GPU memtest failure [Linux GPU folding]

Moderators: slegrand, Site Moderators, PandeGroup

GPU memtest failure [Linux GPU folding]

Postby davidcoton » Sun Jan 20, 2013 12:00 am

I have yet another problem. My system was working until I upgraded Ubuntu 12.10 to the latest kernel 3.5.0.22.Then I had to reinstall the Nvidia 310.19 drivers, and while doing that I broke the whole system. After extensive repair/reinstallation I think everything is back, system appears normal and v7 SMP folding is OK. I rebuilt the v6 GPU system (new copies of downloaded files where available, followed the headless install guide through the relevant steps), but now every attempt to start it produces a log like this:
Code: Select all
--- Opening Log file [January 19 23:29:00 UTC]


# Windows GPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.41r2

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: Z:\home\david\fahgpu3
Arguments: -forcegpu nvidia_fermi -advmethods -verbosity 9

[23:29:00] - Ask before connecting: No
[23:29:00] - User name: davidcoton (Team 0)
[23:29:00] - User ID: 2E46308D1632C9A3
[23:29:00] - Machine ID: 2
[23:29:00]
[23:29:00] Gpu type=3 species=20.
[23:29:00] Loaded queue successfully.
[23:29:01] Initialization complete
[23:29:01]
Autosending finished units... [January 19 23:29:01 UTC]
[23:29:01] Trying to send all fin
hed work units
[23:29:01] + No unsent completed units remaining.
[23:29:01] - Autosend completed
[23:29:01] Core required: FahCore_15.exe
[23:29:01] Core found.
[23:29:01] Working on queue slot 07 [January 19 23:29:01 UTC]
[23:29:01] + Working ...
[23:29:01] - Calling '.\FahCore_15.exe -dir work/ -suffix 07 -nice 19 -nocpulock -checkpoint 6 -verbose -lifeline 8 -version 641'

[23:29:01]
[23:29:01] *------------------------------*
[23:29:01] Folding@Home GPU Core
[23:29:01] Version                2.25 (Wed May 9 17:03:01 EDT 2012)
[23:29:01] Build host             AmoebaRemote
[23:29:01] Board Type             NVIDIA/CUDA
[23:29:01] Core                   15
[23:29:01]
[23:29:01] Window's signal control handler registered.
[23:29:01] Preparing to commence simulation
[23:29:01] - Looking at optimizations...
[23:29:01] DeleteFrameFiles: successfully deleted file=work/wudata_07.ckp
[23:29:01] - Created dyn
[23:29:01] - Files status OK
[23:29:01] sizeof(CORE_PACKET_HDR) = 512 file=<>
[23:29:01] - Expanded 60219 -> 264610 (decompressed 439.4 percent)
[23:29:01] Called DecompressByteArray: compressed_data_size=60219 data_size=264610, decompressed_data_size=264610 diff=0
[23:29:01] - Digital signature verified
[23:29:01]
[23:29:01] Project: 8074 (Run 1, Clone 22, Gen 32)
[23:29:01]
[23:29:01] Assembly optimizations on if available.
[23:29:01] Entering M.D.
[23:29:03] Tpr hash work/wudata_07.tpr:  2296181751 201558024 2365683401 3020711065 2981506139
[23:29:03] GPU device id=0
[23:29:03] Working on GROningen Mixture of Alchemy and Childrens' Stories
[23:29:03] Client config found, loading data.
[23:29:03] Finished fah_main status=59
[23:29:03] mdrun_gpu returned 59
[23:29:03] GPU memtest failure
[23:29:03]
[23:29:03] Folding@home Core Shutdown: GPU_MEMTEST_ERROR
[23:29:07] CoreStatus = 7C (124)
[23:29:07] Client-core communications error: ERROR 0x7c
[23:29:07] This is a sign of more serious problems, shutting down.


I don't think it really is a GPU memory failure -- I can run memtestcl for 1000 cycles on 800MB without error. Any ideas what I have missed, broken, or got wrong?

David
Image
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sun Jan 20, 2013 1:35 am

davidcoton wrote:I don't think it really is a GPU memory failure -- I can run memtestcl for 1000 cycles on 800MB without error. Any ideas what I have missed, broken, or got wrong?

That error usually indicates some incompatibility related to the NVidia driver version, even though it has worked for you in the past (until this upgrade). The Linux Wine/NVidia driver/GPU v6 client match is very delicate, and can be broken by quite trivial changes. Once a working combination is found, it's best not to make any changes at all. Try older NVidia driver versions, starting with 266.58. That's the version that shows the same download date (on my systems here) as the GPU v6.41 client .msi Windows Installer Package. Once you find a driver that works, stick with it, and don't make any system-wide changes or upgrades unless they are absolutely essential.
art_l_j_PlanetAMD64
Over 1.04 Billion Total Points
Over 185,000 Work Units
Over 3,800,000 PPD
Overall rank (if points are combined) 20 of 1721690
In memory of my Mother May 12th 1923 - February 10th 2012
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Tue Jan 22, 2013 9:10 pm

Here is a configuration that works for one user: link

I hope this helps!
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby codysluder » Tue Jan 22, 2013 9:33 pm

It could be a driver or it could be a bug in the memtest version that's incorporated into FahCore15. The quickest way to figure that out is to run the stand-alone version of memtestCL or memtestG80 which can be found here: https://simtk.org/project/xml/downloads ... kage_id907

There was a bug in version 1.00 which falsely reported an error in the newer GPUs. It was fixed in version 1.1
codysluder
 
Posts: 2239
Joined: Sun Dec 02, 2007 12:43 pm

Re: GPU memtest failure

Postby davidcoton » Tue Jan 22, 2013 10:27 pm

Thanks both.

I haven't had stability problems running GPU folding under Wine before, I've had it running for some time (maybe two years -- though I'd have to check that) on several Ubuntu releases and two or three driver releases. I recently changed cards (apparent card failure) from a GTS250 to a 550Ti, that was relatively painless. I have upgraded kernels frequently, no major problems apart from reinstalling drivers. Recently I had a HD failure, and had to rebuild the entire Ubuntu system. I had many problems installing the Nvidia driver on 12.10, and then I had to retrieve the 6.41 client from the crashed HD, but even that worked for a week or more, until the upgrade to kernel 3.5.0.22. I have not been able to install and run memtestG80, it suffers a missing file (from memory a particular .so file), but I installed memtestcl and it ran clean. So I'm inclined to think it's the memtest in the Core15 giving a false fault, though deleting the core and getting a new download didn't help. (NOTE TO SELF: this might be wishful thinking.) Is this likely to be linked to the particular WUs avauilable? I think it was using Core15 successfully before. At present I can't get Ubuntu to a state where I can reinstall the current Nvidia driver with the previous kernel (a configuration that ran correctly), because the Grub menu item for the previous kernel 3.5.0.21 doesn't start even in "single" mode, and while I'd like to get the GPU back to work, I don't want to stop the SMP folding on the same rig by breaking Ubuntu. Particularly since that may not be the issue anyway, and I'm not sure that I can do a new install with anything but the latest kernel -- it would need to be done without updating during the installation. So it will have to wait until I have more time and patience, or native Linux GPU cores arrive.

Sorry if that's a bit of a ramble, but others might be warned, as
art_l_j_PlanetAMD64 wrote: The Linux Wine/NVidia driver/GPU v6 client match is very delicate, and can be broken by quite trivial changes. Once a working combination is found, it's best not to make any changes at all. ... Once you find a driver that works, stick with it, and don't make any system-wide changes or upgrades unless they are absolutely essential.


David
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Fri Feb 08, 2013 4:56 pm

davidcoton wrote:Thanks both.

I haven't had stability problems running GPU folding under Wine before, I've had it running for some time (maybe two years -- though I'd have to check that) on several Ubuntu releases and two or three driver releases. I recently changed cards (apparent card failure) from a GTS250 to a 550Ti, that was relatively painless.

David,

I was able to get the NVidia 275.43 driver on one of my Debian Linux systems by doing this:
Code: Select all
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/275.43/NVIDIA-Linux-x86_64-275.43.run
Basically, you can get most legacy drivers by using this method, just replace the driver version number in the 2 places where it appears in the command string.

NVidia driver v275.43 should work OK with your GTX 550 Ti, as it is listed as being compatible here.

Also, when installing on a Linux system that is not headless, you will have to use a special procedure in place of 'Step 8'.

Here is how I had to run step 8: viewtopic.php?f=54&t=6793&p=236378#p236378
I hope this helps!

Edit by Mod: Replaced unnecessary C&P with link.
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sat Feb 09, 2013 10:01 am

David, the NVidia driver v275.43 runs well on both of my Debian Linux/Wine GPU Folding machines here, with a GTX 285 GPU in each system.

Hopefully, it should also work with your Ubuntu 12.10 with the latest kernel 3.5.0.22, as this guy has v275.43 working:
levesqu6 wrote:for those having problems with the nvidia drivers and newer kernels... the 275.x branch seems to work ok for now. I havent tried 295.33
heres my configuration

debian testing (wheeezy)
kernel 3.2.0-2-amd64
Nvidia drivers version 275.43
cuda 3.0
wine 1.5.0 'unstable')
with a version 3.x.x-x kernel, and the instructions at the beginning of this topic were written for Ubuntu (although it was Ubuntu Server 9.04). Still, there may be something that has changed, to make the GPU3 Linux/Wine combination fail with Ubuntu 12.10, but at least there's a good chance that it might work for you. But you won't know until you try it. :ewink:

PS: One of the GTX 285 GPUs has just completed a WU, and I have temporarily swapped in a GTX 460 from my #2 system, just for one WU, to see if FahCore 0x15 runs on my #5 computer (Debian Linux v6.0.6) with the NVidia v275.43 driver. It has started folding the WU (PRCG 8073 (0, 2156, 39)), and is at 4% complete now. I'll let you know (in about 14 hours) if the WU has completed successfully. If it did, then the GPUs supported by v275.43 (up to the GTX 590) should also fold OK there.
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby davidcoton » Sat Feb 09, 2013 4:54 pm

Thanks for the info. I saw your post in the other forum, and noted step 8 as a possible solution for my system. Now I have had time to try it, I can report the results.

Ubuntu 12.10 relatively new install. Nvidia 550Ti. Currently NO NV video drivers (default nouveau drivers working apparently correctly). Grub also reports no video drivers, so unable to access the grub menu at startup.

This version does not seem to have a root console available, but I tried the gdm3 stop command under sudo. The GDM3 script does not exist.

Back to square one, with a good video card unable to fold, until I can find another way to stop nouveau, X, and all; and load NVidia drivers.

David
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sat Feb 09, 2013 5:30 pm

davidcoton wrote:Thanks for the info. I saw your post in the other forum, and noted step 8 as a possible solution for my system. Now I have had time to try it, I can report the results.

Ubuntu 12.10 relatively new install. Nvidia 550Ti. Currently NO NV video drivers (default nouveau drivers working apparently correctly). Grub also reports no video drivers, so unable to access the grub menu at startup.

This version does not seem to have a root console available, but I tried the gdm3 stop command under sudo. The GDM3 script does not exist.

Back to square one, with a good video card unable to fold, until I can find another way to stop nouveau, X, and all; and load NVidia drivers.

David

OK, your system will be using a different Display Manager, not GNOME (== gdm3). use:
Code: Select all
ps aux | grep dm
to find out which DM your system is using, and then use the appropriate '/etc/init.d/?dm stop' command to shutdown your DM and X server.

Art
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby davidcoton » Sat Feb 09, 2013 6:39 pm

Thanks Art

That identified lightdm, with only a little more pain I managed to install NVidia drivers and complete the installation of F@H under Wine.
I downloaded the 6.41 no-nonsense zip file, extracted the exe, and fired it up.
Code: Select all
err:wgl:has_opengl Failed to load libGL: libGL.so.1: wrong ELF class: ELFCLASS64

That's new on me. I think I know what it means -- probably that the 32-bit CUDA library found a 64-bit library instead of a 32-bit one -- but how do I work round it?
(I was beginning to think I knew most of the quirks of GPU Wine folding -- but obviously I don't.)

David
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sat Feb 09, 2013 8:35 pm

davidcoton wrote:Thanks Art

That identified lightdm, with only a little more pain I managed to install NVidia drivers and complete the installation of F@H under Wine.
I downloaded the 6.41 no-nonsense zip file, extracted the exe, and fired it up.
Code: Select all
err:wgl:has_opengl Failed to load libGL: libGL.so.1: wrong ELF class: ELFCLASS64

That's new on me. I think I know what it means -- probably that the 32-bit CUDA library found a 64-bit library instead of a 32-bit one -- but how do I work round it?
(I was beginning to think I knew most of the quirks of GPU Wine folding -- but obviously I don't.)

David

OK, David, during the NVidia driver installation, you have to be sure that you installed the OpenGL libraries, with the driver. I think, going from memory, that the default (if you just keep hitting <Enter> during the installation) is to not install those libraries:

8. Install the NVIDIA driver:
Code: Select all
sudo sh NVIDIA-Linux-x86_64-275.43.run
Follow the prompts. Go ahead and install the OpenGL libraries too. Answer "Yes" when asked if you would like to run the nvidia-xconfig utility.

Art
Last edited by art_l_j_PlanetAMD64 on Sun Feb 10, 2013 4:12 pm, edited 1 time in total.
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby davidcoton » Sat Feb 09, 2013 9:05 pm

Yes, it's possible I missed that prompt. I've done the process often enough to get casual about checking every prompt, but not frequently enough to remember all the non-default options. So I went back to re-run the nvidia installer, stopping lightdm and dropping to a text console. But the installer thinks a module called nvidia is still loaded. So now I know why I'm not a Linux guru :( It doesn't help that nvidia references things on their website that don't seem to be where they say. So how can I unload the nvidia module (assuming, of course, that X server is exiting properly and not holding on somewhere...).

David
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sat Feb 09, 2013 9:37 pm

davidcoton wrote:Yes, it's possible I missed that prompt. I've done the process often enough to get casual about checking every prompt, but not frequently enough to remember all the non-default options. So I went back to re-run the nvidia installer, stopping lightdm and dropping to a text console. But the installer thinks a module called nvidia is still loaded. So now I know why I'm not a Linux guru :( It doesn't help that nvidia references things on their website that don't seem to be where they say. So how can I unload the nvidia module (assuming, of course, that X server is exiting properly and not holding on somewhere...).

David

OK, I forgot to mention it, but after the first time that you've used the '/etc/init.d/?dm stop' command, it seems to lose its effect. After that, all you need to do is logout (System -> Log Out user...) to go back to the CLI, with your lightdm and X server stopped. You could also reboot, which makes the system 'forget' that you have used the '/etc/init.d/?dm stop' command.

Please look here, even though it is on an entirely different subject, for the method of going back and forth between the CLI and the GUI, after you have used the '/etc/init.d/?dm stop' command once.
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Re: GPU memtest failure

Postby davidcoton » Sat Feb 09, 2013 10:21 pm

Art,

That doesn't seem to be it. The situation is the same after rebooting -- the ?dm stop command doesn't seem to complete, leaving the x terminal (alt-F7) showing text "Checking battery state... [OK]" . It's the same whether the ?dm stop is given from a command window under x, or a text terminal (alt-F1). "logout" from the x terminal just returns to a graphical logon screen, even then the text terminal cannot successfully stop the x terminal.

Still stuck.

David
davidcoton
 
Posts: 935
Joined: Wed Nov 05, 2008 3:19 pm
Location: Cambridge, UK

Re: GPU memtest failure

Postby art_l_j_PlanetAMD64 » Sat Feb 09, 2013 10:47 pm

davidcoton wrote:Art,

That doesn't seem to be it. The situation is the same after rebooting -- the ?dm stop command doesn't seem to complete, leaving the x terminal (alt-F7) showing text "Checking battery state... [OK]" . It's the same whether the ?dm stop is given from a command window under x, or a text terminal (alt-F1). "logout" from the x terminal just returns to a graphical logon screen, even then the text terminal cannot successfully stop the x terminal.

Still stuck.

David

Are you logged in as root? You must be root, to issue the '/etc/init.d/?dm stop' command.
Art
art_l_j_PlanetAMD64
 
Posts: 741
Joined: Sun May 30, 2010 2:28 pm

Next

Return to unOfficial Linux GPU (WINE wrapper) (3rd party support)

Who is online

Users browsing this forum: No registered users and 1 guest

cron