Validator problem?
log in

Advanced search

Message boards : Number crunching : Validator problem?

Previous · 1 · 2 · 3
Author Message
Profile Maurice Goulois
Send message
Joined: 18 May 15
Posts: 4
Credit: 10,918
RAC: 0
Message 236 - Posted: 5 Oct 2015, 11:07:11 UTC

I had stopped crunching here because I had only windows machines to attach, which were all erroring at that time; now I've reinstalled an ubuntu one and attached here, and bam! no validation :)

Mumps [MM]
Send message
Joined: 3 Oct 15
Posts: 3
Credit: 7,344,612
RAC: 0
Message 237 - Posted: 10 Oct 2015, 4:08:39 UTC

Well, that's enough wasted effort. After 37 linux hosts, with 322 cores, completing 26385 WU's over 6 days and not a single WU validated and not a single credit granted. And not a peep from the Admin of this project regarding the situation.

I guess I'll head off elsewhere and hope things get addressed eventually to make it worth returning with some compute cycles.

Profile Maurice Goulois
Send message
Joined: 18 May 15
Posts: 4
Credit: 10,918
RAC: 0
Message 238 - Posted: 10 Oct 2015, 6:36:17 UTC
Last modified: 10 Oct 2015, 6:39:05 UTC

You can go to CAS :) problem fixed there, however the linux ict app works, not happy on win

Profile Daniel
Project administrator
Send message
Joined: 5 Mar 15
Posts: 73
Credit: 162,134
RAC: 0
Message 239 - Posted: 10 Oct 2015, 15:54:20 UTC - in response to Message 238.

It's a RAM intensive application and you need to limit the number of versions running at the same time. The BOINC server cannot do that for you. There are two ways to do this. The first is to limit the RAM usage to 50% whether or not you are using the machine. The other way is to to tell BOINC to run only one or two versions at the same time. You put the file app_config.xml in the project folder (see http://www.vdwnumbers.org/forum_thread.php?id=20&postid=134#134). On Windows, the project folder is C:\ProgramData\BOINC\projects\www.123numbers.org.
____________
Daniel Monroe
vdwnumbers.org Project Administrator

Mumps [MM]
Send message
Joined: 3 Oct 15
Posts: 3
Credit: 7,344,612
RAC: 0
Message 240 - Posted: 10 Oct 2015, 19:14:20 UTC - in response to Message 239.

Excuse me, but all of my linux hosts had either 2 or 4 Gig per core, and only exist to run BOINC WU's. I don't expect it was anything to do with RAM. The work units ran to completion in relatively normal times and produced what looks like valid output results.

It's all about the projects inability to validate Linux work against Windows work, and the choice to not enable Homogeneous Redundancy until the project can get around to fixing the apps. It's been pointed out repeatedly what the difference is between the results, and it seems that should be enough of a clue to fix the problem.

It looks like tasks crunched by both windows and linux will not validate together. Which means if the the third machine is windows, the linux machine loses out. And vice versa. Until this gets resolved, can homogeneous redundancy please be turned on?

http://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy


And from 8 months ago:
zombie67 wrote:
FWIW, the only difference here is the 5th line of the stderr. For windows machines:

0,0,0,0,0,0,0,0,0,

For linux machines:

140733343532288,7067552,7067440,7067456,0,0,0,0,0,
Daniel wrote:
The first line should be all zeroes.

Profile Maurice Goulois
Send message
Joined: 18 May 15
Posts: 4
Credit: 10,918
RAC: 0
Message 241 - Posted: 16 Oct 2015, 1:24:50 UTC

I have retried crunching with windows app limited to 1 task at once, seems that works at least

Profile Morgan the Gold
Send message
Joined: 20 Oct 15
Posts: 3
Credit: 8,060
RAC: 0
Message 249 - Posted: 3 Nov 2015, 23:02:44 UTC

To Quote Zombie 67


It looks like tasks crunched by both windows and linux will not validate together. Which means if the the third machine is windows, the linux machine loses out. And vice versa. Until this gets resolved, can homogeneous redundancy please be turned on?

http://boinc.berkeley.edu/trac/wiki/HomogeneousRedundancy


Even though You ignored their obviously sage advice the eight times they bothered. And add that because 'I don't do windows' there is no point wasting my time with You any longer.

I might check back, if I remember too.

I know You don't care, but You have lost the interest of another BOINC fanatic.

Sincerely:
Morgan the Gold

Profile Daniel
Project administrator
Send message
Joined: 5 Mar 15
Posts: 73
Credit: 162,134
RAC: 0
Message 250 - Posted: 4 Nov 2015, 4:34:57 UTC - in response to Message 249.

The problem of failed workunits happens when the program tries to create an array larger than the memory available. Sometimes, the client detects this, and it will show in the tasks window "Waiting for memory". Other times, the client lets the app run and there is a memory fault. There are two fixes users can make: change your BOINC client settings to only use 50 percent of RAM, or create an XML file to allow only one copy of the program to run at a time. We do not know of a way to do either from our end. The BOINC client does not handle RAM intensive apps well in other ways--for instance, it will pause a program if you touch your keyboard, but it will not release the RAM. This is another reason to limit the RAM usage even when you are not using your computer.

Once the workunit generates an error message due to insufficient memory, the results are garbage (the numbers in the output are not prime) and cannot be validated. This is not a problem with the validator. Six months ago, there was a problem of Windows and Linux work units not validating against each other due to different line endings, but this was fixed back then. Since then, the validator has worked perfectly.

We are exploring some other solutions, for instance, to have our app go to sleep unless there is plenty of memory. However, the app is always going to be RAM intensive because it has to create a large array with size equal to the prime being used. A month ago, when we found a way to cut memory usage in half (using long array rather than long long), the primes were roughly 250 million and the share of failed work units was near zero. Now the primes are roughly 400 million, so the memory usage is getting high again and about 20 percent of workunits are failing.

There may be another solution on our end that will significantly slow down the app. Right now, the app stores the powers of the primitive root modulo the prime, and then checks for better lower bounds for each possible number of colors in parallel. If the app instead only checked for a fixed number of colors, the array could take the modulo that number of colors and use "byte" rather than "long", but it will take ten times as long to run, once for colors 3,4,...,12.

Thank you to volunteers. The project is producing good results--before it started, primes only through 10 million had been checked, and we are 40 times that level. I cannot respond to messages every day because of homework. I still appreciate that you are volunteering your computer time.

Daniel
____________
Daniel Monroe
vdwnumbers.org Project Administrator

Profile Steve Hawker*
Send message
Joined: 4 Apr 15
Posts: 3
Credit: 25,200
RAC: 0
Message 251 - Posted: 5 Nov 2015, 18:12:07 UTC - in response to Message 250.
Last modified: 5 Nov 2015, 18:18:06 UTC

1. I'm guessing your app creates a two dimensional array, n x m. In which case you could create break the arrays into chunks. That way the tasks will be smaller, run faster and require less memory. You could have an app that breaks up the array and another app that consolidates the results.

You're going to need to do something like this because eventually not even the best equipped computer available here will be able to run the present application.

For comparison, my MacBook with 8GB of memory found a prime number with nearly 1 million digits. How will your app handle prime numbers like that?

2. Other projects are able to limit the number of tasks that run simultaneously, without the user needing to write an xml file (although such a file is trivial, not all users are able or willing to do it). If there isn't a setting within the server, then visit some of the projects and ask the admin. Most are only too happy to help. You could even ask David Anderson.

See: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=33#236

Profile Morgan the Gold
Send message
Joined: 20 Oct 15
Posts: 3
Credit: 8,060
RAC: 0
Message 252 - Posted: 7 Nov 2015, 19:30:28 UTC

Hi again, sorry I was a bit harsh. I wrote and then discarded a long post showing how the top linux host, all linux hosts fail to validate vs windoze, but I see now that's not realy the Issue.

The problem appears to be the way in which Your linux app asks for memory, for it is not getting the ram, even when there is 15 Gb free on a single core allowed to BOINC.


The memory is there, the app is just not getting it.

Profile Daniel
Project administrator
Send message
Joined: 5 Mar 15
Posts: 73
Credit: 162,134
RAC: 0
Message 257 - Posted: 9 Nov 2015, 1:00:44 UTC - in response to Message 252.

The problem appears to be the way in which Your linux app asks for memory, for it is not getting the ram, even when there is 15 Gb free on a single core allowed to BOINC.


You were right. Version 32 should fix that. There was a bug that showed up in Linux but not Windows.

http://www.vdwnumbers.org/forum_thread.php?id=44#256
____________
Daniel Monroe
vdwnumbers.org Project Administrator

Profile Morgan the Gold
Send message
Joined: 20 Oct 15
Posts: 3
Credit: 8,060
RAC: 0
Message 258 - Posted: 9 Nov 2015, 2:06:34 UTC - in response to Message 257.
Last modified: 9 Nov 2015, 2:10:54 UTC

Cool I'm glad You're on the fix.

Doc_Gonzo
Send message
Joined: 13 Nov 15
Posts: 1
Credit: 0
RAC: 0
Message 260 - Posted: 13 Nov 2015, 21:38:18 UTC

I picked this as an alternative project to run when my main project is down. I ran a test of just 16 work units and then switched off my rigs to clean them.

When I checked back later tonight, all 16 work units are showing as 'validation inconclusive'. This is with Windows 7, a 3770k and 16Gb of RAM.
I was going to switch all 5 rigs over to this project whenever my main project goes down but until I can fix this, there isn't any point.

Is there anything that I am doing wrong. . . or am I simply not doing something that I should be?

Profile Steve Hawker*
Send message
Joined: 4 Apr 15
Posts: 3
Credit: 25,200
RAC: 0
Message 261 - Posted: 13 Nov 2015, 22:19:48 UTC - in response to Message 260.

Prior to version 3.2 (or v320.00) not one WU validated on my Linux box.

Since then, every single one has validated. I have noticed that some of my valids have invalids and they are all Windows boxes. This might be due to the overwhelming bias in the population or, as I suspect, a bug.

Recently it was claimed that memory was an issue but your machine has 16GB so quantity is not the issue.

If you have other flavors of Windows, you could try those to see if that makes a difference, or you could install a boot-time selectable Linux partition which is a truly onerous solution but it could work.

Profile Daniel
Project administrator
Send message
Joined: 5 Mar 15
Posts: 73
Credit: 162,134
RAC: 0
Message 262 - Posted: 14 Nov 2015, 21:04:21 UTC - in response to Message 260.

When I checked back later tonight, all 16 work units are showing as 'validation inconclusive'.


'Validation inconclusive' means that the workunits have not been validated by other computers yet. Our error rates are very low in version 32.
____________
Daniel Monroe
vdwnumbers.org Project Administrator

Previous · 1 · 2 · 3
Post to thread

Message boards : Number crunching : Validator problem?


Main page · Your account · Message boards


Code and content created by Daniel Monroe © 2018.