Debugging Size Classes Take II
log in

Advanced search

Message boards : News : Debugging Size Classes Take II

Author Message
Profile Kevin
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar
Send message
Joined: 27 Jul 12
Posts: 507
Credit: 14,344,251
RAC: 1,489
Message 4337 - Posted: 7 Jul 2015, 3:35:27 UTC

Just a quick heads up - David Anderson (the guy who wrote most of the BOINC server) from Berkeley is currently having another crack at debugging the size class problem we have.
____________
Regards
Kevin
-----
International Centre for Radio Astronomy Research

Profile Conan
Avatar
Send message
Joined: 16 Aug 12
Posts: 52
Credit: 5,227,240
RAC: 3,475
Message 4338 - Posted: 7 Jul 2015, 3:41:54 UTC

Has he gotten anywhere with the problem since your post on the 28th May?

Conan
____________

Yavanius
Avatar
Send message
Joined: 10 Jan 15
Posts: 44
Credit: 1,024,266
RAC: 3,371
Message 4343 - Posted: 9 Jul 2015, 1:16:39 UTC - in response to Message 4337.

Just a quick heads up - David Anderson (the guy who wrote most of the BOINC server) from Berkeley is currently having another crack at debugging the size class problem we have.


Need to take a (virtual) hammer to those bugs...

Profile Elektra*
Avatar
Send message
Joined: 12 May 14
Posts: 127
Credit: 8,060,292
RAC: 0
Message 4345 - Posted: 9 Jul 2015, 13:24:49 UTC

I hope the actual changes won't mess up my BOINC clients.

At June 19th I've reverted back from 7.0.32 (Android) / 7.4.42 (Win 8.1) to 6.12.38 (Android) / 6.12.33 (Win 8.1). Since June 23rd all clients are running under challenge conditions (POGS exclusively, work buffer 7 days). I've achieved 3160 valid results so far with only 15 tasks aborted which didn't start before their deadline. For me, the few aborted tasks are sensational.

With the BOINC 7 clients I used before I would have had approx. 6200 aborted tasks under the same conditions due to the disastrous runtime estimations of BOINC 7. A stock of estimated 7 days of work turns out to be a stock of 21 days of work in the end (under stable conditions!). And BOINC 7 didn't learn to do it better in about 1 year use time. Very unpleasant for me, very unpleasant for the project operators and very unpleasant for my wingmen who have to wait another period to get their results validated.

So for me POGS with BOINC 7 might be working satisfactorily under every-day-conditions with a small task cache of 1 day, but fails when pushed to its limits during internal or BOINCstats challenges.

But where's the mistake? Are the new task duration and work fetch algorithms failing in general? Probably not. Are it POGS specific faults? Are the "sizes" of the POGS tasks wrongly calculated by the project servers? May be, but POGS doesn't seem to be the only problematic project, at least Milkyway and WCG also tend to underestimate the effective task runtimes. Or is it this pesky <dont_use_dcf> flag set by many projects? BOINC 6 doesn't know about this flag and does therefore probably the very more reliable task duration calculations than BOINC 7. That's my personal assumption.

For me the long task runtimes of POGS tasks on my slow devices are not the main problem but the massively underestimated runtimes of the jobs with BOINC 7. Perhaps this should be investigated with higher priority.
____________

Profile Elektra*
Avatar
Send message
Joined: 12 May 14
Posts: 127
Credit: 8,060,292
RAC: 0
Message 4354 - Posted: 13 Jul 2015, 18:12:48 UTC - in response to Message 4337.
Last modified: 13 Jul 2015, 18:32:42 UTC

Just a quick heads up - David Anderson (the guy who wrote most of the BOINC server) from Berkeley is currently having another crack at debugging the size class problem we have.


I'm afraid that we can't expect much support from David Anderson the next time:

https://secure.worldcommunitygrid.org/forums/wcg/viewthread?thread=38168
BOINC's funding from the U.S. National Science Foundation has ended,
at least for the time being.
This funding supported me, Rom Walton, and Charlie Fenton.
We're now working on other things,
although we'll stay involved in BOINC at some level.

The BOINC project will continue, and will be run according to
a community-based model rather than centrally.
In essence, the people who contribute to BOINC now make the decisions about it.
...
-- David


Regarding our problem with size classes and distributing the proper sized tasks to the volunteer devices I found this probably very interesting article. Unfortunately my English knowledges are a bit restricted, and I've only very limited brains regarding mathematical and physical themes:

https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/EinsteinAtHome/BOINC/EvaluationOfCreditNew#Android_Specific_40estimate_related_41_failure_mode_as_at_7th_June_2014

Android Specific (estimate related) failure mode as at 7th June 2014

Android device does a whetstone, gives [either] a 'normal', 'neon' or 'vfp result into host p_fpops (SIMD vectorisation aware)
Android device [inital] requests tasks with a scheduler request, which includes host p_fpops
Scheduler [,not currently SIMD vectorisation aware] selects tasks and app versions, sets estimate and bound using peak_flops (which was initialised to host.p_fpops [or 1E9 if it was <= 0 )
Android receives tasks & begins processing
Assuming project unscaled estimate was 'reasonable', if app is NOT vfp or neon [i.e. not SIMD vectorised], wrong whetstone variant is applied [on initial request, dividing [effective] bound [duration] by the vectorisation efficiency of the specialised Whetstones...in the region of 3x, which would seem to match up with the SIMAP time exceeded scenario, where bound of ~3x estimate converges with expected actual elapsed.
Bandaid suggestions become:
- extend bound to more than 3 times
- vectorise/optimise the application(s)
- don't let people use their phones to make calls. ;-)

In addition: subsequent validation of host app version peak_flops is subject to the same averaging quantisation instabilities as above, so convergence behaviour is non-deterministic + stochaistically driven


My quick-and-dirty suggestions for a solution of the problems are
    *Only two sizes of tasks: the small ones exclusively assigned to Android (or similar) devices, the larger ones to the other devices. My personal experience is that the runtimes of the larger WU's are acceptable even on entry-level-class notebooks with Celeron or Pentium processors (max. 11 hours wall time with Intel(R) Celeron(R) CPU N2910 @ 1.60GHz)
    *Disabling the setting of the <dont_use_dcf> tag (must be done server-sided; unfortunately we can't disable it with our clients): preemption of the run times of the tasks seems to work much better in the old BOINC 6 manner with DCF



The project Collatz Conjecture (http://boinc.thesonntags.com/collatz) seems to have a working model with size classes (micro_collatz, mini_collatz, solo_collatz, large_collatz)
(Size hints: Large = 16 x Solo; Solo = 16 x Mini; Mini = 16 x Micro) In the project specific settings you can even check which size classes you want to crunch: the larger the tasks, the better the credits/hour. If you let the default settings unchanged (all applications permitted), Collatz Conjecture will send only proper sized tasks suitable for your device.

Yavanius
Avatar
Send message
Joined: 10 Jan 15
Posts: 44
Credit: 1,024,266
RAC: 3,371
Message 4425 - Posted: 3 Aug 2015, 5:05:37 UTC - in response to Message 4354.


Regarding our problem with size classes and distributing the proper sized tasks to the volunteer devices I found this probably very interesting article.


Assuming project unscaled estimate was 'reasonable', if app is NOT vfp or neon [i.e. not SIMD vectorised], wrong whetstone variant is applied [on initial request, dividing [effective] bound [duration] by the vectorisation efficiency of the specialised Whetstones...in the region of 3x, which would seem to match up with the SIMAP time exceeded scenario, where bound of ~3x estimate converges with expected actual elapsed.
Bandaid suggestions become:
- extend bound to more than 3 times
- vectorise/optimise the application(s)


My quick-and-dirty suggestions for a solution of the problems are
[list]*Only two sizes of tasks: the small ones exclusively assigned to Android



Bit out my purview too, but I got to rereading this again and the Einstein folks seem to be saying that this issue only applies to non-neon or non-vfp devices. In that case, you only have to worry about devices reporting (whether that's correctly reported is another issue) they do NOT support neon or vfp in which case it seems that the efficiency, or how well it can compute, is overrated by a factor of 3x. It would be akin to a horse running as fast as a sports car. The horse can't really run that fast, but BOINC seems to think it does in its assessment.

As a 'bandaid,' or stopgap solution, they seem to be saying utilize an assessment that negates that 3x. Possibly simply dividing the final rating by 3...

I'm not quite sure on the vectorise/optimize suggestion. That sounds like something that would be more of an actual fix than a "band-aid..."

Message boards : News : Debugging Size Classes Take II


Main page · Your account · Message boards


Copyright © 2017 The International Centre for Radio Astronomy Research