Posts by Ananas
log in
1) Message boards : Number crunching : Flying Solo with over 10k ADC: how do they do it? (Message 1924)
Posted 20 Dec 2013 by Ananas
If you stick to Windows, consider using the x64 version if your CPU can handle it - the x64 windows binary seems to be quite efficient.

My Xeon L5520 on x64 at stock speed (2.27GHz) is usually slower (per task!) than stuff like Q9550 running on x86 but in this project the CPU times for results with the same weight are quite similar (my Q9550 is OC'ed, that's why it is still somewhat faster).

One reason why this is the case might be that RAM access might not play such a big role for POGS, other projects that support Win x64 seem to suffer from the relatively low memory access speed of the xeon as soon as the program needs uncached memory.
2) Message boards : Number crunching : Proxy notices problem (Message 1850)
Posted 23 Nov 2013 by Ananas
If I wouldn't have seen who posted this, I would have asked if you had restarted the core client.

Does your proxy have any option to invalidate the DNS cache?

If you're using a project manager, the project manager admin might not have updated the project URL yet.
3) Message boards : News : Happy Halloween (Message 1768)
Posted 12 Nov 2013 by Ananas
My zodiac is pumpkin so I don't have to dress anyhow different from everyday :-)
4) Message boards : News : 37.1 TFlops (Message 1731)
Posted 6 Nov 2013 by Ananas
What is left of me is 56.

1958 built here and started with a TI 99/4A :-)
5) Message boards : Number crunching : Weird CPU time causes trouble (Message 1712)
Posted 2 Nov 2013 by Ananas
Really old one (5.10.28) but this problem isn't a BOINC core client issue, the only thing that is client dependant here is that this client doesn't support the runtime (wall clock), which is totally useless anyway.

The CPU time is managed by the wrapper alone and reported to the core client.

This is another bad side effect of this wrong CPU time report :

result 7766966 ended with "Maximum CPU time exceeded" (RSC_FPOPS_BOUND), the wrapper reported 3,814,271 seconds to the core client. Quite interesting : even though the core client does not support the runtime, the result shows a runtime that is much more plausible than the CPU time. This might be a hint where to start the bug hunt.

If it was caused by a missing core client feature, it would happen all the time, not just now and then.

A good workaround would be to limit the reported CPU time somehow, e.g. the maximum of calculated CPU time and the delta between now and the time the result has been downloaded.

p.s.: sorry for the late reply, I have been visiting Sweden for 2 weeks - vacations :-)
6) Message boards : News : 37.1 TFlops (Message 1699)
Posted 18 Oct 2013 by Ananas
... I'd offer to help you myself, (20 years F77, engineering apps on DEC and SEL mini's, 10 years C/C++ on Windows/Linux/embedded systems), but have not programmed since 2008.

OT:
SEL is Standard Electrik Lorenz? I didn't even know that they made computers. DEC I know of course, e.g. Rainbow 100 with cp/m 86 and this funny double floppy drive - one slot normal orientation, one upside down.

As C is a way of thinking, a concept rather than just a language, I highly doubt that you ever will forget how to do it :-) You might have forgotten some library function syntax but that can easily be looked up. Maybe this helps - or this :-)

p.s.: if you're somewhere near the road between Rødby and Helsingør, we might even be able to meet sometimes for a coffee and some C talk (C and coffee are somehow conceptionally connected).
7) Message boards : Number crunching : Weird CPU time causes trouble (Message 1695)
Posted 13 Oct 2013 by Ananas
Some results jump to ~25 days CPU time now at a checkpoint, keep those 25 days (as you fixed the CPU time loss) and report it to the core client.

This makes the duration corrction factor jump to ~38.5 for my box (should be ~0.5), which makes POGS switch to panic mode (EDF) immediately.

As you cannot use BOINC softlinks, the up to 16 freshly started WUs jump on the HDD simultaneous, the core client is so busy copying files that all other running workunits (all projects except RNA-World, where the heartbeat crap is switched off) run into D. Anderson's favourite heartbeat bug.

It seems to be 25 days all the time lately so it looks much like the same cause everytime.

Could you please look into it - probably something like an array using more elements than defined and writing into a variable of your runtime collector.

Lately those of my results had this bug :

http://pogs.theskynet.org/pogs/result.php?resultid=6889391
http://pogs.theskynet.org/pogs/result.php?resultid=6850286
http://pogs.theskynet.org/pogs/result.php?resultid=6816368

The only thing I can do in such a case is : empty the cache for this project and reset it - or it will take forever to get out of the panic mode.
8) Questions and Answers : Android : Problems with Android (Message 1641)
Posted 18 Sep 2013 by Ananas
... I'm thinking I might get you to do some floating point tests ...

Don't forget the rounding functions you use as well as explicite and implicite type conversion, if that's used in your project.

RNA had problems between MS math libraries between Win2k, XP, x64 and Linux with different rounding policies ... Does -0.5 go up or down? Does +0.5 go up or down? Does a more complex function round each operand and calculates it afterwards or does it calculate and round at the end?

A partial solution for RNA has been to deliver a specific MS library (MS changed the rounding at least 2 times) together with the project apps and tell GCC to use those. This is not a full solution, Linux, Win x64 and Win x86 still differ sometimes, but at least all Win x86 have the same results now.

Afaik. Einstein worked a lot on building their own "rounding safe" library functions, replacing (or wrapping) the ones that come with the OS or with the compilers. If I recall right, there has been a rough description somewhere on the Pirates site.
9) Message boards : Number crunching : 195 (0xc3) EXIT_CHILD_FAILED (Message 1636)
Posted 17 Sep 2013 by Ananas
If it started at a certain point, never had errors before and now they won't go away, this is always a good question to begin with : Could it have started with a specific Win7 Patch?

The combination 0xc0000005 and "windows 7" occurs quite often when you search for it.

edit: Here's one that identifies a specific patch : error 0xc0000005 and kb2859537

One more, same problem, same windows update, same solution.

Check this one, there seems to be a major flaw in this kb2859537
10) Message boards : Number crunching : Problem fixed : CPU time loss at each checkpoint. (Message 1504)
Posted 17 Aug 2013 by Ananas
Funny results are still possible :

http://pogs.theskynet.org/pogs/workunit.php?wuid=2116739

Run time 13,921.00 CPU time 935,451.60


but they must be really rare, this is the first one for me with 3.40.
11) Questions and Answers : Web site : Credits not updated on "Accounts" page for projects. wrong link. (Message 1381)
Posted 25 Jul 2013 by Ananas
BOINC Combined Statistics need to update their project URL for POGS in order to fix that :

POGS stats still show with the old URL there.
12) Message boards : Number crunching : Only getting 1 task at a time (Message 1377)
Posted 24 Jul 2013 by Ananas
It could be the Long Term Debits. If those have a high negative value, BOINC sometimes refuses to request enough work. You can influence it with the command line tool. Open a "DOS box", navigate to your BOINC directory and try this command :

boinccmd.exe --host 127.0.0.1 --set_debts http://pogs.theskynet.org/pogs/ 0 80000

The second number (80000) is the new value for long term debits, the first number is the short term value - but you cannot set them separately.

Another thing that influences how much BOINC requests is the combination of settings on your global setting :

- On multiprocessors, use at most
- Maintain enough tasks to keep busy for at least
- ... and up to an additional

Make sure that those values are positive and match your needs.

One more is "Resource share" in the project settings. If that value is only a small percentage of all concurrent / active projects, the server might "think" that you cannot finish the results in time (there should be server message upon work requests if this is the case). Unfortunately BOINC ignores that projects can be set to "no new work", it still counts those in. You can set 1000 there instead of the default 100 in order to increase the share for this project. And while you're there, check "Run only the selected applications", make sure to set the checkbox for fitsedwrapper.

Another one is the average uptime of your box. You cannot influence the value without manipulating client_state.xml. After a longer downtime (vacations?) the value for <on_frac> in the <time_stats> section goes very low, a host that runs continous has a value of 0.999 there. Never manipulate this file while BOINC is active and do not edit it with Word or Wordpad, use notepad or some other plain text editor instead. Oh, and never manipulate this file while BOINC is active.

I hope that at least one of those hints helps - good luck :-)
13) Message boards : Number crunching : Problem fixed : CPU time loss at each checkpoint. (Message 1331)
Posted 19 Jul 2013 by Ananas
The new wrapper version collects/cumulates the CPU times of all fit_sed calls.

This fixes the invalid DCF problem, so the client cache issue (clients requesting way too much work after a while) does not exist anymore :-)

Nice change, thanks!
14) Message boards : News : New theSkyNet Web Site (Message 1313)
Posted 18 Jul 2013 by Ananas
I think it would be better to configure the main page as

http://pogs.theskynet.org/pogs/

instead of

http://pogs.theskynet.org/pogs/index.php

Technically it makes not a too big difference, it's just like most other projects do it.

p.s.: strange effect ... 2 hosts have registered the scheduler on 54.208.77.129/... and requested work without a problem.
The third one, which came a bit later, found it to be pogs.theskynet.org/... - and fails to connect, even though it exists, when I try it in a web browser. EDIT : it is an IP resolution / IP cache problem, for some reason that 3rd host finds the old page contents under the new URL, probably some IP change confusion on server side. It will sort out sooner or later. If you have the same problem, restarting the BOINC client should fix it (I have projects without checkpoints running so I cannot try that now)
The difference between the first 2 hosts and the third one is, that the third one had tried to fetch work while the server has been moved (allowing it to resolve the now wrong IP), whereas the other two had been stopped.
15) Message boards : Number crunching : Validator needs a kick (Message 1281)
Posted 22 Jun 2013 by Ananas
it seems to be stuck or it crashed

edit : nevermind, it was the server status display that didn't update - sorry.
16) Message boards : Number crunching : Is validation a game here? (Message 1189)
Posted 12 May 2013 by Ananas
I just checked the history of that computer reporting 0 sec run-time. It seems to be running an old BOINC 6.X version. ...

My 5.x version does the same, you're right, it is a version issue.

But then ... the runtime is irrelevant anyway because a BOINC task that does not get 100% of a core caused by concurrent tasks with higher priority might report a too high runtime, whereas the CPU time will still be correct.

The older core clients do report the CPU time btw., the wrapper just doesn't cumulate the values and reports only the time used by the final "concat" call - which needs nearly no time.

As POGS has fixed credits (per fit_sed call), it doesn't matter much for the credits (besides confusing people), it has a very strong effect on the duration correction factor though, causing the client-side caching to act up - it requests more and more work the more the DCF goes down. I fix that by drying out the project cache and resetting the project every few days - but I would prefer a bugfix of course ;-)

p.s.: the cache limit you're writing about seems to be inactive here. The 50 you have seen seems to be the limit of tasks it send per request, the total per host seems to be unlimited.
17) Message boards : Number crunching : Credit across BOINC projects (Message 1122)
Posted 13 Feb 2013 by Ananas
Try this one to make it cache more results :

boinccmd.exe --host <YourHostName> --set_debts http://ec2-23-23-126-96.compute-1.amazonaws.com/pogs/ 0 100000


If the second value (the 100000, long term debits) is negative, it will never request much work and afaik. you cannot influence that value from the client GUI so you need to use the command line tool.

The first value (the 0, short term debits) is required by the command, you cannot modify the values separately (besides editing client_state.xml of course).

edit :

By setting the first value very high (limit is somewhere at 86400), you can give a project a temporary higher run priority whereas the second value decides about the downloads.

As those values are not directly coupled, the core client can run into a situation where a project has a high crunching (short term) debit but a low download (long term) debit, which can make a project run empty. Projects that are empty or suspended loose their short term debits but the long term debits are not affected.

p.s.: If your cache setting is very high, even this little trick will not help. Try to set the cache not higher than 2 weeks divided by the number of active projects. As you run multiple projects, the risk of running out of work is very low anyway. I'm running between 3 and 5 projects on my hosts and have set the cache size to less than a day.

A smaller cache even makes BOINC waste less time - it has to write client_state.xml every few seconds and less cached WUs cause a smaller client_state.xml file.
18) Message boards : Number crunching : Error on file upload: can't open log file (Message 969)
Posted 18 Dec 2012 by Ananas
[error] Error on file upload: can't open log file '../log_ip-*-*-*-*/file_upload_handler.log' (errno: 9)

HDD full? Out of free inode count?
19) Message boards : Number crunching : Linux vs Windows completion times (Message 936)
Posted 14 Dec 2012 by Ananas
GNU Fortran, I don't think that there is so much difference in the assembly code quality on different operating systems.

On similar boxes - is the real runtime different or is it just the reported runtime? Getting the runtime from an uninterrupted workunit (end time - start time) and dividing it by the totaal number of fit_sed runs should give you a comparable value. Just comparing the runtime for the first fit_sed run will probably do as well.
20) Message boards : Number crunching : Check Pointing (Message 931)
Posted 13 Dec 2012 by Ananas
The estimated times are as random and as wrong as the reported runtimes und CPU times.

The BOINC client takes the estimated floating point operations, adjusts them to the hosts benchmark value and to the duration correction factor - and this duration correction factor is completely useless in this project.

So forget all displayed time values on BOINC client side, the only values you can count on are the progress bar and starting/ending timestamps in the stdout of the results, after they have been uploaded.

The project has checkpoints btw., not BOINC-like ones but by design. Example :

wrapper: starting 15:32:18 (1068): wrapper: running fit_sed (1 filters.dat observations.dat) 15:42:57 (1068): wrapper: running fit_sed (2 filters.dat observations.dat) 15:53:44 (1068): wrapper: running fit_sed (3 filters.dat observations.dat) 16:02:29 (1068): wrapper: running fit_sed (4 filters.dat observations.dat) 16:12:24 (1068): wrapper: running fit_sed (5 filters.dat observations.dat) 16:22:42 (1068): wrapper: running fit_sed (6 filters.dat observations.dat) 16:31:59 (1068): wrapper: running fit_sed (7 filters.dat observations.dat) 16:41:51 (1068): wrapper: running fit_sed (8 filters.dat observations.dat) 16:51:19 (1068): wrapper: running fit_sed (9 filters.dat observations.dat) 17:02:26 (1068): wrapper: running fit_sed (10 filters.dat observations.dat) 17:12:37 (1068): wrapper: running fit_sed (11 filters.dat observations.dat) 17:21:37 (1068): wrapper: running fit_sed (12 filters.dat observations.dat) 17:31:22 (1068): wrapper: running fit_sed (13 filters.dat observations.dat) 17:40:10 (1068): wrapper: running fit_sed (14 filters.dat observations.dat) 17:48:32 (1068): wrapper: running fit_sed (15 filters.dat observations.dat) 17:57:58 (1068): wrapper: running fit_sed (16 filters.dat observations.dat) 18:06:05 (1068): wrapper: running concat (16 output.fit) 18:06:17 (1068): called boinc_finish


This result had 15 checkpoints, one between each two runs of fit_sed. The total runtime was about 2.5 hours so there has been a checkpoint about every 9 minutes (the calculation might look wrong but the last fit_sed leads to the final result, not to a checkpoint, so you have to divide the runtime by 16, not by 15).

If a project does checkpoint like this, I am totally satisfied with it. This type of checkpoints doesn't consume any extra CPU time for additional checkpoints - a very good concept if you ask me.

What needs to be fixed is the reporting of the CPU time, as this affects the workunit caching on client side - and it makes people nervous ;-)


Next 20

Main page · Your account · Message boards


Copyright © 2017 The International Centre for Radio Astronomy Research