Thursday 17 September 2009

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.



http://www.youtube.com/watch?v=PADq2x8q0kw

Monday 23 March 2009

Installed Capacity at CHEP

This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
Sites should consider taking the following actions.

  • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
  • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
  • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
The followup questions were the following. Now a chance for a a more reflected response.
  1. Will there be any opportunity to run a benchmark within the gcm framework?
    I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
  2. What is GCM collecting and who can see its results?
    Currently no one can see on the wire since messages are encrypted. There should be a display at https://gridops.cern.ch/gcm however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
  3. When should sites start publishing the HEPSpecInt2006 benchmark?
    The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
  4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
    Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
    wn-info to also return a HepSpec2006 value as configured by the site administrators.
  5. Do you know how many sites are currently publishing incorrect data?
    I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

Thursday 29 January 2009

SA1 Nagios Deployment Update

The EGEE SA1 Nagios bundle was released yesterday with significant updates.
  • GOCDB Integration: A list of Sites can now be collected using the GOCDB's new API. In particular a list of sites in a ROC or in a Country can be monitored extending the previous LDAP filter on Sites.
  • GOCDB Downtimes: Downtimes entered in the GOCDB are now also pulled into and inserted as NAGIOS downtimes for your services.
  • HGSM Integration: HGSM is the SouthEast Europe equivalent to the GOCDB.
  • NDOUtils Installed: NDOUtils sits behind NAGIOS and fills in a MySQL database with NAGIOS's configuration and metric and test results.
  • New SRM Tests: These mimic some of the logic of the existing SAM SRM tests. The eventual replacment to the SAM SRM tests. In NAGIOS speak we now have an active check that submits scripts and returns passive results for each of steps of the lcg-cr, lcg-rep, lcg-del seem before.
  • NSCA Installed: Especially for the case where two nodes are used, a NAGIOS node and NRPE triggered UI then passive test results are submitted back via NSCA from the NRPE-UI. Well almost - Bug.
  • New BDII Checks: These are the checks taken directly from the gstat2 work but now running against your services.
  • New msg-to-queue Service: Running on a NAGIOS box this subscribes to externally executed test results for your Site or ROC from the ActiveMQ messaging system. Currently nothing is actually coming in but much of the infastructure is now there.
As before installation can still be done completly via YAIM both for a site or ROC. New packages can be followed for i386 or x86_64. And of course bugs and feedback are always welcome.

The update contains work from Emir, James, Laurence, Konstantin and myself.