Thursday, 17 September 2009

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.

Monday, 23 March 2009

Installed Capacity at CHEP

This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
Sites should consider taking the following actions.

  • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
  • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
  • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
The followup questions were the following. Now a chance for a a more reflected response.
  1. Will there be any opportunity to run a benchmark within the gcm framework?
    I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
  2. What is GCM collecting and who can see its results?
    Currently no one can see on the wire since messages are encrypted. There should be a display at however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
  3. When should sites start publishing the HEPSpecInt2006 benchmark?
    The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
  4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
    Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
    wn-info to also return a HepSpec2006 value as configured by the site administrators.
  5. Do you know how many sites are currently publishing incorrect data?
    I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

Thursday, 29 January 2009

SA1 Nagios Deployment Update

The EGEE SA1 Nagios bundle was released yesterday with significant updates.
  • GOCDB Integration: A list of Sites can now be collected using the GOCDB's new API. In particular a list of sites in a ROC or in a Country can be monitored extending the previous LDAP filter on Sites.
  • GOCDB Downtimes: Downtimes entered in the GOCDB are now also pulled into and inserted as NAGIOS downtimes for your services.
  • HGSM Integration: HGSM is the SouthEast Europe equivalent to the GOCDB.
  • NDOUtils Installed: NDOUtils sits behind NAGIOS and fills in a MySQL database with NAGIOS's configuration and metric and test results.
  • New SRM Tests: These mimic some of the logic of the existing SAM SRM tests. The eventual replacment to the SAM SRM tests. In NAGIOS speak we now have an active check that submits scripts and returns passive results for each of steps of the lcg-cr, lcg-rep, lcg-del seem before.
  • NSCA Installed: Especially for the case where two nodes are used, a NAGIOS node and NRPE triggered UI then passive test results are submitted back via NSCA from the NRPE-UI. Well almost - Bug.
  • New BDII Checks: These are the checks taken directly from the gstat2 work but now running against your services.
  • New msg-to-queue Service: Running on a NAGIOS box this subscribes to externally executed test results for your Site or ROC from the ActiveMQ messaging system. Currently nothing is actually coming in but much of the infastructure is now there.
As before installation can still be done completly via YAIM both for a site or ROC. New packages can be followed for i386 or x86_64. And of course bugs and feedback are always welcome.

The update contains work from Emir, James, Laurence, Konstantin and myself.

Friday, 17 October 2008

Top BDII Publishing Revamp

I have updated the topbdii summary page.
It now includes an historical element as well so you can see the individual site inclusions in each top level BDII over a period of time. There are clearly some topBDIIs which are flakier than others represented by vertical blue lines and some sites are dropped at times by all BDIIs represented by horizontal blue lines.

Also the tables are broken down by BDII version numbers as well.

Tuesday, 13 May 2008

WNWG Update at the GDB.

Tomorrow I present an update of the Worker Node Working Group at the GDB. There is now a complete plan that while complicated with updates to information providers, lcg-tags, YAIM and site configurations of YAIM it will work though add complexity. As planned small sites can ignore everything. It seems it can be deployed at its own speed and once done sites can reconfigure to multiple sub clusters in their own time. The presentation has far to much detail for a GDB audience but it's quite a narrow subject. Mostly of interest to site administrators since it is all work for them. The users just get the benefits, better allocation of jobs to suitable worker nodes.

Friday, 25 April 2008

Interesting GGUS Posts

In the last two or three weeks GGUS is now serving an experimental rss feed of new posts. Watching the posts there is a whole new view on what is really happening at a low level of 1 to 1 GRID operations.... My aim is to post once a week a summary of the three or so most interesting or relevant tickets....
  1. 35783. Tristan Glatard reported that lcg-utils when called with timeout option, -t, rather than timing out gracefully exited with a segfault. In a reality it seems an upgrade fixes this though it is yet to be confirmed.
  2. 35798. Earlier in week a gLite release was made that turned out to break certain low level communications between the WN and CE at a globus level. The problem was fixed and released within a working day but with the rush it was never mentioned that the repositories for the different nodes types have been split for the first time. This is transparent if sites follow the install instructions to the letter but I am sure many do not.... I don't. Sites should be told that the glite-WN is currently a break away from everything else.

Thursday, 20 March 2008

Plotting the GlueSiteLocation

I plotted the GlueSiteLocation as advertised in the GlueSites a few days ago. In fact a quick look over the results shows that the field is filled in well in all but a few cases. Most of the plots are centered on a country with a few "World" ones showing everything. Chris and Steve as a result have an action to fix their location some time whenever they fancy it:-)