Steve at CERN

Next Week EGEE 09

2009-09-17T17:52:00.002+02:00

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.

http://www.youtube.com/watch?v=PADq2x8q0kw

Installed Capacity at CHEP

2009-03-23T08:06:00.009+01:00

This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
Sites should consider taking the following actions.

Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.

The followup questions were the following. Now a chance for a a more reflected response.

Will there be any opportunity to run a benchmark within the gcm framework?
I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
What is GCM collecting and who can see its results?
Currently no one can see on the wire since messages are encrypted. There should be a display at https://gridops.cern.ch/gcm however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
When should sites start publishing the HEPSpecInt2006 benchmark?
The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
wn-info to also return a HepSpec2006 value as configured by the site administrators.
Do you know how many sites are currently publishing incorrect data?
I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.

On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

SA1 Nagios Deployment Update

2009-01-29T09:58:00.005+01:00

The EGEE SA1 Nagios bundle was released yesterday with significant updates.

GOCDB Integration: A list of Sites can now be collected using the GOCDB's new API. In particular a list of sites in a ROC or in a Country can be monitored extending the previous LDAP filter on Sites.
GOCDB Downtimes: Downtimes entered in the GOCDB are now also pulled into and inserted as NAGIOS downtimes for your services.
HGSM Integration: HGSM is the SouthEast Europe equivalent to the GOCDB.
NDOUtils Installed: NDOUtils sits behind NAGIOS and fills in a MySQL database with NAGIOS's configuration and metric and test results.
New SRM Tests: These mimic some of the logic of the existing SAM SRM tests. The eventual replacment to the SAM SRM tests. In NAGIOS speak we now have an active check that submits scripts and returns passive results for each of steps of the lcg-cr, lcg-rep, lcg-del seem before.
NSCA Installed: Especially for the case where two nodes are used, a NAGIOS node and NRPE triggered UI then passive test results are submitted back via NSCA from the NRPE-UI. Well almost - Bug.
New BDII Checks: These are the checks taken directly from the gstat2 work but now running against your services.
New msg-to-queue Service: Running on a NAGIOS box this subscribes to externally executed test results for your Site or ROC from the ActiveMQ messaging system. Currently nothing is actually coming in but much of the infastructure is now there.

As before installation can still be done completly via YAIM both for a site or ROC. New packages can be followed for i386 or x86_64. And of course bugs and feedback are always welcome.

The update contains work from Emir, James, Laurence, Konstantin and myself.

Top BDII Publishing Revamp

2008-10-17T11:09:00.003+02:00

I have updated the topbdii summary page.
It now includes an historical element as well so you can see the individual site inclusions in each top level BDII over a period of time. There are clearly some topBDIIs which are flakier than others represented by vertical blue lines and some sites are dropped at times by all BDIIs represented by horizontal blue lines.

Also the tables are broken down by BDII version numbers as well.

WNWG Update at the GDB.

2008-05-13T17:21:00.004+02:00

Tomorrow I present an update of the Worker Node Working Group at the GDB. There is now a complete plan that while complicated with updates to information providers, lcg-tags, YAIM and site configurations of YAIM it will work though add complexity. As planned small sites can ignore everything. It seems it can be deployed at its own speed and once done sites can reconfigure to multiple sub clusters in their own time. The presentation has far to much detail for a GDB audience but it's quite a narrow subject. Mostly of interest to site administrators since it is all work for them. The users just get the benefits, better allocation of jobs to suitable worker nodes.

Interesting GGUS Posts

2008-04-25T22:29:00.002+02:00

In the last two or three weeks GGUS is now serving an experimental rss feed of new posts. Watching the posts there is a whole new view on what is really happening at a low level of 1 to 1 GRID operations.... My aim is to post once a week a summary of the three or so most interesting or relevant tickets....

35783. Tristan Glatard reported that lcg-utils when called with timeout option, -t, rather than timing out gracefully exited with a segfault. In a reality it seems an upgrade fixes this though it is yet to be confirmed.
35798. Earlier in week a gLite release was made that turned out to break certain low level communications between the WN and CE at a globus level. The problem was fixed and released within a working day but with the rush it was never mentioned that the repositories for the different nodes types have been split for the first time. This is transparent if sites follow the install instructions to the letter but I am sure many do not.... I don't. Sites should be told that the glite-WN is currently a break away from everything else.

Plotting the GlueSiteLocation

2008-03-20T23:39:00.006+01:00

I plotted the GlueSiteLocation as advertised in the GlueSites a few days ago. In fact a quick look over the results shows that the field is filled in well in all but a few cases. Most of the plots are centered on a country with a few "World" ones showing everything. Chris and Steve as a result have an action to fix their location some time whenever they fancy it:-)

Filling in the GlueSite

2008-03-14T20:42:00.003+01:00

The last few days I have again wanted to resolve a given site name as belonging to an EGEE ROC. There are no public interfaces to resolve something that would be useful to so many people including GGUS, the GridPP Real Time Monitor, Operations, Accounting and even the Managers. So I have created a review and proposal to fill the GlueSite object in with well defined information. I have spoken to the WLCG coordinators and will follow up in time with ROC managers, GGUS, OSG the WLCG monitoring working group and any one else I can think of who might be interested. Eventually when something
settles down I'll push it through EGEE deployment and YAIM and onto the sites.

Graph the FTS Deployment

2008-02-25T17:56:00.004+01:00

The FTS deployment is of course made up of channels between sites in the EGEE/LCG grid. This plot shows all the channels as queried from the BDII as lines connecting the sites. The grouping of nodes by colours represents all the channels managed by a single FTS instance.

Again it is another plot that is almost impossible to use.

Graphing Glue

2008-02-18T22:56:00.019+01:00

I've been playing with GraphViz and HyperGraph to draw out the relationships between the objects in the GLUE schema. The results are interesting but especially for the complicated sites, e.g. CERN, GRIF, RAL,... they are hard to read. There are massive png's, vrml and svg formats. Also the html pages load a dynamic applet of the data. I'll request that they are worked into gstat which is the natural place for them to live once I have improved them. Warning: Some of the png images are huge and will most likely crash your browser. The image here is RAL-LCG2. The orange blob top right is the FTS.

Combined SLC4 and SL4 CE.

2008-02-14T16:40:00.004+01:00

As part of the WN working group efforts I've been finishing of a vandalised CE which is now part of the CERN_PPS site. It is two node cluster both of which are WNs. One is also a gatekeeper publishing GlueCE, VOViews and CESEBind objects, the other is publishing two GlueCluster and GlueSubClusters objects. ... To cut a long story short I have a single gatekeeper node which can matched to for either SL4 or SLC4 and the jobs route through to the correct node at the back. It is exactly the same for any GlueSubCluster attribute such as memory for instance. Tomorrow we try and thrash how to get the final steps of this into a release.... It is quite possibly not going to be pretty, keeping things backwards compatible may just not be possible or desirable. details.

WN Meeting Kickoff Last Week

2007-10-09T20:19:00.000+02:00

The first EGEE WN working group meeting happened last week. It went well. I had previously thought I knew exactly what the end point would be even if the exact details of migration were not together. The oversight for me was software tags. Namely that for a site publishing non-overlapping SubClusters then a software installation job has to publish software tags in the correct SubCluster. Exactly how you determine which SubCluster you are installing for is far from obvious. No one has suggested a better solution than a configuration file different on every batch worker, there has to be a better way....

From the EGEE conference there was an addition to the group's scope. The group will give help, advice support, approval, something, ... to the WMS wrapper script addition of doing something sensible in grid jobs during the SIGTERM -> SIGKILL window given by the batch system. Francesco's presentation
looks to have considered everything but can check with group if any comments.

One of the consequences of offering users better job matching to WNs is that the sites are possibly going to start killing more excessive jobs dead. A timely addition if the job wrapper now handles the SIGTERMs.

FTS Release Progress

2007-09-19T18:59:00.000+02:00

At last the FTM node has actually made it to be released on the PPS shortly. I really don't want to think how long it has taken. Longer than you can possibly imagine. Also this week I tried the SL4 build of the FTS for first time. Installation has been fine but configuration now looks to stuck and will need some new builds to be solved. Most of the problems are changes between gLite 3.0 and 3.1 rather than the OS upgrade. Having said that most of the changes between 3.0 and 3.1 look sensible and bits of yaim configuration are being removed basically.

Torque Maui Builds from ETICS.

2007-09-14T20:44:00.000+02:00

Finally got around to pushing maui and torque through ETICS. Bang up to date releases for sl3 and sl4 with i386 or X86_64 are all available. Others like SL5 should now be trivial. The torque build was fairly straight forward where as the maui build does some black magic for the torque dependency it needs. Recently there have been many requests for the X86_64 builds and it is needed anyway for the upcoming official gLite X86_64 WN which is around the corner now. Also the Dubliners have been wanting it for ages to build other exotic platforms like MacOS. Releases, CVS Repos. To finish this off I need to close down some of pages hosted at RAL and GridPP with pointers to the new locations. Not bad for a day when I got my notice of contract termination.

Back to School.

2007-09-10T20:35:00.000+02:00

After a two week break without a single keystroke pressed I've been back for my first day. Big outstanding items are starting a TCG inspired group to investigate and conclude on finding and utilizing individual worker node resources. My first thoughts is that it is all done but there are massive gaps in flexibility of configuration or deployment to achieve it..., it is not done in other words. Other than this the FTM node I've been trying to deploy was rejected from certification for a lack of timestamps in it's logfiles. This is fair enough, I thought it at the time but ignored the problem. Finally the SL4 FTS must be installed, the compilation is there but who knows if it works.

SAM test CE-sft-posix now running.

2007-08-16T09:44:00.000+02:00

After deployment mistakes and corrections the CE-sft-posix test is now running again. The significant change is that now new SEs that appear pass for the first two weeks. Previously they were guaranteed to fail for the first two weeks. The results look promising in that out of 287 CEs tested today 69 or so look to fail. This includes the 10 or so CERN CEs where there may be a problem with the configuration of the atlas-durable and lhcb-durable SRMs but perhaps only with respect to the OPS vo which is probably not well tested. Next step will be to get CERN passing. Need to have my own house in order before proceeding. Following that go through the other failiures to look if they seem fair.

Learning Oracle

2007-07-27T21:30:00.000+02:00

This week I was on the Oracle Administrators Course I and II. Apparently because we are CERN and bright young things they compressed the two week course into a one week course. At first it might not be obvious that the course would be directly related what I do day-to-day but I now certainly understand better what the DBAs are doing near by that the FTS uses. In the past they have been easily able to baffle me with the science of blocks, fragmentation, performance tuning, ... The course was given by Lutz Hartmann, a so called Oracle ACE no less. He was genuinely enthusiastic to teach us and it was all pretty effective. I learned as much as I possibly could in one week so am pleased. I now have one year to do my exams to get the Oracle Administrator qualification. It has to be useful for a post CERN life apart from anything else.

Dependents

2007-07-17T23:11:00.000+02:00

This month and either side is the much touted as the gLite restructuring exercise. The aim is chuck out old code and clean up some of what is there especially looking at dependencies and alike. A review basically. Seems I have been given the LFC and DPM to review which is good, I think they should be fairly obvious. The lack of java makes them easier compared to what it could be. There was some suggestions that the dependencies could be determined from the ETICS system, it is good that the final RPM result is now being used. It is this the final product after all which aim to clean up.

Cardinal Sin of Grid Operations

2007-07-04T20:43:00.000+02:00

CERN today committed the cardinal sin of grid operations. They allowed a host certificate to expire for a production service. This took the MyProxy service used by the FTS out for almost 24 hours. The service is a victim of its own success because it had been running itself for the last year without interruption but when it came to it fell between the cracks of responsibility. I've every confidence it will never happen again. At least with this service anyway.

FTM Node is Getting There

2007-07-03T21:09:00.000+02:00

The new named FTM (File Transfer Monitoring) node made reached a significant point today. All the packages are done including the GridView publishing ones. Also the yaim configuration is done within the new glite-fts-yaim module. But there seems to be some dependecy problem which is best resolved by just avoiding the problem and moving to SL4. This will be a pain since in the first instance the FTM monitoring will require a different OS to the FTS itself. The dependecies may be able to improved furthur as well. Currently VDT is needed just to provide an openldap client to the BDII plugin within the glite-sd-query tool. Hopefully this can be removed. I'm sure it does not need to be there.

OS Matching

2007-07-03T21:07:00.000+02:00

I've now published a recipe for matching RHEL3 , RHEL4 OS clones. It's a lot more complicated that it should be and no doubt there will be grumbling but I am happy it is correct.

A Good Match

2007-06-29T22:58:00.000+02:00

A busy week this week where I collected more things to do than I completed. There were some complaints from sites that some VOs have asked them to switch of their resources when their SE is not present for whatever reason. It turns out that CMS already have a good query for finding CEs and SEs but it just needs a bit added to to not match CEs when the SE is not published. This should help sites and users a lot if something sensible can be matched. In the process discovered that there is work going on now to write a more intelligent service information provider in the UK. It looks good in principal but I must give some comments before it gets further. On a similar matching note CERN is about to go live with some SL4 resources which means people are now actually worried about matching to versions of OSes. It seems a bug reported as fixed a while back may in fact not be fixed though I need to check this fix has actually gone into latest WMS that was deployed today.

Writing a New YAIM Component

2007-06-20T17:04:00.000+02:00

Committed and started running a new YAIM component today to configure the new FTM node. It is going to be really trivial to do the code, basically I need to edit some text files and turn some services on so I should be able to handle that. I decided to dig out ed for the purpose. It was over 10 years ago worryingly that someone first showed me ed and tried to convince me that if you wanted to edit a text file from a script it was the perfect tool. At the time it seemed far to backwards, today it does indeed seem to be the perfect tool for the job. The initial testing went okay and uncovered several bugs or improvements to be made to YAIM core which are now submitted. The new modular yaim looks like a very sensible idea. It allows me to work on the FTS part without fear of breaking or holding up the rest of the world.

RPMs not ETICS RPMs

2007-06-19T17:36:00.000+02:00

After generating RPMS with ETICS and being generally dissatisfied with them I investigated what others are doing. It seems the order of the day is just to write your own .spec file and tell ETICS not to attempt to do so. It defeats the object of ETICS in that is meant to be package neutral but it decreases my trivial installation instructions from 10 points to just 2 since the rest is handled by a decent package. It should make the subsequent upgrades trivial as well as opposed to a complete reconfiguration from scratch. In fact ETICS is adding features all the time such as recently it now supports the %config RPM directive that is a step in the right direction to make the autogenerated spec files better.

FTS Upgrade at CERN

2007-06-18T19:02:00.000+02:00

After careful planning the the CERN FTS servers, production T0 export, tiertwo and the pilot were all upgraded today to version 2.0. It basically went okay and much to plan with completion 10 minutes before the announced intervention period end was reached. In fact two things went wrong. The quattor wrapper for yaim was never tested with the new YAIM, the new glite-FTS2 and glite-FTA2 targets had to merged in by hand. That was my mistake. The other problem still needs investigating but basically huge fragmentation to the production database appears to have (a) made the schema update a lot slower than expected and (b) an index has become corrupted. All in all it looks like although things are working some more downtime will be needed now perhaps longer than the upgrade itself to clean the database up. All in all it makes my life a lot easier, there is now a lot less left that I am managing from before my time getting to CERN.