The first EGEE WN working group meeting happened last week. It went well. I had previously thought I knew exactly what the end point would be even if the exact details of migration were not together. The oversight for me was software tags. Namely that for a site publishing non-overlapping SubClusters then a software installation job has to publish software tags in the correct SubCluster. Exactly how you determine which SubCluster you are installing for is far from obvious. No one has suggested a better solution than a configuration file different on every batch worker, there has to be a better way....
From the EGEE conference there was an addition to the group's scope. The group will give help, advice support, approval, something, ... to the WMS wrapper script addition of doing something sensible in grid jobs during the SIGTERM -> SIGKILL window given by the batch system. Francesco's presentation
looks to have considered everything but can check with group if any comments.
One of the consequences of offering users better job matching to WNs is that the sites are possibly going to start killing more excessive jobs dead. A timely addition if the job wrapper now handles the SIGTERMs.
Tuesday, 9 October 2007
Wednesday, 19 September 2007
At last the FTM node has actually made it to be released on the PPS shortly. I really don't want to think how long it has taken. Longer than you can possibly imagine. Also this week I tried the SL4 build of the FTS for first time. Installation has been fine but configuration now looks to stuck and will need some new builds to be solved. Most of the problems are changes between gLite 3.0 and 3.1 rather than the OS upgrade. Having said that most of the changes between 3.0 and 3.1 look sensible and bits of yaim configuration are being removed basically.
Friday, 14 September 2007
Finally got around to pushing maui and torque through ETICS. Bang up to date releases for sl3 and sl4 with i386 or X86_64 are all available. Others like SL5 should now be trivial. The torque build was fairly straight forward where as the maui build does some black magic for the torque dependency it needs. Recently there have been many requests for the X86_64 builds and it is needed anyway for the upcoming official gLite X86_64 WN which is around the corner now. Also the Dubliners have been wanting it for ages to build other exotic platforms like MacOS. Releases, CVS Repos. To finish this off I need to close down some of pages hosted at RAL and GridPP with pointers to the new locations. Not bad for a day when I got my notice of contract termination.
Monday, 10 September 2007
After a two week break without a single keystroke pressed I've been back for my first day. Big outstanding items are starting a TCG inspired group to investigate and conclude on finding and utilizing individual worker node resources. My first thoughts is that it is all done but there are massive gaps in flexibility of configuration or deployment to achieve it..., it is not done in other words. Other than this the FTM node I've been trying to deploy was rejected from certification for a lack of timestamps in it's logfiles. This is fair enough, I thought it at the time but ignored the problem. Finally the SL4 FTS must be installed, the compilation is there but who knows if it works.
Thursday, 16 August 2007
After deployment mistakes and corrections the CE-sft-posix test is now running again. The significant change is that now new SEs that appear pass for the first two weeks. Previously they were guaranteed to fail for the first two weeks. The results look promising in that out of 287 CEs tested today 69 or so look to fail. This includes the 10 or so CERN CEs where there may be a problem with the configuration of the atlas-durable and lhcb-durable SRMs but perhaps only with respect to the OPS vo which is probably not well tested. Next step will be to get CERN passing. Need to have my own house in order before proceeding. Following that go through the other failiures to look if they seem fair.
Friday, 27 July 2007
This week I was on the Oracle Administrators Course I and II. Apparently because we are CERN and bright young things they compressed the two week course into a one week course. At first it might not be obvious that the course would be directly related what I do day-to-day but I now certainly understand better what the DBAs are doing near by that the FTS uses. In the past they have been easily able to baffle me with the science of blocks, fragmentation, performance tuning, ... The course was given by Lutz Hartmann, a so called Oracle ACE no less. He was genuinely enthusiastic to teach us and it was all pretty effective. I learned as much as I possibly could in one week so am pleased. I now have one year to do my exams to get the Oracle Administrator qualification. It has to be useful for a post CERN life apart from anything else.
Tuesday, 17 July 2007
This month and either side is the much touted as the gLite restructuring exercise. The aim is chuck out old code and clean up some of what is there especially looking at dependencies and alike. A review basically. Seems I have been given the LFC and DPM to review which is good, I think they should be fairly obvious. The lack of java makes them easier compared to what it could be. There was some suggestions that the dependencies could be determined from the ETICS system, it is good that the final RPM result is now being used. It is this the final product after all which aim to clean up.
Wednesday, 4 July 2007
CERN today committed the cardinal sin of grid operations. They allowed a host certificate to expire for a production service. This took the MyProxy service used by the FTS out for almost 24 hours. The service is a victim of its own success because it had been running itself for the last year without interruption but when it came to it fell between the cracks of responsibility. I've every confidence it will never happen again. At least with this service anyway.
Tuesday, 3 July 2007
The new named FTM (File Transfer Monitoring) node made reached a significant point today. All the packages are done including the GridView publishing ones. Also the yaim configuration is done within the new glite-fts-yaim module. But there seems to be some dependecy problem which is best resolved by just avoiding the problem and moving to SL4. This will be a pain since in the first instance the FTM monitoring will require a different OS to the FTS itself. The dependecies may be able to improved furthur as well. Currently VDT is needed just to provide an openldap client to the BDII plugin within the glite-sd-query tool. Hopefully this can be removed. I'm sure it does not need to be there.
Friday, 29 June 2007
A busy week this week where I collected more things to do than I completed. There were some complaints from sites that some VOs have asked them to switch of their resources when their SE is not present for whatever reason. It turns out that CMS already have a good query for finding CEs and SEs but it just needs a bit added to to not match CEs when the SE is not published. This should help sites and users a lot if something sensible can be matched. In the process discovered that there is work going on now to write a more intelligent service information provider in the UK. It looks good in principal but I must give some comments before it gets further. On a similar matching note CERN is about to go live with some SL4 resources which means people are now actually worried about matching to versions of OSes. It seems a bug reported as fixed a while back may in fact not be fixed though I need to check this fix has actually gone into latest WMS that was deployed today.
Wednesday, 20 June 2007
Committed and started running a new YAIM component today to configure the new FTM node. It is going to be really trivial to do the code, basically I need to edit some text files and turn some services on so I should be able to handle that. I decided to dig out ed for the purpose. It was over 10 years ago worryingly that someone first showed me ed and tried to convince me that if you wanted to edit a text file from a script it was the perfect tool. At the time it seemed far to backwards, today it does indeed seem to be the perfect tool for the job. The initial testing went okay and uncovered several bugs or improvements to be made to YAIM core which are now submitted. The new modular yaim looks like a very sensible idea. It allows me to work on the FTS part without fear of breaking or holding up the rest of the world.
Tuesday, 19 June 2007
After generating RPMS with ETICS and being generally dissatisfied with them I investigated what others are doing. It seems the order of the day is just to write your own .spec file and tell ETICS not to attempt to do so. It defeats the object of ETICS in that is meant to be package neutral but it decreases my trivial installation instructions from 10 points to just 2 since the rest is handled by a decent package. It should make the subsequent upgrades trivial as well as opposed to a complete reconfiguration from scratch. In fact ETICS is adding features all the time such as recently it now supports the %config RPM directive that is a step in the right direction to make the autogenerated spec files better.
Monday, 18 June 2007
After careful planning the the CERN FTS servers, production T0 export, tiertwo and the pilot were all upgraded today to version 2.0. It basically went okay and much to plan with completion 10 minutes before the announced intervention period end was reached. In fact two things went wrong. The quattor wrapper for yaim was never tested with the new YAIM, the new glite-FTS2 and glite-FTA2 targets had to merged in by hand. That was my mistake. The other problem still needs investigating but basically huge fragmentation to the production database appears to have (a) made the schema update a lot slower than expected and (b) an index has become corrupted. All in all it looks like although things are working some more downtime will be needed now perhaps longer than the upgrade itself to clean the database up. All in all it makes my life a lot easier, there is now a lot less left that I am managing from before my time getting to CERN.
Thursday, 14 June 2007
At the EGEE operations meeting a presentation form the WLCGMG described their first sensor that might be useful especially to sites running or considering to run nagios. Basically a nagios sensor that will raise the SAM alarms on your local hosts in your nagios system. They are currently looking for a few volunteer sites as early adopters.
Tuesday, 12 June 2007
Thursday, 31 May 2007
Have now started a concerted effort to package each of the FTS scripts that are lying around as FtsAddons. One is done now, one of Ron's scripts. Now started on Poalos scripts and by chance there was some interest in this from Taiwan's T1 so looks like I have a tester that is always good. Other than that I seem to be dashing between French and the Hospital for a dodgy thumb that I have. Took the minutes for the EMT as well. This is easily the best meeting to understand what is going in the upcoming software.
Friday, 25 May 2007
I've been CIC-on-Duty this week, lots of problems caused by poorly tested tests such as R-GMA not being on port 8443 at many sites. The good thing though it has given the procedure for new tests a wake up call and hopefully next time it will go better. This is good since the next test may well be the posix test of mine. End of the week has been difficult though, a scheduled GGUS intervention followed by an unscheduled SAM outage and now the GOCDB has fallen over. Of course these three events all happened on consecutive days.
Sunday, 20 May 2007
I used to publish my Apple iCal calendars to a webdav and from there import them is ical format to gcal. I've now configured GCalDaemon to do a two way sync between ical and gcal. Previously I could not update gcal version of my calendar. This is a lot better since I need to work in gcal anyway to update Jenny's calendar. So far it looks to be working perfectly.
Friday, 18 May 2007
Installed VitualBox on my Mac today followed by Ubuntu. Looks to be much the same as Parallels but it is free to use in binary form for personal use and the underlying code is even opensource so that is good. So far it seems to be working. The reality though is I have not missed linux on my laptop since getting a mac.
Building an RPM would not normally be exciting but this was out of the EGEE CVS using ETICS to build and publish an RPM. The RPM is in reality about as simple as it gets and contains basically a cron, some scripts and logrotate to do some monitoring of the FTS. I still have some unanswered questions, for instance I'm currently unable to tag a version nor do I know how to generate a nightly build of particular ETICS tag. For now this is fine and allows Andrey to proceed with doing the rest.
Tuesday, 15 May 2007
Made some progress today with ETICS and actually have it making a package. However did not manage to do it with out getting expert help in from someone who is far to busy to help my small problems. Turned out I was using an old client, referenced on twiki page all be not an official ETICS page and there was a bit of the web interface I could click that I did not realise I could click.
Monday, 14 May 2007
After what was meant to be an easy addition to the FTS today of another webserver turns out there was yet another file that has been added to make it work by the developers. It's a classic problem of course. I guess one day that layer of access should be closed down. Hopefully with the upcoming upgrade to the FTS there will be time to wipe the service and start again. The only sure way to check what you think quattor is going to do actually happens at install time and you end up with a working service.
Started to look at an OGF standard today BES. Looks to be some standard interface to batch systems that various bits of EGEE will support including cream. However I thought that DRMAA was a similar thing from OGF to give a method for how to interact with batch systems. Need to find out more about both to understand how they are distinct.
Friday, 11 May 2007
Looked yesterday at the changes that would be applied to the FTS pilot if it was brought in line with the PPS deployment of the FTS. In principal they are the same but the pilot has pre-release versions of the RPMs. In reality it looks like just about every RPM will get an upgrade so hopefully we will end up with the same thing that works the same or at least as well as the pilot. This will definetley be done before the production FTSes get the upgrade.
Made a first serious attempt today to add a component into ETICS. In terms of software to be added it is very trivial. Just some scripts to be run as a cron to generate some web reports from the FTS. So after reading through quite few python stack traces it turns out I only have read access to the bits of ETICS I needed to operate on. Have requested more access now.
Wednesday, 9 May 2007
Spent all most the whole day writing quattor configuration files for the upcoming FTS release. Very painful, it is basically equivalent to writing an rpm database by hand via trial an error. One of those tasks that in hindsight it would have been easier to write a script but anyway....
Tuesday, 8 May 2007
Wrote a page describing howto install JPackage versions of JDK on Scientific Linux. It is lot easier to use the JPackage ones rather than the SUN supplied ones but requires a change in practice probably for many sites who were already doing something else.
Monday, 7 May 2007
Not very interesting but fixed up the local CERN FTS hacks to comply to the new yaim and ncm-yaim that have been applied across all services at CERN. Also the split in the SetToDesiredState packages has been accommodated. In the process I noticed that the SRM and castorgrid hosts were broken due to this so alerted the relevant people to correct the situation. Also asked about how and when the SetToDesiredState was useful which I now understand. This had been something I had not understood since learning of its existence shortly after arriving at CERN...
Friday, 4 May 2007
There were some reports from the EMT that installing tomcat5 from jpackage with SUN's JDK 1.5 is a little problematic. Had a look at it today and now I understand it all though how to proceed is unclear. In short tomcat5 requires xml-commons-jaxp-1.3-apis which obsoletes xml-commons-apis. This property is however is provided by SUN's JDK 1.5 and so installing the xml-common api has the result of removing SUN's JDK! Submitted a bug to jpackage and provided a recipe of installation order that works. Bug 266.
Thursday, 3 May 2007
Another meeting about monitoring the FTS and how to create summary of tables of transfers by categorized errors and alike. In fact the stuff already done on the FTS spider looks good. It just needs to be brought closer to the main development and do more in Oracle rather than in PHP.
I've written a short script as an adventure for me python. voms2gacl downloads the xml formatted member of a VOMS server and creates .gacl for use by mod_gridsite. In short it is an easy way to restrict access to webpages by VO for web-browsers. voms2gacl. In fact I wrote it a while ago but I realised I never posted it.
Wednesday, 2 May 2007
This week I'm both the GMOD and the CERN ROC which basically results in lots of meetings. The GDB was today as well but I was unable to attend much of it due to other meetings. In the EMT meeting I was surprised that moving to jpackage looks to be non-trivial. The most basic thing of using Java 1.5 and tomcat looks to not be obvious but I think with the correct combination of magic it should work.
French classes restarted on Monday. They are now on Monday, Wednesday and Thursday mornings. So far so good and this time I really want to put some work into get the most out of them. Face it I am never going to get a better chance to learn a language than I do now.
Tuesday, 1 May 2007
Thursday, 26 April 2007
Interesting talk about the hardware testing that CERN does both at burn in and also routinely. A lot of these things such as fsprobe, SMART, inventory and memory tests are ran routinely on the boxes I run so it is good to here a description of what they are doing.
A summary of the WLCG system admin group was given. Some of the things they are up to include starting a wiki to collect tips and scripts together that people are working on. e.g torque, maui and cfengine recipes. It is a general Cookbook of ideas. There is also a subversion repository and again uses gridsite like the wiki does. There are currently 9 scripts in the repository and more are needed. The need more volunteers which is exactly why publicising at events like HEPiX.
Monday, 23 April 2007
Brookhaven presented that they were running batch jobs in XEN machines on there compute farm. The aim is that they also serve the storage space within dCache or xrootd and want to protect this storage from batch jobs that crash the machine.
One intresting idea from PSI. Rather than having a single SSH gateway they have a gateway that once you are logged into allows you at the firewall to login to any machine. Much better, going through gateways is always a pain.
Friday, 20 April 2007
Two new roles have now been added to the dteam voms service. These are ftsadmin and ftsmaster. They are associated with regions, the first say /dteam/uki/Role=ftsadmin would contain a list of people able to edit FTS channels around the globe with RAL as endpoint. We still need to check that it really works on the FTS but it is meant to.
A long meeting this morning about improving the reporting that the FTS is able to give. The addition of summary tables from triggers for instance. But also looking at historical information for the states of the system and individual services. The followup for me is to check what the information is that is going to be recorded and then considering what is useful or missing for admins in particular.
Thursday, 19 April 2007
Wednesday, 18 April 2007
I'm happier about how the FTS prod and pilot services are now represented in CDB. The pilot is no longer a modified prod service and is defined in its own right. This should remove some problems which are not really there for any general FTS installation.
Tuesday, 17 April 2007
Tried and failed a lot today to install a "package" on the FTS oracle servers to do some statistics collection. It is now done but due to the intervention of others. I need to learn some Oracle and fast if I don't want to feel helpless a few times a week as is the case today.
Big discussion today about some gLite middleware that has started all be it two years ago opening up services in user space on the WN so refreshed proxies can be injected from the CE. I'm convinced myself this is a bad idea and it is something we have to stop. Lots of discussion and some reluctance to actually do it but I think soon it will be okay.
Monday, 16 April 2007
Looks like I will now be attending the EMT meeting twice a week. This is good since I will be able to get a good handle on what is coming up in the release. Of course it is two extra meetings a week but it should be useful information that I listen to.
On the FTS I've set up the reporting package, I have to wait really till 08:00 tomorrow to check if it works which is a little boring of course. If it works then I'll look into running three instances of it for the different FTSes but it looks like it is easy to do but only time will tell of course.
Friday, 13 April 2007
Not the most productive of tasks but is friday afternoon after all. Have moved all my bookmarks to del.ico.us. Hopefully it pays of, seems like a good idea though. Big advantage is that bookmarks have tag's rather than being in a directory structure.
Thursday, 12 April 2007
Spent an age trying to mark pages as secret so only a small group can see the pages before we publish them. It is easy to do but an odd thing which makes sense is you then can't search the pages with a formatted search since the search is anonymous. Of course it makes sense but....
Wednesday, 11 April 2007
Wrote a a new page FtsVomsRoles detailing how VOMS roles might be used with the FTS. The FTS already supports this so we just need a schema for actually using them. Needs a bit of central management but still a lot better than what we had before.
Wrote a small micky mouse script to parse the FTS call.logs. Intrestingly it showed that in the last day for some users they are submitting up to 3000 times as many status requests as they are submitting transfers which seems a little excessive. To be followed up.
Tuesday, 10 April 2007
Report of high load of the FTS webservers. It is clear they are busy but there is no script or log file written to do a quick analysis of what the queries are. So it is time to start one. The log format completely changes for release 2.0 so no point spending any time on this.
Started the empty documentation pages for what will be the FTS 2.0 release. There is a lot of information in the old release pages. The challenge is making use of it while not including things that are plain wrong.
Had a very first go connecting to Oracle with the cx_Oracle python module. It is easy of course like everything once you find the correct level of web page to help you do it. It is needed for some of the FTS monitoring that is being done.
I was GMOD last week over the easter period, fairly quiet other than a couple of announcements. The gmod phone did ring but I missed the call, checking the CERN status within SAM it was obvious that something was up since all the queues were in state queuing. I raised a ticket with ce.support and although it was not them who called there was an obvious problem. A cron had been left in place to drain the CERN farm which had been wanted the previous week.
Thursday, 5 April 2007
Have been working hard with FtsWlcg < LCG < TWiki the homepage of the FTS for the WLCG service. A lot of the pages had fallen way out of date, learnt a lot of twiki in the process including how to do templates, search for particular pages related to a page and many other things. It is looking a lot better but there are still things that need to be done for sure.
As per usual having been away a week for a wedding party a very busy day of catchup. Highlights included.
- Attended 9.00 sysadmin meeting.
- Attended 10.00 WLCG meeting.
- Attended 1.30 meeting with Remi to go over CDB, quattor and things.
- Attended 2.30 meeting with LHCb people looking into using MAUI with Dirac.
- Attended 3.00 meeting of sysadmins.
- Attended 4.30 meeting of EMT.
Tuesday, 16 January 2007
Sven gave a talk on use of XEN at DESY. Lots of good things, they consider the problems are that:
- XEN is not integrated into the kernel.
- MAC Address managment has to be done.
- Automatic Installation.
- Full virtualisation IO is very slow.