Tuesday, 12 December 2006
After getting married and being ill for a couple of days I'm back at work. I have 400 emails to work through which is a lot less than I used to have to deal with. This week I am ROC on Duty for CERN and I also want to get my updated ncd-yaim package deployed. If I get them done I'll be happy.
Thursday, 30 November 2006
Have volunteered to be a CE expert at CERN. I don't really know anything about the gLite CE but this is one very good way to learn of course though not as good as running one completely. Essentially tickets end up with me when they are not easy to solve by the admins. A lot of ticket passing goes on here.
Attended the IT-GD-OPS section meeting and reported what I have been up to in the last week:
- More clean up of FTS profiles. A bit closer to actually being able to deploy new the new hardware. Just need to understand how the load balancing works.
- Starting to impart the knowledge to Yvan about quattor.
- Acting as CERN ROC. There really has been nothing interesting that has happened though.
- Started to contact people for the WLCG ops meeting. Only one firm offer of a talk at the moment.
- Presentation about the utilities that ship the STDOUT, STDERR to an SE for inspection during a job. It runs every now and then GridFTS them off. It appends or creates new files depending on the file size. It all makes sense to me, lots of questions about it though. I guess lots of bright people creates lots of good ideas. Though if the interactive jobs work is this even needed.
Tuesday, 28 November 2006
Really getting some where now with the FTS CDB profiles. The three clusters are now nicely abstracted from the template layout and there are no longer bits of similar configuration dotted about the different files. However there is still one of configuration still dotted about which is the load balancing magic. This is quite CERN specific and so finding out how it works is going to take a bit of hunting I suspect. The real problem for me though is that the load balance configuration is currently active and live on the real production system, I could really mess things up if not careful.
Asked a couple of people if they would give talks with positive responses from one. The other one is my favorite though. Also asked at the WLCG operations meeting for ideas and offers of talks. Of course everyone was quiet as is always the case at the OPS meeting. Must now get in touch Ale' to see if she can help out as well.
Monday, 27 November 2006
Very little happening for the CERN ROC this week though it looks we will be taking on three or four US sites. I've dealt with IOWA before and know they have a quite different setup. For instance GPBOX is deployed and used there. There was some real confusion earlier in the year with them advertising support for the MINOS VO not accepting jobs for them.
This week I have two roles which move around the group within rotation. The GMOD is meant to coordinate between the grid services for the group and dispatch tickets of to the rest of the group. The ROC on duty just deals with GGUS tickets assigned to the CERN ROC. Apart from that my other plans for the week are to look at tutorial session for the LCG meeting and get some concrete decisions from people to give talks.
Friday, 24 November 2006
Tried some more complicated things in PAN today in the process of trying to rationalize the templates for FTS service. In particular conditionals and nested nlists are quite cool. What I was pleased to find out was that having come up with a plan of sensible way to handle multiple sub-clusters and nodetypes I looked for some examples to build up my idea. It looks like the CASTOR team is doing exactly the same thing. The CASTOR templates look to be in a much better state than some of the other ones.
I have been given the task of chairing and planning the tutorial portion for the LCG workshop in January at CERN. In fact this looks to be good thing to do, there are several good ideas and plans that have been given to me or I can think off. There are even some volunteers to give presentations which is just great.
Thursday, 23 November 2006
The attempt to install a quattorized version of the glite-LB looks to have gone well. Certainly all the packages are now installed and just the configuration needs to be looked at. The box needs some kind of raid as well. RAID is something I seriously don't know how to do in quatttor. but like everything else though just find an example and steal it.
Attended the section meeting to report what I've been up to the last week:
- Maui update following the bug that needed fixing.
- FTS servers , pilot service reinstalled from scratch and is at least running.
- Installing glite-LB with Yvan. Requires some new templates.
- Both the last points require an update to ncm-yaim which I'm trying to get AFS permission for.
- SDU-LCG2 asking to be certified.
- R-GMA broken on the PPS which is broken anyway due to IP renumbering.
- Description of xmas time work which I can't do this year, basically time in lieu though
- Interesting points form Alistair about user's expectation of quality. Do we need a new set of users who are more tolerant for grid over batch.
- A couple of fixes have been put into SAM to try and clean up results.
- Five countries in Mediterranean grid, just tested 4.6 Million drugs against Malaria !!
- Re installation form scratch is it still needed? Yes it should be done on the PPS.
Wednesday, 22 November 2006
So it turns out that I can't install the glite-LB as requested since the ncm-yaim component does not support the new glite-LB method to YAIM. I need to understand why ncm-yaim is so explicit about all these things in the code. I think in quattor there is someway that a list of approved values can be stored in the templates. But for now blocked on this till I get write access to the ncm-yaim area.
Was asked to help deploy a gLite WMS with out the WMS, basically a stand alone logging and book keeping service. In principal easy but removing packages is always harder than adding them. I think what is needed after discussion is a new sub cluster, an lb, to the glitewms. This should work well but the generic glitewms needs to become a sub-cluster as well really. More generally its seems that CDB profiles and clusters are well defined but the use of sub-clusters seems quite ill defined to me at the moment. FTS and the RBs use a completely different layout even within themselves. One that needs rationalizing I think.
Tried updating the ncm-yaim component after helpful comments from Thorsten about the release process of actually getting something installed everywhere. Unfortunately I don't have access to the CVS area at the moment. No doubt another days delay. Time to try something else.
Despite my hopes that it would work the install of FTS pilot cluster web services failed over night. Some timeout somewhere yet to be determined took place. A second attempt this morning failed with an error at earlier stage trying to install some random gnome package.... It seems there is a lot to go wrong with any install since there are so many components and so many people changing things in those components. But the errors are good, they give me a reason to learn various things. If it worked I would never get around to it, this is how to think of it.
Tuesday, 21 November 2006
It turns out that many of the non service specific values in a YAIM site-info.def file are hard coded into the NCM-YAIM component. There are I guess some good reasons for doing this but it makes adding a new one quite a painful process. Anyway a good opportunity to try and update and release a YAIM component to CERN. Luckily as always there are ways around it and I can continue with the test installs. All five nodes installing now hoping for working system this time....
A new cluster for the production FTS service is beginning to become available to me. So far one box is ready for the new Quattor->NCM->YAIM installation with a bit of SINDES thrown in for good luck. Unfortunately some ORACLE_LOCATION values appeared to be very illusive to track down. Should be well on the way for the new hardware tomorrow.
Today was the first day installing a cluster of 5 nodes with an aim to end up with a working FTS service on the five nodes. I don't expect it for a second to work first time but anyway. Concerning how the SSH keys make their way to the SINDES server I found an answer - they are not uploaded to SINDES. This is something useful I could add to PrepareInstall. No doubt if it were there people would use it. There was meant to be a new cluster of nodes arriving yesterday for me to install, no sign of it yet.
AIMS is some magical system at CERN for maintaining kickstart, DHCP and PXE options via a command interface. Much like RALs swing device. aims is of course more complicated and a lot more magical in how it works. Seems to rely a lot of having AFS access to various directories and it is taking to time to find out the hard way to what I need access to.
Monday, 20 November 2006
Spent quite a bit of time guessing how ssh keys are populated across centrally managed hosts at CERN. I had become fed up with keys changing after the frequent reinstalls I'm currently doing. SINDES is the mechanism. It works well though I still have questions about how the files are uploaded in the first place.
I've decided it is high time I kept a decent log of what I am to at work. I used to do this before and it was clearly useful to me and other people. When it went lots of people asked where it went. This will only be the work I'm doing while I work for CERN. It is not meant to be interesting to anyone really and is pretty much just a record of ongoing work.
Friday, 17 November 2006
Following the release of new patch level p17 of maui I built packages for both SL3 and SL4. Unfortunately I did not have machines on which torque could be installed so Yvan and Di came to the rescue with some machines. I now have my own SL3 and SL4 to do with as I wish. This should prove to be very useful. Passed onto the deployment team to go out some time. Will be great to bring torque and maui up to date on the production LCG farm. Will make life a lot easier.