Bugs that we discovered in our ROCKS setup and what we have to
do to fix them:
***A: Problems with cloning slave servers:
Symptom: It takes a long time!
Cause: We are probably tarring up more than we need to.
Fix: We can investigate if there is any faster way to transport the
data. It may be possible to tar or cpio across network sockets and
leave NFS out of it. Also I have an idea of how to
just do http installs of these machines from the main ROCKS server \
fnpt122 so we don't have the problem of creating the first one in a
subnet (Ideally we would like to install a real head node via network
install from the first head node, and make a ROCKS server and not
bother with
tarring at all. but in rocks 3.0.0 this is not supported, it
is promised for future versions.).
Further investigation showed that about half of the
4 Gb rocksroot.tar file is coming from openPBS logs in
/opt/OpenPBS/server_logs which I have deleted, but they will
come back again unless we learn how to make them stop. Have
made query to ROCKS discussion list. They were stopped by
totally moving the startup script out of /etc/rc.d/init.d
Symptom: /export/home partition was only 5Gb, should be 10.
Cause: Bug in db-partition-rocksslave.xml script
Fix: increased the size of this partition in db-partition-rocksslave.xml,
but need to test it. Also whatever fixes have to be done to
db-partition.xml as a result of other bugs have to be done here too.
It is probably a good idea if we keep dfarm and user data files
off of the machines that are going to be slave servers.
Fixed.
Symptom: freshly installed farms-rocksslave servers had the partitions
in a different place than we expected and the clone-server.sh script
was hardwired for /dev/hda5 and /dev/hdb3
Fix: We need to change the clone_server.sh script so it relies on
the labels of the partitions as the node is installed, rather
than any one hardwired partition. Important! Done
Symptom: slave rocks servers wouldn't boot, came up in single user mode
Cause: clone_server.sh was adding extra entries to /etc/fstab which
weren't needed.
Fix: Need to change clone_server.sh accordingly. Done--didn't work.
one "fi" was in wrong place, fixed (need to test again)
***Symptom: slave rocks server when booted off /mnt/rocksroot partition
came up with no network.
Cause: This was because of the /etc/modules.conf file on fnpt122,
which specifies e1000 as eth0, the D0 nodes eth0 is a e100.
Fix: We fixed it by hand to get it going, should find some way
to automate it.
***Symptom: Had to modify app_globals by hand for each of the slave servers
Cause: The script that was going to do that hasn't been written yet.
Fix: A table "clusnet" has been created in the database which
has all the cluster-wide parameters for each cluster. Need to
write python or perl script to read it and then modify
app_globals accordingly.
***Symptom: Had to modify nodes database to only include the nodes
on each slave server
Cause: The script that is supposed to do this hasn't been written yet.
Fix: It turns out that we may not have to do this after all. GP farms
ran OK with two rocks servers that both knew about everything.
But field in "nodesconf" table is there to specify which node
belongs to which slave server and we can implement it with
a script if we want. This is a tricky piece of SQL to write
which I haven't figured out how to do as yet.
***Symptom: There's a pvfs error message when the ROCKS server boots up
Cause: pvfs isn't configured correctly on fnpt122 or any of its clones.
We think this is harmless since we are not using pvfs (parallel
virtual file system) anyway.
Fix: Should look at fnpt122 to see if we can disable pvfs there.
***Symptom: We had to go slow due to high loads on fnpcd/fnpcb
Cause: The code to make the slave rocks servers be slave voldemort
servers as well wasn't ready to go.
Fix: We need to set up voldemort-push rpms on these slave servers,
and make a python script to read the database and do the
pullrsync from the appropriate location.
***Symptom: fnpcd actually crashed at one point in the morning.
Cause: Unknown
Fix: if fnpcd had been down, fnpcb would have automatically taken
over the load...if fnpcb were down, though, we would have been
in trouble. Should change configuration to have at least two
independent copies of the files--could go to fnpca if ganglia is revised.
***Symptom: multiple copies of the mysql database can be painful to update if
you find a last-minute problem.
Fix: We need to find a way to just copy the relevant tables of the
mysql db from node to node without recloning the whole rocks
server. Margaret has copied a couple of tables this way
and it works but we need to automate it. Ditto for syncing
up last minute changes in the site-nodes area.
(also on main rocks server check to see if it can run)
Symptom: There are two entries for the rocks kernel in grub.conf
also the script is clobbering the symlink.
Cause: Don't know what is causing the two entries in grub.conf.
symlink getting clobbered because we were not specifying
the outfile in the grubby command inside clone_server.sh
Fix: grubby command is fixed, not clear what to do with why there
are two rocks kernel entries
Symptom: Rocks install dies sometimes with "kickstart file not found"
Cause: This was due to an rsync_push sending out different automount
maps to the rocks server, so it does not have the path
to /export/home/install in its automount maps.. thus it
can't do any kickstart stuff.
Fix: We have to add /export/home/install to our normal auto.home
map on all three farms, also make sure any other voldemort
files don't mess up anything else with the rocks server.
auto.home has been fixed everywhere.
Eventually our goal is that the rocks server can run jobs
even when it is booted in rocks server mode. We should check
the interactions of the other voldemort files with stuff
on the rocks server to make sure nothing else is changed
inadvertently.
Symptom: We had several different cases in the CDF farms
where joe modified the database on fncdf77 and it
appeared to modify the database on fnpt122 at the same time.
This shouldn't happen and we need to understand it.
Cause: we looked for this and couldn't see it happen.
Symptom: Dfarm didn't work right on fnpc124 when we tried to
run it as a worker node booted off of the rocksroot partition.
Cause: /etc/hosts didn't have right entry for fnpc124 and this
confused dfarm--we edited /etc/hosts so that dfarm could
run but this is not a stable situation because it will get
wiped out with any insert-ethers --update.
Also discovered that mysqld isn't running because of absence
of user and group 27. Changed voldemort file to add these.
missing vcsa also, apache, rpm nfsnobody (pcap, named gdm maui
radvd)
Fix: Hard part is getting /etc/hosts to stay good, have to think
about this one. After advice from Federico Sacerdoti of
Rocks team, (and him writing a patch for Rocks 3.2) I
modified the same file he modified in Rocks 3.0,
/opt/rocks/lib/python/rocks/reports/hosts.py. Here is the diff:
[root@fnpt122 reports]# diff hosts.py hosts.py.sav
210c210
< print '127.0.0.1\tlocalhost.localdomain\tlocalhost'
---
> print '127.0.0.1\t%s.%s\tlocalhost' % (hostname, domain)
Now the hosts file comes out right every time after
insert-ethers --update
All users which exist on the head node have been added to
/etc/passwd and /etc/group files in voldemort files
B: Configuration problems on worker nodes
Symptom: filesystem / on GP farm worker nodes is formatted as ext2, not ext3.
Cause: This was unavoidable--we had to use the ROCKS default partitioning
to save the data on the data disks, and the default partitioning
does ext2. Same problem will happen on CDF farms.
Fix: Have to create a journal file using tune2fs, edit fstab
Can do this in the %post section of the kickstart if necessary.
Have added this code to %post section in db-partition.xml,
only when savedata is Y. Also changed db-partition.xml to
be for all cases where savedata="Y".
Need to test. Also still
have to address the question of blown away system disk and saving
data.
Eventually we want to make the kickstart auto-detect
the case of a blown away system disk and still save
the data. We have created case "U" in the database to
alert the system for this in the meantime. in architecures
like abd and acd where there is only one data partition
on each drive, we can save the data now. The others we
still have to work.
Symptom: /var/adm/krb5 is empty on most fnd0 nodes.
Cause: The shell script in
/export/home/install/profiles/3.0.0/site-nodes/farms-keytab.xml
had a bug in it. It finds the node prefix with
HOSTNAME=
PREFIX=${HOSTNAME/[0-9]*/}
The second line strips all the numbers out of the node name, including
the 0 in fnd0 nodes.
Fix: We need to add a simple if statement to this shell script
to make the prefix "fnd0" if it is "fnd". Fixed, OK. First
fix had an extra space, second fix is OK.
Symptom: /var/adm/krb5 was missing some of the newer keytabs on the
GP Farms
Cause: It turns out that we were pulling the keytab.tar from
an out of date location. I had not updated the cron script
when I moved the location of the keytabs on fnpcd.
Fix: This has been fixed (by fixing the cron scripts on d0bbin, cdffarm1,
fnsfo to send the .tar file to /var/adm/sshkey/keytab/[fnd0,cdf,fnpc]
respectively.
Symptom: In Margaret's tests on fnpc123-154, lm_sensors did not work,
it hung every time.
Cause: This was because I forgot to have the old
/etc/rc.d/init.d/lm_sensors moved out of the voldemort files,
and also because the voldemort script modulesconf.sh
had the wrong permission and wasn't getting executed,
leaving the line "alias char-major-89 i2c-dev" out of /etc/modules.conf
Fix: This was fixed last week by putting the right scripts in place.
Symptom: Due to pullrsync/voldemort some of the old rpms are getting
yummed back onto the machine from the 731 area
Cause: 731 area was hardwired into voldemort-0.6-3
Fix: voldemort-0.6.5-1 is released and on all worker nodes now.
***Symptom: Some files in pullrsync/voldemort aren't necessary
anymore because they come from RPMS now.
Symptom: sensors_wrap on fnpc51-90, fncdf91-154 didn't work properly
Cause: the output format of sensors changed with lm_sensors version 2.8.3
Fix: Modified the sensors_wrap.sh file, need to distribute the
rpm and rocks-dist. Rpm now distributed (and the workgroup
yum script fixed by Troy).
***Symptom: A number of nodes had no /etc/krb5.keytab after the reinstall
Cause: Unknown--but two major errors can be seen in
/tmp/hostkrb.err where the errors are captured:
a) "Preauthentication failed while initializing kadmin interface"
--most likely these nodes didn't have the password set
correctly to litesg00ut in the first place, but it could be a
clock issue too.
b) "communication failure with server while changing
host/fnd0180.fnal.gov@FNAL.GOV's key"
These happen from time to time, sometimes they work anyway
and sometimes they don't.
Fix: We have to make both the reinstall/unkeytab sequence and
the farms-private.xml/farms-keytab.xml scripts more bulletproof,
check the return codes, and retry if necessary. If the
unkeytab process fails, it should give an error and not
reboot the node to start the reinstall. Also it seems that
more education is in order so everyone understands what this
set of scripts is really trying to do.
Symptom: We kept getting python exceptions on the GP farms as
we were trying to partition the data disks, and save the data.
We destroyed data on data disks of about 6 nodes that way.
In a Fermi-to-ROCKS conversion, ROCKS wouldn't deal correctly
with nodes that had two swap partitions, or with nodes that
had more than one partition on any data disk, which all
of our nodes of 2003 and later do.
Cause: Not really known
Fix: We got a temporary fix from the ROCKS people. By touching
/.rocks-release and using autopartitioning, it kept the
existing structure of our disks and saved all the data partitions.
The autopartitioning leaves / as an ext2 file system not
ext3, fixed see above.
This autopartitioning doesn't cover the case
of a wiped out system disk where we still want
to save the data disk.
We have modified db_partition.xml to have a third category
of SAVEDATA variable called "U" for unknown.
if the system disk is blown away, this is meant to still
save the data disks. The three cases of this we have
tried thus far (fnpt123, fnpt120, fnpc87) have worked.
More testing is necessary. This "U" option won't work
when converting nodes from Fermi Linux to Rocks, we know because
we were trying these exact partitioning schemes.
We do see some problems sometimes when two swap partitions
are called for. WE have temporarily modified the code
to just use one, will test more later. There may be other
surprises here.
C: Problems in the reinstall scripts that were used to start the install
Symptom: unkeytab script couldn't run in an ssh command
Cause: /usr/krb5/sbin wasn't in the path.
Fix: Was fixed for D0 and GP by adding explicit path name
Symptom: reinstall_2518(370DLE) couldn't run
Cause: Two bugs, a quote was missing, and the path to lilo had to
be added.
Fix: Fixed in D0, GP, now fixed in CDF but need to test.
Symptom: Missing symlinks
Cause: We needed to touch /.rocks-release on all
nodes that we wanted to save the data partitions on for GP.
Also have to cd /boot/kickstart ; ln -s 7.3 default in all
Fix: need to add this to the reinstall scripts on CDF. Not done yet.
Symptom: Some nodes wouldn't PXE boot with /boot/kickstart/cluster-kickstart command
Cause: It appears that there were problems with hardware types
370DLE, 2518, L440GX that we could not force a PXE boot
through software, at least not before ROCKS was installed.
Fix: We worked around it on these architectures, for the conversion
from Fermi Linux to ROCKS, by adding a stanza to LILO
to boot the install image from the hard disk.
Now that ROCKS is properly installed, the equivalent
task should be accomplished with rocks-grub. i.e., if
the node doesn't PXE boot properly, it should boot off
of the hard disk and do a http reinstall. But we need
to test this.
Symptom: four nodes so far would not PXE boot at all
Cause: For first 2, fnd0164, fnd0368, we think it is a faulty
network interface that is not sending the PXE request properly.
(the rocks server does not record any record of getting
such a request).
Two others in CDF farm have same problem, both tyan 2468,
the rest could have same problem as well.
Fix: Can work around this by booting the node off a rocks CD
but we will have to debug the network complex, particularly
in CDF. An identical node in prototype farms does PXE fine
as do all the hotdogs.
More details--CD workaround works for tyan 2468 but not
for koi nodes x5dpa-GG... the koi nodes work everywhere
else but in CDF.
Eventually datacomm changed the PAgP setting on the switch
and we were able to PXE boot all these nodes.
Symptom: CDF nodes with fiber gigabit won't install properly.
PXE boot off of on-board gig-e works but then kernel wants
to install through fiber card, which times out.
Fix: We enabled the PXE boot for fiber gig card, disabled
it for the other two, and it works all the way on the
fiber. Changes made by D. Tang to the switch since we
tried before may also be a factor in why it works now.
Symptom: first PXE DHCP request went through OK but DHCP request
at start of install did not.
Cause: This was a switch misconfiguration--we worked this out
because some nodes worked properly and others attached
to a different module did not.
Fix: Vladimir Bravov reconfigured the module in question on the
switch and everything is fine. We should make sure
that data comm knows where all our ROCKS servers ended up
on each farm.
Symptom: Tyan 2723 board--first PXE boot OK but then couldn't get
kickstart file:
Cause: what we thought was eth0 and what rocks thought was eth0
were opposite
Fix:
1) change the nodes table to show the other MAC address for these
6 nodes (need to propagate back to fnpt122).
2) change the boot order in the BIOS to make the gigabit
boot agent come before the other one, i.e, the 1.1.something in
slot 4 comes ahead of the other one.
3) change the plug at the back of the machine to use the
other interface.
symptom: L440Gx hardware installs OK but on reboot comes up
with flashing cursor in LH corner of screen.
Cause: the GRUB that comes with ROCKS doesn't work properly
with L440GX, it has a bug
Fix: farms-l440gx-grub.xml was written to work around this
problem by installing the normal red hat grub again.
There is a slight bug in it the way it is right now
in that it does grub-install unconditionally whether it
is an L440GX board or not. Need to fix the python.
An attempt to fix that caused the above problem to reappear
on fnd061. I rolled the fix back.
The code has now been fixed again, right, and distributed
everywhere. grub fix will now be applied only to L440GX hardware
Symptom: Tyan 2466 board hangs at "loading initrd.img........"
Cause: Tyan 2466 board is a piece of junk
Fix: reboot, hit f12 for pxe boot again, it will usually work the
second time.
Symptom: 370DLE hardware hangs at "grub loading stage2"
Cause: We tried to reinstall a node that has its system disk
blown away but had savedata="y" set in the database.
Should have set savedata="u" before the reinstall happened
As a result, auto-partitioning made /dev/hdc, not bootable
by default, to be the system disk and the system will not boot.
Fix: Should make "U" be the default in the database, with "N" used
only when we actually want to change the partitioning scheme
on the node and "Y" used only when we are doing a full reinstall
of a good new node. To recover the node in question,
change savedata="N" and reinstall.
D) organizational issues that we can do better next time
1) Get more machines on hold to test--at least 3 of each HW flavor.
(Preferably in future get at least one of each hwtype in prototype farm.)
2) Give users enough time to back up and save important files--for sure
on the test machines.
3) Get all testing done and verified a week in advance--no last minute
changes on the day of.
4) Make a checklist of what needs to be checked on newly installed nodes.
(Part of this being done by services project but that project can only
look for what it knows has to be there).
5) Don't schedule two big updates a day apart like we did this time,
ideally not in the same week.
6) Go through and see if there are some unnecessary things that
we don't need anymore.
7) Make sure the crucial machines have suitable backups in case
of failure, i.e. fnpcd
8) We need to have a distribution to go back to for installs...
what we have agreed on is that we will fix all known bugs in
the distro we have now, and then any future development, i.e.,
new installation features, new rpms, etc. will be done on
a development server and the production server won't be changed.
Also in future when there is a new release we will always keep a
rollback path to the previous release.
9) More people besides Steve need to become proficient in finding
and fixing problems.
E) Further development tasks that need to be done:
1) Support for some version of linux based on red hat enterprise
linux--rocks 3.2. Workgroups and comps files change
quite a bit in this release but in a way that should make it
all easier. For instance, the Fermi workgroups in the later releases
roll up all the special rpms into one big RPM that can just be installed.
2) testing of the db-workgroup.xml and fermi-generic.xml nodes
that are designed to install an arbitrary Fermi Linux workgroup.
3) merging of Farms and CMS ROCKS stuff, to be carried out by
the rocks merge project
4) document document document
5) CD distribution of this for the vendors, which would have to
include a Fermi roll. (can be done in 3.0 and 3.2)
6) automation of the process from rocks release to a Fermi installation.