Bugs that we discovered in our ROCKS setup and what we have to 
do to fix them:

***A:  Problems with cloning slave servers:
   Symptom:  It takes a long time!
   Cause:  We are probably tarring up more than we need to.
   Fix:  We can investigate if there is any faster way to transport the 
         data.  It may be possible to tar or cpio across network sockets and 
         leave NFS out of it.  Also I have an idea of how to 
         just do http installs of these machines from the main ROCKS server \
	fnpt122 so we don't have the problem of creating the first one in a 
	subnet  (Ideally we would like to install a real head node via network 
         install from the first head node, and make a ROCKS server and not 
	 bother with 
         tarring at all.  but in rocks 3.0.0 this is not supported, it
	is promised for future versions.).
	Further investigation showed that about half of the 
        4 Gb rocksroot.tar file is coming from openPBS logs in 
        /opt/OpenPBS/server_logs which I have deleted, but they will 
        come back again unless we learn how to make them stop.  Have 
        made query to ROCKS discussion list.  They were stopped by 
        totally moving the startup script out of /etc/rc.d/init.d
 
   Symptom:  /export/home partition was only 5Gb, should be 10.
   Cause:  Bug in db-partition-rocksslave.xml script
   Fix:  increased the size of this partition in db-partition-rocksslave.xml,
         but need to test it.  Also whatever fixes have to be done to 
         db-partition.xml as a result of other bugs have to be done here too.
	 It is probably a good idea if we keep dfarm and user data files
         off of the machines that are going to be slave servers.
         Fixed.

   Symptom:  freshly installed farms-rocksslave servers had the partitions
         in a different place than we expected and the clone-server.sh script
         was hardwired for /dev/hda5 and /dev/hdb3
   Fix:  We need to change the clone_server.sh script so it relies on 
         the labels of the partitions as the node is installed, rather 
         than any one hardwired partition.  Important!  Done

   Symptom:  slave rocks servers wouldn't boot, came up in single user mode
   Cause:  clone_server.sh was adding extra entries to /etc/fstab  which
           weren't needed.
   Fix:  Need to change clone_server.sh accordingly.  Done--didn't work.
         one "fi" was in wrong place, fixed (need to test again)

***Symptom: slave rocks server when booted off /mnt/rocksroot partition
            came up with no network.
   Cause:  This was because of the /etc/modules.conf file on fnpt122,
           which specifies e1000 as eth0, the D0 nodes eth0 is a e100.
   Fix:  We fixed it by hand to get it going, should find some way 
         to automate it.
   
***Symptom:  Had to modify app_globals by hand for each of the slave servers
   Cause:  The script that was going to do that hasn't been written yet.
   Fix:  A table "clusnet" has been created in the database which 
         has all the cluster-wide parameters for each cluster.  Need to 
         write python or perl script to read it and then modify 
         app_globals accordingly.

***Symptom:  Had to modify nodes database to only include the nodes
            on each slave server
   Cause:  The script that is supposed to do this hasn't been written yet.
   Fix:  It turns out that we may not have to do this after all.  GP farms
         ran OK with two rocks servers that both knew about everything.
         But field in "nodesconf" table is there to specify which node
         belongs to which slave server and we can implement it with
         a script if we want. This is a tricky piece of SQL to write
         which I haven't figured out how to do as yet.

***Symptom:  There's a pvfs error message when the ROCKS server boots up
   Cause:  pvfs isn't configured correctly on fnpt122 or any of its clones.
	   We think this is harmless since we are not using pvfs (parallel
	    virtual file system) anyway.
   Fix:    Should look at fnpt122 to see if we can disable pvfs there.

***Symptom:  We had to go slow due to high loads on fnpcd/fnpcb
   Cause:  The code to make the slave rocks servers be slave voldemort
           servers as well wasn't ready to go.
   Fix:  We need to set up voldemort-push rpms on these slave servers,
         and make a python script to read the database and do the 
         pullrsync from the appropriate location.

***Symptom:  fnpcd actually crashed at one point in the morning.
   Cause:  Unknown
   Fix: if fnpcd had been down, fnpcb would have automatically taken
        over the load...if fnpcb were down, though, we would have been
        in trouble.  Should change configuration to have at least two 
        independent copies of the files--could go to fnpca if ganglia is revised.
***Symptom:  multiple copies of the mysql database can be painful to update if 
        you find a last-minute problem.
   Fix:  We need to find a way to just copy the relevant tables of the 
         mysql db from node to node without recloning the whole rocks
         server. Margaret has copied a couple of tables this way 
         and it works but we need to automate it.  Ditto for syncing
          up last minute changes in the site-nodes area.
         (also on main rocks server check to see if it can run)

    Symptom:  There are two entries for the rocks kernel in grub.conf
          also the script is clobbering the symlink.
    Cause:  Don't know what is causing the two entries in grub.conf.
            symlink getting clobbered because we were not specifying
            the outfile in the grubby command inside clone_server.sh
    Fix:  grubby command is fixed, not clear what to do with why there 
          are two rocks kernel entries

    Symptom:  Rocks install dies sometimes with "kickstart file not found"
    Cause:  This was due to an rsync_push sending out different automount
            maps to the rocks server, so it does not have the path 
            to /export/home/install in its automount maps.. thus it 
            can't do any kickstart stuff.
    Fix:  We have to add /export/home/install to our normal auto.home 
          map on all three farms, also make sure any other voldemort
          files don't mess up anything else with the rocks server.
	  auto.home has been fixed everywhere.
          Eventually our goal is that the rocks server can run jobs
          even when it is booted in rocks server mode.  We should check
          the interactions of the other voldemort files with stuff 
          on the rocks server to make sure nothing else is changed
          inadvertently.

    Symptom:  We had several different cases in the CDF farms
         where joe modified the database on fncdf77 and it 
         appeared to modify the database on fnpt122 at the same time.
         This shouldn't happen and we need to understand it.
     Cause:  we looked for this and couldn't see it happen.

     Symptom:  Dfarm didn't work right on fnpc124 when we tried to 
         run it as a worker node booted off of the rocksroot partition.
     Cause:  /etc/hosts didn't have right entry for fnpc124 and this 
              confused dfarm--we edited /etc/hosts so that dfarm could 
              run but this is not a stable situation because it will get 
              wiped out with any insert-ethers --update.
	      Also discovered that mysqld isn't running because of absence
              of user and group 27.  Changed voldemort file to add these. 
	      missing vcsa also, apache, rpm nfsnobody (pcap, named gdm maui 
              radvd)
     Fix:   Hard part is getting /etc/hosts to stay good, have to think 
            about this one. After advice from Federico Sacerdoti of 
            Rocks team, (and him writing a patch for Rocks 3.2) I 
            modified the same file he modified in Rocks 3.0,
            /opt/rocks/lib/python/rocks/reports/hosts.py.  Here is the diff:
[root@fnpt122 reports]# diff hosts.py hosts.py.sav
210c210
<               print '127.0.0.1\tlocalhost.localdomain\tlocalhost'
---
>               print '127.0.0.1\t%s.%s\tlocalhost' % (hostname, domain)


            Now the hosts file comes out right every time after 
            insert-ethers --update
            All users which exist on the head node have been added to 
            /etc/passwd and /etc/group files in voldemort files


B:  Configuration problems on worker nodes
   Symptom: filesystem / on GP farm worker nodes is formatted as ext2, not ext3.
   Cause:  This was unavoidable--we had to use the ROCKS default partitioning
	to save the data on the data disks, and the default partitioning
        does ext2. Same problem will happen on CDF farms.
   Fix:  Have to create a journal file using tune2fs, edit fstab
         Can do this in the %post section of the kickstart if necessary.
	  Have added this code to %post section in db-partition.xml, 
          only when savedata is Y.  Also changed db-partition.xml to 
          be  for all cases where savedata="Y". 
	  Need to test. Also still
          have to address the question of blown away system disk and saving 
          data.  
	  Eventually we want to make the kickstart auto-detect 
          the case of a blown away system disk and still save
          the data.  We have created case "U" in the database to 
          alert the system for this in the meantime.  in architecures
          like abd and acd where there is only one data partition
          on each drive, we can save the data now. The others we 
          still have to work.


   Symptom:  /var/adm/krb5 is empty on most fnd0 nodes.
   Cause:  The shell script in 
	/export/home/install/profiles/3.0.0/site-nodes/farms-keytab.xml 
	had a bug in it.  It finds the node prefix with 
	HOSTNAME=
	PREFIX=${HOSTNAME/[0-9]*/}
	The second line strips all the numbers out of the node name, including
	the 0 in fnd0 nodes.  
   Fix:  We need to add a simple if statement to this shell script
	to make the prefix "fnd0" if it is "fnd".  Fixed, OK.  First 
         fix had an extra space, second fix is OK.

   Symptom:  /var/adm/krb5 was missing some of the newer keytabs on the 
	GP Farms
   Cause:  It turns out that we were pulling the keytab.tar from
        an out of date location.  I had not updated the cron script
	when I moved the location of the keytabs on fnpcd.
   Fix:  This has been fixed (by fixing the cron scripts on d0bbin, cdffarm1,
        fnsfo to send the .tar file to /var/adm/sshkey/keytab/[fnd0,cdf,fnpc] 
        respectively.

   Symptom:  In Margaret's tests on fnpc123-154, lm_sensors did not work,
	it hung every time.
   Cause:  This was because I forgot to have the old
          /etc/rc.d/init.d/lm_sensors moved out of the voldemort files,
          and also because the voldemort script modulesconf.sh 
          had the wrong permission and wasn't getting executed,
          leaving the line "alias char-major-89 i2c-dev" out of /etc/modules.conf
   Fix:  This was fixed last week by putting the right scripts in place.

   Symptom:  Due to pullrsync/voldemort some of the old rpms are getting
          yummed back onto the machine from the 731 area
   Cause:  731 area was hardwired into voldemort-0.6-3
   Fix:  voldemort-0.6.5-1 is released and on all worker nodes now.

***Symptom:  Some files in pullrsync/voldemort aren't necessary
          anymore because they come from RPMS now.

   Symptom:  sensors_wrap on fnpc51-90, fncdf91-154 didn't work properly
   Cause:  the output format of sensors changed with lm_sensors version 2.8.3
   Fix:  Modified the sensors_wrap.sh file, need to distribute the 
         rpm and rocks-dist. Rpm now distributed (and the workgroup   
         yum script fixed by Troy).

***Symptom:  A number of nodes had no /etc/krb5.keytab after the reinstall
   Cause:  Unknown--but two major errors can be seen in
         /tmp/hostkrb.err where the errors are captured:
       a)	"Preauthentication failed while initializing kadmin interface"
         --most likely these nodes didn't have the password set
           correctly to litesg00ut in the first place, but it could be a 
	clock issue too.  
       b)       "communication failure with server while changing 
		host/fnd0180.fnal.gov@FNAL.GOV's key"
	       These happen from time to time, sometimes they work anyway
               and sometimes they don't.  
        
   Fix:  We have to make both the reinstall/unkeytab sequence and 
         the farms-private.xml/farms-keytab.xml scripts more bulletproof,
	 check the return codes, and retry if necessary.  If the 
         unkeytab process fails, it should give an error and not 
         reboot the node to start the reinstall.  Also it seems that 
         more education is in order so everyone understands what this
         set of scripts is really trying to do.

   Symptom:  We kept getting python exceptions on the GP farms as 
         we were trying to partition the data disks, and save the data.
         We destroyed data on data disks of about 6 nodes that way.
         In a Fermi-to-ROCKS conversion, ROCKS wouldn't deal correctly 
         with nodes that had two swap partitions, or with nodes that
         had more than one partition on any data disk, which all
         of our nodes of 2003 and later do.
   Cause: Not really known
   Fix:  We got a temporary fix from the ROCKS people. By touching 
        /.rocks-release and using autopartitioning, it kept the 
        existing structure of our disks and saved all the data partitions.
	The autopartitioning leaves / as an ext2 file system not 
	ext3, fixed see above.
	
	This autopartitioning doesn't cover the case 
        of a wiped out system disk where we still want 
        to save the data disk.
	We have modified db_partition.xml to have a third category
        of SAVEDATA variable called "U" for unknown.
	if the system disk is blown away, this is meant to still
        save the data disks.  The three cases of this we have 
        tried thus far (fnpt123, fnpt120, fnpc87) have worked. 
        More testing is necessary.  This "U" option won't work
        when converting nodes from Fermi Linux to Rocks, we know because
        we were trying these exact partitioning schemes.

	We do see some problems sometimes when two swap partitions
        are called for.  WE have temporarily modified the code 
        to just use one, will test more later.  There may be other 
        surprises here.

C:   Problems in the reinstall scripts that were used to start the install

     Symptom:  unkeytab script couldn't run in an ssh command 
     Cause: /usr/krb5/sbin wasn't in the path.
     Fix:  Was fixed for D0 and GP by adding explicit path name 

     Symptom:  reinstall_2518(370DLE) couldn't run
     Cause:  Two bugs, a quote was missing, and the path to lilo had to 
           be added.
     Fix:  Fixed in D0, GP, now fixed in CDF but need to test.

     Symptom:  Missing symlinks
     Cause:  We needed to touch /.rocks-release on all 
             nodes that we wanted to save the data partitions on for GP.
             Also have to cd /boot/kickstart ; ln -s 7.3 default in all
     Fix:  need to add this to the reinstall scripts on CDF.  Not done yet.

     Symptom:  Some nodes wouldn't PXE boot with /boot/kickstart/cluster-kickstart command
     Cause:  It appears that there were problems with hardware types 
	     370DLE, 2518, L440GX that we could not force a PXE boot 
             through software, at least not before ROCKS was installed.
     Fix:  We worked around it on these architectures, for the conversion
           from Fermi Linux to ROCKS, by adding a stanza to LILO
           to boot the install  image from the hard disk.
           Now that ROCKS is properly installed, the equivalent 
	   task should be accomplished with rocks-grub. i.e., if 
           the node doesn't PXE boot properly, it should boot off
           of the hard disk and do a http reinstall.  But we need
           to test this.

    Symptom:  four nodes so far would not PXE boot at all
    Cause:  For first 2, fnd0164, fnd0368, we think it is a faulty 
          network interface that is not sending the PXE request properly.
          (the rocks server does not record any record of getting 
            such a request).
           Two others in CDF farm have same problem, both tyan 2468,
           the rest could have same problem as well.  
    Fix:  Can work around this by booting the node off a rocks CD
          but we will have to debug the network complex, particularly
          in CDF.  An identical node in prototype farms does PXE fine
          as do all the hotdogs.
          More details--CD workaround works for tyan 2468 but not 
          for koi nodes x5dpa-GG... the koi nodes work everywhere
          else but in CDF.
	  Eventually datacomm changed the PAgP setting on the switch
          and we were able to PXE boot all these nodes.

    Symptom:  CDF nodes with fiber gigabit won't install properly.
          PXE boot off of on-board gig-e works but then kernel wants
	  to install through fiber card, which times out. 
    Fix:  We enabled the PXE boot for fiber gig card, disabled
          it for the other two, and it works all the way on the 
          fiber.  Changes made by D. Tang to the switch since we 
          tried before may also be a factor in why it works now.
    
   

    Symptom:  first PXE DHCP request went through OK but DHCP request
            at start of install did not.
    Cause:  This was a switch misconfiguration--we worked this out 
             because some nodes worked properly and others attached
            to a different module did not.
    Fix: Vladimir Bravov reconfigured the module in question on the 
         switch and everything is fine.  We should make sure
         that data comm knows where all our ROCKS servers ended up 
         on each farm.



    Symptom:  Tyan 2723 board--first PXE boot OK but then couldn't get
            kickstart file:
    Cause:  what we thought was eth0 and what rocks thought was eth0 
            were opposite
    Fix:  
	1) change the nodes table to show the other MAC address for these
	6 nodes (need to propagate back to fnpt122).
	2) change the boot order in the BIOS to make the gigabit 
	boot agent come before the other one, i.e, the 1.1.something  in 
	slot 4 comes ahead of the other one.
	3) change the plug at the back of the machine to use the 
	other interface.

    symptom:  L440Gx hardware installs OK but on reboot comes up 
              with flashing cursor in LH corner of screen.
    Cause:  the GRUB that comes with ROCKS doesn't work properly 
            with L440GX, it has a bug
    Fix:  farms-l440gx-grub.xml was written to work around this
          problem by installing the normal red hat grub again.
          There is a slight bug in it the way it is right now
          in that it does grub-install unconditionally whether it 
          is an L440GX board or not.  Need to fix the python.
          An attempt to fix that caused the above problem to reappear
          on fnd061.  I rolled the fix back.

	  The code has now been fixed again, right, and distributed
          everywhere.  grub fix will now be applied only to L440GX hardware

    Symptom:  Tyan 2466 board hangs at "loading initrd.img........"
    Cause: Tyan 2466 board is a piece of junk
    Fix:  reboot, hit f12 for pxe boot again, it will usually work the 
          second time.

    Symptom:  370DLE hardware hangs at "grub loading stage2"
    Cause:  We tried to reinstall a node that has its system disk 
            blown away but had savedata="y" set in the database.
            Should have set savedata="u" before the reinstall happened
            As a result, auto-partitioning made /dev/hdc, not bootable
            by default, to be the system disk and the system will not boot.
    Fix:  Should make "U" be the default in the database, with "N" used
          only when we actually want to change the partitioning scheme
          on the node and "Y" used only when we are doing a full reinstall
          of a good new node.  To recover the node in question, 
          change savedata="N" and reinstall.

D) organizational issues that we can do better next time

1) Get more machines on hold to test--at least 3 of each HW flavor.
(Preferably in future get at least one of each hwtype in prototype farm.)

2) Give users enough time to back up and save important files--for sure
on the test machines.

3) Get all testing done and verified a week in advance--no last minute
changes on the day of.

4) Make a checklist of what needs to be checked on newly installed nodes.
(Part of this being done by services project but that project can only 
look for what it knows has to be there).

5) Don't schedule two big updates a day apart like we did this time,
ideally not in the same week.

6) Go through and see if there are some unnecessary things that 
we don't need anymore.

7) Make sure the crucial machines have suitable backups in case
of failure, i.e. fnpcd

8) We need to have a distribution to go back to for installs... 
what we have agreed on is that we will fix all known bugs in 
the distro we have now, and then any future development, i.e.,
new installation features, new rpms, etc. will be done on 
a development server and the production server won't be changed.
Also in future when there is a new release we will always keep a 
rollback path to the previous release.

9) More people besides Steve need to become proficient in finding
and fixing problems.

E) Further development tasks that need to be done:

1) Support for some version of linux based on red hat enterprise 
linux--rocks  3.2.  Workgroups and comps files change
quite a bit in this release but in a way that should make it 
all easier. For instance, the Fermi workgroups in the later releases
roll up all the special rpms into one big RPM that can just be installed.

2) testing of the db-workgroup.xml and fermi-generic.xml nodes
that are designed to install an arbitrary Fermi Linux workgroup.

3) merging of Farms and CMS ROCKS stuff, to be carried out by 
the rocks merge project

4) document document document

5) CD distribution of this for the vendors, which would have to 
include a Fermi roll. (can be done in 3.0 and 3.2)

6) automation of the process from rocks release to a Fermi installation.