Primary Report for the week ending Friday, 2005-06-17
	*****************************************************


Summary:

Installs:
  0 installs  0 decommissions
Replaced drives:
  1 9940B  0 9940A  1 LTO1/LT02  0 other
Replaced servers/movers/fileserver nodes:
  0 server  0 mover  0 fileservers
4 Robot hardware maintenance service calls
1 Server/Mover/Fileserver maintenance (repairs/parts replacement)
Investigations/Interventions:
  9 mover  3 server   7 tape drive  0 fileserver  12 tape  0 file  0 
library
tape operations
  9 tapes clobbered/recycled
  250 tapes labeled/entered/removed
  1 tape MIRs fixed
  0 tape MIRS not fixed	
  1 tape drive firmware updates
  0 tapes cloned
  1 quota increase requests serviced
  1 raid disk replacements/interventions
  2 enstore service requests
  0 new muon pages
  0 off-hour calls/interventions

Weekend:


Monday 06/13

STK
	JL8630 NOACCESS, READ_VOL1_READ_ERR . Read vol label from DG4D
        with no problems. Tape was released.

	DE2EDLT  offline. Cannot eject JL0745 which is NOACCESS.  Tape
        is not in the drive. It did eject and dismounted. Tape release
        drive was put back online.

D0
	PRO204L1 NOACCESS - Positioning error on file 350 trying to get
        to 373. Able to read file 350 with no problem. Read file 373 with
        no problem. The drive D21C was replaced last Friday. There was
        error sense data for D21C.  This error occurred on Thursday.

	PRU383L1 NOACCESS - Mover stuck on D41F, file 127. Fixed the TOC.
        Was able to read file 127 and last file 177.

	Service call place for 9940B drive on 9940B60 mover. Having
        problems reading tapes. Drive was replaced on Tuesday.

CDF
	IA3854 NOACCESS -WRITE_VOL1_READ_ERR. No vol label but eod cookie
        set to "000000000000001". This is a recycled tape. Tape was mounted
        on a 9940B drive and EOF was written. Eod cookie modified to "NONE".

	IA3978 NOACCESS - WRITE_VOL1_READ_ERR. No vol label but eod
        cookie set to "000000000000001". This is a recycled tape. Tape
        was mounted on a 9940B drive and EOF was written. Eod cookie
        modified to "NONE".

	IA5387 NOACCESS - test dcache tape. Positioning error on file 4521.
        Last file on tape. This file was written on the same drive that
        tried to read it a short while after it was written. 994003.


Tuesday  6/14

STK
	9940B10 offline.  Tape failed to eject after write completed. MT 
        offline command failed with an i/o error. ACSLS ejected the tape.
        Mover was rebooted. There were problems with the file system.
        Fsck corrected these.
	
	PNFS hung on stkensrv1. PNFS log file hit 2GB. The file was renamed.

D0

	D21BLTO offline. Tape eject failed for  PRO927L1. Cannot eject tape 
        after 3 tries. MT commands failing with i/o errors even after power 
        cycling the drive and rebooting the mover.  Three attempts were made
        to eject the tape. A service call was placed to Adic. Tape drive was
        replaced on Wednesday.
	PRO927L1 NOACCESS. Stuck in D21BLTO.

CDF

	Pagedcache Dccp Running too long. Nothing in output file. Jobs were 
        killed and lock directory removed.


Wednesday

STK	

	37 errors trying to transfer 2 files from STK dcache to enstore.
        Error is "pnfs tag, wrapper contains invalid characters.

	Recycled 7 exp-db tapes.

	PagedcacheCMSDccp running too long. Nothing in output file. Job
        killed and lock directory removed.

	PagedcacheCMSGridftp running too long. Nothing in output file.
        Job killed and lock directory removed.

	About 13:10  Pagedcache Dccp and Gridftp failing due to 
        /home/enstore/dcache-deploy/config/dCacheSetup: No such file or 
        directory. I checked on stkendca3a and there was no file by that
        name. A link with the dCacheSetup name magically appeared at
        15:59 and the jobs started running.

	Problem with disk array on stkendca7a - ERROR: Disk Array Unit 0 on 
        controller ID:2 is degraded and no longer fault tolerant.
        ERROR: Drive error encountered on port 2 on controller ID:2.
        ERROR: Rebuild/initialization/verify failed due to an error on
        the source or destination drive of controller ID:2 unit: 0
	About 5 pm port 4 on one of stkendca7a's raid units failed.
        For unknown reason, the array did not automatically rebuild.
        The hot spare was in port 2. Wayne and David started it rebuilding,
        and replaced the drive in port 4, which then became the hot spare.

        About 7 pm port 2 failed. Note that this was formerly the hot spare,
        so nothing was lost, even though the rebuild had not completed. All
        the data were still there, with no redundancy, on the remaining 6
        disks. The rebuild began again using the new hot spare in port 4.

        David has configured offline, but not replaced, the failed disk in
        port 2, because this time, David wanted to be certain that it is
        replaced with a good disk. So we are running in degraded mode, with
        the rebuild at 78%, and when the rebuild completes, we will be
        running without a spare.  On Thursday, everything was put right.

D0
	Firmware up updated to current on D21BLTO when it was replaced.
	
CDF

	Replied to user about tape IA8157 with bad file 54.
		
	
Thursday  6/16

STK

	9940B24 offline. Eject failed. Write error with sense data -
        Error with "sense key unit attention". MT commands fail after
        error. Service call placed.

	9940B10 offline. Mover stuck in seek. Tape VO7613. Mover was brought
        back online.

	VO7613 NOACCESS. Mounted the tape and fsf to the end.
	
	Online Database support people reporting files missing from pnfs
        between yesterday and today. Their new script clobbered the files.

	Set minos 9940 quota to 380.

	The load average on stkensrv2 was 19, and lots of processes were
        waiting for disk (D state). The cpus, however, were mostly idle.
        So processes were waiting, and nothing was getting done. There
        were 8 cron jobs all trying to run; they had started anywhere from
        15 to 45 minutes earlier. David killed them all to free things up.
        The disk seems OK, but it's hard to tell.

        The motherboard in stkenmvr32a was replaced Wednesday, and we are
        now running via the onboard 1000BT connection, instead of the
        SysKonnect card.

D0

CDF

	
Friday  6/17

STK
	
	CRC Encp Mismatch - lqcd from tape to dcache. Encped the file with 
        --bypass-filesystem-max-filesize-check option. File copied with no
        problems.

	Dismount Failed BAD on DG4CMV for tape JL8182.

	JL8182 NOACCESS.

D0
	994025 OFFLINE - Tape PRQ545 stuck in drive service call placed.
	PRQ545 NOACCESS. Stuck in 994025. The drive has an "UNLOADING"
        message on the display. I tried to ipl the drive but the
        "UNLOADING" messages persists. Placed a service call to STK. 

CDF

	IA2128 NOACCESS.

	9940B16 offline - ACSSA Dismount 17 dismount failed library failure. 
        Code 5B01 on drive indicates leader block elevator is stuck. Service
        call placed.


Weekend -