Primary Report for the week ending Friday, 2005-06-17
*****************************************************
Summary:
Installs:
0 installs 0 decommissions
Replaced drives:
1 9940B 0 9940A 1 LTO1/LT02 0 other
Replaced servers/movers/fileserver nodes:
0 server 0 mover 0 fileservers
4 Robot hardware maintenance service calls
1 Server/Mover/Fileserver maintenance (repairs/parts replacement)
Investigations/Interventions:
9 mover 3 server 7 tape drive 0 fileserver 12 tape 0 file 0
library
tape operations
9 tapes clobbered/recycled
250 tapes labeled/entered/removed
1 tape MIRs fixed
0 tape MIRS not fixed
1 tape drive firmware updates
0 tapes cloned
1 quota increase requests serviced
1 raid disk replacements/interventions
2 enstore service requests
0 new muon pages
0 off-hour calls/interventions
Weekend:
Monday 06/13
STK
JL8630 NOACCESS, READ_VOL1_READ_ERR . Read vol label from DG4D
with no problems. Tape was released.
DE2EDLT offline. Cannot eject JL0745 which is NOACCESS. Tape
is not in the drive. It did eject and dismounted. Tape release
drive was put back online.
D0
PRO204L1 NOACCESS - Positioning error on file 350 trying to get
to 373. Able to read file 350 with no problem. Read file 373 with
no problem. The drive D21C was replaced last Friday. There was
error sense data for D21C. This error occurred on Thursday.
PRU383L1 NOACCESS - Mover stuck on D41F, file 127. Fixed the TOC.
Was able to read file 127 and last file 177.
Service call place for 9940B drive on 9940B60 mover. Having
problems reading tapes. Drive was replaced on Tuesday.
CDF
IA3854 NOACCESS -WRITE_VOL1_READ_ERR. No vol label but eod cookie
set to "000000000000001". This is a recycled tape. Tape was mounted
on a 9940B drive and EOF was written. Eod cookie modified to "NONE".
IA3978 NOACCESS - WRITE_VOL1_READ_ERR. No vol label but eod
cookie set to "000000000000001". This is a recycled tape. Tape
was mounted on a 9940B drive and EOF was written. Eod cookie
modified to "NONE".
IA5387 NOACCESS - test dcache tape. Positioning error on file 4521.
Last file on tape. This file was written on the same drive that
tried to read it a short while after it was written. 994003.
Tuesday 6/14
STK
9940B10 offline. Tape failed to eject after write completed. MT
offline command failed with an i/o error. ACSLS ejected the tape.
Mover was rebooted. There were problems with the file system.
Fsck corrected these.
PNFS hung on stkensrv1. PNFS log file hit 2GB. The file was renamed.
D0
D21BLTO offline. Tape eject failed for PRO927L1. Cannot eject tape
after 3 tries. MT commands failing with i/o errors even after power
cycling the drive and rebooting the mover. Three attempts were made
to eject the tape. A service call was placed to Adic. Tape drive was
replaced on Wednesday.
PRO927L1 NOACCESS. Stuck in D21BLTO.
CDF
Pagedcache Dccp Running too long. Nothing in output file. Jobs were
killed and lock directory removed.
Wednesday
STK
37 errors trying to transfer 2 files from STK dcache to enstore.
Error is "pnfs tag, wrapper contains invalid characters.
Recycled 7 exp-db tapes.
PagedcacheCMSDccp running too long. Nothing in output file. Job
killed and lock directory removed.
PagedcacheCMSGridftp running too long. Nothing in output file.
Job killed and lock directory removed.
About 13:10 Pagedcache Dccp and Gridftp failing due to
/home/enstore/dcache-deploy/config/dCacheSetup: No such file or
directory. I checked on stkendca3a and there was no file by that
name. A link with the dCacheSetup name magically appeared at
15:59 and the jobs started running.
Problem with disk array on stkendca7a - ERROR: Disk Array Unit 0 on
controller ID:2 is degraded and no longer fault tolerant.
ERROR: Drive error encountered on port 2 on controller ID:2.
ERROR: Rebuild/initialization/verify failed due to an error on
the source or destination drive of controller ID:2 unit: 0
About 5 pm port 4 on one of stkendca7a's raid units failed.
For unknown reason, the array did not automatically rebuild.
The hot spare was in port 2. Wayne and David started it rebuilding,
and replaced the drive in port 4, which then became the hot spare.
About 7 pm port 2 failed. Note that this was formerly the hot spare,
so nothing was lost, even though the rebuild had not completed. All
the data were still there, with no redundancy, on the remaining 6
disks. The rebuild began again using the new hot spare in port 4.
David has configured offline, but not replaced, the failed disk in
port 2, because this time, David wanted to be certain that it is
replaced with a good disk. So we are running in degraded mode, with
the rebuild at 78%, and when the rebuild completes, we will be
running without a spare. On Thursday, everything was put right.
D0
Firmware up updated to current on D21BLTO when it was replaced.
CDF
Replied to user about tape IA8157 with bad file 54.
Thursday 6/16
STK
9940B24 offline. Eject failed. Write error with sense data -
Error with "sense key unit attention". MT commands fail after
error. Service call placed.
9940B10 offline. Mover stuck in seek. Tape VO7613. Mover was brought
back online.
VO7613 NOACCESS. Mounted the tape and fsf to the end.
Online Database support people reporting files missing from pnfs
between yesterday and today. Their new script clobbered the files.
Set minos 9940 quota to 380.
The load average on stkensrv2 was 19, and lots of processes were
waiting for disk (D state). The cpus, however, were mostly idle.
So processes were waiting, and nothing was getting done. There
were 8 cron jobs all trying to run; they had started anywhere from
15 to 45 minutes earlier. David killed them all to free things up.
The disk seems OK, but it's hard to tell.
The motherboard in stkenmvr32a was replaced Wednesday, and we are
now running via the onboard 1000BT connection, instead of the
SysKonnect card.
D0
CDF
Friday 6/17
STK
CRC Encp Mismatch - lqcd from tape to dcache. Encped the file with
--bypass-filesystem-max-filesize-check option. File copied with no
problems.
Dismount Failed BAD on DG4CMV for tape JL8182.
JL8182 NOACCESS.
D0
994025 OFFLINE - Tape PRQ545 stuck in drive service call placed.
PRQ545 NOACCESS. Stuck in 994025. The drive has an "UNLOADING"
message on the display. I tried to ipl the drive but the
"UNLOADING" messages persists. Placed a service call to STK.
CDF
IA2128 NOACCESS.
9940B16 offline - ACSSA Dismount 17 dismount failed library failure.
Code 5B01 on drive indicates leader block elevator is stuck. Service
call placed.
Weekend -