
Incident Log on Sirsi servers
I have kept an incident log for all major happenings on the Sirsi servers since 1995.
These are in two scribblers in my office. The most recent four years worth I have
scanned and provide below in one long set of images, beginning with most recent, from 2002
and going back to November of 1998.
April 22, 2003
Sequence of events for today on SETER:
after midnight there was the usual weekly aix reboot, but after
the reboot various things did not restart as they should have
the TSM backup failed because the TSM scheduler did not re-start
the cron daemon didn't start either, so 6 a.m. cronjob didn't run
the system was unstable, so around 9:45 all the Unicorn server
software crashed (all 5 servers went down). Possibly because i
of inability to fork a process.
hundreds or thousands of defunct processes were left behind.
Therefore the limit on the number of processes that the sirsi
user could run was exceeded. (The limit is probably 4,096).
So each time I brought unicorn back up it crashed again.
the only easy way to delete so many defunct processes at once
is with an aix reboot, so I asked Kin to do this
afterwards everything was OK
the downtime also affected authentication for remote database
access and the workroom booking system
total time of problem was about an hour and a half.
Circulation staff did manual transactions during that time,
which had to be keyed when the system became available again
It would be good to have a more informative checklist generated by
the system after an aix reboot, so that the operator and we would
know if all the various things got re-initialized as they should.
The usual message just says that the aix reboot completed
satisfactorily, which isn't always true.
April 1, 2003
Horus aix fix and subsequent problems
Same patch put on erl (isis)
March 27, 2003
apache 1.3.27 put on neter
then the pci's for http didn't work. Fixed.
aix patch on neter
March 11-13, 2003
sendmail efix installed on neter, seter, horus, erl by Kin Lee
March 4, 2003
Horus problem due to passwd.update file.
Also, the I.T. Operator POWERED OFF seter AGAIN.
Feb.28, 2003
HKN CD-server unplugged permanently.
Also, the I.T. Operator POWERED OFF seter by mistake.
Jan.17, 2003
Same event as on Sept.17 and Nov.15. See email of Jan.17 printed out in the yellow
event scribbler. Now we know it seems to be caused by a runaway catzserver process, but
what causes that I have no idea. If it happens again, just try bringing down the zserver first
and hopefully that will solve the problem no matter what the processes are showing.
CIRCMAC-2 STILL has the old client from previous version. Mary Jane is fixing it. (She did)
Jan.10, 2003
Kin upgraded wu-ftpd on all library servers.
Jan.6, 2003
Kin disabled the ftp, rlogin, etc. on library servers. Big problem.
Nov.15, 2002
Same event as Sept.17 again.
Sept.17, 2002
Sirsi crashed. WF dropped off, eventually Webcats froze. The system would not halt.
Called Sirsi helpdesk (incident #39382). Eventually it did halt. After a run, all was fine.
This is unexplained. The error log did show someone trying to login under the old version's
client. Showed the same thing the next day, too.
August 24, 2002
Wrong wording for how to login with ID and password. This wording in Webcat does not
come from the Gateway configuration as you would expect it to, like the .hdr or .dis or .ftr
or .par files. Instead, it's in Unicorn/Language/English/labels file.
August 16, 2002
Upgraded SETER to Unicorn v2001.12.0.4
THere were a lot of snafus.
- Tu in I.T. forgot to change the libauth alias to neter yesterday. He's on vacation today.
The TIME-TO-LIVE is at 15 min., so he did do that part. Dan Clark changed the alias at 9:04 a.m.
today (Eric called him.) Authentication worked after that.
- I.T. forgot to do the full system backup at 00:01. Angus re-ran it at 7:30 a.m., but the TSM
backup didn't start. Kin Lee then changed the script to avoid the restart of servers, and Ops
then re-ran the script, which promptly erased the new disk-to-disk backup and started it again at
8:40 or so. I decided not to run the tSM backup at the same time as the upgrade and rebuild
because it competes for I/O too much. I also decided not to do a TSM backup because we have one
from the day before. I could ask Kin to stop all TSM activity on the machine until further
notice, but because lots of mistakes are made, it's safer to leave it alone. This does mean that
on Saturday during the rebuild TSM will pick up the huge _sirsi file, which may slow things down.
(But it didn't seem to.)
Started upgrade at 9:45. Finished at 11:12.
Rebuilds finished before 5 a.m. Sunday
Webcat display problem because we didn't move custom 5.pg out of the way BEFORE the upgrade.
August 15, 2002
Barrie loaded 5,398 new users from seter to neter for authentication. 5176 add, 222 updates.
June 11, 2002
Password changes not working on neter. Kin reinitialized passwd.queue file, which was
forgotten when we moved from nut to neter.
June 7, 2002
Did a set of rebuilds on neter on version 2001
- rebuildheading 23:40
- rebuildauth 8:52
- auth_maint 14:19 WON'T DO IN FUTURE, OR WILL DO AT A SEPARATE TIME
- rebuildtext 13:13 WITH 500,000 PER PASS
- rebuildthesauri :25
- correctthesauri's 1:00
- TOTAL: 60:49
- 1,519,288 bib records
May 7, 2002
Upgrade neter from unicorn 99.4.2 to 2001. Problems (perl binary). ON May 8, Alan Welch
re-ran it. It finished, but we couldn't search catalogue. Started rebuildtext set to blocks
of 500,000 records at a time, at 9:30 a.m. It took 13.25 hours.
April 30, 2002
Kin Lee upgraded the native compiler for aix from xlc to cc on neter and seter
Kin added a second monitor to both machine. It's called topas
and it correctly shows memory which 'monitor' did not on these machines. But it doesn't
show users logged on, which monitor does show. I can use either with a pci.
Continuing from April 24, 2002:
The Sirsi incident ID for the inability to start workserver was 20473.
Couldn't run setprots - probably bad file permissions somewhere but no way of knowing.
Eventually it ran. nohup setprots >&setprots.out&. Got message:
setprots.out:0403-007 Generated or received a file descriptor number that is not valid.
Forgot in logs below to say I did a test RESTORE on neter May 1-5 2000.
Page maintained by Linda Pearce. Last updated April 22/2003.