Incident Log on Sirsi servers

I have kept an incident log for all major happenings on the Sirsi servers since 1995. These are in two scribblers in my office. The most recent four years worth I have scanned and provide below in one long set of images, beginning with most recent, from 2002 and going back to November of 1998.

  • April 22, 2003 Sequence of events for today on SETER:
  • after midnight there was the usual weekly aix reboot, but after the reboot various things did not restart as they should have
  • the TSM backup failed because the TSM scheduler did not re-start
  • the cron daemon didn't start either, so 6 a.m. cronjob didn't run
  • the system was unstable, so around 9:45 all the Unicorn server software crashed (all 5 servers went down). Possibly because i of inability to fork a process.
  • hundreds or thousands of defunct processes were left behind. Therefore the limit on the number of processes that the sirsi user could run was exceeded. (The limit is probably 4,096). So each time I brought unicorn back up it crashed again.
  • the only easy way to delete so many defunct processes at once is with an aix reboot, so I asked Kin to do this
  • afterwards everything was OK
  • the downtime also affected authentication for remote database access and the workroom booking system
  • total time of problem was about an hour and a half. Circulation staff did manual transactions during that time, which had to be keyed when the system became available again
    It would be good to have a more informative checklist generated by the system after an aix reboot, so that the operator and we would know if all the various things got re-initialized as they should. The usual message just says that the aix reboot completed satisfactorily, which isn't always true.

  • April 1, 2003
  • Horus aix fix and subsequent problems
  • Same patch put on erl (isis)
  • March 27, 2003
  • apache 1.3.27 put on neter
  • then the pci's for http didn't work. Fixed.
  • aix patch on neter
  • March 11-13, 2003
  • sendmail efix installed on neter, seter, horus, erl by Kin Lee
  • March 4, 2003
  • Horus problem due to passwd.update file.
  • Also, the I.T. Operator POWERED OFF seter AGAIN.
  • Feb.28, 2003
  • HKN CD-server unplugged permanently.
  • Also, the I.T. Operator POWERED OFF seter by mistake.
  • Jan.17, 2003
  • Same event as on Sept.17 and Nov.15. See email of Jan.17 printed out in the yellow event scribbler. Now we know it seems to be caused by a runaway catzserver process, but what causes that I have no idea. If it happens again, just try bringing down the zserver first and hopefully that will solve the problem no matter what the processes are showing.
  • CIRCMAC-2 STILL has the old client from previous version. Mary Jane is fixing it. (She did)
  • Jan.10, 2003
  • Kin upgraded wu-ftpd on all library servers.
  • Jan.6, 2003
  • Kin disabled the ftp, rlogin, etc. on library servers. Big problem.
  • Nov.15, 2002
  • Same event as Sept.17 again.
  • Sept.17, 2002
  • Sirsi crashed. WF dropped off, eventually Webcats froze. The system would not halt. Called Sirsi helpdesk (incident #39382). Eventually it did halt. After a run, all was fine. This is unexplained. The error log did show someone trying to login under the old version's client. Showed the same thing the next day, too.
  • August 24, 2002
  • Wrong wording for how to login with ID and password. This wording in Webcat does not come from the Gateway configuration as you would expect it to, like the .hdr or .dis or .ftr or .par files. Instead, it's in Unicorn/Language/English/labels file.
  • August 16, 2002
  • Upgraded SETER to Unicorn v2001.12.0.4
  • THere were a lot of snafus.
  • Started upgrade at 9:45. Finished at 11:12.
  • Rebuilds finished before 5 a.m. Sunday
  • Webcat display problem because we didn't move custom 5.pg out of the way BEFORE the upgrade.
  • August 15, 2002
  • Barrie loaded 5,398 new users from seter to neter for authentication. 5176 add, 222 updates.
  • June 11, 2002
  • Password changes not working on neter. Kin reinitialized passwd.queue file, which was forgotten when we moved from nut to neter.
  • June 7, 2002
  • Did a set of rebuilds on neter on version 2001
  • May 7, 2002
  • Upgrade neter from unicorn 99.4.2 to 2001. Problems (perl binary). ON May 8, Alan Welch re-ran it. It finished, but we couldn't search catalogue. Started rebuildtext set to blocks of 500,000 records at a time, at 9:30 a.m. It took 13.25 hours.
  • April 30, 2002
  • Kin Lee upgraded the native compiler for aix from xlc to cc on neter and seter
  • Kin added a second monitor to both machine. It's called topas and it correctly shows memory which 'monitor' did not on these machines. But it doesn't show users logged on, which monitor does show. I can use either with a pci.

    Continuing from April 24, 2002:

  • The Sirsi incident ID for the inability to start workserver was 20473.
  • Couldn't run setprots - probably bad file permissions somewhere but no way of knowing.
    Eventually it ran. nohup setprots >&setprots.out&. Got message: setprots.out:0403-007 Generated or received a file descriptor number that is not valid.
  • Forgot in logs below to say I did a test RESTORE on neter May 1-5 2000.






















    Page maintained by Linda Pearce. Last updated April 22/2003.