website page counter

30-Nov-2007: Lights Out


One of the things I was attempting to do, from my earliest days as a systems manager, was to automate as many of the routine operations' tasks as possible. One of my first efforts in this area was to automate the rental of tapes. Previously if a user wished to rent a tape he had to send a request, along with details about the rental, to a console operator who would then fill out a form on his console and reply to the job with the volume serial number of the next scratch tape so that the user's batch job could then ask for the tape to be mounted and used. The operator had to print out the details for the tape librarian to process. I wanted to do away with the console operator's role, as I could not see why it was necessary.

I wrote a program, then installed it as a system-wide command, using the same parameter parsing scheme as was used by the already installed commands. The program had a number of defaults, such as the requesting user's account number, the highest density tape available (6250 BPI at the time), and details like the number of tapes required and length of rental. Any default parameter could be over-ridden by the user if he wished. I called the command "scratch" (short for "scratch tape"). Now the scratch command could be used interactively or imbedded in a batch job. What it did was send a message to the operator to mount the next available scratch tape. That is all the operator did. The program read the volume serial number of the tape and turned it over to the user's batch job. The details for the tape librarian were printed automatically. When I say I installed the program as a system-wide command, I also mean that I inserted into the system's help file in the same format as all the other entries there. In other words, a casual user would not know if my program was part of the original operating system or something added on later. What I accomplished was to reduce the operator interface to something that a robot could perform. (In fact, it would have worked seamlessly with the robotic tape library storage systems that were developed later.)

I took a similar approach to any task that used to involve operator intervention. For example, I automated all the routine system maintenance jobs, eventually evolving my own batch queuing system with an interface that read a simple text file containing the names of batch jobs and the times and days they were to be run. (I ensured the queuing program would always run as scheduled by creating it as a "detached process"—that is, a job that would run despite not being within an interactive context or held within a batch queue. As long as the system was operating, it would run. That sort of detail is one that not many system managers would think of and is one of the marks of my suite of programs.) So, if there was a new job called "system purge" that was to be run every Wednesday and Friday at 11:00 pm, then I just had to enter that information into the file and I could forget about the job. It would run exactly when I had said it should, every time. If one of the maintenance jobs required parameters, those too would be entered into the text file. Set it once and forget it was my attitude.

I developed a pretty sophisticated set of programs in my five years at EMR. At least one of them—one that gathered system statistics and formatted them in daily and weekly reports—was running five years after I had left EMR. I know that because the system manager phoned me to tell me that the program was broken and he demanded that I fix it. The logic behind his reasoning—that a program running every day without fail for nearly ten years would suddenly, for no reason, stop working—is the sort of thing I have often run into. It boggles my mind. Anyhow, it turned out that he had been tinkering with internal systems files that he should never had touched for any reason. Once those files were restored from backup tape, my program continued to run in its usual merry way.

I took those programs with me when I left the government and I implemented them, while continuing to improve them, at various computer rooms across the city of Ottawa. At one site my account was kept alive for years, despite my pleas that it be disabled, simply because the system manager liked my work so much. My reputation proceeded me as I moved from job to job. A few times when speaking to someone new, I was greeted with, Are you that Ron Brown?

The point of all this is that I was, independently, developing a suite of system management tools that fell into a new category called, "lights out operations." The trend, that I unknowingly had been a part of, was to remove all humans from computer rooms. I had no problem with removing people from cold noisy rooms where their tasks could be replaced with a few electronic signals. Operators needed more interesting things to do than watch tape drives spin all night. No one needed to be out of work: if they had minimal computer skills they would be grabbed by the newly-evolving central "help desk" facilities. In fact, I received about $6,000 in recruitment bonuses for placing some of the operators from the EMR computer room in new jobs. Pay and hours were about the same, so they didn't really lose—and the computers benefited by not having any dirty error-prone humans manhandling them. (Even a freshly-showered person is filthy from a computer's point of view. All those microscopic skin flakes and hairs could be enough to crash a disk drive.)

Today, all large computer installations are "lights out." Important systems and components are mirrored; sometimes hundreds of miles apart. What you see now in contemporary computer rooms are racks of computer boards, each of which is a complete computer, lots of cable, and various sized cabinets; some of which contain dozens of disk drives, communications switches, or groups of mainframes—and no people. The entire system can be monitored remotely, often by a poorly-paid, very bored person sitting in front of a personal computer. Systems can page system support folks automatically when they run into problems that they can't handle; though they will fix most problems themselves and inform the systems folks later. Systems are most vulnerable when people get involved, like patching or upgrading programs—or by maliciously attacking them.

"Lights out," for me, has another connotation. When Brian Mulroney was elected prime minister in 1984 his election promise to reduce the size of the civil service (always a promise by conservative politicians, some of whom wind up bloating government with even more departments and ministers) was a minor distraction. After all, we were all doing essential work. Someone had to write the programs and maintain the computer systems. However, the computer centre at Energy, Mines, and Resources was one of the first targets. In 1987 all employees associated with the computer center were invited to an auditorium where we were told that, over the next six months, our department would be cut from 200 employees to about thirty-five. However, we were encouraged to stay on the job and do our usual excellent work until we got layoff notices. I've always thought of that as a prime example of how not to downsize.

Of course morale fell through the floor and all respect for management disappeared in a single stroke. Instead of working, employees would gather outside the director's office and blow cigarette smoke at him. (He was allergic to smoke and the idea that the work place should be smoke-free was still a new untested idea.) People showed up for work at 10:00 or 11:00, booked off for a two or three hour lunch, then left for the day after putting in a token half hour's work. I never panicked and generally did my job as before, except that I took the long lunch hours. I carefully calculated who was absolutely required and whose skills were essential and reasonably rare. Of the thirty-five to be kept on, I was certain that I would be one of them. Though by that time I had an assistant, I was the only one in the department with the depth of knowledge of the workings of the VAX/VMS machines required to carry all the responsibility for them.

Like everyone else, I was looking for other employment opportunities. Near the end of the six month period, I had an offer for a system management position that looked interesting. So, I made an appointment to see the director. I asked him, straight out, man-to-man, for his word that, once the dust had settled from all these layoffs, he would try to get my job classification changed to equal that of the systems people who looked after the Cybers and the IBM machine. I knew I was dealing from a position of strength, but I wasn't demanding anything he could not do. He would not respond to that. Instead, he asked if I had another offer because he needed to know right away, now that we were down to the final few days. I refused to answer him. I wanted his word. It is just as well he didn't give it, because he was the first one to be escorted from the building.

I took the other job...and then six months later was back in my old job as a consultant. I had been correct in my assessment that I would have been one of the few retained, though, given who remained, I was just as happy to have moved on. Besides, if I had stayed at EMR, I would never have had the opportunity to work in the Prime Minister's office, become a software development project manager, become a self-employed entrepreneur, or all the other fun things I got involved with.