30-Nov-2007: Lights Out
One of the things I was attempting to do, from my earliest days as a systems manager, was to automate
as many of the routine operations' tasks as possible. One of my first efforts in this area was to automate the
rental of tapes. Previously if a user wished to rent a tape he had to send a request, along with details about the rental,
to a console operator who would then fill out a form on his console and reply to the job with the volume serial number of the next scratch
tape so that the user's
batch job could then ask for the tape to be mounted and used. The operator had to print out the details for
the tape librarian to process. I wanted to do away with the console operator's role, as I could not see why it was necessary.
I wrote a program, then installed it as a system-wide command, using the same parameter parsing scheme as
was used by the already installed commands. The program had a number of defaults, such as the requesting user's account
number, the highest density tape available (6250 BPI at the time), and details like the number of tapes required and length of rental. Any default parameter
could be over-ridden by the user if he wished. I called the command "scratch" (short for "scratch tape"). Now the scratch command
could be used interactively or imbedded in a batch job. What it did was send a message to the operator to mount the next
available scratch tape. That is all the operator did. The program read the volume serial number of the tape and turned it over to
the user's batch job. The details for the tape librarian were printed automatically. When I say I installed the program as a system-wide command,
I also mean that I inserted into the system's help file in the same format as all the other entries there. In other words, a casual user
would not know if my program was part of the original operating system or something added on later. What I accomplished was to reduce the operator
interface to something that a robot could perform. (In fact, it would have worked seamlessly with the robotic tape library storage systems that were
developed later.)
I took a similar approach to any task that used to involve operator intervention. For example, I automated all the routine system maintenance jobs,
eventually evolving my own batch queuing system with an interface that read a simple text file containing the names of batch jobs and the times and days
they were to be run. (I ensured the queuing program would always run as scheduled by creating it as a "detached process"—that is, a job that would run
despite not being within an interactive context or held within a batch queue. As long as the system was operating, it would run. That sort of detail is one that
not many system managers would think of and is one of the marks of my suite of programs.)
So, if there was a new job called "system purge" that was to be run every Wednesday and Friday at 11:00 pm, then I just had to enter
that information into the file and I could forget about the job. It would run exactly when I had said it should, every time. If one of the maintenance jobs required
parameters, those too would be entered into the text file. Set it once and forget it was my attitude.
I developed a pretty sophisticated set of programs in my five years at EMR. At least one of them—one that gathered system statistics and formatted
them in daily and weekly reports—was running five years after I had left EMR. I know that because the system manager phoned me to tell me that the
program was broken and he demanded that I fix it. The logic behind his reasoning—that a program running every day without fail for nearly ten years would
suddenly, for no reason, stop working—is the sort of thing I have often run into. It boggles my mind. Anyhow, it turned out that he had been
tinkering with internal systems files that he should never had touched for any reason. Once those files were restored from backup tape, my program
continued to run in its usual merry way.
I took those programs with me when I left the government and I implemented them, while continuing to improve them, at various computer rooms across the city of
Ottawa. At one site my account was kept alive for years, despite my pleas that it be disabled, simply because the system manager liked my work so much. My
reputation proceeded me as I moved from job to job. A few times when speaking to someone new, I was greeted with, Are you that Ron Brown?
The point of all this is that I was, independently, developing a suite of system management tools that fell into a new category called, "lights out operations." The trend,
that I unknowingly had been a part of, was to remove all humans from computer rooms. I had no problem with removing people from cold noisy rooms where their
tasks could be replaced with a few electronic signals. Operators needed more interesting things to do than watch tape drives spin all night. No one needed to
be out of work: if they had minimal computer skills they would be grabbed by the newly-evolving central "help desk" facilities. In fact, I received about $6,000 in
recruitment bonuses for placing some of the operators from the EMR computer room in new jobs. Pay and hours were about the same,
so they didn't really lose—and the computers benefited by not having any dirty error-prone humans manhandling them. (Even a freshly-showered person
is filthy from a computer's point of view. All those microscopic skin flakes and hairs could be enough to crash a disk drive.)
Today, all large computer installations are "lights out." Important systems and components are mirrored; sometimes hundreds of miles apart. What you see now
in contemporary computer rooms are racks of computer boards, each of which is a complete computer, lots of cable, and various sized cabinets; some of
which contain dozens of disk drives, communications switches, or groups of mainframes—and no people. The entire system can be monitored
remotely, often by a poorly-paid, very bored person sitting in front of a personal computer. Systems can page system support folks automatically when they run into problems
that they can't handle; though they will fix most problems themselves and inform the systems folks later. Systems are most vulnerable when people get involved,
like patching or upgrading programs—or by maliciously attacking them.
"Lights out," for me, has another connotation. When Brian Mulroney was elected prime minister in 1984 his election promise to reduce the size of the civil service
(always a promise by conservative politicians, some of whom wind up bloating government with even more departments and ministers) was a minor distraction. After all,
we were all doing essential work. Someone had to write the programs and maintain the computer systems. However, the computer centre at Energy, Mines, and Resources
was one of the first targets. In 1987 all employees associated with the computer center were invited to an auditorium where
we were told that, over the next six months, our department would be cut from 200 employees to about thirty-five. However, we were encouraged to stay on the job
and do our usual excellent work until we got layoff notices. I've always thought of that as a prime example of how not to downsize.
Of course morale fell through the floor and all respect for management disappeared in a single stroke. Instead of working, employees would gather outside the
director's office and blow cigarette smoke at him. (He was allergic to smoke and the idea that the work place should be smoke-free was still a new untested
idea.) People showed up for work at 10:00 or 11:00, booked off for a two or three hour lunch, then left for the day after putting in a token half hour's work. I never
panicked and generally did my job as before, except that I took the long lunch hours. I carefully calculated who was absolutely required and whose skills were essential and
reasonably rare. Of the thirty-five to be kept on, I was certain that I would be one of them. Though by that time I had an assistant, I was the only one in the
department with the depth of knowledge of the workings of the VAX/VMS machines required to carry all the responsibility for them.
Like everyone else, I was looking for other employment opportunities. Near the end of the six month period, I had an offer for a system management position
that looked interesting. So, I made an appointment to see the director. I asked him, straight out, man-to-man, for his word that, once the dust had settled from all these layoffs,
he would try to get my job classification changed to equal that of the systems people who looked after the Cybers and the IBM machine. I knew I was dealing from a
position of strength, but I wasn't demanding anything he could not do. He would not respond to that. Instead, he asked if I had
another offer because he needed to know right away, now that we were down to the final few days. I refused to answer him. I wanted his word. It is just as
well he didn't give it, because he was the first one to be escorted from the building.
I took the other job...and then six months later was back in my old job as a consultant. I had been correct in my assessment that I would have been one of the
few retained, though, given who remained, I was just as happy to have moved on. Besides, if I had stayed at EMR, I would never have had the opportunity
to work in the Prime Minister's office, become a software development project manager, become a self-employed entrepreneur, or all the other fun things I got involved with.