Taming the Office Network

We know a few folks who are still on dial-up Internet access at home, but most of us, at least those of us who are “in the business” have DSL or cable, some sort of high-speed broadband connection.  The service providers originally intended one connection, one computer.  But, many households now have more than one computer, and, in the case of Chaos Central, where we run two businesses, one of which builds software, web sites, and manages other networks,  we have “many,” most of which are virtualized, but look, to the network, like individual computers.

The standard DSL or cable modem now comes configured as a network server appliance, with Network Address Translation (NAT), Domain Name Service (DNS), and Dynamic Host Control Protocol (DHCP).  But, as an appliance, the little modem does a less-than-adequate job in each area, with limited control available to the user via a web menu.  At Chaos Central, we have been gradually migrating these network functions to “real” (Linux/Unix, of course) servers, for both performance and control.

Back at the Rocky Mountain Nexus of Chaos Central, in Montana, we used a community-wide wireless, established in the pre-DSL days.  The radio connection was configured as a bridge, so we built a FreeBSD-based router to handle the NAT functions, and put DNS on an internal server.  A separate wireless bridge/router handled DHCP for laptops and such, but the rest of the network had static addresses.

When Chaos Central’s West Coast Nexus coalesced, both the FreeBSD router and the DNS server were casualties of the move, so we relied on the [new] DSL modem for network services, assigning static addresses outside the DHCP scope for servers and workstations that needed to be accessed through SSH.  But, as the stable of virtual servers proliferated, the shortcomings of the DSL modem as a network appliance became painfully obvious.

NAT works by assigning a port to each client system, though which to tunnel requests in and out of the system.  The first DSL modem we had didn’t keep track of which client requested what on which port for some services, so services like FTP, that listen on one port and transmit on another, didn’t work unless the ports were explicitly forwarded to the specific client that needed to use FTP.  This wasn’t a big deal at the time, since the Unix side of the business uses mainly SSH and most public download services offer a choice of HTTP or FTP.  So, NAT, while not smart, works, most of the time…

DNS became an issue once there were too many physical and virtual servers to keep track of in /etc/hosts (LMHOSTS for Windows clients).  The “little appliance that almost could” uses dynamic DNS, by which the client offers its name to the server, so machines can find each other by name.  But, the user interface doesn’t allow a lot of configuration options, so it had to go.

When setting up a private LAN DNS zone, we like to use the form  “company.lan” and generate named.conf.local files and zone files accordingly.  In these, we list the static addresses and server names, and also assign names like “dhcp-2” to enough addresses in the DHCP scope to cover the likely number of clients.  Printers, wireless routers, and portable machines are easier to use as DHCP clients, and, as we shall see later, DHCP with reserved addresses can simplify static assignments as well.

Which brings us to DHCP.  The modem’s DHCP doesn’t allow for a lot of configuration.  Despite offering to do so through the interface, it just didn’t seem to “want” to substitute our LAN DNS server for the internal and external DNS services, which meant manually editing the /etc/resolv.conf file every time the DHCP lease was renewed on DHCP clients, if we needed to address local machines that didn’t use Dynamic DNS (a number of Linux distros do, but some do not–I personally don’t like DDNS because of the potential for name conflicts when you allow servers to name themselves).

So, the next step was to set up our own DHCP server–first turning off the DHCP service in the modem.  Having our own service allows us to specify the local DNS server, domain, search domain, and, better yet, to map the MAC addresses of various machines to static addresses outside the zone, to provide reserved addresses for those machines.  There is a definite advantage to having everything use DHCP–you no longer have to modify the network configuration on each machine if you move or add a service, you just change the service records in the DHCP server and renew the leases on the clients.  We’re using Ubuntu 10.10 Server Edition for DNS and DHCP, using BIND9 for name service software and DHCP3 for address assignment, running as a virtual server.  The setup was quite easy, but, then, we’ve been doing that for 15 years and had the DNS zone templates from the old site archived.  We’re used to hand-editing the files, but Webmin does a great job of guiding the new user through the setup process and managing the services afterwords.

The next step in the process of taking charge of your own network is to configure the DSL or cable modem as a pass-through device, add a second network card to a spare machine, and configure your own router, with a more reliable NAT and a more configurable firewall, and set up a local NTP timeserver.  But, that’s a future project.  Right now, we’re evaluating network storage solutions for $CLIENT, and took time out to clean up and fix things to make that easier.


System Administration Rule #1: Thou Shall Not Lose Thy Customer’s Data

Garrison Keillor opens the monologue on his weekly radio variety show with “It’s been a quiet week in Lake Wobegon…”  Well, it has definitely not been a quiet month at Chaos Central, which accounts in part for a long silence in this forum.  It began with an ominous question from $CLIENT one Friday morning…

“How do you recover an Amanda backup set manually?”  OK, innocent enough.  Sometimes it is faster to do it that way, and easier than installing the client software to restore data to a different system than the one from which it was backed up.  My response: the answer is found in the header of the first archive file, and you’ll need to figure out how to reassemble a sliced dataset.  Amanda is a popular Open Source backup tool for Unix.  $CLIENT runs it as a disk-to-disk application, which we do as well here at Chaos Central.  USB disks are cheaper than tape jukebox systems by far.  We, meaning system administrators in general, backup data regularly, so as not to violate the first rule of system administration:  Thou shall not lose thy customer’s data.  Sometimes, customers delete files by accident, sometimes they delete files on purpose and later find out that was a mistake, and sometimes disks fail.  We don’t guarantee you can get all of your data back, but, usually, if it existed for at least a day, it’s on a backup tape or disk and we can retrieve it, all things being equal…

On this particular week, things went wrong, in a perfect storm.  The data in question had been archived, which is to say it was not part of the active dataset, so was not being backed up regularly.  The archive was stored offsite, but accessible through the network.  Following the principle of “trust but verify,” $CLIENT had kept a backup copy of the data before it was archived.  Data is like forensic evidence–in order to be trusted, it must be maintained through a chain of custody that is trusted.  And, in the spirit of Rule #1, there must be at least two complete copies at all times.

As feared, the off-site archive copy had become corrupted, and the off-site storage agency had no backup of it.  Not all of the files were damaged, but the missing ones were critical.  So, out comes the original backup, which had been preserved on-site for just this contigency.  However, being somewhat “aged,” the backup index had not been updated to the current software revision level, so file recovery through the normal program interface was not an option.  At this point, we have copy number 1 damaged, and the backup intact but stubbornly non-recoverable.  Perfect storms require multiple unlikely and possibly unrelated events to coalesce, and some definable human error in judgment in dealing with the combination to become memorable bad examples and the stuff of books and movies.  Two unrelated events had converged now, and the ice was thinning rapidly, so to speak.  Back at Chaos Central, we were still blissfully unaware of the “whole story,” but would soon be drawn into the rescue operation.

The primary defense of Rule #1 is to ensure there are two verifiably good copies of any data at all times.  The correct response at this point would be to make another copy of the known good but inconveniently unusable backup before resorting to manual extraction measures.  And, that was the intent.  Except for a tiny flaw in the process, whereby a chunk of the backup (which, as we will see, was stored in individual blocks or slices–140,000 of them) was moved to the test area instead of copied.  At this point, the integrity of the sole remaining copy was compromised, but not yet beyond recovery.

What happened next involved yet another human error, caused first by an imperfect understanding of the exact semantics of the manual recovery procedure (partly due to exceptionally vague documentation) and second, by applying it in such a manner as to write into the directory containing the only copy of the critical first blocks of the archive.  The manual recovery procedure called for using the Unix ‘tar’ (tape archiver) command with a ‘-G’ option, which the manual says “handles the old GNU incremental format.”  Whatever that means.  Sounds innocuous, right?  A lot of these open source tools assume that you might be taking data from one system and importing it into another, and use the lowest-common-denominator functionality by default.  The word “incremental” implies “partial,” right?  So we should be safe using it.  No.  What it does, and what happened when our hapless S/A applied it with the target directory as the same directory containing the two archive blocks, was “remove files in the target directory that are not contained in the archive.”  Which are in the software design notes, but not in the user manual.  In my own cautious way, I don’t usually expand tar files into the same directory that contains the archive, as a matter of principle, so this normally wouldn’t be an issue, but it would have been annoying at best if used for a partial restore in a directory containing other files.

The effect was that, yes the first few files in the archive were restored, but now the beginning blocks of the archive were inexplicably missing.  Inexplicably, that is, until the investigation and rescue operation initiated from Chaos Central discovered the horrible truth about what ‘-G’ does.  The tape archive itself consists of hundreds of thousands of files packed into one, which is then compressed.  Losing the first two pages of the archive lost not only the first few files of the data, but also the dictionary to translate the compressed gibberish into the first chapter.  A simple data recovery operation now escalated into a data repair operation, requiring some advanced skills and much research, and not a small amount of luck.

A bit of research on the Web showed that, yes, the dictionary in a gzipped file is reset from time to time, i.e., in compression blocks, with a general description of how one could, with trial and error, find an intact one in what was left of the broken dataset.  But no readily-available implementation of a solution, anywhere.  So, I wrote one, a short Perl script that doesn’t attempt to find the first block, but the first one that starts on a byte boundary (compression is by bits, not whole bytes, so not all blocks do).

#!/bin/perl -w
# Repair a broken gzip file
# Input file is the tail of a corrupt gzip file.
# This script was written to recover a gzipped tar archive from
# a split gzipped file in which one or more segments are missing,
# such as a corrupt backup tape. Apply this to any segment.
# Script creates file "errmsg.txt"
# and uses newgzip.gz as a working filename
# for the recovery process
# NOTE:  Use this script as a guideline only--
#        adjust to fit your particular conditions.

# create a valid GZIP header (binary)

# open test slice and search for possible compression block boundary,
# shifting off first byte.
while ( read(RAW,$buf,32768,0) > 0 ) {
 $testfile .= $buf;
 print ".";  # output a dot for each 32K block to show we're working
print "Read " . length($testfile) . " bytes\n";
while( length($testfile) ) {
 $fbyte = ord(substr($testfile,0,1)) & 7;
 if ( $fbyte == 4 ) {
  $testzip = $header . $testfile;
  $wb = syswrite(RUN,$testzip,length($testzip));
  system("/usr/local/bin/gzip -t newgzip.gz 2>errmsg.txt");
  $rtn =~ y/\n//d;
  print  length($testzip) . ": " . $rtn . "\r";
  if ( $rtn =~ /crc error/ ) {
   print "Success\n";
   rename "newgzip.gz" $ARGV[1];
   unlink "errmsg.txt";
 $testfile = substr($testfile,1);
print "Failed to find a valid compression block in $ARGV[0]\n";
unlink "errmsg.txt";

Well, there it is.  It prints out some dots to show it is reading the file, then prints out where it is in the file, so the operator doesn’t get nervous about what it is doing to the data or how long it will take (a fairly long time, as you may search a long ways into the file to find a usable key, and it reads and writes a file shortened one byte at a time).

So, we got a valid chunk of a compressed file that we could graft onto the front of the remaining 139,996 chunks ( we lost two, and had to search through two more before we found a compression block–you will lose some data with this procedure, but we were looking for specific files, and didn’t want to have to shift a terabyte of data bit by bit, so we looked for a block on a byte boundary).

But, now that we could decode the data, it was still gibberish, because it started in the middle of a file.  Fortunately, there was a Perl script available on the Internet (search for find_tar_headers.pl), which worked “out of the box” to find the start of the next file.  In this particular archive, we lost 30MB of compressed data off the front, and another 400MB of uncompressed data.  Fortunately, the files we needed to recover were further down in the archive.

For the remainder of the process, we added a dummy Amanda header to the recovered Gzip archive block to simplify the script we wrote that stripped off the Amanda headers from all of the chunks, concatenated them together (after discovering that the last block on each “tape” has to be discarded, as it truncates when EOT is reached and is restarted on the next tape), unzipped them, used ‘tail’ with an offset to strip off the unusable partial file from the front, and then extracted the files from the ‘repaired’ (but incomplete) tape archive.  The script ran for about 20 hours, working on the repaired 1.4 Terabyte backup dataset.  The files we lost off the front end were, fortunately, intact in the primary copy of the dataset, so we preserved Rule #1, not entirely by skill alone.

But, this could have been prevented, obviously, by following a few simple steps:  first, as it says on the cover of the Hitchiker’s Guide to the Galaxy, Don’t Panic.  Yes, the customer wants his data “right now,” but, given the choices of “soon” or “never,” I’m sure he would prefer the former.  Take time to plan each step of the recovery process, and carefully make a new plan if the first does not succeed.  Second, don’t use backup software to create archives.  Backups are meant to keep a copy of “live” data, that supposedly gets checked often enough to be correct.  And, the backup is refreshed from time to time and managed by a directory. Archives, by nature, preserve at least two static copies for an indefinite time, and require quite different directory and validation/verification processes.  Third, if one of the two copies of your data is outside your immediate chain of custody, it isn’t valid–you need a third copy.  Fourth, if you need to experiment with the data, experiment on a copy, not the original.  Fifth, if you are in a data recovery operation where only copy exists, use a team approach and double-check every step.  Plan what you are going to do, understand what the expected results of an operation will be and make sure the operation is repeatable or reversible.  And sixth, research your options carefully.  We learned that damaged gzip files and damaged tar files are at least partially recoverable, but at considerable expense in time and effort.  Above all, be careful.  Rule #2 says “If you violate Rule #1, be sure your resume is up to date or you have another marketable skill.”