Yesterday, this site went down for about four hours. Complaints started rolling in from my millions of ardent followers, spurring me into action. Join me as I deconstruct what went wrong, how I fixed it, and how I tried to prevent the problem from occurring again.

Infrastructure Background

This site is built with Hugo, a static site generator written in Go. It¹ uses a slightly modified version of the Nix theme by Matúš Námešný. It’s deployed automatically using GitLab’s CI/CD (Continuous Integration/Continuous Deployment) service. Every time I push a commit to the remote repository, a server runs a deployment script that I wrote. The script generates the site and then uploads it to my web host.

Since I got the flow working, I didn’t have any problems until now.

The Incident

Yesterday at 8:22pm, I pushed a benign commit to the remote repository. The commit in question simply removed a draft of a post I had been writing and decided not to publish. As with all pushed commits, this triggered a build and push of the site.

Hours later, I went to my site for something and noticed something strange: the homepage was 404ing. So were all the other pages that should have been there. So I looked through my CI logs and found that:

Hugo had failed to build my site
My script attempted to upload anyway
It uploaded my .git directory

The first one is okay; failures happen even though ideally they wouldn’t. The second one is worrying: Why would that happen? The third one had the potential to be disastrous.

The consequences

Luckily for me, there were few consequences. To my knowledge, my site gets very few visitors. Also, it’s not like my site runs important tools or services for me, since it’s a static site.

However, there is one thing that could have been very bad…

The potential consequences

As I mentioned above, for those four hours, my entire .git directory for the repository that holds this site was available for all to browse at https://reeshill.net/.git/. This of course means that not only was the current state of the repository publicly accessible, all past version history was as well.

My CI script needs to authenticate with my web host to upload the generated site. It does this with an app-specific password. For the script, this password needs to be stored somewhere the CI server will have access to. The naïve solution would be to hard-code the password into the script that runs on CI. If I had done this, even if I had later removed it, the password would exist in the history of the repository. Had this been the case, an attacker would have had plenty of time to discover this in the version history and upload malicious content to my web host, such as a Javascript cryptocurrency miner.

Fortunately, Past Me was smart enough not to put any secrets in version control. The password is actually stored by GitLab and given to the CI server as an environment variable. I trust GitLab a lot more than I trust myself to keep my secrets safe.

The Cause

Two big things went wrong at once to cause this failure. If either of them had gone right, the damage would have been either nonexistent or extremely minimal. But because they both managed to fail at once, the consequences were greater.

The immediate cause

The immediate cause of failure was that Hugo installation started failing. I was using a Docker image of Alpine Linux that contained lftp, which I use as an FTP client. However, that image did not contain Hugo, and when I set it up I was afraid to mess with it, so CI would install Hugo from the network every time it was run. Something must have changed in the Alpine repositories, because suddenly running hugo resulted in the error

Error relocating /usr/bin/hugo: _ZNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEC1Ev: symbol not found

This should have been no big deal. The CI should have failed and then I would deal with fixing the Hugo installation. But because of problem #2, the whole situation spiraled out of control.

The real cause

When I wrote the script that builds and uploads the site in April 2018, I did not yet understand how shell scripting works. Crucially, I didn’t understand exit codes. When you run a shell command, part of the result is an exit code: a number which is 0 if the command was successful and some other value if it failed.

By default, a script that contains a sequence of commands will run all of those commands in order, even if one or more command fails. For an example, see the following script:

#!/bin/sh

echo 'Started the script!'
ls directory-that-does-not-exist
echo 'The exit code of the previous command was' "$?"

The second command (the ls) will fail if the named directory does not exist. But the script will keep running. Here’s the output from running the above script:

$ ./steamroll.sh 
Started the script!
ls: directory-that-does-not-exist: No such file or directory
The exit code of the previous command was 1

The second echo runs despite the previous command failing with a nonzero exit code.

That’s kind of scary! In most programming languages, if the program encounters an error, it will crash or display some sort of exception that halts execution. Because the default shell behavior is to keep going regardless of failure status, my script failed to generate the site but proceeded to the next step anyway.

In my case, when lftp was unable to upload the directory it was supposed to upload (which was a child of the current directory), it instead uploaded the current directory, which was the repository root. That looked like this:

/builds/jarhill0/homepage/public: No such file or directory
Transferring file `README.md'
Transferring file `ci-upload.sh'
Transferring file `config.toml'
Making directory `.git'
Transferring file `.git/FETCH_HEAD'
Transferring file `.git/HEAD'
[etc.]

Ultimately, the failure of my script to halt once a command failed was what led to the incident. Small failures should be expected, but when they happen there should be a clear indication of failure and the process should not continue.

The Fix

There were essentially three problems I had to fix:

My site was down
My builds weren’t failing
My builds were failing

The quick fix

Since I was still able to build my website on my local machine, I did that and uploaded it. That gave me time to fix the underlying problems.

The process fix

If one of my builds didn’t succeed, that should be reflected in the script’s exit status.

It took six characters to fix my script:

set -e

Adding this line towards the top of the script does the following, according to the man page for set:

-e  Exit immediately if a command exits with a non-zero status.

That’s exactly what we want! As soon as a command fails, the script should fail. With this option set, any failures that may occur down the road will stop the CI job with a failure status. Perfect!

Here’s what it looks like in our test script from above:

#!/bin/sh

set -e

echo 'Started the script!'
ls directory-that-does-not-exist
echo 'The exit code of the previous command was' "$?"

And the output:

$ ./stopsign.sh 
Started the script!
ls: directory-that-does-not-exist: No such file or directory

It stopped once there was an error. Sweet!

The build fix

Now that my build status was properly reflecting the failure that was occuring, I needed to make the builds succeed.

A quick search didn’t help me figure out what was causing the mis-installation of Hugo, and I didn’t feel like digging in further because I knew that I was misusing Docker. The whole point of Docker is that it leads to reproducible states. Instead of installing a new dependency on top of my Docker image every time, I should put all necessary dependencies in the image from the beginning. So I bit the bullet and figured out how to make (actually, adapt) a Dockerfile. Now, all of my CI builds will start with Hugo and lftp already installed. It didn’t take all that long, and now I have removed one more source of failure!

Conclusion

Some failures, like the bug in my script fixed by set -e, can lie dormant for quite a while (in my case, a year and a half) before being exposed when something else fails. When something works reliably for a while, we don’t often think to go back and check that it works as we expect it does. And it’s very difficult to get everything right the first time.

I’m glad that this failure led to an improvement in the way I use Docker images. And I’m glad that the script has been fixed, because now any other failures will be caught and I will be alerted. Above all, I’m glad that things were configured well enough in the beginning that nothing bad happened to my site.

As of this writing. Subject to change. ↩︎

When Failures Align: A meta tale