Yesterday, this site went down for about four hours. Complaints started rolling in from my millions of ardent followers, spurring me into action. Join me as I deconstruct what went wrong, how I fixed it, and how I tried to prevent the problem from occurring again.
This site is built with Hugo, a static site generator written in Go. It1 uses a slightly modified version of the Nix theme by Matúš Námešný. It’s deployed automatically using GitLab’s CI/CD (Continuous Integration/Continuous Deployment) service. Every time I push a commit to the remote repository, a server runs a deployment script that I wrote. The script generates the site and then uploads it to my web host.
Since I got the flow working, I didn’t have any problems until now.
Yesterday at 8:22pm, I pushed a benign commit to the remote repository. The commit in question simply removed a draft of a post I had been writing and decided not to publish. As with all pushed commits, this triggered a build and push of the site.
Hours later, I went to my site for something and noticed something strange: the homepage was 404ing. So were all the other pages that should have been there. So I looked through my CI logs and found that:
- Hugo had failed to build my site
- My script attempted to upload anyway
- It uploaded my
The first one is okay; failures happen even though ideally they wouldn’t. The second one is worrying: Why would that happen? The third one had the potential to be disastrous.
Luckily for me, there were few consequences. To my knowledge, my site gets very few visitors. Also, it’s not like my site runs important tools or services for me, since it’s a static site.
However, there is one thing that could have been very bad…
The potential consequences
As I mentioned above, for those four hours, my entire
.git directory for the
repository that holds this site was available for all to browse at
https://reeshill.net/.git/. This of course means that not only was the current
state of the repository publicly accessible, all past version history was as well.
Fortunately, Past Me was smart enough not to put any secrets in version control. The password is actually stored by GitLab and given to the CI server as an environment variable. I trust GitLab a lot more than I trust myself to keep my secrets safe.
Two big things went wrong at once to cause this failure. If either of them had gone right, the damage would have been either nonexistent or extremely minimal. But because they both managed to fail at once, the consequences were greater.
The immediate cause
The immediate cause of failure was that Hugo installation started failing.
I was using a Docker image of Alpine Linux that contained
lftp, which I use as an FTP client.
However, that image did not contain Hugo, and when I set it up I was afraid to
mess with it, so CI would install Hugo from the network every time it was run.
Something must have changed in the Alpine repositories, because suddenly
hugo resulted in the error
Error relocating /usr/bin/hugo: _ZNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEC1Ev: symbol not found
This should have been no big deal. The CI should have failed and then I would deal with fixing the Hugo installation. But because of problem #2, the whole situation spiraled out of control.
The real cause
When I wrote the script that builds and uploads the site in April 2018, I did not
yet understand how shell scripting works. Crucially, I didn’t understand exit
codes. When you run a shell command, part of the result is an exit code: a
number which is
0 if the command was successful and some other value if it
By default, a script that contains a sequence of commands will run all of those commands in order, even if one or more command fails. For an example, see the following script:
#!/bin/sh echo 'Started the script!' ls directory-that-does-not-exist echo 'The exit code of the previous command was' "$?"
The second command (the
ls) will fail if the named directory does not exist.
But the script will keep running. Here’s the output from running the above script:
$ ./steamroll.sh Started the script! ls: directory-that-does-not-exist: No such file or directory The exit code of the previous command was 1
echo runs despite the previous command failing with a nonzero
That’s kind of scary! In most programming languages, if the program encounters an error, it will crash or display some sort of exception that halts execution. Because the default shell behavior is to keep going regardless of failure status, my script failed to generate the site but proceeded to the next step anyway.
In my case, when
lftp was unable to upload the directory it was supposed to
upload (which was a child of the current directory), it instead uploaded the
current directory, which was the repository root. That looked like this:
/builds/jarhill0/homepage/public: No such file or directory Transferring file `README.md' Transferring file `ci-upload.sh' Transferring file `config.toml' Making directory `.git' Transferring file `.git/FETCH_HEAD' Transferring file `.git/HEAD' [etc.]
Ultimately, the failure of my script to halt once a command failed was what led to the incident. Small failures should be expected, but when they happen there should be a clear indication of failure and the process should not continue.
There were essentially three problems I had to fix:
- My site was down
- My builds weren’t failing
- My builds were failing
The quick fix
Since I was still able to build my website on my local machine, I did that and uploaded it. That gave me time to fix the underlying problems.
The process fix
If one of my builds didn’t succeed, that should be reflected in the script’s exit status.
It took six characters to fix my script:
Adding this line towards the top of the script does the following, according
to the man page for
-e Exit immediately if a command exits with a non-zero status.
That’s exactly what we want! As soon as a command fails, the script should fail. With this option set, any failures that may occur down the road will stop the CI job with a failure status. Perfect!
Here’s what it looks like in our test script from above:
#!/bin/sh set -e echo 'Started the script!' ls directory-that-does-not-exist echo 'The exit code of the previous command was' "$?"
And the output:
$ ./stopsign.sh Started the script! ls: directory-that-does-not-exist: No such file or directory
It stopped once there was an error. Sweet!
The build fix
Now that my build status was properly reflecting the failure that was occuring, I needed to make the builds succeed.
A quick search didn’t help me figure out what was causing the mis-installation
of Hugo, and I didn’t feel like digging in further because I knew that I was
The whole point of Docker is that it leads to reproducible states.
Instead of installing a new dependency on top of my Docker image every time,
I should put all necessary
dependencies in the image from the beginning. So I bit the bullet and figured
out how to make (actually, adapt) a Dockerfile. Now, all of my CI builds will
start with Hugo and
lftp already installed.
It didn’t take all that long, and now I have removed one more source of failure!
Some failures, like the bug in my script fixed by
set -e, can lie dormant for
quite a while
(in my case, a year and a half) before being exposed when something else fails.
When something works reliably for a while, we don’t often think to go back
and check that it works as we expect it does. And it’s very difficult to get
everything right the first time.
I’m glad that this failure led to an improvement in the way I use Docker images. And I’m glad that the script has been fixed, because now any other failures will be caught and I will be alerted. Above all, I’m glad that things were configured well enough in the beginning that nothing bad happened to my site.
As of this writing. Subject to change. ↩︎