[Home](../../../index.html) # Archiving websites 2025-02-27. Omar Mustardo. I want to take a snapshot of my own website as it changes over time. Perhaps a yearly snapshot. I may capture other websites while I'm at it. I test out some options here, but this is not a deep dive into any of these tools. It's initial impressions from spending a couple hours investigating. I didn't find an entirely satisfactory result. ## Kiwix / Zimit A few years ago I downloaded a 95GB copy of Wikipedia.com using https://kiwix.org/en/ in the zim file format. Kiwix made the process easy, and I like knowing that I will theoretically always have a local copy available. Most websites don't have a zim file nicely prepared for everyone to download. Googling led to https://kiwix.org/en/zim-it-up/ and https://github.com/openzim/zimit. It takes a URL and gives you a ZIM file. Great! It even does it for you for small websites. I wanted to run it locally since I'll probably take automated snapshots of my own website every year or so. Running locally required installing Docker which had a few pitfalls, but after that it was easy to run. I'm not going to document Docker since it's a big thing and anything I document is likely to get out of date. I will say that you should follow the official instructions on their website. I originally tried `snap install docker` but the version there was too old and couldn't even run Docker's hello-world. ``` $ docker pull ghcr.io/openzim/zimit:dev $ docker run -v /home/omustardo/Desktop/website_snapshot/:/output ghcr.io/openzim/zimit zimit --seeds https://www.omustardo.com --name omustardo.com.zim ``` Opening the output file in Kiwix unfortunately doesn't show anything on most pages. The trouble is that I use an unusual method to display my website. I write markdown and have javascript dynamically transform it into HTML. View the source of this page if you're interested in that (caveat: no promises that this paragraph will stay current. I may change how I architect my site). Perhaps the Kiwix viewer doesn't fully support javascript? Or maybe it's an issue with local paths? I'll keep Zimit in mind for capturing standard websites. ## (web)httrack https://www.httrack.com This was a suggestion from Claude that I hadn't heard of. ``` $ sudo apt-get install webhttrack $ webhttrack ``` This opened in my browser and it had a site to guide through some basic steps. I was able to download most of omustardo.com, but it missed some javascript files and didn't download any images or other content. I also tried the CLI version with similar results: ``` $ sudo apt-get install httrack $ httrack https://www.omustardo.com -O /tmp/website_snapshot ``` There's more information at https://www.httrack.com/html/fcguide.html This tool has a ton of options and is probably powerful if you take the time to dig into it. I don't really want to do that unless there's no other option. ## wget I've used wget to download individual files before, but it supports much more than that. Iterating with Claude, I ended up with very similar results to httrack. It downloads all of the html, js, and css but misses txt, go, md, and all images. Like httrack, I suspect wget could be made to work with some time reading the manual. ``` wget \ --recursive \ --level=5 \ --page-requisites \ --convert-links \ --wait=2 \ --random-wait \ --execute robots=off \ --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36" \ --no-parent \ --limit-rate=200k \ -e robots=off \ --domains=omustardo.com \ --reject="*action=*,*diff=*,*oldid=*,*printable=*,*redirect=*" \ --accept=".*" \ --include-directories=/ \ --mirror \ --output-file=wget-log.txt \ https://www.omustardo.com/ ``` ## Summary This is a surprisingly hard task! Despite having the worst output (web pages with no visible content), I suspect Zimit was actually the closest to what I wanted: * After dealing with Docker, it was very easy to use. * I'm pretty sure it downloaded all of the content from my website, unlike the others which missed a lot of content. Its output file was 27mb which matches the size of my website. * It didn't displaying content, probably due to javascript not working. This could be worked around by changing how I generate my website from markdown. I could do the markdown to html step when I deploy the site, rather than doing it with javascript on page load. I actually had this implemented using `pandoc`, but I preferred the formatting that `puremd.js` uses. With a tiny bit of css this should be solvable. * At worst, I could at save the archive and use zimdump to extract content. As long as the dump keeps the original website file structure it would be what I want. zimdump is part of https://github.com/openzim/libzim. I also came across https://github.com/iipc/awesome-web-archiving which has a lot of leads.