In preparation for PyCon 2014, the organizers wanted to make static archives of the sites for past years. The 2011 site was already suffering from bitrot (the stylesheets had disappeared), and we wanted to grab the 2012 and 2013 sites before they too started to rot. Noah Kantrowitz was the instigator, and he suggested using httrack. I volunteered to help out, and settled on httrack by default. I used httrack 3.43.9, the version available in Debian 6.0 (squeeze), since that's what I'm running on my personal web server.
The initial mirror was easy:
mkdir /tmp/pycon && cd /tmp/pycon httrack -w -o0 -K4 -c20 https://us.pycon.org/{2011,2012,2013}
where:
- -w means mirror this site (as opposed to merely fetching individual resources, which you can do with curl or wget)
- -o0 means "don't fetch error pages" -- I want a 404 to remain a 404 (the catch is that we lose pretty Django-rendered error pages)
- -K4 means keep original URLs -- without this, links to /2012/about/ become /2012/about/index.html (minor yuck)
- -c20 is an attempt to go faster (20 concurrent connections) (didn't help in this case)
In order to get my web server to serve the mirrored content statically, I put them in a simple directory structure:
mkdir -p /var/www/pycon.gerg.ca mv us.pycon.org/{2011,2012,2013} /var/www/pycon.gerg.ca/. cd /var/www/pycon.gerg.ca
Of course, I also had to create a DNS record and configure my web server to serve that directory as pycon.gerg.ca.
In order to keep track of my changes, I turned each year into its own Mercurial repository:
cd 2011 hg init hg add -q hg commit -m"mirror of http://us.pycon.org/2011/, grabbed by httrack 3.43-9, ending 2013-06-19 12:47"
(and similar for 2012, 2013). (I could have put all three years into one big repository, but doing it this way seems more future-proof. At some point, we're going to want to archive 2013 and 2014 similarly.)
Now I can start finding and fixing problems. If a fix step goes horribly wrong, I can just hg revert the result and try again.
The 2011 and 2012 sites had remnants of revision history -- presumably a feature of Pinax? The static archive only needs to show the final revision of each page, so I nuked the revision history:
cd 2011 hg rm -I 're:.*/rev[0-9]+/' -I '**/history/*' . hg commit -m"remove old revision history"
The interface for editing pages is useless, since it just redirects to a Django login page, which of course won't work in the static archive. Get rid of it:
hg rm -I '**/edit/*' . hg ci -m"remove edit pages (they just redirect to the login page)"
All three sites had a bunch of mystery pages with paths like 2011/account/login/index0000.html. I'm guessing there were links from old revisions to those pages, which is why httrack captured them. Now that the old revisions are gone, make sure nothing left in the static site references them:
hg locate -0 | xargs -0 grep 'index[0-9a-f][0-9a-f][0-9a-f][0-9a-f]'
That found nothing, so remove them:
hg rm account/login/index????.html account/login/index????-?.html account/signup/index????.html hg ci -m"remove mystery index????.html pages (unreferenced)"
There were a bunch of gratuitous absolute URLs in links. I found them, or at least the ones that were likely to be HTML attribute values, with this:
hg locate -0 | xargs -0 egrep --color=always '=\"https?://us\.pycon\.org/2011/' | less -r
For a quick sanity check, I looked at the HTML source for some of the pages with absolute links: even in the dynamic, Django-generated site, the absolute links were there. So this was an error in the original site, not an artifact introduced by httrack.
To remove them, I used a slightly more conservative regex:
# note: this is zsh array syntax; bash can just use `...` files=(`hg locate -0 | xargs -0 egrep -l '=\"https?://us\.pycon\.org/2011/'`) perl -pi~ -e 's+(href|src)=\"https?://us\.pycon\.org/2011/+$1=\"/2011/+gi' $files hg ci -m"fix gratuitous absolute links"
(Yes, there is still room in the world for Perl: sometimes, awk isn't quite up to the job.) Note that the regexes changed for each year: if the 2012 site has links to the 2011 site, I left them absolute.
Note how I did not try to fully relativize those URLs: I'm assuming the site will still be deployed as http://us.pycon.org/2011/. Seems like a safe bet.
Several .css files had weird URLs like /2012/site_media/static/css/pycon.css?10. That's OK in a dynamic site, but doesn't work so well with static filenames. (In fact, httrack mangled those URLs: references to them in HTML remained unchanged, but the files themselves turned out like pycond3d9.css -- apparently some sort of failed URL escaping going on there.) Regardless: it's broken, so I fixed it:
hg locate -0 -I '**.html' | xargs -0 perl -pi~ -e 's|(/201\d/site_media/static/css/.*.css)\?\d+|$1|g' hg ci -m"fix stylesheet naming oddity"
Naturally, the precise sequence of fixups was slightly different for each of the PyCon sites that I captured (2011, 2012, and 2013). This blog post is a guideline and aide-memoire, not a tested, debugged, production-ready script. ;-)