Archiving historical PyCon web sites

In preparation for PyCon 2014, the organizers wanted to make static archives of the sites for past years. The 2011 site was already suffering from bitrot (the stylesheets had disappeared), and we wanted to grab the 2012 and 2013 sites before they too started to rot. Noah Kantrowitz was the instigator, and he suggested using httrack. I volunteered to help out, and settled on httrack by default. I used httrack 3.43.9, the version available in Debian 6.0 (squeeze), since that's what I'm running on my personal web server.

The initial mirror was easy:

mkdir /tmp/pycon && cd /tmp/pycon
httrack -w -o0 -K4 -c20 https://us.pycon.org/{2011,2012,2013}

where:

In order to get my web server to serve the mirrored content statically, I put them in a simple directory structure:

mkdir -p /var/www/pycon.gerg.ca mv us.pycon.org/{2011,2012,2013} /var/www/pycon.gerg.ca/. cd /var/www/pycon.gerg.ca

Of course, I also had to create a DNS record and configure my web server to serve that directory as pycon.gerg.ca.

In order to keep track of my changes, I turned each year into its own Mercurial repository:

cd 2011
hg init
hg add -q
hg commit -m"mirror of http://us.pycon.org/2011/, grabbed by httrack 3.43-9, ending 2013-06-19 12:47"

(and similar for 2012, 2013). (I could have put all three years into one big repository, but doing it this way seems more future-proof. At some point, we're going to want to archive 2013 and 2014 similarly.)

Now I can start finding and fixing problems. If a fix step goes horribly wrong, I can just hg revert the result and try again.

Unnecessary revision history

The 2011 and 2012 sites had remnants of revision history -- presumably a feature of Pinax? The static archive only needs to show the final revision of each page, so I nuked the revision history:

cd 2011
hg rm -I 're:.*/rev[0-9]+/' -I '**/history/*' .
hg commit -m"remove old revision history"

The interface for editing pages is useless, since it just redirects to a Django login page, which of course won't work in the static archive. Get rid of it:

hg rm -I '**/edit/*' .
hg ci -m"remove edit pages (they just redirect to the login page)"

Mystery login pages

All three sites had a bunch of mystery pages with paths like 2011/account/login/index0000.html. I'm guessing there were links from old revisions to those pages, which is why httrack captured them. Now that the old revisions are gone, make sure nothing left in the static site references them:

hg locate -0 | xargs -0 grep 'index[0-9a-f][0-9a-f][0-9a-f][0-9a-f]'

That found nothing, so remove them:

hg rm account/login/index????.html account/login/index????-?.html account/signup/index????.html
hg ci -m"remove mystery index????.html pages (unreferenced)"

Weird stylesheet names

Several .css files had weird URLs like /2012/site_media/static/css/pycon.css?10. That's OK in a dynamic site, but doesn't work so well with static filenames. (In fact, httrack mangled those URLs: references to them in HTML remained unchanged, but the files themselves turned out like pycond3d9.css -- apparently some sort of failed URL escaping going on there.) Regardless: it's broken, so I fixed it:

hg locate -0 -I '**.html' | xargs -0 perl -pi~ -e 's|(/201\d/site_media/static/css/.*.css)\?\d+|$1|g'
hg ci -m"fix stylesheet naming oddity"

Conclusion

Naturally, the precise sequence of fixups was slightly different for each of the PyCon sites that I captured (2011, 2012, and 2013). This blog post is a guideline and aide-memoire, not a tested, debugged, production-ready script. ;-)

Author: Greg Ward
Published on: Jun 21, 2013, 11:14:49 AM
Permalink - Source code