I’ve used the Heritrix web crawler quite a few times in the past. It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out. I’ll go over some of the less intuitive steps:
Download the right version of the crawler
That is to say, version 1.*. Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).
For your convenience, here’s a link to the download page.
Make sure you’re rejecting almost everything
You almost certainly don’t want all the web has to offer. You only want a tiny fraction of it. For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:
.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$
Similarly, you’ll want to drop pesky calendar like applications:
.*(calendar|/api|lecture).*
And any dynamic pages that want to suck up your bandwidth:
.*?.*
Save only what you need
Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:
.*(pdf|postscript).*
Regular expressions are full, not partial matches
You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending
.*
to your normal patterns.
If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!