running a pdf crawler with heritrix

I’ve used the Heritrix web crawler quite a few times in the past.  It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out.  I’ll go over some of the less intuitive steps:

Download the right version of the crawler

That is to say, version 1.*.  Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).

For your convenience, here’s a link to the download page.

Make sure you’re rejecting almost everything

You almost certainly don’t want all the web has to offer.  You only want a tiny fraction of it.  For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:

.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$

Similarly, you’ll want to drop pesky calendar like applications:

.*(calendar|/api|lecture).*

And any dynamic pages that want to suck up your bandwidth:

.*?.*

Save only what you need

Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:

.*(pdf|postscript).*

Regular expressions are full, not partial matches

You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending

.*

to your normal patterns.

If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!

java profiling

The Java profiling world can be a somewhat arcane maze of GUI’s, most of which seem to make things more complex.

Fortunately, it’s actually quite simple to get a usable, sample based CPU profile from any modern JVM. Simply run your program with the additional flags:

-agentlib:hprof=cpu=samples

Now, when your program exits, the JVM will also emit a java.hprof.txt file with a listing of where time was spent. If you pore over that file, you’ll eventually find out where your program was wasting it’s time.

But it turns out there is much simpler option – gprof2dot.py. This lovely little utility can convert your grungy hprof output to a beautiful dot graph. (N.B. I wrote the hprof format importer for gprof2dot, so blame me if it’s wrong.)

For example:

gprof2dot.py < java.hprof.txt | dot -Tpng > profile.png

Gives us back:

 

 

In this case, it appears that:

  • Using Java regular expressions to split things is slow
  • I need to speed up my cosine similarity calculation

xdot.py is another useful program for interactively viewing these graphs: simply feed the output of gprof2dot (or any other graphviz generator) to it:

gprof2dot.py < java.hprof.txt | xdot.py

And now you can scan around your profile image directly.