Category: Uncategorized

log-spaced values with numpy

I knew this had to exist, since otherwise generated logarithmic plots in matplotlib would be a pain in the butt. Still, it took a bit of searching, although perhaps just the name should have clued me in.

 fig, ax = plt.subplots()
 steps = N.log10(N.logspace(0.9, 1-1e-5))
 ax.set_yscale('log', basex=10)
 ax.plot(steps, f(steps), '-')

Also, a shout-out for the ipython inline graphs (

ipython notebook --pylab inline

). Beautiful, and I can copy-paste them into emails and google docs!

running a pdf crawler with heritrix

I’ve used the Heritrix web crawler quite a few times in the past.  It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out.  I’ll go over some of the less intuitive steps:

Download the right version of the crawler

That is to say, version 1.*.  Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).

For your convenience, here’s a link to the download page.

Make sure you’re rejecting almost everything

You almost certainly don’t want all the web has to offer.  You only want a tiny fraction of it.  For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:

.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$

Similarly, you’ll want to drop pesky calendar like applications:

.*(calendar|/api|lecture).*

And any dynamic pages that want to suck up your bandwidth:

.*?.*

Save only what you need

Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:

.*(pdf|postscript).*

Regular expressions are full, not partial matches

You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending

.*

to your normal patterns.

If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!

java profiling

The Java profiling world can be a somewhat arcane maze of GUI’s, most of which seem to make things more complex.

Fortunately, it’s actually quite simple to get a usable, sample based CPU profile from any modern JVM. Simply run your program with the additional flags:

-agentlib:hprof=cpu=samples

Now, when your program exits, the JVM will also emit a java.hprof.txt file with a listing of where time was spent. If you pore over that file, you’ll eventually find out where your program was wasting it’s time.

But it turns out there is much simpler option – gprof2dot.py. This lovely little utility can convert your grungy hprof output to a beautiful dot graph. (N.B. I wrote the hprof format importer for gprof2dot, so blame me if it’s wrong.)

For example:

gprof2dot.py < java.hprof.txt | dot -Tpng > profile.png

Gives us back:

 

 

In this case, it appears that:

  • Using Java regular expressions to split things is slow
  • I need to speed up my cosine similarity calculation

xdot.py is another useful program for interactively viewing these graphs: simply feed the output of gprof2dot (or any other graphviz generator) to it:

gprof2dot.py < java.hprof.txt | xdot.py

And now you can scan around your profile image directly.

Mountain View

I’m in Mountain View for the summer for an MSR internship. I lived in the bay area for 5 years, but somehow I had forgotten how far apart everything is here.

A fruitfly just drowned itself in my cup of coffee this morning. Hopefully that wasn’t an omen, but just a morbid insect.

the buddhist nature of swimming

I’ve been reading The Empty Mirror. It’s a fascinating little book about a Dutchman and a year he spent in a Japanese Zen monastery a few years after WWII. It’s a really quick read, and I recommend it – I just happened upon it in the local used bookstore, but I think it’s worth the Amazon price too.

The book, as expected focuses on the authors time in the monastery and his interactions with the monk, and the trials, stresses and achievement he gets out of the whole thing. Since I’ve also been swimming more frequently these days, it was natural for me to think about how the activities are somewhat related. In some sense, swimming is my form of meditation – it’s an activity where you can completely empty your mind. It’s especially true when you’re working hard and the pain from your muscles wipes everything else out.

I’ve never really done an extensive survey on this, but from personal experience, swimmers are pretty mellow people. Maybe it’s due to their extensive Buddhist training? I should arrange for a conference between some monks and swimmers to investigate more. But first I should probably practice meditating (or swimming) some more.