Citibike: Bike Flow

I’m a big fan of the Citibike bike share program that started here recently.  One common issue I and others seem to suffer from is the lack of bikes (when starting a trip) or docks (when ending a trip).  Our neighborhood tends to be a very popular destination in the evening, so when I try to ride a bike in, I often end up a few blocks away from my desired station.  Similarly, if I get a late start in the morning I often find there are no bikes left to ride.

I was curious about how the flow of bikes works around the city — where do the bikes go to when they leave the East Village?  I crawled the Citibike web site and created a simple website to visualize the flow of bikes around the city; the results are pretty interesting:

citibike

With some more work on the data, it might be possible to use it for predictions (“will I be able to return this bike to my station”) and to aid in balancing (choosing which stations to move bikes between, and at what time).

The source code for the application is available on Github.

Making a JIT interpreter with LuaJIT

(N.B. The code for all of these experiments is on Github).

I recently read this post by François Perrad regarding interpreters, where he compared interpreter loops written in Lua, LuaJit and Pypy.  (I think the original toy interpreter example comes from PyPy).   After some suggestions, he ended up with a new PyPy version which performed very well — close to what you’d see from a static compiler.

The bytecode ‘program’ being used for all of these examples is simply calculating, in a round-about fashion, the square of an input number:

MOV_A_R,    0,
MOV_A_R,    1,
MOV_R_A,    0, 
DECR_A,
MOV_A_R,    0,
MOV_R_A,    2, 
ADD_R_TO_A, 1,
MOV_A_R,    2,
MOV_R_A,    0, 
JUMP_IF_A,  4,
MOV_R_A,    2,
RETURN_A

I made a slight modification to this interpreter to force PyPy to load the bytecode at runtime (to ensure it doesn’t “cheat” during translation by just statically optimizing for this particular program).    This version runs quickly, but not as fast as the version that has the bytecode baked in.  It evaluates 100M iterations of the bytecode loop in 1.6seconds; still this is roughly a hundred times faster then the CPython equivalent.  This is what you’d expect, after all — it’s what PyPy is designed for.

The lua based interpreter, when run with Luajit, takes 5.5 seconds; not bad, but it’s 4 times slower then PyPy. Can we do better?  What’s causing Luajit to run slowly?  If we turn on jit debugging for luajit, we see the problem immediately:

luajit -jdump toy-jit.lua 100000000

There’s no output!  The JIT compiler never activated.  What’s going on?

It turns out that our interpreter loop (like all interpreter loops), is unpredictable, as the path of the execution is very data dependent (‘data’ here meaning the bytecode we’re interpreting):

while true do
  local opcode = bytecode[pc]
  pc = pc + 1 
  if opcode == JUMP_IF_A then
    local target = bytecode[pc]
    pc = pc + 1 
    if a ~= 0 then
      pc = target
    end
  elseif opcode == MOV_A_R then
    ...
  elseif opcode == MOV_R_A then
    ...
  elseif opcode == ADD_R_TO_A then
    ...
  elseif opcode == DECR_A then
    ...
  elseif opcode == RETURN_A then 

After executing a bytecode, the interpreter goes back up to the top of the while, and jumps to a different place. A tracing JIT never gets a chance to see the pattern, and so you end up running in the interpreter the whole time.  PyPy solves this problem by using magic meta-tracing.

It turns out we can get a similar effect in Luajit, without too much effort, using partial evaluation.  That is, given a chunk of bytecode, we’ll generate a specialized version of our interpreter for that bytecode.  We do this, in time-honored fashion, by copy-pasting. We step through each opcode, and instead of evaluating it, we build up a Lua string to evaluate it (A much cleaner approach would be to write our interpreter in some structured fashion, and generate the JIT interpreter from that):

if opcode == JUMP_IF_A then
      local target = bytecode[pc]
      pc = pc + 1
      f_str = f_str .. string.format([[
if a == 0 then
  goto op_%d
end
goto op_%d
]], pc, target)
    elseif opcode == MOV_R_A then
      local n = bytecode[pc]
      pc = pc + 1
    f_str = f_str .. string.format([[
a = reg_%d
]], n)

For our test program, this creates a Lua string like this:

function _jit(a)
  local reg = {0, 0, 0, 0, 0, 0, 0, 0}
  ::op_1::
reg[1] = a
::op_3::
reg[2] = a
::op_5::
a = reg[1]
::op_7::
a = a - 1
::op_8::
reg[1] = a
::op_10::
a = reg[3]
::op_12::
a = a + reg[2]
::op_14::
reg[3] = a
::op_16::
a = reg[1]
::op_18::
if a == 0 then
  goto op_20
end
goto op_5
::op_20::
a = reg[3]
::op_22::
return a
end

If we eval() this string, we get back an interpreter that’s been specialized for just this bytecode. What does our performance look like now?

time pypy-jit-c /home/power/tmp/bytecode.str 100000000
1.63s user 0.01s system 99% cpu 1.649 total

time luajit toy.lua 100000000       
5.51s user 0.01s system 99% cpu 5.549 total

time luajit toy-jit.lua 100000000
0.12s user 0.00s system 97% cpu 0.128 total

We’re now much faster then PyPy! Obviously this trick is easier to play with such a simple interpreter (we’re also using the native numeric type of our JIT, which isn’t always correct). Amore complex, dynamically typed systems might prove to be more difficult to do partial evaluation on. There also could be extra hints I could give to PyPy to make it work better (if you have any ideas, please tell me!).

Still, it’s somewhat surprising how easy it was to generate our ‘JIT’ interpreter — the code isn’t much bigger then the original version. Perhaps with some more scaffolding/helper libraries, this could be a viable way to create fast interpreters for new languages?

thread profiling in Python

Python has accumulated a lot of… character over the years.  We’ve got no less then 3 profiling libraries for single threaded execution and a multi-threaded profiler with an incompatible interface (Yappi).  Since many applications use more then one thread, this can be a bit annoying.

Yappi works most of the time.  Except it can sometimes cause your application to hang for unknown reasons (I blame signals, personally). The other issue is that Yappi doesn’t have a way of collecting call-stack information. (I don’t necessarily care that memcpy takes all of the time, I want to know who called memcpy). In particular, the lovely gprof2dot can take in pstats dumps and output a very nice profile graph.

To address this for my uses, I glom together cProfile runs from multiple threads. In case it might be useful for other people I wrote a quick gist illustrating how to do it. To make it easy to drop in, I monkey-patch the Thread.run method, but you can use a more maintainable approach if you like (I create a subclass ProfileThread in my applications).

from threading import Thread
 
import cProfile
import pstats
 
def enable_thread_profiling():
  '''Monkey-patch Thread.run to enable global profiling.
  
Each thread creates a local profiler; statistics are pooled
to the global stats object on run completion.'''
  Thread.stats = None
  thread_run = Thread.run
  
  def profile_run(self):
    self._prof = cProfile.Profile()
    self._prof.enable()
    thread_run(self)
    self._prof.disable()
    
    if Thread.stats is None:
      Thread.stats = pstats.Stats(self._prof)
    else:
      Thread.stats.add(self._prof)
  
  Thread.run = profile_run
  
def get_thread_stats():
  stats = getattr(Thread, 'stats', None)
  if stats is None:
    raise ValueError, 'Thread profiling was not enabled,'
                      'or no threads finished running.'
  return stats
 
if __name__ == '__main__':
  enable_thread_profiling()
  import time
  t = Thread(target=time.sleep, args=(1,))
  t.start()
  t.join()
  
  get_thread_stats().print_stats()

setting figure height in ipython

The default size chosen by imshow yields unpleasantly small images. Fortunately, you can easily change them using the rather strangely named gcf() function:

import pylab as P  
...  
f = P.gcf()  
f.set_figheight(16)  
f.set_figwidth(16)

log-spaced values with numpy

I knew this had to exist, since otherwise generated logarithmic plots in matplotlib would be a pain in the butt. Still, it took a bit of searching, although perhaps just the name should have clued me in.

 fig, ax = plt.subplots()
 steps = N.log10(N.logspace(0.9, 1-1e-5))
 ax.set_yscale('log', basex=10)
 ax.plot(steps, f(steps), '-')

Also, a shout-out for the ipython inline graphs (

ipython notebook --pylab inline

). Beautiful, and I can copy-paste them into emails and google docs!

running a pdf crawler with heritrix

I’ve used the Heritrix web crawler quite a few times in the past.  It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out.  I’ll go over some of the less intuitive steps:

Download the right version of the crawler

That is to say, version 1.*.  Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).

For your convenience, here’s a link to the download page.

Make sure you’re rejecting almost everything

You almost certainly don’t want all the web has to offer.  You only want a tiny fraction of it.  For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:

.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$

Similarly, you’ll want to drop pesky calendar like applications:

.*(calendar|/api|lecture).*

And any dynamic pages that want to suck up your bandwidth:

.*?.*

Save only what you need

Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:

.*(pdf|postscript).*

Regular expressions are full, not partial matches

You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending

.*

to your normal patterns.

If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!

java profiling

The Java profiling world can be a somewhat arcane maze of GUI’s, most of which seem to make things more complex.

Fortunately, it’s actually quite simple to get a usable, sample based CPU profile from any modern JVM. Simply run your program with the additional flags:

-agentlib:hprof=cpu=samples

Now, when your program exits, the JVM will also emit a java.hprof.txt file with a listing of where time was spent. If you pore over that file, you’ll eventually find out where your program was wasting it’s time.

But it turns out there is much simpler option – gprof2dot.py. This lovely little utility can convert your grungy hprof output to a beautiful dot graph. (N.B. I wrote the hprof format importer for gprof2dot, so blame me if it’s wrong.)

For example:

gprof2dot.py < java.hprof.txt | dot -Tpng > profile.png

Gives us back:

 

 

In this case, it appears that:

  • Using Java regular expressions to split things is slow
  • I need to speed up my cosine similarity calculation

xdot.py is another useful program for interactively viewing these graphs: simply feed the output of gprof2dot (or any other graphviz generator) to it:

gprof2dot.py < java.hprof.txt | xdot.py

And now you can scan around your profile image directly.

Mountain View

I’m in Mountain View for the summer for an MSR internship. I lived in the bay area for 5 years, but somehow I had forgotten how far apart everything is here.

A fruitfly just drowned itself in my cup of coffee this morning. Hopefully that wasn’t an omen, but just a morbid insect.

the buddhist nature of swimming

I’ve been reading The Empty Mirror. It’s a fascinating little book about a Dutchman and a year he spent in a Japanese Zen monastery a few years after WWII. It’s a really quick read, and I recommend it – I just happened upon it in the local used bookstore, but I think it’s worth the Amazon price too.

The book, as expected focuses on the authors time in the monastery and his interactions with the monk, and the trials, stresses and achievement he gets out of the whole thing. Since I’ve also been swimming more frequently these days, it was natural for me to think about how the activities are somewhat related. In some sense, swimming is my form of meditation – it’s an activity where you can completely empty your mind. It’s especially true when you’re working hard and the pain from your muscles wipes everything else out.

I’ve never really done an extensive survey on this, but from personal experience, swimmers are pretty mellow people. Maybe it’s due to their extensive Buddhist training? I should arrange for a conference between some monks and swimmers to investigate more. But first I should probably practice meditating (or swimming) some more.

Review: Dark Silicon and the End of Multicore Scaling

Paper link: p365-esmaeilzadeh

Premise

Single core scaling stopped some number of years ago. But we’re okay, because we now have the “multi-core revolution” (I realllly hate that phrase). Or are we?

It turns out that another, very real wall is blocking future progress — power density. Already CPU designs strive mightily to shutdown or slowdown underutilized circuitry to save power. But we’re reaching the point where processors simply won’t ever be able to run “full-out”.

This paper explores the future of scaling by extrapolating based on a wide range of processors over time, and they find, rather compellingly, that we’re pretty much boned.

Details

The paper is pretty dense, but it boils down to the creation of a model parameterized by processor features:

  • number of cores
  • number of threads
  • CPU frequency
  • L1/L2/DRAM cache size, speed, miss rate
  • DRAM fetch width

and code features:

  • % code that is parallel
  • % read instructions

You can fit this model based on existing processors and their benchmark performance, as well as their power consumption and die area.

Once the model is created, now the fun starts — we can see how performance will scale with future process improvements (moving to 32, 22, 16…nm), or if we assume highly parallel code, we can see how to optimize our CPU design (lots of cores)!

The short of it

If we make optimistic assumptions for how power consumption drops with process scaling, we still end up with the sobering conclusion that we’re rapidly reaching the point where we won’t be able to power all of the parts of a chip. 128 cores aren’t as useful if we can’t keep them all running.

On the plus side, I can buy 1000 watt power supplies now, so I’m looking forward to my 16 processor, 16 core machine of the future. Welcome to the multiprocessor revolution!