Citibike: Bike Flow

I’m a big fan of the Citibike bike share program that started here recently.  One common issue I and others seem to suffer from is the lack of bikes (when starting a trip) or docks (when ending a trip).  Our neighborhood tends to be a very popular destination in the evening, so when I try to ride a bike in, I often end up a few blocks away from my desired station.  Similarly, if I get a late start in the morning I often find there are no bikes left to ride.

I was curious about how the flow of bikes works around the city — where do the bikes go to when they leave the East Village?  I crawled the Citibike web site and created a simple website to visualize the flow of bikes around the city; the results are pretty interesting:

citibike

With some more work on the data, it might be possible to use it for predictions (“will I be able to return this bike to my station”) and to aid in balancing (choosing which stations to move bikes between, and at what time).

The source code for the application is available on Github.

Making a JIT interpreter with LuaJIT

(N.B. The code for all of these experiments is on Github).

I recently read this post by François Perrad regarding interpreters, where he compared interpreter loops written in Lua, LuaJit and Pypy.  (I think the original toy interpreter example comes from PyPy).   After some suggestions, he ended up with a new PyPy version which performed very well — close to what you’d see from a static compiler.

The bytecode ‘program’ being used for all of these examples is simply calculating, in a round-about fashion, the square of an input number:

MOV_A_R,    0,
MOV_A_R,    1,
MOV_R_A,    0, 
DECR_A,
MOV_A_R,    0,
MOV_R_A,    2, 
ADD_R_TO_A, 1,
MOV_A_R,    2,
MOV_R_A,    0, 
JUMP_IF_A,  4,
MOV_R_A,    2,
RETURN_A

I made a slight modification to this interpreter to force PyPy to load the bytecode at runtime (to ensure it doesn’t “cheat” during translation by just statically optimizing for this particular program).    This version runs quickly, but not as fast as the version that has the bytecode baked in.  It evaluates 100M iterations of the bytecode loop in 1.6seconds; still this is roughly a hundred times faster then the CPython equivalent.  This is what you’d expect, after all — it’s what PyPy is designed for.

The lua based interpreter, when run with Luajit, takes 5.5 seconds; not bad, but it’s 4 times slower then PyPy. Can we do better?  What’s causing Luajit to run slowly?  If we turn on jit debugging for luajit, we see the problem immediately:

luajit -jdump toy-jit.lua 100000000

There’s no output!  The JIT compiler never activated.  What’s going on?

It turns out that our interpreter loop (like all interpreter loops), is unpredictable, as the path of the execution is very data dependent (‘data’ here meaning the bytecode we’re interpreting):

while true do
  local opcode = bytecode[pc]
  pc = pc + 1 
  if opcode == JUMP_IF_A then
    local target = bytecode[pc]
    pc = pc + 1 
    if a ~= 0 then
      pc = target
    end
  elseif opcode == MOV_A_R then
    ...
  elseif opcode == MOV_R_A then
    ...
  elseif opcode == ADD_R_TO_A then
    ...
  elseif opcode == DECR_A then
    ...
  elseif opcode == RETURN_A then 

After executing a bytecode, the interpreter goes back up to the top of the while, and jumps to a different place. A tracing JIT never gets a chance to see the pattern, and so you end up running in the interpreter the whole time.  PyPy solves this problem by using magic meta-tracing.

It turns out we can get a similar effect in Luajit, without too much effort, using partial evaluation.  That is, given a chunk of bytecode, we’ll generate a specialized version of our interpreter for that bytecode.  We do this, in time-honored fashion, by copy-pasting. We step through each opcode, and instead of evaluating it, we build up a Lua string to evaluate it (A much cleaner approach would be to write our interpreter in some structured fashion, and generate the JIT interpreter from that):

if opcode == JUMP_IF_A then
      local target = bytecode[pc]
      pc = pc + 1
      f_str = f_str .. string.format([[
if a == 0 then
  goto op_%d
end
goto op_%d
]], pc, target)
    elseif opcode == MOV_R_A then
      local n = bytecode[pc]
      pc = pc + 1
    f_str = f_str .. string.format([[
a = reg_%d
]], n)

For our test program, this creates a Lua string like this:

function _jit(a)
  local reg = {0, 0, 0, 0, 0, 0, 0, 0}
  ::op_1::
reg[1] = a
::op_3::
reg[2] = a
::op_5::
a = reg[1]
::op_7::
a = a - 1
::op_8::
reg[1] = a
::op_10::
a = reg[3]
::op_12::
a = a + reg[2]
::op_14::
reg[3] = a
::op_16::
a = reg[1]
::op_18::
if a == 0 then
  goto op_20
end
goto op_5
::op_20::
a = reg[3]
::op_22::
return a
end

If we eval() this string, we get back an interpreter that’s been specialized for just this bytecode. What does our performance look like now?

time pypy-jit-c /home/power/tmp/bytecode.str 100000000
1.63s user 0.01s system 99% cpu 1.649 total

time luajit toy.lua 100000000       
5.51s user 0.01s system 99% cpu 5.549 total

time luajit toy-jit.lua 100000000
0.12s user 0.00s system 97% cpu 0.128 total

We’re now much faster then PyPy! Obviously this trick is easier to play with such a simple interpreter (we’re also using the native numeric type of our JIT, which isn’t always correct). Amore complex, dynamically typed systems might prove to be more difficult to do partial evaluation on. There also could be extra hints I could give to PyPy to make it work better (if you have any ideas, please tell me!).

Still, it’s somewhat surprising how easy it was to generate our ‘JIT’ interpreter — the code isn’t much bigger then the original version. Perhaps with some more scaffolding/helper libraries, this could be a viable way to create fast interpreters for new languages?