Category: Uncategorized

pup is great!

I definitely do more HTML scraping than is healthy for a person.  And inevitably, I end up spending more time doing the scraping than I would have spent doing some copy-pasting.

So of course I’m excited to have a new tool to make wasting time even easier: pup.

I’m sure I’ve worked with various HTML command line scrapers in the past, but pup seems to have hit a sweet spot — reusing CSS selectors makes it pretty intuitive, and the set of operations to extract what you want is just enough to get your data, without overwhelming you.  Nice!

Computing the optimal stop placement for transit

Like many people, I take the bus to work most days. My commute isn’t actually that far (about 3 miles), but I am incredibly lazy, and the bus lets me catch up on the magazines that would otherwise be accumulating dust on my table. (And if I keep up on my Harper’s, I can at least pretend I’m up to date on what’s going on).

Anyway, here’s my bus route:

bus route

The main thing to notice here is that it stops an awful lot.  During peak commute hours, I can sometimes walk faster than the bus.  Given that I’m out of shape and my commute involves a big hill, that’s not a good sign.

It’s been pointed out many times that perhaps stops are placed too close together in many locales:

So there are potentially good reasons why you want to have stops closer together than what might be optimal; though I would mostly bucket these into having a customer base that is old, fat, lazy, grumpy or some combination of those 4.  You can see in the humantransit post the outrage expressed at having stops more than 300 meters apart.  The horror of having to walk more than 1.5 blocks to your stop!

But let’s go ahead and assume we live in a world where people are happy to walk longer distances.  Let’s go further and assume they’re willing to walk as far as they need to ensure their overall trip time is minimized.  If we have such a cooperative public, then what’s our optimal stop distance?  I made up a trivial model of what happens in this case in a Ipython notebook here:

Here’s the resulting plot:

This model is incredibly contrived, but still, it’s interesting to toy with the tradeoffs.  Note that even with a very slow walking pace (2 minutes/block, or 50 meters a minute), the optimal distance is over 5 blocks apart.  (Compare that with the spacing on my route at a ~2 blocks between stops.)
If you have ideas on how to improve the model, please let me know!

Transaction Chain Visualization

We had a paper at the last SOSP on transaction chains.  Our original analysis of chains was done by hand, which is quite a silly way to do it.  We then wrote a simple script to do the graph analysis, but it’s still difficult to picture the interaction of chains (a script telling you that you have an S-C cycle is great, but what should you do about it?)

To make this a bit easier, I made up a little webpage that lets  you enter in a list of chains and indicate commutative links.  This page very effectively illustrates 3 things:

  • My ineptness at Javascript
  • My lack of graph theory knowledge
  • That there are some neat Javascript libraries out there (hello Dagre!)

Try it out here:

Creating fancy images with Matplotlib

I have to give a short presentation at SOSP next week, and for it, I needed to have some nice pictures representing a distributed array. After trying out several tools for trying to create these, I began to lament and cry over the state of Linux drawing software. But that’s a different story. I ended up writing a simple matplotlib script to generate the pictures I needed, and since it worked out pretty well, I thought I’d share it here.

Here’s the kind of picture I’m referring to:


It turns out this is pretty straightforward using matplotlib. Here’s the basic function:

def draw_array(a, target=None):
    fig = pylab.gcf()
    fig.frameon = False

ax = fig.gca()

ax.set_aspect('equal', 'box')

size = 1.0
z_scale = 1.4
i = 0
for z in reversed(range(a.shape[2])):
    for (x,y),v in np.ndenumerate(a[:, :, z]):
        i += 2
        alpha = a['transparency'][x,y,z]
        color = tuple(a['color'][x,y,z])
        off_x = 0.01 + x + size + z / z_scale
        off_y = y + size + z / z_scale

        rect = pylab.Rectangle([off_x, off_y], size, size,
                               facecolor=color, edgecolor=(0,0,0),
                               zorder = i, alpha = alpha)

        cx = off_x + size/2
        cy = off_y + size/2

        # sigh
        label = str(a['name'][x,y,z])
        w, h = pylab.matplotlib.text.TextPath((0,0), label).get_extents().size / 30

        #print w, h

        text = pylab.Text(cx - w / 2, cy - h / 2, label, zorder=i+1)

if target is not None:
return ax

The first part of this just turns off the various lines for the axes. We then iterate through the elements of the array and create a Rectangle() for each one; each “layer” (z-axis) is shifted off to the right a little bit from the previous, to give our illusion of depth. (We don’t want a normal perspective projection, as it would hide too much of the deeper layers).

The “sigh” comment is where I’m using a hack to determine the size of the text we’re going to put in so I can center it in the array cell. I couldn’t find an easier way to do this, and no, I don’t know why I have to divide the result by 30.

The input array has 3 fields which specify how to render each rectangle:

dtype=([('color', 'f,f,f'), ('name', 'i'), ('transparency', 'f')]))

Now we can construct an arbitrary array and feed it into our function:

shape = (3,3,5)
a = np.ndarray(shape, dtype=([('color', 'f,f,f'), ('name', 'i'), ('transparency', 'f')]))
a['name'] = np.arange(
a['transparency'] = 1.0
a['color'] = (1,1,1)
return a

draw_array(a, target='array.pdf')

Once we have the basics out of the way, we can do some fancy rendering really easily. First, let’s make a little helper class to draw slices:

class draw_slice(object):
    def <strong>init</strong>(self, a, target=None):
        self.a = a = target

def __getitem__(self, slc):
    slice_z = np.copy(self.a)
    slice_z['color'][slc] = (0.9, 0.5, 0.3)
    slice_z['transparency'] = 0.9

We can wrap an array in draw_slice() to make it easy to construct pictures of slices:



We can be fancier if we like too, drawing the results of a filter operation:,

draw_slice(a)[a[‘name’] &lt;= 1]


If you are interested, the full code for creating these figures is here: All you need is matplotlib and numpy.

Citibike: Bike Flow

I’m a big fan of the Citibike bike share program that started here recently.  One common issue I and others seem to suffer from is the lack of bikes (when starting a trip) or docks (when ending a trip).  Our neighborhood tends to be a very popular destination in the evening, so when I try to ride a bike in, I often end up a few blocks away from my desired station.  Similarly, if I get a late start in the morning I often find there are no bikes left to ride.

I was curious about how the flow of bikes works around the city — where do the bikes go to when they leave the East Village?  I crawled the Citibike web site and created a simple website to visualize the flow of bikes around the city; the results are pretty interesting:


With some more work on the data, it might be possible to use it for predictions (“will I be able to return this bike to my station”) and to aid in balancing (choosing which stations to move bikes between, and at what time).

The source code for the application is available on Github.

Making a JIT interpreter with LuaJIT

(N.B. The code for all of these experiments is on Github).

I recently read this post by François Perrad regarding interpreters, where he compared interpreter loops written in Lua, LuaJit and Pypy.  (I think the original toy interpreter example comes from PyPy).   After some suggestions, he ended up with a new PyPy version which performed very well — close to what you’d see from a static compiler.

The bytecode ‘program’ being used for all of these examples is simply calculating, in a round-about fashion, the square of an input number:

MOV_A_R,    0,
MOV_A_R,    1,
MOV_R_A,    0, 
MOV_A_R,    0,
MOV_R_A,    2, 
ADD_R_TO_A, 1,
MOV_A_R,    2,
MOV_R_A,    0, 
JUMP_IF_A,  4,
MOV_R_A,    2,

I made a slight modification to this interpreter to force PyPy to load the bytecode at runtime (to ensure it doesn’t “cheat” during translation by just statically optimizing for this particular program).    This version runs quickly, but not as fast as the version that has the bytecode baked in.  It evaluates 100M iterations of the bytecode loop in 1.6seconds; still this is roughly a hundred times faster then the CPython equivalent.  This is what you’d expect, after all — it’s what PyPy is designed for.

The lua based interpreter, when run with Luajit, takes 5.5 seconds; not bad, but it’s 4 times slower then PyPy. Can we do better?  What’s causing Luajit to run slowly?  If we turn on jit debugging for luajit, we see the problem immediately:

luajit -jdump toy-jit.lua 100000000

There’s no output!  The JIT compiler never activated.  What’s going on?

It turns out that our interpreter loop (like all interpreter loops), is unpredictable, as the path of the execution is very data dependent (‘data’ here meaning the bytecode we’re interpreting):

while true do
  local opcode = bytecode[pc]
  pc = pc + 1 
  if opcode == JUMP_IF_A then
    local target = bytecode[pc]
    pc = pc + 1 
    if a ~= 0 then
      pc = target
  elseif opcode == MOV_A_R then
  elseif opcode == MOV_R_A then
  elseif opcode == ADD_R_TO_A then
  elseif opcode == DECR_A then
  elseif opcode == RETURN_A then 

After executing a bytecode, the interpreter goes back up to the top of the while, and jumps to a different place. A tracing JIT never gets a chance to see the pattern, and so you end up running in the interpreter the whole time.  PyPy solves this problem by using magic meta-tracing.

It turns out we can get a similar effect in Luajit, without too much effort, using partial evaluation.  That is, given a chunk of bytecode, we’ll generate a specialized version of our interpreter for that bytecode.  We do this, in time-honored fashion, by copy-pasting. We step through each opcode, and instead of evaluating it, we build up a Lua string to evaluate it (A much cleaner approach would be to write our interpreter in some structured fashion, and generate the JIT interpreter from that):

if opcode == JUMP_IF_A then
      local target = bytecode[pc]
      pc = pc + 1
      f_str = f_str .. string.format([[
if a == 0 then
  goto op_%d
goto op_%d
]], pc, target)
    elseif opcode == MOV_R_A then
      local n = bytecode[pc]
      pc = pc + 1
    f_str = f_str .. string.format([[
a = reg_%d
]], n)

For our test program, this creates a Lua string like this:

function _jit(a)
  local reg = {0, 0, 0, 0, 0, 0, 0, 0}
reg[1] = a
reg[2] = a
a = reg[1]
a = a - 1
reg[1] = a
a = reg[3]
a = a + reg[2]
reg[3] = a
a = reg[1]
if a == 0 then
  goto op_20
goto op_5
a = reg[3]
return a

If we


this string, we get back an interpreter that’s been specialized for just this bytecode. What does our performance look like now?

time pypy-jit-c /home/power/tmp/bytecode.str 100000000
1.63s user 0.01s system 99% cpu 1.649 total

time luajit toy.lua 100000000       
5.51s user 0.01s system 99% cpu 5.549 total

time luajit toy-jit.lua 100000000
0.12s user 0.00s system 97% cpu 0.128 total

We’re now much faster then PyPy! Obviously this trick is easier to play with such a simple interpreter (we’re also using the native numeric type of our JIT, which isn’t always correct). Amore complex, dynamically typed systems might prove to be more difficult to do partial evaluation on. There also could be extra hints I could give to PyPy to make it work better (if you have any ideas, please tell me!).

Still, it’s somewhat surprising how easy it was to generate our ‘JIT’ interpreter — the code isn’t much bigger then the original version. Perhaps with some more scaffolding/helper libraries, this could be a viable way to create fast interpreters for new languages?

scidb performance

We searched around trying to find any reasonable comparison of Scidb performance to existing systems (specifically, we’re looking at doing straightforward bulk-parallel operations like logistic regression/k-means).  So far, we’ve found the performance is very poor — an order of magnitude or 2 worse then single machine runs or systems like Spark.

Any ideas what we’re doing wrong? The code is below. We’re running this across 4 machines with 32GB of memory each. The example code is just trying to do the regression on 100000 samples with 10 dimensions.


insert(multiply(X, W))

query itself seems to take several seconds. What’s going on here? On a single machine, this operation is less then a millisecond; even accounting for disk reads/network issues I’d expect this to be a hundred times faster then it is.

Trying to run with more data causes the system to run out of memory.


function create_randoms(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating random array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,
y=0:$y,$CHUNK_SIZE,0], double(1)/(random() % 10 + 1)), $3)" >/dev/null

function create_zeros(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating zero array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,y=0:$y,$CHUNK_SIZE,0],0),$3)" >/dev/null

function create_template(){
  x=$(($1 - 1))
  iquery -nq "create array Template [x=0:$x,$CHUNK_SIZE,0]" >/dev/null

function exe(){
  iquery -o csv+ -q  "$1"

function exe_silent(){
  iquery -nq "$1" >/dev/null

function clean(){
  exe_silent "drop array W"
  exe_silent "drop array X"
  exe_silent "drop array Y"
  exe_silent "drop array Pred"
  exe_silent "drop array Diff"
  exe_silent "drop array Grad"
  exe_silent "drop array Temp"
  exe_silent "drop array Template"

function init(){
  create_randoms $N $D X
  create_randoms $N 1 Y
  create_randoms $D 1 W
  create_zeros $N 1 Pred
  create_zeros $D 1 Grad
  create_zeros $N 1 Diff
  create_template $N

D=10 #Dimensions.
N=100000 #Points.

init #Create arrays.

for i in {1..5}
iquery <<HERE
set no fetch;

set lang afl;
insert(multiply(X, W), Pred);

set lang aql;
update Pred set val= double(1)/(double(1) + exp(-val));
select Pred.val - Y.val into Diff from Pred, Y;
select sum(new_val) as val into Temp from (select X.val * R.val as
new_val from X join reshape(Diff, Template) as R on X.x = R.x) group
by y;
select val into Grad from reshape(substitute(Temp, build(Temp, 0)), Grad);

set fetch;
select W.val + 0.001 * Grad.val into W from W, Grad;


thread profiling in Python

Python has accumulated a lot of… character over the years.  We’ve got no less then 3 profiling libraries for single threaded execution and a multi-threaded profiler with an incompatible interface (Yappi).  Since many applications use more then one thread, this can be a bit annoying.

Yappi works most of the time.  Except it can sometimes cause your application to hang for unknown reasons (I blame signals, personally). The other issue is that Yappi doesn’t have a way of collecting call-stack information. (I don’t necessarily care that memcpy takes all of the time, I want to know who called memcpy). In particular, the lovely gprof2dot can take in pstats dumps and output a very nice profile graph.

To address this for my uses, I glom together cProfile runs from multiple threads. In case it might be useful for other people I wrote a quick gist illustrating how to do it. To make it easy to drop in, I monkey-patch the method, but you can use a more maintainable approach if you like (I create a subclass ProfileThread in my applications).

from threading import Thread
import cProfile
import pstats
def enable_thread_profiling():
  '''Monkey-patch to enable global profiling.
Each thread creates a local profiler; statistics are pooled
to the global stats object on run completion.'''
  Thread.stats = None
  thread_run =
  def profile_run(self):
    self._prof = cProfile.Profile()
    if Thread.stats is None:
      Thread.stats = pstats.Stats(self._prof)
      Thread.stats.add(self._prof) = profile_run
def get_thread_stats():
  stats = getattr(Thread, 'stats', None)
  if stats is None:
    raise ValueError, 'Thread profiling was not enabled,'
                      'or no threads finished running.'
  return stats
if __name__ == '__main__':
  import time
  t = Thread(target=time.sleep, args=(1,))