Friday, March 4, 2011

Fun with ruby arrays, hashes and benchmarking

I recently came across a really interesting problem in ruby where I wanted to convert two arrays into a hash of key value pairs. Here are two simple arrays as an example and the desired resulting hash:


Let's first look at a brute force way of solving the problem:


Aside from the shortcut for looping through 0 to 2 this doesn't feel very rubyish. It is also 3 lines of code for something that I felt should be a 1 liner. My next attempt was the following:


This works and is a slightly cleaner code regarding the assignment of the hash value, but it is still 3 lines and not very rubyish. The next realization was that Hash has a constructor that takes an array as an argument and alternately assigns key/value pairs:


Going along this line of investigation I remembered the sparingly used zip function:


This is really close to what we want and is even closer when combined with the flatten function:


Now our hash can be constructed in the 1 line fashion we desire when we use the asterisk to turn the array into the individual argument values:


End of story right? Well not quite. I have always been fascinated by the inject method and wondered if I could use it to build the hash. Not surprisingly you can:


To this point it has been fun to come up with different solutions to the problem. It is what I love about ruby is that it is so easy to experiment. The next question that came to mind was which solution performs the best? Ruby comes with an awesome facility aptly named Benchmark. This is a pretty powerful utility that allows one to get an idea of the user CPU time, system CPU time, the sum of the user and system CPU times, and the elapsed real time. I wrote a quick VR::Script (a script framework for another blog post) to verify the performance of the various methods:


Just running the script with the default values produces the following output:


(arenal) vreng@wes-sandbox:~> ruby array_merge.rb
usersystemtotalreal
brute force0.0000000.0000000.000000( 0.000032)
each w/index0.0000000.0000000.000000( 0.000036)
inject/push0.0000000.0000000.000000( 0.000066)
zip/flatten0.0000000.0000000.000000( 0.000033)
usersystemtotalreal
brute force0.0000000.0000000.000000( 0.000088)
each w/index0.0000000.0000000.000000( 0.000105)
inject/push0.0000000.0000000.000000( 0.000160)
zip/flatten0.0000000.0000000.000000( 0.000127)
usersystemtotalreal
brute force0.0000000.0000000.000000( 0.000168)
each w/index0.0000000.0000000.000000( 0.000192)
inject/push0.0000000.0000000.000000( 0.000280)
zip/flatten0.0000000.0000000.000000( 0.000188)

We can also look at the datafile that is produced which is a more concise form of data:

(CPU)(Real)
sizebfewiipzfbfewiipzf
100.000000000.000033140.000000000.000035050.000000000.000063900.000000000.00003195
500.000000000.000105860.000000000.000164990.000000000.000150920.000000000.00009799
1000.000000000.000162120.000000000.000226970.000000000.000246050.000000000.00033784

The left most column is the upper limit size of the test arrays that were created. The next four columns are the CPU time for each of the methods used (brute force, each with index, inject/push, zip/flatten from left to right) and the last 4 columns are the real time for each implementation method. This small sample size of data doesn't really tell us to much. There is some variation in the data but nothing significant. Let's look at a larger set of data


(CPU)(Real)
sizebfewiipzfbfewiipzf
100.000000000.000000000.000000000.000000000.000061040.000000000.000057940.00004578
1000.000000000.000000000.000000000.000000000.000181910.000000000.000268940.00014210
10000.000000000.000000000.000000000.000000000.001528980.000000000.002416130.00650120
100000.020000000.010000000.020000000.060000000.035845040.010000000.036096100.06806302
500000.060000000.080000000.090000001.310000000.087571140.080000000.132253171.35912704
1000000.110000000.130000000.140000005.120000000.170370100.130000000.289335975.30182099
2000000.240000000.260000000.3300000020.880000000.388137100.260000000.5643279621.76600599
3000000.390000000.380000000.4800000059.140000000.513921020.380000000.7910449570.06250811
4000000.640000000.800000000.72000000146.910000000.752468110.800000001.20897222165.30492520
5000000.760000001.000000000.72000000309.410000000.906491041.000000001.43653202378.01585603

Now this is where it gets interesting. Look at the last column which is the real time usage for the zip/flatten technique. It is not even on the same scale as the other methods it performs so poorly. The graph below plots the benchmark data for the four methods. The first 3 methods labelled Brute Force, each with index and inject/push use the y axis scale on the left. The zip/flatten uses the scale on the right. The suspicion from the data was that the zip/ flatten method grew quadratrically with the size of the arrays and this confirms that (OK, I didn't fit the data but look at it). The clear winners are the Brute Force and each with index methods. each with index methods. They grow linearly at almost the same rate with each with index doing really well up to a size limit of 300k. It then takes a sudden turn up and crosses over the Brute Forcemethod. Brute Force is the most consistent performer and inject/push holds its own despite the abstraction it introduces.



The real point of all this that sometimes our code needs to be analyzed under various conditions. I would not have expected the zip/flatten method to perform so poorly so quickly.
It just goes to show that sometimes the most elegant is not always the best code to use.