Exposing Gotchas: Fun with ruby arrays, hashes and benchmarking

I recently came across a really interesting problem in ruby where I wanted to convert two arrays into a hash of key value pairs. Here are two simple arrays as an example and the desired resulting hash:

Let's first look at a brute force way of solving the problem:

Aside from the shortcut for looping through 0 to 2 this doesn't feel very rubyish. It is also 3 lines of code for something that I felt should be a 1 liner. My next attempt was the following:

This works and is a slightly cleaner code regarding the assignment of the hash value, but it is still 3 lines and not very rubyish. The next realization was that Hash has a constructor that takes an array as an argument and alternately assigns key/value pairs:

Going along this line of investigation I remembered the sparingly used zip function:

This is really close to what we want and is even closer when combined with the flatten function:

Now our hash can be constructed in the 1 line fashion we desire when we use the asterisk to turn the array into the individual argument values:

End of story right? Well not quite. I have always been fascinated by the inject method and wondered if I could use it to build the hash. Not surprisingly you can:

To this point it has been fun to come up with different solutions to the problem. It is what I love about ruby is that it is so easy to experiment. The next question that came to mind was which solution performs the best? Ruby comes with an awesome facility aptly named Benchmark. This is a pretty powerful utility that allows one to get an idea of the user CPU time, system CPU time, the sum of the user and system CPU times, and the elapsed real time. I wrote a quick VR::Script (a script framework for another blog post) to verify the performance of the various methods:

Just running the script with the default values produces the following output:

(arenal) vreng@wes-sandbox:~> ruby array_merge.rb

	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000032)
each w/index	0.000000	0.000000	0.000000	( 0.000036)
inject/push	0.000000	0.000000	0.000000	( 0.000066)
zip/flatten	0.000000	0.000000	0.000000	( 0.000033)
	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000088)
each w/index	0.000000	0.000000	0.000000	( 0.000105)
inject/push	0.000000	0.000000	0.000000	( 0.000160)
zip/flatten	0.000000	0.000000	0.000000	( 0.000127)
	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000168)
each w/index	0.000000	0.000000	0.000000	( 0.000192)
inject/push	0.000000	0.000000	0.000000	( 0.000280)
zip/flatten	0.000000	0.000000	0.000000	( 0.000188)

We can also look at the datafile that is produced which is a more concise form of data:

	(CPU)				(Real)
size	bf	ewi	ip	zf	bf	ewi	ip	zf
10	0.00000000	0.00003314	0.00000000	0.00003505	0.00000000	0.00006390	0.00000000	0.00003195
50	0.00000000	0.00010586	0.00000000	0.00016499	0.00000000	0.00015092	0.00000000	0.00009799
100	0.00000000	0.00016212	0.00000000	0.00022697	0.00000000	0.00024605	0.00000000	0.00033784

The left most column is the upper limit size of the test arrays that were created. The next four columns are the CPU time for each of the methods used (brute force, each with index, inject/push, zip/flatten from left to right) and the last 4 columns are the real time for each implementation method. This small sample size of data doesn't really tell us to much. There is some variation in the data but nothing significant. Let's look at a larger set of data

	(CPU)				(Real)
size	bf	ewi	ip	zf	bf	ewi	ip	zf
10	0.00000000	0.00000000	0.00000000	0.00000000	0.00006104	0.00000000	0.00005794	0.00004578
100	0.00000000	0.00000000	0.00000000	0.00000000	0.00018191	0.00000000	0.00026894	0.00014210
1000	0.00000000	0.00000000	0.00000000	0.00000000	0.00152898	0.00000000	0.00241613	0.00650120
10000	0.02000000	0.01000000	0.02000000	0.06000000	0.03584504	0.01000000	0.03609610	0.06806302
50000	0.06000000	0.08000000	0.09000000	1.31000000	0.08757114	0.08000000	0.13225317	1.35912704
100000	0.11000000	0.13000000	0.14000000	5.12000000	0.17037010	0.13000000	0.28933597	5.30182099
200000	0.24000000	0.26000000	0.33000000	20.88000000	0.38813710	0.26000000	0.56432796	21.76600599
300000	0.39000000	0.38000000	0.48000000	59.14000000	0.51392102	0.38000000	0.79104495	70.06250811
400000	0.64000000	0.80000000	0.72000000	146.91000000	0.75246811	0.80000000	1.20897222	165.30492520
500000	0.76000000	1.00000000	0.72000000	309.41000000	0.90649104	1.00000000	1.43653202	378.01585603

Now this is where it gets interesting. Look at the last column which is the real time usage for the zip/flatten technique. It is not even on the same scale as the other methods it performs so poorly. The graph below plots the benchmark data for the four methods. The first 3 methods labelled Brute Force, each with index and inject/push use the y axis scale on the left. The zip/flatten uses the scale on the right. The suspicion from the data was that the zip/ flatten method grew quadratrically with the size of the arrays and this confirms that (OK, I didn't fit the data but look at it). The clear winners are the Brute Force and each with index methods. each with index methods. They grow linearly at almost the same rate with each with index doing really well up to a size limit of 300k. It then takes a sudden turn up and crosses over the Brute Forcemethod. Brute Force is the most consistent performer and inject/push holds its own despite the abstraction it introduces.

The real point of all this that sometimes our code needs to be analyzed under various conditions. I would not have expected the zip/flatten method to perform so poorly so quickly.
It just goes to show that sometimes the most elegant is not always the best code to use.

9 comments:

UnknownMarch 4, 2011 at 1:52 AM
Another choice was to just do:

Hash[ a1.zip( a2 ) ]

without the flatten and splating, and would also worked.
UnknownMarch 4, 2011 at 4:39 AM
Hi,
I have run your test, with vr_script removed of course. And to my complete surprise I got totally different results:

ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]
user system total real
brute force 0.070000 0.000000 0.070000 ( 0.071190)
each with index 0.110000 0.000000 0.110000 ( 0.109908)
inject/push 0.130000 0.000000 0.130000 ( 0.124819)
zip/flatten 0.120000 0.000000 0.120000 ( 0.128477)
user system total real
brute force 0.400000 0.000000 0.400000 ( 0.396910)
each with index 0.530000 0.000000 0.530000 ( 0.539222)
inject/push 0.640000 0.000000 0.640000 ( 0.647112)
zip/flatten 0.690000 0.010000 0.700000 ( 0.701078)
user system total real
brute force 0.790000 0.000000 0.790000 ( 0.794770)
each with index 1.070000 0.010000 1.080000 ( 1.075875)
inject/push 1.700000 0.000000 1.700000 ( 1.720560)
zip/flatten 1.600000 0.000000 1.600000 ( 1.600440)
simcha@lapik:/tmp$ cat benchmark.data
(CPU) (Real)
size bf ewi ip zf bf ewi ip zf
100000 0.07000000 0.07000000 0.10000000 0.10000000 0.13000000 0.13000000 0.14000000 0.14000000
500000 0.36000000 0.36000000 0.57000000 0.57000000 0.67000000 0.67000000 0.70000000 0.70000000
1000000 0.73000000 0.73000000 1.09000000 1.09000000 1.53000000 1.53000000 1.57000000 1.57000000

It is for 100 000 500 000 and 1 000 000 and zip function performs not so bad. Pablo version is even better with 0.48 for 1 million. You can see that something is wrong with a data in a file, so there is some difference between our systems or libraries? It may cause the result difference as well. I'm really curious what is it.
Jan
UnknownMarch 4, 2011 at 5:04 AM
This comment has been removed by the author.
UnknownMarch 4, 2011 at 5:13 AM
Why you didn't test Hash[a1.zip(a2)] ?
UnknownMarch 4, 2011 at 9:43 AM
Hi naquad I actually did and it is the quickest of all.
For one million run again:
user system total real
brute force 0.830000 0.000000 0.830000 ( 0.831484)
each with index 1.110000 0.000000 1.110000 ( 1.114325)
inject/push 1.830000 0.010000 1.840000 ( 1.837285)
zip 0.620000 0.010000 0.630000 ( 0.637284)
UnknownMarch 5, 2011 at 3:12 AM
Wes: What version of Ruby is this testing on? I'm guessing 1.8.* from the `require 'rubygems'` line. Using 1.9.2 I'm getting StackOverflowErrors when using a limit more than ~65000, seems to be occurring in the Hash[] method.
TJSingletonMarch 6, 2011 at 11:05 AM
Here are a couple notes: https://gist.github.com/857548
UnknownMarch 7, 2011 at 6:09 PM
@Nemo157 - Believe it or not I forgot to mention this was in 1.8.6 environment.

@naquad - Because that only works in 1.8.7 and 1.9.x

@TJSingleton - See prior comment on version

I was going to follow up to this post with suggestions along with a comparison of ruby versions for fun. Interesting stuff and appreciate the comments and enthusiasm!
Max DDecember 19, 2024 at 10:26 PM
Very thoughtful bblog

Exposing Gotchas

Friday, March 4, 2011

Fun with ruby arrays, hashes and benchmarking

9 comments:

Blog Archive

About Me