Exposing Gotchas

One of the common themes that people talk about when it comes to craftsmanship is practice. Dave Thomas made the corollary for programmers a few years ago in this blog post. He brings up great points about not worry about the answer but instead focusing on how you got to the answer that you implemented. It also emphasizes repetition meaning you should practice a kata frequently.

I love the idea of the kata but I found I had a few ideas that could make the process more effective and prove to have wider adoption not only in the ruby community but all programming languages. The ideas I had could be summarized as follows:

Authoring - I wanted a consistent way to write a kata that would clearly illustrate the requirements through details and examples.

Administering - How do you take a kata? I wanted a way to step through the requirements and measure progress by how many requirements I completed and how much time it took to complete them.

Setup - I wanted to be able to chart my progress over time by using source control to keep track of the code for each time I took that kata in an automated fashion.

The result of these ideas is the kata gem. Check the github page for all of the details but briefly it provides the following:

An RSpec like DSL for authoring katas using descriptive keywords like requirement, detail and example.

A command like tool to self-administer a kata by stepping through the requirements

The ability to automatically create a distinct github repo for the kata you wish to take allowing for later code review.

Now that we have tools in place to make us more productive in writing and taking katas we just need katas to test ourselves with. I have created the code katas project so that everyone has the opportunity to contribute and take advantage of the efforts put forth.

If you just want to take a kata that is part of the project it is very easy to get started if you are used to using rvm and bundler. Just clone the repo:

change into the cloned repo:

Install your bundle:

Now you can setup your github repo to take your kata:

At this point you can cd into the created directory and view the repo in your github account. You can commit to it but should ignore it in the code_kata project. All the goodies like autotest and kata are installed so you can follow TDD best practices without any work on your part.

In a separate terminal window you can then administer the kata to guide you through the requirements for the code you need to write:

You are off and running to practice regularly!

What the project needs though are obviously more katas. String calculators are great and thanks to Roy and Katacasts for inspiration but more can be done. Translation and updating of the original katas from Dave Thomas would probably be a good start but what ideas do you all have? For the code kata project and submit your ideas for everyone to grow and get better!

I recently came across a really interesting problem in ruby where I wanted to convert two arrays into a hash of key value pairs. Here are two simple arrays as an example and the desired resulting hash:

Let's first look at a brute force way of solving the problem:

Aside from the shortcut for looping through 0 to 2 this doesn't feel very rubyish. It is also 3 lines of code for something that I felt should be a 1 liner. My next attempt was the following:

This works and is a slightly cleaner code regarding the assignment of the hash value, but it is still 3 lines and not very rubyish. The next realization was that Hash has a constructor that takes an array as an argument and alternately assigns key/value pairs:

Going along this line of investigation I remembered the sparingly used zip function:

This is really close to what we want and is even closer when combined with the flatten function:

Now our hash can be constructed in the 1 line fashion we desire when we use the asterisk to turn the array into the individual argument values:

End of story right? Well not quite. I have always been fascinated by the inject method and wondered if I could use it to build the hash. Not surprisingly you can:

To this point it has been fun to come up with different solutions to the problem. It is what I love about ruby is that it is so easy to experiment. The next question that came to mind was which solution performs the best? Ruby comes with an awesome facility aptly named Benchmark. This is a pretty powerful utility that allows one to get an idea of the user CPU time, system CPU time, the sum of the user and system CPU times, and the elapsed real time. I wrote a quick VR::Script (a script framework for another blog post) to verify the performance of the various methods:

Just running the script with the default values produces the following output:

(arenal) vreng@wes-sandbox:~> ruby array_merge.rb

	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000032)
each w/index	0.000000	0.000000	0.000000	( 0.000036)
inject/push	0.000000	0.000000	0.000000	( 0.000066)
zip/flatten	0.000000	0.000000	0.000000	( 0.000033)
	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000088)
each w/index	0.000000	0.000000	0.000000	( 0.000105)
inject/push	0.000000	0.000000	0.000000	( 0.000160)
zip/flatten	0.000000	0.000000	0.000000	( 0.000127)
	user	system	total	real
brute force	0.000000	0.000000	0.000000	( 0.000168)
each w/index	0.000000	0.000000	0.000000	( 0.000192)
inject/push	0.000000	0.000000	0.000000	( 0.000280)
zip/flatten	0.000000	0.000000	0.000000	( 0.000188)

We can also look at the datafile that is produced which is a more concise form of data:

	(CPU)				(Real)
size	bf	ewi	ip	zf	bf	ewi	ip	zf
10	0.00000000	0.00003314	0.00000000	0.00003505	0.00000000	0.00006390	0.00000000	0.00003195
50	0.00000000	0.00010586	0.00000000	0.00016499	0.00000000	0.00015092	0.00000000	0.00009799
100	0.00000000	0.00016212	0.00000000	0.00022697	0.00000000	0.00024605	0.00000000	0.00033784

The left most column is the upper limit size of the test arrays that were created. The next four columns are the CPU time for each of the methods used (brute force, each with index, inject/push, zip/flatten from left to right) and the last 4 columns are the real time for each implementation method. This small sample size of data doesn't really tell us to much. There is some variation in the data but nothing significant. Let's look at a larger set of data

	(CPU)				(Real)
size	bf	ewi	ip	zf	bf	ewi	ip	zf
10	0.00000000	0.00000000	0.00000000	0.00000000	0.00006104	0.00000000	0.00005794	0.00004578
100	0.00000000	0.00000000	0.00000000	0.00000000	0.00018191	0.00000000	0.00026894	0.00014210
1000	0.00000000	0.00000000	0.00000000	0.00000000	0.00152898	0.00000000	0.00241613	0.00650120
10000	0.02000000	0.01000000	0.02000000	0.06000000	0.03584504	0.01000000	0.03609610	0.06806302
50000	0.06000000	0.08000000	0.09000000	1.31000000	0.08757114	0.08000000	0.13225317	1.35912704
100000	0.11000000	0.13000000	0.14000000	5.12000000	0.17037010	0.13000000	0.28933597	5.30182099
200000	0.24000000	0.26000000	0.33000000	20.88000000	0.38813710	0.26000000	0.56432796	21.76600599
300000	0.39000000	0.38000000	0.48000000	59.14000000	0.51392102	0.38000000	0.79104495	70.06250811
400000	0.64000000	0.80000000	0.72000000	146.91000000	0.75246811	0.80000000	1.20897222	165.30492520
500000	0.76000000	1.00000000	0.72000000	309.41000000	0.90649104	1.00000000	1.43653202	378.01585603

Now this is where it gets interesting. Look at the last column which is the real time usage for the zip/flatten technique. It is not even on the same scale as the other methods it performs so poorly. The graph below plots the benchmark data for the four methods. The first 3 methods labelled Brute Force, each with index and inject/push use the y axis scale on the left. The zip/flatten uses the scale on the right. The suspicion from the data was that the zip/ flatten method grew quadratrically with the size of the arrays and this confirms that (OK, I didn't fit the data but look at it). The clear winners are the Brute Force and each with index methods. each with index methods. They grow linearly at almost the same rate with each with index doing really well up to a size limit of 300k. It then takes a sudden turn up and crosses over the Brute Forcemethod. Brute Force is the most consistent performer and inject/push holds its own despite the abstraction it introduces.

The real point of all this that sometimes our code needs to be analyzed under various conditions. I would not have expected the zip/flatten method to perform so poorly so quickly.
It just goes to show that sometimes the most elegant is not always the best code to use.

Exposing Gotchas

Monday, March 21, 2011

Introducing the Code Kata project

Friday, March 4, 2011

Fun with ruby arrays, hashes and benchmarking

Blog Archive

About Me