Book Review for "Exploring Everyday Things with R and Ruby"
by Sau Sheong Chang, O'Reilly Publishing
Data exploration and visualization made relatively easy, with Ruby and R.
I've been a fan of R for quite some time now. I'm not much of a statistician, but I am good at finding problems when given visualization of a pile of data. That's what I like R for. I've used it to examine log files (especially timing metrics) and found leaking connections, pointers to under allocated resources, and other problems. R can tell you things about data you'd never find just by looking it over.
The thing about R is that you usually have to set the data up in a format that makes it usable. R likes data grouped into neat bunches, which can then be mathematically examined and displayed on nifty plots and graphs. Preparing that data is something I had usually done in Python, but this book chose Ruby. I was happy with that, as Ruby is a language I could stand to learn more about.
The first chapter gives you a good introduction to Ruby. Readers should probably know what an object oriented language looks like, but not much else is required. The author presents the basics in a clear and minimal manner, probably enough to get most programmers off the ground. I liked what I read.
The second chapter introduces R. Again, the author is very clear and gives plenty of easy to try examples. In my opinion, this book provides one of the best introductions to R that I've seen. Very well done.
Once you're given introductions to the core toolkit, the author is off to the races explaining how you might envision data solutions to a variety of contrived problems. You get examples of how to calculate bathroom capacity for an office building, how to model a market economy, how to model a flock of birds as they travel. The author is creative about the problems. Honestly, some of these were a bit tedious to follow. For each problem presented, the author sets up an object-oriented domain to represent the problem space. The next step is to produce data from the domain, and finally to run the data through R to give insights into how that data can be interpreted. All these steps are explained, but you might find yourself referring to external sources once in a while to figure out how to read a particular line of Ruby or R that is being used.
I found these problem exercises a little tedious at times, but valid in their construction. It's sort of like reading someone else's infrastructure code-- you can gain a respect for how their thought process works, but it's not always fun going through the knowledge assimilation process. Probably good mental exercise for a programmer, though.
So, what's the verdict? I liked this book and am certain I'll use it as a reference once in a while. (Especially for the R portions.) Ruby I may or may not end up using again, because Python is fun to program in. Is this book worth the money? I'd say it is, especially if you don't yet grock R or Ruby. If you're at the mid-to-expert level with R, then you may not have much to learn technically (assuming you already have a way to tee up your data), but you could find the problem-solving cases interesting. Or not, it depends on how you like reading that kind of stuff. If you're flat out new to both R and Ruby, you probably should buy this book just for the first couple of chapters. It'll open up your world.
The book can be found here.
Happy Data Visualizing!
Showing posts with label R. Show all posts
Showing posts with label R. Show all posts
Friday, September 7, 2012
Saturday, April 3, 2010
Performance testing tip -- Visualize your data to help reveal patterns!
When you're doing performance testing, take a little extra time and examine your data for patterns. Sometimes the patterns aren't easy to see, but with the help of some tools they'll jump right out at you. Let's look at an example.
Here's some sample data that represents the amount of time it takes for a test driver to invoke the server. Do you see the pattern?
latency
0.7630
0.0626
0.0662
0.0705
0.0833
0.0631
0.0656
0.0712
0.0841
0.0623
0.0671
0.0699
0.0840
0.0619
0.0659
0.0720
0.0835
0.0630
0.0659
0.0706
0.0835
0.0619
0.0671
I can't see the pattern immediately. So let's have a look using 'R', the excellent open source statistical language. (If you're not using R yet, you might consider looking at it. It really is easy to use, and it gives you excellent statistical capabilities. There are lots of nice web tutorials on how to use it, as well as the excellent Manning title "R in Action") Here's the R script:
timingticks <- read.table("/home/rick/Blog_Temp/My.dat", header=TRUE)
attach(timingticks)
'Count'
length(latency)
summary(timingticks)
plot(latency, col='blue')
axis(2, tck=1)
q()
The script is pretty self-explanatory, but what we're saying is roughly this:
- Use my data file.
- Get me the number of records (length) of that file, call it 'Count'
- Give me a statistical summary of the file
- Make a plot of the file, using units of '1' on the left axis. THIS IS THE PART THAT SOMETIMES MAKES THE PATTERNS JUMP OUT!
So let's run the script then have a look at the results. First, the statistical output:
[1] "Count"
[1] 23
latency
Min. :0.06190
1st Qu.:0.06435
Median :0.06710
Mean :0.10036
3rd Qu.:0.07765
Max. :0.76300
> proc.time()
user system elapsed
0.732 0.028 0.747
This alone makes using R worthwhile. With almost no effort, I got some good information about my application timing performance. But that's not the real pay-dirt in this case! Let's have a look at the plot:

Now that's more interesting! Let's ignore the top dot, which represents a 'first transaction'. These are often much longer than the following entries due to 'warm up' issues. But look at the little repeating saw-tooth at the bottom! This is indicative of some sort of repeated behaviour.
In a real-world scenario, this was exactly what I saw when we found some server-side code that was trying to improve performance with a cache. The code was caching a value in a list, then when the list got to a certain size it tossed the cache and started over again. (In this case, the list size would be 4. See how the data goes in patterns, the timing gets a little longer each time for 4 cycles, then it 'busts' and starts over again?) It turned out our "performance-enhancing" cache was actually costing us performance! We removed the cache.
This is just one sample of how data visualisation can be used to help you find patterns in performance data. It doesn't take much time, and it can lead you to valuable insites. It's also fun!
Happy Coding!
Here's some sample data that represents the amount of time it takes for a test driver to invoke the server. Do you see the pattern?
latency
0.7630
0.0626
0.0662
0.0705
0.0833
0.0631
0.0656
0.0712
0.0841
0.0623
0.0671
0.0699
0.0840
0.0619
0.0659
0.0720
0.0835
0.0630
0.0659
0.0706
0.0835
0.0619
0.0671
I can't see the pattern immediately. So let's have a look using 'R', the excellent open source statistical language. (If you're not using R yet, you might consider looking at it. It really is easy to use, and it gives you excellent statistical capabilities. There are lots of nice web tutorials on how to use it, as well as the excellent Manning title "R in Action") Here's the R script:
timingticks <- read.table("/home/rick/Blog_Temp/My.dat", header=TRUE)
attach(timingticks)
'Count'
length(latency)
summary(timingticks)
plot(latency, col='blue')
axis(2, tck=1)
q()
The script is pretty self-explanatory, but what we're saying is roughly this:
- Use my data file.
- Get me the number of records (length) of that file, call it 'Count'
- Give me a statistical summary of the file
- Make a plot of the file, using units of '1' on the left axis. THIS IS THE PART THAT SOMETIMES MAKES THE PATTERNS JUMP OUT!
So let's run the script then have a look at the results. First, the statistical output:
[1] "Count"
[1] 23
latency
Min. :0.06190
1st Qu.:0.06435
Median :0.06710
Mean :0.10036
3rd Qu.:0.07765
Max. :0.76300
> proc.time()
user system elapsed
0.732 0.028 0.747
This alone makes using R worthwhile. With almost no effort, I got some good information about my application timing performance. But that's not the real pay-dirt in this case! Let's have a look at the plot:

Now that's more interesting! Let's ignore the top dot, which represents a 'first transaction'. These are often much longer than the following entries due to 'warm up' issues. But look at the little repeating saw-tooth at the bottom! This is indicative of some sort of repeated behaviour.
In a real-world scenario, this was exactly what I saw when we found some server-side code that was trying to improve performance with a cache. The code was caching a value in a list, then when the list got to a certain size it tossed the cache and started over again. (In this case, the list size would be 4. See how the data goes in patterns, the timing gets a little longer each time for 4 cycles, then it 'busts' and starts over again?) It turned out our "performance-enhancing" cache was actually costing us performance! We removed the cache.
This is just one sample of how data visualisation can be used to help you find patterns in performance data. It doesn't take much time, and it can lead you to valuable insites. It's also fun!
Happy Coding!
Sunday, October 4, 2009
Statistics 'R' easy!
A while back I read an early review of 'R in Action' from Manning. This gave me some ideas about what the R language can be used for-- to be specific, it makes statistics easy! With just a few lines, you can generate mean, standard deviation, quartiles, histograms, graphs, and much more.
I've recently had the opportunity to use this new-found knowledge. At work we're building a new sharded-MySQL back end, and we've just started performance testing. We ran our first performance test and started looking at latency data. I poured the data into R, ran a few stats and printed a few graphs-- instantly a pattern showed up in the graph that shows us a repeating pattern of incremental slowdowns. (It's a saw-tooth pattern, much like what you see in JVM garbage collection graphs-- starts out low, builds to an unacceptable level, then drops back down to the initial decent value.) Cool!
R pointed out the problem, now to go dig up the root cause....
Rick
I've recently had the opportunity to use this new-found knowledge. At work we're building a new sharded-MySQL back end, and we've just started performance testing. We ran our first performance test and started looking at latency data. I poured the data into R, ran a few stats and printed a few graphs-- instantly a pattern showed up in the graph that shows us a repeating pattern of incremental slowdowns. (It's a saw-tooth pattern, much like what you see in JVM garbage collection graphs-- starts out low, builds to an unacceptable level, then drops back down to the initial decent value.) Cool!
R pointed out the problem, now to go dig up the root cause....
Rick
Subscribe to:
Posts (Atom)