“Full many a gem of purest serene
The dark oceans of cave bear.”
Thomas Gray – Elegy in country churchyard
In this post I do a fine grained analysis of the batting performances of cricketing icons from India and also from the international scene to determine how they stack up against each other. I perform 2 separate analyses 1) Between Indian legends (Sunil Gavaskar, Sachin Tendulkar & Rahul Dravid) and another 2) Between contemporary cricketing stars (Brian Lara, Sachin Tendulkar, Ricky Ponting and A B De Villiers)
In the world and more so in India, Tendulkar is probably placed on a higher pedestal than all other cricketers. I was curious to know how much of this adulation is justified. In “Zen and the art of motorcycle maintenance” Robert Pirsig mentions that while we cannot define Quality (in a book, music or painting) we usually know it when we see it. So do the people see an ineffable quality in Tendulkar or are they intuiting his greatness based on overall averages?
In this context, we need to keep in mind the warning that Daniel Kahnemann highlights in his book, ‘Thinking fast and slow’. Kahnemann suggests that we should regard “statistical intuition with proper suspicion and replace impression formation by computation wherever possible”. This is because our minds usually detects patterns and associations even when none actually exist.
So this analysis tries to look deeper into these aspects by performing a detailed statistical analysis.
The data for all the batsman has been taken from ESPN Cricinfo. The data is then cleaned to remove ‘DNB’ (did not bat), ‘TDNB’ (Team did not bat) etc before generating the graphs.
The code, data and the plots can be cloned,forked from Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.
Feel free to agree, disagree, dispute or argue with my analysis.
Take a look at my book with all my articles related to cricket – Cricket analytics with cricketr!!!. The book is also available in paperback and kindle versions at Amazon which has, by the way, better formatting!
The batting performances of the each of the cricketers is described in 3 plots a) Combined boxplot & histogram b) Runs frequency vs Runs plot c) Mean Strike Rate vs Runs plot
A) Batting performance of Sachin Tendulkar
The above graph is combined boxplot and a histogram. The boxplot at the top shows the 1st quantile (25th percentile) which is the left side of the green rectangle, the 3rd quantile( 75% percentile) right side of the green rectangle and the mean and the median. These values are also shown in the histogram below. The histogram gives the frequency of Runs scored in the given range for e.g (0-10, 11-20, 21-30 etc) for Tendulkar
The graph above plots the best fitting curve for Runs scored in the frequency ranges.
This plot computes the Mean Strike Rate for each interval for e.g if between the ranges 11-21 the Strike Rates were 40.5, 48.5, 32.7, 56.8 then the average of these values is computed for the range 11-21 = (40.5 + 48.5 + 32.7 + 56.8)/4. This is done for all ranges and the Mean Strike Rate in each range is plotted and the loess curve is fitted for this data.
The mean, median, the 25th and 75 th percentiles for the runs scored by Rahul Dravid are shown above
The above plot computes the percentage of the total career runs scored in a given range for each of the batsman.
For e.g if Dravid scored the runs 23, 22, 28, 21, 25 in the range 21-30 then the
Range 21 – 20 => percentageRuns = ( 23 + 22 + 28 + 21 + 25)/ Total runs in career * 100
The above plot shows that Rahul Dravid’s has a higher contribution in the range 20-70 while Tendulkar has a larger percentahe in the range 150-230
With respect to the Mean Strike Rate Tendulkar is clearly superior to both Gavaskar & Dravid
The above table captures the the career details of each of the batsman
The following points can be noted
1) The ‘number of innings’ is the data you get after removing rows with DNB, TDNB etc
2) Tendulkar has the higher average 48.39 > Gavaskar (47.3) > Dravid (46.46)
3) The skew of Dravid (1.67) is greater which implies that there the runs scored are more skewed to right (greater runs) in comparison to mean
Clearly De Villiers is ahead in the percentage Runs scores in the range 30-80. Tendulkar is better in the range between 80-120. Lara’s career has a long tail.
The Mean Strike Rate of Lara is ahead of the lot, followed by De Villiers, Ponting and then Tendulkar
L) Analysis of Tendulkar, Lara, Ponting and De Villiers
The following can be observed from the above table
1) Brian Lara has the highest average (51.52) > Sachin Tendulkar (48.39 > Ricky Ponting (46.61) > AB De Villiers (46.55)
2) Brian Lara also the highest skew which means that the data is more skewed to the right of the mean than the others
You can clone the code rom Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Analyzing cricket’s batting legends – Through the mirage with R
4. Masters of spin – Unraveling the web with R
You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4. Bend it like Bluemix, MongoDB with autoscaling – Part 2