Introduction

This is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms

• Univariate Regression
• Multivariate Regression
• Polynomial Regression
• K Nearest Neighbors Regression

The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera

1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package

The content of this post and much more is now available as a compact book  on Amazon in both formats – as Paperback ($9.99) and a Kindle version($6.99/Rs449/). see ‘Practical Machine Learning with R and Python – Machine Learning in stereo

While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.

R squared computation in R is computed as follows
$RSS=\sum (y-yhat)^{2}$
$TSS= \sum(y-mean(y))^{2}$
$Rsquared- 1-\frac{RSS}{TSS}$

Note: You can download this R Markdown file and the associated data sets from Github at MachineLearning-RandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!

1.1a Univariate Regression – R code

Here a simple linear regression line is fitted between a single input feature and the target variable

# Source in the R function library
source("RFunctions.R")
# Read the Boston data file
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - Statistical Learning

# Split the data into training and test sets (75:25)
train_idx <- trainTestSplit(df,trainPercent=75,seed=5)
train <- df[train_idx, ]
test <- df[-train_idx, ]

# Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of
# population'
fit=lm(medv~lstat,data=df)
# Display details of fir
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.168  -3.990  -1.318   2.034  24.500
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# Display the confidence intervals
confint(fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)")
abline(fit)
abline(fit,lwd=3)
abline(fit,lwd=3,col="red")

rsquared=Rsquared(fit,test,test$medv) sprintf("R-squared for uni-variate regression (Boston.csv) is : %f", rsquared) ## [1] "R-squared for uni-variate regression (Boston.csv) is : 0.556964" 1.1b Univariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression #os.chdir("C:\\software\\machine-learning\\RandPython") # Read the CSV file df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") # Select the feature variable X=df['lstat'] # Select the target y=df['medv'] # Split into train and test sets (75:25) X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) X_train=X_train.values.reshape(-1,1) X_test=X_test.values.reshape(-1,1) # Fit a linear model linreg = LinearRegression().fit(X_train, y_train) # Print the training and test R squared score print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Plot the linear regression line fig=plt.scatter(X_train,y_train) # Create a range of points. Compute yhat=coeff1*x + intercept and plot x=np.linspace(0,40,20) fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red') fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)")
fig1=plt.xlabel("Lower status (%)")
fig1=plt.ylabel("Median value of owned homes ($1000)") fig.figure.savefig('foo.png', bbox_inches='tight') fig1.figure.savefig('foo1.png', bbox_inches='tight') print "Finished"  ## R-squared score (training): 0.571 ## R-squared score (test): 0.458 ## Finished 1.2a Multivariate Regression – R code # Read crimes data crimesDF <- read.csv("crimes.csv",stringsAsFactors = FALSE) # Remove the 1st 7 columns which do not impact output crimesDF1 <- crimesDF[,7:length(crimesDF)] # Convert all to numeric crimesDF2 <- sapply(crimesDF1,as.numeric) # Check for NAs a <- is.na(crimesDF2) # Set to 0 as an imputation crimesDF2[a] <-0 #Create as a dataframe crimesDF2 <- as.data.frame(crimesDF2) #Create a train/test split train_idx <- trainTestSplit(crimesDF2,trainPercent=75,seed=5) train <- crimesDF2[train_idx, ] test <- crimesDF2[-train_idx, ] # Fit a multivariate regression model between crimesPerPop and all other features fit <- lm(ViolentCrimesPerPop~.,data=train) # Compute and print R Squared rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop)
sprintf("R-squared for multi-variate regression (crimes.csv)  is : %f", rsquared)
## [1] "R-squared for multi-variate regression (crimes.csv)  is : 0.653940"

1.2b Multivariate Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#Remove the 1st 7 columns
crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]]
# Convert to numeric
crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce')
# Impute NA to 0s
crimesDF2.fillna(0, inplace=True)

# Select the X (feature vatiables - all)
X=crimesDF2.iloc[:,0:120]

# Set the target
y=crimesDF2.iloc[:,121]

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)
# Fit a multivariate regression model
linreg = LinearRegression().fit(X_train, y_train)

# compute and print the R Square
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))
## R-squared score (training): 0.699
## R-squared score (test): 0.677

1.3a Polynomial Regression – R

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

 # Polynomial degree 1
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))

# Select key columns
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Split as train and test sets
train_idx <- trainTestSplit(df3,trainPercent=75,seed=5)
train <- df3[train_idx, ]
test <- df3[-train_idx, ]

# Fit a model of degree 1
fit <- lm(mpg~. ,data=train)
rsquared1 <-Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1) ## [1] "R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : 0.763607" # Polynomial degree 2 - Quadratic x = as.matrix(df3[1:6]) # Make a polynomial of degree 2 for feature variables before split df4=as.data.frame(poly(x,2,raw=TRUE)) df5 <- cbind(df4,df3[7]) # Split into train and test set train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit the quadratic model fit <- lm(mpg~. ,data=train) # Compute R squared rsquared2=Rsquared(fit,test,test$mpg)
sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : %f", rsquared2)
## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : 0.831372"
#Polynomial degree 3
x = as.matrix(df3[1:6])
# Make polynomial of degree 4  of feature variables before split
df4=as.data.frame(poly(x,3,raw=TRUE))
df5 <- cbind(df4,df3[7])
train_idx <- trainTestSplit(df5,trainPercent=75,seed=5)

train <- df5[train_idx, ]
test <- df5[-train_idx, ]
# Fit a model of degree 3
fit <- lm(mpg~. ,data=train)
# Compute R squared
rsquared3=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared3) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.773225" df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3)) # Make a plot of Rsquared and degree ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") + ggtitle("Polynomial regression - R squared vs Degree of polynomial") + xlab("Degree") + ylab("R squared") 1.3a Polynomial Regression – Python For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns # Select key columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Polynomial degree 1 X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('R-squared score - Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train))) # Compute R squared rsquared1 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Polynomial degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) # Compute R squared print('R-squared score - Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train))) rsquared2 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) #Polynomial degree 3 poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('(R-squared score -Polynomial degree 3 (training): {:.3f}' .format(linreg.score(X_train, y_train))) # Compute R squared rsquared3 =linreg.score(X_test, y_test) print('R-squared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) degree=[1,2,3] rsquared =[rsquared1,rsquared2,rsquared3] fig2=plt.plot(degree,rsquared) fig2=plt.title("Polynomial regression - R squared vs Degree of polynomial") fig2=plt.xlabel("Degree") fig2=plt.ylabel("R squared") fig2.figure.savefig('foo2.png', bbox_inches='tight') print "Finished plotting and saving"  ## R-squared score - Polynomial degree 1 (training): 0.811 ## R-squared score - Polynomial degree 1 (test): 0.799 ## R-squared score - Polynomial degree 2 (training): 0.861 ## R-squared score - Polynomial degree 2 (test): 0.847 ## ## (R-squared score -Polynomial degree 3 (training): 0.933 ## R-squared score Polynomial degree 3 (test): 0.710 ## ## Finished plotting and saving 1.4 K Nearest Neighbors The code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to $X_{normalized} = \frac{X-min(X)}{max(X-min(X))}$ Another technique that is used is Standardization which is $X_{standardized} = \frac{X-mean(X)}{sd(X)}$ 1.4a K Nearest Neighbors Regression – R( Unnormalized) The R code below does not use feature scaling # KNN regression requires the FNN package df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split train and test train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Select the feature variables train.X=train[,1:6] # Set the target for training train.Y=train[,7] # Do the same for test set test.X=test[,1:6] test.Y=test[,7] rsquared <- NULL # Create a list of neighbors neighbors <-c(1,2,4,8,10,14) for(i in seq_along(neighbors)){ # Perform a KNN regression fit knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i]) # Compute R sqaured rsquared[i]=knnRSquared(knn$pred,test.Y)
}

# Make a dataframe for plotting
df <- data.frame(neighbors,Rsquared=rsquared)
# Plot the number of neighors vs the R squared
ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") +
xlab("Number of neighbors") + ylab("R squared") +
ggtitle("KNN regression - R squared vs Number of Neighors (Unnormalized)")

1.4b K Nearest Neighbors Regression – Python( Unnormalized)

The Python code below does not use feature scaling

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Perform a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# Create a list of neighbors
rsquared=[]
neighbors=[1,2,4,8,10,14]
for i in neighbors:
# Fit a KNN model
knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train)
# Compute R squared
rsquared.append(knnreg.score(X_test, y_test))
print('R-squared test score: {:.3f}'
.format(knnreg.score(X_test, y_test)))
# Plot the number of neighors vs the R squared
fig3=plt.plot(neighbors,rsquared)
fig3=plt.title("KNN regression - R squared vs Number of neighbors(Unnormalized)")
fig3=plt.xlabel("Neighbors")
fig3=plt.ylabel("R squared")
fig3.figure.savefig('foo3.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared test score: 0.527
## R-squared test score: 0.678
## R-squared test score: 0.707
## R-squared test score: 0.684
## R-squared test score: 0.683
## R-squared test score: 0.670
## Finished plotting and saving

1.4c K Nearest Neighbors Regression – R( Normalized)

It can be seen that R squared improves when the features are normalized.

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Perform MinMaxScaling of feature variables
train.X.scaled=MinMaxScaler(train.X)
test.X.scaled=MinMaxScaler(test.X)

# Create a list of neighbors
rsquared <- NULL
neighbors <-c(1,2,4,6,8,10,12,15,20,25,30)
for(i in seq_along(neighbors)){
# Fit a KNN model
knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i)
# Compute R ssquared
rsquared[i]=knnRSquared(knn$pred,test.Y) } df <- data.frame(neighbors,Rsquared=rsquared) # Plot the number of neighors vs the R squared ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") + xlab("Number of neighbors") + ylab("R squared") + ggtitle("KNN regression - R squared vs Number of Neighors(Normalized)") 1.4d K Nearest Neighbors Regression – Python( Normalized) R squared improves when the features are normalized with MinMaxScaling import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.neighbors import KNeighborsRegressor from sklearn.preprocessing import MinMaxScaler autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Perform a train/ test split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Use MinMaxScaling scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) # Apply scaling on test set X_test_scaled = scaler.transform(X_test) # Create a list of neighbors rsquared=[] neighbors=[1,2,4,6,8,10,12,15,20,25,30] for i in neighbors: # Fit a KNN model knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train) # Compute R squared rsquared.append(knnreg.score(X_test_scaled, y_test)) print('R-squared test score: {:.3f}' .format(knnreg.score(X_test_scaled, y_test))) # Plot the number of neighors vs the R squared fig4=plt.plot(neighbors,rsquared) fig4=plt.title("KNN regression - R squared vs Number of neighbors(Normalized)") fig4=plt.xlabel("Neighbors") fig4=plt.ylabel("R squared") fig4.figure.savefig('foo4.png', bbox_inches='tight') print "Finished plotting and saving" ## R-squared test score: 0.703 ## R-squared test score: 0.810 ## R-squared test score: 0.830 ## R-squared test score: 0.838 ## R-squared test score: 0.834 ## R-squared test score: 0.828 ## R-squared test score: 0.827 ## R-squared test score: 0.826 ## R-squared test score: 0.816 ## R-squared test score: 0.815 ## R-squared test score: 0.809 ## Finished plotting and saving Conclusion In this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms. Comments, suggestions and corrections are welcome. Watch this this space! To be continued…. To see all posts see Index of posts Advertisements Using Linear Programming (LP) for optimizing bowling change or batting lineup in T20 cricket In my recent post, My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI), I had recounted my journey in the domains of of Data Science, Machine Learning (ML), and more recently Deep Learning (DL) all of which are useful while analyzing data. Of late, I have come to the realization that there are many facets to data. And to glean insights from data, Data Science, ML and DL alone are not sufficient and one needs to also have a good handle on linear programming and optimization. My colleague at IBM Research also concurred with this view and told me he had arrived at this conclusion several years ago. The 3rd edition of my books (paperback & kindle) Cricket analytics with cricketr & Beaten by sheer pace! Cricket analytics with yorkr is now available on Amazon for$12.99 (each for paperbacks), and $4.99/Rs 320 and$6.99/Rs448 respectively

While ML & DL are very useful and interesting to make inferences and predictions of outputs from input variables, optimization computes the choice of input which results in maximizing or minimizing the output. So I made a small course correction and started on a course from India’s own NPTEL Introduction to Linear Programming by Prof G. Srinivasan of IIT Madras (highly recommended!). The lectures are delivered with remarkable clarity by the Prof and I am just about halfway through the course (each lecture is of 50-55 min duration), when I decided that I needed to try to formulate and solve some real world Linear Programming problem.

As usual, I turned towards cricket for some appropriate situations, and sure enough it was there in the open. For this LP formulation I take International T20 and IPL, though International ODI will also work equally well.  You can download the associated code and data for this from Github at LP-cricket-analysis

In T20 matches the captain has to make choice of how to rotate bowlers with the aim of restricting the batting side. Conversely, the batsmen need to take advantage of the bowling strength to maximize the runs scored.

Note:
a) A simple and obvious strategy would be
– If the ith bowler’s economy rate is less than the economy rate of the jth bowler i.e.
$er_{i}$ < $er_{j}$ then have bowler ‘i’ to bowl more overs as his/her economy rate is better

b)A better strategy would be to consider the economy rate of each bowler against each batsman. How often  have we witnessed bowlers with a great bowling average get thrashed time and again by the same batsman, or a bowler who is generally very poor being very effective against a particular batsman. i.e. $er_{ij}$ < $er_{ik}$ where the jth bowler is more effective than the kth bowler against the ith batsman. This now becomes a linear optimization problem as we can have several combinations of number of overs X economy rate for different bowlers and we will have to solve this algorithmically to determine the lowest score for bowling performance or highest score for batting order.

This post uses the latter approach to optimize bowling change and batting lineup.

Let is take a hypothetical situation
Assume there are 3 bowlers – $bwlr_{1},bwlr_{2},bwlr_{3}$
and there are 4 batsmen – $bman_{1},bman_{2},bman_{3},bman_{4}$

Let the economy rate $er_{ij}$ be the Economy Rate of the jth bowler to the ith batsman. Also if remaining overs for the bowlers are $o_{1},o_{2},o_{3}$
and the total number of overs left to be bowled are
$o_{1}+o_{2}+o_{3} = N$ then the question is

a) Given the economy rate of each bowler per batsman, how many overs should each bowler bowl, so that the total runs scored by all the batsmen are minimum?

b) Alternatively, if the know the individual strike rate of a batsman against the individual bowlers, how many overs should each batsman face with a bowler so that the total runs scored is maximized?

1. LP Formulation for bowling order

Let the economy rate $er_{ij}$ be the Economy Rate of the jth bowler to the ith batsman.
Objective function : Minimize –
$er_{11}*o_{11} + er_{12}*o_{12} +..+er_{1n}*o_{1n}+ er_{21}*o_{21} + er_{22}*o_{22}+.. + er_{22}*o_{2n}+ er_{m1}*o_{m1}+..+ er_{mn}*o_{mn}$
i.e.
$\sum_{i=1}^{i=m}\sum_{j=1}^{i=n}er_{ij}*o_{ij}$
Constraints
Where $o_{j}$ is the number of overs remaining for the jth bowler against  ‘k’ batsmen
$o_{j1} + o_{j2} + .. o_{jk} < o_{j}$
and if the total number of overs remaining to be bowled is N then
$o_{1} + o_{2} +...+ o_{k} = N$ or
$\sum_{j=1}^{j=k} o_{j} =N$
The overs that any bowler can bowl is $o_{j} >=0$

2. LP Formulation for batting lineup

Let the strike rate $sr_{ij}$  be the Strike Rate of the ith batsman to the jth bowler
Objective function : Maximize –
$sr_{11}*o_{11} + sr_{12}*o_{12} +..+ sr_{1n}*o_{1n}+ sr_{21}*o_{21} + sr_{22}*o_{22}+.. sr_{2n}*o_{2n}+ sr_{m1}*o_{m1}+..+ sr_{mn}*o_{mn}$
i.e.
$\sum_{i=1}^{i=4}\sum_{j=1}^{i=3}sr_{ij}*o_{ij}$
Constraints
Where $o_{j}$ is the number of overs remaining for the jth bowler against  ‘k’ batsmen
$o_{j1} + o_{j2} + .. o_{jk} < o_{j}$
and the total number of overs remaining to be bowled is N then
$o_{1} + o_{2} +...+ o_{k} = N$ or
$\sum_{j=1}^{j=k} o_{j} =N$
The overs that any bowler can bowl is
$o_{j} >=0$

lpSolveAPI– For this maximization and minimization problem I used lpSolveAPI.

Below I take 2 simple examples (example1 & 2)  to ensure that my LP formulation and solution is correct before applying it on real T20 cricket data (Intl. T20 and IPL)

3. LP formulation (Example 1)

Initially I created a test example to ensure that I get the LP formulation and solution correct. Here the er1=4 and er2=3 and o1 & o2 are the overs bowled by bowlers 1 & 2. Also o1+o2=4 In this example as below

o1 o2 Obj Fun(=4o1+3o2)
1    3      13
2    2      14
3    1      15

library(lpSolveAPI)
library(dplyr)
library(knitr)
lprec <- make.lp(0, 2)
a <-lp.control(lprec, sense="min")
set.objfn(lprec, c(4, 3))  # Economy Rate of 4 and 3 for er1 and er2
add.constraint(lprec, c(1, 1), "=",4)  # o1 + o2 =4
add.constraint(lprec, c(1, 0), ">",1)  # o1 > 1
add.constraint(lprec, c(0, 1), ">",1)  # o2 > 1
lprec
## Model name:
##             C1    C2
## Minimize     4     3
## R1           1     1   =  4
## R2           1     0  >=  1
## R3           0     1  >=  1
## Kind       Std   Std
## Type      Real  Real
## Upper      Inf   Inf
## Lower        0     0
b <-solve(lprec)
get.objective(lprec) # 13
## [1] 13
get.variables(lprec) # 1    3 
## [1] 1 3

Note 1: In the above example 13 runs is the minimum that can be scored and this requires

LP solution:
Minimum runs=13

• o1=1
• o2=3

Note 2:The numbers in the columns represent the number of overs that need to be bowled by a bowler to the corresponding batsman.

4. LP formulation (Example 2)

In this formulation there are 2 bowlers and 2 batsmen o11,o12 are the oves bowled by bowler 1 to batsmen 1 & 2 and o21, o22 are the overs bowled by bowler 2 to batsmen 1 & 2 er11=4, er12=2,er21=2,er22=5 o11+o12+o21+o22=5

The solution for this manually computed is o11, o12, o21, o22 Runs
where B11, B12 are the overs bowler 1 bowls to batsman 1 and B21 and B22 are overs bowler 2 bowls to batsman 2

o11     o12    o21    o22      Runs=(4*o11+2*o12+2*o21+5*o22)
1            1             1            2           18
1           2              1             1           15
2           1              1            1            17
1           1               2            1            15

lprec <- make.lp(0, 4)
a <-lp.control(lprec, sense="min")
set.objfn(lprec, c(4, 2,2,5))
lprec
## Model name:
##             C1    C2    C3    C4
## Minimize     4     2     2     5
## R1           1     1     0     0  <=  8
## R2           0     0     1     1  <=  7
## R3           1     1     1     1   =  5
## R4           1     0     0     0  >=  1
## R5           0     1     0     0  >=  1
## R6           0     0     1     0  >=  1
## R7           0     0     0     1  >=  1
## Kind       Std   Std   Std   Std
## Type      Real  Real  Real  Real
## Upper      Inf   Inf   Inf   Inf
## Lower        0     0     0     0
b<-solve(lprec)
get.objective(lprec) 
## [1] 15
get.variables(lprec) 
## [1] 1 2 1 1

Note: In the above example 15 runs is the minimum that can be scored and this requires

LP Solution:
Minimum runs=15

• o11=1
• o12=2
• o21=1
• o22=1

It is possible to keep the minimum to other values and solves also.

5. LP formulation for International T20 India vs Australia (Batting lineup)

To analyze batting and bowling lineups in the cricket world I needed to get the ball-by-ball details of runs scored by each batsman against each of the bowlers. Fortunately I had already created this with my R package yorkr. yorkr processes yaml data from Cricsheet. So I copied the data of all matches between Australia and India in International T20s. You can download my processed data for International T20 at Inswinger

load("Australia-India-allMatches.RData")
dim(matches)
## [1] 3541   25

The following functions compute the ‘Strike Rate’ of a batsman as

SR=1/oversRunsScored

Also the Economy Rate is computed as

ER=1/oversRunsConceded

Incidentally the SR=ER

# Compute the Strike Rate of the batsman
computeSR <- function(batsman1,bowler1){
a <- matches %>% filter(batsman==batsman1 & bowler==bowler1)
a1 <- a %>% summarize(totalRuns=sum(runs),count=n()) %>% mutate(SR=(totalRuns/count)*6)
a1
}

# Compute the Economy Rate of the batsman
computeER <- function(batsman1,bowler1){
a <- matches %>% filter(batsman==batsman1 & bowler==bowler1)
a1 <- a %>% summarize(totalRuns=sum(runs),count=n()) %>% mutate(ER=(totalRuns/count)*6)
a1
}

Here I compute the Strike Rate of Virat Kohli, Yuvraj Singh and MS Dhoni against Shane Watson, Brett Lee and MA Starc

 # Kohli
kohliWatson<- computeSR("V Kohli","SR Watson")
kohliWatson
##   totalRuns count       SR
## 1        45    37 7.297297
kohliLee <- computeSR("V Kohli","B Lee")
kohliLee
##   totalRuns count       SR
## 1        10     7 8.571429
kohliStarc <- computeSR("V Kohli","MA Starc")
kohliStarc
##   totalRuns count       SR
## 1        11     9 7.333333
# Yuvraj
yuvrajWatson<- computeSR("Yuvraj Singh","SR Watson")
yuvrajWatson
##   totalRuns count       SR
## 1        24    22 6.545455
yuvrajLee <- computeSR("Yuvraj Singh","B Lee")
yuvrajLee
##   totalRuns count       SR
## 1        12     7 10.28571
yuvrajStarc <- computeSR("Yuvraj Singh","MA Starc")
yuvrajStarc
##   totalRuns count SR
## 1        12     8  9
# MS Dhoni
dhoniWatson<- computeSR("MS Dhoni","SR Watson")
dhoniWatson
##   totalRuns count       SR
## 1        33    28 7.071429
dhoniLee <- computeSR("MS Dhoni","B Lee")
dhoniLee
##   totalRuns count  SR
## 1        26    20 7.8
dhoniStarc <- computeSR("MS Dhoni","MA Starc")
dhoniStarc
##   totalRuns count   SR
## 1        11     8 8.25

When we consider the batting lineup, the problem is one of maximization. In the LP formulation below V Kohli has a SR of 7.29, 8.57, 7.33 against Watson, Lee & Starc
Yuvraj has a SR of 6.5, 10.28, 9 against Watson, Lee & Starc
and Dhoni has a SR of 7.07, 7.8,  8.25 against Watson, Lee and Starc

The constraints are Watson, Lee and Starc have 3, 4 & 3 overs remaining respectively. The total number of overs remaining to be bowled is 9.The other constraints could be that a bowler bowls at least 1 over etc.

Formulating and solving

# 3 batsman x 3 bowlers
lprec <- make.lp(0, 9)
# Maximization
a<-lp.control(lprec, sense="max")

# Set the objective function
set.objfn(lprec, c(kohliWatson$SR, kohliLee$SR,kohliStarc$SR, yuvrajWatson$SR,yuvrajLee$SR,yuvrajStarc$SR,
dhoniWatson$SR,dhoniLee$SR,dhoniStarc$SR)) #Assume the bowlers have 3,4,3 overs left respectively add.constraint(lprec, c(1, 1,1,0,0,0, 0,0,0), "<=",3) add.constraint(lprec, c(0,0,0,1,1,1,0,0,0), "<=",4) add.constraint(lprec, c(0,0,0,0,0,0,1,1,1), "<=",3) #o11+o12+o13+o21+o22+o23+o31+o32+o33=8 (overs remaining) add.constraint(lprec, c(1,1,1,1,1,1,1,1,1), "=",9) add.constraint(lprec, c(1,0,0,0,0,0,0,0,0), ">=",1) #o11 >=1 add.constraint(lprec, c(0,1,0,0,0,0,0,0,0), ">=",0) #o12 >=0 add.constraint(lprec, c(0,0,1,0,0,0,0,0,0), ">=",0) #o13 >=0 add.constraint(lprec, c(0,0,0,1,0,0,0,0,0), ">=",1) #o21 >=1 add.constraint(lprec, c(0,0,0,0,1,0,0,0,0), ">=",1) #o22 >=1 add.constraint(lprec, c(0,0,0,0,0,1,0,0,0), ">=",0) #o23 >=0 add.constraint(lprec, c(0,0,0,0,0,0,1,0,0), ">=",1) #o31 >=1 add.constraint(lprec, c(0,0,0,0,0,0,0,1,0), ">=",0) #o32 >=0 add.constraint(lprec, c(0,0,0,0,0,0,0,0,1), ">=",0) #o33 >=0 lprec ## Model name: ## a linear program with 9 decision variables and 13 constraints b <-solve(lprec) get.objective(lprec) #  ## [1] 77.16418 get.variables(lprec) #  ## [1] 1 2 0 1 3 0 1 0 1 This shows that the maximum runs that can be scored for the current strike rate is 77.16 runs in 9 overs The breakup is as follows This is also shown below get.variables(lprec) #  ## [1] 1 2 0 1 3 0 1 0 1 This is also shown below e <- as.data.frame(rbind(c(1,2,0,3),c(1,3,0,4),c(1,0,1,2))) names(e) <- c("S Watson","B Lee","MA Starc","Overs") rownames(e) <- c("Kohli","Yuvraj","Dhoni") e LP Solution: Maximum runs that can be scored by India against Australia is:77.164 if the 9 overs to be faced by the batsman are as below ## S Watson B Lee MA Starc Overs ## Kohli 1 2 0 3 ## Yuvraj 1 3 0 4 ## Dhoni 1 0 1 2 #Total overs=9 Note: This assumes that the batsmen perform at their current Strike Rate. Howvever anything can happen in a real game, but nevertheless this is a fairly reasonable estimate of the performance Note 2:The numbers in the columns represent the number of overs that need to be bowled by a bowler to the corresponding batsman. Note 3:You could try other combinations of overs for the above SR. For the above constraints 77.16 is the highest score for the given number of overs 6. LP formulation for International T20 India vs Australia (Bowling lineup) For this I compute how the bowling should be rotated between R Ashwin, RA Jadeja and JJ Bumrah when taking into account their performance against batsmen like Shane Watson, AJ Finch and David Warner. For the bowling performance I take the Economy rate of the bowlers. The data is the same as above computeSR <- function(batsman1,bowler1){ a <- matches %>% filter(batsman==batsman1 & bowler==bowler1) a1 <- a %>% summarize(totalRuns=sum(runs),count=n()) %>% mutate(SR=(totalRuns/count)*6) a1 } # RA Jadeja jadejaWatson<- computeER("SR Watson","RA Jadeja") jadejaWatson ## totalRuns count ER ## 1 60 29 12.41379 jadejaFinch <- computeER("AJ Finch","RA Jadeja") jadejaFinch ## totalRuns count ER ## 1 36 33 6.545455 jadejaWarner <- computeER("DA Warner","RA Jadeja") jadejaWarner ## totalRuns count ER ## 1 23 11 12.54545 # Ashwin ashwinWatson<- computeER("SR Watson","R Ashwin") ashwinWatson ## totalRuns count ER ## 1 41 26 9.461538 ashwinFinch <- computeER("AJ Finch","R Ashwin") ashwinFinch ## totalRuns count ER ## 1 63 36 10.5 ashwinWarner <- computeER("DA Warner","R Ashwin") ashwinWarner ## totalRuns count ER ## 1 38 28 8.142857 # JJ Bunrah bumrahWatson<- computeER("SR Watson","JJ Bumrah") bumrahWatson ## totalRuns count ER ## 1 22 20 6.6 bumrahFinch <- computeER("AJ Finch","JJ Bumrah") bumrahFinch ## totalRuns count ER ## 1 25 19 7.894737 bumrahWarner <- computeER("DA Warner","JJ Bumrah") bumrahWarner ## totalRuns count ER ## 1 2 4 3 As can be seen from above RA Jadeja has a ER of 12.4, 6.54, 12.54 against Watson, AJ Finch and Warner also Ashwin has a ER of 9.46, 10.5, 8.14 against Watson, Finch and Warner. Similarly Bumrah has an ER of 6.6,7.89, 3 against Watson, Finch and Warner The constraints are Jadeja, Ashwin and Bumrah have 4, 3 & 4 overs remaining and the total overs remaining to be bowled is 10. Formulating solving the bowling lineup is shown below lprec <- make.lp(0, 9) a <-lp.control(lprec, sense="min") # Set the objective function set.objfn(lprec, c(jadejaWatson$ER, jadejaFinch$ER,jadejaWarner$ER,
ashwinWatson$ER,ashwinFinch$ER,ashwinWarner$ER, bumrahWatson$ER,bumrahFinch$ER,bumrahWarner$ER))

add.constraint(lprec, c(0,0,0,1,1,1,0,0,0), "<=",3)   # Ashwin has 3 overs left
add.constraint(lprec, c(0,0,0,0,0,0,1,1,1), "<=",4)   # Bumrah has 4 overs left
add.constraint(lprec, c(1,1,1,1,1,1,1,1,1), "=",10) # Total overs = 10

lprec
## Model name:
##   a linear program with 9 decision variables and 13 constraints
b <-solve(lprec)
get.objective(lprec) #  
## [1] 73.58775
get.variables(lprec) # 
## [1] 1 2 1 0 1 1 0 1 3

The minimum runs that will be conceded by these 3 bowlers in 10 overs is 73.58 assuming the bowling is rotated as follows

e <- as.data.frame(rbind(c(1,0,0),c(2,1,1),c(1,1,3),c(4,2,4)))
names(e) <- c("RA Jadeja","R Ashwin","JJ Bumrah")
rownames(e) <- c("S Watson","AJ Finch","DA Warner","Overs")
e 

LP Solution:
Minimum runs that will be conceded by India against Australia is 73.58 in 10 overs if the overs bowled are as follows

##           RA Jadeja R Ashwin JJ Bumrah
## S Watson          1        0         0
## AJ Finch          2        1         1
## DA Warner         1        1         3
## Overs             4        2         4
#Total overs=10  

7. LP formulation for IPL (Mumbai Indians – Kolkata Knight Riders – Bowling lineup)

As in the case of International T20s I also have processed IPL data derived from my R package yorkr. yorkr. yorkr processes yaml data from Cricsheet. The processed data for all IPL matches can be downloaded from GooglyPlus

load("Mumbai Indians-Kolkata Knight Riders-allMatches.RData")
dim(matches)
## [1] 4237   25
# Compute the Economy Rate of the Mumbai Indian bowlers against Kolkata Knight Riders

# Gambhir
gambhirMalinga <- computeER("G Gambhir","SL Malinga")
gambhirHarbhajan <- computeER("G Gambhir","Harbhajan Singh")
gambhirPollard <- computeER("G Gambhir","KA Pollard")

#Yusuf Pathan
yusufMalinga <- computeER("YK Pathan","SL Malinga")
yusufHarbhajan <- computeER("YK Pathan","Harbhajan Singh")
yusufPollard <- computeER("YK Pathan","KA Pollard")

#JH Kallis
kallisMalinga <- computeER("JH Kallis","SL Malinga")
kallisHarbhajan <- computeER("JH Kallis","Harbhajan Singh")
kallisPollard <- computeER("JH Kallis","KA Pollard")

#RV Uthappa
uthappaMalinga <- computeER("RV Uthappa","SL Malinga")
uthappaHarbhajan <- computeER("RV Uthappa","Harbhajan Singh")
uthappaPollard <- computeER("RV Uthappa","KA Pollard")

Here

gambhirMalinga, yusufMalinga, kallisMalinga, uthappaMalinga is the ER of Malinga against Gambhir, Yusuf Pathan, Kallis and Uthappa
gambhirHarbhajan, yusufHarbhajan, kallisHarbhajan, uthappaHarbhajan is the ER of Harbhajan against Gambhir, Yusuf Pathan, Kallis and Uthappa
gambhirPollard, yusufPollard, kallisPollard, uthappaPollard is the ER of Kieron Pollard against Gambhir, Yusuf Pathan, Kallis and Uthappa

The constraints are Malinga, Harbhajan and Pollard have 4 overs each and remaining overs to be bowled is 10.

Formulating and solving this for the bowling lineup of Mumbai Indians against Kolkata Knight Riders

 library("lpSolveAPI")
lprec <- make.lp(0, 12)
a=lp.control(lprec, sense="min")

set.objfn(lprec, c(gambhirMalinga$ER, yusufMalinga$ER,kallisMalinga$ER,uthappaMalinga$ER,
gambhirHarbhajan$ER,yusufHarbhajan$ER,kallisHarbhajan$ER,uthappaHarbhajan$ER,
gambhirPollard$ER,yusufPollard$ER,kallisPollard$ER,uthappaPollard$ER))

lprec
## Model name:
##   a linear program with 12 decision variables and 16 constraints
 b=solve(lprec)
get.objective(lprec) #  
## [1] 55.57887
 get.variables(lprec) # 
##  [1] 3 1 0 0 0 1 0 1 3 1 0 0
e <- as.data.frame(rbind(c(3,1,0,0,4),c(0, 1, 0,1,2),c(3, 1, 0,0,4)))
names(e) <- c("Gambhir","Yusuf","Kallis","Uthappa","Overs")
rownames(e) <- c("Malinga","Harbhajan","Pollard")
e

LP Solution: Mumbai Indians can restrict Kolkata Knight Riders to 55.87 in 10 overs
if the overs are bowled as below

##           Gambhir Yusuf Kallis Uthappa Overs
## Malinga         3     1      0       0     4
## Harbhajan       0     1      0       1     2
## Pollard         3     1      0       0     4
#Total overs=10  

8. LP formulation for IPL (Mumbai Indians – Kolkata Knight Riders – Batting lineup)

As I mentioned it is possible to perform a maximation with the same formulation since computeSR<==>computeER

This just flips the problem around and computes the maximum runs that can be scored for the batsman’s Strike rate (this is same as the bowler’s Economy rate) i.e.

gambhirMalinga, yusufMalinga, kallisMalinga, uthappaMalinga is the SR of Gambhir, Yusuf Pathan, Kallis and Uthappa against Malinga
gambhirHarbhajan, yusufHarbhajan, kallisHarbhajan, uthappaHarbhajan is the SR of Gambhir, Yusuf Pathan, Kallis and Uthappa against Harbhajan
gambhirPollard, yusufPollard, kallisPollard, uthappaPollard is the SR of Gambhir, Yusuf Pathan, Kallis and Uthappa against Kieron Pollard.

The constraints are Malinga, Harbhajan and Pollard have 4 overs each and remaining overs to be bowled is 10.

 library("lpSolveAPI")
lprec <- make.lp(0, 12)
a=lp.control(lprec, sense="max")

a <-set.objfn(lprec, c(gambhirMalinga$ER, yusufMalinga$ER,kallisMalinga$ER,uthappaMalinga$ER,
gambhirHarbhajan$ER,yusufHarbhajan$ER,kallisHarbhajan$ER,uthappaHarbhajan$ER,
gambhirPollard$ER,yusufPollard$ER,kallisPollard$ER,uthappaPollard$ER))

lprec
## Model name:
##   a linear program with 12 decision variables and 16 constraints
 b=solve(lprec)
get.objective(lprec) #  
## [1] 94.22649
 get.variables(lprec) # 
##  [1] 0 3 0 0 0 1 0 3 0 1 3 0
e <- as.data.frame(rbind(c(0,3,0,0,3),c(0, 1, 0,3,4),c(0, 1, 3,0,4)))
names(e) <- c("Gambhir","Yusuf","Kallis","Uthappa","Overs")
rownames(e) <- c("Malinga","Harbhajan","Pollard")
e

LP Solution: Kolkata Knight Riders can score a maximum of 94.22 in 11 overs against Mumbai Indians
if the the number of overs KKR face is as below

##           Gambhir Yusuf Kallis Uthappa Overs
## Malinga         0     3      0       0     3
## Harbhajan       0     1      0       3     4
## Pollard         0     1      3       0     4
#Total overs=11  

Conclusion: It is possible to thus determine the optimum no of overs to give to a specific bowler based on his/her Economy Rate with a particular batsman. Similarly one can determine the maximum runs that can be scored by a batsmen based on their strike rate with bowlers. Cricket like many other games is a game of strategy, skill, talent and some amount of luck. So while the LP formulation can provide some direction,  one must be aware anything could happen in a game of cricket!

To see all posts see Index of Posts

My 2 video presentations on ‘Essential Python for Datascience’

Here, in this post I include 2 sessions on ‘Essential Python for Datascience’. These 2 presentations cover the most important features of the Python language with which you can hit the ground running in datascience. All  the related material for these sessions can be cloned/downloaded from Github at ‘EssentialPythonForDatascience

1. Essential Python for Datascience -1
In this  video presentation I cover basic data types like tuples,lists, dictionaries. How to get the type of a variable, subsetting and numpy arrays. Some basic operations on numpy arrays, slicing is also covered

2. Essential Python for Datascience -2
In the 2nd part I cover Pandas, pandas Series, dataframes, how to subset dataframes using iloc,loc, selection of specific columns, filtering dataframes by criteria etc. Other operations include group_by, apply,agg. Lastly I also touch upon matplotlib.

This is no means an exhaustive coverage of the multitude of features available in Python but can provide as a good starting point for those venturing into datascience with Python.

Good luck with Python!

To see all posts see Index of posts

My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

Then felt I like some watcher of the skies
When a new planet swims into his ken;
Or like stout Cortez when with eagle eyes
He star’d at the Pacific—and all his men
Look’d at each other with a wild surmise—
Silent, upon a peak in Darien.

On First Looking into Chapman’s Homer by John Keats

The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also  summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like.

I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI).

By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through  courses which I did not do, but are probably equally good as well.  Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case,  I hope the list below will provide you with some overall direction.

All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs

• Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience
• The Professors, besides delivering quality content, also point out to important tips, tricks and traps
• You can revisit lectures in online courses
• Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!)

Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python.

The journey
My trip through Data Science, Machine Learning  started with an off-chance remark,about 3 years ago,  from an old friend of mine who spoke to me about having done a few  courses at Coursera, and really liked it.  He further suggested that I should try. This was the final push which set me sailing into this vast domain.

I have included the list of the courses I have done over the past 3 years (33 certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking.

1. Machine Learning, Stanford, Prof Andrew Ng, Coursera
(Requirement: Mandatory, Language:Octave,Status:Completed)
This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up  some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language.

2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) –
The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are  also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor.

3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed)
This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP

3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done)
In this specialization I only did  the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn

3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python

4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed)
I audited (listened only) to the following 2 modules from this Specialization.
a.Inferential Statistics
b.Linear Regression and Modeling
Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity.  Her lectures are filled with many examples which she walks you through in great detail

5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed)
This is an interesting course and provides an alternative point of view to frequentist approach

6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates
(Required: Mandatory for Big Data, Status:Completed, Language; pySpark)
This specialization contains 3 modules
a.Introduction to Apache Spark
b.Distributed Machine Learning with Apache Spark
c.Big Data Analysis with Apache Spark

This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked  the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed.

7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Partially Completed)

This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems

The modules are
a. Neural Networks and Deep Learning
In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network

b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow.
c.Structuring Machine Learning Projects – To do
d. Convolutional Neural Networks – To do
e. Sequence Models – To do

8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto
(Requirement: Mandatory, Language;Octave, Status:Completed)
This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularization, learning rate decay, momentum method etc

9.Probabilistic Graphical Models, Stanford  Prof Daphne Koller(Language:Octave, Status: Partially completed)
This has 3 courses
a.Probabilistic Graphical Models 1: Representation – Done
b.Probabilistic Graphical Models 2: Inference – To do
c.Probabilistic Graphical Models 3: Learning – To do
This course discusses how a system, which can be represented as a complex interaction
of probability distributions, will behave. This is probably the toughest course I did.  I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future

10. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done.
I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes

11. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity
This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc.

12. Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain.
Got started on this one and abandoned some time back. In my to do list though

My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly.
And so my journey continues…

I hope you find this list useful.

R vs Python: Different similarities and similar differences

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say.  I include my own experience with the languages below.

The content of this post and much more is now available as a compact book  on Amazon in both formats – as Paperback ($9.99) and a Kindle version($6.99/Rs449/). see ‘Practical Machine Learning with R and Python – Machine Learning in stereo

1. R data types

R has the following data types

1.  Character
2. Integer
3. Numeric
4. Logical
5. Complex
6. Raw

Python has several data types

1. Int
2. float
3. Long
4. Complex and so on

2. R Vector vs Python List

A common data type in R is the vector. Python has a similar data type, the list

# R vectors
a<-c(4,5,1,3,4,5)
print(a[3])
## [1] 1
print(a[3:4]) # R does not always need the explicit print. 
## [1] 1 3
#R type of variable
print(class(a))
## [1] "numeric"
# Length of a
print(length(a))
## [1] 6
# Python lists
a=[4,5,1,3,4,5] #
print(a[2]) # Some python IDEs require the explicit print
print(a[2:5])
print(type(a))
# Length of a
print(len(a))
## 1
## [1, 3, 4]
## <class 'list'>
## 6

2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple
b = (4,5,7,8)
print(b)

#Python dictionary
c={'name':'Ganesh','age':54,'Work':'Professional'}
print(c)
#Print type of variable c

## (4, 5, 7, 8)
## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'}

2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable
a<-c(4,5,1,3,4,5)
print(class(a))
## [1] "numeric"
#Python - Print type of tuple a
a=[4,5,1,3,4,5]
print(type(a))
b=(4,3,"the",2)
print(type(b))
## <class 'list'>
## <class 'tuple'>

3. Length

To know length in R, use length()

#R - Length of vector
# Length of a
a<-c(4,5,1,3,4,5)
print(length(a))
## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc
# Length of a
a=[4,5,1,3,4,5]
print(len(a))
# Length of b
b = (4,5,7,8)
print(len(b))

## 6
## 4

4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio
#?sapply
#help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console
#help(len)
#?len

5. Subsetting

The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset
a<-c(4,5,1,3,4,8,12,18,1)
print(a[3])
## [1] 1
# To print a range or a slice. Print from the 3rd to the 5th element
print(a[3:6])
## [1] 1 3 4 8

Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset
a=[4,5,1,3,4,8,12,18,1]
# Print the 4th element (starts from 0)
print(a[3])

# Print a slice from 4 to 6th element
print(a[3:6])
## 3
## [3, 4, 8]

6. Operations on vectors in R and operation on lists in Python

In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors
a<- c(5,2,3,1,7)
b<- c(1,5,4,6,8)

print(a+b)
## [1]  6  7  7  7 15
#Element wise subtraction
print(a-b)
## [1]  4 -3 -1 -5 -1
#Element wise product
print(a*b)
## [1]  5 10 12  6 56
# Exponentiating the elements of a vector
print(a^2)
## [1] 25  4  9  1 49

In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list
a =[5,2,3,1,7]
b =[1,5,4,6,8]

#Element wise addition with map & lambda
print(list(map(lambda x,y: x+y,a,b)))
#Element wise subtraction
print(list(map(lambda x,y: x-y,a,b)))
#Element wise product
print(list(map(lambda x,y: x*y,a,b)))
# Exponentiating the elements of a list
print(list(map(lambda x: x**2,a)))

## [6, 7, 7, 7, 15]
## [4, -3, -1, -5, -1]
## [5, 10, 12, 6, 56]
## [25, 4, 9, 1, 49]

However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np
a =[5,2,3,1,7]
b =[1,5,4,6,8]
a=np.array(a)
b=np.array(b)
print(a+b)
#Element wise subtraction
print(a-b)
#Element wise product
print(a*b)
# Exponentiating the elements of a list
print(a**2)

## [ 6  7  7  7 15]
## [ 4 -3 -1 -5 -1]
## [ 5 10 12  6 56]
## [25  4  9  1 49]

7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which
a<- c(5,2,3,1,7)
print(which(a == 1))
## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where
import numpy as np
a =[5,2,3,1,7]
a=np.array(a)
print(np.where(a==1))
## (array([3], dtype=int64),)

8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R
# To check built datasets use
#data() - In R console or in R Studio
#iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# This creates a Sklearn bunch
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)

9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

• Check the size of the dataframe
• Take a look at the top 5 or bottom 5 rows of dataframe
• Check the content of the dataframe

a.Size

In R use dim()

#R - Size
dim(iris)
## [1] 150   5

For Python use .shape

#Python - size
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.shape

b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R
head(iris,5)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
tail(iris,5)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.tail(5))
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
##      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 145                6.7               3.0                5.2               2.3
## 146                6.3               2.5                5.0               1.9
## 147                6.5               3.0                5.2               2.0
## 148                6.2               3.4                5.4               2.3
## 149                5.9               3.0                5.1               1.8

c. Check the content of the dataframe

#R
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
##        Species
##  setosa    :50
##  versicolor:50
##  virginica :50
##
##
## 
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## <class 'pandas.core.frame.DataFrame'> ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None d. Check column names #R names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" colnames(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) #Get column names print(iris.columns) ## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object') e. Rename columns In R we can assign a vector to column names #R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris) ## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species" In Python we can assign a list to s.columns #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"] print(iris.columns) ## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object') f.Details of dataframe #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## <class 'pandas.core.frame.DataFrame'> ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None g. Subsetting dataframes # R #To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector] #df[row,column] iris[3,4] ## [1] 0.2 #df[row vector, column vector] iris[2:5,1:3] ## lengthOfSepal widthOfSepal lengthOfPetal ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5.0 3.6 1.4 #If we omit the row vector, then it implies all rows or if we omit the column vector # then implies all columns for that row iris[2:5,] ## lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa # In R we can all specific columns by column names iris$Sepal.Length[2:5]
## NULL
#Python
# To select an entire row we use .iloc. The index can be used with the ':'. If
# .iloc[start row: end row]. If start row is omitted then it implies the beginning of
# data frame, if end row is omitted then it implies all rows till end
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.iloc[3])
print(iris[:5])
# In python we can select columns by column name as follows
print(iris['sepal length (cm)'][2:6])
#If you want to select more than 2 columns then you must use the double '[[]]' since the
# index is a list itself
print(iris[['sepal length (cm)','sepal width (cm)']][4:7])
## sepal length (cm)    4.6
## sepal width (cm)     3.1
## petal length (cm)    1.5
## petal width (cm)     0.2
## Name: 3, dtype: float64
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## Name: sepal length (cm), dtype: float64
##    sepal length (cm)  sepal width (cm)
## 4                5.0               3.6
## 5                5.4               3.9
## 6                4.6               3.4

h. Computing Mean, Standard deviation

#R
#Mean
mean(iris$lengthOfSepal) ## [1] 5.843333 #Standard deviation sd(iris$widthOfSepal)
## [1] 0.4358663
#Python
#Mean
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
# Convert to Pandas dataframe
print(iris['sepal length (cm)'].mean())
#Standard deviation
print(iris['sepal width (cm)'].std())
## 5.843333333333335
## 0.4335943113621737

i. Boxplot

Boxplot can be produced in R using baseplot

#R
boxplot(iris$lengthOfSepal) Matplotlib is a popular package in Python for plots #Python import sklearn as sklearn import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.boxplot(iris['sepal length (cm)']) plt.show(img) j.Scatter plot #R plot(iris$widthOfSepal,iris$lengthOfSepal) #Python import matplotlib.pyplot as plt import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)']) #plt.show(img) k. Read from csv file #R tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Dimensions of dataframe dim(tendulkar) ## [1] 347 13 names(tendulkar) ## [1] "X" "Runs" "Mins" "BF" "X4s" ## [6] "X6s" "SR" "Pos" "Dismissal" "Inns" ## [11] "Opposition" "Ground" "Start.Date" Use pandas.read_csv() for Python #Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) print(tendulkar.columns) ## (347, 13) ## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos', ## 'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'], ## dtype='object') l. Clean the dataframe in R and Python. The following steps are done for R and Python 1.Remove rows with ‘DNB’ 2.Remove rows with ‘TDNB’ 3.Remove rows with absent 4.Remove the “*” indicating not out 5.Remove incomplete rows with NA for R or NaN in Python 6.Do a scatter plot #R # Remove rows with 'DNB' a <- tendulkar$Runs != "DNB"
tendulkar <- tendulkar[a,]
dim(tendulkar)
## [1] 330  13
# Remove rows with 'TDNB'
b <- tendulkar$Runs != "TDNB" tendulkar <- tendulkar[b,] # Remove rows with absent c <- tendulkar$Runs != "absent"
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 329  13
# Remove the "* indicating not out
tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs))
dim(tendulkar)
## [1] 329  13
# Select only complete rows - complete.cases()
c <- complete.cases(tendulkar)
#Subset the rows which are complete
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 327  13
# Do some base plotting - Scatter plot
plot(tendulkar$BF,tendulkar$Runs)

#Python
import pandas as pd
import matplotlib.pyplot as plt
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
print(tendulkar.shape)
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
print(tendulkar.shape)
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
print(tendulkar.shape)
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")
#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
print(tendulkar.shape)
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
#Scatter plot
plt.scatter(tendulkar.BF,tendulkar.Runs)
## (347, 13)
## (330, 13)
## (329, 13)
## (329, 13)
## (327, 13)

m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

1. Group by ground
2. Compute average runs in each ground
3. Arrange in descending order
#R
library(dplyr)
tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>%
arrange(desc(meanRuns))
head(tendulkar1,10)
## # A tibble: 10 × 2
##           Ground  meanRuns
##
## 1         Multan 194.00000
## 2          Leeds 193.00000
## 3  Colombo (RPS) 143.00000
## 4        Lucknow 142.00000
## 5          Dhaka 132.75000
## 6     Manchester  93.50000
## 7         Sydney  87.22222
## 8   Bloemfontein  85.00000
## 9     Georgetown  81.00000
## 10 Colombo (SSC)  77.55556
#Python
import pandas as pd
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")

#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False)
print(tendulkar1.head(10))
## (347, 13)
## Ground
## Multan           194.000000
## Leeds            193.000000
## Colombo (RPS)    143.000000
## Lucknow          142.000000
## Dhaka            132.750000
## Manchester        93.500000
## Sydney            87.222222
## Bloemfontein      85.000000
## Georgetown        81.000000
## Colombo (SSC)     77.555556
## Name: Runs, dtype: float64

9. Functions

product <- function(a,b){
c<- a*b
c
}
product(5,7)
## [1] 35
def product(a,b):
c = a*b
return c

print(product(5,7))

## 35



Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

Note: This post was created using RStudio’s RMarkdown which allows you to embed R and Python code snippets. It works perfectly, except that matplotlib’s pyplot does not display.

More book, more cricket! 2nd edition of my books now on Amazon

a) Cricket analytics with cricketr
b) Beaten by sheer pace – Cricket analytics with yorkr
is now available on Amazon, both as Paperback and Kindle versions.

The Kindle versions are just $4.99 for both books. Pick up your copies today!!! A) Cricket analytics with cricketr: Second Edition Click hereCricket analytics with cricketr: Second Edition B) Beaten by sheer pace: Cricket analytics with yorkr(2nd edition) Pick up your copies today!!! My 3 video presentations on “Essential R” In this post I include my 3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following • R Language – The essentials • Key R Packages (dplyr, lubridate, ggplot2, etc.) • How to create R Markdown and share reports • A look at Shiny apps • How to create a simple R package You can download the relevant slide deck and practice code at Essential R Essential R – Part 1 This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types Essential R – Part 2 This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also touches on some key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2 Essential R – Part 3 This final session covers R Markdown , and touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package These 3 R sessions cover most of the basic R topics that we tend to use in a our day-to-day R way of life. With this you should be able to hit the ground running! Hope you enjoy these video presentation and also hope you have an even greater time with R! Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today! To see all my posts click – Index of posts cricketr flexes new muscles: The final analysis Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.  Jabberwocky by Lewis Carroll  No analysis of cricket is complete, without determining how players would perform in the host country. Playing Test cricket on foreign pitches, in the host country, is a ‘real test’ for both batsmen and bowlers. Players, who can perform consistently both on domestic and foreign pitches are the genuinely ‘class’ players. Player performance on foreign pitches lets us differentiate the paper tigers, and home ground bullies among batsmen. Similarly, spinners who perform well, only on rank turners in home ground or pace bowlers who can only swing and generate bounce on specially prepared pitches are neither genuine spinners nor real pace bowlers. So this post, helps in identifying those with real strengths, and those who play good only when the conditions are in favor, in home grounds. This post brings a certain level of finality to the analysis of players with my R package ‘cricketr’ Besides, I also meant ‘final analysis’ in the literal sense, as I intend to take a long break from cricket analysis/analytics and focus on some other domains like Neural Networks, Deep Learning and Spark. The 3rd edition of my books (paperback & kindle) Cricket analytics with cricketr & Beaten by sheer pace! Cricket analytics with yorkr is now available on Amazon for$12.99 (each for paperbacks), and $4.99/Rs 320 and$6.99/Rs448 respectively

As already mentioned, my R package ‘cricketr’ uses the statistics info available in ESPN Cricinfo Statsguru. You should be able to install the package from CRAN and use many of the functions available in the package. Please be mindful of ESPN Cricinfo Terms of Use

For getting data of a player against a particular country for the match played in the host country, I just had to add 2 extra parameters to the getPlayerData() function. The cricketr package has been updated with the changed functions for getPlayerData() – Tests, getPlayerDataOD() – ODI and getPlayerDataTT() for the Twenty20s. The updated functions will be available in cricketr Version -0.0.14

The data for the following players have already been obtained with the new, changed getPlayerData() function and have been saved as *.csv files. I will be re-using these files, instead of getting them all over again. Hence the getPlayerData() lines have been commented below

library(cricketr)

1. Performance of a batsman against a host ountry in the host country

For e.g We can the get the data for Sachin Tendulkar for matches played against Australia and in Australia Here opposition=2 and host =2 indicate that the opposition is Australia and the host country is also Australia

#tendulkarAus=getPlayerData(35320,opposition=2,host=2,file="tendulkarVsAusInAus.csv",type="batting")

All cricketr functions can be used with this data frame, as before. All the charts show the performance of Tendulkar in Australia against Australia.

par(mfrow=c(2,3))
par(mar=c(4,4,2,2))
batsman4s("./data/tendulkarVsAusInAus.csv","Tendulkar")
batsman6s("./data/tendulkarVsAusInAus.csv","Tendulkar")
batsmanRunsRanges("./data/tendulkarVsAusInAus.csv","Tendulkar")
batsmanDismissals("./data/tendulkarVsAusInAus.csv","Tendulkar")
batsmanAvgRunsGround("./data/tendulkarVsAusInAus.csv","Tendulkar")
batsmanMovingAverage("./data/tendulkarVsAusInAus.csv","Tendulkar")

dev.off()
## null device
##           1

2. Relative performances of international batsmen against England in England

While we can analyze the performance of a player against an opposition in some host country, I wanted to compare the relative performances of players, to see how players from different nations play in a host country which is not their home ground.

The following lines gets player’s data of matches played in England and against England.The Oval, Lord’s are famous for generating some dangerous swing and bounce. I chose the following players

2. Steve Waugh (Australia)
3. Rahul Dravid (India)
4. Vivian Richards (West Indies)
5. Sachin Tendulkar (India)
#tendulkarEng=getPlayerData(35320,opposition=1,host=1,file="tendulkarVsEngInEng.csv",type="batting")
#srwaughEng=getPlayerData(8192,opposition=1,host=1,file="srwaughVsEngInEng.csv",type="batting")
#dravidEng=getPlayerData(28114,opposition=1,host=1,file="dravidVsEngInEng.csv",type="batting")
#vrichardEng=getPlayerData(52812,opposition=1,host=1,file="vrichardsEngInEng.csv",type="batting")
frames <- list("./data/tendulkarVsEngInEng.csv","./data/bradmanVsEngInEng.csv","./data/srwaughVsEngInEng.csv",
"./data/dravidVsEngInEng.csv","./data/vrichardsEngInEng.csv")
names <- list("S Tendulkar","D Bradman","SR Waugh","R Dravid","Viv Richards")

The Lords and the Oval in England are some of the best pitches in the world. Scoring on these pitches and weather conditions, where there is both swing and bounce really requires excellent batting skills. It can be easily seen that Don Bradman stands heads and shoulders over everybody else, averaging close a cumulative average of 100+. He is followed by Viv Richards, who averages around ~60. Interestingly in English conditions, Rahul Dravid edges out Sachin Tendulkar.

relativeBatsmanCumulativeAvgRuns(frames,names)

# The other 2 plots on relative strike rate and cumulative average strike rate,
shows Viv Richards really  blasts the bowling. Viv Richards has a strike rate
of 70, while Bradman 62+, followed by Tendulkar.
relativeBatsmanSR(frames,names)

relativeBatsmanCumulativeStrikeRate(frames,names)

3. Relative performances of international batsmen against Australia in Australia

The following players from these countries were chosen

1. Sachin Tendulkar (India)
2. Viv Richard (West Indies)
3. David Gower (England)
4. Jacques Kallis (South Africa)
5. Alastair Cook (Emgland)
frames <- list("./data/tendulkarVsAusInAus.csv","./data/vrichardsVAusInAus.csv","./data/dgowerVsAusInAus.csv",
"./data/kallisVsAusInAus.csv","./data/ancookVsWIInWI.csv")
names <- list("S Tendulkar","Viv Richards","David Gower","J Kallis","AN Cook")

Alastair Cook of England has fantastic cumulative average of 55+ on the pitches of Australia. There is a dip towards the end, but we cannot predict whether it would have continued. AN Cook is followed by Tendulkar who has a steady average of 50+ runs, after which there is Viv Richards.

relativeBatsmanCumulativeAvgRuns(frames,names)

#With respect to cumulative or relative strike rate Viv Richards is a class apart.He seems to really
#tear into bowlers. David Gower has an excellent strike rate and is followed by Tendulkar
relativeBatsmanSR(frames,names)

relativeBatsmanCumulativeStrikeRate(frames,names)

4. Relative performances of international batsmen against India in India

While England & Australia are famous for bouncy tracks with swing, Indian pitches are renowed for being extraordinary turners. Also India has always thrown up world class spinners, from the spin quartet of BS Chandraskehar, Bishen Singh Bedi, EAS Prasanna, S Venkatraghavan, to the times of dangerous Anil Kumble, and now to the more recent Ravichander Ashwon and Harbhajan Singh.

A batsmen who can score runs in India against Indian spinners has to be really adept in handling all kinds of spin.

While Clive Lloyd & Alvin Kallicharan had the best performance against India, they have not been included as ESPN Cricinfo had many of the columns missing.

So I chose the following international players for the analysis against India

1. Hashim Amla (South Africa)
2. Alastair Cook (England)
3. Matthew Hayden (Australia)
4. Viv Richards (West Indies)
frames <- list("./data/amlaVsIndInInd.csv","./data/ancookVsIndInInd.csv","./data/mhaydenVsIndInInd.csv",
"./data/vrichardsVsIndInInd.csv")
names <- list("H Amla","AN Cook","M Hayden","Viv Riachards")

Excluding Clive Lloyd & Alvin Kallicharan the next best performer against India is Hashim Amla,followed by Alastair Cook, Viv Richards.

relativeBatsmanCumulativeAvgRuns(frames,names)

#With respect to strike rate, there is no contest when Viv Richards is around. He is clearly the best
#striker of the ball regardless of whether it is the pacy wickets of
#Australia/England or the spinning tracks of the subcontinent. After
#Viv Richards, Hayden and Alastair Cook have good cumulative strike rates
#in India
relativeBatsmanSR(frames,names)

relativeBatsmanCumulativeStrikeRate(frames,names)

5. All time greats of Indian batting

I couldn’t resist checking out how the top Indian batsmen perform when playing in host countries So here is a look at how the top Indian batsmen perform against different host countries

6. Top Indian batsmen against Australia in Australia

The following Indian batsmen were chosen

2. Sachin Tendulkar
3. Virat Kohli
4. Virendar Sehwag
5. VVS Laxman
frames <- list("./data/tendulkarVsAusInAus.csv","./data/gavaskarVsAusInAus.csv","./data/kohliVsAusInAus.csv",
"./data/sehwagVsAusInAus.csv","./data/vvslaxmanVsAusInAus.csv")
names <- list("S Tendulkar","S Gavaskar","V Kohli","V Sehwag","VVS Laxman")

Virat Kohli has the best overall performance against Australia, with a current cumulative average of 60+ runs for the total number of innings played by him (15). With 15 matches the 2nd best is Virendar Sehwag, followed by VVS Laxman. Tendulkar maintains a cumulative average of 48+ runs for an excess of 30+ innings.

relativeBatsmanCumulativeAvgRuns(frames,names)

# Sehwag leads the strike rate against host Australia, followed by
# Tendulkar in Australia and then Kohli
relativeBatsmanSR(frames,names)

relativeBatsmanCumulativeStrikeRate(frames,names)

7. Top Indian batsmen against England in England

The top Indian batmen’s performances against England are shown below

1. Rahul Dravid
2. Dilip Vengsarkar
3. Rahul Dravid
4. Sourav Ganguly
5. Virat Kohli
frames <- list("./data/tendulkarVsEngInEng.csv","./data/dravidVsEngInEng.csv","./data/vengsarkarVsEngInEng.csv",
names <- list("S Tendulkar","R Dravid","D Vengsarkar","S Ganguly","S Gavaskar","V Kohli")

Rahul Dravid has the best performance against England and edges out Tendulkar. He is followed by Tendulkar and then Sourav Ganguly. Note:Incidentally Virat Kohli’s performance against England in England so far has been extremely poor and he averages around 13-15 runs per innings. However he has a long way to go and I hope he catches up. In any case it will be an uphill climb for Kohli in England.

relativeBatsmanCumulativeAvgRuns(frames,names)

#Tendulkar, Ganguly and Dravid have the best strike rate and in that order.
relativeBatsmanSR(frames,names)

relativeBatsmanCumulativeStrikeRate(frames,names)

8. Top Indian batsmen against West Indies in West Indies

frames <- list("./data/tendulkarVsWInWI.csv","./data/dravidVsWInWI.csv","./data/vvslaxmanVsWIInWI.csv",
names <- list("S Tendulkar","R Dravid","VVS Laxman","S Gavaskar")

Against the West Indies Sunil Gavaskar is heads and shoulders above the rest. Gavaskar has a very impressive cumulative average against West Indies

relativeBatsmanCumulativeAvgRuns(frames,names)

# VVS Laxman followed by  Tendulkar & then Dravid have a very
# good strike rate against the West Indies
relativeBatsmanCumulativeStrikeRate(frames,names)

9. World’s best spinners on tracks suited for pace & bounce

In this part I compare the performances of the top 3 spinners in recent years and check out how they perform on surfaces that are known for pace, and bounce. I have taken the following 3 spinners

1. Anil Kumble (India)
2. M Muralitharan (Sri Lanka)
3. Shane Warne (Australia)
#kumbleEng=getPlayerData(30176  ,opposition=3,host=3,file="kumbleVsEngInEng.csv",type="bowling")
#muraliEng=getPlayerData(49636  ,opposition=3,host=3,file="muraliVsEngInEng.csv",type="bowling")
#warneEng=getPlayerData(8166  ,opposition=3,host=3,file="warneVsEngInEng.csv",type="bowling")

10. Top international spinners against England in England

frames <- list("./data/kumbleVsEngInEng.csv","./data/muraliVsEngInEng.csv","./data/warneVsEngInEng.csv")
names <- list("Anil KUmble","M Muralitharan","Shane Warne")

Against England and in England, Muralitharan shines with a cumulative average of nearly 5 wickets per match with a peak of almost 8 wickets. Shane Warne has a steady average at 5 wickets and then Anil Kumble.

relativeBowlerCumulativeAvgWickets(frames,names)

# The order relative cumulative Economy rate, Warne has the best figures,followed by Anil Kumble. Muralitharan
# is much more expensive.
relativeBowlerCumulativeAvgEconRate(frames,names)

11. Top international spinners against South Africa in South Africa

frames <- list("./data/kumbleVsSAInSA.csv","./data/muraliVsSAInSA.csv","./data/warneVsSAInSA.csv")
names <- list("Anil Kumble","M Muralitharan","Shane Warne")

In South Africa too, Muralitharan has the best wicket taking performance averaging about 4 wickets. Warne averages around 3 wickets and Kumble around 2 wickets

relativeBowlerCumulativeAvgWickets(frames,names)

# Muralitharan is expensive in South Africa too, while Kumble and Warne go neck-to-neck in the economy rate.
# Kumble edges out Warne and has a better cumulative average economy rate
relativeBowlerCumulativeAvgEconRate(frames,names)

11. Top international pacers against India in India

As a final analysis I check how the world’s pacers perform in India against India. India pitches are supposed to be flat devoid of bounce, while being terrific turners. Hence Indian pitches are more suited to spin bowling than pace bowling. This is changing these days.

The best performers against India in India are mostly the deadly pacemen of yesteryears

For this I have chosen the following bowlers

1. Courtney Walsh (West Indies)
2. Andy Roberts (West Indies)
3. Malcolm Marshall
4. Glenn McGrath
#cawalshInd=getPlayerData(53216  ,opposition=6,host=6,file="cawalshVsIndInInd.csv",type="bowling")
#arobertsInd=getPlayerData(52817  ,opposition=6,host=6,file="arobertsIndInInd.csv",type="bowling")
#mmarshallInd=getPlayerData(52419  ,opposition=6,host=6,file="mmarshallVsIndInInd.csv",type="bowling")
#gmccgrathInd=getPlayerData(6565  ,opposition=6,host=6,file="mccgrathVsIndInInd.csv",type="bowling")
frames <- list("./data/cawalshVsIndInInd.csv","./data/arobertsIndInInd.csv","./data/mmarshallVsIndInInd.csv",
"./data/mccgrathVsIndInInd.csv")
names <- list("C Walsh","A Roberts","M Marshall","G McGrath")

Courtney Walsh has the best performance, followed by Andy Roberts followed by Andy Roberts and then Malcom Marshall who tips ahead of Glenn McGrath

relativeBowlerCumulativeAvgWickets(frames,names)

#On the other hand McGrath has the best economy rate, followed by A Roberts and then Courtney Walsh
relativeBowlerCumulativeAvgEconRate(frames,names)

12. ODI performance of a player against a specific country in the host country

This gets the data for MS Dhoni in ODI matches against Australia and in Australia

#dhoniAusODI=getPlayerDataOD(28081,opposition=2,host=2,file="dhoniVsAusInAusODI.csv",type="batting")

13. Twenty 20 performance of a player against a specific country in the host country

#dhoniAusTT=getPlayerDataOD(28081,opposition=2,host=2,file="dhoniVsAusInAusTT.csv",type="batting")

All the ODI and Twenty20 functions of cricketr can be used on the above dataframes of MS Dhoni.

Some key observations

Here are some key observations

1. At the top of the batting spectrum is Don Bradman with a very impressive average 100-120 in matches played in England and Australia. Unfortunately there weren’t matches he played in other countries and different pitches. 2.Viv Richard has the best cumulative strike rate overall.
2. Muralitharan strikes more often than Kumble or Warne even in pitches at ENgland, South Africa and West Indies. However Muralitharan is also the most expensive
3. Warne and Kumble have a much better economy rate than Muralitharan.
4. Sunil Gavaskar has an extremely impressive performance in West Indies.
5. Rahul Dravid performs much better than Tendulkar in both England and West Indies.
6. Virat Kohli has the best performance against Australia so far and hope he maintains his stellar performance followed by Sehwag. However Kohli’s performance in England has been very poor
7. West Indies batsmen and bowlers seem to thrive on Indian pitches, with Clive Lloyd and Alvin Kalicharan at the top of the list.

You may like my Shiny apps on cricket

Also see

To see all my posts see Index of posts

Analysis of IPL T20 matches with yorkr templates

Introduction

In this post I create RMarkdown templates for end-to-end analysis of IPL T20 matches, that are available on Cricsheet based on my R package yorkr.  With these templates you can convert all IPL data which is in yaml format to R dataframes. Further I create data and the necessary templates for analyzing IPL matches, teams and players. All of these can be accessed at yorkrIPLTemplate.

The 3rd edition of  my books (paperback & kindle)  Cricket analytics with cricketr & Beaten by sheer pace! Cricket analytics with yorkr is now available on Amazon for $12.99 (each for paperbacks), and$4.99/Rs 320 and $6.99/Rs448 respectively The templates are 1. Template for conversion and setup – IPLT20Template.Rmd 2. Any IPL match – IPLMatchtemplate.Rmd 3. IPL matches between 2 nations – IPLMatches2TeamTemplate.Rmd 4. A IPL nations performance against all other IPL nations – IPLAllMatchesAllOppnTemplate.Rmd 5. Analysis of IPL batsmen and bowlers of all IPL nations – IPLBatsmanBowlerTemplate.Rmd Besides the templates the repository also includes the converted data for all IPL matches I downloaded from Cricsheet in Dec 2016. So this data is complete till the 2016 IPL season. You can recreate the files as more matches are added to Cricsheet site in IPL 2017 and future seasons. This post contains all the steps needed for detailed analysis of IPL matches, teams and IPL player. This will also be my reference in future if I decide to analyze IPL in future! There will be 5 folders at the root 1. IPLdata – Match files as yaml from Cricsheet 2. IPLMatches – Yaml match files converted to dataframes 3. IPLMatchesBetween2Teams – All Matches between any 2 IPL teams 4. allMatchesAllOpposition – An IPL teams’s performance against all other teams 5. BattingBowlingDetails – Batting and bowling details of all IPL teams library(yorkr) library(dplyr) The first few steps take care of the data setup. This needs to be done before any of the analysis of IPL batsmen, bowlers, any IPL match, matches between any 2 IPL countries or analysis of a teams performance against all other countries There will be 5 folders at the root 1. data 2. IPLMatches 3. IPLMatchesBetween2Teams 4. allMatchesAllOpposition 5. BattingBowlingDetails The source YAML files will be in IPLData folder 1.Create directory of IPLMatches Some files may give conversions errors. You could try to debug the problem or just remove it from the IPLdata folder. At most 2-4 file will have conversion problems and I usally remove then from the files to be converted. Also take a look at my GooglyPlus shiny app which was created after performing the same conversion on the Dec 16 data . convertAllYaml2RDataframesT20("data","IPLMatches") 2.Save all matches between all combinations of IPL nations This function will create the set of all matches between each IPL team against every other IPL team. This uses the data that was created in IPLMatches, with the convertAllYaml2RDataframesIPL() function. setwd("./IPLMatchesBetween2Teams") saveAllMatchesBetween2IPLTeams("../IPLMatches") 3.Save all matches against all opposition This will create a consolidated dataframe of all matches played by every IPL playing nation against all other nattions. This also uses the data that was created in IPLMatches, with the convertAllYaml2RDataframesIPL() function. setwd("../allMatchesAllOpposition") saveAllMatchesAllOppositionIPLT20("../IPLMatches") 4. Create batting and bowling details for each IPL team These are the current IPL playing teams. You can add to this vector as newer IPL teams start playing IPL. You will get to know all IPL teams by also look at the directory created above namely allMatchesAllOpposition. This also uses the data that was created in IPLMatches, with the convertAllYaml2RDataframesIPL() function. setwd("../BattingBowlingDetails") ipl_teams <- list("Chennai Super Kings","Deccan Chargers", "Delhi Daredevils","Kings XI Punjab", "Kochi Tuskers Kerala","Kolkata Knight Riders","Mumbai Indians","Pune Warriors", "Rajasthan Royals","Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions", "Rising Pune Supergiants") for(i in seq_along(ipl_teams)){ print(ipl_teams[i]) val <- paste(ipl_teams[i],"-details",sep="") val <- getTeamBattingDetails(ipl_teams[i],dir="../IPLMatches", save=TRUE) } for(i in seq_along(ipl_teams)){ print(ipl_teams[i]) val <- paste(ipl_teams[i],"-details",sep="") val <- getTeamBowlingDetails(ipl_teams[i],dir="../IPLMatches", save=TRUE) } 5. Get the list of batsmen for a particular IPL team The following code is needed for analyzing individual IPL batsmen. In IPL a player could have played in multiple IPL teams. getBatsmen <- function(df){ bmen <- df %>% distinct(batsman) bmen <- as.character(bmen$batsman)
batsmen <- sort(bmen)
}
csk_details <- battingDetails
dc_details <- battingDetails
dd_details <- battingDetails
kxip_details <- battingDetails
ktk_details <- battingDetails
kkr_details <- battingDetails
mi_details <- battingDetails
pw_details <- battingDetails
rr_details <- battingDetails
rcb_details <- battingDetails
sh_details <- battingDetails
gl_details <- battingDetails
rps_details <- battingDetails

#Get the batsmen for each IPL team
csk_batsmen <- getBatsmen(csk_details)
dc_batsmen <- getBatsmen(dc_details)
dd_batsmen <- getBatsmen(dd_details)
kxip_batsmen <- getBatsmen(kxip_details)
ktk_batsmen <- getBatsmen(ktk_details)
kkr_batsmen <- getBatsmen(kkr_details)
mi_batsmen <- getBatsmen(mi_details)
pw_batsmen <- getBatsmen(pw_details)
rr_batsmen <- getBatsmen(rr_details)
rcb_batsmen <- getBatsmen(rcb_details)
sh_batsmen <- getBatsmen(sh_details)
gl_batsmen <- getBatsmen(gl_details)
rps_batsmen <- getBatsmen(rps_details)

# Save the dataframes
save(csk_batsmen,file="csk.RData")
save(dc_batsmen, file="dc.RData")
save(dd_batsmen, file="dd.RData")
save(kxip_batsmen, file="kxip.RData")
save(ktk_batsmen, file="ktk.RData")
save(kkr_batsmen, file="kkr.RData")
save(mi_batsmen , file="mi.RData")
save(pw_batsmen, file="pw.RData")
save(rr_batsmen, file="rr.RData")
save(rcb_batsmen, file="rcb.RData")
save(sh_batsmen, file="sh.RData")
save(gl_batsmen, file="gl.RData")
save(rps_batsmen, file="rps.RData")

6. Get the list of bowlers for a particular IPL team

The method below can get the list of bowler names for any IPL team.The following code is needed for analyzing individual IPL bowlers. In IPL a player could have played in multiple IPL teams.

getBowlers <- function(df){
bwlr <- df %>% distinct(bowler)
bwlr <- as.character(bwlr$bowler) bowler <- sort(bwlr) } load("Chennai Super Kings-BowlingDetails.RData") csk_details <- bowlingDetails load("Deccan Chargers-BowlingDetails.RData") dc_details <- bowlingDetails load("Delhi Daredevils-BowlingDetails.RData") dd_details <- bowlingDetails load("Kings XI Punjab-BowlingDetails.RData") kxip_details <- bowlingDetails load("Kochi Tuskers Kerala-BowlingDetails.RData") ktk_details <- bowlingDetails load("Kolkata Knight Riders-BowlingDetails.RData") kkr_details <- bowlingDetails load("Mumbai Indians-BowlingDetails.RData") mi_details <- bowlingDetails load("Pune Warriors-BowlingDetails.RData") pw_details <- bowlingDetails load("Rajasthan Royals-BowlingDetails.RData") rr_details <- bowlingDetails load("Royal Challengers Bangalore-BowlingDetails.RData") rcb_details <- bowlingDetails load("Sunrisers Hyderabad-BowlingDetails.RData") sh_details <- bowlingDetails load("Gujarat Lions-BowlingDetails.RData") gl_details <- bowlingDetails load("Rising Pune Supergiants-BowlingDetails.RData") rps_details <- bowlingDetails # Get the bowlers for each team csk_bowlers <- getBowlers(csk_details) dc_bowlers <- getBowlers(dc_details) dd_bowlers <- getBowlers(dd_details) kxip_bowlers <- getBowlers(kxip_details) ktk_bowlers <- getBowlers(ktk_details) kkr_bowlers <- getBowlers(kkr_details) mi_bowlers <- getBowlers(mi_details) pw_bowlers <- getBowlers(pw_details) rr_bowlers <- getBowlers(rr_details) rcb_bowlers <- getBowlers(rcb_details) sh_bowlers <- getBowlers(sh_details) gl_bowlers <- getBowlers(gl_details) rps_bowlers <- getBowlers(rps_details) #Save the dataframes save(csk_bowlers,file="csk1.RData") save(dc_bowlers, file="dc1.RData") save(dd_bowlers, file="dd1.RData") save(kxip_bowlers, file="kxip1.RData") save(ktk_bowlers, file="ktk1.RData") save(kkr_bowlers, file="kkr1.RData") save(mi_bowlers , file="mi1.RData") save(pw_bowlers, file="pw1.RData") save(rr_bowlers, file="rr1.RData") save(rcb_bowlers, file="rcb1.RData") save(sh_bowlers, file="sh1.RData") save(gl_bowlers, file="gl1.RData") save(rps_bowlers, file="rps1.RData") Now we are all set A) IPL T20 Match Analysis 1 IPL Match Analysis Load any match data from the ./IPLMatches folder for e.g. Chennai Super Kings-Deccan Chargers-2008-05-06.RData setwd("./IPLMatches") load("Chennai Super Kings-Deccan Chargers-2008-05-06.RData") csk_dc<- overs #The steps are load("IPLTeam1-IPLTeam2-Date.Rdata") IPLTeam1_IPLTeam2 <- overs All analysis for this match can be done now 2. Scorecard teamBattingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam1") teamBattingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam2") 3.Batting Partnerships teamBatsmenPartnershipMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") teamBatsmenPartnershipMatch(IPLTeam1_IPLTeam2,"IPLTeam2","IPLTeam1") 4. Batsmen vs Bowler Plot teamBatsmenVsBowlersMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=TRUE) teamBatsmenVsBowlersMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) 5. Team bowling scorecard teamBowlingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam1") teamBowlingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam2") 6. Team bowling Wicket kind match teamBowlingWicketKindMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <-teamBowlingWicketKindMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 7. Team Bowling Wicket Runs Match teamBowlingWicketRunsMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <-teamBowlingWicketRunsMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 8. Team Bowling Wicket Match m <-teamBowlingWicketMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m teamBowlingWicketMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") 9. Team Bowler vs Batsmen teamBowlersVsBatsmenMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <- teamBowlersVsBatsmenMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 10. Match Worm chart matchWormGraph(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") B) IPL Matches between 2 IPL teams 1 IPL Match Analysis Load any match data from the ./IPLMatches folder for e.g. Chennai Super Kings-Deccan Chargers-2008-05-06.RData setwd("./IPLMatches") load("Chennai Super Kings-Deccan Chargers-2008-05-06.RData") csk_dc<- overs #The steps are load("IPLTeam1-IPLTeam2-Date.Rdata") IPLTeam1_IPLTeam2 <- overs All analysis for this match can be done now 2. Scorecard teamBattingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam1") teamBattingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam2") 3.Batting Partnerships teamBatsmenPartnershipMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") teamBatsmenPartnershipMatch(IPLTeam1_IPLTeam2,"IPLTeam2","IPLTeam1") 4. Batsmen vs Bowler Plot teamBatsmenVsBowlersMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=TRUE) teamBatsmenVsBowlersMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) 5. Team bowling scorecard teamBowlingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam1") teamBowlingScorecardMatch(IPLTeam1_IPLTeam2,"IPLTeam2") 6. Team bowling Wicket kind match teamBowlingWicketKindMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <-teamBowlingWicketKindMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 7. Team Bowling Wicket Runs Match teamBowlingWicketRunsMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <-teamBowlingWicketRunsMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 8. Team Bowling Wicket Match m <-teamBowlingWicketMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m teamBowlingWicketMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") 9. Team Bowler vs Batsmen teamBowlersVsBatsmenMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") m <- teamBowlersVsBatsmenMatch(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2",plot=FALSE) m 10. Match Worm chart matchWormGraph(IPLTeam1_IPLTeam2,"IPLTeam1","IPLTeam2") C) IPL Matches for a team against all other teams 1. IPL Matches for a team against all other teams Load the data between for a IPL team against all other countries ./allMatchesAllOpposition for e.g all matches of Kolkata Knight Riders load("allMatchesAllOpposition-Kolkata Knight Riders.RData") kkr_matches <- matches IPLTeam="IPLTeam1" allMatches <- paste("allMatchesAllOposition-",IPLTeam,".RData",sep="") load(allMatches) IPLTeam1AllMatches <- matches  2. Team’s batting scorecard all Matches m <-teamBattingScorecardAllOppnAllMatches(IPLTeam1AllMatches,theTeam="IPLTeam1") m 3. Batting scorecard of opposing team m <-teamBattingScorecardAllOppnAllMatches(matches=IPLTeam1AllMatches,theTeam="IPLTeam2") 4. Team batting partnerships m <- teamBatsmenPartnershipAllOppnAllMatches(IPLTeam1AllMatches,theTeam="IPLTeam1") m m <- teamBatsmenPartnershipAllOppnAllMatches(IPLTeam1AllMatches,theTeam='IPLTeam1',report="detailed") head(m,30) m <- teamBatsmenPartnershipAllOppnAllMatches(IPLTeam1AllMatches,theTeam='IPLTeam1',report="summary") m 5. Team batting partnerships plot teamBatsmenPartnershipAllOppnAllMatchesPlot(IPLTeam1AllMatches,"IPLTeam1",main="IPLTeam1") teamBatsmenPartnershipAllOppnAllMatchesPlot(IPLTeam1AllMatches,"IPLTeam1",main="IPLTeam2") 6, Team batsmen vs bowlers report m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(IPLTeam1AllMatches,"IPLTeam1",rank=0) m m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(IPLTeam1AllMatches,"IPLTeam1",rank=1,dispRows=30) m m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(matches=IPLTeam1AllMatches,theTeam="IPLTeam2",rank=1,dispRows=25) m 7. Team batsmen vs bowler plot d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(IPLTeam1AllMatches,"IPLTeam1",rank=1,dispRows=50) d teamBatsmenVsBowlersAllOppnAllMatchesPlot(d) d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(IPLTeam1AllMatches,"IPLTeam1",rank=2,dispRows=50) teamBatsmenVsBowlersAllOppnAllMatchesPlot(d) 8. Team bowling scorecard teamBowlingScorecardAllOppnAllMatchesMain(matches=IPLTeam1AllMatches,theTeam="IPLTeam1") teamBowlingScorecardAllOppnAllMatches(IPLTeam1AllMatches,'IPLTeam2') 9. Team bowler vs batsmen teamBowlersVsBatsmenAllOppnAllMatchesMain(IPLTeam1AllMatches,theTeam="IPLTeam1",rank=0) teamBowlersVsBatsmenAllOppnAllMatchesMain(IPLTeam1AllMatches,theTeam="IPLTeam1",rank=2) teamBowlersVsBatsmenAllOppnAllMatchesRept(matches=IPLTeam1AllMatches,theTeam="IPLTeam1",rank=0) 10. Team Bowler vs bastmen df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(IPLTeam1AllMatches,theTeam="IPLTeam1",rank=1) teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"IPLTeam1","IPLTeam1") 11. Team bowler wicket kind teamBowlingWicketKindAllOppnAllMatches(IPLTeam1AllMatches,t1="IPLTeam1",t2="All") teamBowlingWicketKindAllOppnAllMatches(IPLTeam1AllMatches,t1="IPLTeam1",t2="IPLTeam2")  12. teamBowlingWicketRunsAllOppnAllMatches(IPLTeam1AllMatches,t1="IPLTeam1",t2="All",plot=TRUE) teamBowlingWicketRunsAllOppnAllMatches(IPLTeam1AllMatches,t1="IPLTeam1",t2="IPLTeam2",plot=TRUE) 1 IPL Batsman setup functions Get the batsman’s details for a batsman setwd("../BattingBowlingDetails") # IPL Team names IPLTeamNames <- list("Chennai Super Kings","Deccan Chargers", "Delhi Daredevils","Kings Xi Punjab", "Kochi Tuskers Kerala","Kolkata Knight Riders","Mumbai Indians","Pune Warriors", "Rajasthan Royals","Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions", "Rising Pune Supergiants") # Check and get the team indices of IPL teams in which the batsman has played getTeamIndex <- function(batsman){ setwd("./BattingBowlingDetails") load("csk.RData") load("dc.RData") load("dd.RData") load("kxip.RData") load("ktk.RData") load("kkr.RData") load("mi.RData") load("pw.RData") load("rr.RData") load("rcb.RData") load("sh.RData") load("gl.RData") load("rps.RData") setwd("..") getwd() print(ls()) teams_batsmen = list(csk_batsmen,dc_batsmen,dd_batsmen,kxip_batsmen,ktk_batsmen,kkr_batsmen,mi_batsmen, pw_batsmen,rr_batsmen,rcb_batsmen,sh_batsmen,gl_batsmen,rps_batsmen) b <- NULL for (i in 1:length(teams_batsmen)){ a <- which(teams_batsmen[[i]] == batsman) if(length(a) != 0) b <- c(b,i) } b } # Get the list of the IPL team names from the indices passed getTeams <- function(x){ l <- NULL # Get the teams passed in as indexes for (i in seq_along(x)){ l <- c(l, IPLTeamNames[[x[i]]]) } l } # Create a consolidated data frame with all teams the IPL batsman has played for getIPLBatsmanDF <- function(teamNames){ batsmanDF <- NULL # Create a consolidated Data frame of batsman for all IPL teams played for (i in seq_along(teamNames)){ df <- getBatsmanDetails(team=teamNames[i],name=IPLBatsman,dir="./BattingBowlingDetails") batsmanDF <- rbind(batsmanDF,df) } batsmanDF }  2. Create a consolidated IPL batsman data frame # Since an IPL batsman coculd have played in multiple teams we need to determine these teams and # create a consolidated data frame for the analysis # For example to check MS Dhoni we need to do the following IPLBatsman = "MS Dhoni" #Check and get the team indices of IPL teams in which the batsman has played i <- getTeamIndex(IPLBatsman) # Get the team names in which the IPL batsman has played teamNames <- getTeams(i) # Check if file exists in the directory. This check is necessary when moving between matchType ############## Create a consolidated IPL batsman dataframe for analysis batsmanDF <- getIPLBatsmanDF(teamNames)  3. Runs vs deliveries # For e.g. batsmanName="MS Dhoni"" #batsmanRunsVsDeliveries(batsmanDF, "MS Dhoni") batsmanRunsVsDeliveries(batsmanDF,"batsmanName") 4. Batsman 4s & 6s batsman46 <- select(batsmanDF,batsman,ballsPlayed,fours,sixes,runs) p1 <- batsmanFoursSixes(batsman46,"batsmanName") 5. Batsman dismissals batsmanDismissals(batsmanDF,"batsmanName") 6. Runs vs Strike rate batsmanRunsVsStrikeRate(batsmanDF,"batsmanName") 7. Batsman Moving Average batsmanMovingAverage(batsmanDF,"batsmanName") 8. Batsman cumulative average batsmanCumulativeAverageRuns(batsmanDF,"batsmanName") 9. Batsman cumulative strike rate batsmanCumulativeStrikeRate(batsmanDF,"batsmanName") 10. Batsman runs against oppositions batsmanRunsAgainstOpposition(batsmanDF,"batsmanName") 11. Batsman runs vs venue batsmanRunsVenue(batsmanDF,"batsmanName") 12. Batsman runs predict batsmanRunsPredict(batsmanDF,"batsmanName") 13.Bowler set up functions setwd("../BattingBowlingDetails") # IPL Team names IPLTeamNames <- list("Chennai Super Kings","Deccan Chargers", "Delhi Daredevils","Kings Xi Punjab", "Kochi Tuskers Kerala","Kolkata Knight Riders","Mumbai Indians","Pune Warriors", "Rajasthan Royals","Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions", "Rising Pune Supergiants") # Get the team indices of IPL teams for which the bowler as played getTeamIndex_bowler <- function(bowler){ # Load IPL Bowlers setwd("./data") load("csk1.RData") load("dc1.RData") load("dd1.RData") load("kxip1.RData") load("ktk1.RData") load("kkr1.RData") load("mi1.RData") load("pw1.RData") load("rr1.RData") load("rcb1.RData") load("sh1.RData") load("gl1.RData") load("rps1.RData") setwd("..") teams_bowlers = list(csk_bowlers,dc_bowlers,dd_bowlers,kxip_bowlers,ktk_bowlers,kkr_bowlers,mi_bowlers, pw_bowlers,rr_bowlers,rcb_bowlers,sh_bowlers,gl_bowlers,rps_bowlers) b <- NULL for (i in 1:length(teams_bowlers)){ a <- which(teams_bowlers[[i]] == bowler) if(length(a) != 0){ b <- c(b,i) } } b } # Get the list of the IPL team names from the indices passed getTeams <- function(x){ l <- NULL # Get the teams passed in as indexes for (i in seq_along(x)){ l <- c(l, IPLTeamNames[[x[i]]]) } l } # Get the team names teamNames <- getTeams(i) getIPLBowlerDF <- function(teamNames){ bowlerDF <- NULL # Create a consolidated Data frame of batsman for all IPL teams played for (i in seq_along(teamNames)){ df <- getBowlerWicketDetails(team=teamNames[i],name=IPLBowler,dir="./BattingBowlingDetails") bowlerDF <- rbind(bowlerDF,df) } bowlerDF } 14. Get the consolidated data frame for an IPL bowler # Since an IPL bowler could have played in multiple teams we need to determine these teams and # create a consolidated data frame for the analysis # For example to check R Ashwin we need to do the following IPLBowler = "R Ashwin" #Check and get the team indices of IPL teams in which the batsman has played i <- getTeamIndex(IPLBowler) # Get the team names in which the IPL batsman has played teamNames <- getTeams(i) # Check if file exists in the directory. This check is necessary when moving between matchType ############## Create a consolidated IPL batsman dataframe for analysis bowlerDF <- getIPLBowlerDF(teamNames)  15. Bowler Mean Economy rate # For e.g. to get the details of R Ashwin do #bowlerMeanEconomyRate(bowlerDF,"R Ashwin") bowlerMeanEconomyRate(bowlerDF,"bowlerName") 16. Bowler mean runs conceded bowlerMeanRunsConceded(bowlerDF,"bowlerName") 17. Bowler Moving Average bowlerMovingAverage(bowlerDF,"bowlerName") 18. Bowler cumulative average wickets bowlerCumulativeAvgWickets(bowlerDF,"bowlerName") 19. Bowler cumulative Economy Rate (ER) bowlerCumulativeAvgEconRate(bowlerDF,"bowlerName") 20. Bowler wicket plot bowlerWicketPlot(bowlerDF,"bowlerName") 21. Bowler wicket against opposition bowlerWicketsAgainstOpposition(bowlerDF,"bowlerName") 22. Bowler wicket at cricket grounds bowlerWicketsVenue(bowlerDF,"bowlerName") 23. Predict number of deliveries to wickets setwd("./IPLMatches") bowlerDF1 <- getDeliveryWickets(team="IPLTeam1",dir=".",name="bowlerName",save=FALSE) bowlerWktsPredict(bowlerDF1,"bowlerName") Analysis of International T20 matches with yorkr templates Introduction In this post I create yorkr templates for International T20 matches that are available on Cricsheet. With these templates you can convert all T20 data which is in yaml format to R dataframes. Further I create data and the necessary templates for analyzing. All of these templates can be accessed from Github at yorkrT20Template. The templates are 1. Template for conversion and setup – T20Template.Rmd 2. Any T20 match – T20Matchtemplate.Rmd 3. T20 matches between 2 nations – T20Matches2TeamTemplate.Rmd 4. A T20 nations performance against all other T20 nations – T20AllMatchesAllOppnTemplate.Rmd 5. Analysis of T20 batsmen and bowlers of all T20 nations – T20BatsmanBowlerTemplate.Rmd Besides the templates the repository also includes the converted data for all T20 matches I downloaded from Cricsheet in Dec 2016, You can recreate the files as more matches are added to Cricsheet site. This post contains all the steps needed for T20 analysis, as more matches are played around the World and more data is added to Cricsheet. This will also be my reference in future if I decide to analyze T20 in future! The 3rd edition of my books (paperback & kindle) Cricket analytics with cricketr & Beaten by sheer pace! Cricket analytics with yorkr is now available on Amazon for$12.99

There will be 5 folders at the root

1. T20data – Match files as yaml from Cricsheet
2. T20Matches – Yaml match files converted to dataframes
3. T20MatchesBetween2Teams – All Matches between any 2 T20 teams
4. allMatchesAllOpposition – A T20 countries match data against all other teams
5. BattingBowlingDetails – Batting and bowling details of all countries
library(yorkr)
library(dplyr)

The first few steps take care of the data setup. This needs to be done before any of the analysis of T20 batsmen, bowlers, any T20 match, matches between any 2 T20 countries or analysis of a teams performance against all other countries

There will be 5 folders at the root

1. T20data
2. T20Matches
3. T20MatchesBetween2Teams
4. allMatchesAllOpposition
5. BattingBowlingDetails

1.Create directory T20Matches

Some files may give conversions errors. You could try to debug the problem or just remove it from the T20data folder. At most 2-4 file will have conversion problems and I usally remove then from the files to be converted.

Also take a look at my Inswinger shiny app which was created after performing the same conversion on the Dec 16 data .

convertAllYaml2RDataframesT20("T20Data","T20Matches")

2.Save all matches between all combinations of T20 nations

This function will create the set of all matches between every T20 country against every other T20 country. This uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.

setwd("./T20MatchesBetween2Teams")
saveAllMatchesBetweenTeams("../T20Matches")

3.Save all matches against all opposition

This will create a consolidated dataframe of all matches played by every T20 playing nation against all other nattions. This also uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.

setwd("../allMatchesAllOpposition")
saveAllMatchesAllOpposition("../T20Matches")

4. Create batting and bowling details for each T20 country

These are the current T20 playing nations. You can add to this vector as more countries start playing T20. You will get to know all T20 nations by also look at the directory created above namely allMatchesAllOpposition. his also uses the data that was created in T20Matches, with the convertAllYaml2RDataframesT20() function.

setwd("../BattingBowlingDetails")
teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka',
"Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea",
"United Arab Emirates")

for(i in seq_along(teams)){
print(teams[i])
val <- paste(teams[i],"-details",sep="")
val <- getTeamBattingDetails(teams[i],dir="../T20Matches", save=TRUE)

}

for(i in seq_along(teams)){
print(teams[i])
val <- paste(teams[i],"-details",sep="")
val <- getTeamBowlingDetails(teams[i],dir="../T20Matches", save=TRUE)

}

5. Get the list of batsmen for a particular country

For e.g. if you wanted to get the batsmen of Canada you would do the following. By replacing Canada for any other country you can get the batsmen of that country. These batsmen names can then be used in the batsmen analysis

country="Canada"
teamData <- paste(country,"-BattingDetails.RData",sep="")
countryDF <- battingDetails
bmen <- countryDF %>% distinct(batsman)
bmen <- as.character(bmen$batsman) batsmen <- sort(bmen) batsmen 6. Get the list of bowlers for a particular country The method below can get the list of bowler names for any T20 nation. These names can then be used in the bowler analysis below country="Netherlands" teamData <- paste(country,"-BowlingDetails.RData",sep="") load(teamData) countryDF <- bowlingDetails bwlr <- countryDF %>% distinct(bowler) bwlr <- as.character(bwlr$bowler)
bowler <- sort(bwlr)
bowler

A)  International T20 Match Analysis

Load any match data from the ./T20Matches folder for e.g. Afganistan-England-2016-03-23.RData

setwd("./T20Matches")
afg_eng<- overs
#The steps are
country1_country2 <- overs

All analysis for this match can be done now

2. Scorecard

teamBattingScorecardMatch(country1_country2,"Country1")
teamBattingScorecardMatch(country1_country2,"Country2")

3.Batting Partnerships

teamBatsmenPartnershipMatch(country1_country2,"Country1","Country2")
teamBatsmenPartnershipMatch(country1_country2,"Country2","Country1")

4. Batsmen vs Bowler Plot

teamBatsmenVsBowlersMatch(country1_country2,"Country1","Country2",plot=TRUE)
teamBatsmenVsBowlersMatch(country1_country2,"Country1","Country2",plot=FALSE)

5. Team bowling scorecard

teamBowlingScorecardMatch(country1_country2,"Country1")
teamBowlingScorecardMatch(country1_country2,"Country2")

6. Team bowling Wicket kind match

teamBowlingWicketKindMatch(country1_country2,"Country1","Country2")
m <-teamBowlingWicketKindMatch(country1_country2,"Country1","Country2",plot=FALSE)
m

7. Team Bowling Wicket Runs Match

teamBowlingWicketRunsMatch(country1_country2,"Country1","Country2")
m <-teamBowlingWicketRunsMatch(country1_country2,"Country1","Country2",plot=FALSE)
m

8. Team Bowling Wicket Match

m <-teamBowlingWicketMatch(country1_country2,"Country1","Country2",plot=FALSE)
m
teamBowlingWicketMatch(country1_country2,"Country1","Country2")

9. Team Bowler vs Batsmen

teamBowlersVsBatsmenMatch(country1_country2,"Country1","Country2")
m <- teamBowlersVsBatsmenMatch(country1_country2,"Country1","Country2",plot=FALSE)
m

10. Match Worm chart

matchWormGraph(country1_country2,"Country1","Country2")



B)  International T20 Matches between 2 teams

Load match data between any 2 teams from ./T20MatchesBetween2Teams for e.g.Australia-India-allMatches

setwd("./T20MatchesBetween2Teams")
aus_ind_matches <- matches
#Replace below with your own countries
country1<-"England"
country2 <- "South Africa"
country1VsCountry2 <- paste(country1,"-",country2,"-allMatches.RData",sep="")
country1_country2_matches <- matches


2.Batsmen partnerships

m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country1",report="summary")
m
m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country2",report="summary")
m
m<- teamBatsmenPartnershiOppnAllMatches(country1_country2_matches,"country1",report="detailed")
m
teamBatsmenPartnershipOppnAllMatchesChart(country1_country2_matches,"country1","country2")

3. Team batsmen vs bowlers

teamBatsmenVsBowlersOppnAllMatches(country1_country2_matches,"country1","country2")

4. Bowling scorecard

a <-teamBattingScorecardOppnAllMatches(country1_country2_matches,main="country1",opposition="country2")
a

5. Team bowling performance

teamBowlingPerfOppnAllMatches(country1_country2_matches,main="country1",opposition="country2")

6. Team bowler wickets

teamBowlersWicketsOppnAllMatches(country1_country2_matches,main="country1",opposition="country2")
m <-teamBowlersWicketsOppnAllMatches(country1_country2_matches,main="country1",opposition="country2",plot=FALSE)
teamBowlersWicketsOppnAllMatches(country1_country2_matches,"country1","country2",top=3)
m

7. Team bowler vs batsmen

teamBowlersVsBatsmenOppnAllMatches(country1_country2_matches,"country1","country2",top=5)

8. Team bowler wicket kind

teamBowlersWicketKindOppnAllMatches(country1_country2_matches,"country1","country2",plot=TRUE)
m <- teamBowlersWicketKindOppnAllMatches(country1_country2_matches,"country1","country2",plot=FALSE)
m[1:30,]

9. Team bowler wicket runs

teamBowlersWicketRunsOppnAllMatches(country1_country2_matches,"country1","country2")

10. Plot wins and losses

setwd("./T20Matches")
plotWinLossBetweenTeams("country1","country2")

C)  International T20 Matches for a team against all other teams

Load the data between for a T20 team against all other countries ./allMatchesAllOpposition for e.g all matches of India

load("allMatchesAllOpposition-India.RData")
india_matches <- matches
country="country1"
allMatches <- paste("allMatchesAllOposition-",country,".RData",sep="")
country1AllMatches <- matches


2. Team’s batting scorecard all Matches

m <-teamBattingScorecardAllOppnAllMatches(country1AllMatches,theTeam="country1")
m

3. Batting scorecard of opposing team

m <-teamBattingScorecardAllOppnAllMatches(matches=country1AllMatches,theTeam="country2")

4. Team batting partnerships

m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam="country1")
m
m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam='country1',report="detailed")
m <- teamBatsmenPartnershipAllOppnAllMatches(country1AllMatches,theTeam='country1',report="summary")
m

5. Team batting partnerships plot

teamBatsmenPartnershipAllOppnAllMatchesPlot(country1AllMatches,"country1",main="country1")
teamBatsmenPartnershipAllOppnAllMatchesPlot(country1AllMatches,"country1",main="country2")

6, Team batsmen vs bowlers report

m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=0)
m
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=1,dispRows=30)
m
m <-teamBatsmenVsBowlersAllOppnAllMatchesRept(matches=country1AllMatches,theTeam="country2",rank=1,dispRows=25)
m

7. Team batsmen vs bowler plot

d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=1,dispRows=50)
d
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)
d <- teamBatsmenVsBowlersAllOppnAllMatchesRept(country1AllMatches,"country1",rank=2,dispRows=50)
teamBatsmenVsBowlersAllOppnAllMatchesPlot(d)

8. Team bowling scorecard

teamBowlingScorecardAllOppnAllMatchesMain(matches=country1AllMatches,theTeam="country1")
teamBowlingScorecardAllOppnAllMatches(country1AllMatches,'country2')

9. Team bowler vs batsmen

teamBowlersVsBatsmenAllOppnAllMatchesMain(country1AllMatches,theTeam="country1",rank=0)
teamBowlersVsBatsmenAllOppnAllMatchesMain(country1AllMatches,theTeam="country1",rank=2)
teamBowlersVsBatsmenAllOppnAllMatchesRept(matches=country1AllMatches,theTeam="country1",rank=0)

10. Team Bowler vs bastmen

df <- teamBowlersVsBatsmenAllOppnAllMatchesRept(country1AllMatches,theTeam="country1",rank=1)
teamBowlersVsBatsmenAllOppnAllMatchesPlot(df,"country1","country1")

11. Team bowler wicket kind

teamBowlingWicketKindAllOppnAllMatches(country1AllMatches,t1="country1",t2="All")
teamBowlingWicketKindAllOppnAllMatches(country1AllMatches,t1="country1",t2="country2")


12.

teamBowlingWicketRunsAllOppnAllMatches(country1AllMatches,t1="country1",t2="All",plot=TRUE)
teamBowlingWicketRunsAllOppnAllMatches(country1AllMatches,t1="country1",t2="country2",plot=TRUE)

D) Batsman functions

Get the batsman’s details for a batsman

setwd("../BattingBowlingDetails")
kohli <- getBatsmanDetails(team="India",name="Kohli",dir=".")
batsmanDF <- getBatsmanDetails(team="country1",name="batsmanName",dir=".")

2. Runs vs deliveries

batsmanRunsVsDeliveries(batsmanDF,"batsmanName")

3. Batsman 4s & 6s

batsman46 <- select(batsmanDF,batsman,ballsPlayed,fours,sixes,runs)
p1 <- batsmanFoursSixes(batsman46,"batsmanName")

4. Batsman dismissals

batsmanDismissals(batsmanDF,"batsmanName")

5. Runs vs Strike rate

batsmanRunsVsStrikeRate(batsmanDF,"batsmanName")

6. Batsman Moving Average

batsmanMovingAverage(batsmanDF,"batsmanName")

7. Batsman cumulative average

batsmanCumulativeAverageRuns(batsmanDF,"batsmanName")

8. Batsman cumulative strike rate

batsmanCumulativeStrikeRate(batsmanDF,"batsmanName")

9. Batsman runs against oppositions

batsmanRunsAgainstOpposition(batsmanDF,"batsmanName")

10. Batsman runs vs venue

batsmanRunsVenue(batsmanDF,"batsmanName")

11. Batsman runs predict

batsmanRunsPredict(batsmanDF,"batsmanName")

12. Bowler functions

For example to get Ravicahnder Ashwin’s bowling details

setwd("../BattingBowlingDetails")
ashwin <- getBowlerWicketDetails(team="India",name="Ashwin",dir=".")
bowlerDF <- getBatsmanDetails(team="country1",name="bowlerName",dir=".")

13. Bowler Mean Economy rate

bowlerMeanEconomyRate(bowlerDF,"bowlerName")

14. Bowler mean runs conceded

bowlerMeanRunsConceded(bowlerDF,"bowlerName")

15. Bowler Moving Average

bowlerMovingAverage(bowlerDF,"bowlerName")

16. Bowler cumulative average wickets

bowlerCumulativeAvgWickets(bowlerDF,"bowlerName")

17. Bowler cumulative Economy Rate (ER)

bowlerCumulativeAvgEconRate(bowlerDF,"bowlerName")

18. Bowler wicket plot

bowlerWicketPlot(bowlerDF,"bowlerName")

19. Bowler wicket against opposition

bowlerWicketsAgainstOpposition(bowlerDF,"bowlerName")

20. Bowler wicket at cricket grounds

bowlerWicketsVenue(bowlerDF,"bowlerName")

21. Predict number of deliveries to wickets

setwd("./T20Matches")
bowlerDF1 <- getDeliveryWickets(team="country1",dir=".",name="bowlerName",save=FALSE)
bowlerWktsPredict(bowlerDF1,"bowlerName")