# Practical Machine Learning with R and Python – Part 2

In this 2nd part of the series “Practical Machine Learning with R and Python – Part 2”, I continue where I left off in my first post Practical Machine Learning with R and Python – Part 2. In this post I cover the some classification algorithmns and cross validation. Specifically I touch
-Logistic Regression
-K Nearest Neighbors (KNN) classification
-Leave out one Cross Validation (LOOCV)
-K Fold Cross Validation
in both R and Python.

As in my initial post the algorithms are based on the following courses.

You can download this R Markdown file along with the data from Github. I hope these posts can be used as a quick reference in R and Python and Machine Learning.I have tried to include the coolest part of either course in this post.

The following classification problem is based on Logistic Regression. The data is an included data set in Scikit-Learn, which I have saved as csv and use it also for R. The fit of a classification Machine Learning Model depends on how correctly classifies the data. There are several measures of testing a model’s classification performance. They are

Accuracy = TP + TN / (TP + TN + FP + FN) – Fraction of all classes correctly classified
Precision = TP / (TP + FP) – Fraction of correctly classified positives among those classified as positive
Recall = TP / (TP + FN) Also known as sensitivity, or True Positive Rate (True positive) – Fraction of correctly classified as positive among all positives in the data
F1 = 2 * Precision * Recall / (Precision + Recall)

## 1a. Logistic Regression – R code

The caret and e1071 package is required for using the confusionMatrix call

source("RFunctions.R")
library(dplyr)
library(caret)
library(e1071)
# Read the data (from sklearn)
# Rename the target variable
names(cancer) <- c(seq(1,30),"output")
# Split as training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

# Fit a generalized linear logistic model,
fit=glm(output~.,family=binomial,data=train,control = list(maxit = 50))
# Predict the output from the model
a=predict(fit,newdata=train,type="response")
# Set response >0.5 as 1 and <=0.5 as 0
b=ifelse(a>0.5,1,0)
# Compute the confusion matrix for training data
confusionMatrix(b,train$output) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 154 0 ## 1 0 272 ## ## Accuracy : 1 ## 95% CI : (0.9914, 1) ## No Information Rate : 0.6385 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 1 ## Mcnemar's Test P-Value : NA ## ## Sensitivity : 1.0000 ## Specificity : 1.0000 ## Pos Pred Value : 1.0000 ## Neg Pred Value : 1.0000 ## Prevalence : 0.3615 ## Detection Rate : 0.3615 ## Detection Prevalence : 0.3615 ## Balanced Accuracy : 1.0000 ## ## 'Positive' Class : 0 ##  m=predict(fit,newdata=test,type="response") n=ifelse(m>0.5,1,0) # Compute the confusion matrix for test output confusionMatrix(n,test$output)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction  0  1
##          0 52  4
##          1  5 81
##
##                Accuracy : 0.9366
##                  95% CI : (0.8831, 0.9706)
##     No Information Rate : 0.5986
##     P-Value [Acc > NIR] : <2e-16
##
##                   Kappa : 0.8677
##  Mcnemar's Test P-Value : 1
##
##             Sensitivity : 0.9123
##             Specificity : 0.9529
##          Pos Pred Value : 0.9286
##          Neg Pred Value : 0.9419
##              Prevalence : 0.4014
##          Detection Rate : 0.3662
##    Detection Prevalence : 0.3944
##       Balanced Accuracy : 0.9326
##
##        'Positive' Class : 0
## 

## 1b. Logistic Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
os.chdir("C:\\Users\\Ganesh\\RandPython")
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
fig, subaxes = plt.subplots(1, 1, figsize=(7, 5))
# Fit a model
clf = LogisticRegression().fit(X_train, y_train)

# Compute and print the Accuray scores
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
y_predicted=clf.predict(X_test)
# Compute and print confusion matrix
confusion = confusion_matrix(y_test, y_predicted)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, y_predicted)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96
## Accuracy: 0.96
## Precision: 0.99
## Recall: 0.94
## F1: 0.97

## 2. Dummy variables

The following R and Python code show how dummy variables are handled in R and Python. Dummy variables are categorival variables which have to be converted into appropriate values before using them in Machine Learning Model For e.g. if we had currency as ‘dollar’, ‘rupee’ and ‘yen’ then the dummy variable will convert this as
dollar 0 0 0
rupee 0 0 1
yen 0 1 0

## 2a. Logistic Regression with dummy variables- R code

# Load the dummies library
library(dummies) 
df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?"))

# Remove rows which have NA
df1 <- df[complete.cases(df),]
dim(df1)
## [1] 30161    16
# Select specific columns
adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain,
capital.loss,hours.per.week,native.country,salary)
# Set the dummy data with appropriate values

#Split as training and test
train <- adult1[train_idx, ]
test <- adult1[-train_idx, ]

# Fit a binomial logistic regression
fit=glm(salary~.,family=binomial,data=train)
# Predict response
a=predict(fit,newdata=train,type="response")
# If response >0.5 then it is a 1 and 0 otherwise
b=ifelse(a>0.5,1,0)
confusionMatrix(b,train$salary) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 16065 3145 ## 1 968 2442 ## ## Accuracy : 0.8182 ## 95% CI : (0.8131, 0.8232) ## No Information Rate : 0.753 ## P-Value [Acc > NIR] : < 2.2e-16 ## ## Kappa : 0.4375 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.9432 ## Specificity : 0.4371 ## Pos Pred Value : 0.8363 ## Neg Pred Value : 0.7161 ## Prevalence : 0.7530 ## Detection Rate : 0.7102 ## Detection Prevalence : 0.8492 ## Balanced Accuracy : 0.6901 ## ## 'Positive' Class : 0 ##  # Compute and display confusion matrix m=predict(fit,newdata=test,type="response") ## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading n=ifelse(m>0.5,1,0) confusionMatrix(n,test$salary)
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 5263 1099
##          1  357  822
##
##                Accuracy : 0.8069
##                  95% CI : (0.7978, 0.8158)
##     No Information Rate : 0.7453
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.4174
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.9365
##             Specificity : 0.4279
##          Pos Pred Value : 0.8273
##          Neg Pred Value : 0.6972
##              Prevalence : 0.7453
##          Detection Rate : 0.6979
##    Detection Prevalence : 0.8437
##       Balanced Accuracy : 0.6822
##
##        'Positive' Class : 0
## 

## 2b. Logistic Regression with dummy variables- Python code

Pandas has a get_dummies function for handling dummies

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Drop rows with NA
df1=df.dropna()
print(df1.shape)
# Select specific columns
'hours-per-week','native-country','salary']]

'hours-per-week','native-country']]
# Set approporiate values for dummy variables

random_state = 0)
clf = LogisticRegression().fit(X_adult_train, y_train)

# Compute and display Accuracy and Confusion matrix
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
confusion = confusion_matrix(y_test, y_predicted)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted)))
print('F1: {:.2f}'.format(f1_score(y_test, y_predicted)))
## (30161, 16)
## Accuracy of Logistic regression classifier on training set: 0.82
## Accuracy of Logistic regression classifier on test set: 0.81
## Accuracy: 0.81
## Precision: 0.68
## Recall: 0.41
## F1: 0.51

## 3a – K Nearest Neighbors Classification – R code

The Adult data set is taken from UCI Machine Learning Repository

source("RFunctions.R")
df <- read.csv("adult1.csv",stringsAsFactors = FALSE,na.strings = c(""," "," ?"))
# Remove rows which have NA
df1 <- df[complete.cases(df),]
dim(df1)
## [1] 30161    16
# Select specific columns
adult <- df1 %>% dplyr::select(age,occupation,education,educationNum,capitalGain,
capital.loss,hours.per.week,native.country,salary)
# Set dummy variables

#Split train and test as required by KNN classsification model
train <- adult1[train_idx, ]
test <- adult1[-train_idx, ]
train.X <- train[,1:76]
train.y <- train[,77]
test.X <- test[,1:76]
test.y <- test[,77]

# Fit a model for 1,3,5,10 and 15 neighbors
cMat <- NULL
neighbors <-c(1,3,5,10,15)
for(i in seq_along(neighbors)){
fit =knn(train.X,test.X,train.y,k=i)
table(fit,test.y)
a<-confusionMatrix(fit,test.y)
cMat[i] <- a$overall[1] print(a$overall[1])
}
##  Accuracy
## 0.7835831
##  Accuracy
## 0.8162047
##  Accuracy
## 0.8089113
##  Accuracy
## 0.8209787
##  Accuracy
## 0.8184591
#Plot the Accuracy for each of the KNN models
df <- data.frame(neighbors,Accuracy=cMat)
ggplot(df,aes(x=neighbors,y=Accuracy)) + geom_point() +geom_line(color="blue") +
xlab("Number of neighbors") + ylab("Accuracy") +
ggtitle("KNN regression - Accuracy vs Number of Neighors (Unnormalized)")

## 3b – K Nearest Neighbors Classification – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

df1=df.dropna()
print(df1.shape)
# Select specific columns
'hours-per-week','native-country','salary']]

'hours-per-week','native-country']]

#Set values for dummy variables

random_state = 0)

# KNN classification in Python requires the data to be scaled.
# Scale the data
scaler = MinMaxScaler()
# Apply scaling to test set also
# Compute the KNN model for 1,3,5,10 & 15 neighbors
accuracy=[]
neighbors=[1,3,5,10,15]
for i in neighbors:
knn = KNeighborsClassifier(n_neighbors = i)
knn.fit(X_train_scaled, y_train)
accuracy.append(knn.score(X_test_scaled, y_test))
print('Accuracy test score: {:.3f}'
.format(knn.score(X_test_scaled, y_test)))

# Plot the models with the Accuracy attained for each of these models
fig1=plt.plot(neighbors,accuracy)
fig1=plt.title("KNN regression - Accuracy vs Number of neighbors")
fig1=plt.xlabel("Neighbors")
fig1=plt.ylabel("Accuracy")
fig1.figure.savefig('foo1.png', bbox_inches='tight')
## (30161, 16)
## Accuracy test score: 0.749
## Accuracy test score: 0.779
## Accuracy test score: 0.793
## Accuracy test score: 0.804
## Accuracy test score: 0.803

Output image:

## 4 MPG vs Horsepower

The following scatter plot shows the non-linear relation between mpg and horsepower. This will be used as the data input for computing K Fold Cross Validation Error

## 4a MPG vs Horsepower scatter plot – R Code

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]
ggplot(df3,aes(x=horsepower,y=mpg)) + geom_point() + xlab("Horsepower") +
ylab("Miles Per gallon") + ggtitle("Miles per Gallon vs Hosrsepower")

## 4b MPG vs Horsepower scatter plot – Python Code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
#X=autoDF3[['cylinder','displacement','horsepower','weight']]
X=autoDF3[['horsepower']]
y=autoDF3['mpg']

fig11=plt.scatter(X,y)
fig11=plt.title("KNN regression - Accuracy vs Number of neighbors")
fig11=plt.xlabel("Neighbors")
fig11=plt.ylabel("Accuracy")
fig11.figure.savefig('foo11.png', bbox_inches='tight')


## 5 K Fold Cross Validation

K Fold Cross Validation is a technique in which the data set is divided into K Folds or K partitions. The Machine Learning model is trained on K-1 folds and tested on the Kth fold i.e.
we will have K-1 folds for training data and 1 for testing the ML model. Since we can partition this as $C_{1}^{K}$ or K choose 1, there will be K such partitions. The K Fold Cross
Validation estimates the average validation error that we can expect on a new unseen test data.

The formula for K Fold Cross validation is as follows

$MSE_{K} = \frac{\sum (y-yhat)^{2}}{n_{K}}$
and
$n_{K} = \frac{N}{K}$
and
$CV_{K} = \sum_{K=1}^{K} (\frac{n_{K}}{N}) MSE_{K}$

where $n_{K}$ is the number of elements in partition ‘K’ and N is the total number of elements
$CV_{K} =\sum_{K=1}^{K} MSE_{K}$

$CV_{K} =\frac{\sum_{K=1}^{K} MSE_{K}}{K}$
Leave Out one Cross Validation (LOOCV) is a special case of K Fold Cross Validation where N-1 data points are used to train the model and 1 data point is used to test the model. There are N such paritions of N-1 & 1 that are possible. The mean error is measured The Cross Valifation Error for LOOCV is

$CV_{N} = \frac{1}{n} *\frac{\sum_{1}^{n}(y-yhat)^{2}}{1-h_{i}}$
where $h_{i}$ is the diagonal hat matrix

see [Statistical Learning]

The above formula is also included in this blog post

It took me a day and a half to implement the K Fold Cross Validation formula. I think it is correct. In any case do let me know if you think it is off

## 5a. Leave out one cross validation (LOOCV) – R Code

R uses the package ‘boot’ for performing Cross Validation error computation

library(boot)
library(reshape2)
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
# Select complete cases
df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]
set.seed(17)
cv.error=rep(0,10)
# For polynomials 1,2,3... 10 fit a LOOCV model
for (i in 1:10){
glm.fit=glm(mpg~poly(horsepower,i),data=df3)
cv.error[i]=cv.glm(df3,glm.fit)$delta[1] } cv.error ## [1] 24.23151 19.24821 19.33498 19.42443 19.03321 18.97864 18.83305 ## [8] 18.96115 19.06863 19.49093 # Create and display a plot folds <- seq(1,10) df <- data.frame(folds,cvError=cv.error) ggplot(df,aes(x=folds,y=cvError)) + geom_point() +geom_line(color="blue") + xlab("Degree of Polynomial") + ylab("Cross Validation Error") + ggtitle("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial") ## 5b. Leave out one cross validation (LOOCV) – Python Code In Python there is no available function to compute Cross Validation error and we have to compute the above formula. I have done this after several hours. I think it is now in reasonable shape. Do let me know if you think otherwise. For LOOCV I use the K Fold Cross Validation with K=N import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split, KFold from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error # Read data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Remove rows with NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['horsepower']] y=autoDF3['mpg'] # For polynomial degree 1,2,3... 10 def computeCVError(X,y,folds): deg=[] mse=[] degree1=[1,2,3,4,5,6,7,8,9,10] nK=len(X)/float(folds) xval_err=0 # For degree 'j' for j in degree1: # Split as 'folds' kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: # Create the appropriate train and test partitions from the fold index X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # For the polynomial degree 'j' poly = PolynomialFeatures(degree=j) # Transform the X_train and X_test X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.fit_transform(X_test) # Fit a model on the transformed data linreg = LinearRegression().fit(X_train_poly, y_train) # Compute yhat or ypred y_pred = linreg.predict(X_test_poly) # Compute MSE * n_K/N test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X)) # Add the test_mse for this partition of the data mse.append(test_mse) # Compute the mean of all folds for degree 'j' deg.append(np.mean(mse)) return(deg) df=pd.DataFrame() print(len(X)) # Call the function once. For LOOCV K=N. hence len(X) is passed as number of folds cvError=computeCVError(X,y,len(X)) # Create and plot LOOCV df=pd.DataFrame(cvError) fig3=df.plot() fig3=plt.title("Leave one out Cross Validation - Cross Validation Error vs Degree of Polynomial") fig3=plt.xlabel("Degree of Polynomial") fig3=plt.ylabel("Cross validation Error") fig3.figure.savefig('foo3.png', bbox_inches='tight') ## 6a K Fold Cross Validation – R code Here K Fold Cross Validation is done for 4, 5 and 10 folds using the R package boot and the glm package library(boot) library(reshape2) set.seed(17) #Read data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] a=matrix(rep(0,30),nrow=3,ncol=10) set.seed(17) # Set the folds as 4,5 and 10 folds<-c(4,5,10) for(i in seq_along(folds)){ cv.error.10=rep(0,10) for (j in 1:10){ # Fit a generalized linear model glm.fit=glm(mpg~poly(horsepower,j),data=df3) # Compute K Fold Validation error a[i,j]=cv.glm(df3,glm.fit,K=folds[i])$delta[1]

}

}

# Create and display the K Fold Cross Validation Error
b <- t(a)
df <- data.frame(b)
df1 <- cbind(seq(1,10),df)
names(df1) <- c("PolynomialDegree","4-fold","5-fold","10-fold")

df2 <- melt(df1,id="PolynomialDegree")
ggplot(df2) + geom_line(aes(x=PolynomialDegree, y=value, colour=variable),size=2) +
xlab("Degree of Polynomial") + ylab("Cross Validation Error") +
ggtitle("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial")

## 6b. K Fold Cross Validation – Python code

The implementation of K-Fold Cross Validation Error has to be implemented and I have done this below. There is a small discrepancy in the shapes of the curves with the R plot above. Not sure why!

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
# Drop NA rows
autoDF3=autoDF2.dropna()
autoDF3.shape
#X=autoDF3[['cylinder','displacement','horsepower','weight']]
X=autoDF3[['horsepower']]
y=autoDF3['mpg']

# Create Cross Validation function
def computeCVError(X,y,folds):
deg=[]
mse=[]
# For degree 1,2,3,..10
degree1=[1,2,3,4,5,6,7,8,9,10]

nK=len(X)/float(folds)
xval_err=0
for j in degree1:
# Split the data into 'folds'
kf = KFold(len(X),n_folds=folds)
for train_index, test_index in kf:
# Partition the data acccording the fold indices generated
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Scale the X_train and X_test as per the polynomial degree 'j'
poly = PolynomialFeatures(degree=j)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
# Fit a polynomial regression
linreg = LinearRegression().fit(X_train_poly, y_train)
# Compute yhat or ypred
y_pred = linreg.predict(X_test_poly)
# Compute MSE *(nK/N)
test_mse = mean_squared_error(y_test, y_pred)*float(len(X_train))/float(len(X))
# Append to list for different folds
mse.append(test_mse)
# Compute the mean for poylnomial 'j'
deg.append(np.mean(mse))

return(deg)

# Create and display a plot of K -Folds
df=pd.DataFrame()
for folds in [4,5,10]:
cvError=computeCVError(X,y,folds)
#print(cvError)
df1=pd.DataFrame(cvError)
df=pd.concat([df,df1],axis=1)
#print(cvError)

df.columns=['4-fold','5-fold','10-fold']
df=df.reindex([1,2,3,4,5,6,7,8,9,10])
df
fig2=df.plot()
fig2=plt.title("K Fold Cross Validation - Cross Validation Error vs Degree of Polynomial")
fig2=plt.xlabel("Degree of Polynomial")
fig2=plt.ylabel("Cross validation Error")
fig2.figure.savefig('foo2.png', bbox_inches='tight')


This concludes this 2nd part of this series. I will look into model tuning and model selection in R and Python in the coming parts. Comments, suggestions and corrections are welcome!
To be continued….
Watch this space!

Also see

To see all posts see Index of posts

# Introduction

This is the 1st part of a series of posts I intend to write on some common Machine Learning Algorithms in R and Python. In this first part I cover the following Machine Learning Algorithms

• Univariate Regression
• Multivariate Regression
• Polynomial Regression
• K Nearest Neighbors Regression

The code includes the implementation in both R and Python. This series of posts are based on the following 2 MOOC courses I did at Stanford Online and at Coursera

1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG). I also use the Boston data set from MASS package

While coding in R and Python I found that there were some aspects that were more convenient in one language and some in the other. For example, plotting the fit in R is straightforward in R, while computing the R squared, splitting as Train & Test sets etc. are already available in Python. In any case, these minor inconveniences can be easily be implemented in either language.

R squared computation in R is computed as follows
$RSS=\sum (y-yhat)^{2}$
$TSS= \sum(y-mean(y))^{2}$
$Rsquared- 1-\frac{RSS}{TSS}$

Note: You can download this R Markdown file and the associated data sets from Github at MachineLearning-RandPython
Note 1: This post was created as an R Markdown file in RStudio which has a cool feature of including R and Python snippets. The plot of matplotlib needs a workaround but otherwise this is a real cool feature of RStudio!

## 1.1a Univariate Regression – R code

Here a simple linear regression line is fitted between a single input feature and the target variable

# Source in the R function library
source("RFunctions.R")
# Read the Boston data file
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - Statistical Learning

# Split the data into training and test sets (75:25)
train_idx <- trainTestSplit(df,trainPercent=75,seed=5)
train <- df[train_idx, ]
test <- df[-train_idx, ]

# Fit a linear regression line between 'Median value of owner occupied homes' vs 'lower status of
# population'
fit=lm(medv~lstat,data=df)
# Display details of fir
summary(fit)
##
## Call:
## lm(formula = medv ~ lstat, data = df)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.168  -3.990  -1.318   2.034  24.500
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# Display the confidence intervals
confint(fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
plot(df$lstat,df$medv, xlab="Lower status (%)",ylab="Median value of owned homes ($1000)", main="Median value of homes ($1000) vs Lowe status (%)")
abline(fit)
abline(fit,lwd=3)
abline(fit,lwd=3,col="red")

rsquared=Rsquared(fit,test,test$medv) sprintf("R-squared for uni-variate regression (Boston.csv) is : %f", rsquared) ## [1] "R-squared for uni-variate regression (Boston.csv) is : 0.556964" ## 1.1b Univariate Regression – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression #os.chdir("C:\\software\\machine-learning\\RandPython") # Read the CSV file df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") # Select the feature variable X=df['lstat'] # Select the target y=df['medv'] # Split into train and test sets (75:25) X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) X_train=X_train.values.reshape(-1,1) X_test=X_test.values.reshape(-1,1) # Fit a linear model linreg = LinearRegression().fit(X_train, y_train) # Print the training and test R squared score print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train))) print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Plot the linear regression line fig=plt.scatter(X_train,y_train) # Create a range of points. Compute yhat=coeff1*x + intercept and plot x=np.linspace(0,40,20) fig1=plt.plot(x, linreg.coef_ * x + linreg.intercept_, color='red') fig1=plt.title("Median value of homes ($1000) vs Lowe status (%)")
fig1=plt.xlabel("Lower status (%)")
fig1=plt.ylabel("Median value of owned homes ($1000)") fig.figure.savefig('foo.png', bbox_inches='tight') fig1.figure.savefig('foo1.png', bbox_inches='tight') print "Finished"  ## R-squared score (training): 0.571 ## R-squared score (test): 0.458 ## Finished ## 1.2a Multivariate Regression – R code # Read crimes data crimesDF <- read.csv("crimes.csv",stringsAsFactors = FALSE) # Remove the 1st 7 columns which do not impact output crimesDF1 <- crimesDF[,7:length(crimesDF)] # Convert all to numeric crimesDF2 <- sapply(crimesDF1,as.numeric) # Check for NAs a <- is.na(crimesDF2) # Set to 0 as an imputation crimesDF2[a] <-0 #Create as a dataframe crimesDF2 <- as.data.frame(crimesDF2) #Create a train/test split train_idx <- trainTestSplit(crimesDF2,trainPercent=75,seed=5) train <- crimesDF2[train_idx, ] test <- crimesDF2[-train_idx, ] # Fit a multivariate regression model between crimesPerPop and all other features fit <- lm(ViolentCrimesPerPop~.,data=train) # Compute and print R Squared rsquared=Rsquared(fit,test,test$ViolentCrimesPerPop)
sprintf("R-squared for multi-variate regression (crimes.csv)  is : %f", rsquared)
## [1] "R-squared for multi-variate regression (crimes.csv)  is : 0.653940"

## 1.2b Multivariate Regression – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the data
#Remove the 1st 7 columns
crimesDF1=crimesDF.iloc[:,7:crimesDF.shape[1]]
# Convert to numeric
crimesDF2 = crimesDF1.apply(pd.to_numeric, errors='coerce')
# Impute NA to 0s
crimesDF2.fillna(0, inplace=True)

# Select the X (feature vatiables - all)
X=crimesDF2.iloc[:,0:120]

# Set the target
y=crimesDF2.iloc[:,121]

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0)
# Fit a multivariate regression model
linreg = LinearRegression().fit(X_train, y_train)

# compute and print the R Square
print('R-squared score (training): {:.3f}'.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'.format(linreg.score(X_test, y_test)))
## R-squared score (training): 0.699
## R-squared score (test): 0.677

## 1.3a Polynomial Regression – R

For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit

 # Polynomial degree 1
df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))

# Select key columns
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Split as train and test sets
train_idx <- trainTestSplit(df3,trainPercent=75,seed=5)
train <- df3[train_idx, ]
test <- df3[-train_idx, ]

# Fit a model of degree 1
fit <- lm(mpg~. ,data=train)
rsquared1 <-Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : %f", rsquared1) ## [1] "R-squared for Polynomial regression of degree 1 (auto_mpg.csv) is : 0.763607" # Polynomial degree 2 - Quadratic x = as.matrix(df3[1:6]) # Make a polynomial of degree 2 for feature variables before split df4=as.data.frame(poly(x,2,raw=TRUE)) df5 <- cbind(df4,df3[7]) # Split into train and test set train_idx <- trainTestSplit(df5,trainPercent=75,seed=5) train <- df5[train_idx, ] test <- df5[-train_idx, ] # Fit the quadratic model fit <- lm(mpg~. ,data=train) # Compute R squared rsquared2=Rsquared(fit,test,test$mpg)
sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : %f", rsquared2)
## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv)  is : 0.831372"
#Polynomial degree 3
x = as.matrix(df3[1:6])
# Make polynomial of degree 4  of feature variables before split
df4=as.data.frame(poly(x,3,raw=TRUE))
df5 <- cbind(df4,df3[7])
train_idx <- trainTestSplit(df5,trainPercent=75,seed=5)

train <- df5[train_idx, ]
test <- df5[-train_idx, ]
# Fit a model of degree 3
fit <- lm(mpg~. ,data=train)
# Compute R squared
rsquared3=Rsquared(fit,test,test$mpg) sprintf("R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : %f", rsquared3) ## [1] "R-squared for Polynomial regression of degree 2 (auto_mpg.csv) is : 0.773225" df=data.frame(degree=c(1,2,3),Rsquared=c(rsquared1,rsquared2,rsquared3)) # Make a plot of Rsquared and degree ggplot(df,aes(x=degree,y=Rsquared)) +geom_point() + geom_line(color="blue") + ggtitle("Polynomial regression - R squared vs Degree of polynomial") + xlab("Degree") + ylab("R squared") ## 1.3a Polynomial Regression – Python For Polynomial regression , polynomials of degree 1,2 & 3 are used and R squared is computed. It can be seen that the quadaratic model provides the best R squared score and hence the best fit import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns # Select key columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') # Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']] y=autoDF3['mpg'] # Polynomial degree 1 X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('R-squared score - Polynomial degree 1 (training): {:.3f}'.format(linreg.score(X_train, y_train))) # Compute R squared rsquared1 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 1 (test): {:.3f}'.format(linreg.score(X_test, y_test))) # Polynomial degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) # Compute R squared print('R-squared score - Polynomial degree 2 (training): {:.3f}'.format(linreg.score(X_train, y_train))) rsquared2 =linreg.score(X_test, y_test) print('R-squared score - Polynomial degree 2 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) #Polynomial degree 3 poly = PolynomialFeatures(degree=3) X_poly = poly.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_poly, y,random_state = 0) linreg = LinearRegression().fit(X_train, y_train) print('(R-squared score -Polynomial degree 3 (training): {:.3f}' .format(linreg.score(X_train, y_train))) # Compute R squared rsquared3 =linreg.score(X_test, y_test) print('R-squared score Polynomial degree 3 (test): {:.3f}\n'.format(linreg.score(X_test, y_test))) degree=[1,2,3] rsquared =[rsquared1,rsquared2,rsquared3] fig2=plt.plot(degree,rsquared) fig2=plt.title("Polynomial regression - R squared vs Degree of polynomial") fig2=plt.xlabel("Degree") fig2=plt.ylabel("R squared") fig2.figure.savefig('foo2.png', bbox_inches='tight') print "Finished plotting and saving"  ## R-squared score - Polynomial degree 1 (training): 0.811 ## R-squared score - Polynomial degree 1 (test): 0.799 ## R-squared score - Polynomial degree 2 (training): 0.861 ## R-squared score - Polynomial degree 2 (test): 0.847 ## ## (R-squared score -Polynomial degree 3 (training): 0.933 ## R-squared score Polynomial degree 3 (test): 0.710 ## ## Finished plotting and saving ## 1.4 K Nearest Neighbors The code below implements KNN Regression both for R and Python. This is done for different neighbors. The R squared is computed in each case. This is repeated after performing feature scaling. It can be seen the model fit is much better after feature scaling. Normalization refers to $X_{normalized} = \frac{X-min(X)}{max(X-min(X))}$ Another technique that is used is Standardization which is $X_{standardized} = \frac{X-mean(X)}{sd(X)}$ ## 1.4a K Nearest Neighbors Regression – R( Unnormalized) The R code below does not use feature scaling # KNN regression requires the FNN package df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) df3 <- df2[complete.cases(df2),] # Split train and test train_idx <- trainTestSplit(df3,trainPercent=75,seed=5) train <- df3[train_idx, ] test <- df3[-train_idx, ] # Select the feature variables train.X=train[,1:6] # Set the target for training train.Y=train[,7] # Do the same for test set test.X=test[,1:6] test.Y=test[,7] rsquared <- NULL # Create a list of neighbors neighbors <-c(1,2,4,8,10,14) for(i in seq_along(neighbors)){ # Perform a KNN regression fit knn=knn.reg(train.X,test.X,train.Y,k=neighbors[i]) # Compute R sqaured rsquared[i]=knnRSquared(knn$pred,test.Y)
}

# Make a dataframe for plotting
df <- data.frame(neighbors,Rsquared=rsquared)
# Plot the number of neighors vs the R squared
ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") +
xlab("Number of neighbors") + ylab("R squared") +
ggtitle("KNN regression - R squared vs Number of Neighors (Unnormalized)")

## 1.4b K Nearest Neighbors Regression – Python( Unnormalized)

The Python code below does not use feature scaling

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Perform a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# Create a list of neighbors
rsquared=[]
neighbors=[1,2,4,8,10,14]
for i in neighbors:
# Fit a KNN model
knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train)
# Compute R squared
rsquared.append(knnreg.score(X_test, y_test))
print('R-squared test score: {:.3f}'
.format(knnreg.score(X_test, y_test)))
# Plot the number of neighors vs the R squared
fig3=plt.plot(neighbors,rsquared)
fig3=plt.title("KNN regression - R squared vs Number of neighbors(Unnormalized)")
fig3=plt.xlabel("Neighbors")
fig3=plt.ylabel("R squared")
fig3.figure.savefig('foo3.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared test score: 0.527
## R-squared test score: 0.678
## R-squared test score: 0.707
## R-squared test score: 0.684
## R-squared test score: 0.683
## R-squared test score: 0.670
## Finished plotting and saving

## 1.4c K Nearest Neighbors Regression – R( Normalized)

It can be seen that R squared improves when the features are normalized.

df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI
df1 <- as.data.frame(sapply(df,as.numeric))
df2 <- df1 %>% select(cylinder,displacement, horsepower,weight, acceleration, year,mpg)
df3 <- df2[complete.cases(df2),]

# Perform MinMaxScaling of feature variables
train.X.scaled=MinMaxScaler(train.X)
test.X.scaled=MinMaxScaler(test.X)

# Create a list of neighbors
rsquared <- NULL
neighbors <-c(1,2,4,6,8,10,12,15,20,25,30)
for(i in seq_along(neighbors)){
# Fit a KNN model
knn=knn.reg(train.X.scaled,test.X.scaled,train.Y,k=i)
# Compute R ssquared
rsquared[i]=knnRSquared(knn\$pred,test.Y)

}

df <- data.frame(neighbors,Rsquared=rsquared)
# Plot the number of neighors vs the R squared
ggplot(df,aes(x=neighbors,y=Rsquared)) + geom_point() +geom_line(color="blue") +
xlab("Number of neighbors") + ylab("R squared") +
ggtitle("KNN regression - R squared vs Number of Neighors(Normalized)")

## 1.4d K Nearest Neighbors Regression – Python( Normalized)

R squared improves when the features are normalized with MinMaxScaling

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
autoDF.shape
autoDF.columns
autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']]
autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce')
autoDF3=autoDF2.dropna()
autoDF3.shape
X=autoDF3[['cylinder','displacement','horsepower','weight','acceleration','year']]
y=autoDF3['mpg']

# Perform a train/ test  split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
# Use MinMaxScaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Apply scaling on test set
X_test_scaled = scaler.transform(X_test)

# Create a list of neighbors
rsquared=[]
neighbors=[1,2,4,6,8,10,12,15,20,25,30]
for i in neighbors:
# Fit a KNN model
knnreg = KNeighborsRegressor(n_neighbors = i).fit(X_train_scaled, y_train)
# Compute R squared
rsquared.append(knnreg.score(X_test_scaled, y_test))
print('R-squared test score: {:.3f}'
.format(knnreg.score(X_test_scaled, y_test)))

# Plot the number of neighors vs the R squared
fig4=plt.plot(neighbors,rsquared)
fig4=plt.title("KNN regression - R squared vs Number of neighbors(Normalized)")
fig4=plt.xlabel("Neighbors")
fig4=plt.ylabel("R squared")
fig4.figure.savefig('foo4.png', bbox_inches='tight')
print "Finished plotting and saving"
## R-squared test score: 0.703
## R-squared test score: 0.810
## R-squared test score: 0.830
## R-squared test score: 0.838
## R-squared test score: 0.834
## R-squared test score: 0.828
## R-squared test score: 0.827
## R-squared test score: 0.826
## R-squared test score: 0.816
## R-squared test score: 0.815
## R-squared test score: 0.809
## Finished plotting and saving

# Conclusion

In this initial post I cover the regression models when the output is continous. I intend to touch upon other Machine Learning algorithms.
Comments, suggestions and corrections are welcome.

Watch this this space!

To be continued….

To see all posts see Index of posts

# Neural Networks: On Perceptrons and Sigmoid Neurons

Neural Networks had their beginnings in 1943 when Warren McCulloch, a neurophysiologist, and a young mathematician, Walter Pitts, wrote a paper on how neurons might work.  Much later in 1958, Frank Rosenblatt, a neuro-biologist proposed the Perceptron. The Perceptron is a computer model or computerized machine which is devised to represent or simulate the ability of the brain to recognize and discriminate. In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers

Initially it was believed that  Perceptrons were capable of many things including “the ability to walk, talk, see, write, reproduce itself and be conscious of its existence.”

However, a subsequent paper by Marvin Minky and Seymour Papert of MIT, titled “Perceptrons” proved that the Perceptron was truly limited in its functionality. Specifically they showed that the Perceptron was incapable of producing XOR functionality. The Perceptron is only capable of classification where the data points are linearly separable.

This post implements the simple learning algorithm of the ‘Linear Perceptron’ and the ‘Sigmoid Perceptron’.  The implementation has been done in Octave. This implementation is based on “Neural networks for Machine Learning” course by Prof Geoffrey Hinton at Coursera

Perceptron learning procedure
z = ∑wixi  + b
where wi is the ith weight and xi is the ith  feature

For every training case compute the activation output zi

• If the output classifies correctly, leave the weights alone
• If the output classifies a ‘0’ as a ‘1’, then subtract the the feature from the weight
• If the output classifies a ‘0’ as a ‘1’, then add the feature to the weight

This simple neural network is represented below

Sigmoid neuron learning procedure
zi = sigmoid(∑wixi  + b)
where sigmoid is
$sigmoid(z) = 1/1+e^{-z}$

Hence
$z_{i} = 1/1+e^{-(\sum w_{i}x_{i}+b)}$
For every training case compute the activation output zi

• If the output classifies correctly, leave the weights alone
• If the output incorrectly classifies a ‘0’ as a ‘1’ i.e. $z_{i} >sigmoid(0)$, then subtract the feature from the weight
• If the output incorrectly classifies a ‘1’ as ‘0’ i.e., i.e $z_{i} < sigmoid(0)$, then add the feature to the weight
• Iterate till errors <= 1

This is shown below

I have implemented the learning algorithm of the Perceptron and Sigmoid Neuron in Octave. The code is available at Github at Perceptron.

1. Perceptron execution

I performed the tests on 2 different datasets

Data 1

Data 2

2. Sigmoid Perceptron execution
Data 1 & Data 2

It can be seen that the Perceptron does work for simple linearly separable data. I will be implementing other more advanced Neural Networks in the months to come.

Watch this space!

# Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

Here is the 1st part of my video presentation on “Machine Learning, Data Science, NLP and Big Data – Part 1”

# IBM Data Science Experience:  First steps with yorkr

Fresh, and slightly dizzy, from my foray into Quantum Computing with IBM’s Quantum Experience, I now turn my attention to IBM’s Data Science Experience (DSE).

I am on the verge of completing a really great 3 module ‘Data Science and Engineering with Spark XSeries’ from the University of California, Berkeley and I have been thinking of trying out some form of integrated delivery platform for performing analytics, for quite some time.  Coincidentally,  IBM comes out with its Data Science Experience. a month back. There are a couple of other collaborative platforms available for playing around with Apache Spark or Data Analytics namely Jupyter notebooks, Databricks, Data.world.

I decided to go ahead with IBM’s Data Science Experience as  the GUI is a lot cooler, includes shared data sets and integrates with Object Storage, Cloudant DB etc,  which seemed a lot closer to the cloud, literally!  IBM’s DSE is an interactive, collaborative, cloud-based environment for performing data analysis with Apache Spark. DSE is hosted on IBM’s PaaS environment, Bluemix. It should be possible to access in DSE the plethora of cloud services available on Bluemix. IBM’s DSE uses Jupyter notebooks for creating and analyzing data which can be easily shared and has access to a few hundred publicly available datasets

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

In this post, I use IBM’s DSE and my R package yorkr, for analyzing the performance of 1 ODI match (Aus-Ind, 2 Feb 2012)  and the batting performance of Virat Kohli in IPL matches. These are my ‘first’ steps in DSE so, I use plain old “R language” for analysis together with my R package ‘yorkr’. I intend to  do more interesting stuff on Machine learning with SparkR, Sparklyr and PySpark in the weeks and months to come.

You can checkout the Jupyter notebooks created with IBM’s DSE Y at Github  – “Using R package yorkr – A quick overview’ and  on NBviewer at “Using R package yorkr – A quick overview

Working with Jupyter notebooks are fairly straight forward which can handle code in R, Python and Scala. Each cell can either contain code (Python or Scala), Markdown text, NBConvert or Heading. The code is written into the cells and can be executed sequentially. Here is a screen shot of the notebook.

The ‘File’ menu can be used for ‘saving and checkpointing’ or ‘reverting’ to a checkpoint. The ‘kernel’ menu can be used to start, interrupt, restart and run all cells etc. Data Sources icon can be used to load data sources to your code. The data is uploaded to Object Storage with appropriate credentials. You will have to  import this data from Object Storage using the credentials. In my notebook with yorkr I directly load the data from Github.  You can use the sharing to share the notebook. The shared notebook has an extension ‘ipynb’. You can use the ‘Sharing’ icon  to share the notebook. The shared notebook has an extension ‘ipynb’. You an import this notebook directly into your environment and can get started with the code available in the notebook.

You can import existing R, Python or Scala notebooks as shown below. My notebook ‘Using R package yorkr – A quick overview’ can be downloaded using the link ‘yorkrWithDSE’ and clicking the green download icon on top right corner.

I have also uploaded the file to Github and you can download from here too ‘yorkrWithDSE’. This notebook can be imported into your DSE as shown below

Jupyter notebooks have been integrated with Github and are rendered directly from Github.  You can view my Jupyter notebook here  – “Using R package yorkr – A quick overview’. You can also view it on NBviewer at “Using R package yorkr – A quick overview

So there it is. You can download my notebook, import it into IBM’s Data Science Experience and then use data from ‘yorkrData” as shown. As already mentioned yorkrData contains converted data for ODIs, T20 and IPL. For details on how to use my R package yorkr  please my posts on yorkr at “Index of posts

Hope you have fun playing wit IBM’s Data Science Experience and my package yorkr.

I will be exploring IBM’s DSE in weeks and months to come in the areas of Machine Learning with SparkR,SparklyR or pySpark.

Watch this space!!!

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Also see

To see all my posts check
Index of posts

# Introducing cricket package yorkr:Part 4-In the block hole!

## Introduction

“The nitrogen in our DNA, the calcium in our teeth, the iron in our blood, the carbon in our apple pies were made in the interiors of collapsing stars. We are made of starstuff.”

“If you wish to make an apple pie from scratch, you must first invent the universe.”

“We are like butterflies who flutter for a day and think it is forever.”

“The absence of evidence is not the evidence of absence.”

“We are star stuff which has taken its destiny into its own hands.”

                              Cosmos - Carl Sagan

This post is the 4th and possibly, the last part of my introduction, to my latest cricket package yorkr. This is the 4th part of the introduction, the 3 earlier ones were

The 1st part included functions dealing with a specific match, the 2nd part dealt with functions between 2 opposing teams. The 3rd part dealt with functions between a team and all matches with all oppositions. This 4th part includes individual batting and bowling performances in ODI matches and deals with Class 4 functions.

This post has also been published at RPubs yorkr-Part4 and can also be downloaded as a PDF document from yorkr-Part4.pdf.

You can clone/fork the code for the package yorkr from Github at yorkr-package

Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today!

Checkout my interactive Shiny apps GooglyPlus (plots & tables) and Googly (only plots) which can be used to analyze IPL players, teams and matches.

#### Batsman functions

1. batsmanRunsVsDeliveries
3. batsmanDismissals
4. batsmanRunsVsStrikeRate
5. batsmanMovingAverage
6. batsmanCumulativeAverageRuns
7. batsmanCumulativeStrikeRate
8. batsmanRunsAgainstOpposition
9. batsmanRunsVenue
10. batsmanRunsPredict

#### Bowler functions

1. bowlerMeanEconomyRate
2. bowlerMeanRunsConceded
3. bowlerMovingAverage
4. bowlerCumulativeAvgWickets
5. bowlerCumulativeAvgEconRate
6. bowlerWicketPlot
7. bowlerWicketsAgainstOpposition
8. bowlerWicketsVenue
9. bowlerWktsPredict

Note: The yorkr package in its current avatar only supports ODI, T20 and IPL T20 matches.

library(yorkr)
library(gridExtra)
library(rpart.plot)
library(dplyr)
library(ggplot2)
rm(list=ls())

## A. Batsman functions

### 1. Get Team Batting details

The function below gets the overall team batting details based on the RData file available in ODI matches. This is currently also available in Github at (https://github.com/tvganesh/yorkrData/tree/master/ODI/ODI-matches).  However you may have to do this as future matches are added! The batting details of the team in each match is created and a huge data frame is created by rbinding the individual dataframes. This can be saved as a RData file

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
india_details <- getTeamBattingDetails("India",dir=".", save=TRUE)
dim(india_details)
## [1] 11085    15
sa_details <- getTeamBattingDetails("South Africa",dir=".",save=TRUE)
dim(sa_details)
## [1] 6375   15
nz_details <- getTeamBattingDetails("New Zealand",dir=".",save=TRUE)
dim(nz_details)
## [1] 6262   15
eng_details <- getTeamBattingDetails("England",dir=".",save=TRUE)
dim(eng_details)
## [1] 9001   15

### 2. Get batsman details

This function is used to get the individual batting record for a the specified batsmen of the country as in the functions below. For analyzing the batting performances the following cricketers have been chosen

1. Virat Kohli (Ind)
2. M S Dhoni (Ind)
3. AB De Villiers (SA)
4. Q De Kock (SA)
5. J Root (Eng)
6. M J Guptill (NZ)
setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
kohli <- getBatsmanDetails(team="India",name="Kohli",dir=".")
## [1] "./India-BattingDetails.RData"
dhoni <- getBatsmanDetails(team="India",name="Dhoni")
## [1] "./India-BattingDetails.RData"
devilliers <-  getBatsmanDetails(team="South Africa",name="Villiers",dir=".")
## [1] "./South Africa-BattingDetails.RData"
deKock <-  getBatsmanDetails(team="South Africa",name="Kock",dir=".")
## [1] "./South Africa-BattingDetails.RData"
root <-  getBatsmanDetails(team="England",name="Root",dir=".")
## [1] "./England-BattingDetails.RData"
guptill <-  getBatsmanDetails(team="New Zealand",name="Guptill",dir=".")
## [1] "./New Zealand-BattingDetails.RData"

### 3. Runs versus deliveries

Kohli, De Villiers and Guptill have a good cluster of points that head towards 150 runs at 150 deliveries.

p1 <-batsmanRunsVsDeliveries(kohli,"Kohli")
p2 <- batsmanRunsVsDeliveries(dhoni, "Dhoni")
p3 <- batsmanRunsVsDeliveries(devilliers,"De Villiers")
p4 <- batsmanRunsVsDeliveries(deKock,"Q de Kock")
p5 <- batsmanRunsVsDeliveries(root,"JE Root")
p6 <- batsmanRunsVsDeliveries(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 4. Batsman Total runs, Fours and Sixes

The plots below show the total runs, fours and sixes by the batsmen

kohli46 <- select(kohli,batsman,ballsPlayed,fours,sixes,runs)
dhoni46 <- select(dhoni,batsman,ballsPlayed,fours,sixes,runs)
devilliers46 <- select(devilliers,batsman,ballsPlayed,fours,sixes,runs)
p3 <- batsmanFoursSixes(devilliers46, "De Villiers")
deKock46 <- select(deKock,batsman,ballsPlayed,fours,sixes,runs)
p4 <- batsmanFoursSixes(deKock46,"Q de Kock")
root46 <- select(root,batsman,ballsPlayed,fours,sixes,runs)
p5 <- batsmanFoursSixes(root46,"JE Root")
guptill46 <- select(guptill,batsman,ballsPlayed,fours,sixes,runs)
p6 <- batsmanFoursSixes(guptill46,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 5. Batsman dismissals

The type of dismissal for each batsman is shown below

p1 <-batsmanDismissals(kohli,"Kohli")
p2 <- batsmanDismissals(dhoni, "Dhoni")
p3 <- batsmanDismissals(devilliers, "De Villiers")
p4 <- batsmanDismissals(deKock,"Q de Kock")
p5 <- batsmanDismissals(root,"JE Root")
p6 <- batsmanDismissals(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 6. Runs versus Strike Rate

De villiers has the best strike rate among all as there are more points to the right side of the plot for the same runs. Kohli and Dhoni do well too. Q De Kock and Joe Root also have a very good spread of points though they have fewer innings.

p1 <-batsmanRunsVsStrikeRate(kohli,"Kohli")
p2 <- batsmanRunsVsStrikeRate(dhoni, "Dhoni")
p3 <- batsmanRunsVsStrikeRate(devilliers, "De Villiers")
p4 <- batsmanRunsVsStrikeRate(deKock,"Q de Kock")
p5 <- batsmanRunsVsStrikeRate(root,"JE Root")
p6 <- batsmanRunsVsStrikeRate(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 7. Batsman moving average

Kohli’s average is on a gentle increase from below 50 to around 60’s. Joe Root performance is impressive with his moving average of late tending towards the 70’s. Q De Kock seemed to have a slump around 2015 but his performance is on the increase. Devilliers consistently averages around 50. Dhoni also has been having a stable run in the last several years.

p1 <-batsmanMovingAverage(kohli,"Kohli")
p2 <- batsmanMovingAverage(dhoni, "Dhoni")
p3 <- batsmanMovingAverage(devilliers, "De Villiers")
p4 <- batsmanMovingAverage(deKock,"Q de Kock")
p5 <- batsmanMovingAverage(root,"JE Root")
p6 <- batsmanMovingAverage(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 8. Batsman cumulative average

The functions below provide the cumulative average of runs scored. As can be seen Kohli and Devilliers have a cumulative runs rate that averages around 48-50. Q De Kock seems to have had a rocky career with several highs and lows as the cumulative average oscillates between 45-40. Root steadily improves to a cumulative average of around 42-43 from his 50th innings

p1 <-batsmanCumulativeAverageRuns(kohli,"Kohli")
p2 <- batsmanCumulativeAverageRuns(dhoni, "Dhoni")
p3 <- batsmanCumulativeAverageRuns(devilliers, "De Villiers")
p4 <- batsmanCumulativeAverageRuns(deKock,"Q de Kock")
p5 <- batsmanCumulativeAverageRuns(root,"JE Root")
p6 <- batsmanCumulativeAverageRuns(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 9. Cumulative Average Strike Rate

The plots below show the cumulative average strike rate of the batsmen. Dhoni and Devilliers have the best cumulative average strike rate of 90%. The rest average around 80% strike rate. Guptill shows a slump towards the latter part of his career.

p1 <-batsmanCumulativeStrikeRate(kohli,"Kohli")
p2 <- batsmanCumulativeStrikeRate(dhoni, "Dhoni")
p3 <- batsmanCumulativeStrikeRate(devilliers, "De Villiers")
p4 <- batsmanCumulativeStrikeRate(deKock,"Q de Kock")
p5 <- batsmanCumulativeStrikeRate(root,"JE Root")
p6 <- batsmanCumulativeStrikeRate(guptill,"MJ Guptill")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 10. Batsman runs against opposition

Kohli’s best performances are against Australia, West Indies and Sri Lanka

batsmanRunsAgainstOpposition(kohli,"Kohli")

batsmanRunsAgainstOpposition(dhoni, "Dhoni")

Kohli’s best performances are against Australia, Pakistan and West Indies

batsmanRunsAgainstOpposition(devilliers, "De Villiers")

Quentin de Kock average almost 100 runs against India and 75 runs against England

batsmanRunsAgainstOpposition(deKock, "Q de Kock")

Root’s best performances are against South Africa, Sri Lanka and West Indies

batsmanRunsAgainstOpposition(root, "JE Root")

batsmanRunsAgainstOpposition(guptill, "MJ Guptill")

### 11. Runs at different venues

The plots below give the performances of the batsmen at different grounds.

batsmanRunsVenue(kohli,"Kohli")

batsmanRunsVenue(dhoni, "Dhoni")

batsmanRunsVenue(devilliers, "De Villiers")

batsmanRunsVenue(deKock, "Q de Kock")

batsmanRunsVenue(root, "JE Root")

batsmanRunsVenue(guptill, "MJ Guptill")

### 12. Predict number of runs to deliveries

The plots below use rpart classification tree to predict the number of deliveries required to score the runs in the leaf node. For e.g. Kohli takes 66 deliveries to score 64 runs and for higher number of deliveries scores around 115 runs. Devilliers needs

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsPredict(kohli,"Kohli")
batsmanRunsPredict(dhoni, "Dhoni")
batsmanRunsPredict(devilliers, "De Villiers")

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsPredict(deKock,"Q de Kock")
batsmanRunsPredict(root,"JE Root")
batsmanRunsPredict(guptill,"MJ Guptill")

## B. Bowler functions

### 13. Get bowling details

The function below gets the overall team bowling details based on the RData file available in ODI matches. This is currently also available in Github at (https://github.com/tvganesh/yorkrData/tree/master/ODI/ODI-matches). The bowling details of the team in each match is created and a huge data frame is created by rbinding the individual dataframes. This can be saved as a RData file

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
ind_bowling <- getTeamBowlingDetails("India",dir=".",save=TRUE)
dim(ind_bowling)
## [1] 7816   12
aus_bowling <- getTeamBowlingDetails("Australia",dir=".",save=TRUE)
dim(aus_bowling)
## [1] 9191   12
ban_bowling <- getTeamBowlingDetails("Bangladesh",dir=".",save=TRUE)
dim(ban_bowling)
## [1] 5665   12
sa_bowling <- getTeamBowlingDetails("South Africa",dir=".",save=TRUE)
dim(sa_bowling)
## [1] 3806   12
sl_bowling <- getTeamBowlingDetails("Sri Lanka",dir=".",save=TRUE)
dim(sl_bowling)
## [1] 3964   12

### 14. Get bowling details of the individual bowlers

This function is used to get the individual bowling record for a specified bowler of the country as in the functions below. For analyzing the bowling performances the following cricketers have been chosen

1. R A Jadeja (Ind)
2. Ravichander Ashwin (Ind)
3. Mitchell Starc (Aus)
4. Shakib Al Hasan (Ban)
5. Ajantha Mendis (SL)
6. Dale Steyn (SA)
jadeja <- getBowlerWicketDetails(team="India",name="Jadeja",dir=".")
ashwin <- getBowlerWicketDetails(team="India",name="Ashwin",dir=".")
starc <-  getBowlerWicketDetails(team="Australia",name="Starc",dir=".")
mendis <-  getBowlerWicketDetails(team="Sri Lanka",name="Mendis",dir=".")
steyn <-  getBowlerWicketDetails(team="South Africa",name="Steyn",dir=".")

### 15. Bowler Mean Economy Rate

Shakib Al Hassan is expensive in the 1st 3 overs after which he is very economical with a economy rate of 3-4. Starc, Steyn average around a ER of 4.0

p1<-bowlerMeanEconomyRate(jadeja,"RA Jadeja")
p2<-bowlerMeanEconomyRate(ashwin, "R Ashwin")
p3<-bowlerMeanEconomyRate(starc, "MA Starc")
p4<-bowlerMeanEconomyRate(shakib, "Shakib Al Hasan")
p5<-bowlerMeanEconomyRate(mendis, "A Mendis")
p6<-bowlerMeanEconomyRate(steyn, "D Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 16. Bowler Mean Runs conceded

Ashwin is expensive around 6 & 7 overs

p1<-bowlerMeanRunsConceded(jadeja,"RA Jadeja")
p2<-bowlerMeanRunsConceded(ashwin, "R Ashwin")
p3<-bowlerMeanRunsConceded(starc, "M A Starc")
p4<-bowlerMeanRunsConceded(shakib, "Shakib Al Hasan")
p5<-bowlerMeanRunsConceded(mendis, "A Mendis")
p6<-bowlerMeanRunsConceded(steyn, "D Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 17. Bowler Moving average

RA jadeja and Mendis’ performance has dipped considerably, while Ashwin and Shakib have improving performances. Starc average around 4 wickets

p1<-bowlerMovingAverage(jadeja,"RA Jadeja")
p2<-bowlerMovingAverage(ashwin, "Ashwin")
p3<-bowlerMovingAverage(starc, "M A Starc")
p4<-bowlerMovingAverage(shakib, "Shakib Al Hasan")
p5<-bowlerMovingAverage(mendis, "Ajantha Mendis")
p6<-bowlerMovingAverage(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 17. Bowler cumulative average wickets

Starc is clearly the most consistent performer with 3 wickets on an average over his career, while Jadeja averages around 2.0. Ashwin seems to have dropped from 2.4-2.0 wickets, while Mendis drops from high 3.5 to 2.2 wickets. The fractional wickets only show a tendency to take another wicket.

p1<-bowlerCumulativeAvgWickets(jadeja,"RA Jadeja")
p2<-bowlerCumulativeAvgWickets(ashwin, "Ashwin")
p3<-bowlerCumulativeAvgWickets(starc, "M A Starc")
p4<-bowlerCumulativeAvgWickets(shakib, "Shakib Al Hasan")
p5<-bowlerCumulativeAvgWickets(mendis, "Ajantha Mendis")
p6<-bowlerCumulativeAvgWickets(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 18. Bowler cumulative Economy Rate (ER)

The plots below are interesting. All of the bowlers seem to average around 4.5 runs/over. RA Jadeja’s ER improves and heads to 4.5, Mendis is seen to getting more expensive as his career progresses. From a ER of 3.0 he increases towards 4.5

p1<-bowlerCumulativeAvgEconRate(jadeja,"RA Jadeja")
p2<-bowlerCumulativeAvgEconRate(ashwin, "Ashwin")
p3<-bowlerCumulativeAvgEconRate(starc, "M A Starc")
p4<-bowlerCumulativeAvgEconRate(shakib, "Shakib Al Hasan")
p5<-bowlerCumulativeAvgEconRate(mendis, "Ajantha Mendis")
p6<-bowlerCumulativeAvgEconRate(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 19. Bowler wicket plot

The plot below gives the average wickets versus number of overs

p1<-bowlerWicketPlot(jadeja,"RA Jadeja")
p2<-bowlerWicketPlot(ashwin, "Ashwin")
p3<-bowlerWicketPlot(starc, "M A Starc")
p4<-bowlerWicketPlot(shakib, "Shakib Al Hasan")
p5<-bowlerWicketPlot(mendis, "Ajantha Mendis")
p6<-bowlerWicketPlot(steyn, "Dale Steyn")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=3)

### 20. Bowler wicket against opposition

#Jadeja's' best pertformance are against England, Pakistan and West Indies
bowlerWicketsAgainstOpposition(jadeja,"RA Jadeja")

#Ashwin's bets pertformance are against England, Pakistan and South Africa
bowlerWicketsAgainstOpposition(ashwin, "Ashwin")

#Starc has good performances against India, New Zealand, Pakistan, West Indies
bowlerWicketsAgainstOpposition(starc, "M A Starc")

bowlerWicketsAgainstOpposition(shakib,"Shakib Al Hasan")

bowlerWicketsAgainstOpposition(mendis, "Ajantha Mendis")

#Steyn has good performances against India, Sri Lanka, Pakistan, West Indies
bowlerWicketsAgainstOpposition(steyn, "Dale Steyn")

### 21. Bowler wicket at cricket grounds

bowlerWicketsVenue(jadeja,"RA Jadeja")

bowlerWicketsVenue(ashwin, "Ashwin")

bowlerWicketsVenue(starc, "M A Starc")
## Warning: Removed 2 rows containing missing values (geom_bar).

bowlerWicketsVenue(shakib,"Shakib Al Hasan")

bowlerWicketsVenue(mendis, "Ajantha Mendis")

bowlerWicketsVenue(steyn, "Dale Steyn")

### 22. Get Delivery wickets for bowlers

Thsi function creates a dataframe of deliveries and the wickets taken

setwd("C:/software/cricket-package/york-test/yorkrData/ODI/ODI-matches")
ashwin1 <- getDeliveryWickets(team="India",dir=".",name="Ashwin",save=FALSE)
starc1 <- getDeliveryWickets(team="Australia",dir=".",name="MA Starc",save=FALSE)
mendis1 <- getDeliveryWickets(team="Sri Lanka",dir=".",name="Mendis",save=FALSE)
steyn1 <- getDeliveryWickets(team="South Africa",dir=".",name="Steyn",save=FALSE)

### 23. Predict number of deliveries to wickets

#Jadeja and Ashwin need around 22 to 28 deliveries to make a break through
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(ashwin1,"RAshwin")

#Starc and Shakib provide an early breakthrough producing a wicket in around 16 balls. Starc's 2nd wicket comed around the 30th delivery
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(starc1,"MA Starc")
bowlerWktsPredict(shakib1,"Shakib Al Hasan")

#Steyn and Mendis take 20 deliveries to get their 1st wicket
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerWktsPredict(mendis1,"A Mendis")
bowlerWktsPredict(steyn1,"DSteyn")

## Conclusion

This concludes the 4 part introduction to my new R cricket package yorkr for ODIs. I will be enhancing the package to handle Twenty20 and IPL matches soon. You can fork/clone the code from Github at yorkr.

The yaml data from Cricsheet have already beeen converted into R consumable dataframes. The converted data can be downloaded from Github at yorkrData. There are 3 folders – ODI matches, ODI matches between 2 teams (oppnAllMatches), ODI matches between a team and the rest of the world (all matches,all oppositions).

As I have already mentioned I have around 67 functions for analysis, however I am certain that the data has a lot more secrets waiting to be tapped. So please do go ahead and run any machine learning or statistical learning algorithms on them. If you do come up with interesting insights, I would appreciate if attribute the source to Cricsheet(http://cricsheet.org), and my package yorkr and my blog Giga thoughts*, besides dropping me a note.

Hope you have a great time with my yorkr package!

Also see

# Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

In the last decade and a half, there has arisen a class of problem that are becoming very critical in the computing domain. These problems deal with computing in a highly distributed environments. A key characteristic of this domain is the need to grow elastically with increasing workloads while tolerating failures without missing a beat.  In short I would like to refer to this as ‘Web Scale Computing’ where the number of servers exceeds several 100’s and the data size is of the order of few hundred terabytes to several Exabytes.

There are several features that are unique to large scale distributed systems

1. The servers used are not specialized machines but regular commodity, off-the-shelf servers
2. Failures are not the exception but the norm. The design must be resilient to failures
3. There is no global clock. Each individual server has its own internal clock with its own skew and drift rates. Algorithms exist that can create a notion of a global clock
4. Operations happen at these machines concurrently. The order of the operations, things like causality and concurrency, can be evaluated through special algorithms like Lamport or Vector clocks
5. The distributed system must be able to handle failures where servers crash, disk fails or there is a network problem. For this reason data is replicated across servers, so that if one server fails the data can still be obtained from copies residing on other servers.
6. Since data is replicated there are associated issues of consistency. Algorithms exist that ensure that the replicated data is either ‘strongly’ consistent or ‘eventually’ consistent. Trade-offs are often considered when choosing one of the consistency mechanisms
7. Leaders are elected democratically.  Then there are dictators who get elected through ‘bully’ing.

In some ways distributed systems behave like a murmuration of starlings (or a school of fish),  where a leader is elected on the fly (pun unintended) and the starlings or fishes change direction based on a few (typically 6) closest neighbors.

This series of posts, Thinking Web Scale (TWS) ,  will be about Web Scale problems and the algorithms designed to address this.  I would like to keep these posts more essay-like and less pedantic.

In the early days,  computing used to be done in a single monolithic machines with its own CPU, RAM and a disk., This situation was fine for a long time,  as technology promptly kept its date with Moore’s Law which stated that the “ computing power  and memory capacity’ will  double every 18 months. However this situation changed drastically as the data generated from machines grew exponentially – whether it was the call detail records, records from retail stores, click streams, tweets, and status updates of social networks of today

These massive amounts of data cannot be handled by a single machine. We need to ‘divide’ and ‘conquer this data for processing. Hence there is a need for a hundreds of servers each handling a slice of the data.

The first post is about the fairly recent computing paradigm “Map-Reduce”.  Map- Reduce is a product of Google Research and was developed to solve their need to calculate create an Inverted Index of Web pages, to compute the Page Rank etc. The algorithm was initially described in a white paper published by Google on the Map-Reduce algorithm. The Page Rank algorithm now powers Google’s search which now almost indispensable in our daily lives.

The Map-Reduce assumes that these servers are not perfect, failure-proof machines. Rather Map-Reduce folds into its design the assumption that the servers are regular, commodity servers performing a part of the task. The hundreds of terabytes of data is split into 16MB to 64MB chunks and distributed into a file system known as ‘Distributed File System (DFS)’.  There are several implementations of the Distributed File System. Each chunk is replicated across servers. One of the servers is designated as the “Master’. This “Master’ allocates tasks to ‘worker’ nodes. A Master Node also keeps track of the location of the chunks and their replicas.

When the Map or Reduce has to process data, the process is started on the server in which the chunk of data resides.

The data is not transferred to the application from another server. The Compute is brought to the data and not the other way around. In other words the process is started on the server where the data, intermediate results reside

The reason for this is that it is more expensive to transmit data. Besides the latencies associated with data transfer can become significant with increasing distances

Map-Reduce had its genesis from a Lisp Construct of the same name

Where one could apply a common operation over a list of elements and then reduce the resulting list of elements with a reduce operation

The Map-Reduce was originally created by Google solve Page Rank problem Now Map-Reduce is used across a wide variety of problems.

The main components of Map-Reduce are the following

1. Mapper: Convert all d ∈ D to (key (d), value (d))
2. Shuffle: Moves all (k, v) and (k’, v’) with k = k’ to same machine.
3. Reducer: Transforms {(k, v1), (k, v2) . . .} to an output D’ k = f(v1, v2, . . .). …
4. Combiner: If one machine has multiple (k, v1), (k, v2) with same k then it can perform part of Reduce before Shuffle

A schematic of the Map-Reduce is included below\

Map Reduce is usually a perfect fit for problems that have an inherent property of parallelism. To these class of problems the map-reduce paradigm can be applied in simultaneously to a large sets of data.  The “Hello World” equivalent of Map-Reduce is the Word count problem. Here we simultaneously count the occurrences of words in millions of documents

The map operation scans the documents in parallel and outputs a key-value pair. The key is the word and the value is the number of occurrences of the word. E.g. In this case ‘map’ will scan each word and emit the word and the value 1 for the key-value pair

So, if the document contained

“All men are equal. Some men are more equal than others”

Map would output

(all,1),  (men,1), (are,1), (equal,1), (some,1), (men,1), (are,1),  (equal,1), (than,1), (others,1)

The Reduce phase will take the above output and give sum all key value pairs with the same key

(all,1),  (men,2), (are,2),(equal,2), (than,1), (others,1)

So we get to count all the words in the document

In the Map-Reduce the Master node assigns tasks to Worker nodes which process the data on the individual chunks

Map-Reduce also makes short work of dealing with large matrices and can crunch matrix operations like matrix addition, subtraction, multiplication etc.

Matrix-Vector multiplication

As an example if we consider a Matrix-Vector multiplication (taken from the book Mining Massive Data Sets by Jure Leskovec, Anand Rajaraman et al

For a n x n matrix if we have M with the value mij in the ith row and jth column. If we need to multiply this with a vector vj, then the matrix-vector product of M x vj is given by xi

Here the product of mij x vj   can be performed by the map function and the summation can be performed by a reduce operation. The obvious question is, what if the vector vj or the matrix mij did not fit into memory. In such a situation the vector and matrix are divided into equal sized slices and performed acorss machines. The application would have to work on the data to consolidate the partial results.

Fortunately, several problems in Machine Learning, Computer Vision, Regression and Analytics which require large matrix operations. Map-Reduce can be used very effectively in matrix manipulation operations. Computation of Page Rank itself involves such matrix operations which was one of the triggers for the Map-Reduce paradigm.

Handling failures:  As mentioned earlier the Map-Reduce implementation must be resilient to failures where failures are the norm and not the exception. To handle this the ‘master’ node periodically checks the health of the ‘worker’ nodes by pinging them. If the ping response does not arrive, the master marks the worker as ‘failed’ and restarts the task allocated to worker to generate the output on a server that is accessible.

Stragglers: Executing a job in parallel brings forth the famous saying ‘A chain is as strong as the weakest link’. So if there is one node which is straggler and is delayed in computation due to disk errors, the Master Node starts a backup worker and monitors the progress. When either the straggler or the backup complete, the master kills the other process.

Mining Social Networks, Sentiment Analysis of Twitterverse also utilize Map-Reduce.

However, Map-Reduce is not a panacea for all of the industry’s computing problems (see To Hadoop, or not to Hadoop)

But the Map-Reduce is a very critical paradigm in the distributed computing domain as it is able to handle mountains of data, can handle multiple simultaneous failures, and is blazingly fast.

To see all posts click ‘Index of Posts

# An Octave primer

Here is a simple Octave Primer. Octave is a powerful language for implementing Machine Learning algorithms. As I have mentioned its strength is its simplicity. I am including some basic commands with which you can get by implementing fairly complex code

%%Matrix
A matrix can be created as a = [1 2 3; 4 7 8; 12 35 14]; % This is 3 x 3 matrix
Matrix multiplication can be done between m x n * n x k matrix as follows

a = [4 56 3; 2 3 4]; b = [23 1; 3 12; 34 12]; % a = 3 x 2 matrix b = 2 x 3 matrix c = a*b; %% c = 3 x 2 * 2 * 3 = 3 x 3 matrix

c = 362 712 191 86

%%Inverse of a matrix can be obtained by
d = pinv(c); octave-3.2.4.exe:37> d = pinv(c) d = -8.2014e-004 6.7900e-003 1.8215e-003 -3.4522e-003

%%Transpose of a matrix
e = c'; % e is the transpose of done

octave-3.2.4.exe:38> e = c' e = 362 191 712 86

The following operations are done on all elements of a matrix or a vector
k = 5; a = [1 2; 3 4; 5 6]; k = 5.23; c = k * a; d = a - 2 e = a / 5 f = a .* a % Dot product g = a .^2; % Square each elements

%% Select slice of matrix
b = a(:,2); % Select column 2 of matrix a (all rows) c = a(2,:) % Select row of matrix 'a' (all columns)

d = [7 8; 8 9; 10 11; 12 13]; % 4 rows 2 columns d(2:3,:); %Select from rows 2 to 3 (all columns)

octave-3.2.4.exe:41> d d = 7 8 8 9 10 11 12 13 octave-3.2.4.exe:43> d(2:3,:) ans = 8 9 10 11

%% Appending rows to matrix
a = [ 4 5; 5 6; 5 7; 9 8]; % 4 x 2 b = [ 1 3; 2 4]; % 2 x 2 c = [ a; b] % stack a over b d = [b ; a] % stack b over a*b 
octave-3.2.4.exe:44> a = [ 4 5; 5 6; 5 7; 9 8] % 4 x 2 a = 4 5 5 6 5 7 9 8

octave-3.2.4.exe:45> b = [ 1 3; 2 4] % 2 x 2 b = 1 3 2 4

octave-3.2.4.exe:46> c = [ a; b] % stack a over b c = 4 5 5 6 5 7 9 8 1 3 2 4

octave-3.2.4.exe:47> d = [b ; a] % stack b over a*b d = 1 3 2 4 4 5 5 6 5 7 9 8

%% Appending columns
a = [ 1 2 3; 3 4 5]; b = [ 1 2; 3 4]; c = [a b]; d = [b a];

octave-3.2.4.exe:48> a = [ 1 2 3; 3 4 5] a = 1 2 3 3 4 5

octave-3.2.4.exe:49> b = [ 1 2; 3 4] b = 1 2 3 4

octave-3.2.4.exe:50> c = [a b] c = 1 2 3 1 2 3 4 5 3 4

octave-3.2.4.exe:51> d = [b a] d = 1 2 1 2 3 3 4 3 4 5 %%Size of a matrix [c d ] = size(a); 
Creating a matrix of all zeros or ones
d = ones(3,2); e = zeros(4,3); 
%Appending an intercept term to a matrix
a = [1 2 3; 4 5 6]; %2 x 3 b = ones(2,1); a = [b a];

%% Plotting
Creating 2 vectors
x = [1 3 4 5 6]; y = [5 6 7 8 9]; plot(x,y); 
%%Create labels
xlabel("X values); ylabel("Y values); axis([1 10 4 10]); % Set the range of x and y title("Test plot);

%%Creating a 3D scatter plot
If we have 3 column csv file then we can load the data as follows
data = load('values.csv'); X = data(:, 1:2); y = data(:, 3); scatter3(X(:,1),X(:,2),y,[],[240 15 15],'x'); % X(:,1) - x axis X(:,2) - yaxis y[] - z axis

%% Drawing a 3D mesh
x = linspace(0,xrange + 20,10); y = linspace(1,yrange+ 20,10); [XX, YY ] = meshgrid(x,y); 
[a b] = size(XX)

Draw the mesh
for i=1:a, for j= 1:b, ZZ(i,j) = [1 (XX(i,j)-mu(1))/sigma(1) (YY(i,j) - mu(2))/sigma(2) ] * theta; end; end; mesh(XX,YY,ZZ);

For more details please see post Informed choices using Machine Learning 2- Pitting Kumble, Kapil and B S Chandra

%% Creating different polynomial equations
Let X be a feature vector
then
X = [X X.^2 X^3] %X X^2 X^3

This can be created using a for loop as follows
for i= 1:n xtemp = xinput .^i; x = [x xtemp]; end;

Finally while doing multivariate regression if we wanted to create polynomial terms of higher we could do as follows. Let us say we have a feature vector X made of 3 features x1, x2,

Let us say we wanted to create a polynomial of the form x1^2 x1.x2 x2^2 then we could create X as

X = [X(:,1) .^2 X(:,1) . X(:,2) X(:,2) .^2]

As you can see Octave is really powerful language for Machine Learning and has just a few handful of constructs with which one can implement powerful Machine Learning algorithms

# Applying the principles of Machine Learning

While working with multivariate regression there are certain essential principles that must be applied to ensure the correctness of the solution while being able to pick the most optimum solution. This is all the more important when the problem has a large number of features. In this post I apply these important principles to a regression data set which I was able to pull of the internet. This data set was taken from the UCI Machine Learning repository and deals with Boston housing data.  The housing data provides the cost of house in Boston suburbs given the number of rooms, the connectivity to main highways, and crime rate in the area and several other data.  There are a total of 506 data points in this data set with a total of 13 features.

This seemed a reasonable dataset to start to try out the principles of Machine Learning I had picked up from Coursera’s ML course.

Out of a total of 13 features 2 features ’ZN’ and ‘CHAS’ proximity to  Charles river were dropped as the values were mostly zero in these columns . The remaining 11 features were used to map to the output variable of the price.

The following key rules have been applied on the

• The dataset was divided into training samples (60%), cross-validation set (20%) and test set (20%) using a random index
• Try out different polynomial functions while performing gradient descent to determine the theta values
• Different combinations of ‘alpha’ learning rate and ‘lambda’ the regularization parameter were tried while performing gradient descent
• The error rate is then calculated on the cross-validation and test set
• The theta values that were obtained for the lowest cost for a polynomial was used to compute and plot the learning curve for the different polynomials against increasing number of training and cross-validation test samples to check for bias and variance.
• The plot of the cost versus the polynomial degree was plotted to obtain the best fit polynomial for the data set.

A multivariate regression hypothesis can be represented as

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 + …
And the cost can is determined as
J(θ0, θ1, θ2, θ3..) = 1/2m ∑ (hΘ (xi) -yi)2
The implementation was done using Octave. As in my previous posts some functions have not been include to comply with Coursera’s Honor Code. The code can be cloned from GitHub at machine-learning-principles

a) housing compute.m. In this module I perform gradient descent for different polynomial degrees and check the error that is obtained when using the computed theta on the cross validation and test set

max_degrees =4; J_history = zeros(max_degrees, 1); Jcv_history = zeros(max_degrees, 1); for degree = 1:max_degrees; [J Jcv alpha lambda] = train_samples(randidx, training,cross_validation,test_data,degree); end;

b) train_samples.m – This module uses gradient descent to check the best fit for a given polynomial degree for different combinations of alpha (learning rate) and lambda( regularization).

for i = 1:length(alpha_arr), for j = 1:length(lambda_arr) alpha = alpha_arr{i}; lambda= lambda_arr{j}; % Perform Gradient descent % Compute error for training sample for computed theta values % Compute the error rate for the cross validation samples % Compute the error rate against the test set end; end;

c) cross_validation.m – This module uses the theta values to compute cost for the cross validation set

d) test-samples.m – This modules computes the error when using the trained theta on the test set

e) poly.m – This module constructs polynomial vectors based on the degree as follows
function [x] = poly(xinput, n) x = []; for i= 1:n xtemp = xinput .^i; x = [x xtemp]; end;

e) learning_curve.m – The learning curve module plots the error rate for increasing number of training and cross validation samples. This is done as follows. For the theta with the lowest cost as determined by gradient descent
for i from 1 to 100

• Compute the error for ‘i’ training samples
• Compute the error for ‘i’ cross-validation
• Plot the learning curve to determine the bias and variance of the polynomial fit

This is included below
for i = 1: 100 xsample = xtrain(1:i,:); ysample = ytrain(1:i,:); size(xsample); size(ysample); [xsample] = poly(xsample,degree); xsample= [ones(i, 1) xsample]; [c d] = size(xsample); theta = zeros(d, 1); % Minimize using fmincg J = computeCost(xsample, ysample, theta); Jtrain(i) = J; xsample_cv = xcv(1:i,:); ysample_cv = ycv(1:i,:); [xsample_cv] = poly(xsample_cv,degree); xsample_cv= [ones(i, 1) xsample_cv]; J_cv = computeCost(xsample_cv, ysample_cv,theta) Jcv(i) = J_cv; end;

Finally a plot is done been different lambda and the cost.

The results are included below

A) Polynomial degree 1
Convergence graph

Learning curve

The above figure does show a stronger bias. Note: the learning curve was done with around 100 samples
B) Polynomial degree 2

Convergence graph

Learning curve

The learning curve for degree 2 shows a stronger variance.

C) Polynomial degree 3
Convergence graph

Learning curve

D) Polynomial degree 4
Convergence graph

E) Learning curve

This plot is useful to determine which polynomial degree will give the best fit for the dataset and the lowest cost

Clearly from the above it can be seen that degree 2 will give a good fit for the data set.

F) Lambda vs Cost function

The above code demonstrates some key principles while performing multivariate regression
The code can be cloned from GitHub at machine-learning-principles

# Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

Continuing my earlier ‘innings’, of test driving my knowledge in Machine Learning acquired via Coursera,  I now turn my attention towards the bowling performances of our Indian bowling heroes. In this post I give a slightly different ‘spin’ to the bowling analysis and hope I can ‘swing’ your opinion based on my assessment.

I guess that is enough of my cricketing ‘double-speak’ for now and I will get down to the real business of my bowling analysis!

Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today!

As in my earlier post Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid ,the first part of the post has my analyses and the latter part has the details of the implementation of the algorithm. Feel free to read the first part and either scan or skip the latter.

To perform this analysis I have skipped the data on our recent crop of new bowlers. The reason being that data is scant on these bowlers, besides they also seem to have a relatively shorter shelf life (hope there are a couple of finds in this Australian tour of Dec 2014). For the analyses I have chosen B S Chandrasekhar, Kapil Dev Anil Kumble. My rationale as to why I chose the above 3

B S Chandrasekhar also known as “Chandra’ was one of the most lethal leg spinners in the late 1970’s. He had a very dangerous combination of fast leg breaks, searing tops spins interspersed with the  occasional googly. On many occasions he would leave most batsmen completely clueless.

Kapil Nikhanj Dev, the Haryana Hurricane who could outwit the most technically sound batsmen  through some really clever bowling. His variations were almost always effective and he would achieve the vital breakthrough outsmarting the opponent.

And finally Anil Kumble, I chose Kumble because in my opinion he is truly the embodiment of the ‘thinking’ bowler. Many times I have seen Kumble repeatedly beat batsmen. It was like he was telling the batsman ‘check’ as he bowled faster leg breaks, flippers, a straighter delivery or top spins before finally crashing into the wickets or trapping the batsmen. It felt he was saying ‘checkmate dude!’

I have taken the data for the 3 bowlers from ESPN Cricinfo. Only the Test matches were considered for the analyses. All tests against all oppositions both at home and away were included

The assumptions taken and basis of the computation is included below
a.The data is based on the following 2 input variables a) Overs bowled b) Runs given. The output variable is ‘Wickets taken’

b.To my surprise I found that in the late 1970’s when BS Chandrasekhar used to bowl, an over had 8 balls for matches in Australia. So, I had to normalize this data for Chandra to make it on par with the others. Hence for Chandra where the overs were made up of 8 balls the overs was calculated as follows
Overs (O) = (Overs * 8)/6

c.The Economy rate E was calculated as below
E = Overs/runs was chosen as input variable to take into account fewer runs given by the bowler

d.The output variable was re-calculated as Strike Rate (SR) to determine the ‘bowling effectiveness’
Strike Rate = Wickets/Overs
(not be confused with a batsman’s strike rate batsman strike rate = runs/ balls faced)

e.Hence the analysis is based on
f(O,E) = SR
An outline of the Octave code and the data used can be cloned from GitHub at ml-bowling-analyze

1. Surface of Bowling Effectiveness (SBE)
In my earlier post I was able to fit a ‘prediction plane’ based on the minutes at crease, balls faced versus the runs scored. But in this case a plane did not make sense as the wickets can only range from 0 – 10 and in most cases averaging between 3 and 5. So I plot the best fitting 3-D surface over the predicted hypothesis function. The steps performed are

1) The data for the different  bowlers were cleaned with data which indicated (DNB – Did not bowl)
2) The Economy Rate (E) = Runs given/Overs and Strike Rate(SR) = Wickets/overs were calculated.
3) The product of Overs (O), and Economy(E) were stored as Over_Economy(OE)
4) The hypothesis function was computed as h(O, E, OE) = y
5) Theta was calculated using the Normal Equation. The Surface of Bowling Effectiveness( SBE) was then plotted. The plots for each of the bowler is shown below

Here are the plots

A) Anil Kumble
The  data of Kumble, based on Overs bowled & Economy rate versus the Strike Rate is plotted as a 3-D scatter plot (pink crosses). The best fit as determined by solving the optimum theta using the Normal Equation is plotted as 3-D surface shown below.

The 3-D surface is what I have termed as ‘Surface of Bowling Effectiveness (SBE)’ as it depicts bowlers overall effectiveness as it plots the overs (O), ‘economy rate’ E against predicted ‘strike rate’ SR.
Here is another view

The theta values obtained for Kumble are
Theta =
0.104208
-0.043769
-0.016305
0.011949

And the cost at this theta is
Cost Function J = 0.0046269

B) B S Chandrasekhar
Here are the best optimal surface plot for Chandra with the data on O,E vs SR plotted as a 3D scatter plot.  Note: The dataset for  Chandrasekhar is smaller compared to the other two.
Another view for Chandra

Theta values for B S Chandrasekhar are
Theta =
0.095780
-0.025377
-0.024847
0.023415
and the cost is
Cost Function J = 0.0032980

c) Kapil Dev
The plots  for Kapil

Another view of SBE for Kapil

The Theta values and cost function for Kapil are
Theta =
0.090219
0.027725
0.023894
-0.021434
Cost Function J = 0.0035123

2. Predicting wickets
In the previous section the optimum theta with the lowest Cost Function J was calculated. Based on the value of theta, the wickets that will be taken by a bowler can be computed as the product of the hypothesis function and theta. i.e.

y= h(x) * theta  => Strike Rate (SR) = [1 O E OE] * theta
Now predicted wickets can be calculated as

wickets = Strike rate(SR) * Overs(O)
This is done  for Kumble, Chandra and Kapil  for different combinations of Overs(O) and Economy(E) rate.

Here are the results
Predicted wickets for Anil Kumble
The plot of predicted wickets for Kumble is represented below

This can also be represented as a a table

Predicted wickets for B S Chandrasekhar

The table for Chandra

Predicted wickets for Kapil Dev

The plot

The predicted table from the hypothesis function for Kapil Dev

Observation: A closer look at  the predicted wickets for Kapil, Kumble and B S Chandra shows an interesting aspect. The predicted number of wickets is higher for lower economy rates. With a little thought we can see bowlers on turning or pitches with a lot of movement can not only be more economical but can also be destructive and take a lot of wickets. Hence the higher wickets for lower economy rates!

Implementation details
In this post I have used the Normal Equation to get the optimal values of theta for local minimum of the Gradient function.  As mentioned above when I had run the 3D scatter plot fitting a 2D plane did not seem quite right. So I had to experiment with different polynomial equations first trying 2nd order, 3rd order and also the sqrt

I tried the following where ‘O is Overs, ‘E’ stands for Economy Rate and ‘SR’ the predicated Strike rate. Theta is the computed theta from the Normal Equation. The notation in  Matrix notation is shown below

i) A linear plane
SR = [1 O E] * theta

ii) Using the sqrt function
SR = [1 sqrt(O) sqrt(E)]  * theta

iii) Using 2nd order plynomial
SR = [1 O^2 E^2] * theta

iv) Using the 3rd order polynomial
SR = [1 O^3 E^3] * theta

v) Before finally settling on
SR = [1 O E OE] * theta

where OE  = O .* E

The last one seemed to give me the lowest cost and also seemed the most logical visual choice.

A good resource to play around with different functions and check out the shapes of combinations of variables and polynomial order of equation is at WolframAlpha: Plotting and Graphics

Note 1: The gradient descent with the Normal Equation has been performed on the entire data set (approx 220 for Kumble & Kapil) and 99 for Chandra. The proper process for verifying a Machine Learning algorithm is to split the data set into (60% training data, 20% cross validation data and 20% as the test set).  We need to validate the prediction function against the cross-validation set, fine tune it and finally ensure that it  fits  the test set samples well.  However, this split was not done as the data set itself was very low. The entire data set was used to perform the optimal surface fit

Note 2: The optimal theta values have been chosen with a feature vector that is of the form
[1 x y x .* y] The Surface of  Bowling Effectiveness’ has been plotted above. It may appear that there is a’high bias’ in the fit and an even better fit could be obtained by choosing higher order polynomials like
[1 x y x*y x^2 y^2 (x^2) .* y x  .* (y^2)] or
[1 x y x*y x^2 y^2 x^3 y^3]  etc
While we can get a better fit we could run into the problem of ‘high variance; and without the cross validation and test set we will not be able to verify the results, Hence the simpler option [1 x y x*y] was chosen

The Octave code outline and the data used can be cloned from GitHub at ml-bowling-analyze

Conclusion:

1) Predicted wickets: The predicted number of wickets is higher at lower economy rates
2) Comparing performances: There are different ways of looking at the results. One possible way is to check for a particular number of overs and economy rate who is most effective. Here is one way. Taking a small slice from each bowler’s predicted wickets table for anm Economy Rate=4.0 the predicted wickets are

From the above it does appear that Kapil is definitely more effective than the other two. However one could slice and dice in different ways, maybe the most economical for a given numbers and wickets combination or wickets taken in the least overs etc. Do add your thoughts. comments on my assessment or analysis