Into the Telecom vortex


“Ten little Indian boys went out to dine,
One choked his little self and then there were nine
Nine little Indian boys sat up very late;
One overslept himself and then there were eight…”

From the poem “Ten Little Indians”

a

You don’t need to be particularly observant to notice that the telecom landscape over the last decade and a half is full of dead organizations, bloodshed and gore. Organizations have been slain by ruthless times and bigger ones have devoured the weaker, fallen ones. Telecom titans have vanished, giants have been reduced to dwarfs.

Some telecom companies have merged in a deadly embrace trying to beat the market forces only to capitulate to its inexorable death march.

The period from the early 1980s to the late 1990’s were the glorious periods for telecommunication. Digital switches (1972-1982), ISDN (1988), international calling, trunk protocols, mobile (~1991), 2G, 2.5G, and 3G moved in succession, one after another.

Advancement came after advancement. The future had never looked so bright for telecom companies.

The late 1990’s were heady years, not just for telecom companies, but to all technology companies. Stock prices soared. Many stocks were over-valued.  This was mainly due to what was described as the ‘irrational exuberance’ of the stock market.

Lucent, Alcatel, Ericsson, Nortel Networks, Nokia, Siemens, Telecordia all ruled supreme.

1997-2000. then the inevitable happened. There was the infamous dot-com bust of the 2000 which sent reduced many technology stocks to penny stocks. Telecom company stocks went into a major tail spin.  Stock prices of telecom organizations plummeted. This situation, many felt, was further exacerbated by the fact that nothing important or earth shattering was forth-coming from the telecom. In other words, there was no ‘killer app’ from the telecommunication domain.

From 2000 onwards 3G, HSDPA, LTE etc. have all come and gone by. But the markets were largely unimpressed. This was also the period of the downward slide for telecom. The last decade and a half has been extra-ordinarily violent. Technology units of dying organizations have been cannibalized by the more successful ones.

Stellar organizations collapsed, others transformed into ‘white dwarfs’, still others shattered with the ferocity of a super nova.

Here is a short recap of the major events.

  • 2006 – After a couple of unsuccessful attempts Alcatel and Lucent finally decide to merge
  • 2006 – Nokia marries Siemens in a 20 billion Euro deal. N
  • 2009-10 – Ericsson purchases Nortel’s CDMA and LTE business for $1.13 billion
  • 2009-10 – Nortel implodes
  • 2010 – Motorola sells networking unit to Nokia for $1.2 Billion
  • 2011 – Internet giant Google mops up Motorola’s handset division for $12.5 billion, largely for the patents
  • 2012 – Ericsson closes a deal with Telcordia for $1.15 billion
  • 2013 – Nokia sells its handset division to Microsoft after facing a serious beating from smartphones
  • 2015 – Nokia agrees to a $16.6 billion takeover of Alcatel Lucent

And so the story continues like the rhyme in Agatha Christie’s mystery novel

And then there were none

Ten little Indian boys went out to dine,                                                                                                                
One choked his little self and then there were nine…”

The Telecom companies continue their search for the elusive ‘killer app’ as progress comes in small increments – 3G, 3.5G, 3.75G, 4G, and 5G etc.

Personally I think the future of Telecom companies, lies in its ability to embrace the latest technologies of Cloud Computing, Big Data, Software Defined Networks, and Software Defined Datacenters and re-invent themselves. Rather than looking for some elusive ‘killer app’ they have to re-enter the technology scene with a Big Bang

As I referred to in one of my earlier posts “Architecting a cloud Based IP Multimedia System” the proverbial pot at the end of the rainbow may be in

  1. Virtualizing IP Multimedia Switches (IMS) namely the CSCFs (P-CSCF, S-CSCF, I-CSCF etc.),
  2. Using the features of the cloud like Software Defined Storage (SDS) , Load balancers and auto-scaling to elastically scale-up or scale down the CSCF instances to handle varying ‘call traffic’
  3. Having equipment manufacturers (Nokia, Ericsson, and Huawei) will have to use innovating pricing models with the carriers like AT&T, MCI, Airtel or Vodafone. Instead of a one-time cost for hardware and software, the equipment manufacturers will need to charge based on usage or call traffic (utility charging). This will be a win-win for both the equipment manufacturer and carrier
  4. Using SDN to provide the necessary virtualized pipes between users with the necessary policies for advanced services like video-chat, white-boarding, real-time gaming etc.
  5. Using Big Data and Hadoop to analyze Call Detail Records (CDRs) and provide advanced services to customers like differential rates for calls etc

Clearly there will be challenges in this virtualized view of things. Telecom equipment is renowned for its 5 9’s availability. The challenge will be achieving this resiliency, high availability and fault-tolerance with cloud servers. How can WAN latencies be mitigated? How to can SDN provide the QoS required for voice, video and data traffic in IMS?

IMS has many interesting services where video calls from laptops can be transferred as data calls to mobile phones and vice versa, from mobile networks to WiFi  and so on.

Many hurdles will have to be crossed. But this is, in my opinion, will be the path forward.

While the last decade and a half have been bad for the telecom industry, I personally feel we are on the verge on the next big breakthrough in telecom in the next year or two. Telecom will rise like the phoenix from its ashes in the next couple of years

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded

TWS-4: Gossip protocol: Epidemics and rumors to the rescue


Having successfully completed a grueling yet enjoyable ‘Cloud Computing Concepts’ course at Coursera, from the University of Illinois at  Urbana-Champaign,  by Prof Indranil Gupta, I continue on my “Thinking Web Scale (TWS)” series of posts. In this post, I would like to dwell on Gossip Protocol.

Gossip protocol finds its way into distributed system from Epidemiology, a branch of science, which studies and models how diseases, rumors spread through society.   The gossip protocol disseminates information –  the way diseases, rumors spread in society or the way a computer virus is able to infect large networks very rapidly

Gossip protocol is particularly relevant in large distributed systems with hundreds and hundreds of servers spread across multiple data centers for e.g.  Social networks like Facebook, Google or Twitter etc.. The servers that power Google’s search, or the Facebook or Twitter engine is made of hundreds of commercial off the shelf (COTS) computers. This is another way of saying that the designers of these systems should fold extremely high failure rates of the servers into their design. In other words “failures will be the norm and not the exception”

As mentioned in my earlier post, in these large distributed systems  servers will be fail and new servers will be continuously joining the system. The distributed system must be able to accommodate servers joining or leaving the system. There is no global clock and each server has its own clock. To handle server failures data is replicated over many servers which obviously leads to issues of maintaining data consistency between the replicas.

A well-designed distributed system must include in its design key properties of

  1. Availability – Data should be available when you want it
  2. Consistency – Data should consistent across multiple copes
  3. Should be fault tolerant
  4. Should be scalable
  5. Handle servers joining or leaving the systems transparently

One interesting aspect of Distributed Systems much like Operating System (OS) is the fact that a lot of the design choices are based on engineering judgments. The design choices are usually a trade-off of slightly different performance characteristics. Some of them are obvious and some not so obvious.

Why Gossip protocol? What makes it attractive?

Here are some approaches

  1. Centralized Server:

Let us assume that in a network of servers we have a server (Server A) has some piece of information which it needs to spread to other servers. One way is to have this server send the message to all the servers. While this would work there are 2 obvious deficiencies with this approach

  1. The Server A will hog the bandwidth in transmitting the information to all other servers
  2. Server A will be a hot spot besides also being a Single Point of Failure

Cons: In other words if we have a central server always disseminating information then we run into the issue of ‘Single point of Failure’ of this central  server.

  1. Directed Graph

Assuming that we construct a directed overlay graph over the network of servers, we could transmit the message from server A to all other servers. While this approach, has the advantage of lesser traffic as  each server node will typically have around a 1 -3 children. This will result in lesser bandwidth utilization. However the disadvantage to this approach, will be that , when an intermediate non-leaf node fails then information will not reach all children of the failed nodes.

 Cons: Does not handle failures of non-leaf nodes well

  1. Ring Architecture

In this architecture we could have Server A, pass the message round the ring till it gets to the desired server. Clearly each node has one predecessor and one successor. Like the previous example this has the drawback that if one or more servers of the ring fail then the message does not get to its destination.

Cons: Does not handle failures of nodes in the ring well

Note: We should note that these engineering choices only make sense in certain circumstances. So for e.g. the directed graph or the ring structure discussed below have deficiencies for the distributed system case, however  these are accepted design patterns in computer networking for e.g. the Token Ring IEEE 802.5 and graph of nodes in a network. Hierarchical trees are the norm in telecom networks where international calls reach the main trunk exchange, then the central office and finally to the local office in a route that is a root-non-leaf-leaf route.

  1. Gossip protocol

Enter the Gossip protocol (here is a good summary on gossip protocol). In the gossip protocol each server sends the message to ‘b’ random peers. The value ‘b’ typically a small number is called the fan-out.  The server A which has the data is assumed to be ‘infected’. In the beginning only server A is infected while all other servers are ‘susceptible’.  Each server receiving the message is now considered to be infected. Each infected server transmits to ‘b’ other servers. It is likely that the receiving sever is already infected in which case it will drop the message.

In many ways this is similar to the spread of a disease is through a virus. The disease spreads when an infected person comes in contact with another person.

The nice part about the gossip protocol is that is light weight and it can infect the entire set of servers in the order of O (log N)

This is fairly obvious as each round the ‘b’ infected servers will infect ‘b*n’ other servers where ‘n’ is the fan-out.
The computation is as follows

Let x0 = n (Initial state, all un-infected) and y0 =1 (1 infected server) at time t = 0
With x0 + y= n + 1 at all times

Let β be the contact rate between the ‘susceptible’ and ‘infected’  (x*y), then the rate of infection can be represents as
dx/dt= -βxy

The negative sign indicates that the number of ‘non-infected’ servers will decrease over time
(It is amazing how we can capture the entire essence of the spread of disease through a simple, compact equation)

The solution for the above equation (which I have taken in good faith, as my knowledge in differential equations is a faint memory. Hope to refresh my memory when I get the chance, though!)
x=n(n+1)/(n+e^β(n+1)t )  – 1
y=(n+1)/(1+ne^(-β(n+1)t)) – 2

The solution (1)  clearly shows that the number ‘x’ of un-infected servers  at time‘t’ rapidly to 0 as the denominator becomes too large. The number of infected units ‘y’  as t increases tends to n+1, or in other words all servers get infected

This method where infected server sends a message to ‘b’ servers is known as the ‘push’ approach.

Pros: The Gossip protocol clearly is more resilient to servers failing as the gossip message is sent a ‘b’ random targets and can handle failures better.
Cons: There is a possibility that the ‘b’ random targets selected for infection are already infected, in which case the infection can die rapidly if these infected servers fail. 

The solution for the above is to have a ‘pull’ approach where after a time ‘t’ the un-infected servers pull the data from random servers. This way the un-infected servers will also get infected if they pull the data from already infected servers

A third approach is to have a combination of a push-pull approach.
Gossip has been used extensively in Facebook’s and Apache’s Cassandra NoSQL database. Amazon’s Dynamo DB and Riak NoSQL DB also use forms of Gossip Protocol

Failure detection: Gossip protocol has been used extensively in detecting failures. The failed servers are removed from the membership list and this is list is gossiped so that all servers have a uniform view of the set of live servers. However, as with any approach this is prone to high rate  false-positives,  where servers are assumed to have failed even though this may have been  marked as ‘failed’ because of a temporary network failure.   Moreover the network load on epidemic style membership lists are also high.

Some methods to handle false positives is to initially place failed servers under a ‘suspicion’.  When the number of messages attributing failure to this server increases above a threshold ‘t’, then the server is assumed to have failed and removed from the membership list.

Cassandra uses a failure ‘accrual’ mechanism to detect failures in the distributed NoSQL datanase

Epidemic protocols, like the gossip protocol are particularly useful in large scale distributed systems where servers leave and join the system.

One interesting application of the epidemic protocol is to simply to collect the overall state of the system.  If we consider an information exchange where all nodes have set an internal value xi = 0 except node 1 which has x1=1 (infected)  (from the book Distributed Systems: Principles & paradigms by Andrew Tannenbaum and Maarten Van Steen)

where xi = 1 if i =1, or 0 if i > 1
If the nodes gossip this value and compute the average (xi + xj) /2, then after a period of time this value will tend towards 1/N where N is the total number of nodes in the system. Hence all the servers in the system will become aware of the total size of the system.

Conclusion: Gossip protocol has widespread application in distributed systems of today, from spreading information, membership, failure detection, monitoring and alarming. It is really interesting to note that the theory of epidemics or disease spread from a branch of sociology become so important in a field of computer science.

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data

Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data


In the last decade and a half, there has arisen a class of problem that are becoming very critical in the computing domain. These problems deal with computing in a highly distributed environments. A key characteristic of this domain is the need to grow elastically with increasing workloads while tolerating failures without missing a beat.  In short I would like to refer to this as ‘Web Scale Computing’ where the number of servers exceeds several 100’s and the data size is of the order of few hundred terabytes to several Exabytes.

There are several features that are unique to large scale distributed systems

  1. The servers used are not specialized machines but regular commodity, off-the-shelf servers
  2. Failures are not the exception but the norm. The design must be resilient to failures
  3. There is no global clock. Each individual server has its own internal clock with its own skew and drift rates. Algorithms exist that can create a notion of a global clock
  4. Operations happen at these machines concurrently. The order of the operations, things like causality and concurrency, can be evaluated through special algorithms like Lamport or Vector clocks
  5. The distributed system must be able to handle failures where servers crash, disk fails or there is a network problem. For this reason data is replicated across servers, so that if one server fails the data can still be obtained from copies residing on other servers.
  6. Since data is replicated there are associated issues of consistency. Algorithms exist that ensure that the replicated data is either ‘strongly’ consistent or ‘eventually’ consistent. Trade-offs are often considered when choosing one of the consistency mechanisms
  7. Leaders are elected democratically.  Then there are dictators who get elected through ‘bully’ing.

In some ways distributed systems behave like a murmuration of starlings (or a school of fish),  where a leader is elected on the fly (pun unintended) and the starlings or fishes change direction based on a few (typically 6) closest neighbors.

This series of posts, Thinking Web Scale (TWS) ,  will be about Web Scale problems and the algorithms designed to address this.  I would like to keep these posts more essay-like and less pedantic.

In the early days,  computing used to be done in a single monolithic machines with its own CPU, RAM and a disk., This situation was fine for a long time,  as technology promptly kept its date with Moore’s Law which stated that the “ computing power  and memory capacity’ will  double every 18 months. However this situation changed drastically as the data generated from machines grew exponentially – whether it was the call detail records, records from retail stores, click streams, tweets, and status updates of social networks of today

These massive amounts of data cannot be handled by a single machine. We need to ‘divide’ and ‘conquer this data for processing. Hence there is a need for a hundreds of servers each handling a slice of the data.

The first post is about the fairly recent computing paradigm “Map-Reduce”.  Map- Reduce is a product of Google Research and was developed to solve their need to calculate create an Inverted Index of Web pages, to compute the Page Rank etc. The algorithm was initially described in a white paper published by Google on the Map-Reduce algorithm. The Page Rank algorithm now powers Google’s search which now almost indispensable in our daily lives.

The Map-Reduce assumes that these servers are not perfect, failure-proof machines. Rather Map-Reduce folds into its design the assumption that the servers are regular, commodity servers performing a part of the task. The hundreds of terabytes of data is split into 16MB to 64MB chunks and distributed into a file system known as ‘Distributed File System (DFS)’.  There are several implementations of the Distributed File System. Each chunk is replicated across servers. One of the servers is designated as the “Master’. This “Master’ allocates tasks to ‘worker’ nodes. A Master Node also keeps track of the location of the chunks and their replicas.

When the Map or Reduce has to process data, the process is started on the server in which the chunk of data resides.

The data is not transferred to the application from another server. The Compute is brought to the data and not the other way around. In other words the process is started on the server where the data, intermediate results reside

The reason for this is that it is more expensive to transmit data. Besides the latencies associated with data transfer can become significant with increasing distances

Map-Reduce had its genesis from a Lisp Construct of the same name

Where one could apply a common operation over a list of elements and then reduce the resulting list of elements with a reduce operation

The Map-Reduce was originally created by Google solve Page Rank problem Now Map-Reduce is used across a wide variety of problems.

The main components of Map-Reduce are the following

  1. Mapper: Convert all d ∈ D to (key (d), value (d))
  2. Shuffle: Moves all (k, v) and (k’, v’) with k = k’ to same machine.
  3. Reducer: Transforms {(k, v1), (k, v2) . . .} to an output D’ k = f(v1, v2, . . .). …
  4. Combiner: If one machine has multiple (k, v1), (k, v2) with same k then it can perform part of Reduce before Shuffle

A schematic of the Map-Reduce is included below\

2

Map Reduce is usually a perfect fit for problems that have an inherent property of parallelism. To these class of problems the map-reduce paradigm can be applied in simultaneously to a large sets of data.  The “Hello World” equivalent of Map-Reduce is the Word count problem. Here we simultaneously count the occurrences of words in millions of documents

The map operation scans the documents in parallel and outputs a key-value pair. The key is the word and the value is the number of occurrences of the word. E.g. In this case ‘map’ will scan each word and emit the word and the value 1 for the key-value pair

So, if the document contained

“All men are equal. Some men are more equal than others”

Map would output

(all,1),  (men,1), (are,1), (equal,1), (some,1), (men,1), (are,1),  (equal,1), (than,1), (others,1)

The Reduce phase will take the above output and give sum all key value pairs with the same key

(all,1),  (men,2), (are,2),(equal,2), (than,1), (others,1)

So we get to count all the words in the document

In the Map-Reduce the Master node assigns tasks to Worker nodes which process the data on the individual chunks

3

Map-Reduce also makes short work of dealing with large matrices and can crunch matrix operations like matrix addition, subtraction, multiplication etc.

Matrix-Vector multiplication

As an example if we consider a Matrix-Vector multiplication (taken from the book Mining Massive Data Sets by Jure Leskovec, Anand Rajaraman et al

For a n x n matrix if we have M with the value mij in the ith row and jth column. If we need to multiply this with a vector vj, then the matrix-vector product of M x vj is given by xi

1

Here the product of mij x vj   can be performed by the map function and the summation can be performed by a reduce operation. The obvious question is, what if the vector vj or the matrix mij did not fit into memory. In such a situation the vector and matrix are divided into equal sized slices and performed acorss machines. The application would have to work on the data to consolidate the partial results.

Fortunately, several problems in Machine Learning, Computer Vision, Regression and Analytics which require large matrix operations. Map-Reduce can be used very effectively in matrix manipulation operations. Computation of Page Rank itself involves such matrix operations which was one of the triggers for the Map-Reduce paradigm.

Handling failures:  As mentioned earlier the Map-Reduce implementation must be resilient to failures where failures are the norm and not the exception. To handle this the ‘master’ node periodically checks the health of the ‘worker’ nodes by pinging them. If the ping response does not arrive, the master marks the worker as ‘failed’ and restarts the task allocated to worker to generate the output on a server that is accessible.

Stragglers: Executing a job in parallel brings forth the famous saying ‘A chain is as strong as the weakest link’. So if there is one node which is straggler and is delayed in computation due to disk errors, the Master Node starts a backup worker and monitors the progress. When either the straggler or the backup complete, the master kills the other process.

Mining Social Networks, Sentiment Analysis of Twitterverse also utilize Map-Reduce.

However, Map-Reduce is not a panacea for all of the industry’s computing problems (see To Hadoop, or not to Hadoop)

But the Map-Reduce is a very critical paradigm in the distributed computing domain as it is able to handle mountains of data, can handle multiple simultaneous failures, and is blazingly fast.

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

To see all posts click ‘Index of Posts

Mirror, mirror … the best batsman of them all?


“Full many a gem of purest serene
The dark oceans of cave bear.”
Thomas Gray – Elegy in country churchyard

In this post I do a fine grained analysis of the batting performances of cricketing icons from India and also from the international scene to determine how they stack up against each other.  I perform 2 separate analyses 1) Between Indian legends (Sunil Gavaskar, Sachin Tendulkar & Rahul Dravid) and another 2) Between contemporary cricketing stars (Brian Lara, Sachin Tendulkar, Ricky Ponting and A B De Villiers)

In the world and more so in India, Tendulkar is probably placed on a higher pedestal than all other cricketers. I was curious to know how much of this adulation is justified. In “Zen and the art of motorcycle maintenance” Robert Pirsig mentions that while we cannot define Quality (in a book, music or painting) we usually know it when we see it. So do the people see an ineffable quality in Tendulkar or are they intuiting his greatness based on overall averages?

In this context, we need to keep in mind the warning that Daniel Kahnemann highlights in his book, ‘Thinking fast and slow’. Kahnemann suggests that we should regard “statistical intuition with proper suspicion and replace impression formation by computation wherever possible”. This is because our minds usually detects patterns and associations  even when none actually exist.

So this analysis tries to look deeper into these aspects by performing a detailed statistical analysis.

The data for all the batsman has been taken from ESPN Cricinfo. The data is then cleaned to remove ‘DNB’ (did not bat), ‘TDNB’ (Team did not bat) etc before generating the graphs.

The code, data and the plots can be cloned,forked from Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Feel free to agree, disagree, dispute or argue with my analysis.

The batting performances of the each of the cricketers is described in 3 plots a) Combined boxplot & histogram b) Runs frequency vs Runs plot c) Mean Strike Rate vs Runs plot

A) Batting performance of Sachin Tendulkar

a) Combined Boxplot and histogram of runs scored
srt-boxhist1

The above graph is combined boxplot and a histogram. The boxplot at the top shows the 1st quantile (25th percentile) which is the left side of the green rectangle, the 3rd quantile( 75% percentile) right side of the green rectangle and the mean and the median. These values are also shown in the histogram below. The histogram gives the frequency of Runs scored in the given range for e.g (0-10, 11-20, 21-30 etc) for Tendulkar

b) Batting performance – Runs frequency vs Runs
srt-perf

The graph above plots the  best fitting curve for Runs scored in the frequency ranges.

c) Mean Strike Rate vs Runs
srt-sr

This plot computes the Mean Strike Rate for each interval for e.g if between the ranges 11-21 the Strike Rates were 40.5, 48.5, 32.7, 56.8 then the average of these values is computed for the range 11-21 = (40.5 + 48.5 + 32.7 + 56.8)/4. This is done for all ranges and the Mean Strike Rate in each range is plotted and the loess curve is fitted for this data.

B) Batting performance of Rahul Dravid
a) Combined Boxplot and histogram of runs scored
dravid-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Rahul Dravid are shown above

b) Batting performance – Runs frequency vs Runs
dravid-perf

c) Mean Strike Rate vs Runs
dravid-sr

C) Batting performance of Sunil Gavaskar
a) Combined Boxplot and histogram of runs scored
gavaskar-boxhist1

The mean, median, the 25th and 75 th percentiles for the runs scored by Sunil Gavaskar are shown above
b) Batting performance – Runs frequency vs Runs
gavaskar-perf

c) Mean Strike Rate vs Runs
gavaskar-sr
D) Relative performances of Tendulkar, Dravid and Gavaskar
relative-perf1

The above plot computes the percentage of the total career runs scored in a given range for each of the batsman.
For e.g if Dravid scored the runs 23, 22, 28, 21, 25 in the range 21-30 then the
Range 21 – 20 => percentageRuns = ( 23 + 22 + 28 + 21 + 25)/ Total runs in career * 100
The above plot shows that Rahul Dravid’s has a higher contribution in the range 20-70 while Tendulkar has a larger percentahe in the range 150-230

E) Relative Strike Rates of Tendulkar, Dravid and Gavaskar
relative-SR

With respect to the Mean Strike Rate Tendulkar is clearly superior to both Gavaskar & Dravid

F) Analysis of Tendulkar, Dravid and Gavaskar
rel-perf1

The above table captures the the career details of each of the batsman
The following points can be noted
1) The ‘number of innings’ is the data you get after removing rows with DNB, TDNB etc
2) Tendulkar has the higher average 48.39 > Gavaskar (47.3) > Dravid (46.46)
3) The skew of  Dravid (1.67) is greater which implies that there the runs scored are more skewed to right (greater runs) in comparison to mean

G) Batting performance of Brian Lara
a) Combined Boxplot and histogram of runs scored
lara-boxhist1
The mean, median, 1st and 3rd quartile are shown above

b) Batting performance – Runs frequency vs Runs
lara-perf

c) Mean Strike Rate vs Runs
lara-sr

H) Batting performance of Ricky Ponting
a) Combined Boxplot and histogram of runs scored
ponting-boxhist1

b) Batting performance – Runs frequency vs Runs
ponting-perf

c) Mean Strike Rate vs Runs
ponting-SR

I) Batting performance of AB De Villiers
a) Combined Boxplot and histogram of runs scored
devilliers-boxhist1

b) Batting performance – Runs frequency vs Runs
devillier-perf

c) Mean Strike Rate vs Runs
devilliers-SR

J) Relative performances of Tendulkar, Lara, Ponting and De Villiers
relative-perf-intl1

Clearly De Villiers is ahead in the percentage Runs scores in the range 30-80. Tendulkar is better in the range between 80-120. Lara’s career has a long tail.

K) Relative Strike Rates of Tendulkar, Lara, Ponting and De Villiers
relative-SR-intl

The Mean Strike Rate of Lara is ahead of the lot, followed by De Villiers, Ponting and then Tendulkar
L) Analysis of Tendulkar, Lara, Ponting and De Villiers
rel-perf-intl1
The following can be observed from the above table
1) Brian Lara has the highest average (51.52) > Sachin Tendulkar (48.39 > Ricky Ponting (46.61) > AB De Villiers (46.55)
2) Brian Lara also the highest skew which means that the data is more skewed to the right of the mean than the others

You can clone the code rom Github at the following link bestBatsman. You should be able to use the code as-is for any other batsman you choose to.

Also see
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Analyzing cricket’s batting legends – Through the mirage with R
4. Masters of spin – Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4.  Bend it like Bluemix, MongoDB with autoscaling – Part 2

Masters of Spin: Unraveling the web with R


Here is a look at some of the masters of spin bowling in cricket. Specifically this post analyzes 3 giants of spin bowling in recent times, namely Shane Warne of Australia, Muthiah Muralitharan of Sri Lanka and our very own Anil Kumble of India.  As to “who is the best leggie” has been a hot topic in cricket in recent years.  As in my earlier post “Analyzing cricket’s batting legends: Through the mirage with R”, I was not interested in gross statistics like most wickets taken.

In this post I try to analyze how each bowler has performed over his entire test career. All bowlers have bowled around ~240 innings. All  other things being equal, it does take a sense to look a little deeper into what their performance numbers reveal about them. As in my earlier posts the data has been taken from ESPN CricInfo’s Statguru

I have chosen these 3 spinners for the following reasons

Shane Warne : Clearly a deadly spinner who can turn the ball at absurd angles
Muthiah Muralitharan : While controversy dogged Muralitharan he was virtually unplayable on many cricketing venues
Anil Kumble: A master spinner whose chess like strategy usually outwitted the best of batsmen.

The King of Spin according to my analysis below is clearly Muthiah Muralitharan. This is clearly shown in the final charts where the performances of bowlers are plotted on a single graph. Muralitharan is clearly a much more lethal bowler and has a higher strike rate. In addition Muralitharan has the lowest mean economy rate amongst the 3 for wickets in the range 3 to 7.  Feel free to add your own thoughts, comments and dissent.

The code for this implementation is available at GitHub at mastersOfSpin. Feel free to clone,fork or hack the code to your own needs. You should be able to use the code as-is on other bowlers with little or no modification

So here goes

Wickets frequency percentage vs Wickets plot
For this plot I determine how frequently the bowler takes ‘n’ wickets in his career and calculate the percentage over his entire career.  In other words this is done as follows in R

# Create a table of Wickets vs the frequency of the wickts
colnames(wktsDF) <- c("Wickets","Freq")
# Calculate wickets percentage
wktsDF$freqPercent <- (wktsDF$Freq/sum(wktsDF$Freq)) * 100

and plot this as a graph.

This is shown for Warne below
1) Shane Warne –  Wickets Frequency percentage vs Wickets plot

warne-wkts-1

Wickets – Mean Economy rate chart
This chart plots the mean economy rate for ‘n’ wickets for the bowler. As an example to do this for 3 wickets for Shane Warne, a list is created of economy rates when Warne has taken  3 wickets in his entire career. The average of this list is then computed and stored against Warne’s 3 wickets. This is done for all wickets taken in Warne’s career. The R snippet for this implementation is shown below

econRate <- NULL
for (i in 0: max(as.numeric(as.character(bowler$Wkts)))) {
# Create a vector of Economy rate  for number of wickets 'i'
a <- bowler[bowler$Wkts == i,]$Econ
b <- as.numeric(as.character(a))
# Compute the mean economy rate by using lapply on the list
econRate[i+1] <- lapply(list(b),mean)
print(econRate[i])
}

Shane Warne –  Wickets vs Mean Economy rate
This plot for Shane Warne is shown below

warne-er-1

The plots for M Muralithan and Anil Kumble are included below

2) M Muralitharan – Wickets Frequency percentage vs Wickets plot
murali-wkts

M Muralitharan – Wickets vs Mean Economy rate

murali-er

3) Anil Kumble – Wickets Frequency percentage vs Wickets plot
kumble-wkts

Anil Kumble – Wickets vs Mean Economy rate
kumble-er

Finally the relative performance of the bowlers is generated by creating a single chart where the wicket frequencies and the mean economy rate vs wickets is plotted.

This is shown below

Relative wicket percentages
relative-wkts-pct-1

Relative mean economy rate
relative-er-1

As can be seen in the above 2 charts M Muralidharan not only has a higher strike rate as far as wickets in 3 to 7 range, he also has a much lower mean economy rate

You can clone/fork the R code from GitHub at mastersOfSpin

Conclusion: The performance of Muthiah Muralitharan is clearly superior to both Shane Warne and Kumble. In my opinion the king of spin is M Muralitharan, followed by Shane Warne and finally Anil Kumble

Feel free to dispute my claims. Comments, suggestions are more than welcome

Also see

1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra
3. Analyzing cricket’s batting legends – Through the mirage with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
4.  Bend it like Bluemix, MongoDB with autoscaling – Part 2

Analyzing cricket’s batting legends – Through the mirage with R


In this post I do a deep dive into the records of the all-time batting legends of cricket to identify interesting information about their achievements. In my opinion, the usual currency for batsman’s performance like most number of centuries or highest batting average are too gross in their significance. I wanted something finer where we can pin-point specific strengths of different  players

This post will answer the following questions.
– How many times has a batsman scored runs in a specific range say 20-40 or 80-100 and so on?
– How do different batsmen compare against each other?
– Which of the batsmen stayed well beyond their sell-by date?
– Which of the batsmen retired too soon?
– What is the propensity for a batsman to get caught, bowled run out etc?

For this analysis I have chosen the batsmen below for the following reasons
Sir Don Bradman : With a  batting average of 99.94 Bradman was an obvious choice
Sunil Gavaskar is one of India’s batting icons who amassed 774 runs in his debut against the formidable West Indies in West Indies
Brian Lara : A West Indian batting hero who has double, triple and quadruple centuries under his belt
Sachin Tendulkar: A prolific run getter, India’s idol, who holds the record for most test centuries by any batsman (51 centuries)
Ricky Ponting:A dangerous batsman against any bowling attack and who can demolish any bowler on his day
Rahul Dravid: He was India’s most dependable batsman who could weather any storm in a match single-handedly
AB De Villiers : The destructive South African batsman who can pulverize any attack when he gets going

The analysis has been performed on these batsmen on various parameters. Clearly different batsmen have shone in different batting aspects. The analysis focuses on each of these to see how the different players stack up against each other.

The data for the above batsmen has been taken from ESPN Cricinfo. Only the batting statistics of the above batsmen in Test cricket has been taken. The implementation for this analysis has been done using the R language.  The R implementation, datasets and the plots can be accessed at GitHub at analyze-batting-legends. Feel free to fork or clone the code. You should be able to use the code with minor modifications on other players. Also go ahead make your own modifications and hack away!

Key insights from my analysis below
a) Sir Don Bradman’s unmatchable record of 99.94 test average with several centuries, double and triple centuries makes him the gold standard of test batting as seen in the ‘All-time best batsman below’
b) Sunil Gavaskar is the king of batting in India, followed by Rahul Dravid and finally Sachin Tendulkar. See the charts below for details
c) Sunil Gavaskar and Rahul Dravid had at least 2 more years of good test cricket in them. Their retirement was premature. This is based on the individual batsmen’s career graph (moving average below)
d) Brian Lara, Sachin Tendulkar, Ricky Ponting, Vivian Richards retired at a time when their batting was clearly declining. The writing on the wall was clear and they had to go (see moving average below)
e) The biggest hitter of 4’s was Vivian Richards. In the 2nd place is Brian Lara. Tendulkar & Dravid follow behind. Dravid is a surprise as he has the image of a defender.
e) While Sir Don Bradman made huge scores, the number of 4’s in his innings was significantly less. This could be because the ground in those days did not carry the ball far enough
f) With respect to dismissals  Richards was able to keep his wicket intact (11%) of the times , followed by Ponting  Tendulkar, De Villiers, Dravid (10%) who carried the bat, and Gavaskar & Bradman (7%)

A) Runs frequency table and charts
These plots normalize the batting performance of different batsman, since the number of innings played ranges from 89 (Bradman) to 348 (Tendulkar), by calculating the percentage frequency the batsman scores runs in a particular range.   For e.g. Sunil Gavaskar made scores between 60-80 10% of his total innings

This is shown in a tabular form below

runs-frequency
The individual charts for each of the players are shwon belowThe top performers after  removing ranges 0-20 & 20-40 are
Between 40-60 runs – 1) Ricky Ponting (16.4%) 2) Brian lara (15.8%) 3) AB De Villiers (14.6%)
Between 60-80 runs – 1) Vivian Richards (18%) 2) AB De Villiers (10.2%) 3) Sunil Gavaskar (10%)
Between 80-100 runs – 1) Rahul Dravid (7.6%) 2) Brian Lara (7.4%) 3) AB De Villiers (6.4%)
Between 100 -120 runs – 1) Sunil Gavaskar (7.5%) 2) Sir Don Bradman (6.8%) 3) Vivian Richards (5.8%)
Between 120-140 runs – 1) Sir Don Bradman (6.8%) 2) Sachin Tendulkar (2.5%) 3) Vivian Richards (2.3%)

The percentage frequency for Brian Lara is included below
1) Brian Lara
lara-run-freq

The above chart shows out of the total number of innings played by Brian Lara he scored runs in the range (40-60) 16% percent of the time. The chart also shows that Lara scored between 0-20, 40%  while also scoring in the ranges 360-380 & 380-400 around 1%.
The same chart is displayed as continuous graph below
lara-run-perf

The run frequency charts for other batsman are
2) Sir Don Bradman
a) Run frequency
bradman-freq
Note: Notice the significant contributions by Sir Don Bradman in the ranges 120-140,140-160,220-240,all the way up to 340
b) Performance
bradman-perf
3) Sunil Gavaskar
a) Runs frequency chart
gavaskar-freq
b) Performance chart
gavaskar-perf
4) Sachin Tendulkar
a) Runs frequency chart
tendulkar-freq
b) Performance chart
tendulkar-perf
5) Ricky Ponting
a) Runs frequency
ponting-freq
b) Performance
ponting-perf
6) Rahul Dravid
a) Runs frequency chart
dravid-freq
b) Performance chart
dravid-perf
7) Vivian Richards
a) Runs frequency chart
richards-freq
b) Performance chart
richards-perf
8) AB De Villiers
a) Runs frequency chart
villiers-freq
b)  Performance chart
villier-perf

 B) Relative performance of the players
In this section I try to measure the relative performance of the players by superimposing the performance graphs obtained above.  You may say that “comparisons are odious!”. But equally odious are myths that are based on gross facts like highest runs, average or most number of centuries.
a) All-time best batsman
(Sir Don Bradman, Sunil Gavaskar, Vivian Richards, Sachin Tendulkar, Ricky Ponting, Brian Lara, Rahul Dravid, AB De Villiers)
overall-batting-perf
From the above chart it is clear that Sir Don Bradman is the ‘gold’ standard in batting. He is well above others for run ranges above 100 – 350
b) Best Indian batsman (Sunil Gavaskar, Sachin Tendulkar, Rahul Dravid)
srt-sg-dravid-perf
The above chart shows that Gavaskar is ahead of the other two for key ranges between 100 – 130 with almost 8% contribution of total runs. This followed by Dravid who is ahead of Tendulkar in the range 80-120. According to me the all time best Indian batsman is 1) Sunil Gavaskar 2) Rahul Dravid 3) Sachin Tendulkar

c) Best batsman -( Brian Lara, Ricky Ponting, Sachin Tendulkar, AB De Villiers)
This chart was prepared since this comparison was often made in recent times

rel

This chart shows the following ranking 1) AB De Villiers 2) Sachin Tendulkar 3) Brian Lara/Ricky Ponting
C) Chart of 4’s

fours-batsman
This chart is plotted with a 2nd order curve of the number of  4’s versus the total runs in the innings
1) Brian Lara
bradman-4s
2) Sir Don Bradman
bradman-4s
3) Sunil Gavaskar
gavaskar-4s
4) Sachin Tendulkar
tendulkar-4s
5) Ricky Ponting
ponting-4s
6) Rahul Dravid
dravid-4s
7) Vivian Richards
richards-4s
8) AB De Villiers
villiers-4s
D) Proclivity for type of dismissal
The below charts show how often the batsman was out bowled, caught, run out etc
1) Brian Lara
lara-dismissals
2) Sir Don Bradman
bradman-dismissals
3) Sunil  Gavaskar
gavaskar-dismissals
4) Sachin Tendulkar
tendulkar-dismissals
5) Ricky Ponting
ponting-dismissals
6) Rahul Dravid
dravid-dismissals
7) Vivian Richard
richards-dismissals
8) AB De Villiers
villiers-dismissals
E) Moving Average
The plots below provide the performance of the batsman as a time series (chronological) and is displayed as the continuous gray lines. A moving average is computed using ‘loess regression’ and is shown as the dark line. This dark line represents the players performance improvement or decline. The moving average plots are shown below
1) Brian Lara
lara-ma
2) Sir Don Bradman
bradman-ma
Sir Don Bradman’s moving average shows a remarkably consistent performance over the years. He probably could have a continued for a couple more years
3)Sunil Gavaskar

2

Gavaskar moving average does show a good improvement from a dip around 1983. Gavaskar retired bowing to public pressure on a mistaken belief that he was under performing. Gavaskar could have a continued for a couple of more years
4) Sachin Tendulkar

1

Tendulkar’s performance is clearly on the decline from 2011.  He could have announced his retirement at least 2 years prior
5) Ricky Ponting
ponting-ma
Ponting peak performance was around 2005 and does go steeply downward from then on. Ponting could have also retired around 2012
6) Rahul Dravid

1

Dravid seems to have recovered very effectively from his poor for around 2009. His overall performance shows steady improvement. Dravid’s announcement appeared impulsive. Dravid had another 2 good years of test cricket in him
7) Vivian Richards
richards-ma
Richard’s performance seems to have dropped around 1984 and seems to remain that way.
8) AB De Villiers
villiers-ma
AB De Villiers moving average shows a steady upward swing from 2009 onwards. De Villiers has at least 3-4 years of great test cricket ahead of him.

Finally as mentioned above the dataset, the R implementation and all the charts are available at GitHub at analyze-batting-legends. Feel free to fork and clone the code. The code should work for other batsman as-is. Also go ahead and make any modifications for obtaining further insights.

Conclusion: The batting legends have been analyzed from various angles namely i)  What is the frequency of runs scored in a particular range ii) How each batsman compares with others for relative runs in a specified range iii) How does the batsman get out?  iv) What were the peak and lean period of the batsman and whether they recovered or slumped from these periods.  While the batsman themselves have played in different time periods I think in an overall sense the performance under the conditions of the time will be similar.
Anyway feel free to let me know your thoughts. If you see other patterns in the data also do drop in your comment.

You may also like
1. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
2. Informed choices through Machine Learning-2: Pitting together Kumble, Kapil,

Also see
– A crime map of India in R – Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

R incantations for the uninitiated


Here are some basic R incantations that will get you started with R

A) Scalars & Vectors:
Chant 1 – Now repeat after me, with your right hand forward at shoulder height “In R there are no scalars. There are only vectors of length 1″.
Just kidding:-)

To create an integer variable x with a value 5 we write
x <- 5 or
x = 5

While the former notation may seem odd, it is actually more logical considering that the RHS is assigned to LHS. Anyway both seem to work
Vectors can be created as follows
a <- c( 2:10)
b <- c("This", "is", 'R","language")

B) Sequences:
There are several ways of creating sequences of numbers which you intend to use for your computation
<- seq(5:25) # Sequence from 5 to 25

Other ways to create sequences
Increment by 2
> seq(5, 25, by=2)
[1]  5  7  9 11 13 15 17 19 21 23 25

>seq(5,25,length=18) # Create sequence from 5 to 25 with a total length of 18
[1]  5.000000  6.176471  7.352941  8.529412  9.705882 10.882353 12.058824 13.235294
[9] 14.411765 15.588235 16.764706 17.941176 19.117647 20.294118 21.470588 22.647059
[17] 23.823529 25.000000

C) Conditions and loops
An if-else if-else construct goes like this
if(condition) {
do something
} else if (condition) {
do something
} else {
do something
}

Note: Make sure the statements appear as above with the else if and else appearing on the same line as the closing braces, otherwise R complains about ‘unexpected else’ in else statement

D) Loops
I would like to mention 2 ways of doing ‘for’ loops  in R.
a) for (i in 1:10) {
statement
}

> a <- seq(5,25,length=10)
> a
[1]  5.000000  7.222222  9.444444 11.666667 13.888889 16.111111 18.333333
[8] 20.555556 22.777778 25.000000

b) Sequence along the vector sequence. Note: This is useful as we don’t have to know  the length of the vector/sequence
for (i in seq_along(a)){
+   print(a[i])
+ }

[1] 5
[1] 7.222222
[1] 9.444444
[1] 11.66667

There are others ways of looping with ‘while’ and ‘repeat’ which I have not included in this post.

R makes manipulation of matrices and data frames really easy. All the elements in a matrix are numeric while data frames can have different types for each of the element

E) Matrix
> rnorm(12,5,2)
[1] 2.699961 3.160208 5.087478 3.969129 3.317840 4.551565 2.585758 2.397780
[9] 5.297535 6.574757 7.468268 2.440835

a) Create a vector of 12 random numbers with a mean of 5 and SD of 2
> a <-rnorm(12,5,2)
b) Convert the vector to a matrix with 4 rows and 3 columns
> mat <- matrix(a,4,3)
> mat[,1]     [,2]     [,3]
[1,] 5.197010 3.839281 9.022818
[2,] 4.053590 5.321399 5.587495
[3,] 4.225763 4.873768 6.648151
[4,] 4.709784 4.129093 2.575523

c) Subset rows 1 & 2 from the matrix
> mat[1:2,]
[,1]     [,2]     [,3]
[1,] 5.19701 3.839281 9.022818
[2,] 4.05359 5.321399 5.587495

d) Subset matrix a rows 1& 2 and with columns 2 & 3
> mat[1:2,2:3]
[,1]     [,2]
[1,] 3.839281 9.022818
[2,] 5.321399 5.587495

e) Subset matrix a for all row elements for the column 3
> mat[,3]
[1] 9.022818 5.587495 6.648151 2.575523

e) Add row names and column names for the matrix as follows
> names <- c(“tim”,”pat”,”joe”,”jim”)
> v <- data.frame(names,mat)
> v
names       X1       X2       X3
1   tim 5.197010 3.839281 9.022818
2   pat 4.053590 5.321399 5.587495
3   joe 4.225763 4.873768 6.648151
4   jim 4.709784 4.129093 2.575523

> colnames(v) <- c("names","a","b","c")
> v
names        a        b        c
1   tim 5.197010 3.839281 9.022818
2   pat 4.053590 5.321399 5.587495
3   joe 4.225763 4.873768 6.648151
4   jim 4.709784 4.129093 2.575523

F) Data Frames
In R data frames are the most important method to manipulate large amounts of data. One can read data in .csv format into data frame using
df <- read.csv(“mydata.csv”)
To get a feel of data frames it is useful to play around with the numerous data sets that are available with the installation of R
To check the available dataframes do
>data()
AirPassengers                    Monthly Airline Passenger Numbers 1949-1960
BJsales                          Sales Data with Leading Indicator
BJsales.lead (BJsales)           Sales Data with Leading Indicator
BOD                              Biochemical Oxygen Demand
CO2                              Carbon Dioxide Uptake in Grass Plants
ChickWeight                      Weight versus age of chicks on different diets
...

I will be using the mtcars data frame. Here are some of the most important commands on data frames
a) load data from mtcars
data(mtcars)
b) > head(mtcars,3) # Display the top 3 rows of the data frame
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

c) > tail(mtcars,4) # Display the boittom 4 rows of the data frame
mpg cyl disp  hp drat   wt qsec vs am gear carb
Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
Ferrari Dino   19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
Volvo 142E     21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

d) > names(mtcars)  # Display the names of the columns of the data frame
[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

e) > summary(mtcars) # Display the summary of the data frame
mpg             cyl             disp             hp             drat             wt
Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760   Min.   :1.513
1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581
Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695   Median :3.325
Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597   Mean   :3.217
3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610
Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930   Max.   :5.424
qsec             vs               am              gear            carb
Min.   :14.50   Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000
1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000
Median :17.71   Median :0.0000   Median :0.0000   Median :4.000   Median :2.000
Mean   :17.85   Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812
3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000
Max.   :22.90   Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000

f) > str(mtcars) # Generate a concise description of the data frame - values in each column, factors
'data.frame':   32 obs. of  11 variables:
$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
$ disp: num  160 160 108 258 360 ...
$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num  16.5 17 18.6 19.4 17 ...
$ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
$ gear: num  4 4 4 3 3 3 3 4 4 4 ...
$ carb: num  4 4 1 1 2 1 4 2 2 4 ...

g) > mtcars[mtcars$mpg == 10.4,] #Select all rows in mtcars where the mpg column has a value 10.4
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4

h) > mtcars[(mtcars$mpg >20) & (mtcars$mpg <24),] # Select all rows in mtcars where the mpg > 20 and mpg < 24
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

i) > myset <- mtcars[(mtcars$cyl == 6) | (mtcars$cyl == 4),] # Get all calls which are either 4 or 6 cylinder
> myset
mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2…

j) > mean(myset$mpg) # Determine the mean of the set created above
[1] 23.97222

k) > table(mtcars$cyl) #Create a table of cars which have 4,6, or 8 cylinders

4  6  8
11  7 14

G) lapply,sapply,tapply
I use the iris data set for these commands
a) > data(iris) #Load iris data set

b) > names(iris)  #Show the column names of the data set
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
c) > lapply(iris,class) #Show the class of all the columns in iris
$Sepal.Length
[1] "numeric"
$Sepal.Width
[1] "numeric"
$Petal.Length
[1] "numeric"
$Petal.Width
[1] "numeric"
$Species
[1] "factor"

d) > sapply(iris,class) # Display a summary of the class of the iris data set
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species
"numeric"    "numeric"    "numeric"    "numeric"     "factor"

e) tapply: Instead of getting the mean for each of the species as below we can use tapply
> a <-iris[iris$Species == "setosa",]
> mean(a$Sepal.Length)
[1] 5.006
> b <-iris[iris$Species == "versicolor",]
> mean(b$Sepal.Length)
[1] 5.936
> c <-iris[iris$Species == "virginica",]
> mean(c$Sepal.Length)
[1] 6.588

> tapply(iris$Sepal.Length,iris$Species,mean)
setosa versicolor  virginica
5.006      5.936      6.588

Hopefully this highly condensed version of R will set you on a R-oll.

You may like
– A peek into literacy in India:Statistical learning with R
– A crime map of India in R: Crimes against women
– Analyzing cricket’s batting legends – Through the mirage with R

Programming Zen and now – Some essential tips-2


This post is a follow-up to my earlier post – How to program – Some essential tips. In this post I expand on some of the ideas of my earlier post.

Programming means different things to different people. To some programming is a drudgery almost akin to manual labor, to others programming is an insurmountable mountain full of frustrations and disappointments while to others it is an intense problem solving and a creative activity. In my opinion programming can mean anything to you. It is your attitude towards coding that make it a chore, a daunting task or something really creative.

Here are some my insights on how to go about learning to code

Eyes wide open:  People generally get frustrated when a piece of code that they wrote does not do what they intended it to do. In some cases the code snippet will do nothing when they were expecting final result, sometimes the code will crash or it will go into an infinite loop and drive the person nuts. (Let me assure you – I have been there, done that!) The usual reaction when this happens is anger and frustration where we generally tinker around with the code only to get the same result. Soon the emotions will progress from anger to hopelessness.

The first thing that one needs to while coding is to keep your ‘eyes wide open’. We tend to be  guilty of ignoring the error messages that show up. Here one way to attack coding

a) Fully understand the ‘what’ of the problem. If there is an infinite loop or a core dump check after which point does it happen? If there is an execution error, what is the error trying to tell us?
b) Next look into ‘why’  the error occurred.  You could either use debugger or insert appropriate print statements to take the offending code apart.
c) Thirdly think ‘how‘ you can address the situation. Make appropriate changes and re-run the code
d) Did it solve the issue.If yes, move forward. Otherwise go to step a)

Remember that we learn more from our programming mistakes more than when our code just ‘happens’ to work!  Mistakes in our code make us to explain every part of the program

Changing times:

Times have changed. Programming Zen and programming now are worlds apart. In many ways, IDEs, Git, Google etc. have made the programmer’s life a lot easier

‘Git’ing from here to there:  Here is a trick that I learnt fairly recently, though it should have occurred to me more than 2 years back. This is using Git judiciously for all programming tasks (Note:  I am saying nothing new here!).  I find it really useful in writing code with incremental changes.  I create my initial code on the master and then test out incremental changes on a ‘new branch’ even for personal projects. Once I have proved a small increment works, I merge it with the ‘main’ branch. I again start working on the ‘new’ for the next incremental change followed by a merge to the master

The steps are

Make initial changes

1. git add  .
2. git commit –m “ Initial changes’

Create a new branch
3. git checkout –b ‘new

Make incremental changes. Test.
4.git add  .
5. git commit –m “Change 1″

Merge with the master
6.git checkout master
7. git merge new

Continue to work with ‘new’.
8 . git checkout new
9. Go to step 4)

This process can be continued till you get your final product. I find this extremely useful instead of just using an IDE to make code changes. Invariably you can run into a situation where you had something working some time back and in the next instant it is broken and you can’t figure out all the changes you made to the working code. This can be extremely frustrating. With Git you have a history of changes and you can switch to an earlier version of working code and start from there.

Rarely do I find a reason to have more than 1 branch

Here is a pictorial version of this

1

 

 

Taking help from Dr. Google: For most questions and errors that you encounter you will find others who have hit similar bugs. Just google it. You will more than surprised that others went down the exact same path that you are treading.  Besides the internet is full of tutorials, blogs and articles on key aspects of programming

Explore the cave of Stack overflow:   Spend time exploring Stack overflow. Stack overflow is replete with code snippets and questions that you wanted to ask. There is so much information out there. If you really don’t find an answer to your problem, post it in Stack overflow and you are bound to get an answer or a link to a similar question asked previously

Finally programming requires dollops of patience. Develop patience along with your skill in coding and soon programming will much more enjoyable to you.

A crime map of India in R – Crimes against women


In this post I take a look at the gory crime scene across India to determine which states are the heavy weights in crimes. Who is the undisputed champion of rapes in a year? Which state excels in cruelty by husbands and the relatives to wives? Which state leads in dowry deaths? To get the answers to these questions I perform analysis of the state-wise crime data against women with the data  from Open Government Data (OGD) Platform India. The dataset  for this analysis was taken for the Crime against Women from OGD.

The data in OGD is available for crimes against women in different states under different ‘crime heads’ like rape, dowry deaths, kidnapping & abduction etc. The data is available for years from 2001 to 2012. This data is plotted as a scatter plot and a linear regression line is then fit on the available data. Based on this linear model,  the projected incidence of crimes likes rapes, dowry deaths, abduction & kidnapping is performed for each of the states. This is then used to build a table of  different crime heads for all the states predicting the number of crimes till the year 2018. Fortunately, R  crunches through the data sets quite easily. The overall projections of crimes against as women is shown below based on the linear regression for each of these states

Projections over the next couple of years
The tables below are based on the projected incidence of crimes under various categories assuming that these states maintain their torrid crime rate. A cursory look at the tables below clearly indicate the Uttar Pradesh is the undisputed heavy weight champion in 4 of 5 categories shown. Maharashtra and Andhra Pradesh take 2nd and 3rd ranks in the total crimes against women and are significant contenders in other categories too.

A) Projected rapes in India
The top 3 heavy weights in projected rapes over the next 5 years are 1) Madhya Pradesh  2) Uttar Pradesh 3) Maharashtra

rapes

Full table: Rape.csv
B) Projected Dowry deaths in India 
dowrydeaths

Full table: Dowry Deaths.csv
C) Kidnapping & Abduction
kidnapping

Full table: Kidnapping&Abduction.csv
D) Cruelty by husband & relatives
cruelty

Full table: Cruelty by husbands_relatives.csv
E) Total crimes against women

total

Full table: Total crimes.csv
Here is a beautiful visualization of ‘Total crimes against women’  created as a choropleth map  by Philip Predruco.

a

The implementation for this analysis was done using the  R language.  The R code, dataset, output and the crime charts can be accessed at GitHub at crime-against-women

Directory structure
– R code
dataset used
output
statewise-crime-charts

The analysis has been completely parametrized. A quick look at the implementation is shown  below. A function state crime was created as given below

statecrime.R
This function (statecrime.R)  does the following
a) Creates a scatter plot for the state for the crime head
b) Computes a best linear regression fir and draws this line
c) Uses the model parameters (coefficients) to compute the projected crime in the years to come
d) Writes the projected values to a text file
c) Creates a directory with the name of the state if it does not exist and stores the jpeg of the plot there.

statecrime <- function(indiacrime, row, state,crime) {
year <- c(2001:2012)
# Make seperate folders for each state
if(!file.exists(state)) {
dir.create(state)
}
setwd(state)
crimeplot <- paste(crime,".jpg")
jpeg(crimeplot)

# Plot the details of the crime
plot(year,thecrime ,pch= 15, col="red", xlab = "Year", ylab= crime, main = atitle,
,xlim=c(2001,2018),ylim=c(ymin,ymax), axes=FALSE)

A linear regression line is fit using ‘lm’

# Fit a linear regression model
lmfit <-lm(thecrime~year)
# Draw the lmfit line
abline(lmfit)

The model parameters are then used to draw the line and also project for the next 5 years from 2013 to 2018

nyears <-c(2013:2018)
nthecrime <- rep(0,length(nyears))
# Projected crime incidents from 2013 to 2018 using a linear regression model
for (i in seq_along(nyears)) {
nthecrime[i] <- lmfit$coefficients[2] * nyears[i] + lmfit$coefficients[1]
}

The projected data for each state is appended into an appropriate file which is then used to display the tables at the top of this post

# Write the projected crime rate in a file
nthecrime <- round(nthecrime,2)
nthecrime <- c(state, nthecrime, "\n")
print(nthecrime)
#write(nthecrime,file=fileconn, ncolumns=9, append=TRUE,sep="\t")
filename <- paste(crime,".txt")
# Write the output in the ./output directory
setwd("./output")
cat(nthecrime, file=filename, sep=",",append=TRUE)

The above function is then repeatedly called for each state for the different crime heads. (Note: It is possible to check the read both the states and crime heads with R and perform the computation repeatedly. However, I have done this the manual way!)

crimereport.R
# 1. Andhra Pradesh
i <- 1
statecrime(indiacrime, i, "Andhra Pradesh","Rape")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Kidnapping& Abduction")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry Deaths")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Assault on Women")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Insult to modesty")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Cruelty by husband_relatives")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Imporation of girls from foreign country")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Immoral traffic act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Dowry prohibition act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Indecent representation of Women Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Commission of Sati Act")
i <- i+38
statecrime(indiacrime, i, "Andhra Pradesh","Total crimes against women")
...
...

and so on for all the states

Charts for different crimes against women

1) Uttar Pradesh

The plots for  Uttar Pradesh  are shown below

Rapes in UP

Rape

Dowry deaths in UP

Dowry Deaths

Cruelty by husband/relative

Cruelty by husband_relatives

Total crimes against women in Uttar Pradesh

Total crimes against women

You can find more charts in GitHub by clicking Uttar Pradesh

2) Maharashtra : Some of the charts for Maharashtra

Rape

Rape

Kidnapping & Abduction

Kidnapping& Abduction

Total crimes against women in Maharashtra

Total crimes against women

More crime charts  for Maharashtra

Crime charts can be accessed for the following states from GitHub ( in alphabetical order)

3) Andhra Pradesh
4) Arunachal Pradesh
5) Assam
6) Bihar
7) Chattisgarh
8) Delhi (Added as an exception based on its notoriety)
9) Goa
10) Gujarat
11) Haryana
12) Himachal Pradesh
13) Jammu & Kashmir
14) Jharkhand
15) Karnataka
16) Kerala
17) Madhya Pradesh
18) Manipur
19) Meghalaya
20) Mizoram
21) Nagaland
22) Odisha
23) Punjab
24) Rajasthan
25) Sikkim
26) Tamil Nadu
27) Tripura
28) Uttarkhand
29) West Bengal

The code, dataset and the charts can be cloned/forked from GitHub at crime-against-women

Let me know if you find any interesting patterns in the data.
Thoughts, comments welcome!


See also
A peek into literacy in India: Statiscal learning with R

You may also like
– Analyzing cricket’s batting legends – Through the mirage with R
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1

A peek into literacy in India: Statistical Learning with R


In this post I take a peek into the literacy landscape across India as a whole using R language.  The dataset from Open Government Data (OGD) platform India was used for this purpose. This data is based on the 2011 census. The XL sheets for the states were downloaded for data for each state. The Union Territories were not included in the analysis.

A thin slice of the data from each data set was taken from the data for each individual state (Note: This could also have been done from the consolidated india.xls XL sheet which I came to know of, much later).

I calculate the following for age group

Males (%) attending education institutions = (Males attending educational institutions * 100)/ Total males
Females (%) attending education institutions = (Females attending educational institutions * 100)/ Total Females

This is then plotted as a bar chart with the age distribution. I then overlay the national average for each state over the barchart to check whether the literacy in the state is above or below the national average. The implementation in R is included below

The code and data can be forked/cloned from GitHub at india-literacy

The results based on the analysis is given below.

  1. Kerala is clearly the top ranker with the literacy rates for both males and females well above the average
  2. The states with above average literacy are – Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. The states with just about average literacy – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. The states with below average literacy – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan

 

A brief implementation of the basic code in R is shown bwelow

# Read the Arunachal Pradhesh literacy related data
arunachal = read.csv("arunachal.csv")
# Create as a matrix
arunachalmat = as.matrix(arunachal)
arunachalTotal = arunachalmat[2:19,7:28]
# Take transpose as this is necessary for plotting bar charts
arunachalmat = t(arunachalTotal)
# Set the scipen option to format the y axis (otherwise prints as e^05 etc.)
getOption("scipen")
opt <- options("scipen" = 20)
getOption("scipen")
#Create a vector of total Males & Females
arunachalTotalM = arunachalmat[3,]
arunachalTotalF = arunachalmat[4,]
#Create a vector of males & females attending education institution
arunachalM = arunachalmat[6,]
arunachalF = arunachalmat[7,]
#Calculate percent of males attending education of total
arunachalpercentM = round(as.numeric(arunachalM) *100/as.numeric(arunachalTotalM),1)
barplot(arunachalpercentM,names.arg=arunachalmat[1,],main ="Percentage males attending educational institutions in Arunachal Pradesh",
xlab = "Age", ylab= "Percentage",ylim = c(0,100), col ="lightblue", legend= c("Males"))
points(age,indiapercentM,pch=15)
lines(age,indiapercentM,col="red",pch=20,lty=2,lwd=3)
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )
#Calculate percent of females attending education of total
arunachalpercentF = round(as.numeric(arunachalF) *100/as.numeric(arunachalTotalF),1)
barplot(arunachalpercentF,names.arg=arunachalmat[1,],main ="Percentage females attending educational institutions in Arunachal Pradesh ",
xlab = "Age", ylab= "Percentage", ylim = c(0,100), col ="lightblue", legend= c("Females"))
points(age,indiapercentF,pch=15)
lines(age,indiapercentF,col="red",pch=20,lty=2,lwd=3)
legend( x="bottomright",
legend=c("National average"),
col=c("red"), bty="n" , lwd=1, lty=c(2),
pch=c(15) )

A) Overall plot for India

a) India – Males

india-males

b) India – females

india-females

The plots for each individual state is given below

1) Literacy in Tamil Nadu

Tamil Nadu is slightly over the national average. The women seem to do marginally better than the males

a) Tamil Nadu – males

tn-males

b) Tamil Nadu – females

tn-females

2) Literacy in Uttar Pradesh

UP is slightly below the national average. Women are comparatively below men here

a) Uttar Pradesh – males

UP-males

b) Uttar Pradesh – females

UP-females

3) Literacy in Bihar

Bihar is well below the national average for both men and women

a) Bihar – males

bihar-males

b) Bihar – females

bihar-females

4. Literacy in Kerala

Kerala is the winner all the way in literacy with almost 100% literacy across all age groups

a) Kerala – males


kerala-females

b) Kerala -females

kerala-females

 

5. Literacy in Andhra Pradesh

AP just meets the national average for literacy.

a) Andhra Pradesh – males

andhra-males

b) Andhra Pradesh – females

andhra-females

6. Literacy in Arunachal Pradesh

Arunachal Pradesh is below average for most of the age groups

a) Arunachal Pradesh – males

arunachal-males

b) Arunachal Pradesh – females

arunachal-females

7. Literacy in  Assam

Assam is below national average

a) Assam – males

assam-males

b) Assam – females

assam-females

 

8. Literacy in Chattisgarh

Chattisgarh is on par with the national average for both men and women

a) Chattisgarh – males

chattisgarh-males

b) Chattisgarh – females

chattisgarh-females

 

9. Literacy in Gujarat

Gujarat is just about average

a) Gujarat – males

gujarat-males

b) Gujarat – females

gujarat-females

10. Literacy in Haryana

Haryana is slightly above average

a) Haryana – males

haryana-males

b) Haryana – females

haryana-females11.  Literacy in Himachal Pradesh

Himachal Pradesh is cool and above average.

a) Himachal Pradesh – males

himachal-males

 

b) Himachal Pradesh – females

himachal-females

12. Literacy in Jammu and Kashmir

J & K is marginally below average

a) Jammu and Kashmir – males

jk-males

b) Jammu and Kashmir – females

jk-females

 

13. Literacy in Jharkhand

Jharkhand is some ways below average

a) Jharkhand – males

jharkand-males

b) Jharkhand – females

jharkand-feamles

14. Literacy in Karnataka

Karnataka is on average for men. Womem seem to do better than men here

a) Karnataka – males

karnataka-males

b) Karnataka – females

karnataka-females

15. Literacy in Madhya Pradesh

Madhya Pradesh meets the national average

a) Madhya Pradesh – males

mp-males

b) Madhya Pradesh – females

mp-females

16. Literacy in Maharashtra

Maharashtra is front-runner in literacy

a) Maharashtra – females

maharashtra

b) Maharastra – females

maharashtra-feamles

 

17. Literacy in Odisha

Odisha meets national average

a) Odisha – males

odisha-males

b) Odisha – females

odisha-females

 

18. Literacy in  Punjab

Punjab is marginally above average with women doing even better

a) Punjab – males

punjab-males

b) Punjab – females

punjab-females19. Literacy in Rajasthan

Rajasthan is average for males and below average for females

a) Rajasthan – males

rajashthan-males

b) Rajasthan – females

rajasthan-females20. Literacy in Uttarakhand

Uttarakhand rocks and is above average

a) Uttarakhand – males

uttarkhan-males

b) Uttarakhand – females

uttarkhand-females

21. Literacy in West Bengal

West Bengal just about meets the national average.

a) West Bengal – males

wb-males

 

b) West Bengal – females

wb-females

The code can be cloned/forked from GitHub  india-literacy. I have done my analysis on the overall data. The data is further sub-divided across districts in each state and further into urban and rural. Many different ways of analysing are possible. One method is shown here

Conclusion

  1. Kerala is clearly head and shoulders above all states when it comes to literacy
  2. Many states are above average. They are Kerala, Himachal Pradesh, Uttarakhand, Tamil Nadu, Haryana, Himachal Pradesh, Karnataka, Maharashtra, Punjab, Uttarakhand
  3. States with average literacy are – Karnataka, Andhra Pradesh, Chattisgarh, Gujarat, Madhya Pradesh, Odisha, West Bengal
  4. States which fall below the national average are – Uttar Pradesh, Bihar, Jharkhand, Arunachal Pradesh, Assam, Jammu and Kashmir, Jharkhand, Rajasthan

See also
– A crime map of India in R: Crimes against women
– What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
– Bend it like Bluemix, MongoDB with autoscaling – Part 1