My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)


Then felt I like some watcher of the skies 
When a new planet swims into his ken; 
Or like stout Cortez when with eagle eyes 
He star’d at the Pacific—and all his men 
Look’d at each other with a wild surmise— 
Silent, upon a peak in Darien. 

 

On First Looking into Chapman’s Homer by John Keats

The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also  summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like.

I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI).

By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through  courses which I did not do, but are probably equally good as well.  Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case,  I hope the list below will provide you with some overall direction.

All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs

  • Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience
  • The Professors, besides delivering quality content, also point out to important tips, tricks and traps
  • You can revisit lectures in online courses
  • Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!)

Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python.

The journey
My trip through Data Science, Machine Learning  started with an off-chance remark,about 3 years ago,  from an old friend of mine who spoke to me about having done a few  courses at Coursera, and really liked it.  He further suggested that I should try. This was the final push which set me sailing into this vast domain.

I have included the list of the courses I have done over the past 3 years (33 certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking.

1. Machine Learning, Stanford, Prof Andrew Ng, Coursera
(Requirement: Mandatory, Language:Octave,Status:Completed)
This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up  some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language.

2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) –
The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are  also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor.

3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed)
This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP

3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done)
In this specialization I only did  the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn

3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python

4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed)
I audited (listened only) to the following 2 modules from this Specialization.
a.Inferential Statistics
b.Linear Regression and Modeling
Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity.  Her lectures are filled with many examples which she walks you through in great detail

5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed)
This is an interesting course and provides an alternative point of view to frequentist approach

6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates
(Required: Mandatory for Big Data, Status:Completed, Language; pySpark)
This specialization contains 3 modules
a.Introduction to Apache Spark
b.Distributed Machine Learning with Apache Spark
c.Big Data Analysis with Apache Spark

This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked  the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed.

7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Partially Completed)

This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems

The modules are
a. Neural Networks and Deep Learning
In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network

b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow.
c.Structuring Machine Learning Projects – To do
d. Convolutional Neural Networks – To do
e. Sequence Models – To do

8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto
(Requirement: Mandatory, Language;Octave, Status:Completed)
This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularization, learning rate decay, momentum method etc

9.Probabilistic Graphical Models, Stanford  Prof Daphne Koller(Language:Octave, Status: Partially completed)
This has 3 courses
a.Probabilistic Graphical Models 1: Representation – Done
b.Probabilistic Graphical Models 2: Inference – To do
c.Probabilistic Graphical Models 3: Learning – To do
This course discusses how a system, which can be represented as a complex interaction
of probability distributions, will behave. This is probably the toughest course I did.  I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future

10. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done.
I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes

11. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity
This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc.

12. Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain.
Got started on this one and abandoned some time back. In my to do list though

My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly.
And so my journey continues…

I hope you find this list useful.
Have a great journey ahead!!!

Advertisements

IBM Data Science Experience:  First steps with yorkr


Fresh, and slightly dizzy, from my foray into Quantum Computing with IBM’s Quantum Experience, I now turn my attention to IBM’s Data Science Experience (DSE).

I am on the verge of completing a really great 3 module ‘Data Science and Engineering with Spark XSeries’ from the University of California, Berkeley and I have been thinking of trying out some form of integrated delivery platform for performing analytics, for quite some time.  Coincidentally,  IBM comes out with its Data Science Experience. a month back. There are a couple of other collaborative platforms available for playing around with Apache Spark or Data Analytics namely Jupyter notebooks, Databricks, Data.world.

I decided to go ahead with IBM’s Data Science Experience as  the GUI is a lot cooler, includes shared data sets and integrates with Object Storage, Cloudant DB etc,  which seemed a lot closer to the cloud, literally!  IBM’s DSE is an interactive, collaborative, cloud-based environment for performing data analysis with Apache Spark. DSE is hosted on IBM’s PaaS environment, Bluemix. It should be possible to access in DSE the plethora of cloud services available on Bluemix. IBM’s DSE uses Jupyter notebooks for creating and analyzing data which can be easily shared and has access to a few hundred publicly available datasets

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

In this post, I use IBM’s DSE and my R package yorkr, for analyzing the performance of 1 ODI match (Aus-Ind, 2 Feb 2012)  and the batting performance of Virat Kohli in IPL matches. These are my ‘first’ steps in DSE so, I use plain old “R language” for analysis together with my R package ‘yorkr’. I intend to  do more interesting stuff on Machine learning with SparkR, Sparklyr and PySpark in the weeks and months to come.

You can checkout the Jupyter notebooks created with IBM’s DSE Y at Github  – “Using R package yorkr – A quick overview’ and  on NBviewer at “Using R package yorkr – A quick overview

Working with Jupyter notebooks are fairly straight forward which can handle code in R, Python and Scala. Each cell can either contain code (Python or Scala), Markdown text, NBConvert or Heading. The code is written into the cells and can be executed sequentially. Here is a screen shot of the notebook.

Untitled

The ‘File’ menu can be used for ‘saving and checkpointing’ or ‘reverting’ to a checkpoint. The ‘kernel’ menu can be used to start, interrupt, restart and run all cells etc. Data Sources icon can be used to load data sources to your code. The data is uploaded to Object Storage with appropriate credentials. You will have to  import this data from Object Storage using the credentials. In my notebook with yorkr I directly load the data from Github.  You can use the sharing to share the notebook. The shared notebook has an extension ‘ipynb’. You can use the ‘Sharing’ icon  to share the notebook. The shared notebook has an extension ‘ipynb’. You an import this notebook directly into your environment and can get started with the code available in the notebook.

You can import existing R, Python or Scala notebooks as shown below. My notebook ‘Using R package yorkr – A quick overview’ can be downloaded using the link ‘yorkrWithDSE’ and clicking the green download icon on top right corner.

Untitled2

I have also uploaded the file to Github and you can download from here too ‘yorkrWithDSE’. This notebook can be imported into your DSE as shown below

Untitled1

Jupyter notebooks have been integrated with Github and are rendered directly from Github.  You can view my Jupyter notebook here  – “Using R package yorkr – A quick overview’. You can also view it on NBviewer at “Using R package yorkr – A quick overview

So there it is. You can download my notebook, import it into IBM’s Data Science Experience and then use data from ‘yorkrData” as shown. As already mentioned yorkrData contains converted data for ODIs, T20 and IPL. For details on how to use my R package yorkr  please my posts on yorkr at “Index of posts

Hope you have fun playing wit IBM’s Data Science Experience and my package yorkr.

I will be exploring IBM’s DSE in weeks and months to come in the areas of Machine Learning with SparkR,SparklyR or pySpark.

Watch this space!!!

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Also see

1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
2. Natural Processing Language : What would Shakespeare say?
3. Introducing cricket package yorkr:Part 1- Beaten by sheer pace!
4. A closer look at “Robot horse on a Trot! in Android”
5.  Re-introducing cricketr! : An R package to analyze performances of cricketers
6.   What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
7.  Deblurring with OpenCV: Wiener filter reloaded

To see all my posts check
Index of posts

Into the Telecom vortex


“Ten little Indian boys went out to dine,
One choked his little self and then there were nine
Nine little Indian boys sat up very late;
One overslept himself and then there were eight…”

From the poem “Ten Little Indians”

a

You don’t need to be particularly observant to notice that the telecom landscape over the last decade and a half is full of dead organizations, bloodshed and gore. Organizations have been slain by ruthless times and bigger ones have devoured the weaker, fallen ones. Telecom titans have vanished, giants have been reduced to dwarfs.

Some telecom companies have merged in a deadly embrace trying to beat the market forces only to capitulate to its inexorable death march.

The period from the early 1980s to the late 1990’s were the glorious periods for telecommunication. Digital switches (1972-1982), ISDN (1988), international calling, trunk protocols, mobile (~1991), 2G, 2.5G, and 3G moved in succession, one after another.

Advancement came after advancement. The future had never looked so bright for telecom companies.

The late 1990’s were heady years, not just for telecom companies, but to all technology companies. Stock prices soared. Many stocks were over-valued.  This was mainly due to what was described as the ‘irrational exuberance’ of the stock market.

Lucent, Alcatel, Ericsson, Nortel Networks, Nokia, Siemens, Telecordia all ruled supreme.

1997-2000. then the inevitable happened. There was the infamous dot-com bust of the 2000 which sent reduced many technology stocks to penny stocks. Telecom company stocks went into a major tail spin.  Stock prices of telecom organizations plummeted. This situation, many felt, was further exacerbated by the fact that nothing important or earth shattering was forth-coming from the telecom. In other words, there was no ‘killer app’ from the telecommunication domain.

From 2000 onwards 3G, HSDPA, LTE etc. have all come and gone by. But the markets were largely unimpressed. This was also the period of the downward slide for telecom. The last decade and a half has been extra-ordinarily violent. Technology units of dying organizations have been cannibalized by the more successful ones.

Stellar organizations collapsed, others transformed into ‘white dwarfs’, still others shattered with the ferocity of a super nova.

Here is a short recap of the major events.

  • 2006 – After a couple of unsuccessful attempts Alcatel and Lucent finally decide to merge
  • 2006 – Nokia marries Siemens in a 20 billion Euro deal. N
  • 2009-10 – Ericsson purchases Nortel’s CDMA and LTE business for $1.13 billion
  • 2009-10 – Nortel implodes
  • 2010 – Motorola sells networking unit to Nokia for $1.2 Billion
  • 2011 – Internet giant Google mops up Motorola’s handset division for $12.5 billion, largely for the patents
  • 2012 – Ericsson closes a deal with Telcordia for $1.15 billion
  • 2013 – Nokia sells its handset division to Microsoft after facing a serious beating from smartphones
  • 2015 – Nokia agrees to a $16.6 billion takeover of Alcatel Lucent

And so the story continues like the rhyme in Agatha Christie’s mystery novel

And then there were none

Ten little Indian boys went out to dine,                                                                                                                
One choked his little self and then there were nine…”

The Telecom companies continue their search for the elusive ‘killer app’ as progress comes in small increments – 3G, 3.5G, 3.75G, 4G, and 5G etc.

Personally I think the future of Telecom companies, lies in its ability to embrace the latest technologies of Cloud Computing, Big Data, Software Defined Networks, and Software Defined Datacenters and re-invent themselves. Rather than looking for some elusive ‘killer app’ they have to re-enter the technology scene with a Big Bang

As I referred to in one of my earlier posts “Architecting a cloud Based IP Multimedia System” the proverbial pot at the end of the rainbow may be in

  1. Virtualizing IP Multimedia Switches (IMS) namely the CSCFs (P-CSCF, S-CSCF, I-CSCF etc.),
  2. Using the features of the cloud like Software Defined Storage (SDS) , Load balancers and auto-scaling to elastically scale-up or scale down the CSCF instances to handle varying ‘call traffic’
  3. Having equipment manufacturers (Nokia, Ericsson, and Huawei) will have to use innovating pricing models with the carriers like AT&T, MCI, Airtel or Vodafone. Instead of a one-time cost for hardware and software, the equipment manufacturers will need to charge based on usage or call traffic (utility charging). This will be a win-win for both the equipment manufacturer and carrier
  4. Using SDN to provide the necessary virtualized pipes between users with the necessary policies for advanced services like video-chat, white-boarding, real-time gaming etc.
  5. Using Big Data and Hadoop to analyze Call Detail Records (CDRs) and provide advanced services to customers like differential rates for calls etc

Clearly there will be challenges in this virtualized view of things. Telecom equipment is renowned for its 5 9’s availability. The challenge will be achieving this resiliency, high availability and fault-tolerance with cloud servers. How can WAN latencies be mitigated? How to can SDN provide the QoS required for voice, video and data traffic in IMS?

IMS has many interesting services where video calls from laptops can be transferred as data calls to mobile phones and vice versa, from mobile networks to WiFi  and so on.

Many hurdles will have to be crossed. But this is, in my opinion, will be the path forward.

While the last decade and a half have been bad for the telecom industry, I personally feel we are on the verge on the next big breakthrough in telecom in the next year or two. Telecom will rise like the phoenix from its ashes in the next couple of years

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid
5. Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data
6. Deblurring with OpenCV:Weiner filter reloaded

Thinking Web Scale (TWS-3): Map-Reduce – Bring compute to data


In the last decade and a half, there has arisen a class of problem that are becoming very critical in the computing domain. These problems deal with computing in a highly distributed environments. A key characteristic of this domain is the need to grow elastically with increasing workloads while tolerating failures without missing a beat.  In short I would like to refer to this as ‘Web Scale Computing’ where the number of servers exceeds several 100’s and the data size is of the order of few hundred terabytes to several Exabytes.

There are several features that are unique to large scale distributed systems

  1. The servers used are not specialized machines but regular commodity, off-the-shelf servers
  2. Failures are not the exception but the norm. The design must be resilient to failures
  3. There is no global clock. Each individual server has its own internal clock with its own skew and drift rates. Algorithms exist that can create a notion of a global clock
  4. Operations happen at these machines concurrently. The order of the operations, things like causality and concurrency, can be evaluated through special algorithms like Lamport or Vector clocks
  5. The distributed system must be able to handle failures where servers crash, disk fails or there is a network problem. For this reason data is replicated across servers, so that if one server fails the data can still be obtained from copies residing on other servers.
  6. Since data is replicated there are associated issues of consistency. Algorithms exist that ensure that the replicated data is either ‘strongly’ consistent or ‘eventually’ consistent. Trade-offs are often considered when choosing one of the consistency mechanisms
  7. Leaders are elected democratically.  Then there are dictators who get elected through ‘bully’ing.

In some ways distributed systems behave like a murmuration of starlings (or a school of fish),  where a leader is elected on the fly (pun unintended) and the starlings or fishes change direction based on a few (typically 6) closest neighbors.

This series of posts, Thinking Web Scale (TWS) ,  will be about Web Scale problems and the algorithms designed to address this.  I would like to keep these posts more essay-like and less pedantic.

In the early days,  computing used to be done in a single monolithic machines with its own CPU, RAM and a disk., This situation was fine for a long time,  as technology promptly kept its date with Moore’s Law which stated that the “ computing power  and memory capacity’ will  double every 18 months. However this situation changed drastically as the data generated from machines grew exponentially – whether it was the call detail records, records from retail stores, click streams, tweets, and status updates of social networks of today

These massive amounts of data cannot be handled by a single machine. We need to ‘divide’ and ‘conquer this data for processing. Hence there is a need for a hundreds of servers each handling a slice of the data.

The first post is about the fairly recent computing paradigm “Map-Reduce”.  Map- Reduce is a product of Google Research and was developed to solve their need to calculate create an Inverted Index of Web pages, to compute the Page Rank etc. The algorithm was initially described in a white paper published by Google on the Map-Reduce algorithm. The Page Rank algorithm now powers Google’s search which now almost indispensable in our daily lives.

The Map-Reduce assumes that these servers are not perfect, failure-proof machines. Rather Map-Reduce folds into its design the assumption that the servers are regular, commodity servers performing a part of the task. The hundreds of terabytes of data is split into 16MB to 64MB chunks and distributed into a file system known as ‘Distributed File System (DFS)’.  There are several implementations of the Distributed File System. Each chunk is replicated across servers. One of the servers is designated as the “Master’. This “Master’ allocates tasks to ‘worker’ nodes. A Master Node also keeps track of the location of the chunks and their replicas.

When the Map or Reduce has to process data, the process is started on the server in which the chunk of data resides.

The data is not transferred to the application from another server. The Compute is brought to the data and not the other way around. In other words the process is started on the server where the data, intermediate results reside

The reason for this is that it is more expensive to transmit data. Besides the latencies associated with data transfer can become significant with increasing distances

Map-Reduce had its genesis from a Lisp Construct of the same name

Where one could apply a common operation over a list of elements and then reduce the resulting list of elements with a reduce operation

The Map-Reduce was originally created by Google solve Page Rank problem Now Map-Reduce is used across a wide variety of problems.

The main components of Map-Reduce are the following

  1. Mapper: Convert all d ∈ D to (key (d), value (d))
  2. Shuffle: Moves all (k, v) and (k’, v’) with k = k’ to same machine.
  3. Reducer: Transforms {(k, v1), (k, v2) . . .} to an output D’ k = f(v1, v2, . . .). …
  4. Combiner: If one machine has multiple (k, v1), (k, v2) with same k then it can perform part of Reduce before Shuffle

A schematic of the Map-Reduce is included below\

2

Map Reduce is usually a perfect fit for problems that have an inherent property of parallelism. To these class of problems the map-reduce paradigm can be applied in simultaneously to a large sets of data.  The “Hello World” equivalent of Map-Reduce is the Word count problem. Here we simultaneously count the occurrences of words in millions of documents

The map operation scans the documents in parallel and outputs a key-value pair. The key is the word and the value is the number of occurrences of the word. E.g. In this case ‘map’ will scan each word and emit the word and the value 1 for the key-value pair

So, if the document contained

“All men are equal. Some men are more equal than others”

Map would output

(all,1),  (men,1), (are,1), (equal,1), (some,1), (men,1), (are,1),  (equal,1), (than,1), (others,1)

The Reduce phase will take the above output and give sum all key value pairs with the same key

(all,1),  (men,2), (are,2),(equal,2), (than,1), (others,1)

So we get to count all the words in the document

In the Map-Reduce the Master node assigns tasks to Worker nodes which process the data on the individual chunks

3

Map-Reduce also makes short work of dealing with large matrices and can crunch matrix operations like matrix addition, subtraction, multiplication etc.

Matrix-Vector multiplication

As an example if we consider a Matrix-Vector multiplication (taken from the book Mining Massive Data Sets by Jure Leskovec, Anand Rajaraman et al

For a n x n matrix if we have M with the value mij in the ith row and jth column. If we need to multiply this with a vector vj, then the matrix-vector product of M x vj is given by xi

1

Here the product of mij x vj   can be performed by the map function and the summation can be performed by a reduce operation. The obvious question is, what if the vector vj or the matrix mij did not fit into memory. In such a situation the vector and matrix are divided into equal sized slices and performed acorss machines. The application would have to work on the data to consolidate the partial results.

Fortunately, several problems in Machine Learning, Computer Vision, Regression and Analytics which require large matrix operations. Map-Reduce can be used very effectively in matrix manipulation operations. Computation of Page Rank itself involves such matrix operations which was one of the triggers for the Map-Reduce paradigm.

Handling failures:  As mentioned earlier the Map-Reduce implementation must be resilient to failures where failures are the norm and not the exception. To handle this the ‘master’ node periodically checks the health of the ‘worker’ nodes by pinging them. If the ping response does not arrive, the master marks the worker as ‘failed’ and restarts the task allocated to worker to generate the output on a server that is accessible.

Stragglers: Executing a job in parallel brings forth the famous saying ‘A chain is as strong as the weakest link’. So if there is one node which is straggler and is delayed in computation due to disk errors, the Master Node starts a backup worker and monitors the progress. When either the straggler or the backup complete, the master kills the other process.

Mining Social Networks, Sentiment Analysis of Twitterverse also utilize Map-Reduce.

However, Map-Reduce is not a panacea for all of the industry’s computing problems (see To Hadoop, or not to Hadoop)

But the Map-Reduce is a very critical paradigm in the distributed computing domain as it is able to handle mountains of data, can handle multiple simultaneous failures, and is blazingly fast.

Also see
1. A crime map of India in R: Crimes against women
2.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
3.  Bend it like Bluemix, MongoDB with autoscaling – Part 2
4. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid

To see all posts click ‘Index of Posts

The language R


In the universe of programming languages there is a rising staR. It is moving fasteR and getting biggeR and brighteR!

Ok, you get the hint! It is the language R or the R Language.

R language is the successor to the language S. R is extremely powerful for statistical computing and processing. It is an interpreted language much like Python, Perl. The power of the language R comes from the 4000+ software packages that make the R language almost indispensable for any type of statistical computing.

As I mentioned above in my opinion, R, is soon going to play a central role in the technological world. In today’s world we are flooded with data from all sides. To make sense of this information overload we need techniques like Big Data, Analytics and machine learning to make sense of this data deluge. This is where R with its numerous packages that make short work of data becomes critical. The packages also have very interesting graphic packages to display the data in many forms for faster  analysis and easier consumption.

The language R can easily ingest large sets of data in CSV format and perform many computations on them. R language is being used in machine learning, data mining, classification and clustering, text mining besides also being utilized in sentiment analysis from social networks.

The R language contains the usual programming constructs namely logical, loops, assignment etc. The language enables to easily assign values to vectors, matrices, arrays and perform all the associated operations on them.

The R Language can be installed from R-project. The R Language package comes with many datasets which are data collected from various sources. One such dataset is the Iris dataset. The Iris dataset is dataset about the Iris plant( Iris is a genus of 260–300[1][2] species of flowering plants with showy flowers).

The dataset contains 5 parameters

1)      Sepal length 2) Sepal Width 3) Petal length 4) Petal width 5) Species

This dataset has been used in many research papers. R allows you to easily perform any sophisticated set of statistical operations on this data set. Included below are a sample set of operations you can perform on the Iris dataset or any dataset

> iris[1:5,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

> summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species

Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50

1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50

Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50

Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199

3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800

Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

>hist(iris$Sepal.Length)

1

Here is a scatter plot of the Petal width, sepal length and sepal width

>scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)

2

 

As can be seen R can really make short work of data with the numerous packages that come along with it. I have just skimmed the surface of R language.

I hope this has whetted your appetite. Do give R a spin!

Watch this space!

You may also like
1. Introducing cricketr! : An R package to analyze performances of cricketers
2. Literacy in India : A deepR dive.
3. Natural Language Processing: What would Shakespeare say?
4. Revisiting crimes against women in India
5. Sixer – R package cricketr’s new Shiny Avatar

Also see
1. Designing a Social Web Portal
2. Design principles of scalable, distributed systems
3. A Cloud Medley with IBM’s Bluemix, Cloudant and Node.js
4. Programming Zen and now – Some essential tips -2 
5. Fun simulation of a Chain in Android

Find me on Google+

Perils and pitfalls of Big Data


Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data  can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

1

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high.  This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

Find me on Google+

Close encounters with the future


ss

Published in Telecom Asia, Oct 22,2013 – Close encounters with the future

Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have only 1,000 vacuum tubes and perhaps weigh 1.5 tons.—POPULAR MECHANICS, 1949

Introduction: Ray Kurzweil in his non-fiction book “The Singularity is near – When humans transcend biology” predicts that by the year 2045 the Singularity will allow humans to transcend our ‘frail biological bodies’ and our ‘petty, derivative and circumscribed brains’ . Specifically the book claims “that there will be a ‘technological singularity’ in the year 2045, a point where progress is so rapid it outstrips humans’ ability to comprehend it. Irreversibly transformed, people will augment their minds and bodies with genetic alterations, nanotechnology, and artificial intelligence”.

He believes that advances in robotics, AI, nanotechnology and genetics will grow exponentially and will lead us into a future realm of intelligence that will far exceed biological intelligence. This explosion will be the result of ‘accelerating returns from significant advances in technology”

Futurescape

Here is a look at some of the more fascinating key trends in technology. You can decide whether we are heading to Singularity or not.

Autonomous Vehicles (AVs): Self driving cars have moved from the realm of science fiction to reality in recent times. Google’s autonomous cars has already driven around half a million miles. All the major car manufacturers of the world from BMW, Mercedes, Toyota, Nissan, Ford or GM are all coming with their own versions of autonomous cars. These cars are equipped with Adaptive Cruise Control and Collision Avoidance technologies and are already taking away control drivers. Moreover AVs alert drivers, if their attention strays from the road ahead, for too long. Autonomous Vehicles work with the help of Vehicular Communication Technology.

Vehicular Communication along with the Intelligent Transport Systems (ITS) achieves safety by enabling communication between vehicles, people and roads. Vehicle-to-vehicle communications are the fundamental building block of autonomous, self-driving cars. It enables the exchange of data between vehicles and allows automobiles to “see” and adapt to driving obstacles more completely, preventing accidents besides resulting in more efficient driving.

Smart Assistants: From the defeat of Kasparov in chess by IBM’s Deep Blue in 1997, and then subsequently to  the resounding victory of IBM’s Watson in Jeopardy, capable of understanding natural human language, to the more prevalent Apple’s intelligent assistant Siri, Artificially Intelligent  (AI) systems have come a long way. The newest trend in this area is Smart Assistants.  Robots are currently analyzing documents, filling prescriptions, and handling other tasks that were once exclusively done by humans. Smart Assistants are already taking over the tasks of BPO operators, paralegals, store clerks, baby sitters. Robots, in many ways, are not only smarter than humans, but also do not get easily bored,

Intelligent homes and intelligent offices. Rapid advances in technology will be closer to the home both literally and figuratively. The future home will have the ability to detect the presence of people, pets, smoke and changes to humidity, moisture, lighting, temperature. Smart devices will monitor the environment and take appropriate steps to save energy, improve safety and enhance security of homes.  Devices will start learning your habits and enhance your comfort and convenience. Everything from thermostats, fire detectors, washing machines, refrigerators will be equipped electronics that will be capable of adapting to the environment. All gadgets at home will be accessible through laptops, tablets or smartphones from anywhere. We will be able to monitor all aspects of our intelligent home from anywhere.

Smart devices will also make major inroads into offices leading to the birth of intelligent offices where the lighting, heating, cooling will be based on the presence of people in the offices. This will result in an enormous savings in energy. The advances in intelligent homes and intelligent offices will be in the greater context of the Smart Grid.

Swarms of drones: Contrary to the use of weaponized drones for unmanned aerial survey of enemy territory we will soon have commercial drones. Drone will start being used for civilian purposes.  The most compelling aspect of drones these days is the fact that they can be easily manufactured in large quantities, are cheap and can perform complex tasks either singly or collectively. Remotely controlled drones can perform hundreds of civilian jobs, including traffic monitoring, aerial surveying, and oil pipeline inspections and monitoring of crop conditions. Drones are also being employed for conservation of wildlife. In the wilderness of Africa, drones are already helping in providing aerial footage of the landscape, tracking poachers and in also herding elephants. However, before drones become a common sight, it is necessary to ensure that appropriate laws are made for maintaining the safety and security of civilians. This is likely to happen in US in 2015, when the Federal Aviation Administration (FAA) will come up with rules to safely integrate drones into the American skies.

MOOC (Massive Online Open Course): The concept of MOOC, or the ‘Massive Open Online Course’ from top colleges, though just a few years old, is already taking the world by storm. Coursera, edX and Udacity are the top 3 MOOCs besides many others and offer a variety of courses on technology, philosophy, sociology, computer science etc.  As more courses are available online, the requirements of having a uniform start and end date will diminish gradually. The availability of course lectures at all times and through all devices, namely the laptop, tablet or smartphone, will result in large scale adoption by students of all ages.

Contrary to regimented classes MOOCs now allow students to take classes at their own pace. It is likely that some students will breeze through an entire semester worth of classes in a few weeks. It is also likely that a few students will graduate in 4 years with more than a couple of degrees. MOOCs are a natural development considering that the world is going to be more knowledge driven where there will be the need for experts with a diverse set of in-depth skills. Here is an interesting article in WSJ “What College will be like in 2023

3D Printing: This is another technology that is bound to become ubiquitous in our future. 3D printers will revolutionize manufacturing in ways we could never imagine. A 3-D printer is similar to a hot-glue gun attached to a robotic arm. A 3-D printer creates an object by stacking one layer of material, typically plastic or metal, on top of another.  3D printers have been used for making everything from prosthetic limbs, phone cases, lamps all the way to a NASA funded 3D pizza. Here is a great article in New York Times “Dinner is Printed” It is likely that a 3D printer would be indispensable to our future homes much like the refrigerator and microwave.

Artificial sense organs: A recent news items in Science 2.0 “The Future touch sensitive prosthetic limbs”   discusses the invention of a prosthetic limb that can actually provide the sense of touch by stimulating the regions of the brain that deal with the sense of touch. The researchers identified the neural activity that occurs when grasping or feeling an object and successfully induced these patterns in the brain. Two parallel efforts are underway to understand how the human brain works. They are “The Human Brain Project” which has 130 members of the European Union and Obama’s BRAIN project. Both these projects attempt to ‘to give us a deeper and more meaningful understanding of how the human brain operates”. Possibilities as in the movies ‘Avatar’ or ‘Terminator’ may not be far away.

The Others: Besides the above, technologies like Big Data, Cloud Computing, Semantic Web, Internet of Things and Smart Grid will also be swamp us in the future and much has already been said about it.

Conclusion: The above sets of technologies represent seismic shifts and are bound to explode in our future in a million ways.

Given the advances in bionic limbs, Machine Intelligent AI systems, MOOCs, Autonomous Vehicles are we on target for the Singularity?

I wouldn’t be surprised at all!

Find me on Google+

The Next Frontier


Published in Telecom Asia – The next frontier, 21, Mar, 2012

In his classic book “The Innovator’s Dilemma” Prof. Clayton Christensen of Harvard Business School presents several compelling cases of great organizations that fail because they did not address disruptive technologies, occurring in the periphery, with the unique mindset required in managing these disruptions.

In the book the author claims that when these disruptive technologies appeared on the horizon there were few takers for these technologies because there were no immediate applications for them. For e.g. when the hydraulic excavator appeared its performance was inferior to the existing predominant manual excavator. But in course of time the technology behind hydraulic excavators improved significantly to displace existing technologies. Similarly the appearance of 3.5 inch disk had no immediate takers in desktop computers but made its way to the laptop.

Similarly the mini computer giant Digital Equipment Corporation (DEC) ignored the advent of the PC era and focused all its attention on making more powerful mini-computers. This led to the ultimate demise of DEC and several other organizations in this space. This book includes several such examples of organizations that went defunct because disruptive technologies ended up cannibalizing established technologies.

In the last couple of months we have seen technology trends pouring in.  It is now accepted that cloud computing, mobile broadband, social networks, big data, LTE, Smart Grids, and Internet of Things will be key players in the world of our future. We are now at a point in time when serious disruption is not just possible but seems extremely likely. The IT Market Research firm IDC in its Directions 2012 believes that we are in the cusp of a Third Platform that will dominate the IT landscape.

There are several technologies that have been appearing on the periphery and have only gleaned marginal interest for e.g. Super Wi-Fi or Whitespaces which uses unlicensed spectrum to access larger distances of up to 100 kms. Whitespaces has been trialed by a few companies in the last year. Another interesting technology is WiMAX which provides speeds of 40 Mbps for distances of up to 50 km. WiMAX’s deployment has been spotty and has not led to widespread adoption in comparison to its apparent competitor LTE.

In the light of the technology entrants, the disruption in the near future may occur because of a paradigm shift which I would like to refer as the “Neighborhood Area Computing (NAC)” paradigm.  It appears that technology will veer towards neighborhood computing given the bandwidth congestion issues of WAN. A neighborhood area network (NAN) will supplant the WAN for networks which address a community in a smaller geographical area

This will lead to three main trends

Neighborhood Area Networks (NAN):  Major improvements in Neighborhood Area Networks (NAN) are inevitable given the rising importance of smart grids and M2M technology in the context of WAN latencies. Residential homes of the future will have a Home Area Network (HAN) based on bluetooth or Zigbee protocols connecting all electrical appliances. In a smart grid contextNAN provides the connectivity between the Home Area Network (HAN) of a future Smart Home with the WAN network. While it is possible that the utility HAN network will be separate from the IP access network of the residential subscriber, the more likely possibility is that the HAN will be a subnet within the home network and will connect toNAN network.

The data generated from smart grids, m2m networks and mobile broadband will need to be stored and processed immediately through big data analytics on a neighborhood datacenter. Shorter range technologies like WiMAX, Super WiFi/ Whitespaces will transport the data to a neighborhood cloud on which a Hadoop based Big Data analytics will provide real time analytics

Death of the Personal Computer:  The PC/laptop will soon give way to a cloud based computing platform similar to Google’s Chrome book. Not only will we store all our data on the cloud (music, photos, videos) we will also use the cloud for our daily computing needs. Given the high speeds of theNAN this should be quite feasible in the future. The cloud will remove our worries about virus attacks, patch updates and the need to buy new software.  We will also begin to trust our data in the cloud as we progress to the future. Moreover the pay-per-use will be very attractive to consumers.

Exploding Datacenters:  As mentioned above a serious drawback of the cloud is the WAN latency. It is quite likely that with the increases in processing powers and storage capacity coupled with dropping prices that cloud providers will have hundreds of data centers with around 1000 servers for each city rather than a few mega data centers with 10,000’s of servers.  These data centers will address the computing needs of a community in a small geographical area. Such smaller data centers, typically in a small city, will solve 2 problems. One it will build into the cloud geographical redundancy besides also providing excellent performance asNAN latencies will be significantly less in comparison to WAN latencies.

These technologies will improve significantly and fill in the need for handling neighborhood high speed data

The future definitely points to computing in the neighborhood.

Find me on Google+