May 13, 2014

Machine Learning

I had zero knowledge about this topic but wanted to explore. Took Large Scale Hierarchical
Text Classification (LSHTC) as my MS project, so that I have a good scenario to start Machine
Learning

The first thing I wanted to know was the format of data provided by LSHTC. Turned out that it
was SVM format. The training data and test data had the following format

label,label,label… feature:value feature:value

The label indicates the category the document belongs to.

The feature:value vector represents a word and its weight (TF) in the document.

Choice of programming language #

Had to make a choice between Java and Python

I chose Python for the following reasons:

Huge set of Machine Learning libraries - given that I was a beginner, this made a lot of impact. More libraries, more documentation, more examples => more experiments and better understanding
Most of the Machine Learning this day is done with python
Less cumbersome to try out a scenario - given that python is more of a scripting language, experiments could be made quickly especially with IPYTHON
Also the hype around it these days :)

Libraries #

scikit-learn - massive collection of different algorithms for Regression, Classification,
Clustering, Dimension reduction, Model section pipelining etc
mlpy - similar to scikit-learn but offers a smaller set
graphlab - more of a recommendation engine
Spark - very good parallel ML framework but still in its early stage. Does not offer many
algorithms

I started off with sci-kit. It offeres a huge range of libraries & algorithms. I then had to do a lot of
reading about the basic stuff in classification like Hyper planes, linear and non linear
classification, K-Nearest Neighbours and Support Vector Machine (SVM) - What SVM is and
why is it used?

The Stanford NLP book helped me a lot in understanding the basics of Classification

Algorithms #

I’m an absolute beginner to Machine Learning and every algorithm I look at seems the right one.
But only after experimenting each of them you know which is the best fit and why.

The problem I was solving was a medium scale data with 250,000 records of test data and 2
million records of training data. Both training and test data large number of features.

K Nearest Neighbour #

Started of the first trial using K-nearest neighbour algorithm. Turns out, this is a very good
algorithm but doesn’t scale well with larger data set. There are a number of flavors of kNN
which reduces the dimenion of feature vector like - KD Tree, Ball Tree. But still doesn’t help
much while running larger dataset which > 10000 records
Also I used to frequently get the error “Core dumped” when I tried plain kNN and kNN with chi2
best selection. Still figuring out the reason; feel it doesn’t scale for larger dataset. But I get the
same error for smaller datasets of 100 records which is weird and hints me that I might be doing
something wrong!
After reading a few articles I came to a conclusion that it is better to use SVM for large datasets.

Support Vector Machines (SVM) #

Support Vector Machine is one of the fast and efficient learning algorithms for classification and
regression. Works well on large datasets. Linear SVM does a linear classification. We can define
custome kernels for SVM. The SVM library in sci-kit offers commonly used kernels like

linear
polynomial
Radial Basis Function (rbf)
sigmoid

The result with RBF kernel turned out to be bad. The prediction was pretty bad, got the same
label predition for most of the test data.

I switched to linear SVM and the results turned out to be quite decent.

Scaling #

Given the problem is about Large Scale classification, scaling the algorithm to cater to large
datasets is very important!

The algorithms in sci-kit library are in-core , meaning, they run all the tasks on a single core.
This turns out to be bad when running prediction on large datasets.

The way out is multicore processing by splitting the tasks. We can divide the task into sub tasks
and run them on different cores. In my case, I split the test data into smaller subsets and predict
them as different jobs, utilizing multiple cores. Sci-kit also provides a job processing library
called joblib which enables the above mentioned process.

Soon we will run into problem having multiple copies of the training data on each job doing the
prediction. To overcome this, joblib provides memory caching of functions. This helps us not to
create copies, rather share the memory across all jobs. The problem seems to be solved, but it
will not work when we have large enough dataset that needs to be run on different machines!

Kudos