Raghotham S

Parsing JSON in Scala

2015-04-26T05:22:17-07:00

Introduction #

I started a side project on Scala with a group of friends (noobs in scala). We chose Scala because it is well known for type safety and functional programming with support for OOP.
One of the important parts of the project was speaking to a REST API which returned JSON responses.

We began our hunt for efficient JSON parsers on scala and soon we were flooded with libraries:

spray-json
jerkson
jackson
json4s
jacksMapper

With so many options, we were confused! Thanks to this wonderful post from Ooyala Engineering team for putting up a nice comparison of libraries. Finally, we decided to go ahead with json4s because we found it handy to extract objects out of the JSON and also the support it has for Jackson (faster parsing).

Problem #

The problem with most of the libraries listed above, especially json4s, is the poor documentation. The examples given are straight forward cases where the structure of the JSON response and the object model are exactly same.

scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
scala> implicit val formats = DefaultFormats // Brings in default date formats etc.
scala> case class Child(name: String, age: Int, birthdate: Option[java.util.Date])
scala> case class Address(street: String, city: String)
scala> case class Person(name: String, address: Address, children: List[Child])
scala> val json = parse("""
         { "name": "joe",
           "address": {
             "street": "Bulevard",
             "city": "Helsinki"
           },
           "children": [
             {
               "name": "Mary",
               "age": 5,
               "birthdate": "2004-09-04T18:06:22Z"
             },
             {
               "name": "Mazy",
               "age": 3
             }
           ]
         }
       """)

scala> json.extract[Person]
res0: Person = Person(joe,Address(Bulevard,Helsinki),List(Child(Mary,5,Some(Sat Sep 04 18:06:22 EEST 2004)), Child(Mazy,3,None)))

What if we want to convert part of the JSON into an object? #

From the above example, what if we want to convert only the adress information into an object? There is very little or no documentation which guide beginners to accomplish such task.

Link to tweet

Solution #

We can traverse the JSON by giving it a path expression. In the above example, we can traverse to the address object by giving it the path from the root, which is "address"

scala> json  \ "address"

The above statement will do the traversal and returns a JValue. Once we have the JValue for the address, we can convert it into an Address object by using the extract method

scala> case class Child(name: String, age: Int, birthdate: Option[java.util.Date])
scala> case class Address(street: String, city: String) 

scala> val json = parse("""
         { "name": "joe",
           "address": {
             "street": "Bulevard",
             "city": "Helsinki"
           },
           "children": [
             {
               "name": "Mary",
               "age": 5,
               "birthdate": "2004-09-04T18:06:22Z"
             },
             {
               "name": "Mazy",
               "age": 3
             }
           ]
         }
       """)

scala> val addressJson = json  \ "address"  // Extract address object
scala> val addressObj = addressJson.extract[Address]
res1: addressObj: Address = Address(Bulevard,Helsinki)

BOOM! You have extracted an object of type Address from the JSON.

scala> val children = (json \ "children").extract[List[Child]]  // Extract list of objects
res2: List[Child] = List(Child(Mary,5,Some(Sat Sep 04 23:36:22 IST 2004)), Child(Mazy,3,None))

Now you have created a List of type Child

The general trend I see is that the Getting started or Usage guides available for various libraries do not help beginners start off quickly on a given problem. We need better beginner docs that showcase examples which are close to real world scenarios.

Markers with D3

2015-01-22T10:58:28-08:00

Every time I look at the examples page of D3, I’m simply go…

@mbostock has transformed how visualizations are created for web.

Today I learnt how to use svg markers with D3. I was using force layout to analyze graphs, just like this example. But I wanted a directed graph!

Later, I came across another example which had direction. I was happy because a ready-made solution solved the problem. But soon I ran into problem as I wanted a custom tree like structure with every path being directed i.e I wanted the arrow markers at the end of each path.

I went back to the ready-made solution and had a look at the part of code which was generating the arrows.

// build the arrow.
svg.append("svg:defs").selectAll("marker")
// Different link/path types can be defined here
.data(["end"])      
// This section adds in the arrows
//this makes the id as 'end', coming from data
.attr("id", String)  
.enter().append("svg:marker")   
.attr("viewBox", "0 -5 10 10")
.attr("refX", 15)
.attr("refY", -1.5)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("svg:path")
.attr("d", "M0,-5L10,0L0,5");

Thanks to d3noob for adding comments to the code

The above code just creates an arrow. This can be added to any element later by adding the below code as its attribute.

svgElement.attr("marker-end", "url(#end)");

So what is the magic happening here? Lets look closely at what we are doing while building the arrow.

svg.append("svg:defs").selectAll("marker")
.data(["end"])      
.attr("id", String)  
.enter().append("svg:marker")   
.attr("viewBox", "0 -5 10 10")
.attr("refX", 15)
.attr("refY", -1.5)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("svg:path")
.attr("d", "M0,-5L10,0L0,5");

We are creating an SVG def which will generate the arrow head. SVG defs are a way of defining graphical objects which can be applied to elements. The definition can be anything like defining a marker or defining a gradient as specified in MDN example. Now that we have created a definition, this can be easily applied to any element.

We will use the defined marker and apply it to every path we have by altering the path’s attribute

svg.selectAll(".link")
.attr("marker-end", "url(#end)");

We use the marker-end attribute and assign the definition id as its value (in our case it is #end). marker-end attribute is used to add arrowhead or any other object at the final vertex of the path.

Now that we have added arrows to the path, lets see the output

A peek into DOM

Thanks to mbostock, d3noob, MDN

DLNA on Raspberry Pi

2014-09-07T12:00:40-07:00

I always wanted to setup a media server at home for the following reasons:

Reduce redundancy - having multiple copies of media for different devices like phone, tablet, smart TV etc
Ease of use - no need to copy files to and from devices to play media (mostly Floyd and movies)
One stop shop with transmission integration - download files on rpi and they appear on the media server

The easiest solution was to turn my RaspberryPi into a DLNA server. For this I required to a few basic packages and had to configure each.

It was a bit hard to find all of them in a single post and hence I’m writing this post.

Packages required #

samba
nginx (for transmission)
nfs
ntfs (optional, to support ntfs file system)
transmission-daemon

minidlna

sudo apt-get install samba samba-common-bin
sudo apt-get install nginx
sudo apt-get install nfs-kernel-server nfs-common portmap
sudo apt-get install ntfs-3g  # if you want ntfs
sudo apt-get install transmission-daemon
sudo apt-get install minidlna

samba #

Append /etc/samba/smb.conf

[public]
  path = /path/to/public/folder
  browseable = yes
  writeable = yes
  guest ok = no
  read only = no

minidlna #

Edit /etc/minidlna.conf

media_dir=/path/to/public/folder
media_dir=V,/path/to/public/videos/folder
media_dir=A,/path/to/public/music/folder    
media_dir=P,/path/to/public/pictures/folder

friendly_name=rpi

transmission-daemon #

mkdir -p /opt/torr
sudo chown -R debian-transmission /opt/torr

cp /etc/transmission-daemon/settings.json /etc/transmission-daemon/settings_template.json

Edit /etc/transmission-daemon/settings.json

Change the value of download-dir field to /opt/torr

{
    ..
    "download-dir": "/opt/torr",
    ..
}

Time to test! #

sudo service samba stop
sudo service samba start

sudo service minidlna stop
sudo service minidlna start

sudo service transmission-daemon stop
sudo service transmission-daemon start

To test if transmission daemon is running, open http://rpi_ip_addr:9091/transmission/web/

IP address of devices keep changing and hence it is difficult to access it with IP address.
We can solve this problem by using the .local domain. For this we need avahi-daemon and tweak the hosts file

avahi-daemon #

sudo apt-get install avahi-daemon

Edit /etc/init.d/hostname

Change

    127.0.1.1 raspberrypi

    127.0.1.1 [new name here]

Reboot rpi

Now you should be able to access your raspberrypi using the URL http://host_name.local

For example, http://raspberrypi.local

PS: Most of the times minidlna does not refresh the collection in the specified folders. We need to explicitly run the following command

sudo minidlna -R
sudo service minidlna restart

This problem might be because of the inotify functionality of the linux kernel. It has to be enabled by the kernel. A solution is posted here

Courtesy #

Text Search on PostgreSQL

2014-05-31T10:55:07-07:00

PostgreSQL has out of box support for text search.

Assume we have a table of documents:

CREATE TABLE documents
(
  id serial NOT NULL,
  doc text
)

INSERT INTO documents(doc)
VALUES ("Lorem ipsum .....");

INSERT INTO documents(doc)
VALUES ("Quick brown fox .....");

------------------------------------
id       | doc
------------------------------------
0        | "Lorem ipsum ....."
1        | "Quick brown fox ..."

A simple text search is a basic requirement in any system. This can be done using tsvector and tsquery types in PostgreSQL.

[tsvector](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) gives us the list of lexemes for any given text.

[tsquery](www.postgresql.org/docs/9.1/static/datatype-textsearch.html) helps facilitate the search by creating lexemes for search terms, combine search terms / lexemes and compare with tsvector for result.

The to_tsvector method processes text by removing stop words, stem and normalize words so that they can be used with different variants of the word.

For example, precision would become precis and running would become run

On every insert of a document, we need to get the normalized text of the document and add it to the normalized text column. For this we need to create a new column of type tsvector.

ALTER TABLE documents ADD COLUMN tsv TYPE tsvector;

Next, we need to create a trigger function that will update the tsv column on every insert

CREATE TRIGGER tsvupdate
BEFORE INSERT OR UPDATE
ON documents
FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger(tsv, 'pg.catalog.english', doc);

tsvector_update_trigger() is built-in method which takes arguments -

column to store the normalized text
language of the text (because removing stop words and stemming is specific to a language)
column_to_read_text_from
column_to_read_text_from (takes multiple columns as input)

With data populated inside the documents table, we can perform a simple text search using the query:

WITH q AS (SELECT to_tsquery('brown:*') AS query)
SELECT id, doc, tsv from documents, q where q.query @@ documents.tsv;

The to_tsquery function will convert the input text to tsquery type which can be used to do logical operations - & (AND), | (OR), ! (NOT), with lexemes and perform prefix matching using ":*"

The @@ operator checks if the tsvector matches tsquery

So the above query would return us documents which contain “brown”

Limitation: #

tsvector and tsquery will only help us find words from a given text but not substring matching.
For substring matching we will have to use the pg_trgm extension (Trigram based text search). The pg_trgm extension can perform LIKE operation on text fields.

Machine Learning

2014-05-13T21:05:14-07:00

I had zero knowledge about this topic but wanted to explore. Took Large Scale Hierarchical
Text Classification (LSHTC) as my MS project, so that I have a good scenario to start Machine
Learning

The first thing I wanted to know was the format of data provided by LSHTC. Turned out that it
was SVM format. The training data and test data had the following format

label,label,label… feature:value feature:value

The label indicates the category the document belongs to.

The feature:value vector represents a word and its weight (TF) in the document.

Choice of programming language #

Had to make a choice between Java and Python

I chose Python for the following reasons:

Huge set of Machine Learning libraries - given that I was a beginner, this made a lot of impact. More libraries, more documentation, more examples => more experiments and better understanding
Most of the Machine Learning this day is done with python
Less cumbersome to try out a scenario - given that python is more of a scripting language, experiments could be made quickly especially with IPYTHON
Also the hype around it these days :)

Libraries #

scikit-learn - massive collection of different algorithms for Regression, Classification,
Clustering, Dimension reduction, Model section pipelining etc
mlpy - similar to scikit-learn but offers a smaller set
graphlab - more of a recommendation engine
Spark - very good parallel ML framework but still in its early stage. Does not offer many
algorithms

I started off with sci-kit. It offeres a huge range of libraries & algorithms. I then had to do a lot of
reading about the basic stuff in classification like Hyper planes, linear and non linear
classification, K-Nearest Neighbours and Support Vector Machine (SVM) - What SVM is and
why is it used?

The Stanford NLP book helped me a lot in understanding the basics of Classification

Algorithms #

I’m an absolute beginner to Machine Learning and every algorithm I look at seems the right one.
But only after experimenting each of them you know which is the best fit and why.

The problem I was solving was a medium scale data with 250,000 records of test data and 2
million records of training data. Both training and test data large number of features.

K Nearest Neighbour #

Started of the first trial using K-nearest neighbour algorithm. Turns out, this is a very good
algorithm but doesn’t scale well with larger data set. There are a number of flavors of kNN
which reduces the dimenion of feature vector like - KD Tree, Ball Tree. But still doesn’t help
much while running larger dataset which > 10000 records
Also I used to frequently get the error “Core dumped” when I tried plain kNN and kNN with chi2
best selection. Still figuring out the reason; feel it doesn’t scale for larger dataset. But I get the
same error for smaller datasets of 100 records which is weird and hints me that I might be doing
something wrong!
After reading a few articles I came to a conclusion that it is better to use SVM for large datasets.

Support Vector Machines (SVM) #

Support Vector Machine is one of the fast and efficient learning algorithms for classification and
regression. Works well on large datasets. Linear SVM does a linear classification. We can define
custome kernels for SVM. The SVM library in sci-kit offers commonly used kernels like

linear
polynomial
Radial Basis Function (rbf)
sigmoid

The result with RBF kernel turned out to be bad. The prediction was pretty bad, got the same
label predition for most of the test data.

I switched to linear SVM and the results turned out to be quite decent.

Scaling #

Given the problem is about Large Scale classification, scaling the algorithm to cater to large
datasets is very important!

The algorithms in sci-kit library are in-core , meaning, they run all the tasks on a single core.
This turns out to be bad when running prediction on large datasets.

The way out is multicore processing by splitting the tasks. We can divide the task into sub tasks
and run them on different cores. In my case, I split the test data into smaller subsets and predict
them as different jobs, utilizing multiple cores. Sci-kit also provides a job processing library
called joblib which enables the above mentioned process.

Soon we will run into problem having multiple copies of the training data on each job doing the
prediction. To overcome this, joblib provides memory caching of functions. This helps us not to
create copies, rather share the memory across all jobs. The problem seems to be solved, but it
will not work when we have large enough dataset that needs to be run on different machines!

Database Triggers

2014-05-13T10:51:57-07:00

Database trigger is an event that can occur after or before a statement is executed or a row is modified / inserted / deleted.
This can be used to perform any task before or after certain occurrence of an event in the database.

I was curious about this concept from a very long time and wanted to check it out.

I wanted to try an automation by creating a trigger function.

Trigger function in PostgreSQL is a kind of function to which special variables are passed - NEW, OLD etc. More on trigger functions - here

NEW - variable sent to trigger function when the trigger is INSERT / UPDATE. This variable will contain the new row to be inserted / updated

OLD - variable sent to trigger function when the trigger is DELETE. This variable will contain the row to be deleted

To try out trigger functions I created three tables posts, groups and user_posts

I wanted to try an insert automation - On inserting a row into posts table, i wanted the DB to automatically insert rows into user_posts table. For this we need a trigger function like this:

CREATE FUNCTION update_user_post_association() RETURNS trigger
    LANGUAGE plpgsql
    AS $_$
DECLARE
    row user_groups%rowtype;
BEGIN
    FOR row in SELECT * from public.user_groups WHERE group_id = NEW.group_id
    LOOP
        EXECUTE 'INSERT INTO public.user_posts VALUES ($1, $2, 1)' USING row.user_id, NEW.id;
    END LOOP;
    RETURN NEW;
END;
$_$;

In the example we do the following steps:

get all the users of the group
loop on the users
perform an insert using the available data

It is important to note that the trigger function here uses dynamic SQL statement which is slightly different from a normal SQL statement.
When using variables in an SQL statement it is always good to use placeholders like $1, $2 and use the USING keyword to pass the variables.

Now that we have the trigger function, we need to tell the DB when to run this function. For this, we create a trigger

CREATE TRIGGER populate_users AFTER INSERT OR UPDATE ON posts FOR EACH ROW EXECUTE PROCEDURE update_user_post_association();

Now the DB executes the trigger function on every insert into the posts table.

Note : It is not a good idea to perform insert(s) in the trigger function because it may reduce the efficiency of the DB due to multiple inserts. When the DB is under heavy load, multiple inserts inside the trigger function might become slow and hence, the DB might start to queue connections.