Data Science

Even before I start writing the usage of python on how wonderful it is as programming language and start praising it incessantly . Let me start by pointing to an article in ET which states as to why the need to learn python ? what are its trends ? Where does it stand on the list of technologies that are relevant today.
One line to describe that would be "right there on top and its awesome".

A nice read here

Now coming back to the topic of discussion , what is python ? For those of you who have heard of python , you might also have heard of Anaconda and Spyder ! These are neither snakes nor insects . Python is a programming language and that too Object Oriented programming language . who would have thought python as an object oriented language. I didn't believe it either when I started learning python once upon a time . Now what is Anaconda ? It is a set of python packages bundled together and Spyder being its IDE( integrated development environment )
For folks who are trying to learn python for the first time , you should start with one of the IDE's preferably Anaconda. You can also choose to use pycharm , Ipython , WindIDE, PyDev, Komodo IDE , Eric , IEP and the list is endless . I have never used some of then !

Ok , so python is an object oriented programming language so what ! There are many other programming language such as Java , C++, C#, VB.Net . What is so special about python ?

Let me present my case with a simple example , lets say I want to write a simple basic , our own hello world program .

There we go , a simple print "hello world" does the trick for us .

Now lets write a "hello world " program in Java.Looks at the left hand side and see for yourself doesn't it look like a lot of syntax and semantics? Indeed it does !

To arouse your interest further , Let me first take you through some basic questions that are frequently asked about python.

What kind of language is python , is it compiled or interpreted ?
Python is an interpreted language, as opposed to a compiled one, though the distinction can be blurry because of the presence of the bytecode compiler. This means that source files can be run directly without explicitly creating an executable to run.

Will i be able to create a website in python ?
Yes you will be able to create a website in python .

which language is better R or Python?
I would say , It depends on the usage but personally I prefer python more because of its clean syntax , automatic garbage collection , proper memory management as compared to R .

To know more about compiler and and interpreter . Follow this link

Now so as to keep interest in python alive , let me demo a simple web-scraping program. Web-Scraping is a process of pulling data directly from websites.
Lets say, If I want to pull the movie details from website http://www.imdb.com/ .I need the list of all hit movies which has been released in the year 2012 to 2014 based on its rating . Here is what I have to do ( as given below )

This simple 12 line code will pull you data from imdb.com without any hassle . Isn't this wonderful?
Note : The copy paste of the code might not work because the link is incomplete in the screenshot .

Now that I have your attention . Keep watching the space for more updates . I plan to have a series of videos and tutorials to help all of you understand python and its nuances .

What is mongoDB ?

Of all the definitions that I read , I liked the most from Wikipedia .

MongoDB is a cross-platform document-oriented database system. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.

All the mumbo jumbo above only means that MongoDB is a document database whose structure is JSON like and not similar to traditional database .The defination will be clearer when we start performing the operations (CRUD or creating,reading , deleting and updating) later . For now just go with the flow.

Few quick pointers :

1. Document in mongodb is equivalent to row/tuple in relational databases system.
2. Collection is equivalent to table but collection is schema free .Which means that table in RDBMS has a definite table structure whereas in any NoSQL and in mongodb each document/row/tuple has its own schema .

Let us take an example of a relational database system , Here is how a data looks like in our traditional relation database management systems .

 ID         NAME          AGE   
 1          Honey         10  
 2          Roney         20

As is evident and already known the schema/table structure is the same for every record (ID=1 or ID=2 , look at the data), you might be like "I still do not understand what he is trying to say , i can see no difference in the two records .If that is what you are thinking about , you are in right direction .Look at the column/field names . They look static , the ID,NAME and AGE do not change . Do they change with new record insert ? Most definitely No !
Now lets take a look at mongodb documents (Just think of this as a fancy name given to a row/tuple) below, don't worry much about "_id" field .The data is of the format column_name : Column_value or key : value pair combination .
These means :
The first record has columns user_id,age and status with their values .
The second record has columns as name and location only .
If I were to be from traditional Database background . I would say , there is no way this type of storage being possible but here it is . Mongodb makes it possible .

 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }  
 { "_id" : ObjectId("52e72d956cf8a5335f4ac07c"), "name" : "Honey singh", "location" : "Mumbai" }

Does it look different . I bet it does. In simple words every document in mongodb database can have its own schema and hence it is a SCHEMA-FREE database .There is no definite schema for the entire table/collections.

Every document/tuple/row has a unique value identifying itself and that is "_id" . It is of type ObjectId.
If you were to compare "_id" with relational database , this would be similar to primary key .
This is how the structure of "_id" would look like .

 0 1 2 3   |  4 5 6  |  7 8  |  9 10 11     
 Timestamp   Machine    PID     Increment

Few facts of ObjectId.
1. ObjectIds use 12 bytes of storage, which gives them a string representation that is 24
hexadecimal digits: 2 digits for each byte(Look at the above 12 character data for an example ).
2. The first four bytes of an ObjectId are a timestamp in seconds since the epoch.
( an epoch is an instant in time chosen as the origin of a particular era. The "epoch" serves as a reference point from which time is measured)
3. The next three bytes of an ObjectId are a unique identifier of the machine on which it was generated. This is usually a hash of the machine’s hostname
4. To provide uniqueness among different processes generating ObjectIds concurrently on a single machine, the next two bytes are taken from the process identifier (PID) of the ObjectId-generating process.
5. The last three bytes are simply an incrementing counter that is responsible for uniqueness within a second in a single process

Before we start the basic operation , here are a few steps which will help us get going .
Disclaimer : you need to already have mongodb installed.If not go ahead and do it , it is very simple.

 // connecting to mongo   
 hadoop@ubuntu-server64:~/lab/install/mongodb/bin$ mongo  
 MongoDB shell version: 2.4.8  
 connecting to: test  
 // This will show all the databases available in the mongodb  
 > show databases;  
 balaramdb    0.203125GB  
 local  0.078125GB  
 new_db 0.203125GB  
 test  0.203125GB  
 testing 0.203125GB  
 //Very similar to mysql or sql-server , this will take us inside one of the db's . In this case new_db  
 > use new_db  
 switched to db new_db  
 //This will show us all the tables in the mongodb database , dont worry too much about the indexes now. I will explain them later .  
 > show tables;  
 ab  
 system.indexes  
 // This is equivalent to "select * from table " in relational databases;  
 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }

Basic Operations with the Shell :

Even before we start our mongodb journey , you need to understand one thing very clearly .All the database objects in mongodb will be referred to as "db" . So if you have to access a table it has to be db.<table_name>.findOne() or just db.<table_name>.insert() so on and so forth.Lets start with CRUD(create,read,update,delete) operations

Create:

 // let us first check which database we are in , the first line   
 > db.stats()  
 //o/p will be similar to below which means we are in the database : new_db  
 // "db" : "new_db"  
 // let us drop if there is an existing table "blog" that we are trying to create.  
 > db.blog.drop()  
 //o/p  
 //true  
 //Now let us try to see how do we create a collection/table  
 //first we define a local variable which is a javascript object representing our document.  
 post = {"title" : "My Blog Post",  
         "content" : "Here's my blog post.",  
         "date" : new Date()}  
 // this will immediately show the below .The below means that the object "post" is valid.  
 {  
     "title" : "My Blog Post",  
     "content" : "Here's my blog post.",  
     "date" : ISODate("2014-01-28T09:27:52.976Z")  
 }  
 //Let us invoke the below command to insert the data .   
 //did we forget creating the table first before insertioon ?  
 //Nevertheless lets try once and see what happens   
 > db.blog.insert(post)  
 //oops there is no error . Now does that mean the table/collection was created and data inserted on the fly ?  
 //Yes and that is why Mongodb is called schema free NoSQL database .If there is no table it just creates one.

select * from blog
The above is equivalent to the below in mongodb
Read :
This is how you read all the field/columns of all documents in a collection/table .

 > db.blog.find()  //  or db.blog.find({})  they are both same or equivalent command
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "title" : "My Blog Post", "content" : "Here's my blog post.", "date" : ISODate("2014-01-28T09:27:52.976Z") }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084"), "name" : "Honey singh", "location" : "Mumbai" }

How do I read only those records which I want . Let's say I want to pull records only for "Honey Singh" Or find the SQL equivalent as given below
select * from blog where name = "Honey Singh"
Read Filtering:

 > db.blog.find({"name":"Honey singh"})  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh", "location" : "Mumbai" }

 // If I want only few selected columns/fields , Here is how i would write my find() function  
 // the filter criteria "content" : 1 means , pull only "content" from all the documents   
 > db.blog.find({}, {"content" : 1})  
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084" } // Only _id is returned when there is no data available 
 //select content,name from blog   
 // but this will include "_id" as given below  
 > db.blog.find({}, {"content" : 1, "name" : 1})  
 { "_id" : ObjectId("52e15de00b50d96fdf69790f"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh" }  
 //select content,name from blog   
 //Let me remove the unwanted _id field from my result set and get a clean data  
 > db.blog.find({}, {_id : 0,"content" : 1, "name" : 1})  
 { "content" : "Here's my blog post." }  
 { "name" : "Honey singh" }

Data Science

Tuesday, July 7, 2015

Discovering Python !!

Monday, April 14, 2014

MongoDB - Part I

Contributors

Translate