Data Science: MongoDB

What is mongoDB ?

Of all the definitions that I read , I liked the most from Wikipedia .

MongoDB is a cross-platform document-oriented database system. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.

All the mumbo jumbo above only means that MongoDB is a document database whose structure is JSON like and not similar to traditional database .The defination will be clearer when we start performing the operations (CRUD or creating,reading , deleting and updating) later . For now just go with the flow.

Few quick pointers :

1. Document in mongodb is equivalent to row/tuple in relational databases system.
2. Collection is equivalent to table but collection is schema free .Which means that table in RDBMS has a definite table structure whereas in any NoSQL and in mongodb each document/row/tuple has its own schema .

Let us take an example of a relational database system , Here is how a data looks like in our traditional relation database management systems .

 ID         NAME          AGE   
 1          Honey         10  
 2          Roney         20

As is evident and already known the schema/table structure is the same for every record (ID=1 or ID=2 , look at the data), you might be like "I still do not understand what he is trying to say , i can see no difference in the two records .If that is what you are thinking about , you are in right direction .Look at the column/field names . They look static , the ID,NAME and AGE do not change . Do they change with new record insert ? Most definitely No !
Now lets take a look at mongodb documents (Just think of this as a fancy name given to a row/tuple) below, don't worry much about "_id" field .The data is of the format column_name : Column_value or key : value pair combination .
These means :
The first record has columns user_id,age and status with their values .
The second record has columns as name and location only .
If I were to be from traditional Database background . I would say , there is no way this type of storage being possible but here it is . Mongodb makes it possible .

 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }  
 { "_id" : ObjectId("52e72d956cf8a5335f4ac07c"), "name" : "Honey singh", "location" : "Mumbai" }

Does it look different . I bet it does. In simple words every document in mongodb database can have its own schema and hence it is a SCHEMA-FREE database .There is no definite schema for the entire table/collections.

Every document/tuple/row has a unique value identifying itself and that is "_id" . It is of type ObjectId.
If you were to compare "_id" with relational database , this would be similar to primary key .
This is how the structure of "_id" would look like .

 0 1 2 3   |  4 5 6  |  7 8  |  9 10 11     
 Timestamp   Machine    PID     Increment

Few facts of ObjectId.
1. ObjectIds use 12 bytes of storage, which gives them a string representation that is 24
hexadecimal digits: 2 digits for each byte(Look at the above 12 character data for an example ).
2. The first four bytes of an ObjectId are a timestamp in seconds since the epoch.
( an epoch is an instant in time chosen as the origin of a particular era. The "epoch" serves as a reference point from which time is measured)
3. The next three bytes of an ObjectId are a unique identifier of the machine on which it was generated. This is usually a hash of the machine’s hostname
4. To provide uniqueness among different processes generating ObjectIds concurrently on a single machine, the next two bytes are taken from the process identifier (PID) of the ObjectId-generating process.
5. The last three bytes are simply an incrementing counter that is responsible for uniqueness within a second in a single process

Before we start the basic operation , here are a few steps which will help us get going .
Disclaimer : you need to already have mongodb installed.If not go ahead and do it , it is very simple.

 // connecting to mongo   
 hadoop@ubuntu-server64:~/lab/install/mongodb/bin$ mongo  
 MongoDB shell version: 2.4.8  
 connecting to: test  
 // This will show all the databases available in the mongodb  
 > show databases;  
 balaramdb    0.203125GB  
 local  0.078125GB  
 new_db 0.203125GB  
 test  0.203125GB  
 testing 0.203125GB  
 //Very similar to mysql or sql-server , this will take us inside one of the db's . In this case new_db  
 > use new_db  
 switched to db new_db  
 //This will show us all the tables in the mongodb database , dont worry too much about the indexes now. I will explain them later .  
 > show tables;  
 ab  
 system.indexes  
 // This is equivalent to "select * from table " in relational databases;  
 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }

Basic Operations with the Shell :

Even before we start our mongodb journey , you need to understand one thing very clearly .All the database objects in mongodb will be referred to as "db" . So if you have to access a table it has to be db.<table_name>.findOne() or just db.<table_name>.insert() so on and so forth.Lets start with CRUD(create,read,update,delete) operations

Create:

 // let us first check which database we are in , the first line   
 > db.stats()  
 //o/p will be similar to below which means we are in the database : new_db  
 // "db" : "new_db"  
 // let us drop if there is an existing table "blog" that we are trying to create.  
 > db.blog.drop()  
 //o/p  
 //true  
 //Now let us try to see how do we create a collection/table  
 //first we define a local variable which is a javascript object representing our document.  
 post = {"title" : "My Blog Post",  
         "content" : "Here's my blog post.",  
         "date" : new Date()}  
 // this will immediately show the below .The below means that the object "post" is valid.  
 {  
     "title" : "My Blog Post",  
     "content" : "Here's my blog post.",  
     "date" : ISODate("2014-01-28T09:27:52.976Z")  
 }  
 //Let us invoke the below command to insert the data .   
 //did we forget creating the table first before insertioon ?  
 //Nevertheless lets try once and see what happens   
 > db.blog.insert(post)  
 //oops there is no error . Now does that mean the table/collection was created and data inserted on the fly ?  
 //Yes and that is why Mongodb is called schema free NoSQL database .If there is no table it just creates one.

select * from blog
The above is equivalent to the below in mongodb
Read :
This is how you read all the field/columns of all documents in a collection/table .

 > db.blog.find()  //  or db.blog.find({})  they are both same or equivalent command
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "title" : "My Blog Post", "content" : "Here's my blog post.", "date" : ISODate("2014-01-28T09:27:52.976Z") }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084"), "name" : "Honey singh", "location" : "Mumbai" }

How do I read only those records which I want . Let's say I want to pull records only for "Honey Singh" Or find the SQL equivalent as given below
select * from blog where name = "Honey Singh"
Read Filtering:

 > db.blog.find({"name":"Honey singh"})  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh", "location" : "Mumbai" }

 // If I want only few selected columns/fields , Here is how i would write my find() function  
 // the filter criteria "content" : 1 means , pull only "content" from all the documents   
 > db.blog.find({}, {"content" : 1})  
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084" } // Only _id is returned when there is no data available 
 //select content,name from blog   
 // but this will include "_id" as given below  
 > db.blog.find({}, {"content" : 1, "name" : 1})  
 { "_id" : ObjectId("52e15de00b50d96fdf69790f"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh" }  
 //select content,name from blog   
 //Let me remove the unwanted _id field from my result set and get a clean data  
 > db.blog.find({}, {_id : 0,"content" : 1, "name" : 1})  
 { "content" : "Here's my blog post." }  
 { "name" : "Honey singh" }

Data Science

Monday, April 14, 2014

MongoDB - Part I

No comments:

Post a Comment

Contributors

Translate