Data Science: December 2013

Wednesday, December 25, 2013

Arrays and Lists in Python

A good read about arrays and multi dimensional arrays/lists in Python :-

Arrays in basic Python are actually lists that can contain mixed datatypes. However, the numarray module contains support for true arrays, including multi-dimensional arrays, as well as IDL-style array operations and the where function. To use arrays, you must import numarray or from numarray import *.

Unfortunately, numarray generally only supports numeric arrays. Lists must be used for strings or objects. By importing numarray.strings and numarray.objects, you can convert string and object lists to arrays and use some of the numarray features, but only numeric lists are fully supported by numarray.

Creating lists: A list can be created by defining it with []. A numbered list can also be created with the range function which takes start and stop values and an increment.
list = [12, 4, 17, 9]
list2 = [30, "testing", True, 71.4]
a = range(6) #a = [0,1,2,3,4,5]
a = range(12,0,-2) #a = [12,10,8,6,4,2]
An empty list can be initialized with [] and then the append command can be used to append data to the end of the list:
a=[]
a.append("test")
a.append(5)
print a
-> ['test', 5]
Finally, if you want a list to have a predetermined size, you can create a list and fill it with None's:
a=[None]*length
a[5] = "Fifth"
a[3] = 6
print len(a)
-> 10
print a
-> [None, None, None, 6, None, 'Fifth', None, None, None, None]

Removing from lists: The pop method can be used to remove any item from the list:
a.pop(5)
print a
-> [None, None, None, 6, None, None, None, None, None]

Creating arrays: An array can be defined by one of four procedures: zeros, ones, arange, or array. zeros creates an array of a specified size containing all zeros:
a = zeros(5) #a=[0 0 0 0 0]
ones similarly creates an array of a certain size containing all ones:
a = ones(5) #a=[1 1 1 1 1]
arange works exactly the same as range, but produces an array instead of a list:
a = arange(10,0,-2) #a = [10 8 6 4 2] And finally, array can be used to convert a list to an array. For instance, when reading from a file, you can create an empty list and take advantage of the append command and lists not having a fixed size. Then once the data is all in the list, you can convert it to an array:
a = [1, 3, 9] #create a list and append it
a.append(3)
a.append(5)
print a
-> [1, 3, 9, 3, 5]
a = array(a)
print a
-> [1 3 9 3 5]

Multi-dimensional lists: Because Python arrays are actually lists, you are allowed to have jagged arrays. Multi-dimensional lists are just lists of lists:
a=[[0,1,2],[3,4,5]]
print a[1]
-> [3, 4, 5]
s = ["Lee", "Walsh", "Roberson"]
s2 = ["Williams", "Redick", "Ewing", "Dockery"]
s3 = [s, s2]
print s3[1][2]
-> Ewing

Multi-dimensional arrays: However, numarray does support true multi-dimensional arrays. These can be created through one of five methods: zeros, ones, array, arrange, and reshape. zeros and ones work the same way as single dimensions except that they take a tuple of dimensions (a comma separated list enclosed in parentheses) instead of a single argument:
a = zeros((3,5))
a[1,2] = 8
print a
-> [[0 0 0 0 0]
[0 0 8 0 0]
[0 0 0 0 0]]
b = ones((2,3,4)) #create a 2x3x4 array containing all ones.

array works the same way as for 1-d arrays: you can create a list and then convert it to an array. Note with multi-dimensional arrays though, trying to use array to convered a jagged list into an array will cause an error. Lists must be rectangular to be able to be converted to arrays.
s = ["Lee", "Walsh", "Roberson", "Brewer"]
s2 = ["Williams", "Redick", "Ewing", "Dockery"]
s3 = [s, s2]
s4 = array(s3)
print s4 + "test"
-> [['Leetest', 'Walshtest', 'Robersontest', 'Brewertest'],
['Williamstest', 'Redicktest', 'Ewingtest', 'Dockerytest']]
print s4[:,1:3]
-> [['Walsh', 'Roberson'],
['Redick', 'Ewing']]
arange also works the same as with 1-d arrays except you need to pass the shape parameter:
a = arange(25, shape=(5,5)),br> And finally, reshape can be used to convert a 1-d array into a multi-dimensional array. To create a 5x5 array with the elements numbered from 0 to 24, you could use:
b = arange(25)
b = reshape(b,5,5)

Array Dimensions and Subscripts: When creating a multi-dimensional array, the format is ([[depth,] height,] width). Therefore, when accessing array elements in a two dimensional array, the row is listed first, then the column. When accessing an element of a two-dimensional list, the following notation must be used: list[i][j]. However, two dimensional arrays can also use the notation: array[i,j]. In fact, this is the preferred notation of the two for arrays because you cannot use wildcards in the first dimension of the array[i][j] notation (i.e., array[1:3][4] would cause an error whereas array[1:3,4] is valid).

Wildcards can be used in array subscripts using the : , which is known as slicing. This is similar to IDL, with one major difference: if a=[0 1 2 3 4 5], in IDL a[1:4] = [1 2 3 4], but in Python, a[1:4] = [1 2 3]. In Python, when slicing array[i:j], it returns an array containing elements from i to j-1. Just like with strings, indices of arrays can be negative, in which case they count from the right instead of the left, i.e. a[-4:-1] = [2 3 4]. A : can also specify the rest of the elements or up to element, or all elements and arrays or lists can be used to subscript other arrays:
print a[:3] #[0 1 2]
print a[4:] #[4 5]
print a[:] #[0 1 2 3 4 5]
print a[[1,3,4]] #[1 3 4]

Note that slicing in python does not create a new array but just a pointer to the original array. b=a[0:10] followed by b[0] = 5 also changes a[0] to 5. To avoid this, use b = copy(a[0:10])

Array Operators:
Concatenation:

Lists: a + b
For Lists, the + operator appends the list on the right (b) to the list on the left.
a = ["Roberson", "Walsh"]
b = ["Lee", "Humphrey"]
-> a+b = ["Roberson", "Walsh", "Lee", "Humphrey"]

Arrays: concatenate((a,b)[,axis])
For arrays, use the numarry function concatenate. It also allows you to specify the axis when concatenating multi-dimensional arrays.
b = arange(5)
print concatenate((b, arange(6)))
-> [0 1 2 3 4 0 1 2 3 4 5]
b=reshape(b,5,1)
print concatenate((b,a),axis=1)
-> [[0 0 0 0]
[1 0 0 0]
[2 0 8 0]
[3 0 0 0]
[4 0 0 0]]

Equality: a == b and Inequality: a != b
For lists, these work the same as for scalars, meaning they can be used in if statments. For arrays, they return an array containing true or false for each array element.

Wednesday, December 18, 2013

Unable to Read "XLSX" file using R

1. If you have RStudio , check the below for the existence of package "xlsx" and "xlsxjars"

If the packages does not exist ,
Click on "Install Packages"
Search for the package and install the components .

2. After having installed the packages , click on the checkbox as given above . The below is how you should see after the package has been installed

3. You might see Java related problem when you try to load your excel file . See below

Loading required package: rJava

library(rJava)

Error : .onLoad failed in loadNamespace() for ‘rJava’, details:

call: fun(libname, pkgname)

error: JAVA_HOME cannot be determined from the Registry

To resolve these issue , download the 64 BIT latest JAVA (jre-7u15-windows-x64.exe) from Oracle's website

http://www.oracle.com/technetwork/java/javase/downloads/index.html

you will need to login into the website using your user and password .

4. In the windows machine , the below is the next step that you need to follow

Change/Add the JAVA_HOME to the location where it has been installed.

The location might differ for 32 and 64 bit as given below . Put the appropriate one in the system variable list

JAVA_HOME='C:\\Program Files\\Java\\jre7' # for 64-bit version

JAVA_HOME='C:\\Program Files (x86)\\Java\\jre7' # for 32-bit version

5. Re-Start your R-Studio

Enjoy !! You should now be able to see the data being pulled correctly from Excel .

Tuesday, December 17, 2013

SQOOP Installation on HADOOP to Connect with ORACLE and Pull the data and its structures .

11. log into the terminal/Putty , go to downloads folder

Download the sqoop .tar file using the below

wget http://mirror.olnevhost.net/pub/apache/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

Please note that if you do not get this version available , download any other stable version from Apache's mirror http://apache.mirrors.hoobly.com/sqoop/

2. Unzip the file using below:

tar -xzvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

a 3. Move the unzipped file to the install folder :

4. Log into the Oracle's website below

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html

Download the ojdbc6.jar by accepting the terms and conditions .

Please note that you will have to sign up with oracle to download the file .

5. Copy and paste the ojdbc6.jar to the "~/install/sqoop/lib" folder as given below

6. Go to "/home/hadoop/lab/install/sqoop-1.4.3.bin__hadoop-1.0.0/conf "

open the sqoop-env.sh using a VI editor

Or do the below

cp sqoop-env-template.sh sqoop-env.sh ( This will create the file if it does not exist )

vi sqoop-env.sh

Edit the below path as you have in your machine .

7. Start Hadoop by going to

hadoop@ubuntu-server64:~/lab/install/hadoop-1.0.4/bin$ ./start-all.sh

"jps" should list all the processes running as given above .

For Hadoop Installation , please check my blog on Hadoop installation .

8. Go to "/home/hadoop/lab/install/sqoop-1.4.3.bin__hadoop-1.0.0/bin " , Test the sqoop to be working or not ( This is for oracle for others RDBMS please revisit #4 above and download the appropriate jar)

./sqoop list-databases --connect jdbc:oracle:thin:@192.168.10.100:1521:PROD --username SYSTEM --password manager

If the connection is working fine , it is time to go to the next step . Else There can be couple of issues

1. Correct .JAR is not placed in the sqoop/lib folder

2. User/pass using JDBC doesn't have access to the database .

This can be one type of error

================ERROR=====================================================

hadoop@ubuntu-server64:~/lab/install/sqoop-1.4.3.bin__hadoop-1.0.0/bin$ ./sqoop list-databases --connect jdbc:oracle:thin:@192.168.0.41:1521:PROD --username SPENDOMETER -P

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Enter password:

13/12/16 14:48:08 INFO manager.SqlManager: Using default fetchSize of 1000

13/12/16 14:48:09 INFO manager.OracleManager: Time zone has been set to GMT

13/12/16 14:48:09 ERROR manager.OracleManager: The catalog view DBA_USERS was not found. This may happen if the user does not have DBA privileges. Please check privileges and try again.

================ERROR=====================================================

9. Execute the below command to pull the data and load into Hive .

./sqoop import --connect jdbc:oracle:thin:@192.168.0.41:1521:PROD --table TEMP_FIRST_NAME --username SPENDOMETER --password spendometer --hive-import --split-by first_name --target-dir /user/hive/warehouse/name

TEMP_FIRST_NAME -> Name of the table

username -> This must always be CAPITIAL LETTER (SPENDOMETER )

password -> This must always be small letter (spendometer )

10. To check if the file has been created or not , use the below command

Hadoop fs -ls /user/hive/warehouse/temp_first_name

This should contain the folders inside which the data will be available

11. Start the hive with command as given below .

If the hive starts and you still don't see the table , you are pointing to the wrong metastore_DB from hive .

12. start the hive using "bash" command as below

hadoop@ubuntu-server64:~/lab/install/sqoop-1.4.3.bin__hadoop-1.0.0/bin$ bash /home/hadoop/lab/install/hive-0.9.0/bin/hive

This should now fetch you data as below

Wednesday, December 4, 2013

Installing Hadoop on a linux/ubuntu Machine

Follow the procedure in sequence as given below .

step 1 : Installation of Java 1.7/1.6 .

Check the existence of java

Go to the JVM folder and check the existence of multiple JAVA installation.

If it does exist, remove all the installation and have the JVM folder clean( Please note that there should be no files or folder in the JVM directory )

Uninstall the JAVA settings

Install the JAVA

step 2: Installation of SSH

Steps to be followed :

i. sudo apt-get install ssh

ii. su - root

iii. ssh-keygen -t rsa -P ""

iv. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

v. ssh localhost

The "ssh localhost" should result in below :

  Enter “Y” when the question is asked.

step 3:   Download Hadoop from Apache.

  $ wget http://apache.mirrors.hoobly.com/hadoop/common/stable/hadoop-2.2.0.tar.gz
  $ cd /home/hadoop
$ tar xzf hadoop-2.2.0.tar.gz
$ mv hadoop-2.2.0 /usr/local/hadoop

step 4: Single Node Cluster – Setup

i.   Goto  /usr/local/hadoop/etc/Hadoop/
ii. Edit hadoop_env.sh and add JAVA_HOME set path of Sun

iii. To find JDK folder , type “whereis jvm”

Goto /home/hadoop/Downloads/hadoop-2.2.0/bin

The Result should look like this

Do the setup below in the directory
/home/hadoop/Downloads/hadoop-2.2.0/etc/hadoop

core-sites.xml

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

hdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

map-reduce.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

Format Name-node for first time use
===========================
/home/hadoop/Downloads/hadoop-2.2.0/sbin /hadoop namenode –format

Start hadoop deamons
=============================\
/home/hadoop/Downloads/hadoop-2.2.0/sbin /start-dfs.sh
/home/hadoop/Downloads/hadoop-2.2.0/sbin /start-yarn.sh

Web interface available

=================

NameNode - http://localhost:50070/

JobTracker - http://localhost:50030/

Stop hadoop deamons

=================

/home/hadoop/Downloads/hadoop-2.2.0/sbin /stop-dfs.sh

/home/hadoop/Downloads/hadoop-2.2.0/sbin /stop-yarn.sh

Map-reduce Create .JAR file and Step by Step execution of program .

This blog is meant for individuals who are non-Java professionals and for whom creating a .Jar file and executing them as Hadoop program is a big challenge .

Assumptions :
1. Eclipse is already installed on the machine .
2. Hadoop already installed or a VM is already available .
3. WinSCP already installed .

Step 0: Most importantly , please note that the version of Java installed on the LINUX/UBUNTU server
should be same as the ECLIPSE version where you are creating the .JAR.