Big Data on a Shoestring by Nicholas Bessmer Page A

Book: Big Data on a Shoestring by Nicholas Bessmer Read Free Book Online

Authors: Nicholas Bessmer

Ads: Link

installation, use SFTP plugin:



    It lets you see your local computer files and you remote EC2 instance. Now download the following to your computer:

    http://www.sai.msu.su/apache/hadoop/core/stable/

    for the latest stable version of HADOOP and download CASSANDRA:

    http://cassandra.apache.org/download/

    “PIG” is a query language designed for Big Data. We will use this query our Big Data dataset.

    http://www.sai.msu.su/apache/pig

    Now copy these over to your new EC2 Linux Server:



    Once the files have been copied copy and paste the following command:

    tar -xvf hadoop-0.20.2.tar.gz
    tar -xvf apache-cassandra-1.2.1-bin.tar.gz
    tar -xvf pig-0.10.1.tar.gz
    Please also be sure to run this command in this directory by typing these commands:

    »         cd pig-0.10.1 (cd changes
    »         tar –xvf tutorial.tar (also can use utility gunzip)

    This extracts the files which are compressed much like a ZIP file.

    It is possible to choose MS Windows Server as your preferred EC2 server. We installed Linux here (it is cheaper to run than Windows Server)… so editing files with the VI editing tool is a bit harder to do. Lookup VI on the Internet – it is like a very powerful Windows notepad but is command line driven.


Getting The Linux Environment Set Up – Basic Steps

    Type the following:

    »         cd   (changes to the main directory)
    »         vi .bash_profile (vi is the editor and you will be modifying a simple text configuration file) – please see this helpful link from University of San Diego

    http://acms.ucsd.edu/info/vi_tutorial.html

    »         copy and paste the following into your file
    # . bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin:/home/ec2-user/hadoop-0.20.2/bin:/home/ec2-user/pig-0.10.1/bin
sh_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi

# User specific environment and startup programs

PATH=$PATH:$HOME/bin:/home/ec2-user/hadoop-0.20.2/bin:/home/ec2-user/pig-0.10.1/bin

export PATH

Editing Our Hadoop Configuration Files

    We need to edit the following files and run these commands next following up on step #1 above of downloading the Hadoop TAR file:

Edit /conf/core-site.xml. I have used localhost in the value of fs.default.nam [1] e

           fs.default.name
           hdfs://localhost:9000


Edit / conf/mapred-site.xml.

             mapred.job.tracker
             localhost:9001


Edit / conf/hdfs-site.xml. Since this test cluster has a single node, replication factor should be set to 1.

    dfs.replication = “1”

Format the name node (one per install).

    $ bin/hadoop namenode –format

    It should print out something like the following message:

    12/07/15 15:54:20 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = Shamim-2.local/192.168.0.103
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 0.20.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
    ************************************************************/
    12/07/15 15:54:21 INFO namenode.FSNamesystem: fsOwner=samim,staff,com.apple.sharepoint.group.1,everyone,_appstore,localaccounts,_appserverusr,admin,_appserveradm,_lpadmin,_lpoperator,_developer,com.apple.access_screensharing
    12/07/15 15:54:21 INFO namenode.FSNamesystem: supergroup=supergroup
    12/07/15 15:54:21 INFO namenode.FSNamesystem: isPermissionEnabled=true
    12/07/15 15:54:21 INFO common.Storage: Image file of size 95 saved in 0 seconds.
    12/07/15 15:54:21 INFO common.Storage: Storage directory