Tag Archives: hadoop

Using Hadoop And PHP

Getting Started

So first things first.  If you haven’t used Hadoop before you’ll first need to download a Hadoop release and make sure you have Java and PHP installed.  To download Hadoop head over to:

http://hadoop.apache.org/common/releases.html

Click on download a release and choose a mirror.  I suggest choosing the most recent stable release.  Once you’ve downloaded Hadoop, unzip it.

user@computer:$ tar xpf hadoop-0.20.2.tar.gz

I like to create a symlink to the hadoop-<release> directory to make things easier to manage.

user@computer:$ link -s hadoop-0.20.2 hadoop

Now you should have everything you need to start creating a Hadoop PHP job.

Creating The Job

For this example I’m going to create a simple Map/Reduce job for Hadoop.  Let’s start by understanding what we want to happen.

  1. We want to read from an input system – this is our mapper
  2. We want to do something with what we’ve mapped – this is our reducer

At the root of your development directory, let’s create another directory called script.  This is where we’ll store our PHP mapper and reducer files.

user@computer:$ ls

.
..
hadoop-0.20.2
hadoop-0.20.2.tar.gz
hadoop
user@computer:$ mkdir script

Now let’s being creating our mapper script in PHP.  Go ahead and create a PHP file called mapper.php under the script directory.

user@computer:$ touch script/mapper.php

Now let’s look at the basic structure of a PHP mapper.

#!/usr/bin/php
<?php
//this can be anything from reading input from files, to retrieving database content, soap calls, etc.
//for this example I'm going to create a simple php associative array.
$a = array(
'first_name' => 'Hello',
'last_name' => 'World'
);
//it's important to note that anything you send to STDOUT will be written to the output specified by the mapper.
//it's also important to note, do not forget to end all output to STDOUT with a PHP_EOL, this will save you a lot of pain.
echo serialize($a), PHP_EOL;
?>

So this example is extremely simple.  Create a simple associative array and serialize it.  Now onto the reducer.  Create a PHP file in the script directory called reducer.php.

user@computer:$ touch script/reducer.php

Now let’s take a look at the layout of a reducer.

#!/usr/bin/php
 
<?php
 
//Remember when I said anything put out through STDOUT in our mapper would go to the reducer.
//Well, now we read from the STDIN to get the result of our mapper.
//iterate all lines of output from our mapper
while (($line = fgets(STDIN)) !== false) {
    //remove leading and trailing whitespace, just in case 🙂
    $line = trim($line);
    //now recreate the array we serialized in our mapper
    $a = unserialize($line);
    //Now, we do whatever we need to with the data.  Write it out again so another process can pick it up,
    //send it to the database, soap call, whatever.  In this example, just change it a little and
    //write it back out.
    $a['middle_name'] = 'Jason';
    //do not forget the PHP_EOL
    echo serialize($a), PHP_EOL;
}//end while
?>

So now we have a very simple mapper and reducer ready to go.

Execution

So now let’s run it and see what happens.  But first, a little prep work.  We need to specify the input directory that will be used when the job runs.

user@computer:$ mkdir input
user@computer:$ touch input/conf

Ok, that was difficult.  We have an input directory and we’ve created an empty conf file.  The empty conf file is just something that the mapper will use to get started.  For now, don’t worry about it.  Now let’s run this bad boy.  Make sure you have your JAVA_HOME set, this is usually in the /usr directory.  You can set this by running #export JAVA_HOME=/usr.

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

So here’s what the command does.  The first part executes the hadoop execute script.  The “jar” argument tells hadoop to use a jar, in this case it tells it to use “hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar”.  Next we pass the mapper and reducer arguments to the job and specify input and output directories.  If we wanted to, we could pass configuration information to the mapper, or files, etc.  We would just use the same line read structure that we used in the reducer to get the information.  That’s what would go in the input directory if we needed it to.  But for this example, we’ll just pass nothing.  Next the output directory will contain the output of our reducer.  In this case if everything works out correct, it will contain the PHP serialized form of our modified $a array.  If all goes well you should see something like this:

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

10/12/10 12:53:56 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/12/10 12:53:56 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
10/12/10 12:53:56 INFO streaming.StreamJob: Running job: job_local_0001
10/12/10 12:53:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO mapred.MapTask: numReduceTasks: 1
10/12/10 12:53:56 INFO mapred.MapTask: io.sort.mb = 100
10/12/10 12:53:57 INFO mapred.MapTask: data buffer = 79691776/99614720
10/12/10 12:53:57 INFO mapred.MapTask: record buffer = 262144/327680
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/mapper.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=0/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.MapTask: Starting flush of map output
10/12/10 12:53:57 INFO mapred.MapTask: Finished spill 0
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=0/1
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.Merger: Merging 1 sorted segments
10/12/10 12:53:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 70 bytes
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/reducer.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=1/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
10/12/10 12:53:57 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/root/output
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=1/1 > reduce
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
10/12/10 12:53:57 INFO streaming.StreamJob:  map 100%  reduce 100%
10/12/10 12:53:57 INFO streaming.StreamJob: Job complete: job_local_0001
10/12/10 12:53:57 INFO streaming.StreamJob: Output: output

If you get errors where it’s complaining about the output directory, just remove the output directory and try again.

Result

Once you’ve got something similar to the above and no errors, we can check out the result.

user@computer:$ cat output/*

a:3:{s:10:"first_name";s:5:"Hello";s:9:"last_name";s:5:"World";s:11:"middle_name";s:5:"Jason";}

There we go, a serialized form of our modified PHP array $a.  That’s all there is to it.  Now, go forth and Hadoop.