Using Hadoop And PHP

Getting Started

So first things first.  If you haven’t used Hadoop before you’ll first need to download a Hadoop release and make sure you have Java and PHP installed.  To download Hadoop head over to:

http://hadoop.apache.org/common/releases.html

Click on download a release and choose a mirror.  I suggest choosing the most recent stable release.  Once you’ve downloaded Hadoop, unzip it.

user@computer:$ tar xpf hadoop-0.20.2.tar.gz

I like to create a symlink to the hadoop-<release> directory to make things easier to manage.

user@computer:$ link -s hadoop-0.20.2 hadoop

Now you should have everything you need to start creating a Hadoop PHP job.

Creating The Job

For this example I’m going to create a simple Map/Reduce job for Hadoop.  Let’s start by understanding what we want to happen.

  1. We want to read from an input system – this is our mapper
  2. We want to do something with what we’ve mapped – this is our reducer

At the root of your development directory, let’s create another directory called script.  This is where we’ll store our PHP mapper and reducer files.

user@computer:$ ls

.
..
hadoop-0.20.2
hadoop-0.20.2.tar.gz
hadoop
user@computer:$ mkdir script

Now let’s being creating our mapper script in PHP.  Go ahead and create a PHP file called mapper.php under the script directory.

user@computer:$ touch script/mapper.php

Now let’s look at the basic structure of a PHP mapper.

Code   
#!/usr/bin/php
<?php
//this can be anything from reading input from files, to retrieving database content, soap calls, etc.
//for this example I'm going to create a simple php associative array.
$a = array(
'first_name' => 'Hello',
'last_name' => 'World'
);
//it's important to note that anything you send to STDOUT will be written to the output specified by the mapper.
//it's also important to note, do not forget to end all output to STDOUT with a PHP_EOL, this will save you a lot of pain.
echo serialize($a), PHP_EOL;
?>

So this example is extremely simple.  Create a simple associative array and serialize it.  Now onto the reducer.  Create a PHP file in the script directory called reducer.php.

user@computer:$ touch script/reducer.php

Now let’s take a look at the layout of a reducer.

Code   
#!/usr/bin/php
 
<?php
 
//Remember when I said anything put out through STDOUT in our mapper would go to the reducer.
//Well, now we read from the STDIN to get the result of our mapper.
//iterate all lines of output from our mapper
while (($line = fgets(STDIN)) !== false) {
    //remove leading and trailing whitespace, just in case  
    $line = trim($line);
    //now recreate the array we serialized in our mapper
    $a = unserialize($line);
    //Now, we do whatever we need to with the data.  Write it out again so another process can pick it up,
    //send it to the database, soap call, whatever.  In this example, just change it a little and
    //write it back out.
    $a['middle_name'] = 'Jason';
    //do not forget the PHP_EOL
    echo serialize($a), PHP_EOL;
}//end while
?>

So now we have a very simple mapper and reducer ready to go.

Execution

So now let’s run it and see what happens.  But first, a little prep work.  We need to specify the input directory that will be used when the job runs.

user@computer:$ mkdir input
user@computer:$ touch input/conf

Ok, that was difficult.  We have an input directory and we’ve created an empty conf file.  The empty conf file is just something that the mapper will use to get started.  For now, don’t worry about it.  Now let’s run this bad boy.  Make sure you have your JAVA_HOME set, this is usually in the /usr directory.  You can set this by running #export JAVA_HOME=/usr.

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

So here’s what the command does.  The first part executes the hadoop execute script.  The “jar” argument tells hadoop to use a jar, in this case it tells it to use “hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar”.  Next we pass the mapper and reducer arguments to the job and specify input and output directories.  If we wanted to, we could pass configuration information to the mapper, or files, etc.  We would just use the same line read structure that we used in the reducer to get the information.  That’s what would go in the input directory if we needed it to.  But for this example, we’ll just pass nothing.  Next the output directory will contain the output of our reducer.  In this case if everything works out correct, it will contain the PHP serialized form of our modified $a array.  If all goes well you should see something like this:

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

10/12/10 12:53:56 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/12/10 12:53:56 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
10/12/10 12:53:56 INFO streaming.StreamJob: Running job: job_local_0001
10/12/10 12:53:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO mapred.MapTask: numReduceTasks: 1
10/12/10 12:53:56 INFO mapred.MapTask: io.sort.mb = 100
10/12/10 12:53:57 INFO mapred.MapTask: data buffer = 79691776/99614720
10/12/10 12:53:57 INFO mapred.MapTask: record buffer = 262144/327680
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/mapper.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=0/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.MapTask: Starting flush of map output
10/12/10 12:53:57 INFO mapred.MapTask: Finished spill 0
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=0/1
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.Merger: Merging 1 sorted segments
10/12/10 12:53:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 70 bytes
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/reducer.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=1/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
10/12/10 12:53:57 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/root/output
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=1/1 > reduce
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
10/12/10 12:53:57 INFO streaming.StreamJob:  map 100%  reduce 100%
10/12/10 12:53:57 INFO streaming.StreamJob: Job complete: job_local_0001
10/12/10 12:53:57 INFO streaming.StreamJob: Output: output

If you get errors where it’s complaining about the output directory, just remove the output directory and try again.

Result

Once you’ve got something similar to the above and no errors, we can check out the result.

user@computer:$ cat output/*

a:3:{s:10:"first_name";s:5:"Hello";s:9:"last_name";s:5:"World";s:11:"middle_name";s:5:"Jason";}

There we go, a serialized form of our modified PHP array $a.  That’s all there is to it.  Now, go forth and Hadoop.

32 thoughts on “Using Hadoop And PHP

  1. GodLikeMouse Post author

    Hi Chum, verify that your input and output files are being populated correctly. You can also just pass static values along the chain to verify that the entire map reduce system is working as it should.

    Reply
  2. Chum

    Update:

    Re-”installed” Hadoop and compared what I did wrong the first time. Apparently I may have used root sometimes when starting/stopping my daemons(which I may have been oblivious of doing) and that’s what led to my Hadoop getting all whacked(since I started chowning and deleting stuff in tmp and in the DFS that may have caused it to just go haywire :3

    but your mapred example still doesn’t output anything though(the task was successful btw)

    Reply
  3. Chum

    I see, okay I’ll try to be as detailed as possible:
    First off, I set up my Hadoop based on the configs shown here:
    http://hadoop.apache.org/docs/current/single_node_setup.html
    following the pseuo-distributed operation, then to try using the mapper and reader, I followed your example (so my mapper and reader codes are the ones you provided in this post). When I tried to run your example, at first I get the ‘connection refused’ error, so I tried removing the configs I got from the hadoop.apache site (since in your example, you didn’t configure those files), and lo and behold, it worked! But I also noticed that when I tried using jps to see what tasks are running, only datanode, and secondary datanode was shown, I then put back the configs and namenode and jobtracker, tasktracker was also showing up, but when I run it again, I get ‘connection refused’ again. Sooo that’s pretty much where I am at right now.

    p.s.
    I did try exectuting the PHP scripts (as you suggested), and also checked the output files (when the mapred task did succeed) but when I tried to view the output:
    cat /path/to/output/*
    nothing was shown, there were 2 files generated: ‘_SUCCESS’ and ‘part-r-0000′. but both were empty. :(

    Reply
  4. Chum

    Well like I said, I was doing your example above :3
    The error popped out when I tried to execute: /path/to/hadoop jar /path/to/hadoop/contrib/streaming/jar_file -mapper /path/to/mapper.php -reducer /path/to/reduce.php -input /path/to/input/* -output /path/to/output.

    But to update my question, I did forget to run start-mapred.sh (ultra-noob here haha) and also added some configs in my hadoop/conf/core-site.xml and mapred-site.xml; I also noticed that the connection refused error doesn’t show when i comment out the configs i placed, but I can’t access the web ui, and when I keep the config, i get the connection refused error but can access the web ui.
    Is there something wrong with my whole configuration or am I doing something not right?

    Reply
    1. GodLikeMouse Post author

      Hi Chum, it’s a bit hard to determine what the actual cause of your issue is without having any idea of what the code is actually doing, so here’s what I suggest. First pull the functional code out of the PHP daemon (just the PHP code itself), and execute that command line on it’s own. Once you have that working correctly then you should be able to move it into the daemon with little to no issue.

      Reply
  5. Bhavani G

    I am new to Hadoop. My question is, can we make use of Hadoop framework for “Reporting” which deals with huge data.

    Reply
  6. Chum

    Hi! I’m new to all this (literally) and I was following your example and everything was going well.. until I tried running the command ../hadoop jar … -output output, where I keep getting a connection refused error. With only the internet as my resource(and reference) I really can only go so far to solve it. What might I be doing wrong? (Do tell me if I need to provide any needed information)

    Reply
  7. GodLikeMouse Post author

    Sounds like it may be a permission issue. You may need to promote the user under which Hadoop is running to make sure it has access to all the resources you need. If you need to get the task id of the currently executing PHP script under Hadoop, you could use getmypid().

    Reply
  8. Lewis

    I’m not able to get the file paths, each mapper writes to its own hadoop machine and how can I get the mapred task id and HDFS don’t have access to files written by mapper unless they are moved to HDFS

    Reply
  9. GodLikeMouse Post author

    I don’t see why not. You could write the file paths out, then decide when/what pass you want to handle the files themselves. You could do this by setting a flag that determines when to handle the files in the reducer.

    Reply
  10. Lewis

    Is there a way to write to a file through mapper and view that file later. I want all the mappers to write to same file or atleast I can pass all the file paths to reducer. Is this possible?

    Reply
  11. ee

    with the code you have is it possible to create session variables in the mapper and pass them to the reducer?

    Reply
  12. GodLikeMouse Post author

    I see the problem in the last array element:

    i:4;s:11:”unveiled88″;

    i:4 = index 4
    s:11 = string length of value

    but “unveiled88″ is 10 characters not 11. I also see that it jumped to the next line, which could actually mean the value is “unveiled88n”. Looks like the “n” is not being preserved. I would make sure the “n” doesn’t get in there to start with, use trim() for example. I believe that’s what’s killing your unserialize() call.

    Reply
  13. ee

    after it get serilized in the mapper it looks like this:
    a:5:{i:0;s:6:”secret”;i:1;s:3:”liv”;i:2;s:6:”galaxi”;i:3;s:6:”unveil”;i:4;s:11:”unveiled88
    “;}

    Reply
  14. GodLikeMouse Post author

    Sorry, I should have been more specific, can you please put the actual value of the serialized array here. Should for example look like:

    a:3:{s:10:”first_name”;s:5:”Hello”;s:9:”last_name”;s:5:”World”;s:11:”middle_name”;s:5:”Jason”;}

    Reply
  15. ee

    the mapper is executed before the reducer. So before i serialized my array in the mapper.php i print the array to a file and is correct. The reducer.php code is the same with yours.

    my mapper.php code have the following structure:

    assign file values to the array $a
    ….
    echo serialize($a), PHP_EOL;
    Finish

    Make sure your serialized array is not being changed by anything
    i don’t understand where it can get changed

    Reply
  16. GodLikeMouse Post author

    This will happen if the data being serialized changes before you unserialize it. For example, if you have a string “hello world” and you serialize it, then you remove a character from the serialized string, then try to unserialize the string, you’ll get the same error. Make sure your serialized array is not being changed by anything. I would suggest writing the pre and post values out to a log or text file to verify.

    Reply
  17. ee

    can i ask you something else. I want to import a file. I have tokenize the file and i have insert it to the array a. So i am doing everything the same but instead of two elements in the array $a i have let say 1000 elements. But i got an error on the reducer:

    PHP Notice: unserialize(): Error at offset 0 of 3 bytes in /home/*****[removed]/Desktop/script/reducer.php on line 11

    do you know why it give me this error?

    Reply
  18. ee

    after you untar the handoop you don’t need to install it? i follow your example but it seems that hadoop is not intalled

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre user="" computer="" escaped="">