Offline
Online
Viewers

Using Hadoop And PHP

Getting Started

So first things first.  If you haven’t used Hadoop before you’ll first need to download a Hadoop release and make sure you have Java and PHP installed.  To download Hadoop head over to:

http://hadoop.apache.org/common/releases.html

Click on download a release and choose a mirror.  I suggest choosing the most recent stable release.  Once you’ve downloaded Hadoop, unzip it.

tar xpf hadoop-0.20.2.tar.gz

I like to create a symlink to the hadoop-<release> directory to make things easier to manage.

link -s hadoop-0.20.2 hadoop

Now you should have everything you need to start creating a Hadoop PHP job.

Creating The Job

For this example I’m going to create a simple Map/Reduce job for Hadoop.  Let’s start by understanding what we want to happen.

  1. We want to read from an input system – this is our mapper
  2. We want to do something with what we’ve mapped – this is our reducer

At the root of your development directory, let’s create another directory called script.  This is where we’ll store our PHP mapper and reducer files.

ls

.
..
hadoop-0.20.2
hadoop-0.20.2.tar.gz
hadoop
mkdir script

Now let’s being creating our mapper script in PHP.  Go ahead and create a PHP file called mapper.php under the script directory.

touch script/mapper.php

Now let’s look at the basic structure of a PHP mapper.

[codesyntax lang=”php”]

#!/usr/bin/php
<?php
//this can be anything from reading input from files, to retrieving database content, soap calls, etc.
//for this example I'm going to create a simple php associative array.
$a = array(
'first_name' => 'Hello',
'last_name' => 'World'
);
//it's important to note that anything you send to STDOUT will be written to the output specified by the mapper.
//it's also important to note, do not forget to end all output to STDOUT with a PHP_EOL, this will save you a lot of pain.
echo serialize($a), PHP_EOL;
?>

[/codesyntax]

So this example is extremely simple.  Create a simple associative array and serialize it.  Now onto the reducer.  Create a PHP file in the script directory called reducer.php.

touch script/reducer.php

Now let’s take a look at the layout of a reducer.

[codesyntax lang=”php”]

#!/usr/bin/php

<?php
//Remember when I said anything put out through STDOUT in our mapper would go to the reducer.
//Well, now we read from the STDIN to get the result of our mapper.
//iterate all lines of output from our mapper
while (($line = fgets(STDIN)) !== false) {
//remove leading and trailing whitespace, just in case :)
$line = trim($line);
//now recreate the array we serialized in our mapper
$a = unserialize($line);
    //Now, we do whatever we need to with the data.  Write it out again so another process can pick it up,
    //send it to the database, soap call, whatever.  In this example, just change it a little and
    //write it back out.
    $a['middle_name'] = 'Jason';
    //do not forget the PHP_EOL
    echo serialize($a), PHP_EOL;
}//end while
?>

[/codesyntax]

So now we have a very simple mapper and reducer ready to go.

Execution

So now let’s run it and see what happens.  But first, a little prep work.  We need to specify the input directory that will be used when the job runs.

mkdir input
touch input/conf

Ok, that was difficult.  We have an input directory and we’ve created an empty conf file.  The empty conf file is just something that the mapper will use to get started.  For now, don’t worry about it.  Now let’s run this bad boy.  Make sure you have your JAVA_HOME set, this is usually in the /usr directory.  You can set this by running #export JAVA_HOME=/usr.

hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

So here’s what the command does.  The first part executes the hadoop execute script.  The “jar” argument tells hadoop to use a jar, in this case it tells it to use “hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar”.  Next we pass the mapper and reducer arguments to the job and specify input and output directories.  If we wanted to, we could pass configuration information to the mapper, or files, etc.  We would just use the same line read structure that we used in the reducer to get the information.  That’s what would go in the input directory if we needed it to.  But for this example, we’ll just pass nothing.  Next the output directory will contain the output of our reducer.  In this case if everything works out correct, it will contain the PHP serialized form of our modified $a array.  If all goes well you should see something like this:

hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

10/12/10 12:53:56 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/12/10 12:53:56 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
10/12/10 12:53:56 INFO streaming.StreamJob: Running job: job_local_0001
10/12/10 12:53:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO mapred.MapTask: numReduceTasks: 1
10/12/10 12:53:56 INFO mapred.MapTask: io.sort.mb = 100
10/12/10 12:53:57 INFO mapred.MapTask: data buffer = 79691776/99614720
10/12/10 12:53:57 INFO mapred.MapTask: record buffer = 262144/327680
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/mapper.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=0/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.MapTask: Starting flush of map output
10/12/10 12:53:57 INFO mapred.MapTask: Finished spill 0
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=0/1
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.Merger: Merging 1 sorted segments
10/12/10 12:53:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 70 bytes
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/reducer.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=1/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
10/12/10 12:53:57 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/root/output
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=1/1 > reduce
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
10/12/10 12:53:57 INFO streaming.StreamJob:  map 100%  reduce 100%
10/12/10 12:53:57 INFO streaming.StreamJob: Job complete: job_local_0001
10/12/10 12:53:57 INFO streaming.StreamJob: Output: output

If you get errors where it’s complaining about the output directory, just remove the output directory and try again.

Result

Once you’ve got something similar to the above and no errors, we can check out the result.

cat output/*

a:3:{s:10:"first_name";s:5:"Hello";s:9:"last_name";s:5:"World";s:11:"middle_name";s:5:"Jason";}

There we go, a serialized form of our modified PHP array $a.  That’s all there is to it.  Now, go forth and Hadoop.

42 Comments

  1. Actually I want to use php as front end and hadoop as backend. Is that is possible to have good GUI at frontend using php via connecting hadoop as backend?

    1. Hadoop at the end of the day is Java. I don’t see any taping why you would not be able to write a front-end application utilizing PHP Technologies to execute the Hadoop commands on the back end.

    1. Hi Rajasree,

      You don’t need to do anything special to install PHP with Hadoop. Just install PHP as you would normally onto the machine for your specific distribution, then when using Hadoop pass it the PHP files. Just make sure your PHP files include the correct run line comment and that the files are executeable (chmod +x some-file.php) in the file as line 1. For example: #!/usr/bin/php

    1. Hi RJ,

      More than likely however, I don’t know of any publicly available. There are some books for them available online though.

  2. Hi,

    I am getting below error when I am trying to run your above example. What am I missing?

    By reading below error, I get that main class is missing in mapper.php file, but this is not java code and it is php code where main class is not required.

    Can you please help me?

    Exception in thread “main” java.lang.ClassNotFoundException: -mapper
    at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:205)

    1. Hi Sunil,

      It’s possible that the -mapper flag may have changed for the version that you are using. You may want to verify the flags for the Hadoop version you are using. If the flag is correct, then verify that the mapper.php class can be found by Hadoop’s execution directory.

  3. Hi, I am getting following error, can you please help me…

    c:\map_reduce>hadoop jar c:/Hadoop/hadoop-1.1.0-SNAPSHOT/contrib/streaming/hadoo
    p-streaming-1.1.0-SNAPSHOT.jar -mapper “php map_wordcount.php” -reducer “php red
    uce_wordcount.php” -input /input/wordcount/* -output /output/wordcount/

    packageJobJar: [] [/C:/Hadoop/hadoop-1.1.0-SNAPSHOT/lib/hadoop-streaming.jar] C:
    \Users\SHERY\AppData\Local\Temp\streamjob3626686751868886087.jar tmpDir=null
    14/04/20 14:09:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    14/04/20 14:09:07 WARN snappy.LoadSnappy: Snappy native library not loaded
    14/04/20 14:09:07 INFO mapred.FileInputFormat: Total input paths to process : 43

    14/04/20 14:09:07 INFO streaming.StreamJob: getLocalDirs(): [c:\hadoop\HDFS\mapr
    ed\local]
    14/04/20 14:09:07 INFO streaming.StreamJob: Running job: job_201404200058_0051
    14/04/20 14:09:07 INFO streaming.StreamJob: To kill this job, run:
    14/04/20 14:09:07 INFO streaming.StreamJob: C:\Hadoop\hadoop-1.1.0-SNAPSHOT/bin/
    hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201404200058_0051
    14/04/20 14:09:07 INFO streaming.StreamJob: Tracking URL: http://anchorfree.net:
    50030/jobdetails.jsp?jobid=job_201404200058_0051
    14/04/20 14:09:08 INFO streaming.StreamJob: map 0% reduce 0%
    14/04/20 14:09:59 INFO streaming.StreamJob: map 100% reduce 100%
    14/04/20 14:09:59 INFO streaming.StreamJob: To kill this job, run:
    14/04/20 14:09:59 INFO streaming.StreamJob: C:\Hadoop\hadoop-1.1.0-SNAPSHOT/bin/
    hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201404200058_0051
    14/04/20 14:09:59 INFO streaming.StreamJob: Tracking URL: http://anchorfree.net:
    50030/jobdetails.jsp?jobid=job_201404200058_0051
    14/04/20 14:09:59 ERROR streaming.StreamJob: Job not successful. Error: # of fai
    led Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_20140
    4200058_0051_m_000000
    14/04/20 14:09:59 INFO streaming.StreamJob: killJob…
    Streaming Command Failed!

    Below is Link for Code:

    https://dl.dropboxusercontent.com/u/84598995/map_wordcount.php
    https://dl.dropboxusercontent.com/u/84598995/reduce_wordcount.php

    Thanks
    -Shahram

    1. Hi Shahram,

      It appears to be an error in your mapper. Have you tried executing the mapper PHP outside of Hadoop to ensure it is working correctly?

  4. Hi Chum, verify that your input and output files are being populated correctly. You can also just pass static values along the chain to verify that the entire map reduce system is working as it should.

  5. Update:

    Re-“installed” Hadoop and compared what I did wrong the first time. Apparently I may have used root sometimes when starting/stopping my daemons(which I may have been oblivious of doing) and that’s what led to my Hadoop getting all whacked(since I started chowning and deleting stuff in tmp and in the DFS that may have caused it to just go haywire :3

    but your mapred example still doesn’t output anything though(the task was successful btw)

  6. I see, okay I’ll try to be as detailed as possible:
    First off, I set up my Hadoop based on the configs shown here:
    http://hadoop.apache.org/docs/current/single_node_setup.html
    following the pseuo-distributed operation, then to try using the mapper and reader, I followed your example (so my mapper and reader codes are the ones you provided in this post). When I tried to run your example, at first I get the ‘connection refused’ error, so I tried removing the configs I got from the hadoop.apache site (since in your example, you didn’t configure those files), and lo and behold, it worked! But I also noticed that when I tried using jps to see what tasks are running, only datanode, and secondary datanode was shown, I then put back the configs and namenode and jobtracker, tasktracker was also showing up, but when I run it again, I get ‘connection refused’ again. Sooo that’s pretty much where I am at right now.

    p.s.
    I did try exectuting the PHP scripts (as you suggested), and also checked the output files (when the mapred task did succeed) but when I tried to view the output:
    cat /path/to/output/*
    nothing was shown, there were 2 files generated: ‘_SUCCESS’ and ‘part-r-0000’. but both were empty. 🙁

  7. Well like I said, I was doing your example above :3
    The error popped out when I tried to execute: /path/to/hadoop jar /path/to/hadoop/contrib/streaming/jar_file -mapper /path/to/mapper.php -reducer /path/to/reduce.php -input /path/to/input/* -output /path/to/output.

    But to update my question, I did forget to run start-mapred.sh (ultra-noob here haha) and also added some configs in my hadoop/conf/core-site.xml and mapred-site.xml; I also noticed that the connection refused error doesn’t show when i comment out the configs i placed, but I can’t access the web ui, and when I keep the config, i get the connection refused error but can access the web ui.
    Is there something wrong with my whole configuration or am I doing something not right?

    1. Hi Chum, it’s a bit hard to determine what the actual cause of your issue is without having any idea of what the code is actually doing, so here’s what I suggest. First pull the functional code out of the PHP daemon (just the PHP code itself), and execute that command line on it’s own. Once you have that working correctly then you should be able to move it into the daemon with little to no issue.

  8. I am new to Hadoop. My question is, can we make use of Hadoop framework for “Reporting” which deals with huge data.

  9. Hi! I’m new to all this (literally) and I was following your example and everything was going well.. until I tried running the command ../hadoop jar … -output output, where I keep getting a connection refused error. With only the internet as my resource(and reference) I really can only go so far to solve it. What might I be doing wrong? (Do tell me if I need to provide any needed information)

  10. Sounds like it may be a permission issue. You may need to promote the user under which Hadoop is running to make sure it has access to all the resources you need. If you need to get the task id of the currently executing PHP script under Hadoop, you could use getmypid().

  11. I’m not able to get the file paths, each mapper writes to its own hadoop machine and how can I get the mapred task id and HDFS don’t have access to files written by mapper unless they are moved to HDFS

  12. I don’t see why not. You could write the file paths out, then decide when/what pass you want to handle the files themselves. You could do this by setting a flag that determines when to handle the files in the reducer.

  13. Is there a way to write to a file through mapper and view that file later. I want all the mappers to write to same file or atleast I can pass all the file paths to reducer. Is this possible?

  14. with the code you have is it possible to create session variables in the mapper and pass them to the reducer?

  15. I see the problem in the last array element:

    i:4;s:11:”unveiled88″;

    i:4 = index 4
    s:11 = string length of value

    but “unveiled88” is 10 characters not 11. I also see that it jumped to the next line, which could actually mean the value is “unveiled88n”. Looks like the “n” is not being preserved. I would make sure the “n” doesn’t get in there to start with, use trim() for example. I believe that’s what’s killing your unserialize() call.

  16. after it get serilized in the mapper it looks like this:
    a:5:{i:0;s:6:”secret”;i:1;s:3:”liv”;i:2;s:6:”galaxi”;i:3;s:6:”unveil”;i:4;s:11:”unveiled88
    “;}

  17. Sorry, I should have been more specific, can you please put the actual value of the serialized array here. Should for example look like:

    a:3:{s:10:”first_name”;s:5:”Hello”;s:9:”last_name”;s:5:”World”;s:11:”middle_name”;s:5:”Jason”;}

  18. the mapper is executed before the reducer. So before i serialized my array in the mapper.php i print the array to a file and is correct. The reducer.php code is the same with yours.

    my mapper.php code have the following structure:

    assign file values to the array $a
    ….
    echo serialize($a), PHP_EOL;
    Finish

    Make sure your serialized array is not being changed by anything
    i don’t understand where it can get changed

  19. This will happen if the data being serialized changes before you unserialize it. For example, if you have a string “hello world” and you serialize it, then you remove a character from the serialized string, then try to unserialize the string, you’ll get the same error. Make sure your serialized array is not being changed by anything. I would suggest writing the pre and post values out to a log or text file to verify.

  20. can i ask you something else. I want to import a file. I have tokenize the file and i have insert it to the array a. So i am doing everything the same but instead of two elements in the array $a i have let say 1000 elements. But i got an error on the reducer:

    PHP Notice: unserialize(): Error at offset 0 of 3 bytes in /home/*****[removed]/Desktop/script/reducer.php on line 11

    do you know why it give me this error?

  21. after you untar the handoop you don’t need to install it? i follow your example but it seems that hadoop is not intalled

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Affiliates