Category Archives: Java

Getting Started With Windows Azure Blob Storage Using Java

The following is a quick start guide to getting up and running with Windows Azure and Blob storage using Java.  First thing you’ll need to do is setup a Windows Azure account, you can do that here: https://www.windowsazure.com

Next you’ll need to download the Azure SDK for Java which is available here: https://www.windowsazure.com/en-us/develop/java/

Azure SDK Download

Azure SDK Download

I found it easier to simply download the libraries directly and include the azure jar file (microsoft-windowsazure-api-0.3.2.jar) directly.

Windows Azure Libraries

Windows Azure Libraries

Next you’ll need to capture your Account Name and Account Key.

Windows Azure Dashboard

Windows Azure Dashboard

Windows Azure Keys

Windows Azure Keys

Now for the fun part, the code.  I’ve already done a lot of the heavy lifting for you (you’re welcome) so all you’ll really need to do is modify a few values and implement as needed.  So the first thing you’ll need is to take the Key values from above and create an Azure connection string like so:

// Replace your-account-name and your-account-key with your account values
public static final String storageConnectionString = 
"DefaultEndpointsProtocol=http;" + 
"AccountName=your-account-name;" + 
"AccountKey=your-account-key";

Now let’s create a client which we’ll use to communicate with the Azure blob storage.
// Connection String
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
 
// Create the client
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();

Now that we have a client we can begin performing actions like upload, listing and retrieving files.  Let’s start with listing files.  You’ll need a root container name, this is essentially the name of the bucket that you’re going to be sticking your files in.  Now let’s list some files:
// Change this to your container name
String containerName = "your-container-name";
 
// Create the client
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
 
// Get a reference to the container
CloudBlobContainer container = blobClient.getContainerReference(containerName);
 
// Create the container if it does not exist
container.createIfNotExist();
 
// List them
for (ListBlobItem blobItem : container.listBlobs())
   System.out.println(blobItem.getUri());

So here’s what’s going on, we’re using the blob client we created which is looking at the container (or root bucket) and generating a list of the blobs there, if it container doesn’t exist we create it. Next we’re iterating the blobs and displaying the URI for each.  Now let’s say we had a folder structure on that container.  Something to the extent of an “images” directory with image files inside it.  Here’s how you’d list those files:

// Change this to your container name
String containerName = "your-container-name";
 
// Create the client
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
 
// Get a reference to the container
CloudBlobContainer container = blobClient.getContainerReference(containerName);
 
// Create the container if it does not exist
container.createIfNotExist();
 
// Get a reference to the images directory
CloudBlobDirectory directory = container.getDirectoryReference("images/");
 
// List them
for (ListBlobItem blobItem : directory.listBlobs())
   System.out.println(blobItem.getUri());

The only real difference between this example and the previous one is simply the list of either the container or the directory. In this case we get a reference to the cloud directory then list.  Now that we can list container and directory contents, let’s upload some files:

// Change to your file to upload
File source = new File("path/to/your/file");
 
// Where are we going to store the files?
String uri = "your-container-name/images";
 
// Change this to your container name
String containerName = "your-container-name";
 
// Create the client
CloudBlobClient blobClient = storageAccount.createCloudBlobClient();
 
// Retrieve reference to a previously created container
CloudBlobContainer container = blobClient.getContainerReference(uri);
 
// Create the container if it does not exist
container.createIfNotExist();
 
// Let's upload the file.
CloudBlockBlob blob = container.getBlockBlobReference(source.getName());
blob.upload(new FileInputStream(source), source.length());

The above example will upload the source file into the images directory of your container. The only thing left to do now is download a file and pattern is complete.

// Where are we going to save the downloaded file?
File destination = new File("path/to/your/file");
 
// Change this to your container name
String containerName = "your-container-name";
 
// Get a reference to the container
CloudBlobContainer container = blobClient.getContainerReference(containerName);
 
// Create the container if it does not exist
container.createIfNotExist();
 
// Get the actual Azure storage uri
CloudBlobDirectory directory = container.getDirectoryReference("images/");
String fileUri = String.format("%s/%s", directory.getUri(), destination.getName());
 
// Download it
blobItem.download(new FileOutputStream(destination));

And that’s the whole enchilada, hopefully this helped you save a little bit of time and got you up to speed quickly…and I’m out.

Using Hadoop And PHP

Getting Started

So first things first.  If you haven’t used Hadoop before you’ll first need to download a Hadoop release and make sure you have Java and PHP installed.  To download Hadoop head over to:

http://hadoop.apache.org/common/releases.html

Click on download a release and choose a mirror.  I suggest choosing the most recent stable release.  Once you’ve downloaded Hadoop, unzip it.

user@computer:$ tar xpf hadoop-0.20.2.tar.gz

I like to create a symlink to the hadoop-<release> directory to make things easier to manage.

user@computer:$ link -s hadoop-0.20.2 hadoop

Now you should have everything you need to start creating a Hadoop PHP job.

Creating The Job

For this example I’m going to create a simple Map/Reduce job for Hadoop.  Let’s start by understanding what we want to happen.

  1. We want to read from an input system – this is our mapper
  2. We want to do something with what we’ve mapped – this is our reducer

At the root of your development directory, let’s create another directory called script.  This is where we’ll store our PHP mapper and reducer files.

user@computer:$ ls

.
..
hadoop-0.20.2
hadoop-0.20.2.tar.gz
hadoop
user@computer:$ mkdir script

Now let’s being creating our mapper script in PHP.  Go ahead and create a PHP file called mapper.php under the script directory.

user@computer:$ touch script/mapper.php

Now let’s look at the basic structure of a PHP mapper.

#!/usr/bin/php
<?php
//this can be anything from reading input from files, to retrieving database content, soap calls, etc.
//for this example I'm going to create a simple php associative array.
$a = array(
'first_name' => 'Hello',
'last_name' => 'World'
);
//it's important to note that anything you send to STDOUT will be written to the output specified by the mapper.
//it's also important to note, do not forget to end all output to STDOUT with a PHP_EOL, this will save you a lot of pain.
echo serialize($a), PHP_EOL;
?>

So this example is extremely simple.  Create a simple associative array and serialize it.  Now onto the reducer.  Create a PHP file in the script directory called reducer.php.

user@computer:$ touch script/reducer.php

Now let’s take a look at the layout of a reducer.

#!/usr/bin/php
 
<?php
 
//Remember when I said anything put out through STDOUT in our mapper would go to the reducer.
//Well, now we read from the STDIN to get the result of our mapper.
//iterate all lines of output from our mapper
while (($line = fgets(STDIN)) !== false) {
    //remove leading and trailing whitespace, just in case 🙂
    $line = trim($line);
    //now recreate the array we serialized in our mapper
    $a = unserialize($line);
    //Now, we do whatever we need to with the data.  Write it out again so another process can pick it up,
    //send it to the database, soap call, whatever.  In this example, just change it a little and
    //write it back out.
    $a['middle_name'] = 'Jason';
    //do not forget the PHP_EOL
    echo serialize($a), PHP_EOL;
}//end while
?>

So now we have a very simple mapper and reducer ready to go.

Execution

So now let’s run it and see what happens.  But first, a little prep work.  We need to specify the input directory that will be used when the job runs.

user@computer:$ mkdir input
user@computer:$ touch input/conf

Ok, that was difficult.  We have an input directory and we’ve created an empty conf file.  The empty conf file is just something that the mapper will use to get started.  For now, don’t worry about it.  Now let’s run this bad boy.  Make sure you have your JAVA_HOME set, this is usually in the /usr directory.  You can set this by running #export JAVA_HOME=/usr.

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

So here’s what the command does.  The first part executes the hadoop execute script.  The “jar” argument tells hadoop to use a jar, in this case it tells it to use “hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar”.  Next we pass the mapper and reducer arguments to the job and specify input and output directories.  If we wanted to, we could pass configuration information to the mapper, or files, etc.  We would just use the same line read structure that we used in the reducer to get the information.  That’s what would go in the input directory if we needed it to.  But for this example, we’ll just pass nothing.  Next the output directory will contain the output of our reducer.  In this case if everything works out correct, it will contain the PHP serialized form of our modified $a array.  If all goes well you should see something like this:

user@computer:$ hadoop/bin/hadoop jar hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -mapper script/mapper.php -reducer script/reducer.php -input input/* -output output

10/12/10 12:53:56 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
10/12/10 12:53:56 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
10/12/10 12:53:56 INFO streaming.StreamJob: Running job: job_local_0001
10/12/10 12:53:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
10/12/10 12:53:56 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/10 12:53:56 INFO mapred.MapTask: numReduceTasks: 1
10/12/10 12:53:56 INFO mapred.MapTask: io.sort.mb = 100
10/12/10 12:53:57 INFO mapred.MapTask: data buffer = 79691776/99614720
10/12/10 12:53:57 INFO mapred.MapTask: record buffer = 262144/327680
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/mapper.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=0/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.MapTask: Starting flush of map output
10/12/10 12:53:57 INFO mapred.MapTask: Finished spill 0
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=0/1
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.Merger: Merging 1 sorted segments
10/12/10 12:53:57 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 70 bytes
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO streaming.PipeMapRed: PipeMapRed exec [/root/./script/reducer.php]
10/12/10 12:53:57 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
10/12/10 12:53:57 INFO streaming.PipeMapRed: Records R/W=1/1
10/12/10 12:53:57 INFO streaming.PipeMapRed: MROutputThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: MRErrorThread done
10/12/10 12:53:57 INFO streaming.PipeMapRed: mapRedFinished
10/12/10 12:53:57 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
10/12/10 12:53:57 INFO mapred.LocalJobRunner:
10/12/10 12:53:57 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
10/12/10 12:53:57 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/root/output
10/12/10 12:53:57 INFO mapred.LocalJobRunner: Records R/W=1/1 > reduce
10/12/10 12:53:57 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
10/12/10 12:53:57 INFO streaming.StreamJob:  map 100%  reduce 100%
10/12/10 12:53:57 INFO streaming.StreamJob: Job complete: job_local_0001
10/12/10 12:53:57 INFO streaming.StreamJob: Output: output

If you get errors where it’s complaining about the output directory, just remove the output directory and try again.

Result

Once you’ve got something similar to the above and no errors, we can check out the result.

user@computer:$ cat output/*

a:3:{s:10:"first_name";s:5:"Hello";s:9:"last_name";s:5:"World";s:11:"middle_name";s:5:"Jason";}

There we go, a serialized form of our modified PHP array $a.  That’s all there is to it.  Now, go forth and Hadoop.