life is a rum go guv’nor, and that’s the truth

Getting started with Hadoop on Amazon’s elastic mapreduce

After playing with Hadoop a bit in the past, I’m now trying out some things on Amazon’s Elastic MapReduce.

I signed up for a new AWS account and ran their sample LogAnalyzer Job Flow using the AWS console. That was easy enough. Next I attempted to run the same sample from the command line using the Amazon Elastic MapReduce Ruby Client.

Note: The Ruby Client README turns out to be very helpful.

Next I downloaded the source and looked at. Seems simple enough. I notice that this sample uses a library called Cascading, which appears to be a way to simplify common job flow tasks.

After adding the elastic-mapreduce app to my path and setting up my credentials file, I ran:

elastic-mapreduce –create –jar  s3n://elasticmapreduce/samples/cloudfront/logprocessor.jar –args  “-input,s3n://elasticmapreduce/samples/cloudfront/input,-output,s3n://,-start,any,-end,2010-09-07-21,-timeBucket,300,-overallVolumeReport”

It produced:

INFO Exception Retriable invalid response returned from RunJobFlow: {”Error”=>{”Details”=>#<SocketError: getaddrinfo: nodename nor servname provided, or not known>, “Code”=>”InternalFailure”, “Type”=>”Sender”}} while calling RunJobFlow on Amazon::Coral::ElasticMapReduceClient, retrying in 3.0 seconds.

After some poking around, I realized that I specified “west-1″ as my region when it should have been “us-west-1″. This resulted in the client trying to contact a non-existent server I’m guessing.

So now, my jobs started, but failed immediately. I logged into the AWS console and clicked on one of the failed job flows to see the reason for the failure (Last State Change Reason):

The given SSH key name was invalid

Googling found:

Which at first confused me, then I went ahead and followed the link (while logged in) and did what it said to. (Amazing how that works sometimes :-) ) It prompted me to create a new key and to assign it a name.

After I had generated the key and put its name in the credentials.json, things worked like a charm. It turns out that if you run a job from scratch, it has to fire up an EC2 instance in order to run the job, and that can take a while. To avoid that start up time, you can run:

elastic-mapreduce –create –alive –log-uri s3://my-example-bucket/logs

As mentioned in the README.TXT

My next steps are to:

  1. Modify the job flow and run that job flow.
  2. Run the job flow locally.
  3. Debug the MapReduce portion of the job flow.

Leave a Reply