INTRODUCTION

Hadoop

Hadoop is a Framework written completely in Java.
Developed by Doug Cutting from Yahoo.
Hadoop 0.1.0 was released in April 2006.
It is an Open source project of the Apache Software Foundation.
Suitable for running in distributed environment & storage.
It will process very large data sets on computer clusters built from commodity hardware.
Hadoop framework is useful

When you must process lots of unstructured data.
When your processing can easily be made parallel.
When running batch jobs is acceptable.
When you have access to lots of cheap hardware.

Map-Reduce

It is the heart of Hadoop.
It is a programming model or Algorithm for data processing.
Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
MapReduce programs are inherently parallel.
Works in Master-Slave Model.
Mapper Program

Performs filtering and sorting.

Reducer Program

Performs a summary operation.

HDFS (Hadoop Distributed File System)

Java-based file system to store a large volume of data.
Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
Supporting close to a billion files and blocks.
Access

Java API
Python/C for Non-Java Applications
Web GUI through HTTP
FS Shell - shell-like commands that directly interact with HDFS

Features are

HDFS can handle large data sets.
Since HDFS deals with large-scale data, it supports a multitude of machines.
HDFS provides a write-once-read-many access model.
HDFS is built using the Java language making it portable across various platforms.
Fault Tolerance and availability are high.

SOFTWARES & TOOLS

Eclipse (Mars2)
Maven
Java 1.6

USECASE

Find the words and calculate word count in the input files of a directory. (Word Count)

IMPLEMENTATION

Step 1

Create a Maven project in Eclipse and update the below dependencies into pom.xml

<groupId>jdk.tools</groupId>

<artifactId>jdk.tools</artifactId>

<scope>system</scope>

<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-common</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

Step 2

Create a Mapper Class which should extend org.apache.hadoop.mapreduce.Mapper

package jbr.hadoopex;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

Step 3

Create Reducer Class which extends org.apache.hadoop.mapreduce.Reducer

package jbr.hadoopex;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

context.write(key, new IntWritable(sum));

}

Step 4

Create a Client Program which will call the Mapper and Reducer and show the output.

package jbr.hadoopex;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setJarByClass(WordCount.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

BUILD & DEPLOYMENT

Create the Jar which includes all the above Mapper, Reducer & Client Programs
Copy the Jar into HDFS
Create an input folder into HDFS (/user/ranjith/mapreduce/input/)
Create a file(words.txt) with some words and copy to input folder
Create an output (/user/ranjith/mapreduce/output/) folder which will be used by Map-Reduce program and copy the output files into it.

INPUT

Create two files under input folder
Check the files

$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/

You will see the below files.

/user/ranjith/mapreduce/input/file01

/user/ranjith/mapreduce/input/file02

Check the content of the file01

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01

File01 content is:

Hello World Bye World

Check the content of file02

$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02

File02 content is:

Hello Hadoop Goodbye Hadoop

RUN

Enter the below command to view the output.

> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/words.txt /user/ranjith/mapreduce/output

hadoop - keyword to run the map-reduce program
jar - denotes the program archive file
jbr.hadoopex.WordCount - Client program
/user/ranjith/mapreduce/input/words.txt - Input file in HDFS
/user/ranjith/mapreduce/output - output location in HDFS

OUTPUT

Bye 1

Goodbye 1

Hadoop 2

Hello 2

World 2

ERRORS & SOLUTION

Below are the some of the errors which I have encountered during development and deployment of Hadoop Program. I have provided the solution for all those errors. Hope it will be helpful.

ERROR 1

ERROR:

> hadoop WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output

Error: Could not find or load main class WordCount.jar

REASON

Missed to include 'jar' keyword

SOLUTION

Add 'jar' keyword after 'hadoop'

hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output

ERROR 2

ERROR:

> hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output

15/10/06 05:33:24 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

15/10/06 05:33:24 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).

15/10/06 05:33:24 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/<USERNAME>/.staging/job_1443059102069_45474

15/10/06 05:33:24 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME> (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt

Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)

at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:606)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:490)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)

at jbr.hadoopex.WordCount.main(WordCount.java:68)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:483)

at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

SOLUTION

Move the input file into hdfs instead running from the local

ERROR 3

ERROR:

> hadoop jar WordCount.jar jbr.hadoopex.WordCount hdfs://ranjith/mapreduce/input/words.txt hdfs://ranjith/mapreduce/output

Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: ranjith

at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)

at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258)

at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)

at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628)

at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:574)

at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:147)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:518)

at jbr.hadoopex.WordCount.main(WordCount.java:65)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:483)

at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Caused by: java.net.UnknownHostException: ranjith

... 19 more

SOLUTION

Provide full hdfs path:

hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output

ERROR 4

ERROR:

> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output

15/10/06 05:49:22 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME>(auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists

at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:554)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)

at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)

at jbr.hadoopex.WordCount.main(WordCount.java:68)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:483)

at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

SOLUTION

I have created the output directory before running the program. It is not required. So, delete the output directory from HDFS and run again

ERROR 5

ERROR:

15/10/06 05:58:18 INFO mapreduce.Job: Task Id : attempt_1443059102069_45567_m_000000_0, Status : FAILED

Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2047)

at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:742)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Caused by: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1953)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2045)

... 8 more

SOLUTION

Add below Code in Main Program.

job.setJarByClass(WordCount.class);

Hope this example helps to create a Map-Reduce program in Hadoop Framework. Please share your comments below.

Happy Knowledge Sharing!!!

Java & J2EE Technology Blog

Pages

Tuesday, October 6, 2015

Hadoop Word Count Example using Maven and Eclipse

SOFTWARES & TOOLS

USECASE

IMPLEMENTATION

Step 1

Step 2

Step 3

Step 4

BUILD & DEPLOYMENT

INPUT

RUN

OUTPUT

ERRORS & SOLUTION

ERROR 1

ERROR 2

ERROR 3

ERROR 4

ERROR 5

No comments :

Post a Comment

Pages