Tuesday, October 6, 2015

Hadoop Word Count Example using Maven and Eclipse

INTRODUCTION


Hadoop
  • Hadoop is a Framework written completely in Java.
  • Developed by Doug Cutting from Yahoo.
  • Hadoop 0.1.0 was released in April 2006.
  • It is an Open source project of the Apache Software Foundation.
  • Suitable for running in distributed environment & storage.
  • It will process very large data sets on computer clusters built from commodity hardware.
  • Hadoop framework is useful
    • When you must process lots of unstructured data.
    • When your processing can easily be made parallel.
    • When running batch jobs is acceptable.
    • When you have access to lots of cheap hardware.

Map-Reduce
  • It is the heart of Hadoop.
  • It is a programming model or Algorithm for data processing.
  • Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
  • MapReduce programs are inherently parallel.
  • Works in Master-Slave Model.
  • Mapper Program
    • Performs filtering and sorting.
  • Reducer Program
    • Performs a summary operation.
HDFS (Hadoop Distributed File System)
  • Java-based file system to store a large volume of data.
  • Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
  • Supporting close to a billion files and blocks.
  • Access
    • Java API
    • Python/C for Non-Java Applications
    • Web GUI through HTTP
    • FS Shell -  shell-like commands that directly interact with HDFS
  • Features are
    • HDFS can handle large data sets.
    • Since HDFS deals with large-scale data, it supports a multitude of machines.
    • HDFS provides a write-once-read-many access model.
    • HDFS is built using the Java language making it portable across various platforms.
    • Fault Tolerance and availability are high.

SOFTWARES & TOOLS

  1. Eclipse (Mars2)
  2. Maven
  3. Java 1.6

USECASE

Find the words and calculate word count in the input files of a directory. (Word Count)

IMPLEMENTATION

Step 1

Create a Maven project in Eclipse and update the below dependencies into pom.xml
<dependency>
 <groupId>jdk.tools</groupId>
 <artifactId>jdk.tools</artifactId>
 <scope>system</scope>
 <version>1.6</version>
 <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>

<dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.1</version>
</dependency>

<dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-core</artifactId>
 <version>1.2.1</version>
</dependency>

Step 2

Create a Mapper Class which should extend org.apache.hadoop.mapreduce.Mapper
package jbr.hadoopex;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   String line = value.toString();
   StringTokenizer tokenizer = new StringTokenizer(line);
   while (tokenizer.hasMoreTokens()) {
     word.set(tokenizer.nextToken());
     context.write(word, one);
   }
 }
}

Step 3

Create Reducer Class which extends org.apache.hadoop.mapreduce.Reducer
package jbr.hadoopex;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
   int sum = 0;
   while (values.hasNext()) {
     sum += values.next().get();
   }
   context.write(key, new IntWritable(sum));
 }
}

Step 4

Create a Client Program which will call the Mapper and Reducer and show the output.
package jbr.hadoopex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

 public static void main(String[] args) throws Exception {
   Configuration conf = new Configuration();

   Job job = new Job(conf, "wordcount");
   job.setJarByClass(WordCount.class);

   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(IntWritable.class);

   job.setMapperClass(WordCountMapper.class);
   job.setReducerClass(WordCountReducer.class);

   job.setInputFormatClass(TextInputFormat.class);
   job.setOutputFormatClass(TextOutputFormat.class);

   FileInputFormat.addInputPath(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));

   job.waitForCompletion(true);
 }
}

BUILD & DEPLOYMENT

  1. Create the Jar which includes all the above Mapper, Reducer & Client Programs
  2. Copy the Jar into HDFS
  3. Create an input folder into HDFS (/user/ranjith/mapreduce/input/)
  4. Create a file(words.txt) with some words and copy to input folder
  5. Create an output (/user/ranjith/mapreduce/output/) folder which will be used by Map-Reduce program and copy the output files into it.

INPUT

  1. Create two files under input folder
  2. Check the files
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
You will see the below files.
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
  1. Check the content of the file01
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
File01 content is:
Hello World Bye World
  1. Check the content of file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
File02 content is:
Hello Hadoop Goodbye Hadoop

RUN

Enter the below command to view the output.
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/words.txt /user/ranjith/mapreduce/output

  • hadoop - keyword to run the map-reduce program
  • jar - denotes the program archive file
  • jbr.hadoopex.WordCount - Client program
  • /user/ranjith/mapreduce/input/words.txt - Input file in HDFS
  • /user/ranjith/mapreduce/output - output location in HDFS

OUTPUT


Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

ERRORS & SOLUTION

Below are the some of the errors which I have encountered during development and deployment of Hadoop Program. I have provided the solution for all those errors. Hope it will be helpful.

ERROR 1

ERROR:
> hadoop WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output
Error: Could not find or load main class WordCount.jar

REASON
Missed to include 'jar' keyword

SOLUTION
Add 'jar' keyword after 'hadoop'
hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output

ERROR 2

ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output
15/10/06 05:33:24 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/10/06 05:33:24 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
15/10/06 05:33:24 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/<USERNAME>/.staging/job_1443059102069_45474
15/10/06 05:33:24 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME> (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt
       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
       at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
       at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:606)
       at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:490)
       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
       at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
       at jbr.hadoopex.WordCount.main(WordCount.java:68)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:483)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)



SOLUTION
Move the input file into hdfs instead running from the local

ERROR 3

ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount hdfs://ranjith/mapreduce/input/words.txt hdfs://ranjith/mapreduce/output
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: ranjith
       at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
       at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258)
       at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628)
       at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:574)
       at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:147)
       at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
       at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
       at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
       at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:518)
       at jbr.hadoopex.WordCount.main(WordCount.java:65)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:483)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.net.UnknownHostException: ranjith
       ... 19 more


SOLUTION
Provide full hdfs path:
hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output

ERROR 4

ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output
15/10/06 05:49:22 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME>(auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists
       at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
       at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:554)
       at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430)
       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
       at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
       at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
       at jbr.hadoopex.WordCount.main(WordCount.java:68)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:483)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)


SOLUTION
I have created the output directory before running the program. It is not required. So, delete the output directory from HDFS and run again

ERROR 5

ERROR:
15/10/06 05:58:18 INFO mapreduce.Job: Task Id : attempt_1443059102069_45567_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found
       at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2047)
       at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:742)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found
       at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1953)
       at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2045)
       ... 8 more

SOLUTION
Add below Code in Main Program.
job.setJarByClass(WordCount.class);

Hope this example helps to create a Map-Reduce program in Hadoop Framework. Please share your comments below.
Happy Knowledge Sharing!!!

No comments :

Post a Comment