INTRODUCTION
Hadoop
- Hadoop is a Framework written completely in Java.
- Developed by Doug Cutting from Yahoo.
- Hadoop 0.1.0 was released in April 2006.
- It is an Open source project of the Apache Software Foundation.
- Suitable for running in distributed environment & storage.
- It will process very large data sets on computer clusters built from commodity hardware.
- Hadoop framework is useful
- When you must process lots of unstructured data.
- When your processing can easily be made parallel.
- When running batch jobs is acceptable.
- When you have access to lots of cheap hardware.
Map-Reduce
- It is the heart of Hadoop.
- It is a programming model or Algorithm for data processing.
- Hadoop can run MapReduce programs written in various languages (Java, Ruby, Python etc.,).
- MapReduce programs are inherently parallel.
- Works in Master-Slave Model.
- Mapper Program
- Performs filtering and sorting.
- Reducer Program
- Performs a summary operation.
HDFS (Hadoop Distributed File System)
- Java-based file system to store a large volume of data.
- Scalability of up to 200 PB of storage and a single cluster of 4500 servers.
- Supporting close to a billion files and blocks.
- Access
- Java API
- Python/C for Non-Java Applications
- Web GUI through HTTP
- FS Shell - shell-like commands that directly interact with HDFS
- Features are
- HDFS can handle large data sets.
- Since HDFS deals with large-scale data, it supports a multitude of machines.
- HDFS provides a write-once-read-many access model.
- HDFS is built using the Java language making it portable across various platforms.
- Fault Tolerance and availability are high.
SOFTWARES & TOOLS
- Eclipse (Mars2)
- Maven
- Java 1.6
USECASE
Find the words and calculate word count in the input files of a directory. (Word Count)
IMPLEMENTATION
Step 1
Create a Maven project in Eclipse and update the below dependencies into pom.xml
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<scope>system</scope>
<version>1.6</version>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
|
Step 2
Create a Mapper Class which should extend org.apache.hadoop.mapreduce.Mapper
package jbr.hadoopex;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
|
Step 3
Create Reducer Class which extends org.apache.hadoop.mapreduce.Reducer
package jbr.hadoopex;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
|
Step 4
Create a Client Program which will call the Mapper and Reducer and show the output.
package jbr.hadoopex;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
|
BUILD & DEPLOYMENT
- Create the Jar which includes all the above Mapper, Reducer & Client Programs
- Copy the Jar into HDFS
- Create an input folder into HDFS (/user/ranjith/mapreduce/input/)
- Create a file(words.txt) with some words and copy to input folder
- Create an output (/user/ranjith/mapreduce/output/) folder which will be used by Map-Reduce program and copy the output files into it.
INPUT
- Create two files under input folder
- Check the files
$ bin/hadoop dfs -ls /user/ranjith/mapreduce/input/
You will see the below files.
/user/ranjith/mapreduce/input/file01
/user/ranjith/mapreduce/input/file02
- Check the content of the file01
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file01
File01 content is:
Hello World Bye World
- Check the content of file02
$ bin/hadoop dfs -cat /user/ranjith/mapreduce/input/file02
File02 content is:
Hello Hadoop Goodbye Hadoop
RUN
Enter the below command to view the output.
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/ranjith/mapreduce/input/words.txt /user/ranjith/mapreduce/output
|
- hadoop - keyword to run the map-reduce program
- jar - denotes the program archive file
- jbr.hadoopex.WordCount - Client program
- /user/ranjith/mapreduce/input/words.txt - Input file in HDFS
- /user/ranjith/mapreduce/output - output location in HDFS
OUTPUT
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
|
ERRORS & SOLUTION
Below are the some of the errors which I have encountered during development and deployment of Hadoop Program. I have provided the solution for all those errors. Hope it will be helpful.
ERROR 1
ERROR:
> hadoop WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output
Error: Could not find or load main class WordCount.jar
REASON
Missed to include 'jar' keyword
SOLUTION
Add 'jar' keyword after 'hadoop'
hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output
|
ERROR 2
ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /ranjith/learnings/mapreduce/testdata/input/emp.txt /hadoop/user/<USERNAME>/ranjith/learnings/mapreduce/testdata/output
15/10/06 05:33:24 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/10/06 05:33:24 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
15/10/06 05:33:24 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/<USERNAME>/.staging/job_1443059102069_45474
15/10/06 05:33:24 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME> (auth:SIMPLE) cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt
Exception in thread "main" org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ranjith/learnings/mapreduce/testdata/input/emp.txt
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:589)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:606)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:490)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at jbr.hadoopex.WordCount.main(WordCount.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
SOLUTION
Move the input file into hdfs instead running from the local
|
ERROR 3
ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount hdfs://ranjith/mapreduce/input/words.txt hdfs://ranjith/mapreduce/output
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: ranjith
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:628)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:574)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:147)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.addInputPath(FileInputFormat.java:518)
at jbr.hadoopex.WordCount.main(WordCount.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.net.UnknownHostException: ranjith
... 19 more
SOLUTION
Provide full hdfs path:
hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output
|
ERROR 4
ERROR:
> hadoop jar WordCount.jar jbr.hadoopex.WordCount /user/<USERNAME>/ranjith/mapreduce/input/words.txt /user/<USERNAME>/ranjith/mapreduce/output
15/10/06 05:49:22 WARN security.UserGroupInformation: PriviledgedActionException as:<USERNAME>(auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://user/<USERNAME>/ranjith/mapreduce/output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:554)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1292)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1292)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1313)
at jbr.hadoopex.WordCount.main(WordCount.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
SOLUTION
I have created the output directory before running the program. It is not required. So, delete the output directory from HDFS and run again
|
ERROR 5
ERROR:
15/10/06 05:58:18 INFO mapreduce.Job: Task Id : attempt_1443059102069_45567_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2047)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:196)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:742)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.ClassNotFoundException: Class jbr.hadoopex.WordCountMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1953)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2045)
... 8 more
SOLUTION
Add below Code in Main Program.
job.setJarByClass(WordCount.class);
|
Hope this example helps to create a Map-Reduce program in Hadoop Framework. Please share your comments below.
Happy Knowledge Sharing!!!
No comments :
Post a Comment