Thursday, November 12, 2015

Simple User Defined Functions (UDF) in Hive

INTRODUCTION

In this article, we are going see how can we create our own UDF in Hive.
Hive API enables to create our own function by means of extending its API Classes.

In this post, I am going to write a UDF using org.apache.hadoop.hive.ql.exec.UDF Class.

SOFTWARES & TOOLS

  1. Eclipse IDE (Mars2)
  2. Java 7
  3. Maven

DATABASE & TABLES

Create a new database
Query
CREATE DATABASE IF NOT EXISTS ranjith;

Create a new table by pointing the HDFS Location: "/ranjith/hive/data/emp/empinfo/"
Query
USE ranjith;
DROP TABLE IF EXISTS empinfo;
CREATE EXTERNAL TABLE empinfo(empid STRING, firstname STRING, lastname STRING, dob STRING, designation STRING, doj STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "/ranjith/hive/data/emp/empinfo/";

Check the available data.

USECASE

Calculate the year of experience of the employee using their joining date.

IMPLEMENTATION

Create UDF


import org.apache.hadoop.hive.ql.exec.UDF
public class EmpExperienceUDFString extends UDF {

 public String evaluate(Text empid, Text doj) {
 // write you logic here
 }
}

Note the below points
  1. Your UDF class should extend
  2. Your UDF class should have mandatory evaluate method since hive will look for this method.
  3. You can return any String, Map or List from the evaluate method.

Create JAR & Copy to HDFS

  1. Create Jar using eclipse IDE or
  2. Go the project folder and run command: mvn jar:jar
  3. Assume your jar name is: HiveUDF-1.0.jar
  4. Copy the Jar to HDFS location: /ranjith/hive/jars/

Run UDF

Now we will how to run it, first enter the below commands

ADD JAR ranjith/hive/jars/HiveUDF-1.0.jar;
CREATE TEMPORARY FUNCTION emp_exp_string AS 'jbr.hiveudf.EmpExperienceUDFString';
SELECT emp_exp_string(empid,doj) from empinfo;

Line 1: add the jar to the classpath
Line 2: creating a temporary function name for your UDF.
Line 3: Get the output of the UDF by calling the temporary function name.

Now Map-Reduce job will run and display the output as below


No comments :

Post a Comment