Big Data Handling: Map-Reduce Example using Java

Map Reduce Example

Input file

1929143,Cloud,137970,23,3/7/1989,Female,LBEIDY,3009,19

1764937,FSI,106975,40,15/11/1972,Male,VBFMLI,4531,20

2219809,MFG,128452,34,21/1/1978,Male,YDEVBR,2266,20

2230165,Cloud,92520,33,25/3/1979,Female,KDSBYM,2246,18

1994082,MFG,61059,31,31/4/1981,Male,PTHIQG,1631,20

1672195,MFG,121177,36,24/2/1976,Female,UHBHTC,86,16

2130586,RCL,103082,24,24/9/1988,Male,ABEBZF,4177,16

1308642,RCL,84086,26,12/6/1986,Male,TFJHIK,816,21

1700297,Cloud,16624,30,29/4/1982,Female,ENJNSX,11,17

1768996,FSI,134554,33,31/12/1979,Female,TVHRKY,527,17

Schema of input

employeeNo_Position=0;

unit_Position=1;

salary_Position=2;

age_Position=3;

dOB_Position=4;

gender_Position=5;

projectCode_Position=6;

billingRevenue_Position=7;

noOfBillableDays_Position=8;

delimiter=",";

Requirment

For the given input file, find the total revenue of each unit.

Explanation

For every map reduce programming, we need a mapper as well as a reducer. Inside a driver class we are setting the configurations.The output and input of mapper and reducer are key value pairs.

Mapper -

In our scenario the input key to mapper is offset value of each line and input value will be the line itself. Output key is set as unit and output value is set as revenue from input file provided.

The output from mapper is done with a shuffle and sort and then given to the reducer.

Reducer –

Input key to reducer is a text value which is unit and values will be a set of iteratable Intwitable values(Revenue).

We are doing a sum of revenue for each key and thereby obtained unitwise revenue as reducer output.

Java Implementation

Driver Class

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.apache.log4j.Logger;

/**

* @author Amal_Babu

public class Driver extends Configured implements Tool {

private static Logger log = Logger.getLogger(Driver.class);

public static void main(String[] args) {

// int res;

try {

int res = ToolRunner.run(new Configuration(), new Driver(), args);

System.exit(res);

} catch (Exception e1) {

e1.printStackTrace();

}

@Override

public int run(String[] arg0) throws Exception {

// getting configuration object and setting job name

Configuration conf = getConf();

Job job = new Job(conf, "RevenueCalculater");

// setting the class names

job.setJarByClass(Driver.class);

job.setMapperClass(RecordSelector.class);

job.setReducerClass(RevenuePerUnitCalculator.class);

// setting the output data type classes

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setNumReduceTasks(1);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setCombinerClass(RevenuePerUnitCalculator.class);

// to accept the hdfs input and outpur dir at run time

FileInputFormat.addInputPath(job, new Path(arg0[0]));

FileOutputFormat.setOutputPath(job, new Path(arg0[1]));

return job.waitForCompletion(true) ? 0 : 1;

}

Mapper class

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.log4j.Logger;

/**

* @author Amal_Babu

* @Description RecordSelector used to set the key values

public class RecordSelector extends Mapper<LongWritable, Text, Text, IntWritable>{

private Logger log = Logger.getLogger(RecordSelector.class);

public void map(LongWritable key, Text value, Context context){

Text word = new Text();

String line=value.toString();

{

TextFilePositions Tokeniser=new TextFilePositions();

String unit=Tokeniser.getElementByPosition(line, TextFilePositions.unit_Position);

String revenueval=Tokeniser.getElementByPosition(line,TextFilePositions.billingRevenue_Position);

System.out.println(revenueval);

if(revenueval!=null)

{

IntWritable revenue=new IntWritable(Integer.parseInt(revenueval));

word.set(unit);

try {

log.info("Emitting Record from Mapper: word-->" + word

+ " value-->" + value);

context.write(word,revenue);

} catch (IOException e) {

log.error("IOException caught when emitting the record from RecordSelector(Mapper)",e);

} catch (InterruptedException e) {

log.error("InterruptedException caught when emitting the record from RecordSelector(Mapper)",e);

}

Tokenizer class

import java.util.StringTokenizer;

/**

* @author Amal_Babu

* @Description used to assign values to constants

public class TextFilePositions {

public static final int employeeNo_Position=0;

public static final int unit_Position=1;

public static final int salary_Position=2;

public static final int age_Position=3;

public static final int dOB_Position=4;

public static final int gender_Position=5;

public static final int projectCode_Position=6;

public static final int billingRevenue_Position=7;

public static final int noOfBillableDays_Position=8;

public static final String delimiter=",";

public String getElementByPosition(String element,int position)

{

String fieldToReturn=null;

StringTokenizer strToken=new StringTokenizer(element,delimiter);

for(int i=0; i<= position && strToken.hasMoreElements();i++)

fieldToReturn=strToken.nextToken();

return fieldToReturn;

}

Reducer class

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.log4j.Logger;

/**

* @author Amal_Babu

* @Description RevenuePerUnitCalculator used get the revenue per unit

public class RevenuePerUnitCalculator extends Reducer<Text, IntWritable, Text, IntWritable>{

private Logger log = Logger.getLogger(RevenuePerUnitCalculator.class);

public void reduce(Text key, Iterable<IntWritable> values, Context context){

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

try {

context.write(key, new IntWritable(sum));

} catch (IOException e) {

log.error("IOException caught when emitting the record from reducer");

} catch (InterruptedException e) {

log.error("InterruptedException caught when emitting the record from reducer");

}

How to execute?

1. Export the project as jar file.

2. Place input file in HDFS.

3. Execute the command given below

hadoop jar <jar_name> <input_hdfs_path> <output_hdfs_path>

Big Data Handling

Monday, May 13, 2013

Map-Reduce Example using Java

No comments:

Post a Comment

About Me