This repository is collection of MapReduce and Hadoop eco-system related programs:
- MapReduce WordCount application with MRUnit tests
- MapReduce custom Writable implementation for String pair's
- MapReduce custom InputFormat and RecordReader implementation
- MapReduce custom OutputFormat and RecordWriter implementation
- Pig custom LoadFunc to load and parse apache http log events
- Pig custom EvalFunc to transform IP addresses to location using MaxMind GEO API
Also, look into MapReduce Joins on how to implement MapReduce joins.
From root of this project run:
mvn package
Executing WordCount
hadoop jar mapreduce_cwt-1.0-SNAPSHOT.jar com.cloudwick.mapreduce.wordcount.WordCountDriver \
input_path \
output_path
Executing Custom InputFormat
- Copy test data from
src/main/resources/columntext-testdata
to hadoop cluster - Move the file to HDFS
hadoop fs -mkdir fw_input
hadoop fs -put columntext-testdata fw_input
- Execute the MapReduce program to parse the fixed width records
hadoop jar mapreduce_cwt-1.0-SNAPSHOT.jar com.cloudwick.mapreduce.inputformat.FixedWidthColumnTextDriver \
fw_input \
fw_output
Executing Custom OutputFormat
hadoop jar mapreduce_cwt-1.0-SNAPSHOT.jar com.cloudwick.mapreduce.outputformat.FixedWidthColumnTextDriver \
fw_output \
fw_output2