Recently I ran intro problem trying to output two types of data from MapReduce task.
First solution to solve this problem is to use Hadoop MapReduce feature called MultipleOutputs. But, unfortunately this feature is broken in our production version ( 0.20.1 ). We are using new MapReduce API.
In new Cloudera CDH3 Hadoop distribution this feature is working perfectly. Best reference is to use documentation: MultipleOutputs
So, for us not running last version of Hadoop there is simple solution to achieve same functionality. Starting point is Hadoop FAQ
Main trick is to write output for particular Mapper or Reducer to temporary task directory and only output from successfully finished tasks is copied to job output directory. Map Reduce engine will take care of copying successful temporary output, so on us is only to write output.
Enough with talk, let’s code. Remember, we are using new API, so all classes should be from org.apache.hadoop.mapreduce package.
First we need to overload
setup(Context context) method of Mapper or Reducer.
// format: attempt_200811201130_0004_m_000003_0
String taskId = context.getConfiguration().get("mapred.task.id").split("_");
Path workPath = FileOutputFormat.getWorkOutputPath(context);
workPath = workPath.suffix("/mo-"+ taskId + "-" + taskId.substring(1));
writer = workPath.getFileSystem(context.getConfiguration()).create(workPath);
We are first extracting taskId to be able to create per task unique files. After that we are using
FileOutputFormat.getWorkOutputPath to get temporary path for task output. At least, we are creating writer to be used from map or reduce call.
To output something to this writer from map or reduce call it enough to do following
cleanup(Context context) call we are closing writer
This code will produce following files in M/R output directory, if used in mapper
This code is taken from example I created to test this feature. It contains code to do same thing using MultipleOutputs and custom code. Check it out on GitHub.