I am looking for a little clarification on the the answers to this question here:
Generating Separate Output files in Hadoop Streaming
My use case is as follows:
I have a map-only mapreduce job that takes an input file, does a lot of parsing and munging, and then writes back out. However, certain lines may or may not be in an incorrect format, and if that is the case, I would like to write the original line to a separate file.
It seems that one way to do this would be to prepend the name of the file to the line I am printing and use the multipleOutputFormat parameter. For example, if I originally had:
if line_is_valid(line):
print name + '\t' + comments
I could instead do:
if line_is_valid(line):
print valid_file_name + '\t' + name + '\t' + comments
else:
print err_file_name + '\t' + line
The only problem I have with this solution is that I don't want the file_name to appear as the first column in the textfiles. I suppose I could then run another job to strip out the first column of each file, but that seems kind of silly. So:
1) Is this the correct way to manage multiple output files with a python mapreduce job?
2) What is the best way to get rid of that initial column?