Mapreduce throwing OutOfMemoryError for large input file

Question

Hi I have a mapreduce jar that runs perfectly fine for small input files. When I say small I mean sample input files that I've created with less than 10 lines of input. But when I try to run mapreduce on an input file of size 1.8GB, I get the OutOfMemoryError. I'm not sure what i'm supposed to be doing.

Is there anyway that I can limit the number of tasks being spawned? And have few tasks run for longer durations?

Around 20 tasks are spawned on the large input file before I get this error. Here's part of the log that's generated for the first two tasks.

13/12/13 12:00:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
13/12/13 12:00:22 INFO mapreduce.Job: Running job: job_local1170901099_0001
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Waiting for map tasks
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000000_0
13/12/13 12:00:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:22 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
13/12/13 12:00:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:0+134217728
13/12/13 12:00:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:23 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:23 INFO mapreduce.Job: Job job_local1170901099_0001 running in uber mode : false
13/12/13 12:00:23 INFO mapreduce.Job:  map 0% reduce 0%
13/12/13 12:00:24 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000001_0
13/12/13 12:00:24 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:24 INFO mapred.Task:  Using ResourceCalculatorProcessTree : null
13/12/13 12:00:24 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:134217728+134217728
13/12/13 12:00:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:24 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:25 INFO mapred.MapTask: Starting flush of map output

This is the tail of the log which is generated when the error occurs.

13/12/13 12:00:43 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:43 INFO mapred.Task: Task:attempt_local1170901099_0001_m_000020_0 is done. And is in the process of committing
13/12/13 12:00:43 INFO mapred.LocalJobRunner: map
13/12/13 12:00:43 INFO mapred.Task: Task 'attempt_local1170901099_0001_m_000020_0' done.
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1170901099_0001_m_000020_0
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Map task executor complete.
13/12/13 12:00:43 WARN mapred.LocalJobRunner: job_local1170901099_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:238)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
13/12/13 12:00:44 INFO mapreduce.Job:  map 100% reduce 0%
13/12/13 12:00:44 INFO mapreduce.Job: Job job_local1170901099_0001 failed with state FAILED due to: NA
13/12/13 12:00:44 INFO mapreduce.Job: Counters: 22
File System Counters
    FILE: Number of bytes read=27635962
    FILE: Number of bytes written=28018656
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=5338170260
    HDFS: Number of bytes written=0
    HDFS: Number of read operations=25
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=1
Map-Reduce Framework
    Map input records=0
    Map output records=0
    Map output bytes=0
    Map output materialized bytes=6
    Input split bytes=122
    Combine input records=0
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=5
    Total committed heap usage (bytes)=530186240
File Input Format Counters 
    Bytes Read=118909386

Do you have line breaks in your file? I assume that it will just read a massive line of text. — Thomas Jungblut
– Thomas Jungblut, Commented Dec 13, 2013 at 21:49
Hi, you were right. My input file was corrupted. It ran fine once I downloaded the input file again. Thank You. — Chai Nadig
– Chai Nadig, Commented Dec 15, 2013 at 23:43

Chai Nadig · Accepted Answer · 2015-09-16 00:47:06Z

1

This answer is late, but posting it in case it helps someone else. The problem was that the file I was trying to process was corrupted. I got different copy of the file and ran my MR job on it and everything worked fine.

answered Sep 16, 2015 at 0:47

Chai Nadig

4171 gold badge7 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mike Van · Accepted Answer · 2013-12-13 20:43:18Z

0

My first impulse would be to ask what your startup parameters are. Typically, when you run MapReduce and experience an out-of-memory error, you would use something like the following as your startup params:

-Dmapred.map.child.java.opts=-Xmx1G -Dmapred.reduce.child.java.opts=-Xmx1G

The key here is that these two amounts are cumulative. So, the amounts you specificy added together should not come close to exceeding the memory available on your system after you start MapReduce.

edited Dec 13, 2013 at 20:43

answered Dec 13, 2013 at 20:25

Mike Van

1,0587 silver badges26 bronze badges

3 Comments

Chai Nadig Over a year ago

I didn't have a mapred-site.xml file. I created one with the properties that you mentioned and the values that you gave. Restarted and ran the job again, but still no luck. It creates 22 tasks and then throws up that error.

Chai Nadig Over a year ago

No, I have it at 3G and 3G for both of them now. I have 8GB ram on my Mac. It still throws up after 22 tasks.

Raghuveer Over a year ago

did you resolve this error, am also facing this issue now, kindly suggest

Kunal Khaire · Accepted Answer · 2016-03-08 05:32:10Z

0

Might be late but i solved this by setting the following parameter to 0.2

mapred.job.shuffle.input.buffer.percent

This tells the reducer JVM in the shuffle space to ask only 0.2 % of the heap space,rather than 0.7%.You are getting "Out of heap space" error because the shuffle space is asking the JVM for memory which is not available to it.Rather than spilling it just throws the exception.But if you ask only for 0.2% chances are you will get the memory.Also once you exceed the alloted memory the spilling logic comes into picture.

Ofcourse the downside is the slowless.

You can also calculate at run-time the amount of memory available and then reset the buffer.

answered Mar 8, 2016 at 5:32

Kunal Khaire

3273 silver badges7 bronze badges

Collectives™ on Stack Overflow

Mapreduce throwing OutOfMemoryError for large input file

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related