2

I'm getting an out of memory exception due to lack of Java heap space when I try and download tweets using Flume and pipe them into Hadoop.

I have set the heap space currently to 4GB in the mapred-site.xml of Hadoop, like so:

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx4096m</value>
</property>

I am hoping to download tweets continually for two days but can't get past 45 minutes without errors.

Since I do have the disk space to hold all of this, I am assuming the error is coming from Java having to handle so many things at once. Is there a way for me to slow down the speed at which these tweets are downloaded, or do something else to solve this problem?

Edit: flume.conf included

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <required>
TwitterAgent.sources.Twitter.consumerSecret = <required>
TwitterAgent.sources.Twitter.accessToken = <required> 
TwitterAgent.sources.Twitter.accessTokenSecret = <required> 
TwitterAgent.sources.Twitter.keywords = manchester united, man united, man utd, man u

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Edit 2

I've tried increasing the memory to 8GB which still doesn't help. I am assuming I am placing too many tweets in Hadoop at once and need to write them to disk and release the space again (or something to that effect). Is there a guide anywhere on how to do this?

5
  • Can you post the Java code you are using to download tweets? Commented Jul 29, 2013 at 12:44
  • Done - is it an error in there? I modified this code from code provided by Cloudera. Commented Jul 29, 2013 at 12:46
  • 1
    Usually -Xmx4096 should also contain the values units so -Xmx4096m. Commented Jul 29, 2013 at 12:50
  • Apologies, it did contain it, I just copied and pasted the code over poorly. Edited my question now to reflect that. Commented Jul 29, 2013 at 12:53
  • When you increase the heap space, you should generally increase the perm-gen size as well (about 25% of the total heap). Commented Jul 29, 2013 at 14:11

2 Answers 2

1

Set JAVA_OPTS value at flume-env.sh and start flume agent.

Sign up to request clarification or add additional context in comments.

Comments

1

It appears the problem had to do with the batch size and transactionCapacity. I changed them to the following:

TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 1000

This works without me even needing to change the JAVA_OPTS value.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.