2

I have a JSON file (.json) in Amazon S3. I need to read it and create a new field called Hash_index for each JsonObject. The file is very big, so I am using a GSON library to avoid my OutOfMemoryError in reading the file. Below is my code. Please note that I am using GSON

  //Create the Hashed JSON
    public void createHash() throws IOException
    {
        System.out.println("Hash Creation Started");

        strBuffer = new StringBuffer("");


        try
        {
            //List all the Buckets
            List<Bucket>buckets = s3.listBuckets();

            for(int i=0;i<buckets.size();i++)
            {
                System.out.println("- "+(buckets.get(i)).getName());
            }


            //Downloading the Object
            System.out.println("Downloading Object");
            S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
            System.out.println("Content-Type: "  + s3Object.getObjectMetadata().getContentType());



            //Read the JSON File
            /*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
            while (true) {
                String line = reader.readLine();
                if (line == null) break;

               // System.out.println("    " + line);
                strBuffer.append(line);

            }*/

           // JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
           // jsonArray = new JSONArray(jTokener);

            JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
            reader.beginArray();
            int gsonVal = 0;
            while (reader.hasNext()) {
                JsonParser  _parser = new JsonParser();
                JsonElement jsonElement =  _parser.parse(reader);
                JsonObject jsonObject1 = jsonElement.getAsJsonObject();
                //Do something



                StringBuffer hashIndex = new StringBuffer("");

                //Add Title and Body Together to the list
                String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");


                //Remove full stops and commas
                titleAndBodyContainer = titleAndBodyContainer.replaceAll("\\.(?=\\s|$)", " ");
                titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
                titleAndBodyContainer = titleAndBodyContainer.toLowerCase();


                //Create a word list without duplicated words
                StringBuilder result = new StringBuilder();

                HashSet<String> set = new HashSet<String>();
                for(String s : titleAndBodyContainer.split(" ")) {
                    if (!set.contains(s)) {
                        result.append(s);
                        result.append(" ");
                        set.add(s);
                    }
                }
                //System.out.println(result.toString());


                //Re-Arranging everything into Alphabetic Order
                String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
                //String testHash = "057        1$k     983    5*1      058     52j    6!v   983     03z";

                String[]finalWordHolder = (result.toString()).split(" ");
                Arrays.sort(finalWordHolder);


                //Navigate through text and create the Hash
                for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
                {


                    if(wordMap.containsKey(finalWordHolder[arrayCount]))
                    {
                        hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
                    }

                }

                //System.out.println(hashIndex.toString().trim());

                jsonObject1.addProperty("hash_index", hashIndex.toString().trim()); 
                jsonObject1.addProperty("primary_key", gsonVal); 
                jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection

                jsonHashHolder.add(hashIndex.toString().trim());

                System.out.println("Primary Key: "+jsonObject1.get("primary_key"));

                //System.out.println(Arrays.toString(finalWordHolder));
                //System.out.println("- "+hashIndex.toString());

                //break;
                gsonVal++;
            }

            System.out.println("Hash Creation Completed");
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }

When this code is executed, I got the following error

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at HashCreator.createHash(HashCreator.java:252)
        at HashCreator.<init>(HashCreator.java:66)
        at Main.main(Main.java:9)
[root@ip-172-31-45-123 JarFiles]#

Line number 252 is - result.append(s);. It is Inside the HashSet loop.

Previously, it generated OutOfMemoryError in line number 254. Line number 254 is - set.add(s); it is also inside the HashSet array.

My Json files are really really big. Gigabytes and Terabytes. I have no idea about how to avoid the above issue.

5
  • 1
    If the files are that big you simply cannot have them in memory. Think of another solution. Commented Feb 4, 2014 at 15:47
  • At the end of the loop you are calling jsonObjectHolder.add - as this is not a local variable I assume it is instance scoped. This means that you are holding onto all the objects you unmarshall from JSON in memory. You cannot do this - you have to stream the object back out again so the memory can be freed. Commented Feb 4, 2014 at 16:02
  • @BoristheSpider:Yes. But then how can I get the data inside this ArayList outside of the loop? Commented Feb 4, 2014 at 16:17
  • 2
    You simply cannot have a collection in memory holding all the data. You must look to file based solutions. You can have part of the data in memory but it must be cleared out before you can load the next part. A database would do this for you automatically and you could query it at will. Alternatively you can look at things like a B-tree to store structured, queryable, data in a file. Commented Feb 4, 2014 at 16:30
  • @BoristheSpider: I am following your adive now. Will update you. Commented Feb 4, 2014 at 16:32

1 Answer 1

1

Use a streaming JSON library like Jackson. Read in a some JSON, add the hash, and write them out. Then read in some more, process them, and write them out. Keep going until you have processed all the objects.

http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example

(See also this StackOverflow post: Is there a streaming API for JSON?)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.