1

UPDATE: I have adapted my code according to suggestions in the first reply but still an error is produced.

I have written the following code to parse a very large json file:

    public static void main(String[] args) throws Exception {
        String jsonFile="/home/zz/Work/data/wdc/WDC_ProdMatch/idclusters.json";
        WDCProdMatchDatasetIndexer_2 indexer = new WDCProdMatchDatasetIndexer_2();
        indexer.readClusterMetadata(jsonFile);
    }

    public void readClusterMetadata(String jsonFile){
        try(JsonReader jsonReader = new JsonReader(
                new InputStreamReader(
                        new FileInputStream(jsonFile), StandardCharsets.UTF_8))) {

            Gson gson = new GsonBuilder().create();

            jsonReader.beginObject(); //start of json array
            int numberOfRecords = 0;
            while (jsonReader.hasNext()){ //next json array element
                Cluster c = gson.fromJson(jsonReader, Cluster.class);
                long[] sizeInfo=new long[]{c.clusterSizeInOffers, c.size};
                //clusterSize.put(String.valueOf(c.id), sizeInfo);

                numberOfRecords++;
                if (numberOfRecords%1000==0)
                    System.out.println(String.format("processed %d clusters", numberOfRecords));
            }
            jsonReader.endArray();
            System.out.println("Total Records Found : "+numberOfRecords);
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }


    class ArrayAsStringJsonDeserializer implements JsonDeserializer<List<String>> {
        @Override
        public List<String> deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context) throws JsonParseException {
            String value = json.getAsString().trim();
            value = value.substring(1, value.length() - 1);

            return Arrays.stream(value.split(",")).map(String::trim).collect(Collectors.toList());
        }
    }


    class Cluster {
        protected long id;
        protected long size;

        @SerializedName("cluster_size_in_offers")
        protected long clusterSizeInOffers;

        @JsonAdapter(ArrayAsStringJsonDeserializer.class)
        @SerializedName("id_values")
        protected List<String> idValues;

        @SerializedName("categoryDensity")
        protected double catDensity;

        @SerializedName("category")
        protected String cat;
    }

And the data file looks like this (first 10 lines)

{"size":4,"cluster_size_in_offers":1,"id_values":"[814914023129, w2190254, pfl60gs25ssdr,  pfl60gs25ssdr]","id":2,"categoryDensity":1,"category":"Computers_and_Accessories"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[hst322440ss, g1042641]","id":3,"categoryDensity":1,"category":"Office_Products"}
{"size":4,"cluster_size_in_offers":1,"id_values":"[4051329063869, t24datr01765, t24datr01763, datr01763]","id":4,"categoryDensity":1,"category":"Automotive"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[5057195062301, sppct335a2bl]","id":7,"categoryDensity":1,"category":"Office_Products"}
{"size":3,"cluster_size_in_offers":1,"id_values":"[ 845173001269,  mpnlkbusmokeam89us, lkbusmokeam89]","id":8,"categoryDensity":1,"category":"Computers_and_Accessories"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[ksw26r0100, g1104817]","id":9,"categoryDensity":1,"category":"Other_Electronics"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[5054328719897, ltr12x31r685c15]","id":11,"categoryDensity":1,"category":"Office_Products"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[model82226, sirsir822261]","id":15,"categoryDensity":1,"category":"Tools_and_Home_Improvement"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[5054328970724, sscl3816114a2bl]","id":17,"categoryDensity":1,"category":"Office_Products"}
{"size":2,"cluster_size_in_offers":1,"id_values":"[814882011647, 203932664]","id":20,"categoryDensity":1,"category":"Tools_and_Home_Improvement"}

But when the code is run on this data, an error is generated as follows:

com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was NAME at line 1 column 3 path $.
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:226)
    at com.google.gson.Gson.fromJson(Gson.java:927)
    at uk.ac.shef.inf.wop.indexing.WDCProdMatchDatasetIndexer_2.readClusterMetadata(WDCProdMatchDatasetIndexer_2.java:38)
    at uk.ac.shef.inf.wop.indexing.WDCProdMatchDatasetIndexer_2.main(WDCProdMatchDatasetIndexer_2.java:25)
Caused by: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was NAME at line 1 column 3 path $.
    at com.google.gson.stream.JsonReader.beginObject(JsonReader.java:385)
    at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:215)
    ... 3 more

Any suggestions please?

5
  • Each line of the file is an object, so you do need to use beginObject, not beginArray... What error does that give? Commented Dec 10, 2019 at 14:50
  • Thanks for your quick reply, I just updated with the information. Error message at the end of the post Commented Dec 10, 2019 at 14:53
  • I would suggest reading the file line by line, then just deserialize the whole string like normal in Gson. Then you can avoid the streaming logic (which needs to open and close the object within the loop, not outside) Commented Dec 10, 2019 at 14:57
  • Your code is expecting json to start with array. But if you see json it is an object. so that is why thi exception is raised Commented Dec 10, 2019 at 14:58
  • Thank you cricket... I guess I need to check with the data provider that they have indeed separated the data records line by line; Bilal, I also tried changing line A to 'beginObject' but that gives a different error as stated at the end of the post. Thanks Commented Dec 10, 2019 at 16:09

1 Answer 1

1

In a file each line is a separate JSON Object. One problem with it is a fact that JSON Array is wrapped in quotes which makes it a String primitive. You need to provide custom deserialiser for it, unwrap array from quotes and manually split items by comma (,). Example solution could look like below:

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonDeserializationContext;
import com.google.gson.JsonDeserializer;
import com.google.gson.JsonElement;
import com.google.gson.JsonParseException;
import com.google.gson.annotations.JsonAdapter;
import com.google.gson.annotations.SerializedName;
import lombok.Data;
import lombok.ToString;

import java.io.File;
import java.io.IOException;
import java.lang.reflect.Type;
import java.nio.file.Files;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class GsonApp {

    public static void main(String[] args) throws Exception {
        File jsonFile = new File("./resource/test.json").getAbsoluteFile();

        List<Cluster> clusters = readClusters(jsonFile);
        clusters.forEach(System.out::println);
    }

    private static List<Cluster> readClusters(File jsonFile) throws IOException {
        Gson gson = new GsonBuilder().create();

        try (Stream<String> lines = Files.lines(jsonFile.toPath())) {
            return lines
                    .map(line -> gson.fromJson(line, Cluster.class))
                    .collect(Collectors.toList());
        }
    }
}

class ArrayAsStringJsonDeserializer implements JsonDeserializer<List<String>> {
    @Override
    public List<String> deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context) throws JsonParseException {
        String value = json.getAsString().trim();
        value = value.substring(1, value.length() - 1);

        return Arrays.stream(value.split(",")).map(String::trim).collect(Collectors.toList());
    }
}

@Data
@ToString
class Cluster {
    protected long id;
    protected long size;

    @SerializedName("cluster_size_in_offers")
    protected long clusterSizeInOffers;

    @JsonAdapter(ArrayAsStringJsonDeserializer.class)
    @SerializedName("id_values")
    protected List<String> idValues;

    @SerializedName("categoryDensity")
    protected int catDensity;

    @SerializedName("category")
    protected String cat;
}

Above code prints:

Cluster(id=2, size=4, clusterSizeInOffers=1, idValues=[814914023129, w2190254, pfl60gs25ssdr, pfl60gs25ssdr], catDensity=1, cat=Computers_and_Accessories)
Cluster(id=3, size=2, clusterSizeInOffers=1, idValues=[hst322440ss, g1042641], catDensity=1, cat=Office_Products)
Cluster(id=4, size=4, clusterSizeInOffers=1, idValues=[4051329063869, t24datr01765, t24datr01763, datr01763], catDensity=1, cat=Automotive)
...
Sign up to request clarification or add additional context in comments.

4 Comments

Many thanks for taking time to work through my code. I have adapted my code according to yours, because my data is huge (9GB) I need to process the data in stream. My adapted code is now reflected in the original post. But I still get an error that is described above....
@Ziqi, you can not provide jsonReader to Gson because it will read until the end of file and crash because you have new JSON object in every line. You should remove all lines likejsonReader.beginObject(). You can not do parse it like this. If you want to read only first 1000 rows you can limit Stream as well by adding .limit(1000) before collect. Or if you do not want to collect all data and save them you can call forEach method instead.
OK so am I right that your code is reading one line at a time and parse it to an object? I.e., it is not using the JsonReader to handle the stream anymore (Files.lines...). If so I think can know how to adapt it... thanks
@Ziqi, exactly. Files.lines(jsonFile.toPath()) line returns Stream object which reads file lazily line by line. So, if you want to process all lines and save them to DB for example, you can add forEach call to a chain in stream and save given POJO to DB. It should be quite easily to change my code

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.