4

I have got an assignment consists of questions and large JSON file with objects. JSON file has around 5M objects inside and it has 303MB.

this large file can be downloaded here.

Small preview what is inside:

{ Reviewer:1, Movie:1535440, Grade:4, Date:'2005-08-18'}, 
{ Reviewer:1, Movie:1426604, Grade:4, Date:'2005-09-01'}, 
{ Reviewer:1, Movie:1815755, Grade:5, Date:'2004-07-20'}, 
{ Reviewer:2, Movie:2059652, Grade:4, Date:'2005-09-05'}, 
{ Reviewer:2, Movie:1666394, Grade:3, Date:'2005-04-19'}, 
{ Reviewer:2, Movie:1759415, Grade:4, Date:'2005-04-22'},

Each row represents one review. We can find id of reviewer there, then grade he used to review the movie, movie id, and date (in string form).

I need to import this file into my .NET Console app, deserialize it and convert it into the objects so then I can work with them and create some methods, lists of objects etc.

Questions examples:

  1. with parameter N, what is the number of reviews from reviewer N?

(this should be method with parameter of reviewer's id, one reviewer (person) is able to make multiple reviews of different movies)

  1. What reviewer(s) had done most reviews?

The problem is, that every time, when I deserialize the objects from the file, only deserialization itself takes around 10 seconds and the requirement is, that each method can take maximum 4 seconds of process. Even if I specify only one field I want to deserialize from the file, it takes too much time.

Do you, please, know some effective ways or some nuGet packages how to convert these data in less than 4 seconds? I tried Newtonsoft.JSON only.

I found one interesting article but I was not successful in implementation of that code because code snippets are not completely described and I was not able to figure out. Here is the link to that article.

I would be thankful for every idea and help.

10
  • Both questions are trivially cheated without any conversion of JSON data by doing regex matches. This can be further optimized if all the lines are of perfectly uniform length (dangerous as that would be to rely on for real data). Commented Oct 25, 2018 at 14:43
  • 1
    Are you expecting a new file every time you run your program? Is there a reason you couldn't deserialize the objects and then access the data later? Commented Oct 25, 2018 at 14:44
  • Do you need to deserialize in the methods? Perhaps you can deserialize earlier and store your data in memory. Then your methods can simply do your counting for you. Commented Oct 25, 2018 at 14:44
  • 1
    If you don't have any code, how do you know it takes 10 seconds to deserialize the JSON? Commented Oct 25, 2018 at 15:14
  • 2
    From how you explain it, the 4 second requirement might only apply to the methods after performing the deserialization. Commented Oct 25, 2018 at 15:29

1 Answer 1

3

I decided to give it a try so I created some code that can be used to answer the 2 example questions posted by the OP. The best I could do was to get the results in less than 7 seconds, not 4 as requested by the OP :-(

Parsing the reviewer Ids

The following method returns all the reviewer IDs in the file. I'm using a JsonTextReader to extract the value of the Reviewer property only, without deserializing the whole json object:

private static IEnumerable<long> GetReviewerIds(string path)
{
    using (StreamReader streamReader = File.OpenText(path))
    using (JsonTextReader reader = new JsonTextReader(streamReader))
    {
        reader.CloseInput = true;

        while (reader.Read())
        {
            if (reader.TokenType == JsonToken.PropertyName && reader.Value.Equals("Reviewer"))
            {
                int? id = reader.ReadAsInt32();

                if (id.HasValue)
                {
                    yield return id.Value;
                }
            }
        }
    }
}

Getting the number of reviews of reviewer with Id 4

int reviewerId = 4;    
var stopWatch = Stopwatch.StartNew();

int numberOfReviews = GetReviewerIds(@"ratings.json").Count(x => x == reviewerId);

stopWatch.Stop();
Console.WriteLine($"Number of reviews: {numberOfReviews}; Execution time: {stopWatch.Elapsed:g}");

Output:

Number of reviews: 142; Execution time: 0:00:06.2137702

Getting the top N reviewers

int numberOfReviewers = 3;
var stopWatch = Stopwatch.StartNew();

var reviewers = GetReviewerIds(@"ratings.json")
    .GroupBy(x => x)
    .Select(x => new
    {
        Id = x.Key,
        Count = x.Count()
    })
    .OrderByDescending(x => x.Count)
    .Take(numberOfReviewers)
    .ToList();

stopWatch.Stop();
Console.WriteLine($"Reviewer with ID {reviewers.First().Id} has done {reviewers.First().Count} reviews; Execution time: {stopWatch.Elapsed:g}");

Output:

Reviewer with ID 571 has done 154832 reviews; Execution time: 0:00:06.1256635

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you Rui for helping me and for creating the methods. You helped me a lot. I was later informed that 4 seconds is requirement only for the methods not for deserialization itself. So I am calling deserialization method (from the example you gave me) immediately after start and then calling methods. Reader component was a good idea. Thank you again!
No problem @MapeSVK ;-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.