1

I have a big database of several json file. There are tweets and each tweet is a

{"object":"Message",
"action":"create",
"data":{"id":152376374,"body":"@PeakySellers right there with you brother. 368$","created_at":"2019-01-31T23:59:56Z",
"user":{"id":971815,"username":"mtothe5thpower",}'
}

and I have 3 million row in one file and the size is more than 5GB. I use pandas to read the file and it works well data2=pd.read_table('file', sep="\n",header=None)

Now I have a database and in each row, there is one element (like a tweet that I mentioned earlier) and its type is String. Now I convert each element to a dictionary to use the file and access each element. I am using the code below:

for i, row in data2.itertuples():
    data2["dic"][i] = json.loads(data2[0][i])

While this code successfully converts each string to a dictionary, it is very slow. I think there should be a faster way for this task. Thank you in advance for any help or suggestions.

5
  • This doesn't answer your question but you should enumerate like for i, row in enumerate(data2.itertuples()):. That will save you from having to manage i yourself. Commented May 7, 2020 at 18:16
  • 1
    Thank you, Daniel. I am learning. Commented May 7, 2020 at 18:17
  • No problem! That's why we're all here. =) Commented May 7, 2020 at 18:18
  • Is every json string on separate line? Commented May 7, 2020 at 18:24
  • @AndrejKesely Yes. Every line is a separate json. Commented May 7, 2020 at 18:27

1 Answer 1

1

You could load the huge file at once using readlines(), and then join it to one huge string and decode in via json.loads().

Benchmark (using file with 100k Json rows):

import json
import pandas as pd
from timeit import timeit

def f1():
    data2=pd.read_table('huge.txt', sep="\n",header=None)

    data2['dic'] = ''

    for i in range(len(data2[0])):
        data2["dic"][i] = json.loads(data2[0][i])

    return data2

def f2():
    with open('huge.txt', 'r') as r:
        l = r.readlines()

    s = '[' + ','.join(l) + ']'

    data = json.loads(s)

    return pd.DataFrame({'dic': data})

t1 = timeit(lambda: f1(), number=1)
t2 = timeit(lambda: f2(), number=1)

print(t1)
print(t2)

Prints on my machine (AMD 2400G, Python 3.8):

102.88191884600383
0.30581573898962233

So this method seems a lot faster.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Andrej. I used your code on a smaller file, and it worked much faster than the initial code. Nice job.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.