0

I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.

f = open('./yelp_academic_dataset_review.json', encoding='utf-8')

I tried without encoding utf-8 but it creates an error. I created a function that reads the file line by line and make a pandas dataframe up to given number of lines. Anyway some lines are lists. And script iterates in each list too and adds to dataframe.

def json_parser(file, max_chunk):
  f = open(file)
  df = pd.DataFrame([])
  for i in range(2, max_chunk + 2):
    try:
      type(f.readlines(i)) == list
      for j in range(len(f.readlines(i))):
        part = json.loads(f.readlines(i)[j])
        df2 = pd.DataFrame(part.items()).T
        df2.columns = df2.iloc[0]
        df2 = df2.drop(0)
        datas = [df2, df]
        df2 = pd.concat(datas)
        df = df2
    except:
      f = open(file, encoding = "utf-8")
      for j in range(len(f.readlines(i))):
        try:
          part = json.loads(f.readlines(i)[j-1])
        except:
          print(i,j)
        df2 = pd.DataFrame(part.items()).T
        df2.columns = df2.iloc[0]
        df2 = df2.drop(0)
        datas = [df2, df]
        df2 = pd.concat(datas)
        df = df2
  df2.reset_index(inplace=True, drop=True) 
  return df2

But still I am having an error that list index out of range. (Yes I used print to debug). So I looked closer to that lines which causes this error.

But very interestingly when I try to look at that lines, script gives me different list. Here what I meant: enter image description here I runned the cells repeatedly and having different length of the list. So I looked at lists: enter image description here

It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing? Thanks in advance.

2
  • BTW the statement type(...) == list has no effect. Did you mean to write an assertion? Commented Nov 20, 2022 at 13:33
  • Yes, you are right. I edited too much of it. And some parts are left. If it is has to be read in utf-8 encoding it will give an error. So that's why there is a try except section. It doesn't have to do anything. Commented Nov 20, 2022 at 14:18

1 Answer 1

1

You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.

But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.

You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.

Sign up to request clarification or add additional context in comments.

5 Comments

Let's think this out of the function. I looked at the list with f.readlines(3452). Why am I getting different lists each time I run only this cell?
I don't know, I assumed you wanted to know why you get an IndexError.
Are you aware that the argument you pass to readlines is a number of bytes, not a number of lines?
No. Actually I need an explanation of readlines. Can you suggest a source to read a large json file line by line?
I guess I made a mistake about readlines function. And I decided not to use it. Instead of it I decided to use line.rstrip() function. So I will accept this thread is closed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.