3

Apologies if this is very simple or has already been asked, I am new to Python and working with json files, so I'm quite confused.

I have a 9 GB json file scraped from a website. This data consists of information about some 3 million individuals. Each individual has attributes, but not all individuals have the same attributes. An attribute corresponds to a key in the json file, like so:

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

So here, _id, name, locality, skills, industry and experience are attributes (keys). Another profile may have additional attributes, like education, awards, interests, or lack some attribute found in another profile, like the skills attribute, and so on.

What I'd like to do is scan through each profile in the json file, and if a profile contains the attributes skills, industry and experience, I'd like to extract that information and insert it into a data frame (I suppose I need Pandas for this?). From experience, I would want to specifically extract the name of their current employer, i.e. the most recent listing under org. The data frame would look like this:

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

... and so on for all profiles with these three attributes.

I'm struggling to find a good resource that explains how to do this kind of thing, hence my question.

I suppose rough pseudocode would be:

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

I just need to know how to write this in Python.

2
  • 1
    I edited my answer many times. Please let me know if it is helpful. Commented Oct 12, 2019 at 22:08
  • 1
    @Attersson Of course, thanks for taking the time to answer my question. It is late where I live, so I will carefully read through it tomorrow and let you know how it goes. Commented Oct 12, 2019 at 22:12

1 Answer 1

2

I assume the file contains all the profiles, such as

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

Before continuing, let me show a correct way to use Pandas DataFrames.

Example of better usage of Pandas DataFrames:

Values in a Pandas DataFrame cannot be lists. So we will have to duplicate lines as shown in the example below. Check this question and JD Long's answer for more detail: how to use lists as values in pandas dataframe?

ID      |    Industry   | Current employer | Skill
___________________________________________________________________
in-01   |    Government | Republican       | Twitter
in-01   |    Government | Republican       | Real Estate
in-01   |    Government | Republican       | Golf
in-02   |    Marketing  | Marketers R Us   | Branding
in-02   |    Marketing  | Marketers R Us   | Social Media
in-02   |    Marketing  | Marketers R Us   | Advertising

Find explainations within comments in the code below:

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

The above line, df.loc[-1] = ... inserts a row in the dataframe as last row (index -1).

When later you wish to use this information, you will have to use df.groupby('ID')

Let me know if you have different format(s) in your file(s) and if this explaination is sufficient to get you started or you need more.

Sign up to request clarification or add additional context in comments.

9 Comments

Hi, so it seems the problem with this solution is that my json file is 9GB, so I get a memory error when executing this line: json.loads(file.readlines()). Is there some way to modify the code so that it reads a part of the file, works with it, and then repeats on another part of the file?
To answer your question about valid values of start and end: essentially, anything considered to be a "current position" will have an end value of "present". If the end value is not "present", then it can be ignored. Is that enough information?
Looking at the documentation for readlines(), I can see this function takes a "sizehint" as an input, i.e. the number of bytes to be read from the file. So is it possible to read, say, the first 500MB, process it, and then read the next 500MB,... and so on?
Having tried to read the first 50000000 bytes, I get this error: obj = json.loads(file.readlines(50000000)) # 50 MB File "C:\Users\Jake\AppData\Local\Programs\Python\Python37-32\lib\json\__init__.py", line 341, in loads raise TypeError(f'the JSON object must be str, bytes or bytearray, ' TypeError: the JSON object must be str, bytes or bytearray, not list
Ok you can read a single line with readline. It is not so trivial however to parse the file. You should know what to read, when to stop and what to parse into a json everytime... Also, I am not sure what your file format is... if you got bytearrays and must decode / encode or just str(). Another approach is read(nbytes)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.