Python parsing unstructured data

Question

I have a file that contains following data in it. I am trying to parse the data.

08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda   model: civic
year: 2019
trim: "lx"
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota  model: highlander
year: 2021
trim: "Platinum"

I want to see the data like this:

Make, Model, Year, trim, Owner
Honda, civic, 2019, lx, phillip
toyota, highlander, 2021, platinum, Rex

here is my code: I was tryin to create dictionary and then load to panda dataframe. I think i am not on right direction.

def fix_line(record):
    #split every field and value into a seperate line
    results = []
    mini_collection = []
    if not record.startswith("/*"):
        #for data in record.rstrip('\n').strip().split('   '):
        for data in record.rstrip('\n').split('   '):
            if ':' not in data:
                mini_collection.append(data)
            else:
                results.append(data)
    return results
                    
def create_dictionary(data):   
    record = {}                
    for line in fix_line(data):
        line = line.strip()
        name, value = line.split(':', 1)
        record[name.strip()] = value.strip()
    return record

Can you elaborate on this format? Is this a standard output from some other program or system? Can we make guarantees about the dataset like that data always starts on line 4, and every block has values for all keys? Or do we have to interpret everything on the fly? — Henry Ecker
– Henry Ecker ♦, Commented Sep 19, 2021 at 2:24
@HenryEcker yes, it is a file from another system. Yes, every data starts from line 4. Some blocks may have more keys and values, and some may have less keys and value. — st_bones
– st_bones, Commented Sep 19, 2021 at 11:53

anon01 · Accepted Answer · 2021-09-19 04:06:46Z

2

Heres one way:

import re
import yaml #python -m pip install pyyaml
import pandas as pd 

s = """08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda
model: civic
year: 2019
trim: lx
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota
model: highlander
year: 2021
trim: Platinum
owner: Rex
"""

lines = re.split("/*\s*/", s)
records = [yaml.load(line) for line in lines if "make:" in line]
df = pd.DataFrame(records)

output:

     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum      Rex

answered Sep 19, 2021 at 4:06

anon01

11.2k8 gold badges41 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

st_bones Over a year ago

how do I apply re.split if I have data like this? /* ----------------- data1----------------- / make: honda model: civic year: 2019 trim: lx owner: phillip / ----------------- data2----------------- */ make: toyota model: highlander year: 2021 trim: Platinum owner: Rex

st_bones Over a year ago

@annon01 - I have made some changes to the input file, trim has value with double quote and owner does not exist data2. what changes I should do in your code?

anon01 Over a year ago

you will have to write a custom cleaning function if your data is dirty - there is no easy solution for that

Raymond Toh · Accepted Answer · 2021-09-21 12:32:10Z

0

Try using re.finditer with the following pattern to create dictionary based on finding. Then append to a dataframe.

import re

pattern = """
    (?P<make>(?<=(make:\ ))\w+) #use lookbehind regex to get make
    (\s + model: \ )            #Skip to model
    (?P<model>\w+)              #Get Model
    (\s year: \ )               #Skip to year
    (?P<year>\d+)               #Get year
    (\s + trim: \ ")            #Skip to trim
    (?P<trim>\w+)               #Get trim
    ("\s)                       #Skip to owner
    (?P<owner>.*)               #Get owner
"""

df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])
df["owner"] = df["owner"].str.replace("owner: ", "")
df
Out[563]: 
     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum

edited Sep 21, 2021 at 12:32

answered Sep 19, 2021 at 3:02

Raymond Toh

7992 gold badges9 silver badges27 bronze badges

7 Comments

Henry Ecker Over a year ago

Not the DV, but never grow a DataFrame. DataFrame.append in a loop has quadratic time complexity. It would be better to add all of those dictionaries to a list then build the DataFrame at the end. df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])

Raymond Toh Over a year ago

@HenryEcker good idea. i learn something new today. Thanks for feedback!

st_bones Over a year ago

@RaymondToh - I have modified my input file data little bit, like trim has value with double quote and owner does not exist data2 and make and model are on one line. how do I write the pattern for this?

Raymond Toh Over a year ago

@st_bones are you able to provide me a sample? Also, do use list comprehension method provided by Henry and mark a tick to the solution.

Raymond Toh Over a year ago

@st_bones this is the best results I can get using regex. Because data1 and data2 do not have the same pattern. so the str.replace is needed. If this work for you, do upvote and approve the solution :)

|

Collectives™ on Stack Overflow

Python parsing unstructured data

2 Answers 2

3 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related