0

I have a file that contains following data in it. I am trying to parse the data.

08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda   model: civic
year: 2019
trim: "lx"
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota  model: highlander
year: 2021
trim: "Platinum"

I want to see the data like this:

Make, Model, Year, trim, Owner
Honda, civic, 2019, lx, phillip
toyota, highlander, 2021, platinum, Rex

here is my code: I was tryin to create dictionary and then load to panda dataframe. I think i am not on right direction.

def fix_line(record):
    #split every field and value into a seperate line
    results = []
    mini_collection = []
    if not record.startswith("/*"):
        #for data in record.rstrip('\n').strip().split('   '):
        for data in record.rstrip('\n').split('   '):
            if ':' not in data:
                mini_collection.append(data)
            else:
                results.append(data)
    return results
                    
def create_dictionary(data):   
    record = {}                
    for line in fix_line(data):
        line = line.strip()
        name, value = line.split(':', 1)
        record[name.strip()] = value.strip()
    return record
2
  • Can you elaborate on this format? Is this a standard output from some other program or system? Can we make guarantees about the dataset like that data always starts on line 4, and every block has values for all keys? Or do we have to interpret everything on the fly? Commented Sep 19, 2021 at 2:24
  • @HenryEcker yes, it is a file from another system. Yes, every data starts from line 4. Some blocks may have more keys and values, and some may have less keys and value. Commented Sep 19, 2021 at 11:53

2 Answers 2

2

Heres one way:

import re
import yaml #python -m pip install pyyaml
import pandas as pd 

s = """08/23/21 04:00:05 AM


/* ----------------- data1----------------- */ 

make: honda
model: civic
year: 2019
trim: lx
owner: phillip

/* ----------------- data2----------------- */ 

make: toyota
model: highlander
year: 2021
trim: Platinum
owner: Rex
"""

lines = re.split("/*\s*/", s)
records = [yaml.load(line) for line in lines if "make:" in line]
df = pd.DataFrame(records)

output:

     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum      Rex
Sign up to request clarification or add additional context in comments.

3 Comments

how do I apply re.split if I have data like this? /* ----------------- data1----------------- / make: honda model: civic year: 2019 trim: lx owner: phillip / ----------------- data2----------------- */ make: toyota model: highlander year: 2021 trim: Platinum owner: Rex
@annon01 - I have made some changes to the input file, trim has value with double quote and owner does not exist data2. what changes I should do in your code?
you will have to write a custom cleaning function if your data is dirty - there is no easy solution for that
0

Try using re.finditer with the following pattern to create dictionary based on finding. Then append to a dataframe.

import re

pattern = """
    (?P<make>(?<=(make:\ ))\w+) #use lookbehind regex to get make
    (\s + model: \ )            #Skip to model
    (?P<model>\w+)              #Get Model
    (\s year: \ )               #Skip to year
    (?P<year>\d+)               #Get year
    (\s + trim: \ ")            #Skip to trim
    (?P<trim>\w+)               #Get trim
    ("\s)                       #Skip to owner
    (?P<owner>.*)               #Get owner
"""

df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])
df["owner"] = df["owner"].str.replace("owner: ", "")
df
Out[563]: 
     make       model  year      trim    owner
0   honda       civic  2019        lx  phillip
1  toyota  highlander  2021  Platinum 

7 Comments

Not the DV, but never grow a DataFrame. DataFrame.append in a loop has quadratic time complexity. It would be better to add all of those dictionaries to a list then build the DataFrame at the end. df = pd.DataFrame([item.groupdict() for item in re.finditer(pattern, data, re.VERBOSE)])
@HenryEcker good idea. i learn something new today. Thanks for feedback!
@RaymondToh - I have modified my input file data little bit, like trim has value with double quote and owner does not exist data2 and make and model are on one line. how do I write the pattern for this?
@st_bones are you able to provide me a sample? Also, do use list comprehension method provided by Henry and mark a tick to the solution.
@st_bones this is the best results I can get using regex. Because data1 and data2 do not have the same pattern. so the str.replace is needed. If this work for you, do upvote and approve the solution :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.