1

I have a text file with several one line strings that are not always laid out in the same order, but usually contain some of the same information.

Ex.

(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))
(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))

In this case not everything needs to be read in from each line, in this example only 'Name', 'x', 'y', 'label' and 'code'. Assuming I have a couple hundred lines that look like the example, is it possible to easily get the data I want out of each line? Ideally I am trying to get the information passed into a pandas dataframe, but the question is mainly about how to properly regex the strings given the fact that there is no real pattern.

An example of the what the DataFrame might look like (if this helps with understanding the question)

Names   x   y   label   code
RED    123 456   ONE    XYZ
GREEN  789 101   TWO 

Is regex even the best approach to this problem? There is no real pattern that I have found when looking at all of the lines so it may not be ideal.

6
  • 1
    Not sure why you're getting downvoted, "is this a job for a regex" is a perfectly answerable (and valid) question. I've posted an implementation below using a regex in two-steps — and using a dict of list to build the DataFrame Commented Feb 20, 2019 at 15:19
  • @mfitzp probably because no attempts have been made by OP (or at least none shown here). I didn't do it though Commented Feb 20, 2019 at 15:27
  • @mfitzp the questions wasn't "how do I do this" but rather "should I go down this avenue or save the energy and go for a smarter approach" hence the reason I didn't try anything. Commented Feb 20, 2019 at 15:30
  • @MaxB Not sure if that validates as a good question, certainly wish I was a SO bot now. Commented Feb 20, 2019 at 15:34
  • @RickM. I appreciate the concern Commented Feb 20, 2019 at 16:12

4 Answers 4

3

The pattern is regular aside from the properties being in any order, so it's certainly doable. I've done this in two steps — one regex to grab the colour at the beginning and extract the properties string, and a second to extract the properties.

import re


inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = prop_re.findall(props)
    # [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
    print("%s: %s" % (color, properties))

The output given is

RED: [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
GREEN: [('type', 'MX'), ('label', 'TWO'), ('x', '789'), ('y', '101')]

To get this into pandas you can accumulate the properties in a dictionary of lists (I've done this below using defaultdict). You need to store something for empty values so all columns are the same length, here I just store None (or null). Finally use pd.DataFrame.from_dict to get your final DataFrame.

import re
import pandas as pd
from collections import defaultdict

inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

columns = ['color', 'x', 'y', 'type', 'label', 'code']

data_dict = defaultdict(list)

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = dict(prop_re.findall(props))
    properties['color'] = color

    for k in columns:
        v = properties.get(k)  # None if missing
        data_dict[k].append(v)


pd.DataFrame.from_dict(data_dict)

The final output is

   color    x    y type label  code
0    RED  123  456   MT   ONE   XYZ
1  GREEN  789  101   MX   TWO  None
Sign up to request clarification or add additional context in comments.

Comments

3

You can manipulate the strings a bit with splits and extracting between (). Need to first split on '(' to remove the first two levels of nesting.

import pandas as pd

s = df.col.str.split('(', n=2)
df['Names'] = s.str[1].str.split().str[1]

s2 = s.str[2].str.extractall('[(](.*?)[)]')[0].str.split()

df = pd.concat([df, (pd.DataFrame(s2.values.tolist(), index=s2.index.get_level_values(0))
                       .pivot(columns=0, values=1))], axis=1)

Output:

                                                 col  Names code label type    x    y
0  (Names RED (property (x 123) (y 456) (type MT)...    RED  XYZ   ONE   MT  123  456
1  (Names GREEN (property (type MX) (label TWO) (...  GREEN  NaN   TWO   MX  789  101

2 Comments

What witchcraft is this? :) Nice answer.
my regex is just okay, so I have to resort to these oddities :p. But seems to be fairly similar approach, though yours is probably safer to bad inputs.
1

A very basic and straight forward implementation (just to show you that you could have started here before asking the question and gained a bit more credibility):

string1 = "(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))"
string2 = "(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))"

names = []
x = []
y = []
label = []
code = []
split_string = string2.split(' ')

for i in range(0, len(split_string)):
    try:
        if "Names" in split_string[i]:
            names.append(split_string[i+1])
        if "x" in split_string[i]:
            x.append(split_string[i+1][:-1])
        if "y" in split_string[i] and split_string[i].find("y") <= 1:
            y.append(split_string[i+1][:-1])
        if "label" in split_string[i]:
            label.append(split_string[i+1][:-1])
        if "code" in split_string[i]:
            code.append(split_string[i+1][:-1])
    except IndexError:
        break
print(names, '\n', x, '\n', y, '\n', label, '\n', code, '\n')

Output (string1):

['GREEN'] 
['789'] 
['101))'] 
['TWO'] 
[] 

Output (string2):

['RED'] 
['123'] 
['456'] 
['ONE'] 
['XYZ))'] 

Comments

1

For the case of perfectly matching parentheses, can you consider pyparsing instead of regex?

import pandas as pd
import pyparsing as pp

lines=[
    '(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
    '(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

#create an empty dataframe with possible columns
df = pd.DataFrame(columns=['Names', 'x', 'y','type','label','code'])

for line in lines:
    res = pp.nestedExpr(opener='(', closer=')').parseString(line)
    #flatten first level
    l1 = list(itertools.chain.from_iterable(res))
    #flatten property
    l2 = list(itertools.chain.from_iterable(l1[2][1:]))
    #turn to dict
    d1 = l3=dict(itertools.zip_longest(*[iter(l2)] * 2, fillvalue=""))
    #add Name value
    d1.update({'Names': l1[1]})
    #add a row to the dataframe, 
    df = df.append(d1, ignore_index=True)

df = df.fillna('')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.