Pandas: parsing values in structured non tabular text

Question

I have a text file with a format like this:

k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"

What is the most pandorable way of reading these data into a dataframe of this form:

        A       B
0       k1      v1
1       k1      v2
2       k2      v1'
3       k3      v1"
4       k3      v2"
5       k3      v3"

that does not involve manual looping? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above?

It's not clear what your format is. Do you have some token in square brackets at the start of each group, or is your [a-token] just a marker that you put in here on SO to indicate the start of each group? — DSM
– DSM, Commented Feb 12, 2017 at 3:51
yeah meant to say that there is a pattern indicating end of the string for the key values. — Maths noob
– Maths noob, Commented Feb 12, 2017 at 3:54

piRSquared · Accepted Answer · 2017-02-12 06:59:13Z

3

setup
borrowing from @jezrael

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)

str.extract with parameters specified in the regex and look ahead
use duplicated to identify the rows we want to keep.

df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)

    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

answered Feb 12, 2017 at 6:59

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2017-02-12 07:24:35Z

You can use read_csv with some separator which is not in data like | or ¥:

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
             B
0  k1[a-token]
1           v1
2           v2
3  k2[a-token]
4          v1'
5  k3[a-token]
6          v1"
7          v2"
8          v3"

Then insert new column A with extract values with [a-token] and last use boolean indexing with mask by duplicated for remove rows with keys in values column:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

But if file have duplicated keys:

print (df)
              B
0   k1[a-token]
1            v1
2            v2
3   k2[a-token]
4           v1'
5   k3[a-token]
6           v1"
7           v2"
8           v3"
9   k2[a-token]
10          v1'

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A            B
0  k1           v1
1  k1           v2
2  k2          v1'
3  k3          v1"
4  k3          v2"
5  k3          v3"
6  k2  k2[a-token]
7  k2          v1'

Then is necessary change mask to:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"
6  k2  v1'

b2002 · Accepted Answer · 2017-02-12 14:48:03Z

0

With your file as 'temp.txt'...

df = pd.read_csv('temp.txt',
                 header=None,
                 delim_whitespace=True,
                 names=['data'])

bins = df.data.str.endswith('[a-token]')

idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]

a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])

merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values, 
                       'B': idx_vals.data.values})

edited Feb 12, 2017 at 14:48

answered Feb 12, 2017 at 12:54

b2002

9141 gold badge6 silver badges10 bronze badges

2 Comments

Maths noob Over a year ago

the whole point was not to have any visible loops in the code! :)

b2002 Over a year ago

sorry about that @ShS, I didn't read your question properly. how does this look after replacing the loop with merge_asof?

Collectives™ on Stack Overflow

Pandas: parsing values in structured non tabular text

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related