Parsing a dataframe string for a column value

Question

I have a dataframe column with a strings representing a paths. I'd like to use some of that path as the value in another column.

The strings are similar to the following and in a Column Titled 'Image Location'

C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D2\Image9.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D6\Image7.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D7\Image3.tif
...
C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image7.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image1.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image6.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D3\Image4.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D3\Image9.tif
...
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image4.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image9.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image3.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image7.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image1.tif
C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image6.tif

Right now I'm doing the following :

df['Interval'] = df['Image Location'].str.split('\\').apply(lambda x: x[5])
df['Device'] = df['Image Location'].str.split('\\').apply(lambda x: x[6])

This clearly requires the path not to change very much because I'm counting the number of \ to find the Interval and Device values.

I'm wondering if there's a more robust way to do this. For instance, maybe find a pattern such as Day # and D# Any thoughts would be appreciated.

burhan · Accepted Answer · 2016-11-08 18:44:09Z

2

If you don't want to depend on the number of \'s, you can do something like this:

df['Image Location'].map(lambda x: re.findall(r'(?<=Day )[0-9]+', x)).map(lambda x: np.nan if not x else x[0])
df['Image Location'].map(lambda x: re.findall(r'(?<=D)[0-9]+', x)).map(lambda x: np.nan if not x else x[0])

This will first find substring Day (or D) and return the numbers that immediately follow that. So, it assumes there is no other such pattern anywhere else in the string because it will pick up all patterns like D followed by any number of digits.

UPDATE: Looks like it's easier to use Series.str.extract as @MaxU suggested. Here it goes:

df['Image Location'].str.extract('[Day ]([0-9]+)')
df['Image Location'].str.extract('[D]([0-9]+)')

edited Nov 8, 2016 at 18:44

answered Nov 8, 2016 at 18:35

burhan

9244 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

agf1997 Over a year ago

This was the direction I was thinking originally. I'm not sure which solution is better, this or the one from @MaxU This seems like it would be robust to some path change between \Day # and \D# e.g. C:\Users\Chris H\Desktop\20161017HCT116\Day 8\run 1\D2\Image6.tif but that's unlikely to happen. Max's solution is robust to the Interval changing from Days to Hours C:\Users\Chris H\Desktop\20161017HCT116\48 hr\D2\Image6.tif That is probably more likely, but both are great solutions!

MaxU - stand with Ukraine · Accepted Answer · 2016-11-08 18:46:26Z

i would use Series.str.extract() method:

In [40]: df[['Interval','Device']] = \
             df['Image Location'].str.extract(r'([^\\]+)\\([^\\]+)\\[^\\]+$', expand=True)

In [41]: df
Out[41]:
                                                 Image Location Interval Device
0   C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D2\Image9.tif    Day 4     D2
1   C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D6\Image7.tif    Day 4     D6
2   C:\Users\Chris H\Desktop\20161017HCT116\Day 4\D7\Image3.tif    Day 4     D7
3   C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image7.tif    Day 6     D2
4   C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image1.tif    Day 6     D2
5   C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D2\Image6.tif    Day 6     D2
6   C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D3\Image4.tif    Day 6     D3
7   C:\Users\Chris H\Desktop\20161017HCT116\Day 6\D3\Image9.tif    Day 6     D3
8   C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image4.tif    Day 8     D1
9   C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image9.tif    Day 8     D1
10  C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D1\Image3.tif    Day 8     D1
11  C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image7.tif    Day 8     D2
12  C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image1.tif    Day 8     D2
13  C:\Users\Chris H\Desktop\20161017HCT116\Day 8\D2\Image6.tif    Day 8     D2

Here is parsed and explained RegEx

The RegEx in this solution assumes that you last two path parts (directories) are always: Interval and Device correspondingly.

It does NOT matter how many \ (back-slashes) are there at the beginning of the path

Collectives™ on Stack Overflow

Parsing a dataframe string for a column value

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related