String Pattern Matching & Indexing between two Columns - Pandas

Question

I have a dataframe with two text columns. The column value of one column (lets say Col B) is basically a substring/part of the whole string of the other column (lets say Col A). I want to find patterns in each of them and want to check the trend of the positioning or beginning letters of the string of Col A. So I want to generate three columns, one is the position of the substring the other two are the preceding and the following characters.

Here is the how the dataframe looks like:

| Col A     | Col B |
----------------------
AGHXXXJ002  | XXX   |
AGHGHJJ002  | GHJ   |
ABCRTGHP001 | RTGH  |
ABCDFFP01   | DFF   |
ABCXGHJD09  | XGH   |

Now based on the above pattern I want to generate two columns:

| Col A     | Col B | Position                  | Preceding Chars | Following Chars |
-------------------------------------------------------------------------------------
AGHXXXJ002  | XXX   | [3, 5]                    |  AGH            | J002            |
 (Because XXX starts at index 3 and ends at 5)  |                 |                 |
AGHGHJJ002  | GHJ   | [3, 5]                    |  AGH            | J002            |
ABCRTGHP001 | RTGH  | [3, 6]                    |  ABC            | P001            |
ABCDFFP01   | DFFP  | [3, 5]                    |  ABC            | 01              |
ABCXGHJD09  | XGH   | [3, 5]                    |  ABC            | D09             |
HGMQQUTV01  | HGM   | [0, 2]                    |  NaN            | QQUTV01         |
GBHUJJS099  | BHU   | [1, 3]                    |  G              | JJS099          |

This is my desired output. I tried using a for loop and scrape out the substrings, but never got executed, hence removed the code. Till now I have been doing manually but there are more than 50k rows and its not feasible. Also, the position column can be split into two different columns, start position and end position.

tanjmaxalb · Accepted Answer · 2020-07-08 15:54:41Z

Probably, it will help you

>>> import re
>>> import pandas

>>> df = pandas.DataFrame([["AGHXXXJ002", "XXX"], ["AGHGHJJ002", "GHJ"], ["ABCRTGHP001", "RTGH"], ["ABCDFFP01", "DFF"], ["ABCXGHJD09", "XGH"]], columns=["Col A", "Col B"])
>>> df
         Col A Col B
0   AGHXXXJ002   XXX
1   AGHGHJJ002   GHJ
2  ABCRTGHP001  RTGH
3    ABCDFFP01   DFF
4   ABCXGHJD09   XGH

>>> def get_position(row):
...     match = re.search(row["Col B"], row["Col A"])
...     if match:
...             return match.span()
...     else:
...             return [-1, -1]
... 
>>> df["Position"] = df.apply(get_position, axis=1)
>>> df
         Col A Col B Position
0   AGHXXXJ002   XXX   (3, 6)
1   AGHGHJJ002   GHJ   (3, 6)
2  ABCRTGHP001  RTGH   (3, 7)
3    ABCDFFP01   DFF   (3, 6)
4   ABCXGHJD09   XGH   (3, 6)

>>> def get_preceding(row):
...     if row["Position"][0] == -1:
...             return ""
...     return row["Col A"][:row["Position"][0]]
... 
>>> df["Preceding Chars"] = df.apply(get_preceding, axis=1)
>>> df
         Col A Col B Position Preceding Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH
1   AGHGHJJ002   GHJ   (3, 6)             AGH
2  ABCRTGHP001  RTGH   (3, 7)             ABC
3    ABCDFFP01   DFF   (3, 6)             ABC
4   ABCXGHJD09   XGH   (3, 6)             ABC

>>> def get_following(row):
...     if row["Position"][1] == -1:
...             return ""
...     return row["Col A"][row["Position"][1]:]
... 
>>> df["Following Chars"] = df.apply(get_following, axis=1)
>>> df
         Col A Col B Position Preceding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 6)             AGH            J002
1   AGHGHJJ002   GHJ   (3, 6)             AGH            J002
2  ABCRTGHP001  RTGH   (3, 7)             ABC            P001
3    ABCDFFP01   DFF   (3, 6)             ABC             P01
4   ABCXGHJD09   XGH   (3, 6)             ABC            JD09

Umar.H · Accepted Answer · 2020-07-08 16:38:59Z

There isn't a vectorised method to do this as we are dealing with row level operations and strings.

lets use str.find and np.char.find to create your dataframe.

#Note I've removed the spaces in your columns.
s = pd.DataFrame(df.apply(lambda x : x['ColA'].split(x['ColB']),axis=1).tolist())
idx = df.apply(lambda x : np.char.find(x['ColA'],x['ColB']),1)

pos = zip(idx.values, (idx - 1 + df["ColB"].str.len()).values)

df["Position"] = list(pos)
df['Proceeding Chars'], df['Following Chars'] = s[0], s[1]

print(df)

        ColA  ColB Position Proceeding Chars Following Chars
0   AGHXXXJ002   XXX   (3, 5)              AGH            J002
1   AGHGHJJ002   GHJ   (3, 5)              AGH            J002
2  ABCRTGHP001  RTGH   (3, 6)              ABC            P001
3    ABCDFFP01   DFF   (3, 5)              ABC             P01
4   ABCXGHJD09   XGH   (3, 5)              ABC            JD09
5   HGMQQUTV01   HGM   (0, 2)                          QQUTV01
6   GBHUJJS099   BHU   (1, 3)                G          JJS099

Alexey · Accepted Answer · 2020-07-08 17:50:18Z

# Prepare test data

dct = {'Col A': {0: 'AGHXXXJ002',
  1: 'AGHGHJJ002',
  2: 'ABCRTGHP001',
  3: 'ABCDFFP01',
  4: 'ABCXGHJD09'},
 'Col B': {0: 'XXX', 1: 'GHJ', 2: 'RTGH', 3: 'DFF', 4: 'XGH'}}

df = pd.DataFrame.from_dict(dct)


tmp_lst = [x[0].split(x[1]) for x in zip(df['Col A'],df['Col B'])]         #  prepare temporary list with items: 'AGHXXXJ002'.split('XXX') -> [['AGH','J002'],.....]
df['Preceding Chars'] = [c[0] for c in tmp_lst]          # get first element ['AGH','J002'][0] -> 'AGH' 
df['Following Chars'] = [c[1] for c in tmp_lst]          # get second element ['AGH','J002'][1] -> 'J002' 
df['Position'] = [[len(i[0]), len(i[0])+ len(i[1])-1] for i in zip(df['Preceding Chars'], df['Col B'])]    

df
Out[1]:

    Col A       Col B   Preceding Chars Following Chars Position
0   AGHXXXJ002  XXX     AGH             J002            [3, 5]
1   AGHGHJJ002  GHJ     AGH             J002            [3, 5]
2   ABCRTGHP001 RTGH    ABC             P001            [3, 6]
3   ABCDFFP01   DFF     ABC             P01             [3, 5]
4   ABCXGHJD09  XGH     ABC             JD09            [3, 5]

Collectives™ on Stack Overflow

String Pattern Matching & Indexing between two Columns - Pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related