Python Pandas partial match of list of string in dataframe

Question

Hi everyone i am trying match partial string within a columns in data-frame and return the match string(Capital letter matter).I don't have a strong knowledge of programming and i just start learning.

#list of State
state_abbrv = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA",
        "ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK",
        "OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]

#Create dataframe 
d = {"Index": [1, 2, 3, 4, 5 , 6, 7], "Description": ["ABNY", "MANY", "NYNY","DO", "nyNY", ""CWARD NY", "HOWARD BEACH NY"]}

df = pd.DataFrame(data=d)

Here's the df:

Index Description 
1           ABNY         
2           MANY         
3           NYNY         
4           DO           
5           nyNY         
6           CWARD NY       
7           HOWARD BEACH NY

Here's my code:

df = df.assign(State = df["Description"].str.findall(state_abbrv))

And here's the expected result:

Index Description State
1     ABNY         NY
2     MANY         MA,NY
3     NYNY         NY,NY
4     DO           
5     nyNY         NY
6     CWARD NY     WA,NY 
7     HOWARD BEACH NY WA,AR,NY

Thanks

MrNobody33 · Accepted Answer · 2020-07-09 23:22:58Z

4

You could try with join, and then use str.findall:

statesjoin='|'.join(state_abbrv)
df=df.assign(State = df["Description"].str.findall(statesjoin))

Output:

df
   Index Description     State
0      1        ABNY      [NY]
1      2        MANY  [MA, NY]
2      3        NYNY  [NY, NY]
3      4          DO        []
4      5        nyNY      [NY]
5      6      ABALBB      [AL]
6      7        ALCA  [AL, CA]

In the possible case @AkshaySehgal described, you could try this:

import re
df=df.assign(State = df["Description"].apply(lambda x: ','.join(re.findall('..',x))).str.findall(statesjoin))

edited Jul 9, 2020 at 23:22

answered Jul 9, 2020 at 22:33

MrNobody33

6,5039 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Akshay Sehgal Over a year ago

This can sometimes fail (not in this scenario) when a string like BN is part of state abbrevation. Then ABNY will yield BN, NY when it should yield NY only.

MrNobody33 Over a year ago

Sure, that's true. Just added a solution for that case.

JustStartLearningCode Over a year ago

Thanks i just find out when the description contain number and text (eg ny1NY) is won't work , so i convert the column to str type.

JustStartLearningCode Over a year ago

Hi, Some of the result are missing after i change some of the description data to "CWARD NY" and "HOWARD BEACH NY". I updated the code from my table

MrNobody33 Over a year ago

@JustStartLearningCode but why "CWARD NY" gets [WA,NY ], and "HOWARD BEACH NY"gets [WA,AR,NY]? Didn't "CWARD NY" get [WA,AR,NY] too, because of the substring "WARD"?

Akshay Sehgal · Accepted Answer · 2020-07-09 22:45:05Z

1

Instead of combining all the state abbrevations into a single string and using them (which can yield incorrect results if some abbrevation ends and begins with similar characters), you can use this -

def get_common(s):
    parts = set(map(''.join, zip(*[iter(s)]*2))) #Break string into 2 length tokens
    common = ', '.join(list(parts.intersection(set(state_abbrv)))) #intersection between tokens and abbrevations
    return common

df['State'] = df['Description'].apply(get_common)

Index Description State
1     ABNY         NY
2     MANY         MA,NY
3     NYNY         NY,NY
4     DO           
5     nyNY         NY
6     ABALBB       AL 
7     ALCA         AL,CA

answered Jul 9, 2020 at 22:45

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

2 Comments

JustStartLearningCode Over a year ago

Thanks i just find out when the description contain number and text (eg ny1NY) is won't work , so i convert the column to str type.

JustStartLearningCode Over a year ago

Hi, Some of the result are missing after i change some of the description data to "CWARD NY" and "HOWARD BEACH NY". I updated the code from my table

Collectives™ on Stack Overflow

Python Pandas partial match of list of string in dataframe

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related