2

Hi everyone i am trying match partial string within a columns in data-frame and return the match string(Capital letter matter).I don't have a strong knowledge of programming and i just start learning.

#list of State
state_abbrv = ["AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN","IA","KS","KY","LA",
        "ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC","ND","OH","OK",
        "OR","PA","RI","SC","SD","TN","TX","UT","VT","VA","WA","WV","WI","WY"]

#Create dataframe 
d = {"Index": [1, 2, 3, 4, 5 , 6, 7], "Description": ["ABNY", "MANY", "NYNY","DO", "nyNY", ""CWARD NY", "HOWARD BEACH NY"]}

df = pd.DataFrame(data=d)

Here's the df:

Index Description 
1           ABNY         
2           MANY         
3           NYNY         
4           DO           
5           nyNY         
6           CWARD NY       
7           HOWARD BEACH NY   

Here's my code:

df = df.assign(State = df["Description"].str.findall(state_abbrv))

And here's the expected result:

Index Description State
1     ABNY         NY
2     MANY         MA,NY
3     NYNY         NY,NY
4     DO           
5     nyNY         NY
6     CWARD NY     WA,NY 
7     HOWARD BEACH NY WA,AR,NY

Thanks

2 Answers 2

4

You could try with join, and then use str.findall:

statesjoin='|'.join(state_abbrv)
df=df.assign(State = df["Description"].str.findall(statesjoin))

Output:

df
   Index Description     State
0      1        ABNY      [NY]
1      2        MANY  [MA, NY]
2      3        NYNY  [NY, NY]
3      4          DO        []
4      5        nyNY      [NY]
5      6      ABALBB      [AL]
6      7        ALCA  [AL, CA]

In the possible case @AkshaySehgal described, you could try this:

import re
df=df.assign(State = df["Description"].apply(lambda x: ','.join(re.findall('..',x))).str.findall(statesjoin))
Sign up to request clarification or add additional context in comments.

5 Comments

This can sometimes fail (not in this scenario) when a string like BN is part of state abbrevation. Then ABNY will yield BN, NY when it should yield NY only.
Sure, that's true. Just added a solution for that case.
Thanks i just find out when the description contain number and text (eg ny1NY) is won't work , so i convert the column to str type.
Hi, Some of the result are missing after i change some of the description data to "CWARD NY" and "HOWARD BEACH NY". I updated the code from my table
@JustStartLearningCode but why "CWARD NY" gets [WA,NY ], and "HOWARD BEACH NY"gets [WA,AR,NY]? Didn't "CWARD NY" get [WA,AR,NY] too, because of the substring "WARD"?
1

Instead of combining all the state abbrevations into a single string and using them (which can yield incorrect results if some abbrevation ends and begins with similar characters), you can use this -

def get_common(s):
    parts = set(map(''.join, zip(*[iter(s)]*2))) #Break string into 2 length tokens
    common = ', '.join(list(parts.intersection(set(state_abbrv)))) #intersection between tokens and abbrevations
    return common

df['State'] = df['Description'].apply(get_common)
Index Description State
1     ABNY         NY
2     MANY         MA,NY
3     NYNY         NY,NY
4     DO           
5     nyNY         NY
6     ABALBB       AL 
7     ALCA         AL,CA

2 Comments

Thanks i just find out when the description contain number and text (eg ny1NY) is won't work , so i convert the column to str type.
Hi, Some of the result are missing after i change some of the description data to "CWARD NY" and "HOWARD BEACH NY". I updated the code from my table

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.