Merging two columns while eliminating duplicate strings in pandas dataframe

Question

I have a dataframe with the original column 'All' , which I split into RegionName1 and RegioName2 columns. There are duplicate entries, for example, Duluth and Duluth (University of Minnesota Duluth. I want to convert strings like Duluth (University of Minnesota Duluth to NaN values. So I have tried

unitown['RegionName2'] = [np.nan if '(' in x else x for x in unitown['RegionName2']]

and got an error that TypeError: argument of type 'float' is not iterable. What else can I try?

unitown=pd.read_table('university_towns.txt', header=None).rename(columns={0:'All'})
unitown['State']=unitown['All'].apply(lambda x: x.split('[edi')[0].strip() if x.count('[edi') else np.NaN).fillna(method="ffill")                       #.fillna(method="ffill")
unitown['RegionName1'] = unitown['All'].apply(lambda x: x.split('(')[0].strip() if x.count('(') else np.NaN)
unitown['RegionName2'] = unitown['All'].apply(lambda x: x.split(',')[0].strip() if x.count(',') else np.NaN)
unitown['RegionName2'] = [np.nan if '(' in x else x for x in     unitown['RegionName2']]
return unitown[unitown.State=='Minnesota']

foglerit · Accepted Answer · 2020-03-13 16:27:26Z

1

You can either use:

unitown.loc[unitown.RegionName2.str.contains("("), 'RegionName2'] = np.NaN

Or add this logic directly to the code that generates RegionName2 as in:

unitown['RegionName2'] = unitown['All'].apply(
    lambda x: x.split(',')[0].strip() if x.count(',') and "(" not in x.split(',')[0] else np.NaN
)

edited Mar 13, 2020 at 16:27

answered Mar 13, 2020 at 11:54

foglerit

8,4469 gold badges53 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Bluetail Over a year ago

thanks, foglerit! this is exactly what I was looking for.

foglerit Over a year ago

My pleasure @MariaBruevich. Could you hit the accept button so others can easily know this answer solves your problem? Thanks

Bluetail Over a year ago

I dont see the accept button? I clicked on 'this answer is useful' next your answer. By the way, I discovered that I should convert the NaNs to type 'string' for my list comprehension to work.

stanna · Accepted Answer · 2020-03-13 12:34:55Z

0

#input data
d = {'RegionName1': ["a", "b", "c", "d"], 'RegionName2': ['Duluth and Duluth (University of Minnesota Duluth', "Monkato(Minnesota", 'Other1', 'Other2']}
df = pd.DataFrame(data=d)
print("Input dataframe:")
print(df)

#searching for '(' in RegionName2 column and replacing with NaN
z=0
for i, row in df.iterrows():
  k = df.loc[z,'RegionName2']
  if '(' in str(k):
    df.loc[z,'RegionName2'] = np.nan
  z = z+1
print("Output dataframe:")
print(df)

answered Mar 13, 2020 at 12:34

stanna

1086 bronze badges

Collectives™ on Stack Overflow

Merging two columns while eliminating duplicate strings in pandas dataframe

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related