1

I'm working on a huge file that has names in columns that contain extraneous values (like the "|" key) that I want to remove, but for some reason my str.replace function only seems to apply to some rows in the column.

My column in the dataframe summary looks something like this:

Labels
test|test 1
test 2
test 3
test|test 4
test|test 5
test 6

As you can see, some columns are already how i want them to be, only containing the name "test #", but some have "test|" in front, which I want removed.

My function to remove them is like this:

correction = summary["Labels"].str.replace('test\|', '')

It seems to work for most of the values, but when I check for pipes ("|") in the dataframe (once i merged correction with summary), it says it finds 9330 of them:

found = summary[summary['Labels'].str.contains('|',regex=False)]
print(len(found))
print(found['Labels'].value_counts())

Results
9330
test|test-667     59
test|test-765     40
test|test-1810    39
test|test-685     36
test|test-1077    33
                  ..

Does anyone know why this is, and how i can fix it?

4
  • 1
    Any chance there could something like be testtest||test-667? Commented Jan 12, 2022 at 21:06
  • In the function you wrote, correction is a series. But when you are looking for errors, correction is a dataframe. So you are actually not showing us what you really did... Commented Jan 12, 2022 at 21:11
  • @Aryerez ah you're right sorry, forgot to add that i put correction into the summary dataframe after removing the unwanted values. i've corrected the code above to reflect that! Commented Jan 12, 2022 at 21:28
  • @Emily It is possible that your problem comes from combining correction and summary the wrong way, which we can't know since you are not showing us. Commented Jan 12, 2022 at 21:33

2 Answers 2

1

You were on the right track. Replace raw string as follows

summary['Labels'] = summary['Labels'].str.replace(r'test\|','', regex=True)



Labels
0  test 1
1  test 2
2  test 4
Sign up to request clarification or add additional context in comments.

Comments

1

Try str.extract:

df['Labels'] = df['Labels'].str.extract(r'\|(.*)', expand=False) \
                           .combine_first(df['Labels'])
print(df)

# Output
   Labels
0  test 1
1  test 2
2  test 3
3  test 4
4  test 5
5  test 6

1 Comment

Thanks for your response! I tried this and it still doesn't seem to work, not sure why

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.