0

I am doing some data cleansing and want to remove the whole string between the characters:
"<p.>kódy:" and "</p.>" The strings are located in a dataframe and for each record, different characters can be found between the two characters, so I thought using combining str.place or re.sub with some kind of a wilcard could work, but I've not been succesful.

This is my sample input:

<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>

<p.> 924071180 924071181 924071182 </p.>

And the desired output:


<p.> 924071180 924071181 924071182 </p.>

Any help would be appreciated!

Cheers,

Stepan

5
  • Does this answer your question? How to delete the words between two delimiters? Commented Feb 28, 2021 at 15:22
  • what part's do you want to remove and what parts do you want to keep? Commented Feb 28, 2021 at 15:22
  • I assume that the missing periods in the desired output vs the input data is a mistake? You don't say anything about removing the periods in your description. Commented Feb 28, 2021 at 15:25
  • @CryptoFool you are correct, thanks for pointing out. Commented Feb 28, 2021 at 19:20
  • @ShoaibWani the whole string between the two characters. I will be more specific in the description, thanks. Commented Feb 28, 2021 at 19:23

2 Answers 2

1

You can use a regular expression substitution to get what you want in a single call:

data = """<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>

<p.> 924071180 924071181 924071182 </p.>
"""

import re

r = re.sub(r"<p\.>kódy:.+?</p\.>", "", data)

print(r)

Result:

<p.> 924071180 924071181 924071182 </p.>
Sign up to request clarification or add additional context in comments.

1 Comment

That seems promising, is there any possibility to apply this over the whole dataframe column? Thanks!
0

You can use split.

st = "<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.> <p.> 924071180 924071181 924071182 </p.>"
st.split('</p.>')

Result:

['<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110', ' <p.> 924071180 924071181 924071182 ', '']

Or:

import re
t = re.sub('<p.>kódy.*?</p.>', '', st)

Result:

' <p.> 924071180 924071181 924071182 </p.>'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.