Python replace substring given pattern

Question

I am doing some data cleansing and want to remove the whole string between the characters:
"<p.>kódy:" and "</p.>" The strings are located in a dataframe and for each record, different characters can be found between the two characters, so I thought using combining str.place or re.sub with some kind of a wilcard could work, but I've not been succesful.

This is my sample input:

<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>

<p.> 924071180 924071181 924071182 </p.>

And the desired output:


<p.> 924071180 924071181 924071182 </p.>

Any help would be appreciated!

Cheers,

Stepan

Does this answer your question? How to delete the words between two delimiters? — dwb
– dwb, Commented Feb 28, 2021 at 15:22
what part's do you want to remove and what parts do you want to keep? — Shoaib Wani
– Shoaib Wani, Commented Feb 28, 2021 at 15:22
I assume that the missing periods in the desired output vs the input data is a mistake? You don't say anything about removing the periods in your description. — CryptoFool
– CryptoFool, Commented Feb 28, 2021 at 15:25
@ShoaibWani the whole string between the two characters. I will be more specific in the description, thanks. — Štěpán Zechovský
– Štěpán Zechovský, Commented Feb 28, 2021 at 19:23

CryptoFool · Accepted Answer · 2021-02-28 15:21:50Z

1

You can use a regular expression substitution to get what you want in a single call:

data = """<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>

<p.> 924071180 924071181 924071182 </p.>
"""

import re

r = re.sub(r"<p\.>kódy:.+?</p\.>", "", data)

print(r)

Result:

<p.> 924071180 924071181 924071182 </p.>

answered Feb 28, 2021 at 15:21

CryptoFool

23.4k5 gold badges31 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Štěpán Zechovský Over a year ago

That seems promising, is there any possibility to apply this over the whole dataframe column? Thanks!

David Meu · Accepted Answer · 2021-02-28 15:24:21Z

You can use split.

st = "<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.> <p.> 924071180 924071181 924071182 </p.>"
st.split('</p.>')

Result:

['<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110', ' <p.> 924071180 924071181 924071182 ', '']

Or:

import re
t = re.sub('<p.>kódy.*?</p.>', '', st)

Result:

' <p.> 924071180 924071181 924071182 </p.>'

Collectives™ on Stack Overflow

Python replace substring given pattern

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related