Removing variable length characters from a string in python

Question

I have strings that are of the form below:

<p>The is a string.</p>
<em>This is another string.</em>

They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().

Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.

I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.

For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:

x = x.replace("<..>", "")

Why don't you just use a parser like BeautifulSoup since these are just tags? — Cory Kramer
– Cory Kramer, Commented Jul 19, 2014 at 21:05
HTML should not be parsed with regex. Try a parser like Beautiful Soup or etree instead — inspectorG4dget
– inspectorG4dget, Commented Jul 19, 2014 at 21:06

user2555451 · Accepted Answer · 2014-07-19 21:06:07Z

3

Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:

>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>

[^>]* matches zero or more characters that are not >.

answered Jul 19, 2014 at 21:06

user2555451

Sign up to request clarification or add additional context in comments.

3 Comments

Ed L Over a year ago

It's probably better to use (<[^>]*>)? for the regex.

zx81 Over a year ago

He wants to retrieve the words, so this will be a two-step solution, right? Some splitting will need to take place.

user2555451 Over a year ago

@zx81 - Well, in that case, all he needs to do is sub("<[^>]*>", "", "<p>The is a string.</p>").split(). We don't need anything fancy because he said that he is getting the lines one at a time and that they are all of the same format.

Community · Accepted Answer · 2017-05-23 10:25:04Z

2

No Need for a 2-Step Solution

You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.

Option 1: Match All Instead of Splitting

Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:

<[^>]+>|(\w+)

The words will be in Group 1.

Use it like this:

subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)

Output

['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']

Discussion

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

Reference

Option 2: One Single Split

<[^>]+>|[ .]

On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.

Output

This
is
a
string

edited May 23, 2017 at 10:25

CommunityBot

11 silver badge

answered Jul 19, 2014 at 21:07

zx81

42k10 gold badges92 silver badges106 bronze badges

5 Comments

zx81 Over a year ago

FYI: Added simple code (really two lines) for the Group 1 option, which IMO is more solid than the split option.

zx81 Over a year ago

Hey, I gave two solutions that require a SINGLE step: you don't need to 1 split, then 2 replace. Why did you choose a 2-step solution? That makes no sense to me. My Option 1 is ONE step (just match all). My Option 2 is ONE step (just split)

zx81 Over a year ago

On top of that, the solution you picked was edited to essentially use my regex <[^>]+>.

Mars Over a year ago

My question was how to remove all <..> from a string. You put removing and splitting together but I was only asking about removing it, which the accepted answer provides. I do want to split also but my question didn't ask how to perform the two in one step.

zx81 Over a year ago

But you ARE splitting, and answers on SO always try to give you a better way to do things. If you can do it in one step, that's what we show you. I have nothing against the replacement by @iCodez, but in my view you made an extraordinarily poor choice, and I am sure that even he would agree, as I would if the situation were reversed. I normally don't rant about this kind of thing, it happens a lot, but usually not with someone with over 500 rep.

Collectives™ on Stack Overflow

Removing variable length characters from a string in python

2 Answers 2

3 Comments

No Need for a 2-Step Solution

Option 1: Match All Instead of Splitting

Option 2: One Single Split

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

No Need for a 2-Step Solution

Option 1: Match All Instead of Splitting

Option 2: One Single Split

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related