0

I have strings that are of the form below:

<p>The is a string.</p>
<em>This is another string.</em>

They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().

Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.

I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.

For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:

x = x.replace("<..>", "")
9
  • Why don't you just use a parser like BeautifulSoup since these are just tags? Commented Jul 19, 2014 at 21:05
  • what is expected output? Commented Jul 19, 2014 at 21:05
  • Either BeautifulSoup or re will do the trick. Commented Jul 19, 2014 at 21:06
  • @Cyber. Looking to do it without a parser. Commented Jul 19, 2014 at 21:06
  • HTML should not be parsed with regex. Try a parser like Beautiful Soup or etree instead Commented Jul 19, 2014 at 21:06

2 Answers 2

3

Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:

>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>

[^>]* matches zero or more characters that are not >.

Sign up to request clarification or add additional context in comments.

3 Comments

It's probably better to use (<[^>]*>)? for the regex.
He wants to retrieve the words, so this will be a two-step solution, right? Some splitting will need to take place.
@zx81 - Well, in that case, all he needs to do is sub("<[^>]*>", "", "<p>The is a string.</p>").split(). We don't need anything fancy because he said that he is getting the lines one at a time and that they are all of the same format.
2

No Need for a 2-Step Solution

You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.


Option 1: Match All Instead of Splitting

Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:

<[^>]+>|(\w+)

The words will be in Group 1.

Use it like this:

subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)

Output

['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']

Discussion

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."

The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

Reference

Option 2: One Single Split

<[^>]+>|[ .]

On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.

Output

This
is
a
string

5 Comments

FYI: Added simple code (really two lines) for the Group 1 option, which IMO is more solid than the split option.
Hey, I gave two solutions that require a SINGLE step: you don't need to 1 split, then 2 replace. Why did you choose a 2-step solution? That makes no sense to me. My Option 1 is ONE step (just match all). My Option 2 is ONE step (just split)
On top of that, the solution you picked was edited to essentially use my regex <[^>]+>.
My question was how to remove all <..> from a string. You put removing and splitting together but I was only asking about removing it, which the accepted answer provides. I do want to split also but my question didn't ask how to perform the two in one step.
But you ARE splitting, and answers on SO always try to give you a better way to do things. If you can do it in one step, that's what we show you. I have nothing against the replacement by @iCodez, but in my view you made an extraordinarily poor choice, and I am sure that even he would agree, as I would if the situation were reversed. I normally don't rant about this kind of thing, it happens a lot, but usually not with someone with over 500 rep.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.