0

I am working on a html file which has item 1, item 2, and item 3. I want to delete all the text that comes after the LAST item 2. There may be more than one item 2 in the file. I am using this but it does not work:

text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""

>>> a=re.search ('(?<=<B>)Item&nbsp;2.',text)
>>> b= a.group(0)
>>> newText= text.partition(b)[0]
>>> newText
'<A href="#106">'

it deletes the text after the first item 2 not the second one.

3
  • could you please show the string you expect in your question? Commented Jul 27, 2013 at 19:31
  • Please read the highest voted answer here: stackoverflow.com/questions/1732348/… Commented Jul 27, 2013 at 19:36
  • I want the output to be "<A href="#106">Item&nbsp;2." Commented Jul 28, 2013 at 1:55

1 Answer 1

1

I'd use BeautifulSoup to parse the HTML and modify it. You might want to use the decompose() or extract() method.

BeautifulSoup is nice because it's pretty good at parsing malformed HTML.

For your specific example:

>>> import bs4
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> soup = bs4.BeautifulSoup(text)
>>> soup.b.next_sibling.extract()
u' this is an example this is an example'
>>> soup
<html><body><a href="#106">Item 2. <b>Item 2. Properties</b></a></body></html>

If you really wanna use regular expressions, a non-greedy regex would work for your example:

>>> import re
>>> text = """<A href="#106">Item&nbsp;2. <B>Item&nbsp;2. Properties</B> this is an example this is an example"""
>>> m = re.match(".*?Item&nbsp;2\.", text)
>>> m.group(0)
'<A href="#106">Item&nbsp;2.'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.