2

I have a regular expression that should work to remove all content in a file before div id="content" and including/after <div id="footer"

Live test

([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*)

I am using the re module to work with the regex in python. The code I am using in my python:

file = open(file_dir)
content = file.read()
result = re.search('([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*))', content)

I have tried using re.match as well. I am unable to return the content I want. Right now I can only get it to return everything BEFORE the div#content

10
  • Did you want to remove parts matching that regex, instead of finding and returning a part that matches that regex? Commented Jun 26, 2017 at 17:33
  • Do you want to include the <div> tags or do you want those to be removed? Commented Jun 26, 2017 at 17:33
  • I want to include the <div id="content" and everything inside that tag. I want to NOT include the <div id="footer" and everything after it. So basically just want the HTML/content for everything inside the <div id="content" Commented Jun 26, 2017 at 17:36
  • It's ambiguous until you show us exact right output. Commented Jun 26, 2017 at 17:38
  • @revo not sure what you're looking for atm. Did you see the live test link? Commented Jun 26, 2017 at 17:39

3 Answers 3

3

Though not advisable, you could extract your content instead of simply matching it:

import re

rx = re.compile(r'''
        .*?
        (
            <div\ id="content"
            .+?
        )
        <div\ id="footer
        ''', re.VERBOSE | re.DOTALL)

content = rx.findall(your_string_here, 1)[0]
print(content)


This yields

<div id="content" class="other">
i have this other stuff 
<div>More stuff</div>

See a demo on regex101.com. Better yet: use a parser, e.g. BeautifulSoup instead.

Sign up to request clarification or add additional context in comments.

3 Comments

Agreed that it isn't advisable. However I have something like 40,000 pages to go through and I don't want it to take an eternity so my thinking is that regex would be faster than a parser. Would you agree?
@BrianEdelman: It mostly is, yes. And if you always have the same structure, it will very likely work. Bear in mind though that you might get unexpected results for e.g. comments or nested <div class='footer'> structures - regular expressions are not parsers.
Thanks for this answer. Marking it as correct since it's what I asked, though I ultimately am going for a parser. Clearly that way lies madness.
2

If you will permit me to comment: HTML + regex = madness. :)

HTML is often irregular and a few stray characters will derail the cleverest regex. Moreover, many web pages that appear to be HTML are actually not easily available as HTML. Meanwhile, there are several lovely products for processing websites are undergoing continuous development, amongst them BeautifulSoup, selenium, and scrapy.

>>> from io import StringIO
>>> import bs4
>>> HTML = StringIO('''\
... <body>
...     <div id="container">
...         <div id="content">
...             <span class="something_1">some words</span>
...             <a href="https://link">big one</a>
...         </div>
...     <div>
...     <div id="footer">
... </body>''')
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('div', attrs={'id': 'container'})
<div id="container">
<div id="content">
<span class="something_1">some words</span>
<a href="https://link">big one</a>
</div>
<div>
<div id="footer">
</div></div></div>

3 Comments

Thanks for submitting! I ended up going with parser even if it will be a bit slower. I was able to make a similar code to above work without StringIO. What's the advantage of that?
The advantage of StringIO is only that I didn't have to create a file to offer an example. :) Also, the authors of scrapy say that their stuff is faster than BeautifulSoup's. You don't have to write a scraper to use it.
Oh, and using StringIO, I could show the contents of the HTML right in the answer.
1

This RegEx should work: https://regex101.com/r/L1zzOc/1

\<div id=\"content\"[.\s\S]*?(?=\<div id=\"footer\")

It looks like you had a typo in your original code to match and forgot a " after the first <div id="footer>.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.