Adapting regex to python re module

Question

I have a regular expression that should work to remove all content in a file before div id="content" and including/after <div id="footer"

Live test

([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*)

I am using the re module to work with the regex in python. The code I am using in my python:

file = open(file_dir)
content = file.read()
result = re.search('([\s\S]*)(?=<div id="content")|(?=<div id="footer)([\s\S]*))', content)

I have tried using re.match as well. I am unable to return the content I want. Right now I can only get it to return everything BEFORE the div#content

Did you want to remove parts matching that regex, instead of finding and returning a part that matches that regex? — user2357112
– user2357112, Commented Jun 26, 2017 at 17:33
Do you want to include the <div> tags or do you want those to be removed? — victor
– victor, Commented Jun 26, 2017 at 17:33
I want to include the <div id="content" and everything inside that tag. I want to NOT include the <div id="footer" and everything after it. So basically just want the HTML/content for everything inside the <div id="content" — Brian Edelman
– Brian Edelman, Commented Jun 26, 2017 at 17:36
@revo not sure what you're looking for atm. Did you see the live test link? — Brian Edelman
– Brian Edelman, Commented Jun 26, 2017 at 17:39

Jan · Accepted Answer · 2017-06-26 17:55:44Z

3

Though not advisable, you could extract your content instead of simply matching it:

import re

rx = re.compile(r'''
        .*?
        (
            <div\ id="content"
            .+?
        )
        <div\ id="footer
        ''', re.VERBOSE | re.DOTALL)

content = rx.findall(your_string_here, 1)[0]
print(content)

This yields

<div id="content" class="other">
i have this other stuff 
<div>More stuff</div>

See a demo on regex101.com. Better yet: use a parser, e.g. BeautifulSoup instead.

edited Jun 26, 2017 at 17:55

answered Jun 26, 2017 at 17:48

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Brian Edelman Over a year ago

Agreed that it isn't advisable. However I have something like 40,000 pages to go through and I don't want it to take an eternity so my thinking is that regex would be faster than a parser. Would you agree?

Jan Over a year ago

@BrianEdelman: It mostly is, yes. And if you always have the same structure, it will very likely work. Bear in mind though that you might get unexpected results for e.g. comments or nested <div class='footer'> structures - regular expressions are not parsers.

Brian Edelman Over a year ago

Thanks for this answer. Marking it as correct since it's what I asked, though I ultimately am going for a parser. Clearly that way lies madness.

Bill Bell · Accepted Answer · 2017-06-26 18:18:42Z

2

If you will permit me to comment: HTML + regex = madness. :)

HTML is often irregular and a few stray characters will derail the cleverest regex. Moreover, many web pages that appear to be HTML are actually not easily available as HTML. Meanwhile, there are several lovely products for processing websites are undergoing continuous development, amongst them BeautifulSoup, selenium, and scrapy.

>>> from io import StringIO
>>> import bs4
>>> HTML = StringIO('''\
... <body>
...     <div id="container">
...         <div id="content">
...             <span class="something_1">some words</span>
...             <a href="https://link">big one</a>
...         </div>
...     <div>
...     <div id="footer">
... </body>''')
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> soup.find('div', attrs={'id': 'container'})
<div id="container">
<div id="content">
<span class="something_1">some words</span>
<a href="https://link">big one</a>
</div>
<div>
<div id="footer">
</div></div></div>

answered Jun 26, 2017 at 18:18

Bill Bell

21.7k6 gold badges48 silver badges62 bronze badges

3 Comments

Brian Edelman Over a year ago

Thanks for submitting! I ended up going with parser even if it will be a bit slower. I was able to make a similar code to above work without StringIO. What's the advantage of that?

Bill Bell Over a year ago

The advantage of StringIO is only that I didn't have to create a file to offer an example. :) Also, the authors of scrapy say that their stuff is faster than BeautifulSoup's. You don't have to write a scraper to use it.

Bill Bell Over a year ago

Oh, and using StringIO, I could show the contents of the HTML right in the answer.

victor · Accepted Answer · 2017-06-26 17:56:43Z

1

This RegEx should work: https://regex101.com/r/L1zzOc/1

\<div id=\"content\"[.\s\S]*?(?=\<div id=\"footer\")

It looks like you had a typo in your original code to match and forgot a " after the first <div id="footer>.

answered Jun 26, 2017 at 17:56

victor

1,64414 silver badges24 bronze badges

Collectives™ on Stack Overflow

Adapting regex to python re module

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related