Parsing HTML with str.split in Python

Question

I'm parsing a website with the requests module and I'm trying to get specific URLs inside tags (but a table of data as the tags are used more than once) without using BeautifulSoup. Here's part of the code I'm trying to parse:

<td class="notranslate" style="height:25px;">
    <a class="post-list-subject" href="/Forum/ShowPost.aspx?PostID=80631954">
        <div class="thread-link-outer-wrapper">
            <div class="thread-link-container notranslate">
                Forum Rule: Don&#39;t Spam in Any Way
            </div>

I'm trying to get the text inside the tag:

/Forum/ShowPost.aspx?PostID=80631954

The thing is, because I'm parsing a forum site, there are multiple uses of those divider tags. I'd like to retrieve a table of post URLs using string.split using code similar to this:

htmltext.split('<a class="post-list-subject" href="')[1].split('"><div class="thread-link-outer-wrapper">')[0]

There is nothing in the HTML code to indicate a post number on the page, just links.

To parse HTML, use an HTML parser; that's what they are designed for. — Scott Hunter
– Scott Hunter, Commented May 3, 2016 at 0:31
@ScottHunter I understand I could use an HTML parser, but I don't want to import a big module for a tiny thing that could be done without that module. It's more efficient in the long run, execution times as well. The code I have now goes through PAGES of these forum posts and collects this kind of data. — user4381005
– user4381005, Commented May 3, 2016 at 0:34
And yet, here you are, needing help. If you are only doing simple parsing, then you probably only need a simple parser, or just a part of a larger one. You are not only re-inventing the wheel, you are asking others to help you do so, when existing tools specifically built for such tasks exist. — Scott Hunter
– Scott Hunter, Commented May 3, 2016 at 0:39
As I say in my profile, measure before optimizing. Worry about the hassle of building a fake HTML parser out of duct tape before you worry about a few milliseconds of runtime. — TigerhawkT3
– TigerhawkT3, Commented May 3, 2016 at 0:55

Christian · Accepted Answer · 2016-05-03 00:51:50Z

1

In my opinion there are better ways to do this. Even if you don't want to use BeautifulSoup, I would lean towards regular expressions. However, the task can definitely be accomplished using the code you want. Here's one way, using a list comprehension:

 results = [chunk.split('">')[0] for chunk in htmltext.split('<a class="post-list-subject" href="')[1:]]

I tried to model it as closely off of your base code as possible, but I did simplify one of the split arguments to avoid whitespace issues.

In case regular expressions are fair game, here's how you could do it:

import re
target = '<a class="post-list-subject" href="(.*)">'
results = re.findall(target, htmltext)

edited May 3, 2016 at 0:51

answered May 3, 2016 at 0:42

Christian

7393 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user212514 · Accepted Answer · 2016-05-03 00:34:19Z

1

Consider using Beautiful Soup. It will make your life a lot easier. Pay attention to the choice of parser so that you can get the balance of speed and leniency that is appropriate for your task.

answered May 3, 2016 at 0:34

user212514

3,1381 gold badge17 silver badges11 bronze badges

1 Comment

user4381005 Over a year ago

I specified in the question that I don't want to use a big module for a little thing. See my comment to @ScottHunter.

Thtu · Accepted Answer · 2016-05-03 01:00:23Z

It seems really dicey to try to pre-optimize without establishing your bottleneck is going to be html parsing. If you're worried about performance, why not use lxml? Module imports are hardly ever the bottleneck, and it sounds like you're shooting yourself in the foot here.

That said, this will technically do what you want, but it seriously is not more performant than using an HTML parser like lxml in the long run. Explicitly avoiding an HTML parser will also probably drastically increase your development time as you figure out obscure string manipulation snippets rather than just using the nice tree structure that you get for free with HTML.

strcleaner = lambda x : x.replace('\n', '').replace(' ', '').replace('\t', '')
S = strcleaner(htmltext)
S.split(strcleaner('<a class="post-list-subject" href="'))[1].split(strcleaner('"><div class="thread-link-outer-wrapper">'))[0]

The problem with the code you posted is that whitespace and newlines are characters too.

Collectives™ on Stack Overflow

Parsing HTML with str.split in Python

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related