Remove a portion of HTML text using Python

Question

I have a very long HTML text of the following structure:

<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>

Now, let's say I want to trim the HTML text to just 1000 characters, but I still want the HTML to be valid, that is, close the tags whose closing tags were removed. What can I do to correct the trimmed HTML text using Python? Note that the HTML is not always structured as above.

I need this for an email campaign wherein a preview of the blog is sent but the recipient needs to visit the blog's URL to see the complete article.

I can have Django or Odoo for this, though I'm actually using Odoo in this case. I can get rendered HTML body from my template but I need to trim it first, and then send the modified HTML to my mailing list. — macdelacruz
– macdelacruz, Commented Nov 10, 2015 at 16:27

SeniorFoffo · Accepted Answer · 2015-11-10 17:54:49Z

1

How about BeautifulSoup? (python-bs4)

from bs4 import BeautifulSoup

test_html = """<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>"""

test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')

print(soup.prettify())

.prettify() should close the tags automatically.

edited Nov 10, 2015 at 17:54

answered Nov 10, 2015 at 17:07

SeniorFoffo

284 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

macdelacruz Over a year ago

As for now, this is the most viable solution presented. But if this can be down by using a simple non-module code, it is maybe better.

wallabra · Accepted Answer · 2015-11-10 16:39:19Z

0

I can show an example. If it looks like this:

<div>
  <p>Long text...</p>
  <p>Longer text to be trimmed</p>
</div>

And you have a Python code like:

def TrimHTML(HtmlString):
    result = []
    newlinesremaining = 2 # or some other value of your choice
    foundlastpart = False
    for x in list(HtmlString): # being HtmlString the html to be trimmed
        if not newlinesremaining < 1:
            if x == '\n':
                newlinesremaining -= 1
            result.append(x)
        elif foundlastpart == False:
            if x == \n:
                newlinesremaining = float('inf')
                foundlastpart == True
        return result.join('')

and you run that code inputting the example HTML above in the function, then the function returns:

<div>
  <p>Long text...</p>
</div>

For some probably odd reason I couldn't test it in the short time window that I have before work.

answered Nov 10, 2015 at 16:39

wallabra

4428 silver badges18 bronze badges

2 Comments

macdelacruz Over a year ago

This assumes that each line is one line of valid HTML with properly closed tags. Also, what if I remove all line breaks or minify the HTML?

wallabra Over a year ago

According to official documentation, it is not recommended to do so, so it might be uncommon to find a minified HTML. Either way I ain't much good at such scripts.

Collectives™ on Stack Overflow

Remove a portion of HTML text using Python

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related