1

I have a very long HTML text of the following structure:

<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>

Now, let's say I want to trim the HTML text to just 1000 characters, but I still want the HTML to be valid, that is, close the tags whose closing tags were removed. What can I do to correct the trimmed HTML text using Python? Note that the HTML is not always structured as above.

I need this for an email campaign wherein a preview of the blog is sent but the recipient needs to visit the blog's URL to see the complete article.

2
  • Are you using any framework? if so, which one? Commented Nov 10, 2015 at 16:23
  • I can have Django or Odoo for this, though I'm actually using Odoo in this case. I can get rendered HTML body from my template but I need to trim it first, and then send the modified HTML to my mailing list. Commented Nov 10, 2015 at 16:27

2 Answers 2

1

How about BeautifulSoup? (python-bs4)

from bs4 import BeautifulSoup

test_html = """<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>"""

test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')

print(soup.prettify())

.prettify() should close the tags automatically.

Sign up to request clarification or add additional context in comments.

1 Comment

As for now, this is the most viable solution presented. But if this can be down by using a simple non-module code, it is maybe better.
0

I can show an example. If it looks like this:

<div>
  <p>Long text...</p>
  <p>Longer text to be trimmed</p>
</div>

And you have a Python code like:

def TrimHTML(HtmlString):
    result = []
    newlinesremaining = 2 # or some other value of your choice
    foundlastpart = False
    for x in list(HtmlString): # being HtmlString the html to be trimmed
        if not newlinesremaining < 1:
            if x == '\n':
                newlinesremaining -= 1
            result.append(x)
        elif foundlastpart == False:
            if x == \n:
                newlinesremaining = float('inf')
                foundlastpart == True
        return result.join('')

and you run that code inputting the example HTML above in the function, then the function returns:

<div>
  <p>Long text...</p>
</div>

For some probably odd reason I couldn't test it in the short time window that I have before work.

2 Comments

This assumes that each line is one line of valid HTML with properly closed tags. Also, what if I remove all line breaks or minify the HTML?
According to official documentation, it is not recommended to do so, so it might be uncommon to find a minified HTML. Either way I ain't much good at such scripts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.