Using lxml to find/replace text while maintaining HTML structure

Question

I'm trying to build a simple script to insert hyperlinks into HTML based on a dictionary.

It works up to a point, using simple str.replace() but then breaks when it encounters existing hyperlinks or other tags which are then rendered broken.

I've seen broad answers on StackOverflow to the question, which suggest using lxml and BeatifulSoup, but I'm running into specific issues and hope someone could give me a push in the right direction.

Here's an example sample:

test_string = """<p>
    COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
    For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
    <strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
    <strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
    <strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""

import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")

for p in paragraphs:
    if p.tag == "p":
       DO SOMETHING

The logic was to only treat paragraph tags in an article (and avoid header tags and tables).

The problem, however, is that I've not been able to find a way of then processing the text in a tag in such a way that I preserve the location of tags such as

How can I break the paragraph into chunks using lxml which are:

text on which I can then run find/replace
tags which must be preserved

in such a way that I can then preserve the overall structure?

UPDATE: For instance, I might want to replace Met Office with <a href="met_office.html">Met Office</a> while retaining the other text as it is, thus:

<p>
        COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
    </p>

You can use XPath to select just <p> elements directly : root.xpath("//p") — har07
– har07, Commented Feb 27, 2016 at 2:36
The example given is really confusing: "..I might want to replace Met Office with Met Office" — har07
– har07, Commented Feb 27, 2016 at 2:37
The replacement should be: Met Office (no tag) to <a href="met_office.jpg">Met Office</a> (with hyperlink). For some reason it didn't show up in the text (which is why I added the code example). — elksie5000
– elksie5000, Commented Feb 28, 2016 at 8:46

Kenly · Accepted Answer · 2016-02-26 12:52:27Z

0

To replace paragraph text for example, use node.set(attribute, value) then use lxml.html.tostring() to save changes:

p.set('string', '')
result = lxml.html.tostring(root)

answered Feb 26, 2016 at 12:52

Kenly

27k7 gold badges50 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Mr Lister · Accepted Answer · 2016-02-29 22:00:03Z

0

Just use BeautifulSoup which (optionally) uses lxml:

from bs4 import BeautifulSoup
soup = BeautifulSoup(test_string)
pars = soup.findall('p')
for p in pars:
  p.string = 'this text was replaced'
print s.prettify()

The result is:

<html>
 <body>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <h3>
   Weather map
  </h3>
  ...
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
 </body>
</html>

Update: If you want to search for a string inside the paragraph's text and wrap it in a new <a> tag, a regular expression combined with the new_tag() function in BeautifulSoup should do the job.

edited Feb 29, 2016 at 22:00

Mr Lister

46.8k15 gold badges118 silver badges156 bronze badges

answered Feb 26, 2016 at 12:57

Suzana

4,4583 gold badges32 silver badges53 bronze badges

3 Comments

elksie5000 Over a year ago

I've added an update to better explain what I'm trying to achieve.

Suzana Over a year ago

That's a quite different question now...I assume you want to wrap a text in another <a> tag?

elksie5000 Over a year ago

I accept my question was not as well outlined as it could have been. The problem I've got is that I don't know how to add hyperlinks into HTML without corrupting existing hyperlinks or other HTML tags. It doesn't help that I've not used BeautifulSoup and instead previously relied upon lxml.

Collectives™ on Stack Overflow

Using lxml to find/replace text while maintaining HTML structure

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related