0

I'm trying to build a simple script to insert hyperlinks into HTML based on a dictionary.

It works up to a point, using simple str.replace() but then breaks when it encounters existing hyperlinks or other tags which are then rendered broken.

I've seen broad answers on StackOverflow to the question, which suggest using lxml and BeatifulSoup, but I'm running into specific issues and hope someone could give me a push in the right direction.

Here's an example sample:

test_string = """<p>
    COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
    For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
    <strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
    <strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
    <strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""

import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")

for p in paragraphs:
    if p.tag == "p":
       DO SOMETHING

The logic was to only treat paragraph tags in an article (and avoid header tags and tables).

The problem, however, is that I've not been able to find a way of then processing the text in a tag in such a way that I preserve the location of tags such as

How can I break the paragraph into chunks using lxml which are:

  1. text on which I can then run find/replace
  2. tags which must be preserved

in such a way that I can then preserve the overall structure?

UPDATE: For instance, I might want to replace Met Office with <a href="met_office.html">Met Office</a> while retaining the other text as it is, thus:

<p>
        COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
    </p>
3
  • You can use XPath to select just <p> elements directly : root.xpath("//p") Commented Feb 27, 2016 at 2:36
  • The example given is really confusing: "..I might want to replace Met Office with Met Office" Commented Feb 27, 2016 at 2:37
  • 1
    The replacement should be: Met Office (no tag) to <a href="met_office.jpg">Met Office</a> (with hyperlink). For some reason it didn't show up in the text (which is why I added the code example). Commented Feb 28, 2016 at 8:46

2 Answers 2

0

To replace paragraph text for example, use node.set(attribute, value) then use lxml.html.tostring() to save changes:

p.set('string', '')
result = lxml.html.tostring(root)
Sign up to request clarification or add additional context in comments.

Comments

0

Just use BeautifulSoup which (optionally) uses lxml:

from bs4 import BeautifulSoup
soup = BeautifulSoup(test_string)
pars = soup.findall('p')
for p in pars:
  p.string = 'this text was replaced'
print s.prettify()

The result is:

<html>
 <body>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <h3>
   Weather map
  </h3>
  ...
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
 </body>
</html>

Update: If you want to search for a string inside the paragraph's text and wrap it in a new <a> tag, a regular expression combined with the new_tag() function in BeautifulSoup should do the job.

3 Comments

I've added an update to better explain what I'm trying to achieve.
That's a quite different question now...I assume you want to wrap a text in another <a> tag?
I accept my question was not as well outlined as it could have been. The problem I've got is that I don't know how to add hyperlinks into HTML without corrupting existing hyperlinks or other HTML tags. It doesn't help that I've not used BeautifulSoup and instead previously relied upon lxml.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.