I'm trying to build a simple script to insert hyperlinks into HTML based on a dictionary.
It works up to a point, using simple str.replace() but then breaks when it encounters existing hyperlinks or other tags which are then rendered broken.
I've seen broad answers on StackOverflow to the question, which suggest using lxml and BeatifulSoup, but I'm running into specific issues and hope someone could give me a push in the right direction.
Here's an example sample:
test_string = """<p>
COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
<strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
<strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
<strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""
import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")
for p in paragraphs:
if p.tag == "p":
DO SOMETHING
The logic was to only treat paragraph tags in an article (and avoid header tags and tables).
The problem, however, is that I've not been able to find a way of then processing the text in a tag in such a way that I preserve the location of tags such as
How can I break the paragraph into chunks using lxml which are:
- text on which I can then run find/replace
- tags which must be preserved
in such a way that I can then preserve the overall structure?
UPDATE:
For instance, I might want to replace Met Office with <a href="met_office.html">Met Office</a> while retaining the other text as it is, thus:
<p>
COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>elements directly :root.xpath("//p")