1

I have multiple string that I want to wrap HTML tags around within an HTML document. I want to leave the text the same, but replace the strings with HTML elements containing that string.

Furthermore, some of the strings I want to replace, contain other strings I want to replace. In these cases, I want to apply the substitution of the larger string and ignore that of the smaller string.

In addition, I only want to perform this substitution when those strings are contained fully within the same element.

Here's my replacement list.

replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]

Given the following html:

<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>

I would want to substitute to this:

<html>
<body>
<p>Paragraph contains <span title="foo" class="customclass34">foo</span></p>
<p>Paragraph contains <span id="id79" class="customclass79">foo bar</span</p>
</body>
</html>

So far I've tried using the beautiful soup library and looping through my replacement list in order of decreasing string length, and I can find and replace my strings with other strings, but I can't work out how to insert the HTML at those points. Or whether there's a better way entirely. Trying to perform string substitution with a soup.new_tag object fails whether I convert it to a string or not.

EDIT: Realised the example I gave didn't even conform to my own rules, modified example.

3 Answers 3

1

I think this is very close to what you are looking for. You can use soup.find_all(string=True) to get only the NavigableString elements and then do replace.

from bs4 import BeautifulSoup
html="""
<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>
"""
replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]
soup=BeautifulSoup(html,'html.parser')
for s in soup.find_all(string=True):
    for item in replacement_list[::-1]: #assuming that it is in ascending order of length
        key,val=item
        if key in s:
            new_s=s.replace(key,val)
            s.replace_with(BeautifulSoup(new_s,'html.parser')) #restrict youself to this built-in parser
            break#break on 1st match
print(soup)

#generate a new valid soup that treats span as seperate tag if you want
soup=BeautifulSoup(str(soup),'html.parser')
print(soup.find_all('span'))

Outputs:

<html>
<body>
<p>Paragraph contains <span class="customclass34" title="foo">foo</span></p>
<p>Paragraph contains <span class="customclass79" id="id21">foo bar</span></p>
</body>
</html>

[<span class="customclass34" title="foo">foo</span>, <span class="customclass79" id="id21">foo bar</span>]
Sign up to request clarification or add additional context in comments.

1 Comment

Upvoted for a close solution that showed me the tools to use. The HTML I'm working with might have multiple matches for a string to be replaced within a single NavigableString object, so breaking on the first match to prevent duplicate tags didn't work for me.
1

I've found a solution for this.

I have to iterate through the HTML for each different string I want to wrap HTML tags around. This seems inefficient, but I can't find a better way of doing it.

I've added a class to all the tags I'm inserting, which I use to check if the string I'm trying to replace was part of a larger string that was already replaced.

This solution is also case-insensitive (it will wrap tags around the string 'fOo'), while preserving the case of the original text.

def html_update(input_html):
    from bs4 import BeautifulSoup
    import re

    soup = BeautifulSoup(input_html)

    replacement_list = [
        ('foo', '<span title="foo" class="customclass34 replace">', '</span>'),
        ('foo bar', '<span id="id21" class="customclass79 replace">', '</span>')
    ]
    # Go through list in order of decreasing length
    replacement_list = sorted(replacement_list, key = lambda k: -len(k[0]))

    for item in replacement_list:
        replace_regex = re.compile(item[0], re.IGNORECASE)
        target = soup.find_all(string=replace_regex)
        for v in target:
            # You can use other conditions here, like (v.parent.name == 'a')
            # to not wrap the tags around strings within links
            if v.parent.has_attr('class') and 'replace' in v.parent['class']:
                # The match must be part of a large string that was already replaced, so do nothing
                continue 

            def replace(match):
                return '{0}{1}{2}'.format(item[1], match.group(0), item[2])

            new_v = replace_regex.sub(replace, v)
            v.replace_with(BeautifulSoup(new_v, 'html.parser'))
    return str(soup)

Comments

0

When you are dealing with small files, it is good to read the file line by line, and replace in each line what you want to replace, then write everything to a new file.

Assuming your file is called output.html:

replacement_list = {'foo': '<span title="foo" class="customclass34">foo</span>', 'foo bar':'<span id="id21" class="customclass79">foo bar</span>'}

with open('output.html','w') as dest :
    with open('test.html','r') as src :
        for line in src:   #### reading the src file line by line
            str_possible = []
            for string in replacement_list.keys(): #### looping over all the strings you are looking for
                if string in line: ### checking if this string is in the line
                    str_possible.append(string)
            if len(str_possible) >0:
                str_final = max(str_possible, key=len)  ###taking the appropriate one, which is the longest
                line = line.replace(str_final,replacement_list[str_final])

            dest.write(line)

I also suggest you check the use of dictionaries in python, which is the object that I use for replacement_list.

Finally, this code will work, if there is at the maximum one string on the line. If there is two, it needs to be adapted a bit, but this gives you the overall idea.

6 Comments

This code will replace "foo" and leave "bar" untouched in the second line, so "foo bar" won't be replaced at all.
Oh, right, I went a bit too fast. I just updated my code taking in account your comment.
The problem with this solution is that it would replace the strings if they appeared within the html tags themselves, leading to malformed HTML.
I am not sure I understand because I don't know much about HTML, but do you mean that if you have <foo> and </foo> it would replace it ? In that case you would just check in the line that there is no '<' just before the string ? I guess it should be sufficient but I'm not sure ..
For modifying HTML, it's generally always better to use HTML libraries rather than simple string substitution. There's too many edge cases to reliably do it yourself. In this case, using Beautiful Soup to extract the text (leaving out the HTML tags) and then making the substitutions might work, so it's what I'm attempting now. I do appreciate the answer though.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.