I want to convert HTML snippets like
<strong>Hi</strong><strong> </strong><em><strong>Tim</strong></em>
with any complexity or levels into
<strong>Hi <em>Tim</em></strong>.
The implementation should handle:
- All tags
- All attributes with any values (meaning that it distinguishes them correctly)
- Adjacent equal tags (merge them)
- Nested tags in any depth (reorder and merge them)
- Maybe even know that some tags can be merged if they are separated by a space (like
strongandem, but notu)
I thought I would find something in the BeautifulSoup or lxml packages or even Python itself. lxml's clean_html at first looked promising to me. But I couldn't find anything. I also searched for other packages (like https://github.com/matthiask/html-sanitizer/, http://countergram.github.io/pytidylib/) and questions (like BeautifulSoup - combine consecutive tags, Clean Up HTML in Python).
Code samples or links to packages are appreciated!
<a>t1</a><b><c><a>t2</a></c><d class="x"><a><a>t3</a></a></d></b>.