Extracting data from HTML bulleted lists in Python

Question

I have an html document with the following bulleted list:

Body=<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>

(Alternative View):

Preconditions
- PC1
- PC2
Use Case Triggers
- T1
- T2
Postconditions
- PO1
- PO2

I'm trying to write a function in Python that will disect this list and pull out groups of data. The goal is to put this data in a matrix that would look like the following:

[[Preconditions, PC1],[Preconditions, PC2],[Use Case Triggers, T1],[Use Case Triggers, T2],[Postconditions, PO1],[Postconditions,PO2]]

The other hurdle to cross is the fact that I need this sort of matrix to generate regardless of the number of ul and li elements.

Any guidance is appreciated!

Are we to assume that PC1,.., PO2 are handwritten in HTML, or will they be derived from a function call initiated in JS? — Alex Douglas
– Alex Douglas, Commented Jul 31, 2020 at 19:43

Yagiz Degirmenci · Accepted Answer · 2020-07-31 19:53:25Z

1

You can write a function that takes raw html and deletes all html tags

def cleanhtml(raw_html):
    cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    cleantext = re.sub(cleanr, " ", raw_html)
    return cleantext

Some other cleanr options:

cleanr = re.compile("<[A-Za-z\/][^>]*>")
cleanr = re.compile("<[^>]*>")
cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")

But there is a better and easier way with Beautifulsoup.

from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
    r = requests.get(url).text
    soup = BeautifulSoup(r, "html.parser")
    return soup.get_text()

edited Jul 31, 2020 at 19:53

answered Jul 31, 2020 at 19:43

Yagiz Degirmenci

21.4k9 gold badges75 silver badges92 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SergeySD · Accepted Answer · 2020-07-31 19:44:44Z

-2

a good library for parse html - beautifulsoup. code example:

html = "<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>"



from bs4 import BeautifulSoup

bs = BeautifulSoup(html, "html.parser")
uls = bs.findAll("ul")
for ul in uls:
    print(ul.findAll("li"))

edited Jul 31, 2020 at 19:44

answered Jul 31, 2020 at 19:37

SergeySD

831 gold badge2 silver badges8 bronze badges

Collectives™ on Stack Overflow

Extracting data from HTML bulleted lists in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related