0

I have an html document with the following bulleted list:

Body=<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>

(Alternative View):

  • Preconditions
    • PC1
    • PC2
  • Use Case Triggers
    • T1
    • T2
  • Postconditions
    • PO1
    • PO2

I'm trying to write a function in Python that will disect this list and pull out groups of data. The goal is to put this data in a matrix that would look like the following:

[[Preconditions, PC1],[Preconditions, PC2],[Use Case Triggers, T1],[Use Case Triggers, T2],[Postconditions, PO1],[Postconditions,PO2]]

The other hurdle to cross is the fact that I need this sort of matrix to generate regardless of the number of ul and li elements.

Any guidance is appreciated!

2
  • this will help stackoverflow.com/a/24216387/9050514 Commented Jul 31, 2020 at 19:42
  • Are we to assume that PC1,.., PO2 are handwritten in HTML, or will they be derived from a function call initiated in JS? Commented Jul 31, 2020 at 19:43

2 Answers 2

1

You can write a function that takes raw html and deletes all html tags

def cleanhtml(raw_html):
    cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    cleantext = re.sub(cleanr, " ", raw_html)
    return cleantext

Some other cleanr options:

  • cleanr = re.compile("<[A-Za-z\/][^>]*>")
  • cleanr = re.compile("<[^>]*>")
  • cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")

But there is a better and easier way with Beautifulsoup.

from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
    r = requests.get(url).text
    soup = BeautifulSoup(r, "html.parser")
    return soup.get_text()
Sign up to request clarification or add additional context in comments.

Comments

-2

a good library for parse html - beautifulsoup. code example:

html = "<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>"



from bs4 import BeautifulSoup

bs = BeautifulSoup(html, "html.parser")
uls = bs.findAll("ul")
for ul in uls:
    print(ul.findAll("li"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.