Python, looping over list of urls to parse html content

Question

Below is the html source of url:

<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>
<ul>
  <li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Mon Sep 23 21:41:16 2019

</ul>

and here's my code:

Note that links is list of urls

for link in links:
    page = requests.get(link).text
    sp1 = BeautifulSoup(page, "html.parser").findAll('h1')
    sp2 = BeautifulSoup(page, "html.parser").findAll('li')
    print(sp1,sp2)

Current OUTPUT

[<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>] [<li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Tue Sep 24 00:27:05 2019

Trying to edit my code to get the following output.

hotspot-00:26:BB:05:BB:10, Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited

What is printed at print(sp1,sp2)? Edit that into your answer. — Kevin Welch
– Kevin Welch, Commented Sep 23, 2019 at 22:29
@KevinWelch it's there already. refresh for seeing the output. — Anna Plym
– Anna Plym, Commented Sep 23, 2019 at 22:33

R. Arctor · Accepted Answer · 2019-09-23 22:51:03Z

1

First of all you don't need to create two BeautifulSoup objects. As for your question:

import re

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    header = soup.find('h1').text
    header = re.sub(r'.*<<(.*)>>.*', r'\g<1>', header)
    limit = [elem.text.strip() for elem in soup.find_all('li') if re.search(r'^Limit-at:', elem.text)][0].split('\n')[0]
    print(header, limit)

I used the html you provided to test the above solution.

So you're getting lists there because you are using find_all which always returns a list.

For the header I used find same thing but it only returns the first match. Then I do some regex substitution to remove all but the desired portion of the header test.

For the limit things are a little trickier because it's in a nested li element. So loop through all of the li elements adding the one whose text attribute begins with 'Limit-at:'. Because that'll be a list I grab the 0 element, splitting that on the new line character, this produces a new list. Then grab the zero element of that to get rid of the 'Last Update' portion of that text.

edited Sep 23, 2019 at 22:51

answered Sep 23, 2019 at 22:33

R. Arctor

7285 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Anna Plym Over a year ago

got TypeError: expected string or bytes-like object.

R. Arctor Over a year ago

See edited answer. Forgot to grab text from the header = soup.find line

R. Arctor Over a year ago

OP deleted comment but asked about matching < instead of <. You would just need to swap << and >> with << and >> respectively in the first parameter of re.sub.

Anna Plym Over a year ago

it's works fine but if it <h1>Queue <hs-<hotspot1>> Statistics </h1> so output will be Queue <hs-<hotspot1>> Statistics

R. Arctor Over a year ago

You will likely need a more complicated regular expression if the data is not consistent with the example you provided in your qguestion. I recommend checking out the docs for python's re module (docs.python.org/3.6/library/re.html) and making use of any number of online regular expression testers. Try somethings and edit your post or create a new one if you're having trouble.

|

Collectives™ on Stack Overflow

Python, looping over list of urls to parse html content

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related