1

Below is the html source of url:

<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>
<ul>
  <li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Mon Sep 23 21:41:16 2019

</ul>

and here's my code:

Note that links is list of urls

for link in links:
    page = requests.get(link).text
    sp1 = BeautifulSoup(page, "html.parser").findAll('h1')
    sp2 = BeautifulSoup(page, "html.parser").findAll('li')
    print(sp1,sp2)

Current OUTPUT

[<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>] [<li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Tue Sep 24 00:27:05 2019

Trying to edit my code to get the following output.

hotspot-00:26:BB:05:BB:10, Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited
2
  • What is printed at print(sp1,sp2)? Edit that into your answer. Commented Sep 23, 2019 at 22:29
  • 1
    @KevinWelch it's there already. refresh for seeing the output. Commented Sep 23, 2019 at 22:33

1 Answer 1

1

First of all you don't need to create two BeautifulSoup objects. As for your question:

import re

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    header = soup.find('h1').text
    header = re.sub(r'.*<<(.*)>>.*', r'\g<1>', header)
    limit = [elem.text.strip() for elem in soup.find_all('li') if re.search(r'^Limit-at:', elem.text)][0].split('\n')[0]
    print(header, limit)

I used the html you provided to test the above solution.

So you're getting lists there because you are using find_all which always returns a list.

For the header I used find same thing but it only returns the first match. Then I do some regex substitution to remove all but the desired portion of the header test.

For the limit things are a little trickier because it's in a nested li element. So loop through all of the li elements adding the one whose text attribute begins with 'Limit-at:'. Because that'll be a list I grab the 0 element, splitting that on the new line character, this produces a new list. Then grab the zero element of that to get rid of the 'Last Update' portion of that text.

Sign up to request clarification or add additional context in comments.

8 Comments

got TypeError: expected string or bytes-like object.
See edited answer. Forgot to grab text from the header = soup.find line
OP deleted comment but asked about matching &lt; instead of <. You would just need to swap << and >> with &lt;&lt; and &gt;&gt; respectively in the first parameter of re.sub.
it's works fine but if it <h1>Queue &lt;hs-&lt;hotspot1&gt;&gt; Statistics </h1> so output will be Queue <hs-<hotspot1>> Statistics
You will likely need a more complicated regular expression if the data is not consistent with the example you provided in your qguestion. I recommend checking out the docs for python's re module (docs.python.org/3.6/library/re.html) and making use of any number of online regular expression testers. Try somethings and edit your post or create a new one if you're having trouble.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.