0
import requests as rs
from bs4 import BeautifulSoup as bs
import re

site = 'https://www.iciciprulife.com/'
req = rs.get(site)
soup = bs(req.text, 'html.parser')
link=input("Enter which url you want http or https:")

if link == "http":
    for i in soup.find_all('a',attrs={'href': re.compile("^http://")}):
        print(i.get('href'))

In The above code I don't want to use 'href' or 'a' instead I want to search URL using regular expression in entire webpage

2
  • 1
    You should say why you don't want to use href? Using your own regex to parse html is generally considered a bad idea... Commented Jun 9, 2021 at 9:25
  • Use an attribute = value css selector Commented Jun 9, 2021 at 11:01

2 Answers 2

0

soup.text turns soup to string. This string contains non-ASCII characters, so you need to convert/remove them first.

Then, you can search the whole string with regex.

To remove non-ASCII characters from string:

How to remove nonAscii characters in python

Sign up to request clarification or add additional context in comments.

Comments

0
urls = re.findall(r'https?://[^\s<>"]+', req.text)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.