Web Scraping tag of html in Python

Question

I would like to scrape all the links that end with .php I have written a regrex to select the target url such as samsung-phones-f-9-0-r1-p1.php

I am wondering if there's something wrong with my regrex or the tag is not correct.

Thank you so much in advance for answering

from bs4 import BeautifulSoup
import urllib.request as urlopen
import ssl 
import re

base_url = 'https://www.gsmarena.com/samsung-phones-9.php'
webrequest = request.Request(url, headers = {
    "User-Agent" : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36'})
    
    
# open the url
html = request.urlopen(base_url).read().decode('utf-8')
soup = BeautifulSoup(html, features = 'lxml')
# scraping sub urls
sub_urls = soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})
# https:\/\/www\.gsmarena\.com\/samsung.+(.php)
print(sub_urls)

What is the expected output?

Jarvis
– Jarvis

2020-12-25 20:17:02 +00:00
Commented Dec 25, 2020 at 20:17 — Jarvis
– Jarvis, Commented Dec 25, 2020 at 20:17

idar · Accepted Answer · 2020-12-25 20:37:27Z

1

You are doing it right but you are not extracting the actual href property from the tags.
Modify this line:

sub_urls = soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})

to this:

sub_urls = [x.get('href') for x in soup.find_all('a', {"href": re.compile("(samsung).+(.php)")})]

answered Dec 25, 2020 at 20:37

idar

6101 gold badge6 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

James Huang Over a year ago

Could I ask why it needs a get() to return href? I thought if I specified "href" and all the "href" tags would be selected.

idar Over a year ago

anything specified within find_all() function would help narrow down the tag selection. this is why your find_all function was returning list of <a> tags. To get any attribute of selected tags, you must use .get() function and specify the key you need.

BrKo14 · Accepted Answer · 2020-12-25 20:38:13Z

0

Your regular expression is not exactly wrong, since it will capture the URLs you are aiming for. However, two points to consider: (1) the parenthesis are unnecessary and (2) you should escape the . character in .php, or it'll be compiled as a quantifier. A better solution might look like this: samsung.+\.php. Of course this is will only capture the .php file itself and not the whole URL. If wanted the whole thing you'd have to use .*samsung.+\.php.
You can always use an online tool to test your regular expressions.

Either way, you haven't outlined any concrete issues, therefore I'm not sure if I've satisfied your doubts.

answered Dec 25, 2020 at 20:38

BrKo14

418 bronze badges

1 Comment

James Huang Over a year ago

I have tried the online tool that shows the paired result is okay, I suspect the problem is html tags.

Collectives™ on Stack Overflow

Web Scraping tag of html in Python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related