How to extract specific string on a web page using Python

Question

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv

As you can see, this is the page source of mbasic.facebook.com.

What I'm trying to do is scrape all the anchor tags that have a pattern like this:

Example

<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">

Example with wild card.

<a class="cf" href="*">

so I decided to add a wild card identifier after href="*" since the value are dynamic.

Here's my (not working) Python Code.

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))

Note that in the page, there are several patterns like this so I need to capture all and print it.

<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=lADKURnNsk4AX8WTS1F&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=96f40cb2f95acbcfe9f6e4dc6cb31161&amp;oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=Z2daQ-qGgpsAX8BmLKr&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=22f2b487166a7cd06e4ff650af4f7a7b&amp;oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>

My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!

Tried another set of code but no luck :)

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)

Kyle · Accepted Answer · 2020-04-19 14:07:17Z

1

I think your wildcard match needs a dot in front like .*

I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.

answered Apr 19, 2020 at 14:07

Kyle

565 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Eric Truett · Accepted Answer · 2020-04-19 14:08:10Z

1

You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.

import re

s = """<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">"""

patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)

Output

>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">',
 '<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">']

answered Apr 19, 2020 at 14:08

Eric Truett

3,0201 gold badge20 silver badges22 bronze badges

3 Comments

Ben Daggers Over a year ago

Hi Eric! Thank you for helping out. The thing is, the "/profile.php?id=100044454444312&fref=fr_tab" href value are dynamic. Its not always like this.

Eric Truett Over a year ago

You can just remove profile.*? but, in that case, there is no need for href at all in the regex because every <a> element will have an href attribute. It is trial and error to find exactly the correct regex.

Ben Daggers Over a year ago

thanks for your advice Eric I'm using selenium right now as I can easily simulate mouse clicks in facebook page. Not sure about requests-html as I have no idea on this library. Though I have a beginner knowledge in BS. I'll look into that. Thank you.

Rahul Podi Rajagopal · Accepted Answer · 2020-04-19 14:14:37Z

0

As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

Post this the below set of commands should solve your purpose

req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')

url in the above command is the link you want to scrape and headers argument takes by browser specs/version

answered Apr 19, 2020 at 14:14

Rahul Podi Rajagopal

412 bronze badges

Collectives™ on Stack Overflow

How to extract specific string on a web page using Python

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related