0

Here's the complete HTML Code of the page that I'm trying to scrape so please take a look first https://codepen.io/bendaggers/pen/LYpZMNv

As you can see, this is the page source of mbasic.facebook.com.

What I'm trying to do is scrape all the anchor tags that have a pattern like this:

Example

<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">

Example with wild card.

<a class="cf" href="*">

so I decided to add a wild card identifier after href="*" since the value are dynamic.

Here's my (not working) Python Code.

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = re.compile(driver.page_source)
pattern = "<a class=\"cf\" href=\"*\">"
print(pagex.findall(pattern))

Note that in the page, there are several patterns like this so I need to capture all and print it.

<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/79342209_112439723581175_5245034566049071104_o.jpg?_nc_cat=108&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=lADKURnNsk4AX8WTS1F&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=96f40cb2f95acbcfe9f6e4dc6cb31161&amp;oe=5EC27AEB" class="bo s" alt="Natividad Cruz, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">Natividad Cruz</a>
<td class="w n" style="vertical-align: middle"><img src="https://scontent.fceb2-1.fna.fbcdn.net/v/t1.0-1/cp0/e15/q65/p50x50/10306248_10201945477974508_4213924286888352892_n.jpg?_nc_cat=109&amp;_nc_sid=dbb9e7&amp;efg=eyJpIjoiYiJ9&amp;_nc_ohc=Z2daQ-qGgpsAX8BmLKr&amp;_nc_ht=scontent.fceb2-1.fna&amp;_nc_tp=3&amp;oh=22f2b487166a7cd06e4ff650af4f7a7b&amp;oe=5EC34325" class="bo s" alt="John Vinas, profile picture" /></td>
<td class="w t" style="vertical-align: middle"><a class="cf" href="/john.vinas?fref=fr_tab">John Vinas</a>

My goal is to print or findall the anchor tags and display it in terminal. Appreciate your help on this. Thank you!

Tried another set of code but no luck :)

driver.get('https://mbasic.facebook.com/cheska.cabral.796/friends')
pagex = driver.page_source
pattern = "<td class=\".*\" style=\"vertical-align: middle\"><a class=\".*\">"
x = re.findall(pattern, pagex)
print(x)

3 Answers 3

1

I think your wildcard match needs a dot in front like .*

I'd also recommend using a library like Beautiful Soup for this, it might make your life easier.

Sign up to request clarification or add additional context in comments.

Comments

1

You should use a parsing library, such as BeautifulSoup or requests-html. If you want to do it manually, then build on the second attempt you posted. The first won't get you what you want because you are compiling the entire page as a regular expression.

import re

s = """<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">\n\n<h1>\n<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">"""

patt = r'<a.*?class[="]{2}cf.*?href.*?profile.*?>'
matches = re.findall(patt, s)

Output

>>>matches
['<a class="cf" href="/profile.php?id=100044454444312&amp;fref=fr_tab">',
 '<a class="cf" href="/profile.php?id=20004666644312&amp;fref=fr_tab">']

3 Comments

Hi Eric! Thank you for helping out. The thing is, the "/profile.php?id=100044454444312&amp;fref=fr_tab" href value are dynamic. Its not always like this.
You can just remove profile.*? but, in that case, there is no need for href at all in the regex because every <a> element will have an href attribute. It is trial and error to find exactly the correct regex.
thanks for your advice Eric I'm using selenium right now as I can easily simulate mouse clicks in facebook page. Not sure about requests-html as I have no idea on this library. Though I have a beginner knowledge in BS. I'll look into that. Thank you.
0

As mentioned by the previous respondent, BeautifulSoup is the best thats available out there in python to scrape web pages. To import beautiful soup and other libraries use the following commands

  • from urllib.request import Request, urlopen
  • from bs4 import BeautifulSoup

Post this the below set of commands should solve your purpose

req=Request(url,headers = {'User-Agent': 'Chrome/64.0.3282.140'})
result=urlopen(req).read()
soup = BeautifulSoup(result, "html.parser")
atags=soup('a')

url in the above command is the link you want to scrape and headers argument takes by browser specs/version

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.