web scraping with Python and Beautifulsoup (in Linux Ubuntu-like)

Question

from bs4 import BeautifulSoup
import requests

url = 'http://www.bhaktiyogapura.com/2017/03/calendario-vaisnava-marco-de-2017/'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) '
                        'Chrome/51.0.2704.103 Safari/537.36'}



req = requests.get(url,headers= header)

html = req.text

soup = BeautifulSoup(html,'html.parser')

I am trying to extract information from http://www.bhaktiyogapura.com/2017/03/calendario-vaisnava-marco-de-2017

Dia: 4 Março 2017- Sábado

Event: Desaparecimento de Śrī Śrīmad Bhakti Vaibhava Purī Gosvāmī Mahārāja

Every day I need to extract the day and the event, for example:

day -> Dia: 4 Março 2017- Sábado

event -> is the string that has "-" before.

How to do it using BeautifulSoup?

All the dates are in a <strong> tag, so use .select("strong"). Then for the event, maybe you can look for all p tags that begin with -. — Charles Clayton
– Charles Clayton, Commented Mar 10, 2017 at 21:51
@Charles Clayton,Could you show some code? How can I print the event of the day? For example, what is the event today? — Ed S
– Ed S, Commented Mar 10, 2017 at 22:06

DisappointedByUnaccountableMod · Accepted Answer · 2021-02-18 17:53:58Z

2

you need to specify: to see what it contain (source): print soup that will help you see your data you want to extract.

to get texts: use get_text() you should use find() and findAll()

data = soup.findAll(id="Dia:")
for dat in data:
    print (dat.get_text())

you can also use next_sibling() in a loop (iteration)

or you choose a children to get only the tag children or use descendants()

ive had a look at that site Event is a class on html coding

so it must be

data = soup.findAll("p",{"class":"Event"})
for dat in data:
    print (dat.get_text())

that will give all events in page: to print only the date you want you can use next_sibling, cause event are directly under the date.

date = str(input("Enter the day date in spanish:"))

date_day = soup.findAll({"strong"})
for stong in date_day:
    if (strong.get_text() == date):
        print strong.get_text().next_sibling

this will scrape you the event from the day you want.

you can use from datetime date and Calendar to get the day and use a condition to date like:

date_aa = { "1":"janvier",....,"3":"Março",......}'

you can use previous_sibling() insteed to avoid some unknown text that not events.

Event_day = soup.findAll("p",{"class":"Event"})
for sting in Event_day:
    if (string.get_text().previous_sibling == date):
        print string.get_text()

that'd be better C: lol use regular expression too.

import re
a = """(<p class="event">)+\w"""
for strong in soup.findAll({"strong"}).parent.nextsibling:
    global strong
    if (strong.get_text() == date):
        if (re.search(a, strong.next_sibling)!= None):
            print (strong)

Another way:

to show all Dia (dates i guess):

 for strong in soup.findAll({"strong"}):
      print(strong.get_text())

to show all events:

 for strong in soup.findAll("p",{"class":"event"}):
      print(strong.get_text())

to get the event for the date you want:

for strong in soup.findAll({"strong"}).parent.descendants:
    print (strong.get_title():)

you can use descendants insteed of children reading web scraping with python book will help a lot.

thats how to scrape the event and date

from bs4 import BeautifulSoup
import requests
import re
url = 'http://www.bhaktiyogapura.com/2017/03/calendario-vaisnava-marco-de-2017/'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/51.0.2704.103 Safari/537.36'}



req = requests.get(url,headers= header)

html = req.text

soup = BeautifulSoup(html,'html.parser')
date = "23 Março 2017- "




#a = """(<p class="event">)+\w"""
#date_day = soup.findAll({"strong"})
#for strong in soup.find({"strong"}).descendants:
#    print (strong)


for strong in soup.findAll("strong"):
    print ("Dia:" + strong.get_text())
    for string in soup.findAll("p",{"class":"event"}):
        print ("Event: " + string.get_text())

to print only (Dia:) and (Event:) in BeautifulSoup:

from bs4 import BeautifulSoup
import requests
import re
url = 'http://www.bhaktiyogapura.com/2017/03/calendario-vaisnava-marco-de-2017/'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                'AppleWebKit/537.36 (KHTML, like Gecko) '
                'Chrome/51.0.2704.103 Safari/537.36'}



req = requests.get(url,headers= header)

html = req.text

soup = BeautifulSoup(html,'html.parser')
date = str(input("Enter the Dia as (23 Março 2017): ")) + "- "
for strong in soup.findAll("strong"):
if ( strong.get_text() == date):
    print ("Dia: ",strong.get_text())
    for string in soup.findAll("p",{"class":"event"}):
        print ("Event: ", string.get_text())
        break

edited Feb 18, 2021 at 17:53

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Mar 10, 2017 at 23:38

NINO Cikoo

841 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Ed S Over a year ago

I tried run this code: "in <module> if (strong.get_text() == date): NameError: name 'strong' is not defined"

NINO Cikoo Over a year ago

sorry i missed a global.

Ed S Over a year ago

:justpaste.it/edit/14985998/a7f67796 I got a lot of erros. Could you help me?

NINO Cikoo Over a year ago

this will give you the event from the below the date.

Ed S Over a year ago

It's working. How can I only print the event in the specific day, not all events?

|

oshaiken · Accepted Answer · 2017-03-10 22:04:34Z

2

Here is the start for you. You can use Xpath with BeatufulSoup. Add loop according to you your logic

url = 'http://www.bhaktiyogapura.com/2017/03/calendario-vaisnava-marco-de-2017/'
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/51.0.2704.103 Safari/537.36'}



from lxml import html
import requests

page = requests.get(url)
tree = html.fromstring(page.content)

data = tree.xpath('//*[@id="post-9445"]/div[2]/div[1]/div/div/div/p[5]/strong/text()')

print data

answered Mar 10, 2017 at 22:04

oshaiken

2,6701 gold badge19 silver badges27 bronze badges

Collectives™ on Stack Overflow

web scraping with Python and Beautifulsoup (in Linux Ubuntu-like)

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related