Issue with scraping in python

Question

I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.

To give some example:

I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.

My code is:

import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'

# creating request object
req = requests.get(url)

# creating soup object
data = BeautifulSoup(req.text, 'html')

# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
   print(li.text, end=" ")

Just to find the text quicker in body: Liczba mieszkańców, Kobiety, Mężczyźni. — Jaroslaw
– Jaroslaw, Commented Jul 11, 2021 at 13:24
Please try to give more detail information, like I want to scrape the part of this website, SO only support English. Try to give details in English rather then in other languages. — imxitiz
– imxitiz, Commented Jul 11, 2021 at 13:28
Thx for your answer. The page (link given in post) is typically about the PL census data.My questions is how to scrape No of citizens (Liczba miszkańców) split into Women (Kobiety) and Men (Mężczyźni). Hope it helps when it comes to language issue -:) — Jaroslaw
– Jaroslaw, Commented Jul 11, 2021 at 13:36
Yeah! It solves a little bit of language issue. But where exactly is that part in the website. Do you want to get that data from all categories or from specific one cuz the data of man and woman are in different categories. Please elaborate that screenshot with 1/2 upper part of data and same for lower. Because I didn't found that part while going roughly through website. — imxitiz
– imxitiz, Commented Jul 11, 2021 at 13:41

imxitiz · Accepted Answer · 2021-07-11 17:48:34Z

1

At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.

Here's the code implementation of all above things:

from bs4 import BeautifulSoup
import requests
import pandas as pd

page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')

lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
    try:
        dic["name"].append(li.find(text=True,recursive=False).strip())
        dic["value"].append(li.find("span").text.replace(" ",""))
        print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
    except:
        pass

df=pd.DataFrame(dic)

print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")

And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:

soup = BeautifulSoup(page.content, 'lxml')

dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
    a=li.find_all("li")[1:-1]
    for b in a:
        error=0
        try:
            print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
            dic["name"].append(b.find(text=True,recursive=False).strip())
            dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
        except Exception as e:
            pass

df=pd.DataFrame(dic)

edited Jul 11, 2021 at 17:48

answered Jul 11, 2021 at 17:06

imxitiz

4,0253 gold badges13 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jaroslaw Over a year ago

Hello. Works as a charm. I think I get it now. This is exactly what I was searching for. You have made my day. GREAT THANKS!

Jaroslaw Over a year ago

Again, many thanks for your time & contribution! It is even more than I thought I could get.It's amazing!

Bhavya Parikh · Accepted Answer · 2021-07-11 14:34:25Z

1

Find main tag by specific class and from it find all li tag

main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
    values.append(i.find("span").get_text())    
    names.append(i.find(text=True,recursive=False))
main_values.append(values)

For table representation use pandas module

import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df

Output:

Liczba mieszkańców (2011)   Kod pocztowy    Numer kierunkowy
 0  1 935                  05-532           (+48) 22

answered Jul 11, 2021 at 14:34

Bhavya Parikh

3,3982 gold badges11 silver badges20 bronze badges

1 Comment

Jaroslaw Over a year ago

Many thanks for this one! It's almost exactly what I need. Took some text location from inspector.It's class="list-group row", and li_class hdr col-md-12. However not sure how to apply it directly in the code. I have added one more png to show where the text is exactly located in website.

Collectives™ on Stack Overflow

Issue with scraping in python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related