1

I am trying to scrape some precise lines and create table from collected data (url attached), but cannot get more than the entire body text. Thus, I got stuck.

To give some example:

I would like to arrive at the below table, scraping details from the body content.All the details are there, however any help on how to retrieve them in a form given below would be much appreciated.

enter image description here

My code is:

import requests
from bs4 import BeautifulSoup
# providing url
url = 'https://www.polskawliczbach.pl/wies_Baniocha'

# creating request object
req = requests.get(url)

# creating soup object
data = BeautifulSoup(req.text, 'html')

# finding all li tags in ul and printing the text within it
data1 = data.find('body')
for li in data1.find_all("li"):
   print(li.text, end=" ")

enter image description here

enter image description here

7
  • Just to find the text quicker in body: Liczba mieszkańców, Kobiety, Mężczyźni. Commented Jul 11, 2021 at 13:24
  • Please try to give more detail information, like I want to scrape the part of this website, SO only support English. Try to give details in English rather then in other languages. Commented Jul 11, 2021 at 13:28
  • Thx for your answer. The page (link given in post) is typically about the PL census data.My questions is how to scrape No of citizens (Liczba miszkańców) split into Women (Kobiety) and Men (Mężczyźni). Hope it helps when it comes to language issue -:) Commented Jul 11, 2021 at 13:36
  • I have added also a screenshot directly from the webpage. Commented Jul 11, 2021 at 13:40
  • Yeah! It solves a little bit of language issue. But where exactly is that part in the website. Do you want to get that data from all categories or from specific one cuz the data of man and woman are in different categories. Please elaborate that screenshot with 1/2 upper part of data and same for lower. Because I didn't found that part while going roughly through website. Commented Jul 11, 2021 at 13:41

2 Answers 2

1

At first find the ul and then try to find li inside ul. Scrape needed data, save scraped data in variable and make table using pandas. Now we have done all things if you want to save table then save it in csv file otherwise just print it.

Here's the code implementation of all above things:

from bs4 import BeautifulSoup
import requests
import pandas as pd

page = requests.get('https://www.polskawliczbach.pl/wies_Baniocha')
soup = BeautifulSoup(page.content, 'lxml')

lis=soup.find_all("ul",class_="list-group row")[1].find_all("li")[1:-1]
dic={"name":[],"value":[]}
for li in lis:
    try:
        dic["name"].append(li.find(text=True,recursive=False).strip())
        dic["value"].append(li.find("span").text.replace(" ",""))
        print(li.find(text=True,recursive=False).strip(),li.find("span").text.replace(" ",""))
    except:
        pass

df=pd.DataFrame(dic)

print(df)
# If you want to save this as file then uncomment following line:
# df.to_csv("<FILENAME>.csv")

And additionally if you want to scrape all then "categories", I don't understand that language so,I don't know which is useful and which is not but anyway here's the code, you can just change this part of above code:

soup = BeautifulSoup(page.content, 'lxml')

dic={"name":[],"value":[]}
lis=soup.find_all("ul",class_="list-group row")
for li in lis:
    a=li.find_all("li")[1:-1]
    for b in a:
        error=0
        try:
            print(b.find(text=True,recursive=False).strip(),"\t",b.find("span").text.replace(" ","").replace(",",""))
            dic["name"].append(b.find(text=True,recursive=False).strip())
            dic["value"].append(b.find("span").text.replace(" ","").replace(",",""))
        except Exception as e:
            pass

df=pd.DataFrame(dic)
Sign up to request clarification or add additional context in comments.

2 Comments

Hello. Works as a charm. I think I get it now. This is exactly what I was searching for. You have made my day. GREAT THANKS!
Again, many thanks for your time & contribution! It is even more than I thought I could get.It's amazing!
1

Find main tag by specific class and from it find all li tag

main_data=data.find("ul", class_="list-group").find_all("li")[1:-1]
names=[]
values=[]
main_values=[]
for i in main_data:
    values.append(i.find("span").get_text())    
    names.append(i.find(text=True,recursive=False))
main_values.append(values)

For table representation use pandas module

import pandas as pd
df=pd.DataFrame(columns=names,data=main_values)
df

Output:

Liczba mieszkańców (2011)   Kod pocztowy    Numer kierunkowy
 0  1 935                  05-532           (+48) 22

1 Comment

Many thanks for this one! It's almost exactly what I need. Took some text location from inspector.It's class="list-group row", and li_class hdr col-md-12. However not sure how to apply it directly in the code. I have added one more png to show where the text is exactly located in website.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.