1

I ran into an issue preparing data for database, since iam doing that very first time
I scraped text from html dt and dd tags, so i'am getting a lot of information that i need and that i dont.

My output looks like that:

{'Plotas:': '49,16 m²', 'Kambarių sk.:': '2', 'Aukštas:': '2', 'Aukštų sk.:': '7', 'Metai:': '2022', 'Pastato tipas:': 'Mūrinis', 'Šildymas:': 'Centrinis kolektorinis', 'Įrengimas:': 'Dalinė apdaila                                                                            NAUDINGA:\nInterjero dizaineriai', 'Pastato energijos suvartojimo klasė:': 'A+', 'Reklama/pasiūlymas:': 'Pasirinkite geriausią internetą namams', 'Ypatybės:': 'Nauja kanalizacija\nNauja elektros instaliacija', 'Papildomos patalpos:': 'Sandėliukas\nVieta automobiliui', 'Apsauga:': 'Šarvuotos durys\nKodinė laiptinės spyna\nVaizdo kameros'}

My code looks like that:

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
import csv

PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)


for puslapis in range(2, 3):
    driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
    response = driver.page_source
    soup = BeautifulSoup(response, 'html.parser')
    blocks = soup.find_all('tr', class_= 'list-row')

    stored_urls = []

    for url in blocks:
        try:
            stored_urls.append(url.a['href'])
        except:
            pass

    for link in stored_urls:
        driver.get(link)
        response = driver.page_source
        soup = BeautifulSoup(response, 'html.parser')

        try:
            #Reikia su RegEx sutvarkyti adresa
            adress = soup.find('h1','obj-header-text').text.strip()
            # print(adress)
        except:
            adress = 'n/a'

            def get_dl(soup):
                keys, values = [], []
                for dl in soup.findAll("dl", {"class": "obj-details"}):
                    for dt in dl.findAll("dt"):
                        keys.append(dt.text.strip())
                    for dd in dl.findAll("dd"):
                        values.append(dd.text.strip())
                return dict(zip(keys, values))


            dl_dict = get_dl(soup)

Quesion: How can i filter and prepare data only i need..for example, my desired output should look like that:

Plotas: 49,16 m²
Kambariu_sk: 2
Metai: 2022

How should i put that info for easier trasfer into database ?

1 Answer 1

1

I suggest you improve your loop to find both dt and dd entries at the same time. Then only add keys that are in a required list.

Try the following approach:

from selenium import webdriver
from bs4 import BeautifulSoup


def get_dl(soup):
    d = {}
    
    for dl in soup.findAll("dl", {"class": "obj-details"}):
        for el in dl.find_all(["dt", "dd"]):
            if el.name == 'dt':
                key = el.get_text(strip=True)
            elif key in ['Plotas:', 'Kambarių sk.:', 'Metai:']:
                d[key] = el.get_text(strip=True)
        
    return d


PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
data = []

for puslapis in range(2, 3):
    driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
    response = driver.page_source
    soup = BeautifulSoup(response, 'html.parser')
    blocks = soup.find_all('tr', class_= 'list-row')
    stored_urls = []

    for url in blocks:
        try:
            stored_urls.append(url.a['href'])
        except:
            pass

    for link in stored_urls:
        driver.get(link)
        response = driver.page_source
        soup = BeautifulSoup(response, 'html.parser')
        h1 = soup.find('h1', 'obj-header-text')
        
        if h1:
            address = h1.get_text(strip=True)
        else:
            address = 'n/a'

        data.append({'Address' : address, **get_dl(soup)})
            
for entry in data:
    print(entry)

Giving you data starting:

{'Address': 'Vilnius, Markučiai, Pakraščio g., 2 kambarių butas', 'Plotas:': '44,9 m²', 'Kambarių sk.:': '2', 'Metai:': '2023'}
{'Address': 'Vilnius, Pašilaičiai, Budiniškių g., 2 kambarių butas', 'Plotas:': '49,16 m²', 'Kambarių sk.:': '2', 'Metai:': '2022'}
{'Address': 'Vilnius, Senamiestis, Liejyklos g., 4 kambarių butas', 'Plotas:': '55 m²', 'Kambarių sk.:': '4', 'Metai:': '1940'}
{'Address': 'Vilnius, Žirmūnai, Kareivių g., 2 kambarių butas', 'Plotas:': '24,3 m²', 'Kambarių sk.:': '2', 'Metai:': '2020'}

You could write this to output.csv using:

with open('output.csv', 'w', encoding='utf-8', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=data[0].keys())
    csv_output.writeheader()
    csv_output.writerows(data)

Giving output.csv starting:

Address,Plotas:,Kambarių sk.:,Metai:
"Vilnius, Markučiai, Pakraščio g., 2 kambarių butas","44,9 m²",2,2023
"Vilnius, Pašilaičiai, Budiniškių g., 2 kambarių butas","49,16 m²",2,2022
"Vilnius, Senamiestis, Liejyklos g., 4 kambarių butas",55 m²,4,1940
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.