I ran into an issue preparing data for database, since iam doing that very first time
I scraped text from html dt and dd tags, so i'am getting a lot of information that i need and that i dont.
My output looks like that:
{'Plotas:': '49,16 m²', 'Kambarių sk.:': '2', 'Aukštas:': '2', 'Aukštų sk.:': '7', 'Metai:': '2022', 'Pastato tipas:': 'Mūrinis', 'Šildymas:': 'Centrinis kolektorinis', 'Įrengimas:': 'Dalinė apdaila NAUDINGA:\nInterjero dizaineriai', 'Pastato energijos suvartojimo klasė:': 'A+', 'Reklama/pasiūlymas:': 'Pasirinkite geriausią internetą namams', 'Ypatybės:': 'Nauja kanalizacija\nNauja elektros instaliacija', 'Papildomos patalpos:': 'Sandėliukas\nVieta automobiliui', 'Apsauga:': 'Šarvuotos durys\nKodinė laiptinės spyna\nVaizdo kameros'}
My code looks like that:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
import csv
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
for puslapis in range(2, 3):
driver.get(f'https://www.aruodas.lt/butai/vilniuje/puslapis/{puslapis}')
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
blocks = soup.find_all('tr', class_= 'list-row')
stored_urls = []
for url in blocks:
try:
stored_urls.append(url.a['href'])
except:
pass
for link in stored_urls:
driver.get(link)
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
try:
#Reikia su RegEx sutvarkyti adresa
adress = soup.find('h1','obj-header-text').text.strip()
# print(adress)
except:
adress = 'n/a'
def get_dl(soup):
keys, values = [], []
for dl in soup.findAll("dl", {"class": "obj-details"}):
for dt in dl.findAll("dt"):
keys.append(dt.text.strip())
for dd in dl.findAll("dd"):
values.append(dd.text.strip())
return dict(zip(keys, values))
dl_dict = get_dl(soup)
Quesion: How can i filter and prepare data only i need..for example, my desired output should look like that:
Plotas: 49,16 m²
Kambariu_sk: 2
Metai: 2022
How should i put that info for easier trasfer into database ?