Need to clean web scraped data using python

Question

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3. The code that I have written follows below. The code works and gives me my intended results.

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table. Please can some help me with this.

Here I have added the image of the output with unwanted data (red circled)

You can always slice the Data frame to get rid of the unwanted data. Alternatively, use Beautiful soup library to parse html before using pandas library. — Bernad Peter
– Bernad Peter, Commented Jun 14, 2020 at 5:13
read_html return list of dataframe for each table in HTML source, use list index to access the required dataframe stackoverflow.com/questions/39710903/… — sushanth
– sushanth, Commented Jun 14, 2020 at 5:46
You were correct to use pd.read_html. Just select the correct index where the data is [3]. See my answer below — Prayson W. Daniel
– Prayson W. Daniel, Commented Jun 14, 2020 at 7:45

Bernad Peter · Accepted Answer · 2020-06-20 05:28:57Z

2

    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])

edited Jun 20, 2020 at 5:28

answered Jun 14, 2020 at 5:27

Bernad Peter

5145 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Thejitha Anjana Over a year ago

Thanks mate. it works fine and small question. what df[3] does??

Prayson W. Daniel Over a year ago

Using urllib.requests did actually just performed the process twice as .read_html does that :) so there is no need for that step

Prayson W. Daniel Over a year ago

Explanation of why I downvoted: I rarely downvote answers. I usually dislike downvotes without explanation to where I could improve. So here is mine. You added extra and unused codes

from urllib.request import urlopen, Request      url = "http://goldpricez.com/gold/history/lkr/years-3"     req = Request(url=url)    html = urlopen(req).read()

all these is not used. df[3] would work if all that is deleted. ;) that is why. Hope you understand :)

Bernad Peter Over a year ago

@ThejithaAnjana df[3] prints the fourth dataframe of from the dataframes list.

Ibrahim.H Over a year ago

As for now, it's the df[1] element, this is what worked for me.

yash.29 · Accepted Answer · 2020-06-14 06:29:13Z

0

Use BeautifulSoup for this the below code works perfectly

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code

answered Jun 14, 2020 at 6:29

yash.29

531 silver badge7 bronze badges

1 Comment

Prayson W. Daniel Over a year ago

.read_html uses bs4 under the hood ;)

flavor : str or None, container of strings The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

Prayson W. Daniel · Accepted Answer · 2020-06-14 08:14:56Z

0

The way you used .read_html will return a list of all tables. Your table is at index 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. You can change the parse, the name of the table, pass header as you would in .read_csv. Check .read_html for more details.

For speed, you can use lxml e.g. pd.read_html(url, flavor='lxml')[3]. By default, html5lib, which is the second slowest, is used. Another flavor is html.parser. It is the slowest of them all.

edited Jun 14, 2020 at 8:14

answered Jun 14, 2020 at 7:47

Prayson W. Daniel

15.8k6 gold badges57 silver badges62 bronze badges

Collectives™ on Stack Overflow

Need to clean web scraped data using python

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related