2

I am trying to write a code for scraping data from http://goldpricez.com/gold/history/lkr/years-3. The code that I have written follows below. The code works and gives me my intended results.

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)

print(df)

But result is with some unwanted data and I want only the data in the table. Please can some help me with this.

Here I have added the image of the output with unwanted data (red circled)

3
  • You can always slice the Data frame to get rid of the unwanted data. Alternatively, use Beautiful soup library to parse html before using pandas library. Commented Jun 14, 2020 at 5:13
  • read_html return list of dataframe for each table in HTML source, use list index to access the required dataframe stackoverflow.com/questions/39710903/… Commented Jun 14, 2020 at 5:46
  • You were correct to use pd.read_html. Just select the correct index where the data is [3]. See my answer below Commented Jun 14, 2020 at 7:45

3 Answers 3

2
    import pandas as pd



   url = "http://goldpricez.com/gold/history/lkr/years-3"

   df = pd.read_html(url)# this will give you a list of dataframes from html

  print(df[3])
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks mate. it works fine and small question. what df[3] does??
Using urllib.requests did actually just performed the process twice as .read_html does that :) so there is no need for that step
Explanation of why I downvoted: I rarely downvote answers. I usually dislike downvotes without explanation to where I could improve. So here is mine. You added extra and unused codes from urllib.request import urlopen, Request url = "http://goldpricez.com/gold/history/lkr/years-3" req = Request(url=url) html = urlopen(req).read() all these is not used. df[3] would work if all that is deleted. ;) that is why. Hope you understand :)
@ThejithaAnjana df[3] prints the fourth dataframe of from the dataframes list.
As for now, it's the df[1] element, this is what worked for me.
0

Use BeautifulSoup for this the below code works perfectly

import requests
from bs4 import BeautifulSoup
url = "http://goldpricez.com/gold/history/lkr/years-3"
r = requests.get(url)
s = BeautifulSoup(r.text, "html.parser")
data = s.find_all("td")
data = data[11:]
for i in range(0, len(data), 2):
    print(data[i].text.strip(), "      ", data[i+1].text.strip())

This other advantage of using BeautifulSoup is that it is way faster that your code

1 Comment

.read_html uses bs4 under the hood ;) flavor : str or None, container of strings The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.
0

The way you used .read_html will return a list of all tables. Your table is at index 3

import pandas as pd

url = "http://goldpricez.com/gold/history/lkr/years-3"

df = pd.read_html(url)[3]

print(df)

.read_html makes a call to the URL, and uses BeautifulSoup to parse the response under the hood. You can change the parse, the name of the table, pass header as you would in .read_csv. Check .read_html for more details.

For speed, you can use lxml e.g. pd.read_html(url, flavor='lxml')[3]. By default, html5lib, which is the second slowest, is used. Another flavor is html.parser. It is the slowest of them all.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.