how to correctly fetch dataframe from url in python?

Question

I want to automate data fetching from USDA website, where I am specifically interested in few categories for data selection. To do so, I tried the following:

import io
import requests
import pandas as pd

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1'

query_list = {"Report Type":"item","species":"BEEF","portal":"ls","category":"Retail", "Regions":"National", "Grades":"ALL", "Cut": "All", "Dates_from":"2019-03-01", "Dates_to":"2021-02-01"}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")
df.to_csv("usda_report.csv")

but I couldn't get the expected dataframe that I want, here is the output that after I tried to run above attempt:

ParserError: Expected 1 fields in line 117, saw 2. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

desired output

I need to pass these queries to do correct data selection: Category = "Retail"; Report Type = "Item"; Species = "Beef"; Region(s) = "National"; Dates_from = "2019-03-01"; Dates_to = "2021-02-15".

ideally, I want to pass those queries and want to get the following dataframe (head of dataframe):

update

in my desired outputs, I need those columns: Date, Region, Grade, Cut, Retail Items, Outlets or number of stores, Weighted Avg

from the above attempt, I couldn't get the output dataframe like this. How should I fetch data correctly? Can anyone suggest possible of doing this right in pandas? any idea?

As an aside, an alternative method is described here with BeautifulSoup scraping and a simple for-loop. — Yehuda
– Yehuda, Commented Feb 23, 2021 at 3:03
You can only use pandas.read_csv if the input is a csv. You can't parse arbitrary html pages like this. — Håken Lid
– Håken Lid, Commented Feb 23, 2021 at 3:04
@Yehuda how can we do this using BeautifulSoup scraping instead? Do you have the possible approach to get my expected output? any thoughts? — Hamilton
– Hamilton, Commented Feb 23, 2021 at 3:47
@HåkenLid query_list is not working at all for data selection, I don't why. How can we get desired output as I showed above? any idea? — Hamilton
– Hamilton, Commented Feb 23, 2021 at 4:04
@Hamilton Check the link I provided. The first half of the code from the question (through table_rows = table.find_all('tr')) should get you the data; then add the code from the linked answer to finish it off. — Yehuda
– Yehuda, Commented Feb 23, 2021 at 19:21

Håken Lid · Accepted Answer · 2021-02-23 04:20:31Z

1

You must add the query parameter format=text to get the data in csv format from this web site.

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail'
query_list = {
    "format":"text", 
    "repType":"item",
    "species":"BEEF",
    "portal":"ls",
    "region":"NATIONAL",
    "cut":"0",
    "repDate":"03/01/2019", 
    "endDate":"02/01/2021",
}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")

You might have to modify the query parameters further. You can use the web site with your browser and change the filters you want. Then you can convert the current query parameters in the url to json with this command in the javascript terminal.

JSON.stringify(Object.fromEntries(new URLSearchParams(location.search)), null, 2)

edited Feb 23, 2021 at 4:20

answered Feb 23, 2021 at 3:12

Håken Lid

23.2k10 gold badges58 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Hamilton Over a year ago

Thanks for your input. If I use above attempt, I can't get desired output like I showed above. Any possible updates? Is that doable to get my desired output dataframe? Thanks!

Håken Lid Over a year ago

You still get the same error? Did you also change the last line? In your question you are using req in one line and r in the next.

Hamilton Over a year ago

yep, now error is gone but not getting desired output and some of the queries are not working for data selection. Any possible updates to get desired output? Thanks

Håken Lid Over a year ago

Seems like some of your query parameters were invalid. I can change the ones I'm able to guess.

Håken Lid Over a year ago

I've changed some of the incorrect parameters in query_list. There might be others. You should use the web site and change the filters to what you want. Then press submit and check what the current query parameters are. You can open the javascript console and do this: JSON.stringify(Object.fromEntries(new URLSearchParams(location.search)), null,2)

|

Collectives™ on Stack Overflow

how to correctly fetch dataframe from url in python?

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related