0

I want to automate data fetching from USDA website, where I am specifically interested in few categories for data selection. To do so, I tried the following:

import io
import requests
import pandas as pd

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1'

query_list = {"Report Type":"item","species":"BEEF","portal":"ls","category":"Retail", "Regions":"National", "Grades":"ALL", "Cut": "All", "Dates_from":"2019-03-01", "Dates_to":"2021-02-01"}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")
df.to_csv("usda_report.csv")

but I couldn't get the expected dataframe that I want, here is the output that after I tried to run above attempt:

ParserError: Expected 1 fields in line 117, saw 2. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

desired output

I need to pass these queries to do correct data selection: Category = "Retail"; Report Type = "Item"; Species = "Beef"; Region(s) = "National"; Dates_from = "2019-03-01"; Dates_to = "2021-02-15".

ideally, I want to pass those queries and want to get the following dataframe (head of dataframe):

enter image description here

update

in my desired outputs, I need those columns: Date, Region, Grade, Cut, Retail Items, Outlets or number of stores, Weighted Avg

from the above attempt, I couldn't get the output dataframe like this. How should I fetch data correctly? Can anyone suggest possible of doing this right in pandas? any idea?

5
  • As an aside, an alternative method is described here with BeautifulSoup scraping and a simple for-loop. Commented Feb 23, 2021 at 3:03
  • 1
    You can only use pandas.read_csv if the input is a csv. You can't parse arbitrary html pages like this. Commented Feb 23, 2021 at 3:04
  • @Yehuda how can we do this using BeautifulSoup scraping instead? Do you have the possible approach to get my expected output? any thoughts? Commented Feb 23, 2021 at 3:47
  • @HåkenLid query_list is not working at all for data selection, I don't why. How can we get desired output as I showed above? any idea? Commented Feb 23, 2021 at 4:04
  • @Hamilton Check the link I provided. The first half of the code from the question (through table_rows = table.find_all('tr')) should get you the data; then add the code from the linked answer to finish it off. Commented Feb 23, 2021 at 19:21

1 Answer 1

1

You must add the query parameter format=text to get the data in csv format from this web site.

url = 'https://www.marketnews.usda.gov/mnp/ls-report-retail'
query_list = {
    "format":"text", 
    "repType":"item",
    "species":"BEEF",
    "portal":"ls",
    "region":"NATIONAL",
    "cut":"0",
    "repDate":"03/01/2019", 
    "endDate":"02/01/2021",
}
req = requests.get(url, params=query_list)
df = pd.read_csv(io.StringIO(req.text), sep="\s\s+", engine="python")

You might have to modify the query parameters further. You can use the web site with your browser and change the filters you want. Then you can convert the current query parameters in the url to json with this command in the javascript terminal.

JSON.stringify(Object.fromEntries(new URLSearchParams(location.search)), null, 2)
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for your input. If I use above attempt, I can't get desired output like I showed above. Any possible updates? Is that doable to get my desired output dataframe? Thanks!
You still get the same error? Did you also change the last line? In your question you are using req in one line and r in the next.
yep, now error is gone but not getting desired output and some of the queries are not working for data selection. Any possible updates to get desired output? Thanks
Seems like some of your query parameters were invalid. I can change the ones I'm able to guess.
I've changed some of the incorrect parameters in query_list. There might be others. You should use the web site and change the filters to what you want. Then press submit and check what the current query parameters are. You can open the javascript console and do this: JSON.stringify(Object.fromEntries(new URLSearchParams(location.search)), null,2)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.