Using python requests for several urls in a dataframe

Question

I have a CSV that I read using pandas and looks like:

                |    URL            | Status Code | 
--------------- | ------------------|-------------|
       0        | www.example.com   |    404      |
----------------|-------------------|-------------|
        1       | www.example.com/2 |   404       |

I want to check if the URLs on the second column are still responding with 404. I have this code:

url = df['URL']
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
print(rawData)

I get the following error:

InvalidSchema: No connection adapters were found for '0 http://www.example.com

1 http://www.example.com/2

Name: URL, dtype: object'

I searched several questions but could not find the answer. Any help is appreciated.

What happens when you do urlData = requests.get(url[0]).content? — Mad Physicist
– Mad Physicist, Commented Aug 7, 2017 at 19:34
I doubt you can call get on something that is either a Series or a DataFrame like that. At least, I think that that is what the error message is telling you. — Mad Physicist
– Mad Physicist, Commented Aug 7, 2017 at 19:35

randomir · Accepted Answer · 2017-08-07 19:44:32Z

4

The requests.get is not broadcastable, so you'll either have to call it for each URL with pandas.DataFrame.apply:

>>> df['New Status Code'] = df.URL.apply(lambda url: requests.get(url).status_code)
>>> df
   Status Code                URL  New Status Code
0          404    www.example.com              404
1          404  www.example.com/2              404

or use numpy.vectorize:

>>> vectorized_get = numpy.vectorize(lambda url: requests.get(url).status_code)
>>> df['New Status Code'] = vectorized_get(df.URL)

answered Aug 7, 2017 at 19:44

randomir

18.8k1 gold badge46 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

David · Accepted Answer · 2017-08-07 19:35:11Z

0

df['URL'] is going to return you a Series of data, not a single value. I suspect your code is blowing up on the requests.get(url).content line.

Can you post more of the code?

You may want to look at the apply function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html.

answered Aug 7, 2017 at 19:35

David

7755 silver badges12 bronze badges

Comments

thehappycheese · Accepted Answer · 2024-02-19 13:15:08Z

If you are in a jupyter notebook you can easily use pandas-aiohttp (disclaimer; I just published this package);

import pandas as pd
import pandas_aiohttp

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

data = await example_urls.aiohttp.get_text()

0    {\n  "userId": 1,\n  "id": 1,\n  "title": "sun...
1    {\n  "userId": 1,\n  "id": 2,\n  "title": "qui...
dtype: object

Note: You can add assert pandas_aiohttp on the line after import pandas_aiohttp to prevent your IDE from highlighting the apparently "unused import". This package works by registering a custom accessor (i.e. monkey patching, which I feel is only ok because pandas documents it as a feature)

If you are not in a jupyter notebook then there is some extra work to start your own async event loop:

import pandas as pd
import pandas_aiohttp
import asyncio

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

async def main():
    data = await example_urls.aiohttp.get_text()
    print(data)

asyncio.run(main())

By default this will use 100 parallel connections, and should be waaaay faster than most other methods.

Collectives™ on Stack Overflow

Using python requests for several urls in a dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related