1

I have a CSV that I read using pandas and looks like:

                |    URL            | Status Code | 
--------------- | ------------------|-------------|
       0        | www.example.com   |    404      |
----------------|-------------------|-------------|
        1       | www.example.com/2 |   404       |

I want to check if the URLs on the second column are still responding with 404. I have this code:

url = df['URL']
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
print(rawData)

I get the following error:

InvalidSchema: No connection adapters were found for '0 http://www.example.com

1 http://www.example.com/2

Name: URL, dtype: object'

I searched several questions but could not find the answer. Any help is appreciated.

3
  • That is failing at the requests call, correct? Commented Aug 7, 2017 at 19:33
  • What happens when you do urlData = requests.get(url[0]).content? Commented Aug 7, 2017 at 19:34
  • I doubt you can call get on something that is either a Series or a DataFrame like that. At least, I think that that is what the error message is telling you. Commented Aug 7, 2017 at 19:35

3 Answers 3

4

The requests.get is not broadcastable, so you'll either have to call it for each URL with pandas.DataFrame.apply:

>>> df['New Status Code'] = df.URL.apply(lambda url: requests.get(url).status_code)
>>> df
   Status Code                URL  New Status Code
0          404    www.example.com              404
1          404  www.example.com/2              404

or use numpy.vectorize:

>>> vectorized_get = numpy.vectorize(lambda url: requests.get(url).status_code)
>>> df['New Status Code'] = vectorized_get(df.URL)
Sign up to request clarification or add additional context in comments.

Comments

0

df['URL'] is going to return you a Series of data, not a single value. I suspect your code is blowing up on the requests.get(url).content line.

Can you post more of the code?

You may want to look at the apply function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html.

Comments

0

If you are in a jupyter notebook you can easily use pandas-aiohttp (disclaimer; I just published this package);

import pandas as pd
import pandas_aiohttp

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

data = await example_urls.aiohttp.get_text()
0    {\n  "userId": 1,\n  "id": 1,\n  "title": "sun...
1    {\n  "userId": 1,\n  "id": 2,\n  "title": "qui...
dtype: object

Note: You can add assert pandas_aiohttp on the line after import pandas_aiohttp to prevent your IDE from highlighting the apparently "unused import". This package works by registering a custom accessor (i.e. monkey patching, which I feel is only ok because pandas documents it as a feature)

If you are not in a jupyter notebook then there is some extra work to start your own async event loop:

import pandas as pd
import pandas_aiohttp
import asyncio

example_urls = pd.Series([
    "https://jsonplaceholder.typicode.com/posts/1",
    "https://jsonplaceholder.typicode.com/posts/2",
])

async def main():
    data = await example_urls.aiohttp.get_text()
    print(data)

asyncio.run(main())

By default this will use 100 parallel connections, and should be waaaay faster than most other methods.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.