Creating pandas dataframe from API call

Question

I'm building an API to retrieve Census data, but I'm having trouble formatting the output. My question is really one of two:

1) How can I improve my API call so that the output is prettier (ideally a dataframe)

or

2) How can I manipulate the list that I currently get so that it is in a pandas dataframe?

Here is what I have so far:

import requests
import pandas as pd
import numpy as np

mytoken = "numbersandletters" 
# this is my API key, so unfortunately I can't provide it

def state_data(token, variables, year = 2010, state = "*", survey = "sf1"):
    state = [str(i) for i in state]
    # make sure the input for state (integers) are strings
  variables = ",".join(variables) # squish all the variables into one string
  year = str(year)
  combine = ["http://api.census.gov/data/", year, "/", survey, "?key=", mytoken, "&get=", variables, "&for=state:"] 
# make a list of all the components to construct a URL
  incomplete_url = "".join(combine) # the URL without the state tackd on to the end
  complete_url = map(lambda i: incomplete_url + i, state) # now the state is tacked on to the end; one URL per state or for "*"
  r = []
  r = map(lambda i: requests.get(i), complete_url) 
# make an API call to each complete_url
  data = map(lambda i: i.json(), r)
print r
print data 
print type(data)
df = pd.DataFrame(data)
print df

An example of calling the function is this, with the output below.

state_data(token = mytoken, state = [47, 48, 49, 50], variables = ["P0010001", "P0010001"])

resulting in:

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]


[[[u'P0010001', u'P0010001', u'state'], [u'6346105', u'6346105', u'47']], 
[[u'P0010001', u'P0010001', u'state'], [u'25145561', u'25145561', u'48']], 
[[u'P0010001', u'P0010001', u'state'], [u'2763885', u'2763885', u'49']], 
[[u'P0010001', u'P0010001', u'state'], [u'625741', u'625741', u'50']]]

<type 'list'>
                         0                         1
0  [P0010001, P0010001, state]    [6346105, 6346105, 47]
1  [P0010001, P0010001, state]  [25145561, 25145561, 48]
2  [P0010001, P0010001, state]    [2763885, 2763885, 49]
3  [P0010001, P0010001, state]      [625741, 625741, 50]

Whereas the desired outcome would be:

  P0010001  P0010001  state
0 6346105   6346105   47
1 25145561  25145561  48
2 2763885   2763885   49
3 625741    625741    50

Fwiw, the analogous code in R is below. I'm translating a library I've written in R to Python:

state.data = function(token, state = "*", variables, year = 2010, survey = "sf1"){
  state = as.character(state)
  variables = paste(variables, collapse = ",")
  year = as.character(year)
  my.url = matrix(paste("http://api.census.gov/data/", year, "/", survey, "?key=", token,
                    "&get=",variables, "&for=state:", state, sep = ""), ncol = 1)

  process.url = apply(my.url, 1, function(x)   process.api.data(fromJSON(file=url(x))))
  rbind.dat = data.frame(rbindlist(process.url))
  rbind.dat = rbind.dat[, c(tail(seq_len(ncol(rbind.dat)), 1), seq_len(ncol(rbind.dat) - 1))] 
  rbind.dat
}

Can you give an example of how the original data looks like (what you retrieve from the complete_url? If it is json, maybe you can use pd.read_json? — joris
– joris, Commented Feb 3, 2015 at 21:12

acushner · Accepted Answer · 2015-02-05 16:51:18Z

2

so you have duplicate fields, which is nonsensical, and your result will only show one of the duplicated fields.

however, all you need to do is pass a list/iterable of dict objects to the pd.DataFrame constructor, and you'll have your results:

vals = [[[...]]]  # the data you provided in your example
df = pd.DataFrame(dict(zip(*v)) for v in vals)

assuming this is your data:

data = [["P0010001","PCO0020019","state"], ["4779736","1204","01"], ["710231","53","02"], ["6392017","799","04"], ["2915918","924","05"], ["37253956","6244","06"], ["5029196","955","08"], ["3574097","1266","09"], ["897934","266","10"], ["601723","170","11"], ["18801310","4372","12"], ["9687653","1629","13"], ["1360301","251","15"], ["1567582","320","16"], ["12830632","3713","17"]]

then this works:

df = pd.DataFrame(data[1:], columns=data[0])

so you'll need to figure out how to get the data into that form. all i'm doing is passing a list of lists (data[1:]) and a list (data[0])

edited Feb 5, 2015 at 16:51

answered Feb 3, 2015 at 22:02

acushner

9,9461 gold badge38 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

acushner Over a year ago

yes, different problem, and simple: df = pd.DataFrame(data[1:], columns=data[0])

Nancy Over a year ago

Okay I'll try that. I added my R code to the original question to show why I'm doing things they way I am. It might help contextualize.

Nancy Over a year ago

Hm. I'm getting this error for all the examples I tried: ValueError: Shape of passed values is (0, 0), indices imply (3, 0) Should I be iterating over the length of the original data set?

acushner Over a year ago

i edited the answer. you'll have to inspect the data to see why it's not correct.

Collectives™ on Stack Overflow

Creating pandas dataframe from API call

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related