5

I'm building an API to retrieve Census data, but I'm having trouble formatting the output. My question is really one of two:

1) How can I improve my API call so that the output is prettier (ideally a dataframe)

or

2) How can I manipulate the list that I currently get so that it is in a pandas dataframe?

Here is what I have so far:

import requests
import pandas as pd
import numpy as np

mytoken = "numbersandletters" 
# this is my API key, so unfortunately I can't provide it

def state_data(token, variables, year = 2010, state = "*", survey = "sf1"):
    state = [str(i) for i in state]
    # make sure the input for state (integers) are strings
  variables = ",".join(variables) # squish all the variables into one string
  year = str(year)
  combine = ["http://api.census.gov/data/", year, "/", survey, "?key=", mytoken, "&get=", variables, "&for=state:"] 
# make a list of all the components to construct a URL
  incomplete_url = "".join(combine) # the URL without the state tackd on to the end
  complete_url = map(lambda i: incomplete_url + i, state) # now the state is tacked on to the end; one URL per state or for "*"
  r = []
  r = map(lambda i: requests.get(i), complete_url) 
# make an API call to each complete_url
  data = map(lambda i: i.json(), r)
print r
print data 
print type(data)
df = pd.DataFrame(data)
print df

An example of calling the function is this, with the output below.

state_data(token = mytoken, state = [47, 48, 49, 50], variables = ["P0010001", "P0010001"])

resulting in:

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]


[[[u'P0010001', u'P0010001', u'state'], [u'6346105', u'6346105', u'47']], 
[[u'P0010001', u'P0010001', u'state'], [u'25145561', u'25145561', u'48']], 
[[u'P0010001', u'P0010001', u'state'], [u'2763885', u'2763885', u'49']], 
[[u'P0010001', u'P0010001', u'state'], [u'625741', u'625741', u'50']]]

<type 'list'>
                         0                         1
0  [P0010001, P0010001, state]    [6346105, 6346105, 47]
1  [P0010001, P0010001, state]  [25145561, 25145561, 48]
2  [P0010001, P0010001, state]    [2763885, 2763885, 49]
3  [P0010001, P0010001, state]      [625741, 625741, 50]

Whereas the desired outcome would be:

  P0010001  P0010001  state
0 6346105   6346105   47
1 25145561  25145561  48
2 2763885   2763885   49
3 625741    625741    50

Fwiw, the analogous code in R is below. I'm translating a library I've written in R to Python:

state.data = function(token, state = "*", variables, year = 2010, survey = "sf1"){
  state = as.character(state)
  variables = paste(variables, collapse = ",")
  year = as.character(year)
  my.url = matrix(paste("http://api.census.gov/data/", year, "/", survey, "?key=", token,
                    "&get=",variables, "&for=state:", state, sep = ""), ncol = 1)

  process.url = apply(my.url, 1, function(x)   process.api.data(fromJSON(file=url(x))))
  rbind.dat = data.frame(rbindlist(process.url))
  rbind.dat = rbind.dat[, c(tail(seq_len(ncol(rbind.dat)), 1), seq_len(ncol(rbind.dat) - 1))] 
  rbind.dat
}
2
  • 1
    Can you give an example of how the original data looks like (what you retrieve from the complete_url? If it is json, maybe you can use pd.read_json? Commented Feb 3, 2015 at 21:12
  • I added a print r and the output. r is a list. Commented Feb 3, 2015 at 22:03

1 Answer 1

2

so you have duplicate fields, which is nonsensical, and your result will only show one of the duplicated fields.

however, all you need to do is pass a list/iterable of dict objects to the pd.DataFrame constructor, and you'll have your results:

vals = [[[...]]]  # the data you provided in your example
df = pd.DataFrame(dict(zip(*v)) for v in vals)

assuming this is your data:

data = [["P0010001","PCO0020019","state"], ["4779736","1204","01"], ["710231","53","02"], ["6392017","799","04"], ["2915918","924","05"], ["37253956","6244","06"], ["5029196","955","08"], ["3574097","1266","09"], ["897934","266","10"], ["601723","170","11"], ["18801310","4372","12"], ["9687653","1629","13"], ["1360301","251","15"], ["1567582","320","16"], ["12830632","3713","17"]]

then this works:

df = pd.DataFrame(data[1:], columns=data[0])

so you'll need to figure out how to get the data into that form. all i'm doing is passing a list of lists (data[1:]) and a list (data[0])

Sign up to request clarification or add additional context in comments.

4 Comments

yes, different problem, and simple: df = pd.DataFrame(data[1:], columns=data[0])
Okay I'll try that. I added my R code to the original question to show why I'm doing things they way I am. It might help contextualize.
Hm. I'm getting this error for all the examples I tried: ValueError: Shape of passed values is (0, 0), indices imply (3, 0) Should I be iterating over the length of the original data set?
i edited the answer. you'll have to inspect the data to see why it's not correct.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.