0

For context, I have a dataset that is comprised of USA's states and territories. I have made a new data frame with only the 50 states(excluding territories) lets call it States_Only. This is complete. However, the first data set (lets call it USA_ALL) had both NY and NYC as independent rows, meaning that the values attributed to NY do not already include NYC's recorded data. Because they originated from the same data set the columns match. All values are either NAN/NULL or integers. For my States_Only data to be complete, the NYC values from USA_ALL need to be added to NY in the States_only dataframe. How can I achieve this? For clarity, I do not want to append NYC, nor do I have the ability to groupby() because there is nothing software side tying these two together(such as an identifier), only the knowledge that NYC is within NY.

import requests
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

if __name__ == '__main__':
    #data prep
    data_path = './assets/'
    out_path = './output'
    #scraping javascript map data via xml
    endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
    data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()
    #convert to df and export raw data as csv
    df = pd.DataFrame(data["US_MAP_DATA"])
    path = os.path.join(out_path,'Raw_CDC_Data.csv')
    df.to_csv(path)

    #Remove last data point (Total USA)
    df.drop(df.tail(1).index,inplace=True)
    #Create DF of just 50 states
    state_abbr =["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]


    states = df[df['abbr'].isin(state_abbr)]
    # Add NYC from df to NY's existing values (sum of each column) to states

here is an excel spreadsheat to show the expected final value in the States_only dataset, this is included because the formatting on this forum for this data would be hard to understand and unclear Expected Values

2
  • 1
    Please provide a minimal reproducible example, as well as the current and expected output. As an aside, is the NY/NYC issue the only one of its kind, or do you need to do this for multiple cities? Commented Oct 18, 2020 at 0:15
  • this is my code up until this point so as long as the python file as folders labeled assets and output it is perfectly reproduceable, the expected output is located in the link at the bottom labeled expected values, it did not translate well in text format and was impossible to read clearly so I attached it there. This would be the only city that needs to be consolidated. The CDC website records NYC differently because it's so much larger than the rest of the state. Commented Oct 18, 2020 at 0:55

1 Answer 1

1

While this isn't super clean, it will do the trick:

import pandas as pd

import requests

endpoint = "https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData"
data = requests.get(endpoint, params={"id": "US_MAP_DATA"}).json()

df = pd.DataFrame(data["US_MAP_DATA"])

# drop last row
df = df[:-1]

ny_rows_mask = df["abbr"].isin(["NY", "NYC"])

ny_rows = df.loc[ny_rows_mask]

df = df.loc[~ny_rows_mask]

new_row = ny_rows.sum()
new_row["abbr"] = "NY"
new_row["id"] = 36
new_row["fips"] = 36
new_row["name"] = "New York"

df = df.append(new_row, ignore_index=True)

As an aside, if you haven't already you should examine some of the data types that Pandas infers from the CSV. The id column probably shouldn't be a number type, for example.

Sign up to request clarification or add additional context in comments.

3 Comments

That did work, but since it appends it at the end of the dataframe, graphs derived from this have ny at the bottom out of (mostly) alphabetical order how could this code be modified to insert it into the same index? (between Nevada and Ohio)
@TaylorKillen It's probably simpler to just sort by that column after the new row is added.
@TaylorKillen By the way, you can accept this answer if it solves the issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.