0

I want to clean a folder of csv files but there are differences in the dataframes.

The first chunk has the following:

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
Anhui,Mainland China,1/22/2020 17:00,1,,
Beijing,Mainland China,1/22/2020 17:00,14,,
Chongqing,Mainland China,1/22/2020 17:00,6,,

and the second chunk has the following:

FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-23 23:19:34,34.22333378,-82.46170658,1,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-23 23:19:34,30.295064899999996,-92.41419698,1,0,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-23 23:19:34,37.76707161,-75.63234615,1,0,0,0,"Accomack, Virginia, US"

I am trying to clean them all up to this format:

0,County,State,Country,Confirmed,Deaths,Recovered,Active,City
0,Abbeville,South Carolina,US,3,0,0,0,"Abbeville, South Carolina, US"
1,Acadia,Louisiana,US,9,1,0,0,"Acadia, Louisiana, US"
2,Accomack,Virginia,US,3,0,0,0,"Accomack, Virginia, US"

My question is is there a way to sort based on the differences in the dataframes or will I always have to find where the files change and then sort based on that?

I have tried the following, with 01-22-2020.csv being the first reference:

from glob import glob

# files = glob('*.csv')

samples = []
references = []

ref = str(input('Enter first reference name: '))
num_ref = int(input('How many references are there? '))

all_files = glob('*.csv')
first_ref = all_files.index(ref)
ref_files = all_files[first_ref:first_ref+num_ref]

sample_files = all_files
del sample_files[first_ref:first_ref+num_ref]
del all_files

and the result is:

ValueError: '01-22-2020.csv' is not in list

Here is another attempt:

files = glob('*.csv')
for f in files:
    df = pd.read_csv(f)
    df = df.replace(np.nan, 'Other', regex=True)
    if df.columns[0] == ['FIPS']:
        df = df.drop(['FIPS', 'Last_Update', 'Lat', 'Long_'], axis=1)
        df = df.rename(columns={'Admin2': 'County',
                                'Province_State': 'State',
                                'Country_Region': 'Country',
                                'Combined_Key': 'City'})
        df.to_csv(f)
    elif df.columns[0] != ['FIPS']:
        df = df.drop(['Last Update'], axis=1)
        df = df.rename(columns={'Province/State': 'State',
                               'Country/Region': 'Country'})
        df.to_csv(f)
    else:
        pass

Which results in:

KeyError: "['Last Update'] not found in axis"

2 Answers 2

1

I would load in the file with Python first, and split them into different files. For example based on whether the first characters are digits or not.

pandas's .read_csv() has no way of differentiating between different styles of lines within the same CSV file.

Sign up to request clarification or add additional context in comments.

5 Comments

I would think an if statement checking the first column name would suffice but it appears not. (edited OP to reflect on this) It's a folder that I am adding more files to and if the dataframe changes, I have to add new code in and comment out other code.
Try if df.columns[0] instead.
KeyError: "['Last Update'] not found in axis" is now my new error.
Do a print(df.columns) after the if statement and you'll see what you are working with.
I am working with the first file, which is ['Province/Sate'] for the first column. This does have the ['Last Update'] which is not being found.
0

Instead of

df = df.drop('Last Update')

use

df = df.drop('Last_Update')

(note the underline symbol _).

5 Comments

The first subset has it as Last Update while the second has it as Last_Update
Your supposed order of files returned by glob('*.csv') is obviously incorrect — try print(files).
In your first approach your current directory doesn't contain entered file. BTW, the applying str to input() is superfluous, you may omit it as input() returns a string.
The if statement should handle the second subset while the elif should handle the first. In my mind, it shouldn't matter what the order of the files are. If the first column is FIPS, perform this, if not, then perform that. I have already tried your suggestion and it came out to the same result. KeyError: "['Last_Update'] not found in axis"
With the first approach, why does sorted(glob return the same result as what I have now?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.