Sorting files based on dataframe Python

Question

I want to clean a folder of csv files but there are differences in the dataframes.

The first chunk has the following:

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
Anhui,Mainland China,1/22/2020 17:00,1,,
Beijing,Mainland China,1/22/2020 17:00,14,,
Chongqing,Mainland China,1/22/2020 17:00,6,,

and the second chunk has the following:

FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-23 23:19:34,34.22333378,-82.46170658,1,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-23 23:19:34,30.295064899999996,-92.41419698,1,0,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-23 23:19:34,37.76707161,-75.63234615,1,0,0,0,"Accomack, Virginia, US"

I am trying to clean them all up to this format:

0,County,State,Country,Confirmed,Deaths,Recovered,Active,City
0,Abbeville,South Carolina,US,3,0,0,0,"Abbeville, South Carolina, US"
1,Acadia,Louisiana,US,9,1,0,0,"Acadia, Louisiana, US"
2,Accomack,Virginia,US,3,0,0,0,"Accomack, Virginia, US"

My question is is there a way to sort based on the differences in the dataframes or will I always have to find where the files change and then sort based on that?

I have tried the following, with 01-22-2020.csv being the first reference:

from glob import glob

# files = glob('*.csv')

samples = []
references = []

ref = str(input('Enter first reference name: '))
num_ref = int(input('How many references are there? '))

all_files = glob('*.csv')
first_ref = all_files.index(ref)
ref_files = all_files[first_ref:first_ref+num_ref]

sample_files = all_files
del sample_files[first_ref:first_ref+num_ref]
del all_files

and the result is:

ValueError: '01-22-2020.csv' is not in list

Here is another attempt:

files = glob('*.csv')
for f in files:
    df = pd.read_csv(f)
    df = df.replace(np.nan, 'Other', regex=True)
    if df.columns[0] == ['FIPS']:
        df = df.drop(['FIPS', 'Last_Update', 'Lat', 'Long_'], axis=1)
        df = df.rename(columns={'Admin2': 'County',
                                'Province_State': 'State',
                                'Country_Region': 'Country',
                                'Combined_Key': 'City'})
        df.to_csv(f)
    elif df.columns[0] != ['FIPS']:
        df = df.drop(['Last Update'], axis=1)
        df = df.rename(columns={'Province/State': 'State',
                               'Country/Region': 'Country'})
        df.to_csv(f)
    else:
        pass

Which results in:

KeyError: "['Last Update'] not found in axis"

gosuto · Accepted Answer · 2020-04-02 21:29:25Z

1

I would load in the file with Python first, and split them into different files. For example based on whether the first characters are digits or not.

pandas's .read_csv() has no way of differentiating between different styles of lines within the same CSV file.

answered Apr 2, 2020 at 21:29

gosuto

5,8316 gold badges42 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Luck Box Over a year ago

I would think an if statement checking the first column name would suffice but it appears not. (edited OP to reflect on this) It's a folder that I am adding more files to and if the dataframe changes, I have to add new code in and comment out other code.

gosuto Over a year ago

Try if df.columns[0] instead.

Luck Box Over a year ago

KeyError: "['Last Update'] not found in axis" is now my new error.

gosuto Over a year ago

Do a print(df.columns) after the if statement and you'll see what you are working with.

Luck Box Over a year ago

I am working with the first file, which is ['Province/Sate'] for the first column. This does have the ['Last Update'] which is not being found.

MarianD · Accepted Answer · 2020-04-02 23:20:03Z

0

Instead of

df = df.drop('Last Update')

use

df = df.drop('Last_Update')

(note the underline symbol _).

answered Apr 2, 2020 at 23:20

MarianD

14.4k12 gold badges50 silver badges61 bronze badges

5 Comments

Luck Box Over a year ago

The first subset has it as Last Update while the second has it as Last_Update

MarianD Over a year ago

Your supposed order of files returned by glob('*.csv') is obviously incorrect — try print(files).

MarianD Over a year ago

In your first approach your current directory doesn't contain entered file. BTW, the applying str to input() is superfluous, you may omit it as input() returns a string.

Luck Box Over a year ago

The if statement should handle the second subset while the elif should handle the first. In my mind, it shouldn't matter what the order of the files are. If the first column is FIPS, perform this, if not, then perform that. I have already tried your suggestion and it came out to the same result. KeyError: "['Last_Update'] not found in axis"

Luck Box Over a year ago

With the first approach, why does sorted(glob return the same result as what I have now?

Collectives™ on Stack Overflow

Sorting files based on dataframe Python

2 Answers 2

5 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related