How to merge multiple csv and create a dataframe?

Question

I'd like to do the following steps: 1. merge all csv in the same directory 2. create as Dataframe 3. assign columns and drop a column then set one of column ('Type') as index 4. for all file,s I would like to melt column D to end column as rows

file_list = glob.glob("*.csv")

for file in file_list:
    merged_file = pd.read_csv(file)
    print(merged_file)

merged_file = pd.DataFrame()
df.columns = ['Type', 'Country', 'Source']
df = df.drop('Source', axis=1)
df = df.set_index('Type',drop=False).stack().reset_index()

agg_df = pd.melt(df, id_vars = [df, 'Source', 'Country', 'Type'])

df = df.sort_values('Type').reset_index(drop=True)
print(df)

The clear expected output (aligned to the left):

Mineral name - Type - Country - Prod_t_2021
Mineral name - Type - Country - Prod_t_2022
Mineral name - Type - Country - Reserves_t
Mineral name - Type - Country - Reserves_notes

Mineral name could be extracted from Type as string

The source is World.zip from URL: https://www.sciencebase.gov/catalog/item/63b5f411d34e92aad3caa57f

Can you show a clear expected output of the final df ?

Timeless
– Timeless

2023-02-07 00:42:54 +00:00
Commented Feb 7, 2023 at 0:42 — Timeless
– Timeless, Commented Feb 7, 2023 at 0:42

Timeless · Accepted Answer · 2023-02-07 00:48:14Z

0

IIUC, use this :

df = (
        pd.concat([pd.read_csv(f) for f in file_list], ignore_index=True)
            .melt(id_vars = ["Source", "Country", "Type"])
            .set_index("Type")
            .sort_index()
      )

Output :

                              Source          Country         variable                   value
Type                                                                                          
 Mine production: Palladium  MCS2023    United States  Prod_t_est_2022                     NaN
 Mine production: Palladium  MCS2023     South Africa  Reserves_ore_kt                     NaN
 Mine production: Palladium  MCS2023           Russia  Reserves_ore_kt                     NaN
 Mine production: Palladium  MCS2023           Canada  Reserves_ore_kt                     NaN
 Mine production: Palladium  MCS2023    United States  Reserves_ore_kt                     NaN
 Mine production: Palladium  MCS2023    United States      Reserves_kt                     NaN
 Mine production: Palladium  MCS2023           Canada      Reserves_kt                     NaN
 Mine production: Palladium  MCS2023           Russia      Reserves_kt                     NaN
 Mine production: Palladium  MCS2023     South Africa      Reserves_kt                     NaN
 Mine production: Palladium  MCS2023  Other countries   Reserves_notes  Included with platinum

answered Feb 7, 2023 at 0:48

Timeless

38.3k6 gold badges33 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user21084142 Over a year ago

for instance, how can I add value for prod_t_est_2022 and prod_t_est_2021 as new raws for Palladium production in the United States? (applicable to all minerals and countries)

Timeless Over a year ago

That's why you need to add a clear expected output in your question.

user21084142 Over a year ago

please refer to the edited question

Collectives™ on Stack Overflow

How to merge multiple csv and create a dataframe?

The clear expected output (aligned to the left):

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

The clear expected output (aligned to the left):

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related