1

I'm trying to append the data frame values as rows but its appending them as columns. I have 32 files that i would like to take the second column from (called dataset_code) and append it. But its creating 32 rows and 101 columns. I would like 1 column and 3232 rows.

import pandas as pd
import os



source_directory = r'file_path'

df_combined = pd.DataFrame(columns=["dataset_code"])

for file in os.listdir(source_directory):
    if file.endswith(".csv"):
            #Read the new CSV to a dataframe.  
            df = pd.read_csv(source_directory + '\\' + file)
            df = df["dataset_code"]
            df_combined=df_combined.append(df)



print(df_combined)
2
  • 1
    Are you sure the columns are the same? from append docs: "Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns." Commented Aug 14, 2016 at 13:40
  • yes, when i subset df and print it, it prints the appropriate column Commented Aug 14, 2016 at 13:42

3 Answers 3

7

You already have two perfectly good answers, but let me make a couple of recommendations.

  1. If you only want the dataset_code column, tell pd.read_csv directly (usecols=['dataset_code']) instead of loading the whole file into memory only to subset the dataframe immediately.
  2. Instead of appending to an initially-empty dataframe, collect a list of dataframes and concatenate them in one fell swoop at the end. Appending rows to a pandas DataFrame is costly (it has to create a whole new one), so your approach creates 65 DataFrames: one at the beginning, one when reading each file, one when appending each of the latter — maybe even 32 more, with the subsetting. The approach I am proposing only creates 33 of them, and is the common idiom for this kind of importing.

Here is the code:

import os
import pandas as pd

source_directory = r'file_path'

dfs = []
for file in os.listdir(source_directory):
    if file.endswith(".csv"):
        df = pd.read_csv(os.join.path(source_directory, file),
                        usecols=['dataset_code'])
        dfs.append(df)

df_combined = pd.concat(dfs)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Alberto, I changed yours to the accepted answer because it is the better solution
3

df["dataset_code"] is a Series, not a DataFrame. Since you want to append one DataFrame to another, you need to change the Series object to a DataFrame object.

>>> type(df)
<class 'pandas.core.frame.DataFrame'>
>>> type(df['dataset_code'])
<class 'pandas.core.series.Series'>

To make the conversion, do this:

df = df["dataset_code"].to_frame()

1 Comment

hey Nehal, this worked, thank you!! But why did it work? Can you help me understand?
3

Alternatively, you can create a dataframe with double square brackets:

df = df[["dataset_code"]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.