how to remove a column from Pandas dataframe using Python?

Question

I have the following code (Python 2.7):

df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
    futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
    for data in as_completed(futh):
        df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
          compression='gzip',
          quoting=QUOTE_NONNUMERIC)

This creates a compressed csv file from the data stream. Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.

The allowed columns are:

cols_list = ['order_id', 'customer_id', 'date', 'price']

I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.

I'm looking for something like:

if col_name not it cols_list:
   del df[???]  #delete column and it's data.
   print [???]  #print the name of the redundant column for log

I think there are two approaches here:

not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.

I prefer the 1st option as it should be with better performance (?)

One of my attempts was:

for i, data in enumerate(df):
        for col_name in cols_list:
            if col_name not in data.keys():
               del df[col_name ]

but it doesn't work..

if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'

I'm not sure I enumerate over df itself

why not just get a new dataframe that will have the desired columns from the previous dataframe and a new one that you have added. That way if there are more columns in the previous dataframe it is immaterial as you will be only dealing with the required columns in the new dataframe — Inder
– Inder, Commented Jul 17, 2018 at 7:49
@Inder I'm not sure I'm following you. I can't compere the previous CSV to the current one. I delete them after the code is finished. The csv is a step towards uploading the data into BigQuery. — jack
– jack, Commented Jul 17, 2018 at 7:52
what I am saying is that you only need order id, customer id , date, price say from a dataframe df(1) , that can have say 10 columns . just create an empty dataframe df(2), assign the columns that you need from df(1) eg df(2)["customer id"] = df(1)["customer id"] . Also u can add your custom column to this new dataframe and do as you desire with it you can be sure that it only has the required columns regardless of what the original data frame had — Inder
– Inder, Commented Jul 17, 2018 at 7:56
@Inder that might be a huge consumption of memory and time... Isn't there another way? What about doing the df.append only for the desired columns? — jack
– jack, Commented Jul 17, 2018 at 8:06
append will throw an error if the number of columns are different. — Inder
– Inder, Commented Jul 17, 2018 at 8:14

Joe · Accepted Answer · 2018-07-17 08:29:24Z

1

If you want to make your attempt with for loop works, try this:

for col_name in df.columns:
    if col_name not in cols_list:
       del df[col_name]

edited Jul 17, 2018 at 8:29

answered Jul 17, 2018 at 8:17

Joe

12.4k7 gold badges44 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Thijs van Ede · Accepted Answer · 2018-07-17 07:44:23Z

0

Removing the redundant column after the df.append is finished is quite simple:

df = df[cols_list]

As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.

answered Jul 17, 2018 at 7:44

Thijs van Ede

9371 gold badge6 silver badges15 bronze badges

3 Comments

jack Over a year ago

cols_list are the columns that should stay. I don't know the names of the columns that I need to remove. Everything that is not on cols_list should be removed.

Thijs van Ede Over a year ago

Yes, if you apply the function that I gave, it will only take the columns specified in cols_list to the new df variable.

jack Over a year ago

this doesn't answer my question :)

Paula Thomas · Accepted Answer · 2018-07-17 07:45:54Z

0

According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:

usecols : list-like or callable, default None

Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’, ‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

This is the answer to your problem.

answered Jul 17, 2018 at 7:45

Paula Thomas

1,1901 gold badge10 silver badges15 bronze badges

2 Comments

jack Over a year ago

This is not a solution. First you assume that I do read_csv which is wrong. I upload the file to Google Storage and update BigQuery with it. Second, There could be more than 20 columns that I need to ignore.. assuming 500K rows per file that it a lot of space which used for nothing, not to mention the process time.

Paula Thomas Over a year ago

Then I suggest that the at least the title needs to be changed, it specifically mentions csv. But I would also point out that you must have criteria for excluding columns and presumably these can be coded so you may want to use the callable version of something like this parameter, it can be a function.

jezrael · Accepted Answer · 2018-07-17 08:22:12Z

0

I think need intersection by list of column namess and then filter by subset with []:

cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]

answered Jul 17, 2018 at 8:22

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

how to remove a column from Pandas dataframe using Python?

4 Answers 4

Comments

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related