I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
- not to add the redundant column to the
dfin the first place. - remove the redundant column after the
df.appendis finished.
I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df itself