Pandas read_csv add header names in case of changing number of columns

Question

I have a lot of csv files that I would like to read with Pandas (pd.read_csv), however, in some of the files there is added a column midway that does not have a header, like this example:

Apples, Pears
1, 2
3, 4
5, 6, 7

If using pd.read_csv(example_file) the following error is thrown "ParserError: Error tokenizing data. C error: Expected 2 fields in line 4, saw 3"

I would like to avoid having to skip the line and instead just add a dummy header name, like Unknown1, and get the following result:

Apples, Pears, Unknown1  
1, 2, np.nan
3, 4, np.nan
5, 6, 7

tdelaney · Accepted Answer · 2018-06-25 18:24:31Z

5

pandas needs to know the geometry in advance to build the dataframe. You can read the header line and add several dummy column names to supply the number of columns, then re-read the whole csv and discard the columns that weren't used after all.

>>> import pandas as pd
>>> names = list(pd.read_csv('foo.csv', nrows=0)) + ['unknown1', 'unknown2']
>>> df=pd.read_csv('foo.csv', names=names, skiprows=1).dropna(axis='columns', how='all')
>>> df
   Apples   Pears  unknown1
0       1       2       NaN
1       3       4       NaN
2       5       6       7.0

If there are many extra columns and you are worried about the memory footprint of the intermediate dataframe, you can use the csv module to scan the file and calculate the maximum number of rows. Unlike pandas, csv is quite happy to emit varying sized rows.

>>> with open('foo.csv', newline='') as in_fp:
...     reader = csv.reader(in_fp)
...     header = next(reader)
...     num_cols = max(len(row) for row in reader)
... 
>>> names = header + ['unknown{}'.format(i+1) for i in range(num_cols-len(header))]
>>> df = pd.read_csv('foo.csv', names=names, skiprows=1)
>>> df
   Apples   Pears  unknown1
0       1       2       NaN
1       3       4       NaN
2       5       6       7.0

edited Jun 25, 2018 at 18:24

answered Jun 25, 2018 at 17:39

tdelaney

77.9k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Adam R. Jensen Over a year ago

Excellent! Nice and simple, and only uses Pandas. Thank you! :)

sudonym Over a year ago

I am afraid this is not scalable. What do you do if there are 10 missing column names?

tdelaney Over a year ago

@sudonym - It really depends on what level of scaling you need. Preprocessing the file adds to CPU cost. The memory footprint is not that big of a deal on modern hardware unless the files are quite large. OP has many files and few extra columns. It is likely a good solution.

sudonym Over a year ago

I agree with you.

BENY · Accepted Answer · 2018-06-25 16:59:28Z

1

We can load the csv then fixed your out after that

import io
t="""Apples, Pears
1, 2
3, 4
5, 6, 7"""
df = pd.read_csv(io.StringIO(t), sep='\t')

yourdf=df.iloc[:,0].str.split(', ',expand=True)
s=df.columns.str.split(', ').tolist()[0]
yourdf.columns=s+['unknow'+str(x+1) for x in range(yourdf.shape[1]-len(s))]


yourdf
Out[104]: 
  Apples Pears unknow1
0      1     2    None
1      3     4    None
2      5     6       7

answered Jun 25, 2018 at 16:59

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Adam R. Jensen Over a year ago

In the case of 't' being a path to the csv file, how would IO be used?

sudonym · Accepted Answer · 2018-06-25 18:43:38Z

0

If you don't know the number of columns beforehand, you can determine the maximum number of columns across all rows beforehand using readlines(), which comes at the cost of loosing the known header names.

sep = ','                                                   # Define separator
lines = open("test.csv").readlines()                        # Open file and read lines
colcount = max([len(l.strip().split(sep)) for l in lines])  # Count separator
df = pd.read_csv("test.csv", names = range(colcount), skiprows = [0])
print df

   0  1    2
0  1  2  NaN
1  3  4  NaN
2  5  6  7.0

The colcount above can be applied to all other answers so far, too.

Edit: Beware of input files other than .csv (see comments)

edited Jun 25, 2018 at 18:43

answered Jun 25, 2018 at 17:02

sudonym

4,0384 gold badges40 silver badges63 bronze badges

5 Comments

tdelaney Over a year ago

This doesn't work either. You've ignored any quoting and escapes for the column separator that a CSV processor would consider.

sudonym Over a year ago

this works for me - did you copy and paste into jupyter? Would you mind to elaborate?

tdelaney Over a year ago

Quoting and escapes are part of the csv protocol. For instance, what if a column is "embedded comma, should not split this column"? CSV parses deal with this, which is why we use them instead of just splitting.

sudonym Over a year ago

I have upvoted your answer and removed any reference from mine regarding scalability of yours. Thanks for your explanation.

tdelaney Over a year ago

If you use csv to handle the escaping ,then this solution is good.

Collectives™ on Stack Overflow

Pandas read_csv add header names in case of changing number of columns

3 Answers 3

4 Comments

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related