Reading csv with variable number of columns with pandas

Question

I have a csv file with a variable number of columns.

Pandas can handle the situation fairly easily with the name parameter. This works if the first row of the file is either the same length or shorter of names, but fails if it's longer.

Examples

Case 1

First row has the same number of elements as names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Out[0]:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works.

Case 2

First row has less elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Out[0]:
#    A  B    C
# 0  1  2  NaN
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works too.

Case 3

First row has more elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

# Expected output:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0
#
# Instead I get:
# IndexError: list index out of range

This doesn't work!!!

Question

Is there a way to make this script more robust so that it handles also Case 3 without generating an error? Doesn't this looks a bit like a bug

I'm running pandas version 0.23.4

EDIT:

Case 3b

First row has more elements than names

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''') # NOTE: No comma after the first "4"!!

df = pd.read_csv(file, names=['A','B','C'], index_col=False)
df

#Out[0]:
#    A  B    C
# 0  1  2  3.0
# 1  1  2  NaN
# 2  1  2  3.0
# 3  1  2  3.0

This works!!!

What the hell is going on?

Neroksi · Accepted Answer · 2019-03-19 16:58:23Z

3

Just make use of the usecols params, instead of the names one. names assume that you're listing all the columns' name, whereas usecolsassume a subample of the columns.

from io import StringIO
import pandas as pd

file = StringIO(
        '''1, 2, 3, 4,
           1, 2
           1, 2, 3, 4, 
           1, 2, 3,''')

df = pd.read_csv(file, usecols =[0,1,2], header = None)
df
0   1   2
0   1   2   3.0
1   1   2   NaN
2   1   2   3.0
3   1   2   3.0

answered Mar 19, 2019 at 16:58

Neroksi

1,4181 gold badge13 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Reading csv with variable number of columns with pandas

Examples

Case 1

Case 2

Case 3

Question

EDIT:

Case 3b

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Examples

Case 1

Case 2

Case 3

Question

EDIT:

Case 3b

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related