73

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)

But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.

I also did this:

s = pd.Series(0)
s[1]=22
for i in range(2,14):
    s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)

But it reads the first 15 columns which is the length of s.

4
  • 1
    you'd have to generate a list of cols and pass this to parse_cols e.g. parse_cols=[0, 22,23,24.....,37] rather than what you're doing now Commented Nov 11, 2015 at 16:30
  • Not sure why that didn't work, it could be a bug, what happens when you pass a hard coded list: df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = [0,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37)? Commented Nov 13, 2015 at 9:31
  • @EdChum, It works if pass a hard coded list. Commented Nov 13, 2015 at 17:34
  • In that case generate a list rather than a series Commented Nov 13, 2015 at 17:44

6 Answers 6

101

You can use column indices (letters) like this:

import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)

Corresponding documentation:

usecols : int, str, list-like, or callable default None

  • If None, then parse all columns.

  • If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

  • If list of int, then indicates list of column numbers to be parsed.

  • If list of string, then indicates list of column names to be parsed.

    New in version 0.24.0.

  • If callable, then evaluate each column name against it and parse the column if the callable returns True.

Returns a subset of the columns according to behavior above.

New in version 0.24.0.

Sign up to request clarification or add additional context in comments.

1 Comment

It should be noted that "names" should be read as "names in excel", not those you could choose or use as headers. The docs are not clear about this but it is worth mentionning, it gave me some headaches.
23

"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.) below are the examples

1. Selected Columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")

2. Range of Columns and selected column

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")

3. Multiple Ranges

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")

4. Range of columns

df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")

2 Comments

any ideas for limiting columns by number?
@rluts, replace usecols="A,C,F", with usecols=[0,2,5], incase of range of column numbers, usecols=range(2,9), depending on the requirement please replace the numbers.
22

parse_cols is deprecated, use usecols instead

that is:

df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")

1 Comment

Note also this bug/unexpected behavior, which I ran into today. github.com/pandas-dev/pandas/issues/18273 Looks like using column names does not work with Excel...
11

If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works

df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]

where "name of column" = columns wanted. Case and whitespace sensitive

Comments

2

Read any column's data in excel

import pandas as pd


name_of_file =  "test.xlsx"
data = pd.read_excel(name_of_file)

required_colum_name = "Post test Number"
print(data[required_colum_name])

Comments

0

Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.

To reproduce:

  • create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
  • df.read_excel(filename, usecols='C:D')

df.columns will return ['Header1.1', 'Header2.1']

Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.

Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

1 Comment

This should be posted as a separate question

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.