1

I tried searching for the answer to this question but was not able to find it... so here it goes.

I have a dataset with 23987 columns. I actually only want the information in 35 of those columns (quite spread out between them). I have put these 35 items in a list. I wanted to know if there is a quick way to drop all the columns except those by passing the list

I tried this:

df1.drop(df1.columns.difference([ALTJ_genes]), axis=1, inplace=True)

ALTJ_genes is the list with the 35 items. The error I get is:

TypeError: unhashable type: 'list'

I was wondering if there is a way to do it, I know I can reach my goal by passing the individual columns but I want to know if with the list is possible. This would make the code much clearer.

In any case, thanks!

EDIT: I provide some screenshot, maybe it is useful.

The first screenshot shows the head of the dataframe The second screenshot shows how I can select one column

Now, this is the complete error I get when passing the list with all the genes.

---------------------------------------------------------------------------

KeyError Traceback (most recent call last) in ----> 1 df1[ALTJ_genes]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in getitem(self, key) 2984 if is_iterator(key): 2985 key = list(key) -> 2986 indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True) 2987 2988 # take() does not accept boolean indexers

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing) 1283 # When setting, missing keys are not allowed, even with .loc: 1284 kwargs = {"raise_missing": True if is_setter else raise_missing} -> 1285 return self._get_listlike_indexer(obj, axis, **kwargs)1 1286 else: 1287 try:

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1090 1091 self._validate_read_indexer( -> 1092 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing 1093 ) 1094 return keyarr, indexer

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1175 raise KeyError( 1176 "None of [{key}] are in the [{axis}]".format( -> 1177 key=key, axis=self.obj._get_axis_name(axis) 1178 ) 1179 )

KeyError: "None of [Index([ ('APEX1',), ('ASF1A',), ('CDKN2D',), ('CIB1',), ('DNA2',),\n ('FAAP24',), ('FANCM',), ('GEN1',), ('HRAS',), ('LIG1',),\n ('LIG3',), ('MEN1',), ('MRE11',), ('MSH3',), ('MSH6',),\n ('NUDT1',), ('MTOR',), ('NABP2',), ('NTHL1',), ('PALB2',),\n ('PARP1',), ('PARP3',), ('POLA1',), ('POLM',), ('POLQ',),\n ('PRPF19',), ('RAD51D',), ('RBBP8',), ('RRM2',), ('RUVBL2',),\n ('SOD1',), ('KAT5',), ('UNG',), ('WRN',), ('XRCC1',)],\n dtype='object', name='Gene_Name')] are in the [columns]"

1 Answer 1

2

I think you need remove [] because ALTJ_genes is list and [ALTJ_genes] is nested list:

df1.drop(df1.columns.difference(ALTJ_genes), axis=1, inplace=True)

But simplier is select columns by list:

df1 = df1[ALTJ_genes]

EDIT:

I think problem is with defined columns with nested list, so get one level non standard MultiIndex:

df1 = pd.DataFrame([[1,2,3,4]])
#nested list
df1.columns = [['APEX1', 'ASF1A', 'CDKN2D', 'AAA']]
print (df1) 
  APEX1 ASF1A CDKN2D AAA
0     1     2      3   4

print (df1.columns)
MultiIndex([( 'APEX1',),
            ( 'ASF1A',),
            ('CDKN2D',),
            (   'AAA',)],
           )

If pass non nested list:

df1 = pd.DataFrame([[1,2,3,4]])
#not nested list
df1.columns = ['APEX1', 'ASF1A', 'CDKN2D', 'AAA']
print (df1) 
   APEX1  ASF1A  CDKN2D  AAA
0      1      2       3    4

print (df1.columns)
Index(['APEX1', 'ASF1A', 'CDKN2D', 'AAA'], dtype='object')
Sign up to request clarification or add additional context in comments.

15 Comments

Thanks for this Jezrael! I had tried the selecting columns way but it did not work, maybe I am doing something incorrectly because when I select multiple columns from the list doing: df1[['APEX1', 'ASF1A', 'CDKN2D']] I get a result but when I pass my list it says "None of [Index([ ('APEX1',), ('ASF1A',), ('CDKN2D',) ... name='Gene_Name')] are in the [columns]", trully lost! But thanks a lot, it got me thinking and gave me some new ideas to try!
@JourneyDS - How is created DataFrame ?
@JourneyDS - Answer was edited for explain possible problem.
The DataFrame comes from a CSV file separated by tabs I can post more information, maybe it helps to see the problem
Maybe I clicked twice, now should be green
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.