Bug in pandas query() method?

Question

I was experimenting several use cases for the pandas query() method, and tried one argument that threw an exception, but yet caused an unwanted modification to the data in my DataFrame.

In [549]: syn_fmax_sort
Out[549]: 
     build_number      name    fmax
0             390     adpcm  143.45
1             390       aes  309.60
2             390     dfadd  241.02
3             390     dfdiv   10.80
....
211           413     dfmul  215.98
212           413     dfsin   11.94
213           413       gsm  194.70
214           413      jpeg  197.75
215           413      mips  202.39
216           413     mpeg2  291.29
217           413       sha  243.19

[218 rows x 3 columns]

So I wanted to use query() to just take out a subset of this dataframe that contains all the build_number of 392, so I tried:

In [550]: syn_fmax_sort.query('build_number = 392')

That threw a ValueError: cannot label index with a null key exception, but not only that, it returned back the full dataframe to me,and caused all the build_number to be set to 392:

In [551]: syn_fmax_sort
Out[551]: 
     build_number      name    fmax
0             392     adpcm  143.45
1             392       aes  309.60
2             392     dfadd  241.02
3             392     dfdiv   10.80
....
211           392     dfmul  215.98
212           392     dfsin   11.94
213           392       gsm  194.70
214           392      jpeg  197.75
215           392      mips  202.39
216           392     mpeg2  291.29
217           392       sha  243.19

[218 rows x 3 columns]

However, I have since figured out how to get value 392 only, if I used syn_fmax_sort.query('391 < build_number < 393'), it works/

So my question is: Is the behavior that I observed above when I queried the dataframe wrongly due to a bug in the query() method?

I don't know why you got the error but don't you want this instead: syn_fmax_sort.query('build_number == 392') — EdChum
– EdChum, Commented Feb 25, 2015 at 8:52
Oh right yea that works too! I don't know why I thought doing == also caused the same thing, but I didn't mention it above. I think its because I did the == attempt after =, and the = screwed up my DataFrame to begin with that I thought it wouldn't work. — AKKO
– AKKO, Commented Feb 25, 2015 at 8:57
OK, I think the error is that you are trying to pass a query which is assigning the value to the column (this looks unintentional), I think the error is spurious as internally it thinks you're trying to get a value, creates an index and then internally it checks the length and thinks you're trying to index using some duff value, it probably should raise a KeyError technically but I don't know how it's parsed your query — EdChum
– EdChum, Commented Feb 25, 2015 at 8:58
@AKKO This is indeed a bug (it should not alter your data), and there is already an open ticket about this: github.com/pydata/pandas/issues/8664 — joris
– joris, Commented Feb 25, 2015 at 13:00
@AKKO Certainly welcome if you want to look at this issue! Just chime in on the issue on github with questions. — joris
– joris, Commented Feb 26, 2015 at 12:53

EdChum · Accepted Answer · 2015-02-25 09:00:15Z

It looks like you had a typo, you probably wanted to use == rather than =, a simple example shows the same problem:

In [286]:

df = pd.DataFrame({'a':np.arange(5)})
df
Out[286]:
   a
0  0
1  1
2  2
3  3
4  4
In [287]:

df.query('a = 3')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-287-41cfa0572737> in <module>()
----> 1 df.query('a = 3')

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in query(self, expr, **kwargs)
   1923             # when res is multi-dimensional loc raises, but this is sometimes a
   1924             # valid query
-> 1925             return self[res]
   1926 
   1927     def eval(self, expr, **kwargs):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1778             return self._getitem_multilevel(key)
   1779         else:
-> 1780             return self._getitem_column(key)
   1781 
   1782     def _getitem_column(self, key):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   1785         # get column
   1786         if self.columns.is_unique:
-> 1787             return self._get_item_cache(key)
   1788 
   1789         # duplicate columns & possible reduce dimensionaility

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1066         res = cache.get(item)
   1067         if res is None:
-> 1068             values = self._data.get(item)
   1069             res = self._box_item_values(item, values)
   1070             cache[item] = res

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   2856                         loc = indexer.item()
   2857                     else:
-> 2858                         raise ValueError("cannot label index with a null key")
   2859 
   2860             return self.iget(loc, fastpath=fastpath)

ValueError: cannot label index with a null key

It looks like internally it's trying to build an index using your query and it then checks the length and as it's 0 it raises a ValueError it probably should be KeyError, I don't know how it's evaluated your query but perhaps it's unsupported at the moment the ability to assign values to columns.

lordy... this was my issue too - but not really a "typo", from working back and forth between Postgres (where its =) and pandas all day!

Collectives™ on Stack Overflow

Bug in pandas query() method?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related