Pandas read_sql function produces binary column names

Question

I have access to a MS SQL SERVER database from which I retrieve data for analysis. I use a Mac and so can access the database with Navicat Essentials for SQL Server. That works really well. However, I would like to access the database using Python. I have installed a virtual environment for Python 3.4 and have installed various libraries including Numpy, Pandas, Pypyodbc and some others. I configured a DNS connection in ODBC Manager app and I can access a table called 'Category' in the database using Python as follows:

import pandas as pd
import pypyodbc

connectionName = pypyodbc.connect('DNS=myDNSName')

queryName 'SELECT ID, CategoryName FROM Category'

retrievedDataDF = pd.io.sql.read_sql(queryName, con=connectionName)

connectionName.close()

print(retrieveDataDF.head())
print(retrieveDataDF.columns)

This seems to work fine except the column headings in the returned dataframe seem to be represented in some form of binary format, in this case, the column headings in the dataframe are b'i' and b'c'. The outputs from the print functions are:

   b'i'     b'c'
0     1  missing
1     2     blue
2     3      red
3     4    green
4     5   yellow

Index([b'i', b'c'], dtype='object')

I don't recall having this problem previously and I can't find any reference to similar issues online. As a result, I can't work out what is going on.

Any suggestions would be appreciated.

EDIT: Following comments by Joris, the following may be useful:

connectionName.cursor().execute(queryName).description

[(b'i', int, 11, 10, 10, 0, False), (b'c', str, 100, 100, 100, 0, True)]

Versions of all installed libraries are given below:

From Terminal

$ env/bin/pip list

appnope (0.1.0) decorator (4.0.4) gnureadline (6.3.3) ipykernel (4.1.1) ipython (4.0.0) ipython-genutils (0.1.0) ipywidgets (4.1.1) jdcal (1.0) Jinja2 (2.8) jsonschema (2.5.1) jupyter (1.0.0) jupyter-client (4.1.1) jupyter-console (4.0.3) jupyter-core (4.0.6) MarkupSafe (0.23) matplotlib (1.4.3) mistune (0.7.1) nbconvert (4.0.0) nbformat (4.0.1) nose (1.3.7) notebook (4.0.6) numexpr (2.4.3) numpy (1.10.1) openpyxl (2.2.4) pandas (0.17.0) pandastable (0.4.0) path.py (8.1.2) pexpect (4.0.1) pickleshare (0.5) pip (1.5.6) ptyprocess (0.5) Pygments (2.0.2) pyparsing (2.0.3) pypyodbc (1.3.3) python-dateutil (2.4.2) pytz (2015.6) pyzmq (14.7.0) qtconsole (4.1.0) scipy (0.16.1) setuptools (3.6) simplegeneric (0.8.1) six (1.9.0) terminado (0.5) tornado (4.2.1) traitlets (4.0.0) xlrd (0.9.3)

From within virtual environment

import pandas as pd
pd.show_versions(as_json=False)

INSTALLED VERSIONS

commit: None python: 3.4.1.final.0 python-bits: 64 OS: Darwin OS-release: 15.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8

pandas: 0.17.0 nose: 1.3.7 pip: 1.5.6 setuptools: 3.6 Cython: None numpy: 1.10.1 scipy: 0.16.1 statsmodels: None IPython: 4.0.0 sphinx: None patsy: None dateutil: 2.4.2 pytz: 2015.6 blosc: None bottleneck: None tables: None numexpr: 2.4.3 matplotlib: 1.4.3 openpyxl: 2.2.4 xlrd: 0.9.3 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None

(Since then, I've installed sqlalchemy 1.0.10 but I'm still working on trying to connect using SQLAlchemy.)

EDIT 2

Have failed to connect using sqlalchemy to create engine because I couldn't get pyodbc to install on a Mac running El Capitan (pip install fails with fatal error caused by missing sql.h header file) and sqlalchemy seems to require pyodbc to be installed. Instead, I generally use pypyodbc but sqlalchemy can't use pypyodbc instead of pyodbc. I have, however, successfully connected to the database using the following:

phjConnection = pypyodbc.connect(driver="{Actual SQL Server}",server="myServerName",uid="myUserName",pwd="myPassword",db="myDBName",port="1433")
phjQuery = '''SELECT ID, Catagory_Name FROM Catagory'''
phjLatestData = pd.io.sql.read_sql(phjQuery, con=phjConnection)

Not sure if that achieves the same goal suggested by Joris but the problem still exists , namely:

print(phjLatestData.head())

   b'i'     b'c'
0     1  missing
1     2     blue
2     3      red
3     4    green
4     5   yellow

Can you show the output of retrieveDataDF.head() and retrieveDataDF.columns ? — joris
– joris, Commented Dec 18, 2015 at 12:16
What does connectionName.cursor().execute(queryName).description give you? — joris
– joris, Commented Dec 18, 2015 at 14:05
Can you also mention the pandas version you are using? Further, could you try the same but using an SQLAlchemy connection? (so giving an engine instead of connectionName to read_sql, to create this engine, this will be similar to docs.sqlalchemy.org/en/latest/dialects/…) — joris
– joris, Commented Dec 18, 2015 at 14:10

joris · Accepted Answer · 2015-12-22 11:40:30Z

2

This seems a problem with the pypyodbc driver itself. Pandas constructs the column names for the resulting dataframe from information it gets from the query result, and more in particular its description attribute.
If you run this manually, you get (copied from your edit):

>>> connectionName.cursor().execute(queryName).description
[(b'i', int, 11, 10, 10, 0, False), (b'c', str, 100, 100, 100, 0, True)]

Normally, the first value in each tuple should be the column name. But here, it gives you only its first character as a byte.
This seems a known problem for some environments (specifically Python 3 I think), at least it has already been reported: https://code.google.com/p/pypyodbc/issues/detail?id=43

answered Dec 22, 2015 at 11:40

joris

140k37 gold badges257 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas read_sql function produces binary column names

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related