6

I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.

For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.

             df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
             df['name'][df['session_id']==1]

I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:

Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`

But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!

4
  • You might want to look at Blaze, which provides a common interface (not SQL) to query and process data stored in different formats, or at Odo, which can easily move data between different formats (e.g. it can load csv into an sql database). Commented Apr 5, 2016 at 16:44
  • This is a great introduction into how pandas compares to SQL: pandas.pydata.org/pandas-docs/version/0.18.0/…. In the mean time could you Commented Apr 5, 2016 at 17:41
  • Also, would it possible to provide a df.head() or an example of the data you are working with? Commented Apr 5, 2016 at 17:45
  • @measureallthethings thx! these comparisons are extremely useful! Commented Apr 6, 2016 at 1:28

2 Answers 2

11

Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.

import pandas as pd
import sqlite3

#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Sign up to request clarification or add additional context in comments.

6 Comments

thx!!!! i was testing ur code and i got an error "'utf-8' codec can't decode byte 0x89 in position 27: invalid start byte". any chance u could help me with it? thx!!
can you tell me what line of code it is on? You may be able to fix it by doing df = pd.read_csv('/resources/data/findhelp/tabledata.csv', encoding = 'utf8')
it's on that df line (df=pd.read_csv.....). i added ecoding='utf8' but still not working and gave me the same error. would it be something wrong with my csv file? thx!
basically you just need to try a few different encodings. For example, try: df = pd.read_csv('filname.csv', encoding = "ISO-8859-1"). It has to do with the fact that you have special characters in your csv, i think. Do you know what encoding the CSV should be?
to be honest, i dont know what encoding the csv is, i simple just save it as csv file on my mac. does that matter?
|
0

Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.

First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).

Here's a full example you can run in a Jupyter notebook:

Installation

pip install jupysql duckdb duckdb-engine

Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine

Example

Load extension (%sql magic) and create in-memory database:

%load_ext SQL
%sql duckdb://

Download some sample CSV data:

from urllib.request import urlretrieve

urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")

Query:

%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC

JupySQL documentation available here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.