1

I want to append about 700 millions rows and 2 columns to a database. Using the code below:

disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1

for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
    df.to_sql('data', disk_engine, if_exists='append')
    count = j*chunksize
    print(count)
    print(j)

It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?

Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:

#load data(600 million rows * 2 columns) into database
#def count(screen):
  #return count of distinct list of users for a given set of screens

Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?

3
  • 1
    Gotcha, so you are using SQLite. As for your question "is there a Python equivalent to R data tables". Pandas is that library. The slow part of your code is the database writing. Can you not count the distinct users from the df variable itself? Why do you need SQL? Commented Apr 24, 2016 at 1:58
  • I assumed a database would be faster to execute queries. As a new user to python, how would I see the records for df? If I do print(df), I get the object name and I thought writing the data to SQL db would be easier in terms of writing queries and I can also view the output of my table. Also, I am not sure how long it would take to load the the data in my pd.read_csv statement. Commented Apr 24, 2016 at 2:02
  • It probably would be a lot easier to write the query itself in SQL, yes, but as you've discovered, loading data into a database is slow. Personally, I would recommend you look into SparkSQL and worry about writing to a database file later. Commented Apr 24, 2016 at 2:07

2 Answers 2

1

If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:

sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user

Don't forget to build appropriate indexes before doing any queries.

Sign up to request clarification or add additional context in comments.

1 Comment

The same would be true for MySQL, btw. Just let the database handle the import by importing the whole file. I've done this with multi GB files in reasonable times (tens of minutes).
0

As @John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.

First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.

Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.

For loading you data into MySQL DB you can and should use native LOAD DATA tool.

Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)

Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.

4 Comments

After using MySQL to load the data, is it easy to to interact with that database through python/pandas? Also is Spark useful if you are only using your laptop?
@zorny, no, the idea is that you don't want to use pandas unless all the data you want to process fits into the memory or you can easily process (join, group, aggregate, filter, etc.) your data in chunks using pandas, which is rarely the case.
@zorny, Using Spark on one machine/laptop doesn't make much sense, maybe just for learning it... If you have no other options and have to process all your data on your laptop, you may try to do it directly in MySQL - it was designed for processing relational data ;)
I need a way to write functions that interacts with the data. I want to utilize pandas machine learning functions since I want to build models. In MYSQL, I can load the data but cannot write any functions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.