I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
dfvariable itself? Why do you need SQL?