1

I have two very large csv files that contain the same variables. I want to combine them into one table inside a sqlite database - if possible using R.

I successfully managed to put both csv files in separate tables inside one database using inborutils::csv_to_sqlite that one imports small chunks of data at a time.

Is there a way to create a third tables where both tables are simply appended using R (keeping in mind the limited RAM)? And if not - how else can I perform this task? Maybe via the terminal?

1 Answer 1

2

We assume that when the question refers to the "same variables" that it means that the two tables have the same column names. Below we create two such test tables, BOD and BOD2, and then in the create statement we combine them creating table both. This does the combining entirely on the SQLite side. Finally we use look at both.

library(RSQLite)
con <- dbConnect(SQLite())  # modify to refer to existing SQLite database

dbWriteTable(con, "BOD", BOD)
dbWriteTable(con, "BOD2", 10 * BOD)

dbExecute(con, "create table both as select * from BOD union select * from BOD2")

dbReadTable(con, "both")

dbDisconnect(con)
Sign up to request clarification or add additional context in comments.

6 Comments

I can reproduce your example, however, when I run the code with my data it is already running for ~3 hours now. Of course this depends on structure and amount of data as well as on my notebook. But would you be able to take an educated guess how long this might take for two tables with around 20 million rows and 36 columns? Just to know, if I should expect it to take.. days? (I know, I specifically asked for a solution in R and if it takes a few hours I do not care since I only have to fulfill that task once. However, would there be other (faster) options as well?
As stated in the answer this is entirely done on the SQLite side. The create statement does not go through R at all.
Ah, of course, my bad. Still, regarding the expected time - putting the csv files inside the database chunk by chunk took around 1 hour per table/csv file. Since I do not know what is going on on the SQLite side when I append both tables, could you maybe explain? Is the creation of the appended table so much more time consuming than the creation of the tables in the database from the csv files?
You can try it with a different number of records timing each one and then plot the time vs number of records and see if you can determine the shape of the curve and extrapolate it. Since you have both as csv you could concatenate the csv files and then read it in. You could also try different databases.
I had to stop the process after ~10 hours but I found another workaround: by combining the two csv files via the terminal and and then putting them into the SQLite database via inborutils::csv_to_sqlite. I accept the answer though, since in the example you provided it works (and possibliy also for my data if I gave it more time).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.