Import multiple csv files into postgresql database using r (memory error)

Question

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.

I successfully connected to the database, created a loop to import the csv files and tried to import. R then returns an error, because my pc runs out of memory.

My question is: Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards? That way I would not run out of memory.

Code which returns the memory error:

`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
                 port=db_port, user=db_user, password=db_password)

#check if connection was succesfull
dbExistsTable(con, "prices")

#create function to load multiple csv files
import_csvfiles <- function(path){
  files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
  lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
    }


#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`

Thanks in advance for the feedback!

All the code pieces work and just have to throttle down to account for RAM, so if the csv are similar size, test how many you can handle at a time and process that way. A handy notation is, for example, lapply(files, read_csv)[1:10], or perhaps import_csvfiles("path...")[1:10]. You establish how much you can consume, wrap your import in horrors a for loop to achieve the throttling. Pop open a terminal and use htop to keep and eye on RAM use, and close out unnecessary programs. Just a way to think about the problem. HTH — Chris
– Chris, Commented Apr 19, 2020 at 13:26

Len Greski · Accepted Answer · 2020-04-19 13:30:51Z

2

If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.

import_csvfiles <- function(path){
     files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
     lapply(files,function(x){ 
          prices <- read.csv(x) 
          dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
          })
}

answered Apr 19, 2020 at 13:30

Len Greski

11k2 gold badges26 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Chris Over a year ago

And you don't need prices[x] because there should be only one file in the dbWriteTable que at any given pass?

Len Greski Over a year ago

@Chris - correct. Within the environment of lapply() , the prices object contains one and only one input file for each iteration within lapply().

Chris Over a year ago

This is great to drive home, within the context of anonymous function in lapply, for all the cases of exploding memory that come up. I'll be able to use this profitably today.

Gonny Over a year ago

This did it for me! Thanks!

Gonny Over a year ago

Done, Thanks again!

|

cnburger · Accepted Answer · 2020-04-19 18:40:53Z

0

I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.

When calling ?read.csv I saw the following output:

nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.

skip : the number of lines of the data file to skip before beginning to read data.

Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.

For example, for each file do the following:

number_of_lines = 5000                 # Number of lines to read at a time
row_skip = 0                           # number of lines to skip initially
keep_reading = TRUE                    # We will change this value to stop the while

while (keep_reading) {
    my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
    dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB

    row_skip = 1 + row_skip + number_of_lines   # The "1 +" is there due to inclusivity avoiding duplicates

# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
   keep_reading = FALSE
    } # end-if    
} # end-while

By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

answered Apr 19, 2020 at 18:40

cnburger

1092 silver badges9 bronze badges

2 Comments

Gonny Over a year ago

Do I have to run the code for each file then ? Because I have over 2.000 csv files to import. I think the answer of Len Greski in my case was the more simple way. Thanks anyway!

cnburger Over a year ago

Yes, it would then have to be for every csv file, I am sure that it will still be fast. If you have big memory constraints this will be an extra solution breaking up the reading of the csv file into smaller part, and limiting the amount of memory allocated by the number_of_lines variable. My pleasure, happy to contribute!

Collectives™ on Stack Overflow

Import multiple csv files into postgresql database using r (memory error)

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related