2

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.

I successfully connected to the database, created a loop to import the csv files and tried to import. R then returns an error, because my pc runs out of memory.

My question is: Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards? That way I would not run out of memory.

Code which returns the memory error:

`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
                 port=db_port, user=db_user, password=db_password)

#check if connection was succesfull
dbExistsTable(con, "prices")

#create function to load multiple csv files
import_csvfiles <- function(path){
  files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
  lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
    }


#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`

Thanks in advance for the feedback!

1
  • All the code pieces work and just have to throttle down to account for RAM, so if the csv are similar size, test how many you can handle at a time and process that way. A handy notation is, for example, lapply(files, read_csv)[1:10], or perhaps import_csvfiles("path...")[1:10]. You establish how much you can consume, wrap your import in horrors a for loop to achieve the throttling. Pop open a terminal and use htop to keep and eye on RAM use, and close out unnecessary programs. Just a way to think about the problem. HTH Commented Apr 19, 2020 at 13:26

2 Answers 2

2

If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.

import_csvfiles <- function(path){
     files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
     lapply(files,function(x){ 
          prices <- read.csv(x) 
          dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
          })
}
Sign up to request clarification or add additional context in comments.

6 Comments

And you don't need prices[x] because there should be only one file in the dbWriteTable que at any given pass?
@Chris - correct. Within the environment of lapply() , the prices object contains one and only one input file for each iteration within lapply().
This is great to drive home, within the context of anonymous function in lapply, for all the cases of exploding memory that come up. I'll be able to use this profitably today.
This did it for me! Thanks!
Done, Thanks again!
|
0

I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.

When calling ?read.csv I saw the following output:

nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.

skip : the number of lines of the data file to skip before beginning to read data.

Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.

For example, for each file do the following:

number_of_lines = 5000                 # Number of lines to read at a time
row_skip = 0                           # number of lines to skip initially
keep_reading = TRUE                    # We will change this value to stop the while

while (keep_reading) {
    my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
    dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB

    row_skip = 1 + row_skip + number_of_lines   # The "1 +" is there due to inclusivity avoiding duplicates

# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
   keep_reading = FALSE
    } # end-if    
} # end-while

By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

2 Comments

Do I have to run the code for each file then ? Because I have over 2.000 csv files to import. I think the answer of Len Greski in my case was the more simple way. Thanks anyway!
Yes, it would then have to be for every csv file, I am sure that it will still be fast. If you have big memory constraints this will be an extra solution breaking up the reading of the csv file into smaller part, and limiting the amount of memory allocated by the number_of_lines variable. My pleasure, happy to contribute!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.