3

I am trying an efficient means of reading in select data from very large csv files using the method described here by @JD_Long. The method relies on sqldf() a function from a library by the same name. The problem I am having is that the function returns an empty dataframe, that is a dataframe with the correct number of columns and their names but with zero rows.

I've truncated the formatting function that's intended to process the dataframe once it's been read into the working environment by sqldf().

sum_stats_df <- function(f_name){
  # read in data
  f <- file(f_name, open = "r")
  df <- sqldf("select * from f limit 6", dbname = tempfile(),
              file.format = list(header = T, row.names = F))
  close(f, type='r')
  return(df)

The csv datasets that I'm working with are the well-known Bureau of Transportation Statistics flights data which I downloaded from the ASA website here.

I tried the sample code from the StackOverflow question linked above.

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
library(sqldf)
f <- file("bigdf.csv")
bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F))

It worked just fine when I called sum_stats_df("bigdf.csv"). I tried looking at the properties for the simulation file and the flights csv files using properties in the windows file explorer but they look identical informat. The only difference is the file size. The flights files are much larger, but I don't see how this would matter.

The next thing I tried is loading twenty lines of one of the flights csv files using read.csv(..., nrow=20), and then took the resulting dataframe and wrote it back to a csv file. When I tried the problematic function on that new csv test file, it worked.

> print(head(sum_stats_df("test.csv")))
    X Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier
1 "1" 1987    10         14         3     741        730     912        849          "PS"
2 "2" 1987    10         15         4     729        730     903        849          "PS"
3 "3" 1987    10         17         6     741        730     918        849          "PS"
4 "4" 1987    10         18         7     729        730     847        849          "PS"
5 "5" 1987    10         19         1     749        730     922        849          "PS"
6 "6" 1987    10         21         3     728        730     848        849          "PS"

But, again, a call using the original csv file returns an empty dataframe.

> print(head(sum_stats_df("2000.csv")))
 [1] Year              Month             DayofMonth        DayOfWeek         DepTime          
 [6] CRSDepTime        ArrTime           CRSArrTime        UniqueCarrier     FlightNum        
[11] TailNum           ActualElapsedTime CRSElapsedTime    AirTime           ArrDelay         
[16] DepDelay          Origin            Dest              Distance          TaxiIn           
[21] TaxiOut           Cancelled         CancellationCode  Diverted          CarrierDelay     
[26] WeatherDelay      NASDelay          SecurityDelay     LateAircraftDelay
<0 rows> (or 0-length row.names)

So I'm wondering what could it be about the csv files that I downloaded from ASA that is different from those that I write locally using write.csv()? What might be other causes for sqldf() returning empty dataframes, but getting all the column names?

6
  • One problem I see is that you are using LIMIT without ORDER BY. This doesn't make sense because you aren't telling the database how you want to limit the result set. Which 6 records do you want? Commented Jan 9, 2018 at 5:06
  • As far as I know, and I might be wrong, using limit this way would be analogous getting just the first six rows. Commented Jan 9, 2018 at 5:11
  • But I've tried it without limit and get the same empty dataframe. Commented Jan 9, 2018 at 5:11
  • 1
    Please read this accepted answer and see if you can read from the file as you have opened it in your function. Commented Jan 9, 2018 at 5:17
  • Added l <- readLines(con=f, n=6); print(l) and got the header and five full lines from the file. I've read many answers here saying that the sql objects are deleted immediately after the call, but it doesn't explain why the results are first assigned to df nor why it works with locally written csv files. Commented Jan 9, 2018 at 5:28

2 Answers 2

1

SQLite is fussy about end of line characters and these can differ from the default on your OS if the file is from somewhere else.

On Windows and Linux Ubuntu this worked for me assuming that you have already unzipped it:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n")

or on Windows this variation using the original bz2 file and assuming that the 7z executable is on your path:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n",
    filter = "7z x -so 1987.csv.bz2 2>NUL")

On Linux we can use bzip2 instead:

read.csv.sql("1987.csv", "select * from file limit 3", eol = "\n",
    filter = "bzip2 -cd 1987.csv.bz2")

FAQ 16 on the sqldf github page has an example that is very similar to this one.

Sign up to request clarification or add additional context in comments.

Comments

0

The files you can download from the quoted web page are bzip compressed. I can reproduce your problems when using the compressed file directly. One possible solution is to decompress the file(s) before using them with sqldf. This is unfortunate since other tools for reading CSV files support bzip compressed files directly. I do not know how to do this with sqldf() directly, but read.csv.sql from the same package supports a filter option, making the following function possible:

sum_stats_df <- function(f_name){
  read.csv.sql(f_name, sql = "select * from file limit 6", filter = "bunzip2")
}

5 Comments

Thanks for your input, but I unzipped the files before trying to load them. I wonder, though, if there is some processing or formatting going on with the unzipper that might be behind the problem.
Which tool on which OS did you use to unzipp the files? What file sizes do you currently have? For comparison, the data from 2008 is 109 MB with and 658 MB without compression.
Opening the locally created csv files and the downloaded and unzipped files in Excel doesn't reveal any difference in the two. But opening them in notebook++ showed some quotes for values in some columns in the former. With this in mind I tried defining the class for each column using the method arg. Still didn't work.
I used 7_zip, and the unzipped 2008 csv file is 673MB in my case.
This is very odd. Can you add print(f) to your function after opening the file and post the result when calling the function on one of the files?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.