Selecting unique rows using sqldf package in R

Question

I have a csv file that as ~1.9M rows and 32 columns. I also have limited RAM, which makes it loading into the memory very inconvenient. As a result I am thinking of using a database but do not have any intimate knowledge on the subject and so have have looked around at this site but found no viable solns so far.

The CSV file looks like this:

    Case,Event,P01,P02,P03,P04,P05,P06,P07,P08,P09,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30
C000039,E97553,8,10,90,-0.34176313227395744,-5.581162038780728E-4,-0.12090388100201072,-1.5172412910939355,-0.9075283173030568,2.0571877671625742,-0.002902632819930783,-0.6761896565590585,-0.7258602353522214,0.8684602429202587,0.0023189312896576167,0.002318939470525324,-0.1881462494296103,-0.0014303471592995315,-0.03133299206977217,7.72338072867324E-4,-0.08952068388668191,-1.4536398437657685,-0.020065144945600275,-0.16276139919188118,0.6915962670997067,-1.593412697264055,-1.563877781707804,-1.4921751129092755,4.701551108078644,6,-0.688302560842075
C000039,E23039,8,10,90,-0.3420173545012358,-5.581162038780728E-4,-1.6563770995734233,-1.5386562526752448,-1.3604342580422861,2.1025445031625525,-0.0028504751366762804,-0.6103972392687121,-2.0390388918403284,-1.7249948885013526,0.00231891181914203,0.0023189141684282384,-0.18603688853814693,-0.0014303471592995315,-0.03182759137355937,0.001011754948131039,0.13009444290656555,-1.737249614361576,-0.015763602969926262,-0.16276139919188118,0.7133868949811379,-1.624962995908364,-1.5946762525901037,-1.5362787555380522,4.751479927607516,6,-0.688302560842075
C000039,E23039,35,10,90,-0.3593468363273839,-5.581162038780728E-4,-2.2590624066428937,-1.540784192984501,-1.3651511418164592,0.05539868728273849,-0.00225912499740972,0.20899232681704485,-2.2007336302050633,-2.518401278903022,0.0023189850665203673,0.0023189834133465186,-0.1386548782028836,-0.0013092574968056093,-0.0315006293688149,9.042390365542781E-4,-0.3514180333671346,-1.8007561969675518,-0.008593259125791147,-2.295351187387221,0.6329101442826701,-1.8095530459660578,-1.7748676145152822,-1.495347406256394,2.553693742122162,34,-0.6882806822066699

.... .... upto 1.9 M rows

As you can see the 'Case' column repeats itself but I want to only get unique records before importing it into a dataframe. So i used this:

f<-file("test.csv")
bigdf <- sqldf("select * from 'f' where Case in (select Case from 'f' group by Case having count(*) = 1)", dbname = tempfile(), file.format = list(header = T, row.names = F))

However I get this error:

Error in sqliteExecStatement(con, statement, bind.data) : 
RS-DBI driver: (error in statement: near "in": syntax error)

Is there something obvious I am missing here. Much thanks in advance.

So I tried this: bigdf <- sqldf("SELECT DISTINCT Case FROM 'f'", dbname = tempfile(), file.format = list(header = T, row.names = F)) but get this: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: near "FROM": syntax error) — Javiar Sandra
– Javiar Sandra, Commented Aug 14, 2013 at 16:06

CL. · Accepted Answer · 2013-08-14 16:33:07Z

2

CASE is a keyword, so you have to quote this column name as "Case" in your query.

answered Aug 14, 2013 at 16:33

CL.

182k18 gold badges241 silver badges282 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Javiar Sandra Over a year ago

Double quotes didn't work for me. Had to use [Case]. But thanks for pointing that it was a keyword. And the solution to finding unique rows is: Select * from f group by [Case]. I cant believe it was that simple!

Travis Heeter · Accepted Answer · 2016-12-02 21:57:33Z

0

For those who want unique rows using sqldf, use DISTINCT:

newdf <- sqldf("SELECT DISTINCT * FROM df") # Get unique rows

sqldf uses SQLite Syntax by default.

newdf <- sqldf("SELECT DISTINCT name FROM df") # Get unique column values
newdf <- sqldf("SELECT *, COUNT(DISTINCT name) as NoNames FROM df GROUP BY whatever") # Get a count of unique names

answered Dec 2, 2016 at 21:57

Travis Heeter

14.2k14 gold badges99 silver badges146 bronze badges

Comments

Zhebin Ni · Accepted Answer · 2016-03-21 21:47:38Z

-1

if you use "Case" in sqldf in R, you should put a "," before "Case". Because the "Case" query is the whole line, you should make it seperate.

answered Mar 21, 2016 at 21:47

Zhebin Ni

11 bronze badge

Collectives™ on Stack Overflow

Selecting unique rows using sqldf package in R

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related