create hash value for each row of data in dataframe in R

Question

I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.

My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.

I tried to give and unique hash for each row of data, using the code below:

for (loop.ssi in (1:nrow(ssi.10q3.v1)))
    {ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
     print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
     flush.console()
    }

But this is very slow.

Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.

UPDATE

I have updated the code as below:

ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)   

ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
                          c("uid"),
                          function(df)
                             {df[,"uid"]<- NULL
                              hash <- digest(as.character(df))
                              data.frame(hash=hash)
                             },
                          .progress="text")

I self-generated a uid column for the "unique" purpose.

What makes the row unique? Maybe it's just better/faster to compare each of those fields. Is there a way to add the hash to the data so you don't have to generate the hash each time? Can you concatenate all the fields together as one field and ten test. Maybe that would be faster? FYI - I know nothing about R — Shiv Kumar
– Shiv Kumar, Commented Feb 23, 2011 at 3:59
I'd suggest looking at plyr::join (and match_df in the devel version) for an implementation of this strategy for comparing and matching data frames. See also plyr::join.df and plyr::id. — hadley
– hadley, Commented Feb 23, 2011 at 13:33

mdsumner · Accepted Answer · 2011-02-23 07:45:52Z

6

If I get what you want properly, digest will work directly with apply:

library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))

edited Feb 23, 2011 at 7:45

answered Feb 23, 2011 at 5:40

mdsumner

29.6k6 gold badges85 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lokheart Over a year ago

by changing uid = nrow(ssi.10q3.v1) into uid = 1:nrow(ssi.10q3.v1), this will be perfect, thanks!

mdsumner · Accepted Answer · 2011-02-23 05:31:52Z

1

I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:

rowSums(df2 == df1) == ncol(df1)

Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).

 rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)

edited Feb 23, 2011 at 5:31

answered Feb 23, 2011 at 4:40

mdsumner

29.6k6 gold badges85 silver badges91 bronze badges

2 Comments

lokheart Over a year ago

I guess you guess the dataframe to be all numeric, but it is not, so I don't think using rowSums is possible

mdsumner Over a year ago

rowSums is getting a logical matrix as input, not a data.frame - that's the result of the == comparison - it's a bit funny to rely on numerical conversion from logical for the sum but no big deal

Collectives™ on Stack Overflow

create hash value for each row of data in dataframe in R

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related