sqldf - using variables in reference dataframe, add a variable to input dataframe

Question

I have a data frame that has CategoryCodes for every row. Multiple rows have same CategoryCodes, and there are a few hundred unique CategoryCodes. I have to assign the names of the category for each row, pulling the category from a reference data frame. I tried to use below syntax, but this is giving me an output where number of rows in MyData have increased by times. The output should have same number of rows as MyData. Where am I going wrong?

 Combineddf<-sqldf("select * from MyData left join 
              ReferenceDf using (CategoryCodes)")

Reference Data:

   CategoryCodes Class
5     120500      Tools
6     166300 Spare Parts
7     280200 Spare Parts
8     280200 Spare Parts
9     295200 Spare Parts
10    165000 Spare Parts

MyData (over 30 columns):

   X    Z     CategoryCodes    Y
5  OW   EA      120300         S
6  ANB  EA    120500            S
7  ANB  FOT    120300            S
8  ANB  EA    120500            S
9  ANB  EA    120300            S
10 MIS  EA    120500            S

Gregor Thomas · Accepted Answer · 2018-01-04 16:36:10Z

2

Increasing number of rows from a join happens when there are multiple matches.

In Reference Data you can see duplicate category codes - for example rows 7 and 8 both have code 280200, so any code 280200 in MyData will get matched to both of those rows.

Maybe you want to select only the unique rows of ReferenceDF? Something like

Combineddf<-sqldf("select * from MyData left join 
              (select distinct * from ReferenceDf)
              using (CategoryCodes)")

answered Jan 4, 2018 at 16:36

Gregor Thomas

147k22 gold badges185 silver badges320 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

pyeR_biz Over a year ago

Thanks, this worked well, and being familiar with concept of cartesian product, I understood your explanation.

pyeR_biz Over a year ago

it's not working with my original data, it is finding multiple matches. Is the fact that my reference dataframe actually has over 10 columns causing this?

Gregor Thomas Over a year ago

Probably it means your reference data frame has multiple rows with the same CategoryCode but differences in other columns. You'll need to come up with some logic that determines which row you actually want to keep, select those rows from ReferenceDF, and join to that.

Gregor Thomas Over a year ago

If all you care about are the CategoryCodes and Class columns, then (select distinct CategoryCodes, Class from ReferenceDf) will work - as long as there are no CategoryCodes with multiple classes. I can't help you more without seeing some data - and even then the question will just be "when there are duplicated CategoryCodes, which row do you want?"

pyeR_biz Over a year ago

thanks for the tip, what you said was indeed was indeed true, there are repeated values in the reference as well, with different category codes. So i have manufactured a new unique key in both the data sets, and used the commented syntax, it worked. Thanks again. This was a good little experience.

Collectives™ on Stack Overflow

sqldf - using variables in reference dataframe, add a variable to input dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related