0

I have a data frame that has CategoryCodes for every row. Multiple rows have same CategoryCodes, and there are a few hundred unique CategoryCodes. I have to assign the names of the category for each row, pulling the category from a reference data frame. I tried to use below syntax, but this is giving me an output where number of rows in MyData have increased by times. The output should have same number of rows as MyData. Where am I going wrong?

 Combineddf<-sqldf("select * from MyData left join 
              ReferenceDf using (CategoryCodes)")

Reference Data:

   CategoryCodes Class
5     120500      Tools
6     166300 Spare Parts
7     280200 Spare Parts
8     280200 Spare Parts
9     295200 Spare Parts
10    165000 Spare Parts

MyData (over 30 columns):

   X    Z     CategoryCodes    Y
5  OW   EA      120300         S
6  ANB  EA    120500            S
7  ANB  FOT    120300            S
8  ANB  EA    120500            S
9  ANB  EA    120300            S
10 MIS  EA    120500            S
0

1 Answer 1

2

Increasing number of rows from a join happens when there are multiple matches.

In Reference Data you can see duplicate category codes - for example rows 7 and 8 both have code 280200, so any code 280200 in MyData will get matched to both of those rows.

Maybe you want to select only the unique rows of ReferenceDF? Something like

Combineddf<-sqldf("select * from MyData left join 
              (select distinct * from ReferenceDf)
              using (CategoryCodes)")
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, this worked well, and being familiar with concept of cartesian product, I understood your explanation.
it's not working with my original data, it is finding multiple matches. Is the fact that my reference dataframe actually has over 10 columns causing this?
Probably it means your reference data frame has multiple rows with the same CategoryCode but differences in other columns. You'll need to come up with some logic that determines which row you actually want to keep, select those rows from ReferenceDF, and join to that.
If all you care about are the CategoryCodes and Class columns, then (select distinct CategoryCodes, Class from ReferenceDf) will work - as long as there are no CategoryCodes with multiple classes. I can't help you more without seeing some data - and even then the question will just be "when there are duplicated CategoryCodes, which row do you want?"
thanks for the tip, what you said was indeed was indeed true, there are repeated values in the reference as well, with different category codes. So i have manufactured a new unique key in both the data sets, and used the commented syntax, it worked. Thanks again. This was a good little experience.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.