0

I have 2 dataframes in Spark which are train and test. I have a categorical column in both, say Product_ID, what I want to do is that, I want to put -1 value for those categories, which are in test but not present in train. So for that I first found distinct categories for that column in p_not_in_test. But I am not able proceed further. how to do that.....

p_not_in_test = test.select('Product_ID').subtract(train.select('Product_ID'))

p_not_in_test  = p_not_in_test.distinct()

Regards

1 Answer 1

2

Here's a reproducible example, first we create dummy data:

test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
                       ("ID3", 5,8),("ID4", 9,0),
                       ("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])

train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
                        ("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])

Now we need to extend your definition of p_not_in_test so we get a list as an output:

p_not_in_test = (test.select('PRODUCT_ID')
                 .subtract(train.select('PRODUCT_ID'))
                 .rdd.map(lambda x: x[0]).collect())

Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())

test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
|       ID1|   1|   5|   ID1|
|       ID2|   2|   4|-1 ID2|
|       ID3|   5|   8|   ID3|
|       ID4|   9|   0|-1 ID4|
|       ID5|   0|   3|   ID5|
+----------+----+----+------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.