Here's a reproducible example, first we create dummy data:
test = sc.parallelize([("ID1", 1,5),("ID2", 2,4),
("ID3", 5,8),("ID4", 9,0),
("ID5", 0,3)]).toDF(["PRODUCT_ID", "val1", "val2"])
train = sc.parallelize([("ID1", 4,7),("ID3", 1,4),
("ID5", 9,2)]).toDF(["PRODUCT_ID", "val1", "val2"])
Now we need to extend your definition of p_not_in_test so we get a list as an output:
p_not_in_test = (test.select('PRODUCT_ID')
.subtract(train.select('PRODUCT_ID'))
.rdd.map(lambda x: x[0]).collect())
Finally, we can create an udf that will add "-1" in front of each ID that's not present in train.
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
addString = udf(lambda x: '-1 ' + x if x in p_not_in_test else x, StringType())
test.withColumn("NEW_ID",addString(test["PRODUCT_ID"])).show()
+----------+----+----+------+
|PRODUCT_ID|val1|val2|NEW_ID|
+----------+----+----+------+
| ID1| 1| 5| ID1|
| ID2| 2| 4|-1 ID2|
| ID3| 5| 8| ID3|
| ID4| 9| 0|-1 ID4|
| ID5| 0| 3| ID5|
+----------+----+----+------+