I have two dataframes.
AA =
+---+----+---+-----+-----+
| id1|id2| nr|cell1|cell2|
+---+----+---+-----+-----+
| 1| 1| 0| ab2 | ac3 |
| 1| 1| 1| dg6 | jf2 |
| 2| 1| 1| 84d | kf6 |
| 2| 2| 1| 89m | k34 |
| 3| 1| 0| 5bd | nc4 |
+---+----+---+-----+-----+
and a second dataframe BB, which looks like:
BB =
+---+----+---+-----+
| a | b|use|cell |
+---+----+---+-----+
| 1| 1| x| ab2 |
| 1| 1| a| dg6 |
| 2| 1| b| 84d |
| 2| 2| t| 89m |
| 3| 1| d| 5bd |
+---+----+---+-----+
where, in BB, the cell section, I have all possible cells that can appear in the AA cell1 and cell2 sections (cell1 - cell2 is an interval).
I want to add two columns to BB, val1 and val2. The conditions are the following.
val1 has 1 values when:
id1 == id2 (in AA) ,
and cell (in B) == cell1 or cell2 (in AA)
and nr = 1 in AA.
and 0 otherwise.
The other column is constructed according to:
val 2 has 1 values when:
id1 != id2 in (AA)
and cell (in B) == cell1 or cell 2 in (AA)
and nr = 1 in AA.
it also has 0 values otherwise.
My attempt: I tried to work with:
from pyspark.sql.functions import when, col
condition = col("id1") == col("id2")
result = df.withColumn("val1", when(condition, 1)
result.show()
But it soon became apparent that this task is way over my pyspark skill level.
EDIT:
I am trying to run :
condition1 = AA.id1 == AA.id2
condition2 = AA.nr == 1
condition3 = AA.cell1 == BB.cell | AA.cell2 == BB.cell
result = BB.withColumn("val1", when(condition1 & condition2 & condition3, 1).otherwise(0)
Gives an error inside a Zeppelin notebook:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4362.py", line 344, in <module>
code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
File "<stdin>", line 6
__zeppelin__._displayhook()
^
SyntaxError: invalid syntax
EDIT2: Thanks for the correction, I was missing a closing bracket. However now I get
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Which is awkward, since I am already using these operators.
otherwise(0)|operator has a stronger binding then the==. So your condition3 becomes column equal to (column or column) equal to column. Then it does not know how two columns can be "ORed"