0

I have two dataframes.

AA = 

+---+----+---+-----+-----+
| id1|id2| nr|cell1|cell2|
+---+----+---+-----+-----+
|  1|   1|  0| ab2 | ac3 |
|  1|   1|  1| dg6 | jf2 |
|  2|   1|  1| 84d | kf6 |
|  2|   2|  1| 89m | k34 |
|  3|   1|  0| 5bd | nc4 |
+---+----+---+-----+-----+

and a second dataframe BB, which looks like:

BB =

+---+----+---+-----+
| a |   b|use|cell |
+---+----+---+-----+
|  1|   1|  x| ab2 |
|  1|   1|  a| dg6 |
|  2|   1|  b| 84d |
|  2|   2|  t| 89m |
|  3|   1|  d| 5bd |
+---+----+---+-----+

where, in BB, the cell section, I have all possible cells that can appear in the AA cell1 and cell2 sections (cell1 - cell2 is an interval).

I want to add two columns to BB, val1 and val2. The conditions are the following.

val1 has 1 values when:
             id1 == id2 (in AA) , 
         and cell (in B) == cell1 or cell2 (in AA)
         and nr = 1 in AA.

and 0 otherwise. 

The other column is constructed according to:

val 2 has 1 values when:
           id1 != id2 in (AA)
      and  cell (in B) == cell1 or cell 2 in (AA)
      and  nr = 1 in AA.

      it also has 0 values otherwise.

My attempt: I tried to work with:

from pyspark.sql.functions import when, col

condition = col("id1") == col("id2")
result = df.withColumn("val1", when(condition, 1)
result.show()

But it soon became apparent that this task is way over my pyspark skill level.

EDIT:

I am trying to run :

condition1 = AA.id1 == AA.id2
condition2 = AA.nr == 1
condition3 = AA.cell1 == BB.cell  | AA.cell2 == BB.cell

result = BB.withColumn("val1", when(condition1 & condition2 & condition3, 1).otherwise(0)

Gives an error inside a Zeppelin notebook:

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-4362.py", line 344, in <module>
    code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
  File "<stdin>", line 6
    __zeppelin__._displayhook()
               ^
SyntaxError: invalid syntax

EDIT2: Thanks for the correction, I was missing a closing bracket. However now I get

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Which is awkward, since I am already using these operators.

3
  • You miss a closing bracket in last row behind otherwise(0) Commented Oct 17, 2018 at 9:39
  • Thanks @gaw, I corrected it, but it does not solve the problem though. Commented Oct 17, 2018 at 9:45
  • You get the error, because the | operator has a stronger binding then the ==. So your condition3 becomes column equal to (column or column) equal to column. Then it does not know how two columns can be "ORed" Commented Oct 17, 2018 at 9:53

1 Answer 1

1

In my opinion the best way might be a join of the two dataframes and then you can model the conditions in the when clause. I think if you create a new column with withColumn it iterates over the values from the current dataframe, but I think you can not access values from another dataframe and expect it also iterates through the rows there. The following code should fulfill your request:

df_aa = spark.createDataFrame([
(1,1,0,"ab2", "ac3"),   
(1,1,1,"dg6", "jf2"),   
(2,1,1,"84d", "kf6"),   
(2,2,1,"89m", "k34"),   
(3,1,0,"5bd", "nc4")
], ("id1", "id2","nr","cell1","cell2"))

df_bb = spark.createDataFrame([
(1, 1, "x","ab2"),  
(1, 1, "a","dg6"),  
(2, 1, "b","84d"),  
(2, 2, "t","89m"),  
(3, 1, "d", "5bd")
], ("a", "b","use","cell"))

cond = (df_bb.cell == df_aa.cell1)|(df_bb.cell == df_aa.cell2)
df_bb.join(df_aa, cond, how="full").withColumn("val1", when((col("id1")==col("id2")) & ((col("cell")==col("cell1"))|(col("cell")==col("cell2"))) & (col("nr")==1), 1).otherwise(0)).withColumn("val2", when(~(col("id1")==col("id2")) & ((col("cell")==col("cell1"))|(col("cell")==col("cell2"))) & (col("nr")==1), 1).otherwise(0)).show()

Result looks like:

+---+---+---+----+---+---+---+-----+-----+----+----+
|  a|  b|use|cell|id1|id2| nr|cell1|cell2|val1|val2|
+---+---+---+----+---+---+---+-----+-----+----+----+
|  1|  1|  x| ab2|  1|  1|  0|  ab2|  ac3|   0|   0|
|  1|  1|  a| dg6|  1|  1|  1|  dg6|  jf2|   1|   0|
|  2|  1|  b| 84d|  2|  1|  1|  84d|  kf6|   0|   1|
|  2|  2|  t| 89m|  2|  2|  1|  89m|  k34|   1|   0|
|  3|  1|  d| 5bd|  3|  1|  0|  5bd|  nc4|   0|   0|
+---+---+---+----+---+---+---+-----+-----+----+----+

It could be that I do not even need to check for the condition cell==cell1|cell==cell2 since that is pretty much the join condition, but to make the when conditions similar to the requirements of you, I put it there

Sign up to request clarification or add additional context in comments.

3 Comments

If you want to make sure the conditions apply the same lines (index-wise) then an additional id column can be created for both dataframes and used for the join
Seems to work, however, I am trying to modify the condition cond = (df_bb.cell == df_aa.cell1)|(df_bb.cell == df_aa.cell2)&(df_aa.cell1 != 0)&(df_aa.cell2 != 0) to not add the rows where cell1 or cell2 may be 0. This seems to have no effect however.
you have to do it in a different way: cond =((df_bb.cell == df_aa.cell1)|(df_bb.cell == df_aa.cell2)) & (~(df_aa.cell1 == '0'))& (~(df_aa.cell2 == '0')) because you usually negate the conditions with ~ . It might also help to change the join type to "inner" to only get the successfully joined record: Use join(df_aa, cond, how="inner") for the join

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.