0

I wanted to create an array and save three fields from a dataframe and then read that array so the codes stored in the array are not on another dataframe.

df1

id; id1; code; date_create
1; 100; 50; 2021-10-10
2; 200; 60; 2021-10-10
3; 300; 70; 2021-10-10
4; 400; 80; 2021-10-10
5; 500; 90; 2021-10-10

df2

1; 100; 50; 2021-10-10
2; 200; 60; 2021-10-10
3; 300; 70; 2021-10-10
4; 400; 80; 2021-10-15
5; 500; 90; 2021-10-15
6; 600; 100; 2021-10-15
7; 700; 101; 2021-10-15

I would like to store it in an array:

read df2 where date_create equals 2021-10-15 and save the field id, id1, code

After read the array and generate the df1 again but without the id, id1, code that is in the array

more or less like this, below the code is not right is more an idea

list = np.array (df1.select ("id", id1, code) .collect ())
    for i in lista:
          df1 = df1.filter (df1 ["id", id1, code]! = i)

Then I was going to make a union

df2.union (df1)

to avoid duplication problems.

If anyone can help me I would appreciate it.

result
    id; id1; code; date_create
    1; 100; 50; 2021-10-10
    2; 200; 60; 2021-10-10
    3; 300; 70; 2021-10-10
    4; 400; 80; 2021-10-15
    5; 500; 90; 2021-10-15
    6; 600; 100; 2021-10-15
    7; 700; 101; 2021-10-15

1 Answer 1

1

You can do an anti-join to eliminate duplicates, and then union:

result = df1.join(df2, ['id', 'id1', 'code'], 'anti').union(df2)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.