2

I have a dataframe like:

+--------+-------+--------------------+-------------------+
|     id1|    id2|                body|         created_at|
+--------+-------+--------------------+-------------------+
|1       |      4|....................|2017-10-01 00:00:05|
|2       |      3|....................|2017-10-01 00:00:05|
|3       |      2|....................|2017-10-01 00:00:05|
|4       |      1|....................|2017-10-01 00:00:05|
+--------+-------+--------------------+-------------------+

I would like to filter the table using both id1 and id2. For example get rows where id1=1, id2=4 and id1=2, id2=3.

Currently, I'm using loop to generate a giant query string for df.filter(), i.e. ((id1 = 1) and (id2 = 4)) or ((id1 = 2) and (id2 = 3)). Just wondering if there is a more properly way to achieve this?

1 Answer 1

1

You can generate a helper DF (table):

tmp:

+--------+-------+
|     id1|    id2|
+--------+-------+
|1       |      4|
|2       |      3|
+--------+-------+

and then join them:

SELECT a.*
FROM tab a
JOIN tmp b
  ON (a.id1 = b.id1 and a.id2 = b.id2)

where tab is your original DF, registered as a table

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks MaxU, how is the performance of this approach? I tried it to select 2 rows from 437 rows, which took 8.11s and my original approach took 0.03s.
I guess this approach will be slower, but it doesn't depend on number of rows in the tmp table.Your approach may fail if the condition string will be too long...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.