0

In pandas, I can successfully run the following:

def car(t)
    if t in df_a:
       return df_a[t]/df_b[t]
    else:
       return 0

But how can I do the exact same thing with spark dataframe?Many thanks!
The data is like this

df_a
a 20
b 40
c 60

df_b
a 80
b 50
e 100

The result should be 0.25 when input car(a)

6
  • What are you trying to compute? Commented Oct 18, 2016 at 13:35
  • I am using hadoop, just want to convert the code from pandas to spark Commented Oct 18, 2016 at 13:43
  • Yes but what does that function do, you should show the input and the output. Commented Oct 18, 2016 at 14:18
  • df_a contain the id, I run df_a.value_counts() before I run the code above. Commented Oct 18, 2016 at 16:01
  • Are you using Scala or Pyspark? Commented Oct 18, 2016 at 16:11

1 Answer 1

3

First you have to join both dataframes, then you have to filter by the letter you want and select the operation you need.

df_a = sc.parallelize([("a", 20), ("b", 40), ("c", 60)]).toDF(["key", "value"])
df_b = sc.parallelize([("a", 80), ("b", 50), ("e", 100)]).toDF(["key", "value"])

def car(c):
  return df_a.join(df_b, on=["key"]).where(df_a["key"] == c).select((df_a["value"] / df_b["value"]).alias("ratio")).head()

car("a")

# Row(ratio=0.25)
Sign up to request clarification or add additional context in comments.

2 Comments

One more question, can the input be a dataframe? I would like to input a dataframe df_c which conatin the key and then the car() will loop through each row of the key in df_c and then the output will the ratio for each key.
You have to show me an example first. However, avoid thinking in such imperative way, spark is lazy and most of computation is done in parallel

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.