1

Have a dataframe which has user, location, values. Currently US related location are in different rows for one user:

user     location   values
209       OH_US          45
O09       PA_US          30
O09       AQ             10
209       CA_US          50
209       UK             10 
....          

For each user want to generate a new row to replace US related locations with a sum and location name is 'US'.Remove those rows in different states in US. Expected result looks like this:

user     location   values
209       US          200
209       UK          10
O09       US          300
O09       AQ          10
...

Currently I'm thinking to pull all US related rows to a separate dataframe to do a sum in groupby, then drop all the rows related to US in original dataframe to join with the US sum dataframe.

Is there a more efficient way to do this?

1 Answer 1

1

Hi there can we multiple approach to solve this in pyspark

  1. Using spark.sql -
df.createOrReplaceTempView("SAMPLE_TABLE")

df.createOrReplaceTempView("SAMPLE_TABLE")

df2=spark.sql("SELECT user , case when location like '%_US' then 'US' else location end Location , SUM(VALUES) VALUES  from SAMPLE_TABLE group by user , case when location like '%_US' then 'US' else location end ")

df2.show()
  1. Using pyspark api

    import pyspark.sql.functions as F
    
    df.groupby(F.when(F.col('location').\
    like("%_US"),"US").\
    otherwise(F.col("location")).\
    alias('location'))\
    .agg(F.sum('values').alias("values"))\
    .show()
    
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.