2

I'm using Spark on Google Cloud to process data from Google Analytics but I don't know how to select custom dimensions based on index

The structure of GA's custom dimension is the fallowing:

ARRAY<STRUCT< index: INTEGER, value:STRING >>

Usually, in BigQuery, I would do a subquery to select the data like

SELECT (select value from customDimensions where index = 2)

But as explained in here subquery in select is not yet supported.

1
  • Have you tried using UNSET? Commented Jan 2, 2019 at 14:17

1 Answer 1

2

Know nothing about Spark on Google Cloud, but if it's close enough to Apache Spark you can use element_at function that returns the element of the array at the given index in value if column is array followed by dot accessor.

// create a sample dataset
val structData = Seq((0,"zero"), (1, "one")).toDF("id", "value")
val data = structData
  .select(struct("id", "value") as "s")
  .groupBy()
  .agg(collect_list("s") as "a")

// the schema matches the requirements
scala> data.printSchema
root
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: integer (nullable = false)
 |    |    |-- value: string (nullable = true)

data.createOrReplaceTempView("customDimensions")

The following query won't work since index is not known.

scala> sql("select value from customDimensions where index = 2").show
org.apache.spark.sql.AnalysisException: cannot resolve '`index`' given input columns: [customdimensions.a]; line 1 pos 41;
'Project ['value]
+- 'Filter ('index = 2)
   +- SubqueryAlias `customdimensions`
      +- Aggregate [collect_list(s#9, 0, 0) AS a#13]
         +- Project [named_struct(id, id#5, value, value#6) AS s#9]
            +- Project [_1#2 AS id#5, _2#3 AS value#6]
               +- LocalRelation [_1#2, _2#3]
...

Let's use element_at standard function instead.

scala> sql("select element_at(a, 2) from customDimensions").show
+----------------+
|element_at(a, 2)|
+----------------+
|        [1, one]|
+----------------+

The "array" is a struct and so you can use . (dot).

scala> sql("select element_at(a, 2).value from customDimensions").show
+----------------------+
|element_at(a, 2).value|
+----------------------+
|                   one|
+----------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.