3

I wish to access different fields / subfields from a fairly deeply nested structure with arrays in order to do arithmetic operations on them. Some of the data is actually in the field names themselves (the structure that I have to access is created that way and there is nothing I can do about that). In particular, I have a list of numbers as the field names which I must use, and these will change from one json file to the next, so I must dynamically infer what those field names are and then use them with subfield values.

I've looked at this: Access names of fields in struct Spark SQL Unfortunately, I do not know what will be the field names for my structure so I cannot use this.

I've also tried this, which looked promising: how to extract the column name and data type from nested struct type in spark Unfortunately, whatever the magic in the "flatten" function does, I have not been able to adapt it to fieldnames rather than fields themselves.

Here is an example json dataset. It represents consumption baskets:

  • each of the two baskets "comp A" and "comp B" have a number of prices as subfields: compA.'55.80' is a price, compA.'132.88' is another pice, etc.
  • I wish to associate those unit prices to the quantity available in their respective subfields: compA.'55.80'.comment[0].qty (500), as well as compA.'55.80'.comment[0].qty (600), should both be associated to 55.80. compA.'132.88'.comment[0].qty (700) should be associated to 132.88. etc.
{"type":"test","name":"john doe","products":{
    "baskets":{
        "comp A":{
            "55.80":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.88":[{"type":"fun","comment":{"qty":700,"text":"hello"}}]
            ,"0.03":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
        }
        ,"comp B":{
            "55.70":[{"type":"fun","comment":{"qty":500,"text":"hello"}},{"type":"work","comment":{"qty":600,"text":"hello"}}]
            ,"132.98":[{"type":"fun","comment":{"qty":300,"text":"hello"}},{"type":"work","comment":{"qty":900,"text":"hello"}}]
            ,"0.01":[{"type":"fun","comment":{"qty":400,"text":"hello"}}]
        }
    }
}}

I would like to obtain all these numbers in a dataframe in order to do some operations on them:

+ -------+---------+----------+
+ basket | price   | quantity +
+ -------+---------+----------+
+ comp A | 55.80   | 500      +
+ comp A | 55.80   | 600      +
+ comp A | 132.88  | 700      +
+ comp A | 0.03    | 500      +
+ comp A | 0.03    | 600      +
+ comp B | 55.70   | 500      +
+ comp B | 55.70   | 600      +
+ comp B | 132.98  | 300      +
+ comp B | 132.98  | 900      +
+ comp B | 0.01    | 400      +
+ -------+---------+----------+

The original dataset is accessed as such:

scala> myDs
res135: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [products: struct<baskets: struct<compA: struct<55.80: array<struct .....
10
  • I'm making a bit of progress using the following: spark.read.json(myDs.withColumn("compA",get_json_object($"json","$.products.baskets.compA")).select("compA").rdd.map(_.getString(0))).columns This yields: Array[String] = Array(55.80, 132.88, 0.03, .... This is not enough, but it's a start... Commented Jul 7, 2019 at 3:02
  • stackoverflow.com/questions/52525013/… this should give you some guidance. exploding reqd. Commented Jul 7, 2019 at 6:20
  • Any joy yet on this front? Commented Jul 7, 2019 at 9:19
  • The main difficulties are the fact that 1. I do not know the schema in advance since the prices change and 2. The fact that the prices are field names. Unless I am mistaken, the example using you linked to with films does therefore not help in this respect? Commented Jul 7, 2019 at 13:24
  • That example I had is pretty straight forward. U need to know something on schema. Commented Jul 7, 2019 at 13:30

1 Answer 1

1

This approach of processing data that comes in as a column name is not an approach to follow. It will simply not work.

Sign up to request clarification or add additional context in comments.

6 Comments

i haven't completely given up yet (as I have no power over the data format, and the data is there...). i'll post updates as they come along.
Good for you but real hard yakka.
this answer states that it's impossible but does not show why it is impossible. Actually, I am pretty sure this is possible => I'd prefer that someone (maybe me) find the actual answer and post it here to document / help others.
Then I look forward to the answer one day. I note that there are more than a few experts on this platform, far better than me, and no response to-date.
Solved yet? Pls post if so.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.