0

This is the schema of my main data frame:

root
 |-- DataPartition: string (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- _lineItemId: long (nullable = true)
 |-- _organizationId: long (nullable = true)
 |-- fl:FinancialConceptGlobal: string (nullable = true)
 |-- fl:FinancialConceptGlobalId: long (nullable = true)
 |-- fl:FinancialConceptLocal: string (nullable = true)
 |-- fl:FinancialConceptLocalId: long (nullable = true)
 |-- fl:InstrumentId: long (nullable = true)
 |-- fl:IsCredit: boolean (nullable = true)
 |-- fl:IsDimensional: boolean (nullable = true)
 |-- fl:IsRangeAllowed: boolean (nullable = true)
 |-- fl:IsSegmentedByOrigin: boolean (nullable = true)
 |-- fl:LineItemName: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:LocalLanguageLabel: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:SegmentChildDescription: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- fl:SegmentGroupDescription: string (nullable = true)
 |-- fl:Segments: struct (nullable = true)
 |    |-- fl:SegmentSequence: struct (nullable = true)
 |    |    |-- _VALUE: long (nullable = true)
 |    |    |-- _segmentId: long (nullable = true)
 |-- fl:StatementTypeCode: string (nullable = true)
 |-- FFAction|!|: string (nullable = true)

From this my required output is below:

LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
4295879842|^|1246|^|CUS|^|Net Sales-Customer Segment|^|相手先別の販売高(相手先別)|^|JCSNTS|^|REXM|^|False|^||^||^||^||^|False|^|False|^|CUS_JCSNTS|^||^||^|505126|^|505074|^|505074|^|505126|^|505126|^||^|505074|^|True|^|3020155|^|3015249|^||^|I|!|

To get above output this is what I have tried:

val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartition"), $"env:Header.env:info.env:TimeStamp".as("TimeStamp"), $"column1.*")
val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)

With this i am getting below output

 +------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|DataPartition     |TimeStamp                |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:LineItemName                                                                                      |fl:LocalLanguageLabel|fl:SegmentChildDescription|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3          |4298009288     |XTOT                     |3016350                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total Assets,505074]                                                                                |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|9          |4298009288     |XTCOI                    |3016329                    |null                    |null                      |21521455386    |true       |false           |false            |false                 |[S/O-Ordinary Shares,505074]                                                                         |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|10         |4298009288     |XTCOC                    |3016328                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total Equivalent No of Common Shares O/S,505074]                                                    |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|11         |4298009288     |XTCTI                    |3016331                    |null                    |null                      |21521455386    |true       |false           |false            |false                 |[T/S-Ordinary Shares,505074]                                                                         |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|19         |4298009288     |ESGA                     |3018991                    |null                    |null                      |null           |false      |false           |false            |false                 |[General and administrative expense,505074]                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|20         |4298009288     |XTOE                     |3016349                    |null                    |null                      |null           |false      |false           |false            |false                 |[Total Operating Expense,505074]                                                                     |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|21         |4298009288     |XIBT                     |3016299                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income Before Taxes,505074]                                                                     |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|22         |4298009288     |TTAX                     |3019472                    |null                    |null                      |null           |false      |false           |false            |false                 |[Income tax benefit,505074]                                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|23         |4298009288     |XIAT                     |3016297                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income After Taxes,505074]                                                                      |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|24         |4298009288     |XBXP                     |3016252                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net Income Before Extra. Items,505074]                                                              |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|25         |4298009288     |XNIC                     |3019922                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net loss,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|26         |4298009288     |XNCN                     |3016316                    |null                    |null                      |null           |true       |false           |false            |false                 |[Income Available to Com Excl ExtraOrd,505074]                                                       |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|27         |4298009288     |XNCX                     |3016318                    |null                    |null                      |null           |true       |false           |false            |false                 |[Net loss,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|29         |4298009288     |CDNI                     |3018735                    |null                    |null                      |null           |true       |false           |false            |false                 |[Diluted Net Income,505074]                                                                          |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|30         |4298009288     |XTAX                     |3019589                    |null                    |null                      |null           |false      |false           |false            |false                 |[Income Taxes - Total,505074]                                                                        |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|33         |4298009288     |RNTS                     |3015275                    |null                    |null                      |null           |true       |false           |false            |false                 |[Revenues,505074]                                                                                    |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|34         |4298009288     |XTLR                     |3016345                    |null                    |null                      |null           |true       |false           |false            |false                 |[Total revenues,505074]                                                                              |null                 |null                      |null                      |null       |INC                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|35         |4298009288     |XTCII                    |3016326                    |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Shares Issued - (Instrument Level),505074]                                                   |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|36         |4298009288     |XTCTIPF                  |1002023922                 |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Treasury Shares on Instrument Level Multiplied to its Conversion to Primary Factor,505074]   |null                 |null                      |null                      |null       |BAL                 |I|!|       |
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|37         |4298009288     |XTCOIPF                  |1002023921                 |null                    |null                      |21521455386    |true       |false           |false            |null                  |[Common Shares Outstanding on Instrument Level Multiplied to its Conversion to Primary Factor,505074]|null                 |null                      |null                      |null       |BAL                 |I|!|       |
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+-----------------------------------------------------------------------------------------------------+---------------------+--------------------------+--------------------------+-----------+--------------------+-----------+

My issue is columns name fl:LineItemName . This is a struct type and i need to create two different columns out of this . One for the _VALUE as LineItemName and another for the _languageId as LanguageId.

Same way I have to create for fl:LocalLanguageLabel and for the fl:SegmentChildDescription.

Do I have to do this using with column option? Or is there any way without that I can do?

This is working for me except for the last line:

val dfType = dfContentItem.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"env:Data.fl:LineItem.*", getFFActionParent($"_action").as("FFAction|!|")).filter($"env:Data.fl:LineItem._organizationId".isNotNull)

val dfnewTemp = dfType
  .withColumn("LineItemName", $"fl:LineItemName._VALUE")
  .withColumn("LineItemName.languageId", $"fl:LineItemName._languageId")
  .withColumn("LocalLanguageLabel", $"fl:LocalLanguageLabel._languageId")
  .withColumn("LocalLanguageLabel.languageId", $"fl:LocalLanguageLabel._VALUE")   
  .withColumn("SegmentChildDescription", $"fl:SegmentChildDescription._languageId")
  .withColumn("SegmentChildDescription.languageId", $"fl:SegmentChildDescription._VALUE")
  .drop($"fl:LineItemName")
  .drop($"fl:LocalLanguageLabel")
  .drop($"fl:SegmentChildDescription")
dfnewTemp.show(false)
val temp = dfnewTemp.select(dfnewTemp.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)
2
  • So, if I understand you correctly, you want to take the fl:LineItemName column and split it into two (LineItemName and LanguageId)? Commented Feb 20, 2018 at 9:29
  • @Shaido yesexactly Commented Feb 20, 2018 at 9:33

1 Answer 1

1

What you have to do is to use withColumn and simply select the variables present inside the structs. The fl:LineItemName column contains a struct with two values, _VALUE and _languageId which can simply be selected as follows:

val df = dfType.withColumn("LineItemName", $"fl:LineItemName._VALUE")
  .withColumn("LanguageId", $"fl:LineItemName._languageId")
  .drop("fl:LineItemName")

For the other two mentioned columns, simply do the same thing.

Sign up to request clarification or add additional context in comments.

8 Comments

yes thank you and also we need to drop ($"fl:LineItemName") .I was thinking can we do this in explode itself ...
@SUDARSHAN: that is correct. It would also be possible to use explode however, it's not so convenient here since the column name differ from the variable names in the struct.
same we have to follow for fl:LocalLanguageLabel and fl:SegmentChildDescription also
@SUDARSHAN: Yes, as I mentioned you need to do the same thing for the other two columns (but make sure to use different column names).
Just one more clarification .Updated my question with latest change ..I am doing this at last line but getting error like Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from LineItemName#368: need struct type but got string;
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.