Merging duplicate columns in seq json hdfs files in spark

Question

I am reading a seq json file from HDFS using spark like this :

val data = spark.read.json(spark.sparkContext.sequenceFile[String, String]("/prod/data/class1/20190114/2019011413/class2/part-*").map{
    case (x,y) => 
    (y.toString)})

data.registerTempTable("data")

val filteredData = data.filter("sourceInfo='Web'")

val explodedData = filteredData.withColumn("A", explode(filteredData("payload.adCsm.vfrd")))
val explodedDataDbg = explodedData.withColumn("B", explode(filteredData("payload.adCsm.dbg"))).drop("payload")

On which I am getting this error:

org.apache.spark.sql.AnalysisException: 
Ambiguous reference to fields StructField(adCsm,ArrayType(StructType(StructField(atfComp,StringType,true), StructField(csmTot,StringType,true), StructField(dbc,ArrayType(LongType,true),true), StructField(dbcx,LongType,true), StructField(dbg,StringType,true), StructField(dbv,LongType,true), StructField(fv,LongType,true), StructField(hdr,LongType,true), StructField(hidden,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(hvrx,DoubleType,true), StructField(hvry,DoubleType,true), StructField(inf,StringType,true), StructField(isP,LongType,true), StructField(ltav,StringType,true), StructField(ltdb,StringType,true), StructField(ltdm,StringType,true), StructField(lteu,StringType,true), StructField(ltfm,StringType,true), StructField(ltfs,StringType,true), StructField(lths,StringType,true), StructField(ltpm,StringType,true), StructField(ltpq,StringType,true), StructField(ltts,StringType,true), StructField(ltut,StringType,true), StructField(ltvd,StringType,true), StructField(ltvv,StringType,true), StructField(msg,StringType,true), StructField(nl,LongType,true), StructField(prerender,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(pt,StringType,true), StructField(src,StringType,true), StructField(states,StringType,true), StructField(tdr,StringType,true), StructField(tld,StringType,true), StructField(trusted,BooleanType,true), StructField(tsc,LongType,true), StructField(tsd,DoubleType,true), StructField(tsz,DoubleType,true), StructField(type,StringType,true), StructField(unloaded,StructType(StructField(duration,LongType,true), StructField(stime,LongType,true)),true), StructField(vdr,StringType,true), StructField(vfrd,LongType,true), StructField(visible,StructType(StructField(duration,LongType,true), StructField(stime,StringType,true)),true), StructField(xpath,StringType,true)),true),true), StructField(adcsm,ArrayType(StructType(StructField(tdr,DoubleType,true), StructField(vdr,DoubleType,true)),true),true);

Not sure how, but ONLY SOMETIMES there are two structs with the same name "adCsm" inside "payload". Since I am interested in fields present in one of them, I need to deal with this ambiguity.

I know one way is to check for the field A and B and drop the column if the fields are absent and hence choose the other adCsm. Was wondering if there is any better way to handle this? If I can probably just merge the duplicate columns (with different data) instead of this explicit filtering? Not sure how duplicate structs are even present in a seq "json" file TIA!

Md Shihab Uddin · Accepted Answer · 2019-01-21 17:33:36Z

2

I think, the ambiguity happened due to case sensitivity issue in spark dataframe column name. In the last part of the schema i see

StructField(adcsm,
ArrayType(StructType(
StructField(tdr,DoubleType,true), 
StructField(vdr,DoubleType,true)),true),true)

So there is two same name structFields (adScm and adscm) inside plain StructType. First enable case sensitivity of spark sql by

sqlContext.sql("set spark.sql.caseSensitive=true")

then it'll differentiate the two fields. Here is details to solve case sensitive issue solve case sensitivity issue . Hopefully it'll help you.

answered Jan 21, 2019 at 17:33

Md Shihab Uddin

5616 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Merging duplicate columns in seq json hdfs files in spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related