I am trying to read multiple parquet files form gcs using a dataproc spark job.
df = sc.read.option("mergeSchema", "true").parquet(remote_path)
The above code throws error saying:-
`org.apache.spark.SparkException: Failed merging schema of file gs://x/2023-04-03T11:33:15.parquet
org.apache.spark.SparkException: Failed to merge fields 'group_size__c' and 'group_size__c'. Failed to merge incompatible data types double and string
`
To overcome this issue I changed the code to use a specified schema. The 'group_size__c' column is set as string:-
df = sc.read.schema(schema).parquet(remote_path)
This line does not throw any error. But when I try to print distinct values in column 'group_size__c' using this code
df = df.withColumn("group_size__c", col("group_size__c").cast("string"))
LOG.info(df.select("group_size__c").distinct().show())
it throws the error
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
what might be causing this error? I have tried disabling the dictionary encoding but it doesn't solve the problem.
spark = SparkSession.builder.config("parquet.enable.dictionary","false").getOrCreate()