1

I am trying to read multiple parquet files form gcs using a dataproc spark job.

df = sc.read.option("mergeSchema", "true").parquet(remote_path)

The above code throws error saying:-

   `org.apache.spark.SparkException: Failed merging schema of file gs://x/2023-04-03T11:33:15.parquet
    org.apache.spark.SparkException: Failed to merge fields 'group_size__c' and 'group_size__c'. Failed to merge incompatible data types double and string
`

To overcome this issue I changed the code to use a specified schema. The 'group_size__c' column is set as string:-

df = sc.read.schema(schema).parquet(remote_path)

This line does not throw any error. But when I try to print distinct values in column 'group_size__c' using this code

 df = df.withColumn("group_size__c", col("group_size__c").cast("string"))
 LOG.info(df.select("group_size__c").distinct().show())

it throws the error

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary

what might be causing this error? I have tried disabling the dictionary encoding but it doesn't solve the problem.

spark = SparkSession.builder.config("parquet.enable.dictionary","false").getOrCreate()

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.