2

I am executing a simple create table query in spark sql using spark-submit(cluster mode). Receiving org.apache.parquet.io.ParquetDecodingException. I could get few details on this issue over internet, one of the suggestion was to add the config spark.sql.parquet.writeLegacyFormat=true. The issue still persist after addding this setting.

Below is the query:

spark.sql("""
CREATE TABLE TestTable
 STORED AS PARQUET 
    AS 
SELECT Col1, 
       Col2, 
       Col3 
FROM Stable""")

Error Description :

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file maprfs:///path/disputer/1545555-r-00000.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:461)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:219)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

Spark Configuration file :

spark.driver.memory=10G
spark.executor.memory=23G 
spark.executor.cores=3
spark.executor.instances=100  
spark.dynamicAllocation.enabled=false      
spark.yarn.preserve.staging.files=false  
spark.yarn.executor.extraJavaOptions=-XX:MaxDirectMemorySize=6144m    
spark.sql.shuffle.partitions=1000
spark.shuffle.service=true  
spark.yarn.maxAppAttempts=1  
spark.broadcastTimeout=36000  
spark.debug.maxToStringFields=100  
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2  
spark.network.timeout=600s  
spark.sql.parquet.enableVectorizedReader=false
spark.scheduler.listenerbus.eventqueue.capacity=200000  
spark.driver.memoryOverhead=1024  
spark.yarn.executor.memoryOverhead=5120  
spark.executor.extraJavaOptions=-XX:+UseG1GC  
spark.driver.extraJavaOptions=-XX:+UseG1GC

1 Answer 1

4

This issue was occurring due to disabling spark.sql.parquet.enableVectorizedReader. spark.sql.parquet.enableVectorizedReader=true resolves the issue.

For more details, Visit https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.