Databrick pyspark Error While getting Excel data from my Azure Blob Storage

Question

I want to read an excel file with multiple sheets in my Blob storage Azure Gen2 using Databrick pyspark. I already install the maven package. Below my code :

df = spark.read.format('com.crealytics.spark.excel') \
.option("header", "true") \
.option("useHeader", "true") \
.option("treatEmptyValuesAsNulls", "true") \
.option("inferSchema", "true") \
.option("sheetName", "sheet1") \
.option("maxRowsInMemory", 10) \
.load(file_path)

Running this code I get this error:

Py4JJavaError: An error occurred while calling o323.load. : java.lang.NoClassDefFoundError: Could not initialize class com.crealytics.spark.excel.WorkbookReader$ at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:22) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:390) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

Any help is appreciated. Thanks

Hello @AlexOtt, yes I already attached the library to a cluster and Notebook. — MFatn
– MFatn, Commented Dec 9, 2021 at 9:35
is it compiled for correct version of Scala that matches DBR version? — Alex Ott
– Alex Ott, Commented Dec 9, 2021 at 11:05
@AlexOtt in order to be sure I install both Scala2.11 and Scala2.12, and is not work. My DBR version"9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)". — MFatn
– MFatn, Commented Dec 9, 2021 at 14:55
@KarthikBhyresh-MT, Not yet. I use panda as alternative solution, but I hope solve this issues. — MFatn
– MFatn, Commented Dec 16, 2021 at 13:13

KarthikBhyresh-MT · Accepted Answer · 2021-12-09 09:41:02Z

2

Can you verify if you have properly Mount an Azure Blob storage container.

Checkout official MS doc: Access Azure Blob storage using the RDD API

Hadoop configuration options are not accessible via SparkContext. If you are using the RDD API to read from Azure Blob storage, you must set the Hadoop credential configuration properties as Spark configuration options when you create the cluster, adding the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations that are used for your RDD jobs

Configure an account access key:

spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>

answered Dec 9, 2021 at 9:41

KarthikBhyresh-MT

5,1142 gold badges7 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Databrick pyspark Error While getting Excel data from my Azure Blob Storage

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related