How to read file from Blob storage using scala to spark

Question

I have a piece of scala code that works locally

val test = "resources/test.csv"

val trainInput = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .format("com.databricks.spark.csv")
  .load(train)
  .cache

However when i try to run it on azure, spark by submitting the job, and adjusting the following line:

val test = "wasb:///tmp/MachineLearningScala/test.csv"

It doesn't work. How do i reference files in blob storage in azure using scala? This should be straight forward.

koiralo · Accepted Answer · 2018-02-26 14:42:54Z

2

If you are using sbt add this dependency to built.sbt

"org.apache.hadoop" % "hadoop-azure" % "2.7.3"

For maven add the dependency as

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-azure</artifactId>
    <version>2.7.0</version>
</dependency>

To read the files from blob storage you need to define the file system to be used in the underlying Hadoop configurations.

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

And read the csv file as

  val path = "wasb[s]://[email protected]"
  val dataframe = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(path + "/tmp/MachineLearningScala/test.csv")

here is the example Hope this helped!

answered Feb 26, 2018 at 14:42

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JetS79 Over a year ago

Thanks! Thats brilliant. I will try and implement now.

Collectives™ on Stack Overflow

How to read file from Blob storage using scala to spark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related