I'm using pyspark sql with keras under elephas.
I want to try some kind of distributed image processing with mongoDB GridFS
I've found related question but in Java world on Scala Loading a Spark 2.x DataFrame from MongoDB GridFS
But nothing more than that, I can't find any other documentation how to work with GridFS from pySpark.
my pyspark - mongo code looks like that:
sparkConf = SparkConf().setMaster("local[4]").setAppName("MongoSparkConnectorTour")\
.set("spark.app.id", "MongoSparkConnectorTour")\
.set("spark.mongodb.input.database", config.MONGO_DB)
# If executed via pyspark, sc is already instantiated
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
dk = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")\
.option("spark.mongodb.input.uri", config.MONGO_MED_EVENTS)\
.load()
if (dk.count() > 0):
# print data frame schema
dk.printSchema()
# Preview Dataframe (Pandas Preview is Cleaner)
print( dk.limit(5).toPandas() )
Is it possible to work with GridFS data that way? I'd like to see minimal example.