Loading Spark 2.x DataFrame from MongoDB GridFS in Python

Question

I'm using pyspark sql with keras under elephas.

I want to try some kind of distributed image processing with mongoDB GridFS

I've found related question but in Java world on Scala Loading a Spark 2.x DataFrame from MongoDB GridFS

But nothing more than that, I can't find any other documentation how to work with GridFS from pySpark.

my pyspark - mongo code looks like that:

sparkConf = SparkConf().setMaster("local[4]").setAppName("MongoSparkConnectorTour")\
                                             .set("spark.app.id", "MongoSparkConnectorTour")\
                                             .set("spark.mongodb.input.database", config.MONGO_DB)

# If executed via pyspark, sc is already instantiated
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)

dk = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")\
                    .option("spark.mongodb.input.uri", config.MONGO_MED_EVENTS)\
                    .load()

if (dk.count() > 0):
    # print data frame schema
    dk.printSchema()

    # Preview Dataframe (Pandas Preview is Cleaner)
    print( dk.limit(5).toPandas() )

Is it possible to work with GridFS data that way? I'd like to see minimal example.

Artem Vovsia · Accepted Answer · 2019-10-18 11:51:08Z

There is a way to translate Scala code to Pyspark.

Download mongo-hadoop-core.jar from https://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-core/2.0.2
Run pyspark with the jar included:

SPARK_CLASSPATH=./path/to/mongo-hadoop-core.jar pyspark

And translated code:

sc = SparkContext(conf=sparkConf)

mongo_conf = {
    "mongo.input.uri": "mongodb://..."
    "mongo.input.query": s"...mongo query here..."
}

rdd = sc.newAPIHadoopRDD("com.mongodb.hadoop.GridFSInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.apache.hadoop.io.MapWritable", conf=conf)

I'm not a hundred % sure about the keyClass and valueClass so here are the sources that I've used to compile this code:

Collectives™ on Stack Overflow

Loading Spark 2.x DataFrame from MongoDB GridFS in Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related