1

I'm using pyspark sql with keras under elephas.

I want to try some kind of distributed image processing with mongoDB GridFS

I've found related question but in Java world on Scala Loading a Spark 2.x DataFrame from MongoDB GridFS

But nothing more than that, I can't find any other documentation how to work with GridFS from pySpark.

my pyspark - mongo code looks like that:

sparkConf = SparkConf().setMaster("local[4]").setAppName("MongoSparkConnectorTour")\
                                             .set("spark.app.id", "MongoSparkConnectorTour")\
                                             .set("spark.mongodb.input.database", config.MONGO_DB)

# If executed via pyspark, sc is already instantiated
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)

dk = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")\
                    .option("spark.mongodb.input.uri", config.MONGO_MED_EVENTS)\
                    .load()

if (dk.count() > 0):
    # print data frame schema
    dk.printSchema()

    # Preview Dataframe (Pandas Preview is Cleaner)
    print( dk.limit(5).toPandas() )

Is it possible to work with GridFS data that way? I'd like to see minimal example.

1 Answer 1

1
+250

There is a way to translate Scala code to Pyspark.

  1. Download mongo-hadoop-core.jar from https://mvnrepository.com/artifact/org.mongodb.mongo-hadoop/mongo-hadoop-core/2.0.2

  2. Run pyspark with the jar included:

SPARK_CLASSPATH=./path/to/mongo-hadoop-core.jar pyspark
  1. And translated code:
sc = SparkContext(conf=sparkConf)

mongo_conf = {
    "mongo.input.uri": "mongodb://..."
    "mongo.input.query": s"...mongo query here..."
}

rdd = sc.newAPIHadoopRDD("com.mongodb.hadoop.GridFSInputFormat", keyClass="org.apache.hadoop.io.NullWritable", valueClass="org.apache.hadoop.io.MapWritable", conf=conf)

I'm not a hundred % sure about the keyClass and valueClass so here are the sources that I've used to compile this code:

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.