How to save data from org.apache.spark.sql.DataFrame created on MongoDB data back to MongoDB?

Question

There are methods to save data of org.apache.spark.sql.DataFrame to file system or Hive. But how to save data from DataFrame created on MongoDB data back to MongoDB?

Edit : I created DataFrame using

SparkContext sc = new SparkContext()
Configuration config = new Configuration();
config.set("mongo.input.uri","mongodb://localhost:27017:testDB.testCollection);
JavaRDD<Tuple2<Object, BSONObject>> mongoJavaRDD = sc.newAPIHadoopRDD(config, MongoInputFormat.class, Object.class,
            BSONObject.class).toJavaRDD();
JavaRDD<Object> mongoRDD = mongoJavaRDD.flatMap(new FlatMapFunction<Tuple2<Object, BSONObject>, Object>()
    {
        @Override
        public Iterable<Object> call(Tuple2<Object, BSONObject> arg)
        {
            BSONObject obj = arg._2();
            Object javaObject = generateJavaObjectFromBSON(obj, clazz);
            return Arrays.asList(javaObject);
        }
    });

sqlContext = new SqlContext(sc);
 DataFrame df = sqlContext.createDataFrame(mongoRDD, Person.class).registerTempTable("Person");

OK, my Java is rusty at best, but I really don't understand why to create a single element list just to flatMap. Simple map should be enough. Also what is going on inside generateJavaObjectFromBSON? — zero323
– zero323, Commented Jul 20, 2015 at 13:17

Community · Accepted Answer · 2017-05-23 12:16:01Z

3

Using PySpark and assuming you have a local MongoDB instance:

import pymongo
from toolz import dissoc

# First, lets create some dummy collection
client = pymongo.MongoClient()
client["foo"]["bar"].insert([{"k": "foo", "v": 1}, {"k": "bar", "v": 2}])
client.close()

config = {
    "mongo.input.uri": "mongodb://localhost:27017/foo.bar",
    "mongo.output.uri": "mongodb://localhost:27017/foo.barplus"
}

# Read data from MongoDB
rdd = sc.newAPIHadoopRDD(
    "com.mongodb.hadoop.MongoInputFormat",
     "org.apache.hadoop.io.Text",
     "org.apache.hadoop.io.MapWritable",
     None, None, config)

# Drop _id field and create data frame
dt = sqlContext.createDataFrame(rdd.map(lambda (k, v): dissoc(v, "_id")))
dt_plus_one = dt.select(dt["k"], (dt["v"] + 1).alias("v"))

(dt_plus_one.
   rdd. # Extract rdd
   map(lambda row: (None, row.asDict())). # Map to (None, dict) pairs
   saveAsNewAPIHadoopFile(
       "file:///placeholder", # Ignored
       # From org.mongodb.mongo-hadoop:mongo-hadoop-core
       "com.mongodb.hadoop.MongoOutputFormat", 
        None, None, None, None, config))

See also: Getting Spark, Python, and MongoDB to work together

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Jul 20, 2015 at 11:57

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dev Over a year ago

I am using and did not find any createDataFrame() method for the ist of json objects...

zero323 Over a year ago

So, problem is with loading to data frame as well?

Collectives™ on Stack Overflow

How to save data from org.apache.spark.sql.DataFrame created on MongoDB data back to MongoDB?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related