Load/import CSV file in to mongodb using PYSPARK

Question

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.

you want to read the csv from desktop using pyspark and then save it in mongodb, right? — mayank agrawal
– mayank agrawal, Commented Sep 28, 2018 at 14:13
yes! absolutely correct. I want to import the CSV file and store it in mongodb — swetha reddy
– swetha reddy, Commented Sep 29, 2018 at 15:06

mayank agrawal · Accepted Answer · 2018-10-01 08:27:05Z

1

First read the csv as pyspark dataframe.

from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext

sc = SparkContext(conf = conf)
sql = SQLContext(sc)

df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")

Then write it to mongodb,

df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
        .option('database',NAME).option('collection',COLLECTION_MONGODB).save()

Specify the NAME and COLLECTION_MONGODB as created by you.

Also, you need to give conf and packages alongwith spark-submit according to your version,

/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
                  --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME" 
                  --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
                  tester.py

Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.

edited Oct 1, 2018 at 8:27

answered Oct 1, 2018 at 6:51

mayank agrawal

2,5552 gold badges16 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

mayank agrawal Over a year ago

Did you find the answer useful?

swetha reddy · Accepted Answer · 2018-10-03 11:06:22Z

This worked for me. database:people Collection:con

pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
    --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
    --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0


from pyspark.sql import SparkSession

my_spark = SparkSession \
         .builder \
         .appName("myApp") \
         .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
         .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
         .getOrCreate()

df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)

df.printSchema()

df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()

Next go to mongo and check if collection is wrtten by following below steps

mongo
show dbs
use people
show collections
db.con.find().pretty()

Collectives™ on Stack Overflow

Load/import CSV file in to mongodb using PYSPARK

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related