Loading Several Text Files into MongoDB using PyMongo

Question

I'm currently attempting to load several text files into MongoDB (they're in JSON format).

I tried using an OS walk, but I seem to be having trouble. My current method is:

>>> import pymongo
>>> import os
>>> import json
>>> from pymongo import Connection
>>> connection = Connection()
>>> db = connection.Austin
>>> collection = db.tweets
>>> collection = db.tweet_collection
>>> db.tweet_collection
Collection(Database(Connection('localhost', 27017), u'Austin'), u'tweet_collection')
>>> collection
Collection(Database(Connection('localhost', 27017), u'Austin'), u'tweet_collection')
>>> tweets = db.tweets
>>> tweet = open(os.path.expanduser('~/Tweets/10_7_2012_12:09-Tweets.txt'),'r')
>>> for line in tweet:
...      d = json.loads(line)
...      tweets.insert(d)
...

For inserting a single Tweet. I want to be able to open multiple files and run that same piece of code, namely the for loop that turns the JSON into python dictionaries and inserts it into the collection, autonomously.

Does anyone have a solid example of how to do this, complete with an explanation?

While we're on the topic, I'm attempting to use MongoDB with a poor understanding of databases (silly and stupid, I know), but MongoDB can support multiple instances of databases at the same time, and stores collections, which are groups of documents, and you can insert individual documents, correct?

(Also, please ignore the inconsistency between the collections tweets and tweet_collection.. I was just experimenting to get a better understanding)

Is there single/multiple tweets per file? (as the name *Tweets.txt implies > 1) — Jon Clements
– Jon Clements, Commented Jul 19, 2012 at 21:04
Aye, each file contains multiple tweets, but each tweet has its own line. Sorry for the slow response. — Noc
– Noc, Commented Jul 23, 2012 at 16:37

Jon Clements · Accepted Answer · 2012-07-19 21:19:13Z

5

untested

from glob import iglob
import os.path
import pymongo
import json

for fname in iglob(os.path.expanduser('~/Tweets/*.txt')):
    with open(fname) as fin:
        tweets = json.load(fin)
        for tweet in tweets:
            db.tweets.insert(tweet)

This loops over all the filenames in '~/Tweets/*.txt', opens it, and loads 1 or more tweets from the file into a Python dictionary - note the use of .load instead of .loads - the difference being .load() takes a file-like object while .loads() takes a string. Then for each tweet, inserts that into the database. (Note I've used db.tweets.insert instead of tweets = db.tweets as I personally find the 'db' prefix a reminder it's a DB op and not some other object)

As to your understanding on MongoDB re: DB's/collections/documents - yes, you're correct.

answered Jul 19, 2012 at 21:19

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Noc Over a year ago

Well, I tried it, and it gives the following error: Traceback (most recent call last): File "masspush.py", line 13, in <module> tweets = json.load(fin) File "/usr/lib/python2.7/json/__init__.py", line 278, in load **kw) File "/usr/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 369, in decode raise ValueError(errmsg("Extra data", s, end, len(s))) ValueError: Extra data: line 2 column 1 - line 218 column 1 (char 2590 - 554222) Sorry for formatting. @Jon Clements

Jon Clements Over a year ago

Can you copy and post that bit and some of the surrounding data?

Noc Over a year ago

Do you mean within the text file itself? The issue is that there are ~500 files, and I'm not sure which file it refers to that has the extra data.

Jon Clements Over a year ago

Catch the exception and print fname?

Noc Over a year ago

I caught the exception - it turns out that all of them are doing it... Awesome. So, for the sake of space, is the extra data line 2 column 1 or line 218 column 1? Or is that line 2 TO line 218 column 1? And even line 2 is roughly 2,000 characters :|

|

Collectives™ on Stack Overflow

Loading Several Text Files into MongoDB using PyMongo

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related