1

I'm currently attempting to load several text files into MongoDB (they're in JSON format).

I tried using an OS walk, but I seem to be having trouble. My current method is:

>>> import pymongo
>>> import os
>>> import json
>>> from pymongo import Connection
>>> connection = Connection()
>>> db = connection.Austin
>>> collection = db.tweets
>>> collection = db.tweet_collection
>>> db.tweet_collection
Collection(Database(Connection('localhost', 27017), u'Austin'), u'tweet_collection')
>>> collection
Collection(Database(Connection('localhost', 27017), u'Austin'), u'tweet_collection')
>>> tweets = db.tweets
>>> tweet = open(os.path.expanduser('~/Tweets/10_7_2012_12:09-Tweets.txt'),'r')
>>> for line in tweet:
...      d = json.loads(line)
...      tweets.insert(d)
... 

For inserting a single Tweet. I want to be able to open multiple files and run that same piece of code, namely the for loop that turns the JSON into python dictionaries and inserts it into the collection, autonomously.

Does anyone have a solid example of how to do this, complete with an explanation?

While we're on the topic, I'm attempting to use MongoDB with a poor understanding of databases (silly and stupid, I know), but MongoDB can support multiple instances of databases at the same time, and stores collections, which are groups of documents, and you can insert individual documents, correct?

(Also, please ignore the inconsistency between the collections tweets and tweet_collection.. I was just experimenting to get a better understanding)

2
  • Is there single/multiple tweets per file? (as the name *Tweets.txt implies > 1) Commented Jul 19, 2012 at 21:04
  • Aye, each file contains multiple tweets, but each tweet has its own line. Sorry for the slow response. Commented Jul 23, 2012 at 16:37

1 Answer 1

5

untested

from glob import iglob
import os.path
import pymongo
import json

for fname in iglob(os.path.expanduser('~/Tweets/*.txt')):
    with open(fname) as fin:
        tweets = json.load(fin)
        for tweet in tweets:
            db.tweets.insert(tweet)

This loops over all the filenames in '~/Tweets/*.txt', opens it, and loads 1 or more tweets from the file into a Python dictionary - note the use of .load instead of .loads - the difference being .load() takes a file-like object while .loads() takes a string. Then for each tweet, inserts that into the database. (Note I've used db.tweets.insert instead of tweets = db.tweets as I personally find the 'db' prefix a reminder it's a DB op and not some other object)

As to your understanding on MongoDB re: DB's/collections/documents - yes, you're correct.

Sign up to request clarification or add additional context in comments.

6 Comments

Well, I tried it, and it gives the following error: Traceback (most recent call last): File "masspush.py", line 13, in <module> tweets = json.load(fin) File "/usr/lib/python2.7/json/__init__.py", line 278, in load **kw) File "/usr/lib/python2.7/json/__init__.py", line 326, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 369, in decode raise ValueError(errmsg("Extra data", s, end, len(s))) ValueError: Extra data: line 2 column 1 - line 218 column 1 (char 2590 - 554222) Sorry for formatting. @Jon Clements
Can you copy and post that bit and some of the surrounding data?
Do you mean within the text file itself? The issue is that there are ~500 files, and I'm not sure which file it refers to that has the extra data.
Catch the exception and print fname?
I caught the exception - it turns out that all of them are doing it... Awesome. So, for the sake of space, is the extra data line 2 column 1 or line 218 column 1? Or is that line 2 TO line 218 column 1? And even line 2 is roughly 2,000 characters :|
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.