0

This BeautifulSoup Parser works as it should when printing data while looping. It outputs the correct things. The final line of code (outputting to csv) says that user2 is not defined, even though it seems to be... Any ideas? (Thanks all! It was an indentation error, now edited. Code works!)

import csv
from bs4 import BeautifulSoup

# Create output file and write headers
f = csv.writer(open('/Users/xx/Downloads/#parsed.csv', "w"), delimiter = '\t')
f.writerow(["date", "username", "tweet"]) #csv column headings
soup = BeautifulSoup(open("/Users/simonlindgren/Downloads/#raw.html")) #input html document 

tweetdata = soup.find_all("div", class_="content") #find anchors of each tweet
#print tweetdata
for tweet in tweetdata:
    username = tweet.find_all(class_="username js-action-profile-name")
    for user in username:
        user2 = user.get_text()
        #print user2
    date = tweet.find_all(class_="_timestamp js-short-timestamp ")
    for d in date:
        date2 = d.get_text()
        tweet = tweet.find_all(class_="js-tweet-text tweet-text")
        for t in tweet:
            tweet2 = t.get_text().encode('utf-8')
            tweet3 = tweet2.replace('\n', ' ')
            tweet4 = tweet3.replace('\"','')

    f.writerow([date2, user2, tweet4])
2
  • 3
    Could you please review the indentation - it's important in Python. Commented Feb 19, 2015 at 14:11
  • A copy of the input html document and expected CSV output would also be helpful. Commented Feb 19, 2015 at 14:38

1 Answer 1

1

The problem is user2 is only scoped inside the loop for user in username:. Once that loop ends, user2 is not accessible. Changing your code to f.writerow([username, date, tweet]) should work without the NameError, but I suspect that this code will not produce what you want. This is because those values are still going to have the HTML code in them (which is why you have used the get_text() to pull out the data from the tags).

Instead, assuming that there is only one username, date and tweet text body per tweet, you could change your code to something like this:

tweetdata = soup.find_all("div", class_="content") #find anchors of each tweet
#print tweetdata
for tweet in tweetdata:
    # pull out our data
    username = tweet.find_all(class_="username js-action-profile-name")
    date = tweet.find_all(class_="_timestamp js-short-timestamp ")
    text = tweet.find_all(class_="js-tweet-text tweet-text")

    our_data = tuple(username[0].get_text(), date[0].get_text(),
                       text[0].get_text().encode('utf-8'))
    print "User: %s - Date: %s - Text: %s" % our_data

    # write to CSV
    f.writerow(our_data)

This avoids using the unnecessary for loops (since each tweet will only have one username, date and text body anyway). If you need to write it out as a list, change our_data from being a tuple to a list.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.