0

I'm trying to write multiple rows in to a CSV file using python and I've been working on this code for a while to piece together how to do this. My goal here is simply to use the oxford dictionary website, and web-scrape the year and words created for each year into a csv file. I want each row to start with the year I'm searching for and then list all the words across horizontally. Then, I want to be able to repeat this for multiple years.

Here's my code so far:

import requests
import re 
import urllib2
import os
import csv

year_search = 1550
subject_search = ['Law'] 

path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
request = urllib2.Request('http://www.oed.com/', None, header)
f = opener.open(request)  
data = f.read()
f.close()
print 'database first access was successful'

resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
outputw = open(resultPath, 'w')
outputh = open(htmlPath, 'w')
request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+str(year_search)+'&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+str(subject_search)+'&type=dictionarysearch', None, header)
page = opener.open(request)
urlpage = page.read()
outputh.write(urlpage)
new_word = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print str(new_word)
outputw.write(str(new_word))
page.close()
outputw.close()

This outputs my string of words that were identified for the year 1550. Then I tried to make code write to a csv file on my computer, which it does, but I want to do two things that I'm messing up here:

  1. I want to be able to insert multiple rows into this and
  2. I want to have the year show up in the first spot

Next part of my code:

with open('OED_table.csv', 'w') as csvfile:
    fieldnames = ['year_search']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'year_search': new_word})

I was using the csv module's online documentation as a reference for the second part of the code.

And just to clarify, I included the first part of the code in order to give perspective.

12
  • Ok, I've probably spent more time on this than I should to try understand where the dictionary was coming from (a Python dictionary, not OED) and what needed to be written. As far as I can tell, your expected output is just a list of 1550 | accomplice as a row i.e. just a year in column A and a word in column B, for every word in 1550? Commented Oct 9, 2016 at 15:55
  • 1
    And do you want to do this for all years in a range? If I understand your request properly, it would be easier to build that into an answer. A lot of your code is unnecessary and you're using regex to parse html. However, it appears to work in this case, so I'll formulate an answer now trying to use your approach Commented Oct 9, 2016 at 16:00
  • 1
    You should probably use the Python 2 documentation for the csv module as a reference. Commented Oct 9, 2016 at 16:23
  • 1
    @roganjosh: No you're not crazy. I, too, was getting multiple results for a while but now only one, ['leggiero']. Commented Oct 9, 2016 at 16:48
  • 1
    @martineau thanks for the confirmation, I've spent ages debugging thinking I did something silly. OP: I don't think this is possible without an account, they appear to require a login after so many requests from the same IP Commented Oct 9, 2016 at 16:52

1 Answer 1

3

You really shouldn't parse html with a regex. That said, here's how to modify your code to produce a csv file of all the words found.

Note: for unknown reasons the list of result word varies in length from one execution to the next.

import csv
import os
import re
import requests
import urllib2

year_search = 1550
subject_search = ['Law']

path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}

# commented out because not used
#request = urllib2.Request('http://www.oed.com/', None, header)
#f = opener.open(request)
#data = f.read()
#f.close()
#print 'database first access was successful'

resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
request = urllib2.Request(
    'http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='
    + str(year_search)
    + '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='
    + str(subject_search)
    + '&type=dictionarysearch', None, header)
page = opener.open(request)

with open(resultPath, 'wb') as outputw, open(htmlPath, 'w') as outputh:
    urlpage = page.read()
    outputh.write(urlpage)

    new_words = re.findall(
        r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
    print new_words
    csv_writer = csv.writer(outputw)
    for word in new_words:
        csv_writer.writerow([year_search, word])

Here's the contents of the OED_table.csv file when it works:

1550,above bounden
1550,accomplice
1550,baton
1550,civilist
1550,garnishment
1550,heredity
1550,maritime
1550,municipal
1550,nil
1550,nuncupate
1550,perjuriously
1550,rank
1550,semi-
1550,torture
1550,unplace
Sign up to request clarification or add additional context in comments.

7 Comments

"leggiero" appears to be the word of the day. If you load the url in a browser, you're met with a login screen. While I don't doubt this is a decent approach written by you, I think OP will hit a roadblock after just a few requests. I don't think they allow scraping at all.
@roganjosh: All part of the reason I started my answer with a caveat.
True, the only reason I commented is because we both get the same word and OP needs to abandon this approach unless there is a login mechanism that is accessible (I didn't check to see if it was a paid subscription). We both ended up pulling a word from the login screen. Upvote anyway since you technically answer the question about writing to csv :)
@roganjosh: Thanks. If nothing else, the OP can see how to write multiple rows into a cvs file, regardless of the source of the data for them. I too was wondering how it was possible to do queries like this without some sort of oed account and related authorization.
Kainesplain: You could write them all as one row (without the year) by removing the for word in new_words: and making a single call to csv_writer.writerow(new_words). You might need to make it conditional by using if new_words: csv_writer.writerow(new_words). If you want to add the year at the beginning, use csv_writer.writerow([year_search] + new_words).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.