2

Basically I have a large html document that I would like to scrape. A very simplified example of a similar document is as follows:

<a name = 'ID_0'></a>
<span class='c2'>Date</span>
<span class='c2'>December 12,2005</span>
<span class='c2'>Source</span>
<span class='c2'>NY Times</span>
<span class='c2'>Author</span>
<span class='c2'>John</span>

<a name = 'ID_1'></a>
<span class='c2'>Date</span>
<span class='c2'>January 21,2008</span>
<span class='c2'>Source</span>
<span class='c2'>LA Times</span>

<a name = 'ID_2'></a>
<span class='c2'>Source</span>
<span class='c2'>Wall Street Journal</span>
<span class='c2'>Author</span>
<span class='c2'>Jane</span>

The document has roughly 3500 'a' tags and at first I thought that each would have identical layouts. So, I wrote something along the lines of:

a_list = soup.find_all('a')
data2D = []
for i in range(0,len(a_list)):
    data=[]
    data.append(a_list[i]['name'])
    data.append(a_list[i].find_next(text='Date').find_next().text)
    data.append(a_list[i].find_next(text='Source').find_next().text)
    data.append(a_list[i].find_next(text='Author').find_next().text)
    data2D.append(data)

However, since some IDs are missing Authors or Dates, the scraper takes the next available Author or Date which would be from the next ID. ID_1 would have ID_2 Author. ID_2 would have ID_3 Date. My first thought was to somehow keep track of the indexes at each tag and if an index exceeds the next 'a' tag index, then append null. Is there a better solution?

1
  • use lxml and xpath.. Commented Nov 4, 2015 at 13:50

1 Answer 1

1

Instead of find_next(), I would use .find_next_siblings() (or .find_all_next()) and get all the tags until the next a link or the end of the document. Something along these lines:

links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])

for link in links:
    item = [link["name"]]
    for elm in link.find_next_siblings():
        if elm.name == "a":
            break  # hit the next "a" element - break

        if elm.text in columns:
            item.append(elm.find_next().text)

     data.append(item)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.