Python Web Scraping Issue

Question

Basically I have a large html document that I would like to scrape. A very simplified example of a similar document is as follows:

<a name = 'ID_0'></a>
<span class='c2'>Date</span>
<span class='c2'>December 12,2005</span>
<span class='c2'>Source</span>
<span class='c2'>NY Times</span>
<span class='c2'>Author</span>
<span class='c2'>John</span>

<a name = 'ID_1'></a>
<span class='c2'>Date</span>
<span class='c2'>January 21,2008</span>
<span class='c2'>Source</span>
<span class='c2'>LA Times</span>

<a name = 'ID_2'></a>
<span class='c2'>Source</span>
<span class='c2'>Wall Street Journal</span>
<span class='c2'>Author</span>
<span class='c2'>Jane</span>

The document has roughly 3500 'a' tags and at first I thought that each would have identical layouts. So, I wrote something along the lines of:

a_list = soup.find_all('a')
data2D = []
for i in range(0,len(a_list)):
    data=[]
    data.append(a_list[i]['name'])
    data.append(a_list[i].find_next(text='Date').find_next().text)
    data.append(a_list[i].find_next(text='Source').find_next().text)
    data.append(a_list[i].find_next(text='Author').find_next().text)
    data2D.append(data)

However, since some IDs are missing Authors or Dates, the scraper takes the next available Author or Date which would be from the next ID. ID_1 would have ID_2 Author. ID_2 would have ID_3 Date. My first thought was to somehow keep track of the indexes at each tag and if an index exceeds the next 'a' tag index, then append null. Is there a better solution?

use lxml and xpath..

Learner
– Learner

2015-11-04 13:50:31 +00:00
Commented Nov 4, 2015 at 13:50 — Learner
– Learner, Commented Nov 4, 2015 at 13:50

alecxe · Accepted Answer · 2015-11-04 14:45:04Z

1

Instead of find_next(), I would use .find_next_siblings() (or .find_all_next()) and get all the tags until the next a link or the end of the document. Something along these lines:

links = soup.find_all('a', {"name": True})
data = []
columns = set(['Date', 'Source', 'Author'])

for link in links:
    item = [link["name"]]
    for elm in link.find_next_siblings():
        if elm.name == "a":
            break  # hit the next "a" element - break

        if elm.text in columns:
            item.append(elm.find_next().text)

     data.append(item)

answered Nov 4, 2015 at 14:45

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Web Scraping Issue

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related