HTML scraping: iterating through nested directories

Question

I need to scrape a website that has a basic folder system, with folders labled with keywords - some of the folders contain text files. I need to scan all the pages (folders) and check the links to new folders, record keywords and files. My main problem ise more abstract: if there is a directory with nested folders and unknown "depth", what is the most pythonc way to iterate through all of them. [if the "depth" would be known, it would be a really simple for loop). Ideas greatly appriciated.

georg · Accepted Answer · 2012-05-15 13:14:15Z

2

Here's a simple spider algorithm. It uses a deque for documents to be processed and a set of already processed documents:

active = deque()
seen = set()

active.append(first document)

while active is not empty:
    document = active.popleft()
    if document in seen:
        continue

    # do stuff with the document -- e.g. index keywords

    seen.add(document)
    for each link in the document:
         active.append(link)

Note that this is iterative and as such can work with arbitrary deep trees.

edited May 15, 2012 at 13:14

answered May 12, 2012 at 9:34

georg

216k57 gold badges324 silver badges401 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

root Over a year ago

Thank you for your answer - it's working well. However I think there's a mistake: deque object doesn't have an add atribute, it should be append.

georg Over a year ago

@priilane: you're welcome. My post is rather pseudo code than working python... nevertheless, fixed.

ThiefMaster · Accepted Answer · 2012-05-12 09:09:28Z

2

Recursion is usually the easiest way to go.

However, that might give you a StackOverflowError after some time if someone creates a directory with a symlink to itself or a parent.

answered May 12, 2012 at 9:09

ThiefMaster

320k85 gold badges608 silver badges648 bronze badges

1 Comment

root Over a year ago

Thank you for your answer. It seems to me that both answers to the question can solve the problem. However, as I am fairly new to Python/recursion, would it be possible for you to provide a small snippet of (pseudo)code to make it easier to compare these two options.

Collectives™ on Stack Overflow

HTML scraping: iterating through nested directories

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related