0

I need to scrape a website that has a basic folder system, with folders labled with keywords - some of the folders contain text files. I need to scan all the pages (folders) and check the links to new folders, record keywords and files. My main problem ise more abstract: if there is a directory with nested folders and unknown "depth", what is the most pythonc way to iterate through all of them. [if the "depth" would be known, it would be a really simple for loop). Ideas greatly appriciated.

2 Answers 2

2

Here's a simple spider algorithm. It uses a deque for documents to be processed and a set of already processed documents:

active = deque()
seen = set()

active.append(first document)

while active is not empty:
    document = active.popleft()
    if document in seen:
        continue

    # do stuff with the document -- e.g. index keywords

    seen.add(document)
    for each link in the document:
         active.append(link)

Note that this is iterative and as such can work with arbitrary deep trees.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your answer - it's working well. However I think there's a mistake: deque object doesn't have an add atribute, it should be append.
@priilane: you're welcome. My post is rather pseudo code than working python... nevertheless, fixed.
2

Recursion is usually the easiest way to go.

However, that might give you a StackOverflowError after some time if someone creates a directory with a symlink to itself or a parent.

1 Comment

Thank you for your answer. It seems to me that both answers to the question can solve the problem. However, as I am fairly new to Python/recursion, would it be possible for you to provide a small snippet of (pseudo)code to make it easier to compare these two options.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.