I need to scrape a website that has a basic folder system, with folders labled with keywords - some of the folders contain text files. I need to scan all the pages (folders) and check the links to new folders, record keywords and files. My main problem ise more abstract: if there is a directory with nested folders and unknown "depth", what is the most pythonc way to iterate through all of them. [if the "depth" would be known, it would be a really simple for loop). Ideas greatly appriciated.
2 Answers
Here's a simple spider algorithm. It uses a deque for documents to be processed and a set of already processed documents:
active = deque()
seen = set()
active.append(first document)
while active is not empty:
document = active.popleft()
if document in seen:
continue
# do stuff with the document -- e.g. index keywords
seen.add(document)
for each link in the document:
active.append(link)
Note that this is iterative and as such can work with arbitrary deep trees.
Recursion is usually the easiest way to go.
However, that might give you a StackOverflowError after some time if someone creates a directory with a symlink to itself or a parent.
1 Comment
root
Thank you for your answer. It seems to me that both answers to the question can solve the problem. However, as I am fairly new to Python/recursion, would it be possible for you to provide a small snippet of (pseudo)code to make it easier to compare these two options.