Readability and error handling improvements for Python web scraping class

Question

Description

I recently wrote a Python script to download files from the Library of Congress (LOC) based on a search query. The code fetches metadata, extracts file URLs of a specific format (e.g., JPEG), and downloads them into a designated directory. I'm seeking feedback on its structure, efficiency, and readability.

Functionality

The script uses the Library of Congress API to:

Retrieve Item IDs: Query the API for items matching the search URL and parse their IDs.
Get File URLs: For each item, extract URLs for files of a specified MIME type (e.g., image/jpeg for .jpg).
Download Files: Save the matching files locally with a logical directory and naming structure.

Key features include:

Code Walkthrough

Initialization (__init__): Validates inputs like search_url, file_extension, and save_to. Maps the specified file extension to a MIME type using MIME_TYPE_MAP.

Data Fetching (fetch_json): Sends HTTP requests with retries on failure. Ensures the response is JSON, logging warnings otherwise.

Item ID Retrieval (get_item_ids): Parses paginated results from the LOC API to collect relevant item IDs. Filters out items that don’t match criteria (e.g., excluding collections).

File URL Extraction (get_image_urls): Iterates through items, gathering file URLs matching the specified MIME type. Implements rate limiting to respect the server.

File Downloading (download_files): Saves files to a designated directory, ensuring unique filenames. Handles streaming file downloads in chunks for large files.

Main Workflow (run): Combines the above steps: fetch item IDs, extract file URLs, and download files.

Source Code

import requests
import os
import time
import logging
from tenacity import retry, wait_exponential, stop_after_attempt
from typing import List, Dict, Optional

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

class LOCImageDownloader:
    """
    A class to download files from the Library of Congress based on a search query.
    
    Example:
        downloader = LOCImageDownloader(search_url, file_extension, save_to)
        downloader.run()
    """

    MIME_TYPE_MAP = {
        "jpg": "image/jpeg",
        "jpeg": "image/jpeg",
        "tif": "image/tiff",
        "tiff": "image/tiff",
        "pdf": "application/pdf",
        "png": "image/png",
    }

    def __init__(self, search_url: str, file_extension: str, save_to: str):
        """
        Initialize the downloader with search URL, desired file extension, and save path.

        Args:
            search_url (str): The search URL to query the LOC API.
            file_extension (str): Desired file extension (e.g., 'jpg', 'pdf').
            save_to (str): Directory to save downloaded files.

        Raises:
            ValueError: If the specified file extension is unsupported.
        """
        if not search_url.startswith("http"):
            raise ValueError("Invalid search_url. Must start with http or https.")
        
        if not os.access(os.path.dirname(save_to) or ".", os.W_OK):
            raise ValueError(f"Save directory '{save_to}' is not writable.")

        self.search_url = search_url
        self.file_extension = file_extension.lower()
        self.mime_type = self.MIME_TYPE_MAP.get(self.file_extension)
        if not self.mime_type:
            raise ValueError(f"Unsupported file extension: {self.file_extension}")
        self.save_to = save_to
        os.makedirs(self.save_to, exist_ok=True)
        self.session = requests.Session()

    @retry(wait=wait_exponential(multiplier=1, min=2, max=10), stop=stop_after_attempt(5))
    def fetch_json(self, url: str, params: Optional[Dict] = None) -> Optional[Dict]:
        """Fetch JSON data from a given URL."""
        try:
            response = self.session.get(url, params=params, timeout=10)
            response.raise_for_status()
            if 'application/json' not in response.headers.get("Content-Type", ""):
                logging.warning(f"Unexpected Content-Type for {url}: {response.headers.get('Content-Type')}")
                return None
            return response.json()
        except requests.RequestException as e:
            logging.error(f"Error fetching {url}: {e}")
            return None

    def get_item_ids(self) -> List[str]:
        """Retrieve item IDs from the LOC search URL."""
        logging.info("Fetching item IDs from the Library of Congress...")
        item_ids, url, count = [], self.search_url, 0
        max_pages = 100  # Safeguard for pagination loops
        pages_processed = 0

        while url and pages_processed < max_pages:
            data = self.fetch_json(url, {"fo": "json", "c": 100, "at": "results,pagination"})
            if not data:
                break

            results = data.get("results", [])
            for result in results:
                if "item" in result.get("id", "") and "collection" not in result.get("original_format", []):
                    item_ids.append(result["id"])
            count += len(results)

            url = data.get("pagination", {}).get("next")  # Get the next page URL
            pages_processed += 1

            # Be kind to the server (this is a magic number provided by the LOC)
            time.sleep(1)
        
        logging.info(f"Found {len(item_ids)} items.")
        return item_ids

    def get_image_urls(self, item_ids: List[str]) -> List[Dict[str, str]]:
        """Retrieve URLs for files matching the specified file type."""
        logging.info("Retrieving file URLs for the specified items...")
        file_urls, processed = [], 0

        for item_url in item_ids:
            data = self.fetch_json(item_url, {"fo": "json"})
            processed += 1
            logging.info(f"Processing item {processed}/{len(item_ids)}...")

            if not data:
                continue

            for resource in data.get("resources", []):
                for file_info in self.flatten_files(resource.get("files", [])):
                    if file_info.get("mimetype") == self.mime_type:
                        file_urls.append({"image_url": file_info["url"], "item_id": item_url})

            # Be kind to the server (this is a magic number provided by the LOC)
            time.sleep(2)
        
        logging.info(f"Found {len(file_urls)} matching files.")
        return file_urls

    @staticmethod
    def flatten_files(files: List) -> List[Dict]:
        """Flatten a potentially nested list of files into a single list of dictionaries."""
        if not isinstance(files, list):
            raise ValueError("Expected a list for 'files'")
        return [item for sublist in files for item in (sublist if isinstance(sublist, list) else [sublist])]

    def download_files(self, file_urls: List[Dict[str, str]]) -> None:
        """Download files from the given URLs and save them to the specified directory."""
        logging.info("Downloading files...")
        for index, file_info in enumerate(file_urls, start=1):
            file_url = file_info["image_url"]
            item_id = file_info["item_id"].strip("/").split("/")[-1]
            save_path = os.path.join(self.save_to, item_id)
            os.makedirs(save_path, exist_ok=True)

            # Determine the filename
            if "image-services/iiif" in file_url:
                url_parts = file_url.split("/")
                filename_part = next((part.split(":")[-1] for part in url_parts if "service:" in part), "image")
                ext = file_url.split(".")[-1]
                filename = f"{filename_part}.{ext}"
            else:
                filename = file_url.split("/")[-1]

            # Ensure unique filename
            file_path = os.path.join(save_path, filename)
            counter = 1
            while os.path.exists(file_path):
                file_path = os.path.join(save_path, f"{filename.rsplit('.', 1)[0]}_{counter}.{self.file_extension}")
                counter += 1

            logging.info(f"[{index}/{len(file_urls)}] Downloading {file_url} as {file_path}...")

            # Download the file
            try:
                with self.session.get(file_url, stream=True) as response:
                    response.raise_for_status()
                    with open(file_path, "wb") as f:
                        for chunk in response.iter_content(chunk_size=8192):
                            f.write(chunk)
            except requests.RequestException as e:
                logging.error(f"Error downloading {file_url}: {e}")
            
            # Be kind to the server (this is a magic number provided by the LOC)
            time.sleep(2)
        
    def run(self) -> None:
        """Execute the downloader: fetch item IDs, retrieve file URLs, and download files."""
        item_ids = self.get_item_ids()
        if not item_ids:
            logging.error("No items found.")
            return
        
        file_urls = self.get_image_urls(item_ids)
        if not file_urls:
            logging.error("No matching files found.")
            return
        
        self.download_files(file_urls)

Feedback Areas

The code seems to work both well and as expected. I haven't done much web scraping previously, however, and so I'm looking for general feedback. I'm most concerned about the following:

Overall Readability: Is the overall flow of the program intuitive and maintainable?
Error Handling: Are exceptions handled effectively, particularly for network-related errors? Are there additional checks or validations I should include?
Best Practices: Are there any Pythonic improvements or standard practices I should adopt?

J_H · Accepted Answer · 2024-11-19 18:18:32Z

use of LLMs

Best Practices: Are there any ... standard practices I should adopt?

Yes. Adopt the following one.

Stop using AI crutches. They're not helping you. They are actively hindering your learning journey, and your ability to effectively communicate with colleagues. (I don't know which language model you used, so I will call your dog Rover and your model Claude.)

Don't ask Claude to "summarize the functionality of this code" -- it's not helpful to reviewers. There are plenty of humans waiting in the wings to insult my intelligence in one way or another, I don't need another robot to do that for me. Don't ask Claude to "provide a code walkthrough" -- the code should be able to eloquently speak for itself. If some of the methods are not essential to a high level overview, then they should not be part of the Public API you designed, and should be marked _private so reviewers can skip over them at first reading.

Do show reviewers what typical calling sequence and expected results would be. Which brings us to the next point.

doctest

Thank you for providing a class-level docstring.

class LOCImageDownloader:
    """
    A class to download files from the Library of Congress based on a search query.
    
    Example:
        downloader = LOCImageDownloader(search_url, file_extension, save_to)
        downloader.run()
    """

The one-sentence summary is fine. The rest of it is not.

The central thing you should be advising maintenance engineers of is the class invariants you promise to preserve. We see none of that. Pushing that exposition to the __init__ ctor, even in the form of documenting the meaning of self.this and self.that, would be fine. But instead we see an example calling sequence: construct it, then run it. Which suggests that, IDK, fetching and getting should have been private _fetch_json, _get_item_ids, and so on? Elevating run() and demoting the others leaves me with more questions than you've supplied answers for.

These two lines of code would have gone nicely in an automated integration TestCase.

They might also have worked tolerably well in a doctest, though we usually think of those as being small unit tests, without internet interactions. Perhaps it would look like this. (I do not yet know what expected output / results would be.)

    >>> downloader = LOCImageDownloader(search_url, file_extension, save_to)
    >>> downloader.run()
    foo
    bar
    baz

redundant docstring

If you intend to produce Sphinx documentation for folks who lack access to the source code, be sure to mention that in a posted config file or in the Review Context prose. I'm assuming that is not a part of this project.

    def __init__(self, search_url: str, file_extension: str, save_to: str):
        """
        Initialize the downloader with search URL, desired file extension, and save path.

That is an accurate, well-formed English sentence. It didn't tell me anything beyond what I already knew from the signature. Consider deleting the whole docstring, on the grounds that it is uninformative. Its sole saving grace is "(e.g., 'jpg', 'pdf')", which helpfully tells callers that sending in e.g. '.jpg' would be Bad, send in 'jpg' instead. This is a slightly surprising convention, in that it ignores the Path.suffix convention of starting with . dot, but it is a convention and it is clearly documented, thank you.

            save_to (str): Directory ...

This is regrettable in at least two ways. The (str) notation is redundant with the signature and could be DRYed up. And passing in a Path would be far more natural than passing in an arbitrary string. It reveals intent. And it offers a bunch of convenience methods.

Calling out that ValueError could be raised is maybe helpful. This isn't java, so checked exception paranoia is not appropriate. As we all know, a Bad Thing could happen at any time, and it's not helpful to enumerate every single one of them. The docstring tries to be helpful, but it uses the hopelessly vague term "unsupported", without mentioning MIME_TYPE_MAP. That is, the docstring is essentially asking the reader to "go read the source code, because I'm too lazy to describe the valid file_extension values to you." Omitting the entire note would have left the reader in the identical situation. A caller who attempts to pass in e.g. "gif" will soon learn the error of their ways, and will mend them, without need of referring to the docstring.

The code can raise ValueError in more ways than the docstring alludes to.

We're all adults here. You don't need to tell about touching the hot stove. We won't.

Kudos on raising nice errors, with informative diagnostic messages.

And I do like the @retry decorator. For one thing, it relieves you of the documentation burden, placing it upon the package maintainer instead.

unfortunate defaulting

        if not os.access(os.path.dirname(save_to) or ".", os.W_OK):

That's just painful.

As mentioned above, the Path API makes many things more pleasant, including this error check. Strongly prefer it over the ancient os.path API.

And the disjunct is a botch on top of the wrong signature choice. We are looking at these possibilities:

"" or "."
"." or "."
"./" or "."

Rather than a disjunct, or unconditionally appending "/", it would be simpler to abandon str in favor of Path.

diagnostic quality

This is perfectly nice code. Feel free to keep it as-is.

        self.mime_type = self.MIME_TYPE_MAP.get(self.file_extension)
        if not self.mime_type:
            raise ValueError(f"Unsupported file extension: {self.file_extension}")

The "unsupported" text could be viewed as helpful, friendly advice about what went wrong, and mentioning the offending extension is a boon to the maintenance engineer who is chasing down a bug.

But we might just as well have gone with this:

        self.mime_type = self.MIME_TYPE_MAP[self.file_extension]

The KeyError would be just as fatal, though less diagnostic. But it would helpfully point out that MIME_TYPE_MAP is what defines the notion of "unsupported", something a maintenance engineer would have to go research anyway with the other diagnostic.

ancient annotation

    def fetch_json( ... , params: Optional[Dict] = None) -> Optional[Dict]:

Type checking in python has been advancing by leaps and bounds in recent years. Prefer modern annotations:

    def fetch_json( ... , params: dict | None = None) -> dict | None:

Kudos on the conventional response.raise_for_status(). But it's unclear why we don't raise upon finding application/json missing from the content type.

swallowed an error

This is terrible:

        except requests.RequestException as e:
            logging.error(f"Error fetching {url}: {e}")
            return None

It looks like pointless checked exception java code.

We had an exception. A perfectly good exception. Which would have usefully propagated up the call stack, advising caller that their request was "impossible". But you swallowed it, returning None instead. Prefer raise, or rather to omit the try so it raises on its own.

inappropriate tuple unpack

Don't do this:

        item_ids, url, count = [], self.search_url, 0

Those are three entirely unrelated assignments. They belong on three separate lines.

I don't understand these two lines of code:

        max_pages = 100  # Safeguard for pagination loops
        ...
            data = self.fetch_json(url, {"fo": "json", "c": 100, ... })

Was the intent perhaps to write { ..., "c": max_pages, ... } ? Or are we seeing a different magic number of \$100\$ there?

I do like this defaulting: results = data.get("results", [])

dead store

We init count to zero, and later we see

            count += len(results)

But then it's never used. Simply delete that local variable.

redundant comment

            url = data.get("pagination", {}).get("next")  # Get the next page URL

Yeah, the code said that already. Elide the # comment.

wrong docstring

    def get_image_urls(self, item_ids: ... ) -> ... :
        """Retrieve URLs for files matching the specified file type."""

What?!?

I see the specified item IDs. But file type?!? Who wrote this? You? Claude? Maybe the English sentence was once true, and then the code changed? It leaves us with so many questions.

Similarly here:

    def download_files(self, file_urls: ... ) -> None:
        """Download files from the given URLs and save them to the specified directory."""

Perhaps once upon a time the signature specified a directory, but no longer.

nit, tuple unpack: file_urls, processed = [], 0 No, just stop it. This is less worse, as there are only 2 unrelated items instead of 3.

In any event, thank you for citing LoC as the reference for those sleep() magic numbers.

extract helper

            # Determine the filename
            if ... :
                filename = ...
            else:
                filename = ...

You felt this needed a # comment. Good, I'm glad y'all added one. But that suggests we should break out a helper, perhaps def determine_the_filename(), or some better function name.

The 8192 magic number is not the end of the world, but it would be worthwhile to assign it to some block size symbol.

docstring restates the code

    def run(self) -> None:
        """Execute the downloader: fetch item IDs, retrieve file URLs, and download files."""
        item_ids = self.get_item_ids()
        ...        
        file_urls = self.get_image_urls(item_ids)
        ...        
        self.download_files(file_urls)

If you're going to insult my intelligence in this way, then start with i += 1 # Examines the current value of the index i and increments it by exactly 1.

You might wish to choose a more informative identifier than run.

Let us touch on the Design of Public API item for a moment. A None return value amounts to a fatal error. So perhaps what caller really wanted was for those two helpers to raise fatal error upon encountering trouble.

If this is based on an LLM, (1) what is the evidence? and (2) the question needs to be banished. — Reinderien
– Reinderien, Commented Nov 19, 2024 at 13:43
@Reinderien I think the theory is that the "code walkthrough" was LLM generated but not the code itself although I'd be happy for J_H to correct me — MD-Tech
– MD-Tech, Commented Nov 19, 2024 at 16:47
@Reinderien, I'm willing to believe in the OP's authorship of the source code. But I have high confidence that some sections of the Review Context came straight from an LLM with little or no subsequent edits. To the extent that such prose winds up being part of Project Documentation, I object to burdening future maintainers with it. The OP contained no Makefile, but if the project "needs" LLM text it should be derived from the source artifacts, and re-derived each time those artifacts change. It gives the appearance of offering additional insights, yet it does not. — J_H
– J_H, Commented Nov 19, 2024 at 18:08
Thank you for your extensive comments, they are quite useful.With respect to my use of machine learning models: I did make use of an LLM here. Prior to posting, I ran my code through an LLM and asked for feedback, hoping to catch any glaring errors that I may have glossed over. I then used this feedback in informing the writing of this post, with the goal of improving the clarity of both my code walkthrough and requests with respect to types of feedback sought. The code and post are entirely my own, however, though evidently I may have mimicked the formatting of the LLM's response too closely. — IntegerEuler
– IntegerEuler, Commented Nov 19, 2024 at 18:25
Thank you, @IntegerEuler. I have two bits of advice, and I'll observe that your impulse to offer an overview is a very good one which should be encouraged. (1.) Consider using any of the module-level, class-level, or __init__ docstrings to offer such an overview, with the aim of helping a newly recruited team member quickly come up to speed on how to call into the library and how to interpret its results. Automated tests can really help with this. (2.) Look carefully at your IDE's auto-generated summary of public symbols. Design the Public API. Prune what a new teammate won't need. — J_H
– J_H, Commented Nov 19, 2024 at 18:33

Kate · Accepted Answer · 2024-11-19 19:46:21Z

Retrying

I was a bit surprised to see tenacity here, since requests can be used along urllib3 to deliver the desired behavior: Example: Automatic Retries. The example quoted shows some fine grained control options on status code or method, which could come in handy here. I would prefer this option over a generic library. But I would certainly keep tenacity for unit testing purposes.

Then you could inhibit the other methods as well, not just fetch_json. This would make the whole code more tolerant to transient errors, therefore more robust, which is the intention here but the job isn't finished. Because you are also using session in download_files without the retry functionality.

By the way you did something right here by using requests.Session(), and not requests.get...

Dataclasses

The class declaration could be simplified a bit using dataclasses.

Optimize

Your function get_item_ids gets a bunch of IDs, and returns at the end of a loop. This could be potentially time- and resource-intensive. If an error occurs in the middle, then the previously retrieved content is lost, and you would have to start over. It would make sense to return one item at a time, and maybe think about persistence, so that you can later resume where you left.

The computer does not have to remain idle during all this time. Your application can consume the results as they come back, and proceed with other tasks while your scraping goes on in the background, even though the target has limitations and you can't really fire a bunch of parallel requests at it.

I would consider using ThreadPool or threading-based alternatives to delegate the scraping to a separate process.

Filename

Regarding file name manipulation, using pathlib would simplify the job of extracting file parts (and also make the code more expressive than slicing).

Logging

It's good to have logging, but do output to a file as well. Debugging console output is tedious, and the buffer console size is limited anyway, so you would probably miss a lot of messages.

Checking response

One last thing: I noticed in fetch_json that you do this:

    if 'application/json' not in response.headers.get("Content-Type", ""):
        logging.warning(f"Unexpected Content-Type for {url}: {response.headers.get('Content-Type')}")
        return None

The expectation here is that you will be talking HTTP/1.1 with the server. In HTTP2, the headers are lowercase. Then the response would look like this instead:

content-type: application/json; charset=UTF-8

I don´t think requests currently support HTTP2, so it's probably not an issue for now.

My suggestion is to replace this check by handling requests.exceptions.JSONDecodeError in this block. This is what will happen "If the response body does not contain valid json."

Stack Exchange Network

Readability and error handling improvements for Python web scraping class

Description

Functionality

Code Walkthrough

Source Code

Feedback Areas

2 Answers 2

use of LLMs

doctest

redundant docstring

unfortunate defaulting

diagnostic quality

ancient annotation

swallowed an error

inappropriate tuple unpack

dead store

redundant comment

wrong docstring

extract helper

docstring restates the code

Retrying

Dataclasses

Optimize

Filename

Logging

Checking response

You must log in to answer this question.

Hot Network Questions

Readability and error handling improvements for Python web scraping class

Description

Functionality

Code Walkthrough

Source Code

Feedback Areas

2 Answers 2

use of LLMs

doctest

redundant docstring

unfortunate defaulting

diagnostic quality

ancient annotation

swallowed an error

inappropriate tuple unpack

dead store

redundant comment

wrong docstring

extract helper

docstring restates the code

Retrying

Dataclasses

Optimize

Filename

Logging

Checking response

You must log in to answer this question.

Related

Hot Network Questions