Open Source Disposable Email Filter PHP Package

Question

I’ve recently developed an open-source PHP package called Disposable Email Filter and would greatly appreciate a thorough code review. The purpose of this package is to filter out disposable email addresses, which are often used for temporary purposes and can be problematic in various applications.

The repository is available on GitHub: Disposable Email Filter PHP
The client code is stored in the src directory.
Code which is stored in the updater folder is used internally only and need to disposable email list update the during CI/CD.

I am looking for a code review to improve it from both technical and architectural perspectives. I would appreciate it if you take a look at code repository.

For example, this is part of the updater
The exit(0) and exit(1) are used to get exit status of the script when it's executed during the command line.

/** @internal */
final class UpdaterService
{
    public function execute(): void
    {
        $sourceBlacklistsDto = (new SourceBlacklistsLoader(new FileContentsLoader(), Config::SOURCE_BLACKLISTS_PATH))->load();
        $whitelist = Whitelist::fromContents(new FileContentsManipulator(Config::WHITELIST_PATH));

        $changelogTracker = new ChangelogTracker(new LastSyncedBlacklistsFileManipulator(Config::LAST_SYNCED_BLACKLISTS_DATA_FILEPATH), $sourceBlacklistsDto, $whitelist);
        $changesDto = $changelogTracker->compareWithLastSyncedBlacklists();
        if ($changesDto->hasAnyChanges() === false) {
            echo PHP_EOL . 'No changes detected. Exiting...' . PHP_EOL;

            exit(1);
        }

        $changelogTracker->updateLastSyncedBlacklists();

        $releaseNotesUpdater = new ReleaseNotesUpdater($changesDto, new FileContentsManipulator(Config::GENERATED_RELEASE_NOTES_FILE_PATH));
        $releaseNotesUpdater->updateReleaseNotes();

        $preparedBlacklistData = (new PrepareBlacklistService($sourceBlacklistsDto, new FileContentsManipulator(Config::WHITELIST_PATH)))->getPreparedBlacklist();
        (new BlacklistUpdaterService($preparedBlacklistData))
            ->update(new TxtBlacklistUpdater(new FileContentsManipulator(Config::TXT_BLACKLIST_PATH)))
            ->update(new JsonBlacklistUpdater(new FileContentsManipulator(Config::JSON_BLACKLIST_PATH)))
            ->update(
                new PhpBlacklistUpdater(
                    new FileContentsManipulator(Config::PHP_BLACKLIST_PATH),
                    new PhpBlacklistStyleFixerCodeChecker(Config::PHP_BLACKLIST_PATH)
                )
            );

        exit(0);
    }
}

Welcome to Code Review! If you want more code than that one class reviewed then please expand the code here per this meta answer. Links to code hosted on third-party sites are permissible, but the most relevant excerpts must be embedded in the question itself. — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented May 28, 2024 at 19:22

J_H · Accepted Answer · 2024-05-29 01:45:12Z

Let's assume this project becomes successful and is installed on a great many websites, which want more than "I sent random number to user's email and they sent it back to my website" during registration.

ongoing support

There will be false negatives and false positives. There always are. And it will cause various people grief to set things right.

updates

The readme explains one can create a PR or Issue suggesting an update to host lists. Let's imagine this process works smoothly in a timely fashion. And a new minor version of the package quickly gets pushed out.

Now one must wait for a great many websites to get around to obtaining an up-to-date version of the package. This can be problematical.

Issuing code updates in this way makes perfect sense. Data updates? Less so. Please reconsider this approach.

policy

I didn't notice anything in the repo describing a policy for what hosts go into what categories. I only saw an ill-defined distinction between "fake / temp" emails versus "good" emails. It would be helpful to operationalize the definition, so programs or people could examine same address and reach same conclusion. If you delegate policy decisions to others, such as 7c, please spell that out.

A "fake" address is easily defined: zero percent SMTP delivery, for example due to bad DNS, no {A, AAAA, MX} server address, or server doesn't respond on TCP ports such as 25. Sending a random number challenge weeds these out, but I can see it's attractive to save the effort of attempting a doomed delivery which eventually will always timeout. OTOH a "temp" address needs some parameters to define it: random challenge initially worked fine but then after X time Y% of users become unresponsive, that sort of thing.

data delivery

There are many reputable SMTP blacklist services with good support operations and good data delivery, including Spamhaus, SORBS, UCEPROTECTL3, Nordspam DBL, and others. They curate blacklist files and ship them around. But for up-to-date data delivery to end user websites, it's typical to lookup a TXT record or similar in the Domain Name System. This has the virtues of being highly replicated and available, offering caching, and imposing a TTL on records so the support process can recover from a "whoops!". Typical failure mode is a popular site gets 0wned by a bad actor who exhibits bad behavior, then the site is repaired, and it goes back to exhibiting good behavior.

Please consider sending much of your data through that or some similar online system, so that updates become available at client websites before those client websites get around to performing a package software update.

The flip side of this is relevant as well. If a new fake-email domain crops up and is an overnight success, a great many client websites will want to know about it overnight.

IDN

final class Validator
{
    ...
    private const PATTERN2 = '/^[a-zA-Z0-9!#$%&\'*+\\/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+\\/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$/';

The code credits Yii validator, but alas does not mention why we don't just take a package dependency on the current version of Yii. In any event, I fear the underlying approach is deeply flawed.

Once upon a time one could validate e.g. [email protected] with such a regex. Once upon a time all such traffic originated at ARPAnet nodes located in the U.S., and the network map fit on a sheet of 8.5×11. Nowadays email traffic has significantly broader reach.

I look at that regex and I think, "Wow! That is very US-ASCII centric."

An email address is localpart@globalpart, where globalpart should appear in the Domain Name System. DNS lookups are cheap, fast, and can be quickly timed out. I have trouble seeing how such a regex wins over just doing the lookup. And apart from enforcing length restrictions, why is "localpart" any of your business? Send it over port 25 and see what the other end says. If you can't transmit a random number because an SMTP peer said "no such user", fine, you have your answer. Insist on rapid replies, during an interactive web registration session, if a synchronous flow works better for your UX.

I don't know about you or your various client websites, but I'm accustomed to seeing lots of customers in my logs who have something besides English as their native tongue. For both localpart and globalpart I have seen lots of names written in Arabic, Chinese, Hindi, and many more besides. Of course, if you're filtering such customers early in the flow, you definitely won't see such examples in lists of paid invoices.

Please write down what portions of what Unicode code blocks you support, as it will make a difference to maintainers of your various client websites.

Please write down where this code fits into a stack that may see IDN names, e.g. bücher.de encoded as xn--bcher-kva.de. Maybe client websites should perform such an encode / decode so only US-ASCII globalpart is presented to this layer? Document the details.

memory vs disk

final class UpdaterService
{
    public function execute(): void
    {
        $sourceBlacklistsDto = (new SourceBlacklistsLoader(new FileContentsLoader(), Config::SOURCE_BLACKLISTS_PATH))->load();

Hmmm, that looks like an inconveniently large memory footprint. Once we get past fifty or sixty thousand records, it's time to start thinking about databases and caches. Some of those records will be hot, accessed frequently, and most will be cold, never referenced at all. Let sqlite or some other solution worry about those details for you.

Or make it the nameserver's problem, by issuing DNS queries.

I understand that this is used for batch processing, so it's a transient memory footprint (though still a good match for a JOIN). But the DisposableEmailDomains getDomains() method has the same kind of poorly scaling footprint. And if we were a DBL DNS client, we wouldn't need to worry about the full list on client machines anyway.

Once we accept the current approach, sure, the code looks good.

Returning error status 1 for "no changes" seems slightly odd for a routine condition which is Not An Error, but that's fine, caller needs to learn about the status somehow. Among the alternative approaches, you might promise that final line of stdout shall be a hash of the output file, and then let caller worry about comparing hashes to see if there was a change.

Stack Exchange Network

Open Source Disposable Email Filter PHP Package

1 Answer 1

ongoing support

updates

policy

data delivery

IDN

memory vs disk

You must log in to answer this question.

Hot Network Questions

1 Answer 1

ongoing support

updates

policy

data delivery

memory vs disk

You must log in to answer this question.

Related