Let's assume this project becomes successful
and is installed on a great many websites,
which want more than "I sent random number to user's email
and they sent it back to my website" during registration.
ongoing support
There will be false negatives and false positives.
There always are.
And it will cause various people grief to set things right.
updates
The readme
explains one can create a PR or Issue suggesting an
update to host lists.
Let's imagine this process works smoothly in a timely fashion.
And a new minor version of the package quickly gets pushed out.
Now one must wait for a great many websites to get around
to obtaining an up-to-date version of the package.
This can be problematical.
Issuing code updates in this way makes perfect sense.
Data updates?
Less so.
Please reconsider this approach.
policy
I didn't notice anything in the repo describing a policy
for what hosts go into what categories.
I only saw an ill-defined distinction between "fake / temp"
emails versus "good" emails.
It would be helpful to operationalize the definition, so programs
or people could examine same address and reach same conclusion.
If you delegate policy decisions to others, such as 7c,
please spell that out.
A "fake" address is easily defined: zero percent SMTP delivery,
for example due to bad DNS, no {A, AAAA, MX} server address,
or server doesn't respond on TCP ports such as 25.
Sending a random number challenge weeds these out,
but I can see it's attractive to save the effort of
attempting a doomed delivery which eventually will always timeout.
OTOH a "temp" address needs some parameters to define it:
random challenge initially worked fine but then after X time Y% of
users become unresponsive, that sort of thing.
data delivery
There are many reputable SMTP blacklist services
with good support operations and good data delivery,
including Spamhaus, SORBS, UCEPROTECTL3, Nordspam DBL, and others.
They curate blacklist files and ship them around.
But for up-to-date data delivery to end user websites,
it's typical to lookup a TXT record or similar in the
Domain Name System. This has the virtues of being highly
replicated and available, offering caching, and imposing a
TTL
on records so the support process can recover from a "whoops!".
Typical failure mode is a popular site gets 0wned by a
bad actor who exhibits bad behavior, then the site is repaired,
and it goes back to exhibiting good behavior.
Please consider sending much of your data through that or some
similar online system, so that updates become available
at client websites before those client websites
get around to performing a package software update.
The flip side of this is relevant as well.
If a new fake-email domain crops up and is an overnight success,
a great many client websites will want to know about it overnight.
final class Validator
{
...
private const PATTERN2 = '/^[a-zA-Z0-9!#$%&\'*+\\/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+\\/=?^_`{|}~-]+)*@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?$/';
The
code
credits
Yii validator,
but alas does not mention why we don't just take a package
dependency on the current version of Yii.
In any event, I fear the underlying approach is deeply flawed.
Once upon a time one could validate e.g. [email protected] with such a regex.
Once upon a time all such traffic originated at ARPAnet nodes
located in the U.S., and the network map fit on a sheet of 8.5×11.
Nowadays email traffic has significantly broader reach.
I look at that regex and I think, "Wow! That is very US-ASCII centric."
An email address is localpart@globalpart,
where globalpart should appear in the Domain Name System.
DNS lookups are cheap, fast, and can be quickly timed out.
I have trouble seeing how such a regex wins over just doing the lookup.
And apart from enforcing length restrictions, why is "localpart"
any of your business? Send it over port 25 and see what the other
end says. If you can't transmit a random number because an SMTP
peer said "no such user", fine, you have your answer.
Insist on rapid replies, during an interactive web registration session,
if a synchronous flow works better for your UX.
I don't know about you or your various client websites,
but I'm accustomed to seeing lots of customers in my logs
who have something besides English as their native tongue.
For both localpart and globalpart I have seen lots of
names written in Arabic, Chinese, Hindi, and many more besides.
Of course, if you're filtering such customers early in the flow,
you definitely won't see such examples in lists of paid invoices.
Please write down what portions of what Unicode code blocks
you support, as it will make a difference to maintainers
of your various client websites.
Please write down where this code fits into a stack
that may see IDN names, e.g. bücher.de
encoded as xn--bcher-kva.de.
Maybe client websites should perform such an encode / decode
so only US-ASCII globalpart is presented to this layer?
Document the details.
memory vs disk
final class UpdaterService
{
public function execute(): void
{
$sourceBlacklistsDto = (new SourceBlacklistsLoader(new FileContentsLoader(), Config::SOURCE_BLACKLISTS_PATH))->load();
Hmmm, that looks like an inconveniently large memory footprint.
Once we get past fifty or sixty thousand records,
it's time to start thinking about databases and caches.
Some of those records will be hot, accessed frequently,
and most will be cold, never referenced at all.
Let sqlite or some other solution worry about those details for you.
Or make it the nameserver's problem, by issuing DNS queries.
I understand that this is used for batch processing,
so it's a transient memory footprint (though still a good
match for a JOIN).
But the DisposableEmailDomains getDomains() method
has the same kind of poorly scaling footprint.
And if we were a DBL DNS client, we wouldn't need to worry
about the full list on client machines anyway.
Once we accept the current approach, sure, the code looks good.
Returning error status 1 for "no changes" seems
slightly odd for a routine condition which is Not An Error,
but that's fine, caller needs to learn about the status somehow.
Among the alternative approaches, you might promise that
final line of stdout shall be a hash of the output file,
and then let caller worry about comparing hashes to see
if there was a change.