Ask Al: Checking email addresses against URIBLs?

Scott writes, "I use URIBL lists in SpamAssassin, and these are configured according to their documented purpose (URI checks). I am getting a lot of spam now which spamvertise email addresses, rather than a URI. The bad domain might be in the From: field, or it might be in the message body. Example: 'Contact me at this address xyz@example.com'. These domains are already on RBLs such as URIBL_BLACK. The first question is, do you think this is a valid strategy, and how 'safe'? I can not see any negatives with the spam examples I have received. Second question is, how does one take advantage or URIBL etc to validate email address strings? Google for "spamassassin rhsbl" and you get no useful information."

Wow! That's a great question, and I have absolutely no idea how to answer it. So, I turned to an expert: Steven Champeon. He maintains Enemieslist, a set of over forty thousand patterns for classifying reverse DNS naming conventions. He's a smart guy, and no stranger to using URIBL data in an unconventional manner. Here's what he had to say in response.

Interesting question, Scott. To be right up front about the answer, the URI blocklists I'm familiar with are generally strongly opposed to the use of their data for purposes outside of their stated purview -- to help you filter spam based on URIs in the body -- and their official policy is to discourage such use. That said, I chatted via IM to one of the administrators behind URIBL, and this was his response (loosely paraphrased):

"We don't encourage the use of URIBL data in looking up mailto:s; but we can't prohibit it unless it starts to show up as extraordinary lookup volumes, which could lead us to block whoever is doing the querying. In any case, I can't imagine it would be effective; we aren't adding domains from mailto:s to URIBL, so any successful results would be coincidental at best."

So, the quick answer is "please don't do that, but we can't stop you, and we'd be surprised if it worked at all."

That said, a few years ago, I implemented some rulesets for my homebrewed sendmail antispam package that look for domains in message headers and look them up in SURBL and URIBL. Specifically, the rule looks for From:, Message-Id, and Reply-To: header contents and extracts domains from them, on the grounds that some spam was getting through with suspected spammer domains in those headers. Jeff Chan from SURBL, when I advised him what I was doing, simply asked me to stop. The URIBL folks were less firmly opposed, but still discouraged the practice. I kept it up, and while my mail flows here aren't anywhere near high enough to cause noticeable load on their servers, it easily quadrupled my query volume without catching much spam I wasn't already catching via other means. A firm believer in the "belt and suspenders" principle, as well as having a typical systems administrator's "if it's not broken, don't touch it" mentality, I kept them in the filters. Let's just call it "data gathering for research purposes". And hey, turns out it might be useful to you in this instance, so there's that.

I just checked my logs, so I can actually give you some firm data: over the past 31 days, the filter tagged 421 messages as containing domains in those headers that were also listed in URIBL. This with a daily spam load of anywhere from three to ten thousand messages or so. Of those, they were *all* scored highly enough by other checks to be rejected after DATA. I've disabled the checks now, as they obviously don't add anything to my filtering effectiveness today, even if they may have in the past. Besides, querying hundreds of thousands of times to catch a few hundred spams I would have already rejected for other reasons isn't neighborly, never mind being a terribly useless practice.

As your question was in regard to spam your system *isn't* already catching, I can certainly understand the motivation to try something like this, but in the end, I'd caution you against it for the same reasons the URIBL admin mentioned -- it will increase your query volumes with little noticeable result, because those domains aren't likely to be listed bu URIBL, or if they are, it's just a coincidence.

Finally, I'll add that as with all content filtering (as opposed to source-based filtering) any such strategies have a relatively high risk of catching stuff you *don't* want to be filtered, such as mailing list discussions or off-list exchanges about spam domains; with the bottom line being that you'll likely increase your false positives more than you'll be increasing your spam catch. However, as you seem convinced that it will be useful to you, and will likely go ahead and try it anyway, I will strongly recommend that you gather some data first.

With any such potentially risky strategies, I recommend using temporary rules in procmail or as appropriate to emit some logging data to show whether or not the risk is worth it to you. That way, your decision to proceed over and against the recommendations of the blocklist maintainers is at least based on a reasonably large data set, so you can make an informed decision. And remember -- they *will* block you if your query volumes exceed recommended levels, so focus on that aspect during your analysis of the data you gather.

No comments:

Post a Comment

Comments policy: Al is always right. Kidding, mostly. Be polite, and you're welcome to join in, even if it's a differing viewpoint.