It dawned on me today that I haven't been logging the recipient addresses identified in the spam messages I'm cataloging and reporting data on. I think it'd be a good idea to expand my data set sideways and start adding that info, as spot checking the data has been quite insightful. I've found, for example, that spammers are dumb enough to harvest from Google Groups, because I have a fair number of recipient addresses with “...” in them, indicating they were truncated versions of real addresses I used when posting to newsgroups years ago. Then there's lots of spam directly to those newsgroup-harvested addresses, spam to addresses obviously harvested from the web, spam hitting abused co-reg addresses, and god knows what else to actual once-valid but long-dead actual user addresses.
There's one alias that is getting just a metric ton of spam, and the construction of the username portion makes it clear to me that it was an alias I gave to somebody and they misused it, or somehow leaked it to some real bad dudes. I wish I could remember who I gave the address to – but that info is stored on a drive pulled from my old unix server when I moved to Chicago. I'm dying to know which random bad actor is responsible for that bit o' feed, because the mail it's getting is so far from CAN-SPAM compliant that it's not even funny.
Even though I'm getting more than six thousand spams a day, I've only been tracking an average of 2200 a day for the past forty-one days. At first I had to do a lot of manual review of the spam to ensure that it wasn't accidental ham, there was a fair amount of that to be weeded out. It was easily weeded out and rules were put in place to help keep it out, but doing so took time, and I couldn't run the whole spamtrap feed through the measuring stick until I reviewed it all.
Now that this is out of the way, the only things holding me back here and there are software bugs and/or server issues. Occasionally the drive on the server handling this mail fills up, so I had to do a lot of fancy coding around that, to make stuff sit and pause and wait for the disk usage to come back down. That's no fun. But now that I'm able to work around it, I should start consistently logging data about at least five thousand spams each day.
Here's some random statistics for you. I recently added Gmail bulk foldering to my spam results, and so far I'm seeing that Gmail is only 88.8% affective against my spam feed. Meaning, 11.2% of spam I receive is not going to the spam folder in Gmail. Of the 92,730 spam messages I've tracked so far, over the past forty-one days, they have come my way from 68,516 unique IP addresses, and 58,022 unique /24 blocks.
Just yesterday it dawned on me that I should start tracking domains used in spam. I decided to focus on from lines, and log unique from domains that actually exist. Just since I turned it on, I've tracked over 5,500 unique domains. I have a few ideas of neat things I can do with this data, after I compile enough of it, but I'm not sharing any of those secrets quite yet.
What I will share though, is information showing what IP addresses and netblocks actually send me the most spam. It'll be interesting to see how it compares to what other people are seeing on their own mail streams. Look for that soon!