What identifies an email message?


Today, I'm trying to help a client answer the question of how Gmail decides in which tab to place a sender's email messages. That leads to the question of how does Gmail, or any other ISP, identify the sender or messages in question? Google, like most ISPs, does not explicitly say, but you can make some hypotheses based on observation.

So, let's theorize. I'll base this mostly on my prior observation, and say that I believe that ISPs identify senders primarily through some mix of these means:

  • Sending IP address,
  • Sending domain name,
  • DKIM authenticated from domain,
  • SPF authenticated return-path address or domain.

Most ISPs base most of this on the sending IP address. Google's Gmail leans more toward identification based on authenticated domain, if possible, primarily via DKIM. Perhaps also via SPF. Various other ISPs are a mix of one or more of these.

Then it becomes, how does Gmail identify the content of the message, to decide if it's an update, or a promotion, or what. Again, no explicit guidance is forthcoming, but one can theorize. My theories on how Google is going to identify content are that it could be any or all of these:

  • A fingerprint calculated based on review of the entire content and code,
  • A fingerprint calculated based on review of the HTML code only,
  • A fingerprint calculated based on review of the text but not code,
  • Matching certain phrases or text,
  • Identifying certain domains you link to or use to host images.

In this context, fingerprinting means "identifying various markers in an email message" and generating a sort of score or numeric checksum based on that. It is often done to identify the same email being sent by different IPs and domains. If a spammer is rotating through domains and IP addresses, fingerprinting the content allows you to identify that the messages are probably all from the same entity, even when those primary sender markers of IP address and sending domain are highly variable. Cloudmark and the Distributed Checksum Clearinghouse (DCC) are well known examples of systems that use fingerprinting.

Also, content and reputation overlap here. An example: Linking to a blocklisted domain can cause deliverability issues-- it can drive spam folder placement, or in some cases, it'll make an ISP block that given message. That's sort of a content issue in that it refers to something referenced in the body content of an email message, but it is also a reputation issue in that it is driven by the (poor) reputation of that domain that you might be linking to.

Bonus: Let's talk more about Gmail tab placement. From Andrew Barrett over on his "the Email Skinny" blog: There's definitely a way out of Gmail Promotions and into the Primary Tab, (But you’re not going to like it).

If you've got additional thoughts on how ISPs identify senders and content, feel free to weigh in via the comments. There are always things that I might not know, or haven't thought of!

Post a Comment

Comments