This month’s Social Engineer podcast discussed a tactic to identify bad guys through writing style, something the hosts expressed surprise was possible.
This won’t be news to anyone who minored in English or Communications or Journalism. A lot of factors go into style—where we grew up, where our parents are from, what we read growing up, our life experience, and it really is like a fingerprint. Fitzgerald’s Gatsby called everyone “Old Sport,” and we all have something like that, it’s just usually more subtle. I’ll say, “taste this,” when my wife or mother in law will say “taste of this.” That’s a regional thing. I pick up on that because I’m interested in language. A really good linguist can pick up on a lot more than that, and machine learning can potentially pick up on still more.
If you recall, it was the Unabomber’s long manifesto that brought down Ted Kaczynski. Other forensics proved it, but the investigation began with his brother’s observation that the manifesto “sounded like Ted.”
Attribution is a common problem in security, but written clues certainly help. There are a million words in the English language, so it should come as no surprise that no two people combine them in exactly the same way. Our lives influence the way we combine those words. If our parents’ native language is something other than English, it can heavily color our use of English, but the region we live in can as well. Someone from Minnesota can understand someone from Georgia, but they will find some of the things the other says peculiar. In one of my favorite essays about writing, Kurt Vonnegut observed that his own writing sounded like someone from Indianapolis, and made no apologies for it.
Given a reasonable sized corpus of known writings, it’s not hard to then determine if an anonymous or pseudonymous work is written by the same person as the known writings. If the goal is just to track people on criminal forums or to detect and ban previously banned trolls, the job is pretty easy.
Unmasking identities is harder but possible. A linguist or a good machine learning program probably will be able to pick up that I’m from the midwest and a member of Generation X, and it may even be able to pinpoint me to the early or late leg of Gen X. It will also almost certainly pick up that I am male, and probably that I am white. My word choices will give a good indication of my level of education, and in some cases, things like religion as well.
Human analysis can pick up more clues as well, and by the process of elimination, it’s possible to eliminate candidates. How many white Protestant male computer security analysts are there who are interested in baseball and 1980s computers and mid-20th-century toy trains?
The more we write, the more we betray ourselves, but many of us write a lot more than we probably think we do. Bad opsec was what brought down Ross Ulbricht, the creator of The Silk Road underground marketplace, but even if he had used a different handle and e-mail address when he was getting started, it’s likely that eventually the authorities would have been able to match his pseudonymous Dread Pirate Roberts writings with his unmasked online persona under his real name.
Do I believe some FBI computer has analyzed the entire contents of this blog at some point? Yes I do. Why? Because they’re idiots if they haven’t. There are probably a couple million words here, and I’m writing under my real name, so it absolutely makes sense to compare any random anonymous criminal’s writings to mine just in case it’s me. Or any other blogger who blogs under a real name.