Stylometry, the study of how a person or group writes, is a fascinating component of forensics, math, and literature. While I have tried to create some code for it, my computer is simply not powerful enough to run said code on actually large amounts of text, and the code was fairly trivial to write. In this article (writing(?), post(?)), I intend on detailing the several techniques for stylometry that I believe to be important for determining authorship, which is the main use case of stylometry.
There are two types of length distribution, word length and sentence length. While sentence length tends to be viewed as non-indicative of writing, I believe that it is. For example, my (I'm pretty sure), tends to have pretty long sentences, and I prefer writing longer sentences, but other people's writing may have much shorter sentences. While this is not exactly the same for word length, word length can act as essentially a proxy for the use of functional words, which are another important thing that stylometrists look for, as functional words tend to be either 2, 3, or 5 characters long. I intend on detailing the function of functional words in stylometry soon. Anyway, the hypothesis goes, that if you get the sentence length distribution over n characters, over any authors' collected works, the graph of the frequency of the sentence length over n characters will look about the same for any n long collection, and the same is applicable to words.
N-grams are an interesting, and fairly recent technique (unlike length distribution, which dates back to the late 19th century), which works by splitting up either words or characters into n-length lists, with also the inability for any list to not be n-length, causing some characters/words to be repeated through lists in order to ensure that all the lists are n-length (ex. An 3-gram of “hello” would be [“hel”, “llo”], notice the repetition of the extra l). After splitting up the text who's author you want to identify into n-grams, you than split up collected works, of equal amount (to prevent bias), of the possible authors. You then find which author's n-gram has the most overlap with the text, and that is likely to be the author of the text (of course, this isn't perfect).
Word frequency is a crucial part of stylometry, as certain authors are much, much more likely to use certain words than other authors. For example, the Federalist Papers are a common use case of stylometry, and one of the reasons that old stylometrists predicted that a lot of the disputed papers were Madison's, as opposed to Hamilton's, was that Hamilton used “upon” WAY WAY more, like more than ten times more, and this didn't line up with the disputed papers, and so it was believed that Hamilton wasn't the one to write the papers, but rather that it was Madison. Word frequency as a technique is, of course, pretty simple. You just count the frequency of a word, normally it'’'s frequency of word per thousand or so words. This can also be used for special characters, such as semicolons or commas, which some authors use more or less. The counting of special characters is an especially good method for author identification, by finding authors who use a similar amount of those special words per thousand.
This technique encompasses a lot of random stuff, which don't need that much explaining. The first, is sentence length standard deviation, which is a very author specific thing, as some authors have more variation in writing over a given text than others. Another unit of measurement, which is closely related to both length standard deviation and sentence length distribution, is mean sentence length. A final method, which is not that accurate, in my opinion, for author pinpointing, is readability, which uses several algorithms (e.g. the flesch score), in order to calculate the readability in a limited value space (e.g. 0-100), of a given sample of text, and the hypothesis is that a given author tends to maintain a fairly constant readability score.
ngl, this writing was a little rushed, but finding all the stuff I did in the web of noise about this was really difficult