How long is that word? As long as it needs to be.

Mar 14 2011 Published by under Synaptic Misfires

He studies too much for words of four syllables

-Jane Austen, Pride and Prejudice

"No"
"the"
"of"
"to"
"and"
"a"
"Yes"
"How"
"LOL"
"F**K"

What do all of these words have in common? They are all in the 500 most frequently used words in the English language (ok, two of them aren't, but I bet you on the internet, they are). Notice something about these words? They are all really SHORT. Most of the big 500 are, the majority are only one syllable. This is not surprising, the most frequently used words in many languages are short. This is pretty obvious when you think about it, and applies to most languages (ok, maybe not German, where the word for "speed limit" is "Geschwindigkeitsbegrenzung". They've got other words of similar length. A mouthful? Sure. But super fun to speak!).

When you know that short words are used more frequently, then the next logical step is to hypothesize that word length is DETERMINED by how frequently its used. If you're going to use a word to mean you agree with something, and you're going to use it a lot, it doesn't make a lot of sense to have that word be 15 letters long, esp when you can say "yes", "oui", "ja", or "si" (I'd include a non-phoenetic alphabet example too but I don't think wordpress can do that in text...). The hypothesis that word length is determined by how frequently the word is used in a language was laid out by a guy named Zipf, who observed that the length of a word is inversely correlated to how often its used.

The idea behind this is straight up efficiency. Information is most efficient when its conveyed in the shortest way possible. That means short words.

This hypothesis has stood for the past 75 years or so, but there have been some problems. And now, there's a new hypothesis: what if the length of words is correlated with their INFORMATION content?

Piantadosi, Tily, Gibson. "Word lengths are optimized for efficient communication" PNAS, 2011.

Zipf's idea, that the frequency of word usage determines how long the word is does work in some respects. But the problem is that the frequency of word usage depends heavily on context. The choice of words you use on the internet may be highly different from the choices in a business meeting, where longer words maybe used with great frequency (especially compared to "LOL", which many people may argue is not a word, but hey, it's in the OED). So if context changes word choice, but you still want to keep the maximal efficiency of information communication, what determines whether words are short or long?

The scientists behind today's paper hypothesize that the INFORMATION conveyed by the word determines its length. This differs from Zipf in that it doesn't depend on frequency. If Zipf's hypothesis was true, the more we used a word (like 'supercalifragalisticexpialidocious'), the shorter it would become, as frequency of use would determine its eventual length. But with the hypothesis that information content determines length, This means that the more information a word conveys, the longer its allowed to be. This means that you can keep your long words long if the information they convey is essential.

The other big way this differs from Zipf is the rate of information communication. The hypothesis of information conveying word length keeps the rate of information flow during conversation as constant as possible. For example, if you use the word "a", you're conveying relatively little information, and it's also a very fast word to say. While if you use the word "criminal", you convey more information, but it takes more time to do so. The hypothesis described in this paper not only keeps word length dependent on information, it also keeps the flow of information relatively constant. Long words with lots of information take a long time, and short words with little information take a short time. The net results is that the flow of information remains pretty constant over time as you use long or short words.

So the question now is: does the hypothesis WORK? Does the information contained in a word really predict its length?

Edited to add: A note on the methods. Replicated from my comment below:

They say they used “an unsmoothed N-gram model trained on data from Google”. So basically they took frequently used strings in the Google datasets for each language, strings of 2, 3, or 4 letters. They compared them to the OPUS Corpus (I looked it up, and it's an archive of open source documents) to take out nonsense words. They used the most frequent words in the dataset because information content can be estimated reliably only from high frequency (so even the words they worked with that were used less frequently were still used more frequently than many other words). They then used a mathematical model to estimate the amount of information contained in each word.

Here's the mathematical model:

Where C is context, W is word, and the joint distribution of a word in a context is P(C, W). This corresponds to the expected information context from a random word in a large trove of words. At least, that's how I understand it.

Looks pretty good so far. Here you can see a correlation between the length of words in the English language and the amount of information the words are thought to convey, and the correlation is very nicely in favor of longer words conveying more information.

Sure that works fine for English, but what about other languages?

Here you can see correlation values for information value and word length (the solid bars), and frequency of usage and word length (the hashed bars) for 11 different languages. I am not sure what the n=2, n=3, and n=4 stand for, if anyone knows, please drop it in the comments!! But what you can see overall is that there is a correlation between information value and word length in all of these languages (the lowest correlation here overall appears to be Romanian and I'm not sure why that is). There is a much lower correlation between frequency of usage and word length (except for in Italian). But when they comapred the two correlations to each OTHER, they found that the frequency of word usage is also correlated with information content. When you need to convey a lot of information frequently, you're going to use the long words.

So far it looks pretty good, that information value and word length are well correlated. The only place where this breaks down is in words that mean relatively little and are used extremely frequently, the top 5-20% most frequently used words have little correlation between word length and information content, meaning that some short words convey a lot of information, while long words may contain less.

they conclude that information content of words will correlate with their length, and inversely correlate with the frequency of usage. But this DOES break down for the most frequently used words, and so Zipf's hypothesis is not entirely incorrect. I imagine that the REAL things that determine word length probably is a result of the informative value (promoting longer words for more information) which is modulated by the frequency of use (frequency of use promotes shorter words for more efficiency). So it is possible to have high information and short words IF they are words used extremely frequently. With words that are used less often, there's less pressure to shorten them up, and thus their length correlates more with their informative value. The net result: the most efficient use of language, taking into account frequency, informative value, and word length. The use of language will become less efficient as you use words that are less frequent. So for maximal communication? Small words.

But I'm still waiting for an explanation of "Geschwindigkeitsbegrenzung".

Piantadosi ST, Tily H, & Gibson E (2011). From the Cover: Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences of the United States of America, 108 (9), 3526-9 PMID: 21278332

11 responses so far

  • Matt Hall says:

    Interesting post. Always fun to read about a new subject. I love how you can take any aspect of anything at all, and find that there is a rich and active vein of research behind it.

    I found the paper here at MIT.

    Disclaimer: I don't know anything about this field of study. But my reading of this paper leaves me a bit puzzled. They seem to define 'information content' in terms of word frequency: -log P(W|C), where P is probability of finding Words in a given Context. This seems to beg the question; I must be missing something.

    The N=2, 3, etc., refer to N-gram models, 'standard quantitative probabilistic models' as the authors put it. These are short sequences of words. Read more in Wikipedia.

    • Craig says:

      It's not entirely circular; or, at least, not unjustifiably circular.

      The information value is the negative log probability of the word appearing in that context. So, if a word is easy to predict from the preceding words, then it doesn't add much extra information to the sentence; chop the word out and you could still probably guess what was meant. But if the word is highly surprising given its context, then it does substantially alter the meaning of the sentence; it's had more of an impact on the overall information in the communication.

      It's not perfect, but it seems an adequate approximation to me.

  • Jacques says:

    How did they measure information content?

    • I had the same question. I hope they found an objective way to do this...

    • scicurious says:

      They say they used "an unsmoothed N-gram model trained on data from Google". So basically they took frequently used strings in the Google datasets for each language, strings of 2, 3, or 4 letters. They compared them to the OPUS Corpus (I don't know what that is) to take out nonsense words. They used the most frequent words in the dataset because information content can be estimated reliably only from high frequency (so even the words they worked with that were used less frequently were still used more frequently than many other words). They then used a mathematical model to estimate the amount of information contained in each word which I can't replicated here, but will post in an image in the post.

      At least, that's how I understand it. This is WAY out of my depth.

  • Jason Dick says:

    Oh, I see! Yes, the definition of information content is, I think, the key here, and it is directly tied to the understanding of the N number in those plots. From the text:

    "We chose to approximate P(W |C) by using a standard quantitative probabilistic model, an N-gram model, which treats C as consisting only of the local linguistic context containing the previous N − 1 words."

    So it sounds like they just judge how much information a word conveys by looking at the word before it. If a particular word always follows some previous set of words, then that word provides the reader (or listener) with no additional information: the previous two words are enough. So the N just refers to the number of words in a sequence they look at.

  • Zuska says:

    If Zipf’s hypothesis was true, the more we used a word (like ‘supercalifragalisticexpialidocious’), the shorter it would become, as frequency of use would determine its eventual length. But with the hypothesis that information content determines length, This means that the more information a word conveys, the longer its allowed to be. This means that you can keep your long words long if the information they convey is essential.

    This provides a possible explanation for something I noticed that really bothered me - the transition from referring to Guantanamo Bay as "GitMo", at least among a certain subset of people who were talking about it on tv, in print, on blogs, and in personal conversations. Initially, the full name was used because it was unfamiliar but over time it was shortened to a "nickname" as it was used more frequently. This actually served the dual purpose of decreasing the amount of information conveyed by the name. Guantanamo Bay sounds like someplace serious where something might be happening that is essential to attend to. GitMo sounds sorta clever and light, like one of those celebrity couple hybrid names, and what were we talking about? Look, something shiny!

  • L3viathan says:

    "Geschwindigkeitsbegrenzung": The nice (or strange) thing in german is, that you can basicly take any noun and attach it to any other.
    "Donaudampfschiffahrtskapitänsmütze" means the cap of the captain of the steam ship on the Donau.

    "Geschwindigkeitsbegrenzung" is "die Begrenzung der Geschwindigkeit": the limit of speed.
    "Begrenzung": "Grenze" means border. The syllable "be" can do different things with a word, like "besetzen" = occupy and "setzen" = "sit".
    "Geschwindigkeit" = fast means "schnell" in modern german, but there is also another word called "geschwind".
    "keit" means more or less "ness": "Müdigkeit" = "tiredness" ("müde" = tired)
    so Geschwindigkeit means more or less "speedyness"

    And that's how we come to "Geschwindigkeitsbegrenzung".

  • Stecki says:

    Sorry to be a little obnoxious, but there is a typo: It’s supercalifragilisticexpialidocious, not supercalifragalisticexpialidocious… 😉

  • Micha says:

    In large parts of germany traditionally there is no Geschwindigkeitsbegrenzung, you can just go as fast as you want. It's zipfs law live in action.

Leave a Reply to L3viathan Cancel reply