10/19/2010

Alex just asked me what letter is most common in the English language. I told him it was “e” (everyone knows that, right?) and following that is either “s” or “t”.

But I’m a curious kind of guy, and since I was sitting on the bed using my laptop anyway, I figured it wouldn’t be that hard to set up a series of shell commands that would do this for me. I knew how to access the spell-checking dictionary, and from there it was just a matter of chaining together the right pipes to take all of the words from the dictionary, break them down, and count the letters.

Here’s the command:

aspell -d en dump master | tr A-Z a-z | sed s/./”&\n”/g | grep [a-z] | sort | uniq -c | sort -nr

And the results:

127009 e
125815 s
97486 i
93927 a
82464 r
81679 n
72800 t
69631 o
60813 l
43910 c
40123 d
35710 u
31921 m
30971 g
29675 p
27042 h
22359 b
19136 y
14482 f
11675 k
11287 v
9917 w
4603 z
3016 x
2950 j
2070 q

It’s not surprising that “e” was on top, with “s” right behind, but I could’ve sworn “t” was more popular than that. Upon further thought, I suspect the data is a bit skewed because this is the dictionary and words like “the” only appear once. In spoken or written English, it appears much more often than words like, say, “phlegm”.

So to be more realistic, I should use a book instead of the dictionary. Of course I have a copy of Moby Dick handy… I use it for all of the testing I do in web sites. One change to the command above gives me the count from the novel instead:

cat moby-dick.txt | tr A-Z a-z | sed s/./”&\n”/g | grep [a-z] | sort | uniq -c | sort -nr

The answer? Things are a bit different:

115020 e
86552 t
76491 a
68135 o
64553 n
64381 i
63105 s
61778 h
51157 r
42048 l
37656 d
26316 u
22902 m
22141 c
21774 w
20493 g
20475 f
16961 p
16602 y
16600 b
8418 v
7937 k
1544 q
1061 j
1006 x
621 z

Once again “e” trumps the rest, but now it’s “t” in second place and “s” has dropped down a bit. How odd. Even more interesting is the fact that “q” isn’t dead last– it’s 22nd. My theory on that: there’s a character in Moby Dick named Queequeg, and the mere mention of his name probably bumps the letter up the list.

This is hardly scientific, but all in all it’s a decent test. And yes, I’m a complete geek for taking three minutes to figure out how to do this.