Mathematically, why was the Enigma machine so hard to crack?

Mathematically, why was the Enigma machine so hard to crack?

In laymen terms, what was it exactly that made cracking the Enigma machine such a formidable task? Everything I have seen about the Enigma machine, from a general article to information about cryptanalysis of the Enigma, is quite lengthy, and it appears to be difficult to pinpoint exactly the most salient mathematical difficulty facing the codebreakers other than the sheer number of possible settings (159 million million million according to this Bletchley Park website) that changed every single day. Or was this really it? There seems to be much more to it than that.

I was hoping someone with a fair amount of knowledge about the mathematics behind the Enigma and its breaking might be able to provide a condensed, simplified reason for what made cracking this machine such a monumental undertaking.

I strongly disagree with the other answer which trivializes the contributions of Alan Turing and his group, as well as the Polish mathematicians who first worked on the problem. Of course increasing the numbers makes the problem harder, but even with a modern computer a brute force attack on the often cited 150 million million combinations (this number actually varied throughout the war and for different configurations of the Enigma machine) would be a tall order. With computers not even existing at the start of the war, there was absolutely no chance to attack the problem in a brute force way.

Therefore new methods to reduce the possible number of combinations had to be developed. One example is Banburismus, a statistical method developed by Alan Turing. This was a method which allowed excluding many possibilities even without any “cribs”, i.e., known plaintext parts of the message.

The basic idea is that, given two natural language strings, they will share many more letters in the same positions than two random strings would. This is due to the fact that certain letters are much more probable than others. For instance, an example from the linked page:

HereisthefirstmessageofapairNothingspecialjustordinaryEnglishtext
BelowitanotherAgainyoucanseethatitismerelyarandomexamplemessage
-    -                -        -  - -             -       -  -


Here we have nine matching letters or overlaps. With two random strings, we would expect an average of two or three matches for a message of this length.

A crucial insight then was that this property is preserved even if both messages are enciphered through the Enigma machine. This allowed the development of an attack which could discard many combinations on statistical grounds.

The description of the method makes it clear that the beginnings of information theory, formally established by Claude Shannon in his paper “A Mathematical Theory of Communication” in 1948, are already present in these ideas. Saying that it was just a matter of building a fast enough machine to try all combinations is an insult to these theoretical achievements.

Just for fun, here’s a bit of math on the letter frequency problem. Let’s say $$X1∈{1,2,…,26}X_1 \in \{ 1, 2, \ldots, 26 \}$$ denotes the event that a letter in the first message is A, B, …, Z, and similarly $$X2X_2$$ a letter in the second message. Since the messages are independent, so are these events. So let $$p∈R26p \in \mathbb R^{26}$$ denote the vector $$(P(X1=1),…,P(X1=26))(P(X_1 = 1), \ldots, P(X_1 = 26))$$, i.e., the letter frequencies for our language.

Then the probability that one position contains the same letter in both messages is
$$P(X1=X2)=26∑i=1P(X1=i∧X2=i)=26∑i=1P(X1=i)P(X2=i)=‖ P(X_1=X_2) = \sum_{i=1}^{26} P(X_1 = i \land X_2=i) = \sum_{i=1}^{26} P(X_1 = i) P(X_2=i) = \| p \|_{\ell_2}^2.$$
(Here we implicitly assumed that each letter is independent from the next, which is not quite true. Using digram or trigram frequencies would be more accurate.)

If our “language” is random, i.e., $$p=(\frac 1{26}, \ldots, \frac 1{26})p=(\frac 1{26}, \ldots, \frac 1{26})$$, we get that the above probability for a matching letter is $$\frac 1{26}\frac 1{26}$$.
For our 63-letter message, we would thus expect around $$63/26 \approx 2.463/26 \approx 2.4$$ matching letters.

In fact this is the lowest possible probability: since the probabilities in $$pp$$ must sum up to 1, we obtain from the Cauchy-Schwarz inequality that
$$1 = \sum_{i=1}^{26} p_i \le \sqrt{26} \|p\|_{\ell_2}, 1 = \sum_{i=1}^{26} p_i \le \sqrt{26} \|p\|_{\ell_2},$$
and hence
$$\|p\|_{\ell_2}^2 \ge \frac 1 {26}. \|p\|_{\ell_2}^2 \ge \frac 1 {26}.$$

Using the letter frequencies for English given on Wikipedia, we can compute that for the English language, we have
$$\|p\|_{\ell_2}^2 = 0.0655 > 0.0385 = \frac 1 {26}. \|p\|_{\ell_2}^2 = 0.0655 > 0.0385 = \frac 1 {26}.$$
I assume that by taking di- and trigrams into account, this number would rise even more.

Update: By downloading a text version of “War and Peace” from Project Gutenberg and massaging it in IPython, I found very similar letter distributions to the ones given on Wikipedia, resulting in $$\|p\|_{\ell_2}^2 \approx 0.0659\|p\|_{\ell_2}^2 \approx 0.0659$$.

Then I extracted pairs of random 60-letter strings from the text and counted the number of matches between them. Repeating this experiment 1,000,000 times, I found an average probability for a match of 0.0659 per letter. So it seems that the simple first order estimate $$\|p\|_{\ell_2}^2\|p\|_{\ell_2}^2$$ is actually a very good approximation to the probability of matching letters.