There is an interesting recent article “

Mathematicians shocked to find pattern in “random” prime numbers” inNew Scientist. (Don’t you love math titles in the popular press? Compare to the source paper’sUnexpected Biases in the Distribution of Consecutive Primes.)To summarize, let p,q be

of form a\pmod {10} and b\pmod {10}, respectively. In the paper by K. Soundararajan and R. Lemke Oliver, here is the number N (in million units) of such pairs for the first hundred million primes modulo 10,consecutive primes\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|c|c|}

\hline

&a&b&\color{blue}N&&a&b&\color{blue}N&&a&b&\color{blue}N&&a&b&\color{blue}N\\

\hline

&1&3&7.43&&3&7&7.04&&7&9&7.43&&9&1&7.99\\

&1&7&7.50&&3&9&7.50&&7&1&6.37&&9&3&6.37\\

&1&9&5.44&&3&1&6.01&&7&3&6.76&&9&7&6.01\\

&1&1&\color{brown}{4.62}&&3&3&\color{brown}{4.44}&&7&7&\color{brown}{4.44}&&9&9&\color{brown}{4.62}\\

\hline

\text{Total}& & &24.99&& & &24.99&& & &25.00&& & &24.99\\

\hline

\end{array}As expected, each class a has a total of 25 million primes (after rounding). The “shocking” thing, according to the article, is that if the primes were truly

, then it is reasonable to expect that each subclass will have \color{blue}{N=25/4 = 6.25}. As the present data shows, this is apparently not the case.random

Argument: The disparity seems to make sense. For example, let p=11, so a=1 . Since p,q areconsecutive primes, then, of course, subsequent numbers are not chosen at random. Wouldn’t it be more likely the next prime will end in the “closer” 3 or 7 such as q=13 or q=17, rather than looping back to the same end digit, like q=31? (I’ve taken the liberty of re-arranging the table to reflect this.)However, what is surprising is the article concludes, and I quote,

“…as the primes stretch to infinity, they do eventually shake off the pattern and give the random distribution mathematicians are used to expecting.”

Question:What is an effective way to counter the argument given above and come up with the same conclusion as in the article? (Willthe N eventually approach N\to 6.25, with the unit suitably adjusted?) Or is the conclusion based on a conjecture and may not be true?all

P.S:A more enlightening popular article “Mathematicians Discover Prime Conspiracy“. (It turns out the same argument is mentioned there, but with a subtle way to address it.)

**Answer**

\qquad \qquad *Remark: see also [update 3] at end*

## 1. First observations

I think there is at least one artifact (=non-random) in that list of frequencies.

If we rewrite this as a “correlation”-table, *(the row-header indicate the residue classes of the smaller prime p and the column-header that of the larger prime q)*:

\small \begin{array} {r|rrrr}

& 1&3&7&9 \\ \hline

1& 4.62& 7.43& 7.50& 5.44\\

3& 6.01& 4.44& 7.04& 7.50\\

7& 6.37& 6.76& 4.44& 7.43\\

9& 7.99& 6.37& 6.01& 4.62

\end{array}

then a surprising observation is surely the striking symmetry around the antidiagonal. But also the asymmetric increase of frequencies from top-right to bottom-left on the antidiagonal is somehow surprising.

However, if we look at this table in terms of ** primegaps**, then

- residue-pairs (1,1) (3,3) (7,7),(9,9) (the diagonal) refer to primegaps of

the lenghtes (10,20,30,…,10k,…) and those are the entries in the table with lowest frequencies, - residue-pairs (1,3), (7,9) and (9,1) refer to primegaps of the lenghtes (2,12,22,32,…,10k+2,…) and those contain the entry with the highest frequencies
- residue-pairs (3,7) (7,1) ,9,3 refer to primegaps of the lenghtes

(4,14,24,34,…,10k+4,…) - residue-pairs (1,7) (3,9) and (7,3) refer to primegaps of the lenghtes

(6,16,26,36,…,10k+6,…) and have the two next-largest frequencies - residue-pairs (1,9) (3,1) and (9,7) refer to primegaps of the lenghtes

(8,18,28,38,…,10k+8,…)

so the -in the first view surprising- different frequencies of pairs (1,9) and (9,1) occurs because one collects the gaps of (minimal) length 8 and the other that of (minimal) length 2 – and the latter are much more frequent, but which is completely compatible with the general distribution of primegaps. The following images show the distribution of the primegaps modulo 100 (whose greater number of residue classes should make the problem more transparent).

(I’ve left the primes smaller than 10 out of the computation):

We see the clear logarithmic decrease of frequencies with a small jittering disturbance over the residue classes. It is also obvious, that the smaller primegaps dominate the larger ones, so that a “slot” which catches the primegaps of lengthes 2,12,22,… has more occurences than the “slot” which catches 8,18,28,… – just by the frequencies in the very first residue class. The original table of frequencies in the residue classes modulo 10 splits this into 16 combinations of pairs of 4 residue classes and the observed non-smoothness is due to that general jitter in the resdiue classes of the primegaps.

It might also be interesting to see that primegap-frequencies separated into three subclasses – :

That trisection shows the collected residue classes 6,12,18,… (the green line) as dominant over the two other collections and the two other collection change “priority” over the single residue classes.

The modulo-10-problem overlays that curves a bit and irons the variation a bit out and even makes it a bit less visible – but not completely: because the general distribution of residue classes in the primegaps has such a strong dominance in the small residue-classes. So I think that general distribution-characteristic explains that modulo-10 problem, however a bit less obvious…

## 2. Further observations (update 2)

For further analysis of the remaining jitter in the previous image I’ve tried to de-trend the frequencies distribution of the primegaps (however now without modulo considerations!).

Here is what I got on base of *5 700 000* primes and the first *75* nonzero lenghtes *g*. The regression-formula was simply created by the Excel-spreadsheet:

De-trending means to compute the difference between the true frequencies \small f(g) and the estimated ones; however, the frequency-residuals \small r_0(g)=f(g) – 16.015 e^{-0.068 g } decrease in absolute value with the value of *g*. Heuristically I applied a further detrending function at the residuals \small r_0(g) so that I got \small r_1(g) = r_0(g) \cdot 1.07^g which look now much better de-trended.

This is the plot of the residuals \small r_1(g):

Now we see that periodic occurences of peaks in steps of *6* and even some apparent overlay. Thus I marked the small primefactors \small (3,5,7,11) in *g* and we see a strong hint for a additive composition due to that primefactors in g

The red dots mark that *g* divisible by *3*, green dots that by *5*, and we see, that at *g* which are divisible by both the frequency is even increased.

I’ve also tried a multiple regression using that small primefactors on that residuals, but this is still in process….

## 3. observations after Regression/Detrending (update 3)

Using multiple regression to detrend the frequencies of primegaps by their length *g* and additionally by the primefactors of *g* I got initially a strong surviving pattern with peaks for the primefactor *5*. But those peaks could be explained by the observation, that *(mod 100)* there are *40* residues of primefactor *p* where the gaplength *g=0 (mod 10)* can occur, but only *30* residues where the other gaplengthes can occur.

Thus I computed the *relative (logarithmized) frequencies* as \text{fl}(g)=\ln(f(g)/m_p(g)) where f(g) is the frequency of that gaplength, and m_p(g) the number of possible residue classes of the (first) prime *p* (in the pair *(p,q)* ) at where the gaplengthes *g* can occur.

The first result is the following picture where only the general trend of decreasing of frequencies of larger gaps is detrended:

This computation gives a residue \text{res}_0 which is the relative (logarithmized) frequency after the length of the primegap is held constant (see the equation in the picture). The regular pattern of peaks at *5*-steps in the earlier pictures is now practically removed.

However, there is still the pattern of *3*-step which indicates the dominance of gaplength *6*. I tried to remove now the primefactorization of *g* as additional predictors. I included marker variables for primefactors *q* from *3* to *29* into the multiple regression equation and the following picture shows the residues \text{res}_1(g) after the systematic influence of the primefactorization of *g* is removed.

This picture has besides a soft long hill-like trend no more -for me- visible systematic pattern, which would indicate non-random influences.

*(For me this is now enough, and I’ll step out – but still curious whether there will come out more by someone else)*

**Attribution***Source : Link , Question Author : Tito Piezas III , Answer Author : Gottfried Helms*