In these notes by Terence Tao is a proof of Stirling’s formula. I really like most of it, but at a crucial step he uses the integral identity

n! = \int_{0}^{\infty} t^n e^{-t} dt

, coming from the Gamma function. I have a mathematical confession to make: I have never “grokked” this identity. Why should I expect the integral on the right to give me the number of elements in the symmetric group on n letters?

(It’s not that I don’t know how to prove it. It’s quite fun to prove; my favorite proof observes that it is equivalent to the integral identity \int_{0}^{\infty} e^{(x-1)t} dt = \frac{1}{1 – x}. But if someone were to ask me, “Yes, but why,

really?” I would have no idea what to say.)So what are more intuitive ways of thinking about this identity? Is there a probabilistic interpretation? What kind of random variable has probability density function \frac{t^n}{n!} e^{-t}? (What does this all have to do with Tate’s thesis?)

As a rough measure of what I’m looking for, your answer should make it obvious that t^n e^{-t} attains its maximum at t = n.

Edit: The kind of explanation I’m looking for, as I described in the comments, is similar to this explanation of the beta integral.

**Answer**

I haven’t quite got this straight yet, but I think one way to go is to think about choosing points at random from the positive reals. This answer is going to be rather longer than it really needs to be, because I’m thinking about this in a few (closely related) ways, which probably aren’t all necessary, and you can decide to reject the uninteresting parts and keep anything of value. Very roughly, the idea is that if you “randomly” choose points from the positive reals and arrange them in increasing order, then the probability that the (n+1)^\text{th} point is in a small interval (t,t+dt) is a product of probabilities of independent events, n factors of t for choosing n points in the interval [0,t], one factor of e^{-t} as all the other points are in [t,\infty), one factor of dt for choosing the point in (t,t+dt), and a denominator of n! coming from the reordering. At least, as an exercise in making a simple problem much harder, here it goes…

I’ll start with a bit of theory before trying to describe intuitively why the probability density \dfrac{t^n}{n!}e^{-t} pops out.

We can look at the homogeneous Poisson process (with rate parameter 1). One way to think of this is to take a sequence on independent exponentially distributed random variables with rate parameter 1, S_1,S_2,\ldots, and set T_n=S_1+\cdots+S_n. As has been commented on already, T_{n+1} has the probability density function \dfrac{t^n}{n!}e^{-t}. I’m going to avoid proving this immediately though, as it would just reduce to manipulating some integrals. Then, the Poisson process X(t) counts the number of times T_i lying in the interval [0,t].

We can also look at Poisson point processes (aka, Poisson random measures, but that Wikipedia page is very poor). This is just makes rigorous the idea of randomly choosing unordered sets of points from a sigma-finite measure space (E,\mathcal{E},\mu). Technically, it can be defined as a set of nonnegative integer-valued random variables \{N(A)\colon A\in\mathcal{E}\} counting the number of points chosen from each subset A, such that N(A) has the Poisson distribution of rate \mu(A) and N(A_1),N(A_2),\ldots are independent for pairwise disjoint sets A_1,A_2,\ldots. By definition, this satisfies

\begin{array}{}\mathbb{P}(N(A)=n)=\dfrac{\mu(A)^n}{n!}e^{-\mu(A)}.&&(1)\end{array}

The points T_1,T_2,\ldots above defining the homogeneous Poisson process also define a Poisson random measure with respect to the Lebesgue measure (\mathbb{R}\_+,{\cal B},\lambda). Once you forget about the order in which they were defined and just regard them as a random set that is, which I think is the source of the n!. If you think about the probability of T_{n+1} being in a small interval (t,t+\delta t) then this is just the same as having N([0,t])=n and N((t,t+\delta t))=1, which has probability \dfrac{t^n}{n!}e^{-t}\delta t.

So, how can we choose points at random so that each small set \delta A has probability \mu(\delta A) of containing a point, and why does (1) pop out? I’m imagining a hopeless darts player randomly throwing darts about and, purely by luck, hitting the board with some of them. Consider throwing a very large number N\gg1 of darts, independently, so that each one only has probability \mu(A)/N of hitting the set, and is distributed according to the probability distribution \mu/\mu(A). This is consistent, at least, if you think about the probability of hitting a subset B\subseteq A. The probability of missing with all of them is (1-\mu(A)/N)^N=e^{-\mu(A)}. This is a multiplicative function due to independence of the number hitting disjoint sets. To get the probability of one dart hitting the set, multiply by \mu(A) (one factor of \mu(A)/N for each individual dart, multiplied by N because there are N of them). For n darts, we multiply by \mu(A) n times, for picking n darts to hit, then divide by n! because we have over-counted the subsets of size n by this factor (due to counting all n! ways of ordering them). This gives (1). I think this argument can probably be cleaned up a bit.

Getting back to choosing points randomly on the positive reals, this gives a probability of \dfrac{t^n}{n!}e^{-t}dt of picking n in the interval [0,t] and one in (t,t+dt). If we sort them in order as T_1\lt T_2\lt\cdots then \mathbb{P}(T_1\gt t)=e^{-t}, so it is exponentially distributed. Conditional on this, T_2,T_3,\ldots are chosen randomly from [T_1,\infty), so we see that the differences T_{i+1}-T_{i} are independent and identically distributed.

Why is \dfrac{t^n}{n!}e^{-t} maximized at t=n? I’m not sure why the mode should be a simple property of a distribution. It doesn’t even exist except for unimodal distributions. As T_{n+1} is the sum of n+1 IID random variables of mean one, the law of large numbers suggests that it should be peaked approximately around n. The central limit theorem goes further, and gives \dfrac{t^n}{n!}e^{-t}\approx\dfrac{1}{\sqrt{2\pi n}}e^{-(t-n)^2/{2n}}. Stirling’s formula is just this evaluated at t=n.

What’s this to do with Tate’s thesis? I don’t know, and I haven’t read it (but intend to), but have a vague idea of what it’s about. If there is anything to do with it, maybe it is something to do with the fact that we are relating the sums of independent random variables S_1+\cdots+S_n distributed with respect to the Haar measure on the multiplicative group \mathbb{R}_+ (edit: oops, that’s not true, the multiplicative Haar measure has cumulative distribution given by \log, not \exp) with randomly chosen sets according to the Haar measure on the additive group \mathbb{R}.

**Attribution***Source : Link , Question Author : Qiaochu Yuan , Answer Author : BAI*