Let $S$ be a smooth oriented surface in $\mathbb R^3$ with boundary $C$, and let $f: \mathbb R^3 \to \mathbb R^3$ be a continuously differentiable vector field on $\mathbb R^3$.
Stokes’s theorem states that
$$
\int_C f \cdot dr = \int_S (\nabla \times f) \cdot dA.
$$
In other words, the line integral of $f$ over the curve $C$ is equal to the integral of the curl of $f$ over the surface $S$.
Here the orientation of the boundary $C$ is induced by the orientation of $S$.Question: How might somebody have derived or discovered this formula? Where does this formula come from?
The goal is to provide an intuitive explanation of Stokes’s theorem, rather than a rigorous proof.
(I’ll post an answer below.)
Answer
Here’s an intuitive way to discover Stokes’s theorem.
Imagine chopping up the surface $S$ into tiny pieces such that each tiny piece is a parallelogram (or at least, each tiny piece is approximately a parallelogram).
Let $C_i$ be the boundary of the $i$th tiny parallelogram. I’ll assume each $C_i$ has the orientation induced by the orientation of $S$. Notice that
$$
\tag{1} \sum_i \int_{C_i} f \cdot dr = \int_C f \cdot dr.
$$
This is because the sum on the left “telescopes”. Everything in the middle cancels out and we are left only with boundary terms. This beautiful step in the derivation is reminiscent of the telescoping sum that appears when deriving the fundamental theorem of calculus in single variable calculus.
To complete our derivation of Stokes’s theorem, we must compute the integral of $f$ around the boundary of a tiny parallelogram. Below is a picture of one single tiny parallelogram which is based at a point $x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} \in \mathbb R^3$ and which is spanned by vectors $v = \begin{bmatrix} v_1 \\ v_2 \\ v_3 \end{bmatrix}$ and $w = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix} \in \mathbb R^3$. The orientation of the boundary of the parallelogram is indicated by the little direction arrows.
Since this is a very tiny parallelogram, I’ll make the approximation that the integral of $f$ along edge 1 is approximately $f(x) \cdot v$, the integral of $f$ along edge 2 is approximately $f(x + v) \cdot w$, the integral of $f$ along edge 3 is approximately $f(x + w) \cdot (v)$, and the integral of $f$ along edge 4 is approximately $f(x) \cdot (w)$. Summing these four terms, and pairing edge 1 with edge 3 and edge 2 with edge 4, we find that the integral of $f$ along the boundary of this parallelogram is approximately
\begin{align*}
&\quad \langle f(x+v) – f(x), w \rangle – \langle f(x + w) – f(x), v \rangle \\
&\approx \langle f'(x) v, w \rangle – \langle f'(x) w, v \rangle \\
&= \langle v, (f'(x)^T – f'(x)) w \rangle \\
&= \left \langle v, \begin{bmatrix} 0 & \frac{\partial f_2(x)}{\partial x_1} – \frac{\partial f_1(x)}{\partial x_2} & \frac{\partial f_3(x)}{\partial x_1} – \frac{\partial f_1(x)}{\partial x_3} \\
\frac{\partial f_1(x)}{\partial x_2} – \frac{\partial f_2(x)}{\partial x_1} & 0 &
\frac{\partial f_3(x)}{\partial x_2} – \frac{\partial f_2(x)}{\partial x_3} \\
\frac{\partial f_1(x)}{\partial x_3} – \frac{\partial f_3(x)}{\partial x_1} &
\frac{\partial f_2(x)}{\partial x_3} – \frac{\partial f_3(x)}{\partial x_2} & 0
\end{bmatrix} w \right\rangle \\
&= \langle v, w \times (\nabla \times f) \rangle \\
&=\tag{2} \langle \nabla \times f, \underbrace{v \times w}_{\substack{\text{Area vector}\\ \text{for this tiny} \\ \text{parallelogram}}} \rangle.
\end{align*}
Here $f_1, f_2$, and $f_3$ are the component functions of $f$ and
$
f'(x) = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \frac{\partial f_1(x)}{\partial x_2} & \frac{\partial f_1(x)}{\partial x_3} \\
\frac{\partial f_2(x)}{\partial x_1} & \frac{\partial f_2(x)}{\partial x_2} & \frac{\partial f_2(x)}{\partial x_3} \\
\frac{\partial f_3(x)}{\partial x_1} & \frac{\partial f_3(x)}{\partial x_2} & \frac{\partial f_3(x)}{\partial x_3} \\
\end{bmatrix}
$
is the Jacobian matrix of $f$ at $x$.
The vector $\nabla \times f$, which is called the “curl” of $f$ at $x$, is defined by
$$
\nabla \times f = \begin{bmatrix}
\frac{\partial f_3(x)}{\partial x_2} – \frac{\partial f_2(x)}{\partial x_3} \\
\frac{\partial f_1(x)}{\partial x_3} – \frac{\partial f_3(x)}{\partial x_1} \\
\frac{\partial f_2(x)}{\partial x_1} – \frac{\partial f_1(x)}{\partial x_2}
\end{bmatrix}.
$$
This is the moment in math when we discover the curl for the first time.
Technically, I should write the curl of $f$ at $x$ as $(\nabla \times f)(x)$.
The final step in our derivation of Stokes’s theorem is to apply formula (2) to the sum on the left in equation (1). Let $\Delta A_i$ be the “area vector” for the $i$th tiny parallelogram. In other words, the vector $\Delta A_i$ points outwards, and the magnitude of $\Delta A_i$ is equal to the area of the $i$th tiny parallelogram.
Let $x^i \in \mathbb R^3$ be the point where the $i$th tiny parallelogram is based. (The $i$ here is a superscript, not an exponent.) Combining formulas (1) and (2) reveals that
\begin{align}
\int_C f \cdot dr &\approx \sum_i (\nabla \times f)(x_i) \cdot \Delta A_i \\
&\approx \int_S (\nabla \times f) \cdot dA.
\end{align}
We have discovered the Stokes’s theorem formula. It seems plausible that we can make the approximation as accurate as we like by chopping up $S$ into sufficiently small pieces. Thus, we conclude that
$$
\int_C f \cdot dr = \int_S (\nabla \times f) \cdot dA
$$
Comments:

I gave a similar derivation of Green’s theorem here. I also wrote notes that attempt to give a similar derivation of the generalized Stokes’s theorem here.

Physicists frequently use similar arguments when deriving Stokes’s theorem. Feynman, for example, integrates a vector field around a little square in the $xy$plane, then recognizes that the result can be expressed in terms of the curl vector. Here is the relevant passage from Feynman:
However, how did Feynman discover the curl in the first place? He did it by treating the gradient operator $\nabla$ as a vector, and symbolically computing the cross product of this “vector” with $f$. I find that to be interesting and characteristically Feynman, but I also want a more direct way to discover Stokes’s theorem, the same way that we discovered Green’s theorem. (See section 36 and section 25 of volume II of the Feynman Lectures on Physics for reference.)The book Div, Grad, Curl and All That computes the three components of the curl vector by integrating a vector field around small rectangles which are parallel to either the $xy$plane or the $xz$plane or the $yz$plane. The author remarks, “It turns out that these three quantities are the Cartesian components of a vector. To this vector we give the name ‘curl of $\mathbf F$,’ which we write $\text{curl } \mathbf F$.” In other words, now paraphrasing and switching to my notation, they assume the existence of a vector $(\nabla \times f)(x)$ which satisfies
$$
(\nabla \times f)(x) \cdot \Delta A \approx \int_{\partial E} f \cdot dr
$$ for any tiny planar surface $E$ containing $x$ with area vector $\Delta A$. By considering the special cases where $E$ is a rectangle and $\Delta A$ is parallel to either $\hat i$ or $\hat j$ or $\hat k$, they derive the components of $(\nabla \times f)(x)$. Here is the relevant passage:

When deriving Green’s theorem and the divergence theorem, physicists typically chop up the region that we are integrating over into tiny rectangles or tiny boxes. I think the most clear and elegant way to make this strategy work for Stokes’s theorem is to chop up $S$ into tiny parallelograms. In fact, I think we should also use parallelograms or parallelepipeds when deriving Green’s theorem and the divergence theorem. This strategy can even be used to derive the generalized Stokes’s theorem and to discover the exterior derivative (by chopping up a smooth manifold into tiny parallelepipeds).

One way to chop up $S$ into tiny parallelograms is to start with a rectangular region $R$ that is chopped up into tiny rectangles, then smoothly morph $R$ onto $S$. If $S$ is not diffeomorphic to a rectangular region, then $S$ can at least be broken into simpler pieces, each of which is diffeomorphic to a rectangular region.

When deriving equation (2), I used the firstorder Taylor approximation
$$
\tag{3} f(x + v) – f(x) \approx f'(x) v.
$$
The approximation is good when $v$ is small.
The Jacobian matrix $f'(x)$ is also called the derivative of $f$ at $x$. The approximation (3), which Terence Tao refers to as “Newton’s approximation”, is the key idea of calculus. It is essentially the definition of $f'(x)$. The fundamental strategy of calculus is to take a nonlinear function $f$ (difficult) and approximate it locally by a linear function (easy). When deriving the formulas of calculus, we always find that we use the approximation (3) at the crucial moment. 
It would also be ok to evaluate $f$ at the midpoints of the edges when approximating the integral of $f$ along each edge of the tiny parallelogram. So the integral of $f$ along edge 1 is approximately $f(x + v/2) \cdot v$, the integral of $f$ along edge 2 is approximately $f(x + v + w/2) \cdot w$, etc. These are typically more accurate approximations and the calculation works out equally nicely. However, since our goal is just to provide an intuitive derivation of Stokes’s theorem, we might as well keep the calculation as simple as possible.
Attribution
Source : Link , Question Author : littleO , Answer Author : littleO