# Formal definition of conditional probability

It would be extremely helpful if anyone gives me the formal definition of conditional probability and expectation in the following setting, given probability space
$(\Omega, \mathscr{A}, \mu )$ with $\mu(\Omega) = 1$, and a random variable $X : \Omega \rightarrow \mathbb{R}^n$, where for any borel set $A \in \mathscr{B}(\mathbb{R}^n)$ we define
$$\mathbb{P}(X \in A) = (X_*\mu)(A) = \mu(X^{-1}(A))= \mu(\{\omega\in \Omega\ \ |\ \ X(\omega) \in A\})\ \ \text{and}\ \ \mathbb{E}(X) = \int_\Omega Xd\mu$$
Regardless of $X, Y$ being discrete or continuous (with density $f_X, f_Y$ and joint density $f_{X,Y}$ w.r.t some measure $\nu$ on $\mathbb{R}^n$), I am asking for the definition
of $\mathbb{P}(Y\in B\ |\ X \in A)$ and $\mathbb{E}(Y|X)$ for all Borel sets $A, B \in \mathscr{B}(\mathbb{R}^n)$, keeping in mind that $\mathbb{P}(X \in A)$ may well be zero.

In our probability class some thing of the following sort was mentioned, where
$\delta_x$ is the Dirac distribution at $x$, then we have
$$\mathbb{E}(Y|X = x) = \frac{\mathbb{E}(\delta_x(X)Y)}{\mathbb{P}(X=x)}$$
out of which I can’t make any sense. Any appropiate reference for these is also very much welcome.

Thank you.

Let throughout this post $(\Omega,\mathcal{F},P)$ be a probability space, and let us first define the conditional expectation ${\rm E}[X\mid\mathcal{G}]$ for integrable random variables $X:\Omega\to\mathbb{R}$, i.e. $X\in L^1(P)$, and sub-sigma-algebras $\mathcal{G}\subseteq\mathcal{F}$.

Definition: The conditional expectation ${\rm E}[X\mid\mathcal{G}]$ of $X$ given $\mathcal{G}$ is the random variable $Z$ having the following properties:

(i) $Z$ is integrable, i.e. $Z\in L^1(P)$.

(ii) $Z$ is ($\mathcal{G},\mathcal{B}(\mathbb{R}))$-measurable.

(iii) For any $A\in\mathcal{G}$ we have
$$\int_A Z\,\mathrm dP=\int_A X\,\mathrm dP.$$

Note: It makes sense to talk about the conditional expectation since if $U$ is another random variable satisfying (i)-(iii) then $U=Z$ $P$-a.s.

Definition: If $X\in L^1(P)$ and $Y:\Omega\to\mathbb{R}$ is any random variable, then the conditional expectation of $X$ given $Y$ is defined as
$${\rm E}[X\mid Y]:={\rm E}[X\mid\sigma(Y)],$$
where $\sigma(Y)=\{Y^{-1}(B)\mid B\in\mathcal{B}(\mathbb{R})\}$ is the sigma-algebra generated by $Y$.

I’m not aware of any other definition of $P(Y\in B\mid X\in A)$ than the obvious, i.e.
$$P(Y\in B\mid X\in A)=\frac{P(Y\in B,X\in A)}{P(X\in A)}$$
provided that $P(X\in A)>0$. The only exception being when $A$ contains a single point, i.e. $A=\{x\}$ for some $x\in\mathbb{R}$. In this case, the object $P(Y\in B\mid X=x)$ is defined in terms of a regular conditional distribution.

Let us first define regular conditional probabilities. Let $X:\Omega\to\mathbb{R}$ be a random variable.

Definition: A regular conditional probability for $P$ given $X$ is a function
$$\mathcal{F}\times \mathbb{R} \ni(A,x)\mapsto P^X(A\mid x)$$
satisfying the following three conditions:

(i) The mapping $A\mapsto P^X(A\mid x)$ is a probability measure on $(\Omega,\mathcal{F})$ for all $x\in \mathbb{R}$.

(ii) The mapping $x\mapsto P^X(A\mid x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable for all $A\in\mathcal{F}$.

(iii) The defining equation holds: For any $A\in\mathcal{F}$ and $B\in\mathcal{B}(\mathbb{R})$ we have
$$\int_B P^X(A\mid x)\,P_X(\mathrm dx)=P(A\cap\{X\in B\}).$$

Note: A mapping satisfying (i) and (ii) is often called a Markov kernel. Furthermore, since $(\mathbb{R},\mathcal{B}(\mathbb{R}))$ is a nice space, the regular conditional probability is unique in the sense that if $\tilde{P}^X(\cdot\mid\cdot)$ is another regular conditional probability of $P$ given $X$, then we have that $P^X(\cdot\mid x)=\tilde{P}^X(\cdot\mid x)$ for $P_X$-a.a. $x$. Here $P_X=P\circ X^{-1}$ is the distribution of $X$.

Connection: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Then for any $A\in\mathcal{F}$ we have
$${\rm E}[1_A\mid X]=\varphi(X),$$
where $\varphi(x)=P^X(A\mid x)$. In short we write ${\rm E}[1_A\mid X]=P^X(A\mid X)$.

Now let us introduce another random variable $Y:\Omega\to\mathbb{R}$, and $P^X(\cdot\mid \cdot)$ still denotes a regular conditional probability of $P$ given $X$.

Definition: For $B\in\mathcal{B}(\mathbb{R})$ and $x\in\mathbb{R}$ we define the regular conditional distribution of $Y$ given $X$ by
$$P_{Y\mid X}(B\mid x):=P^X(Y\in B\mid x).$$

Instead of $P_{Y\mid X}(B\mid x)$ one often writes $P(Y\in B\mid X=x)$.

An easy consequence of this definition is that $(B,x)\mapsto P_{Y\mid X}(B\mid x)$ is a Markov kernel and for any $A,B\in\mathcal{B}(\mathbb{R})$ we have
$$\int_A P_{Y\mid X}(B\mid x)\,P_X(\mathrm dx)=P(\{X\in A\}\cap\{Y\in B\}). \tag{1}$$

In fact, $P_{Y\mid X}(\cdot \mid \cdot)$ is a regular conditional distribution of $Y$ given $X$ if and only if $P_{Y\mid X}(\cdot\mid\cdot)$ is a Markov kernel and satisfies $(1)$. Again $(1)$ is often referred to as the defining equation.

Definition: Let $P^X(\cdot\mid\cdot)$ be a regular conditional probability of $P$ given $X$. Furthermore, let $U:\Omega\to\mathbb{R}$ be another random variable that is assumed bounded (to ensure the following expectations exist). Then we define the (regular) conditional mean of $U$ given $X=x$ by
$${\rm E}[U\mid X=x]:=\int_\Omega U(\omega)\, P^X(\mathrm d\omega\mid x).$$

Let us denote $\psi(x)={\rm E}[U\mid X=x]$. Then we have the following:

Connection: The mapping $\mathbb{R}\ni x\mapsto \psi(x)$ is $(\mathcal{B}(\mathbb{R}),\mathcal{B}(\mathbb{R}))$-measurable, and
$${\rm E}[U\mid X]=\psi(X).$$

The following is an extremely useful rule when calculating with conditional distributions:

Rule: Let $X$ and $Y$ be as above, and let $\xi:\mathbb{R}^2\to\mathbb{R}$ be $(\mathcal{B}(\mathbb{R}^2),\mathcal{B}(\mathbb{R}))$-measurable. Then
$$P(\xi(X,Y)\in D\mid X=x)=P(\xi(x,Y)\in D\mid X=x),\quad D\in\mathcal{B}(\mathbb{R}),$$
holds for $P_X$-a.a. $x$. This is saying that “conditional on $X=x$ we may replace $X$ by $x$”.

The following example shows how this rule can be useful: Let $X$ and $Y$ be independent $\mathcal{N}(0,1)$ random variables, and let $U=X+Y$. Then we claim that $U\mid X=x\sim \mathcal{N}(x,1)$ for $P_X$-a.a. $x$. To see this, note that by the rule above, the distribution of $U\mid X=x$ and $Y+x\mid X=x$ is the same. But since $Y$ is independent of $X$ we have that $Y+x\mid X=x$ is distributed as $Y+x$. We can write it as follows:
$$U\mid X=x\sim Y+x\mid X=x\sim Y+x\sim\mathcal{N}(x,1).$$