# Probability ## Definitions \[ L^2(X) &= \theset{f: X \to \RR: \int_\RR f(x) ~dx < \infty} &&\text{square integrable functions}\\ \inner{g}{f}_{2} &= \int_\RR g(x)f(x) ~dx &&\text{the } L^2 \text{ inner product}\\ \norm{f}_2^2 &= \inner{f}{f} = \int_\RR f(x)^2 ~dx &&\text{norm}\\ E[\wait] &= \inner{\wait}{f} &&\text{expectation}\\ (\tau_{p}f)(x) &= f(p- x) &&\text{translation}\\ (f \ast g)(p) &= \int_\RR f(t)g(p-t)~dt = \int_\RR f(t)(T_{p}g)(t) ~dt = \inner{T_pg}{f} &&\text{convolution}\\ \] :::{.definition title="Random Variable"} For $(\Sigma, E, \mu)$ a probability space with sample space $\Sigma$ and probability measure $\mu$, a random variable is a function $X: \Sigma \to \RR$ ::: :::{.definition title="Probability Density Function (PDF)"} For any $U \subset \RR$, given by the relation \[ P(X \in U) = \int_U f(x) ~dx \\ \implies P(a \leq X \leq b) = \int_a^b f(x) ~dx \] ::: :::{.definition title="Cumulative Distribution Function (CDF)"} The antiderivative of the PDF \[ F(x) = P(X \leq x) = \int_{-\infty}^x f(x) ~dx .\] Yields $\dd{F}{x} = f(x)$ ::: :::{.definition title="Mean/Expected Value"} \[ E[X] \definedas \inner{\id}{f} = \int_\RR x f(x) ~dx .\] Also denoted $\mu_X$. ::: :::{.proposition title="Linearity of Expectation"} \[ E\left[\sum_{i\in\NN} a_i X_i\right] = \sum_{i\in\NN} a_i E[X_i] .\] Does not matter whether or not the $X_i$ are independent. ::: :::{.definition title="Variance"} \[ \mathrm{Var}(X) &= E[(X - E[X])^2] \\ &= \int (x - E[X])^2 f(x) ~dx \\ &= E[X^2] - E[X]^2 \\ &\definedas \sigma^2(X) \] where $\sigma$ is the standard deviation. Can also defined as $\inner{(\id - \inner{\id}{f})^2}{f}$ Take the portion of the id function in the orthogonal complement of $f$, squared, and project it back onto $f$? \todo[inline]{???} ::: :::{.proposition title="Properties of Variance"} \[ \mathrm{Var}(aX + b) &= a^2\mathrm{Var}(X) \\ \mathrm{Var}\qty{ \sum_\NN X_i } &= \sum_i \mathrm{Var}(X_i) + 2 \sum_{i < j}\mathrm{Cov}(X_i, X_j) .\] ::: :::{.definition title="Covariance"} \[ \mathrm{Cov}(X,Y) &= E[(X-\mu_X)(Y-\mu_Y)] \\ &= E[XY] - E[X]E[Y] \] ::: :::{.proposition title="Properties of Covariance"} \[ \mathrm{Cov}(X, X) &= \mathrm{Var}(X) \\ \mathrm{Cov}(aX, Y) &= a\mathrm{Cov}(X,Y) \\ \mathrm{Cov}(\sum_{\NN} X_i, \sum_\NN Y_j) &= \sum_i \sum_j\mathrm{Cov}(X_i, Y_j) \\ .\] ::: :::{.proposition title="Stirling's Approximation"} \[ k! \sim k^\frac{k+1}{2}e^{-k} \sqrt{2\pi} .\] ::: :::{.proposition title="Markov Inequality"} \[ P(X \geq a) \leq \frac 1 a E[X] \] One-sided Markov: \[ P(X \in N_\varepsilon(\mu)) = 2\frac{\sigma^2}{\sigma^2 + a^2} .\] ::: :::{.proposition title="Chebyshev's Inequality"} \[ P(\abs{X - \mu} \geq a) \leq \left( \frac \sigma k \right)^2 \] ::: :::{.proof title="?"} Apply Markov to the variable $(X-\mu)^2$ and $a=k^2$ ::: :::{.theorem title="Central Limit Theorem"} For $X_i$ i.i.d., \[ \lim_n \frac{\sum_{i=1}^n X_i - n\mu}{\sigma \sqrt n} \sim N(0, 1) .\] ::: :::{.theorem title="Strong Large of Large Numbers"} \[ P(\frac{1}{n} \sum X_i \rightarrow \mu) = 1 .\] ::: :::{.proposition title="Chernoff Bounds"} For all $t > 0$, \[ P(X \in N_\varepsilon(a)^c) \leq 2 e^{-at}M_X(t) \\ .\] ::: :::{.proposition title="Jensen's Inequality"} \[ E[f(X)] \geq f(E[X]) \] ::: :::{.definition title="Entropy"} \[ H(X) = - \sum p_i \ln p_i \] ::: ## Theory and Background :::{.definition title="Axioms for a Probability Space"} Given a sample space $\Sigma$ with events $S$, 1. $\mu(\Sigma) = 1$ 1. Yields $S \in \Sigma \implies 0 \leq P(S) \leq 1$ 2. For mutually exclusive events, $P(\union_\NN S_i) = \sum_\NN P(S_i)$ 1. Yields $P(\emptyset) = 0$ ::: :::{.proposition title="Properties that follow from axioms"} \envlist - $P(S^c) = 1 - P(S)$ - $E \subseteq F \implies P(E) \leq P(F)$ - Proof: $E \subseteq F \implies F = E \union (E^c \intersect F)$, which are disjoint, so $P(E) \leq P(E) + P(E^c \intersect F) = P(F)$. - $P(E \union F) = P(E) + P(F) - P(E \intersect F)$ ::: :::{.definition title="Conditional Probability"} \[ P(F)P(E \mid F) = P(E \intersect F) = P(E)P(F \mid E) \] Generalization: \[ P(\intersect_\NN E_i) = P(E_1) P(E_2 \mid E_1)P(E_3\mid E_1 \intersect E_2) \cdots \] ::: :::{.theorem title="Bayes' Rule"} \[ P(E) = P(F)P(E \mid F) + P(F^c)P(E \mid F^c) \\ P(E) = \sum_i P(A_i) P(E \mid A_i) \] Generalization: for $\coprod_{i=1}^n A_i = \Sigma$ and $A=A_i$ for some $i$, \[ P(A \mid B) = \frac{P(A)P(B\mid A)}{\sum_{j = 1}^n P(B \mid A_j)} .\] The LHS: the posterior probability, while $P(A_i)$ are the priors. ::: :::{.definition title="Odds"} \[ P(A) / P(A^c) \] Conditional odds: \[ \frac{P(A \mid E)}{P(A^c \mid E)} = \frac{P(A)}{P(A^c)} \frac{P(E \mid A)}{P(E \mid A^c)} \] ::: :::{.definition title="Independence"} \[ P(A \intersect B) = P(A) P(B) \] ::: :::{.proposition title="Change of Variables for PDFs"} If $g$ is differentiable and monotonic and $Y=g(X)$, then \[ f_Y(y) = \begin{cases} (f_X \circ g^{-1})(y) \abs{\dd{}{y} g^{-1}(y)} & y \in \im(g) \\ 0 & \text{else} \end{cases} \] ::: :::{.proposition title="PDF for a sum of independent random variables"} \[ f_{X+Y} = (F_X \ast f_y) \] ::: ## Distributions Let $X$ be a random variable, and $f$ be its probability density function satisfying $f(k) = P(X = k)$ ### Uniform - Consider an event with $n$ mutually exclusive outcomes of equal probability, and let $X \in \theset{1,2,\ldots, n}$ denote which outcome occurs. Then, \[ f(k) &= \quad \frac 1 n \\ \mu &= \frac n 2 \\ \sigma^2 &= a \] - Examples: - Dice rolls where $n=6$. - Fair coin toss where $n=2$. - Continuous: $\mu = (1/2)(b+a), \sigma^2 = (1/12)(b-a)^2$ ### Bernoulli - Consider a trial with either a positive or negative outcome, and let $X \in\theset{0,1}$ where $1$ denotes a success with probability $p$. Then, \[ f(k) &= \begin{cases} 1-p, & k = 0 \\ p, & k = 1 \end{cases} \\ \mu &= \quad p \\ \sigma^2 &= \quad p(1-p) \] - Examples: - A weighted coin with $P(\text{Heads}) = p$ ### Binomial - Consider a sequence of independent Bernoulli trials, let $X \in \theset{1,\ldots, n}$ denote the number of successes occurring in $n$ trials. Then, \[ f(k) &= \quad {n \choose k} p^k (1-p)^{n-k} \\ \mu &= \quad np \\ \sigma^2 &= \quad np(1-p) \] - Examples: - A sequence of coin flips and the numbers of total heads occurring. ### Poisson - Given a parameter $\lambda > 0$ that denotes the rate per unit time of an event occurring and $X$ the number of times the event occurs in one unit of time, \[ f(k) &= \quad \frac{\lambda^k}{k!}e^{-\lambda} \\ \mu &= \quad \lambda \\ \sigma^2 &= \quad \lambda^2 \] - Approximates binomial when $n >> 1$ and $p << 1$ by using $\lambda = np$ ### Negative Binomial - $B^- (r, p)$: similar to binomial, where $X$ is the number of trials it takes to accumulate $r$ successes \[ f(k) &= \quad {k-1 \choose r-1}p^r(1-p)^{k-r} \\ \mu &= \quad \frac r p \\ \sigma^2 &= \quad \frac{r (1-p)}{p^2} \] ### Geometric - Consider a sequence of independent Bernoulli trials, let $X \in \theset{1,\ldots, n}$ where $X=k$ denotes the first success happening on the $k\dash$th trial. Then, \[ f(k) &= \quad (1-p)^{k-1} p \\ \mu &= \quad \frac 1 p \\ \sigma^2 &= \quad \frac{1-p}{p^2} \] - Note this is equal to $B^-(1, p)$ - Examples: - A sequence of coin flips and the number of flips before the first heads appears. ### Hypergeometric - $H(n, m, s)$ An urn filled with $n$ balls, where $m$ are white and $n-m$ are black; pick a sample of size $s$ and let $X$ denote the number of white balls: \[ f(k) &= \quad {m \choose k} {n-m \choose s-k} {n \choose s}^{-1} \\ \mu &= \quad \frac{ms}{n} \\ \sigma^2 &= \quad \frac{ms}{n}(1- \frac{m}{n})\left( 1 - \frac{s-1}{n-1} \right) \] ### Normal \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \] $z$ | $\Phi(z)$ -- | -- $0$ | $0.5$ $1$ | $0.69$ $1.5$ | $0.84$ $2$ | $0.93$ $2.5$ | $0.97$ $>3$ | $0.99$ ## Table of Distributions Table: let $q = 1-p$. \[ \text{Distribution} & f(x) & & \mu & \sigma^2 & M(t) \\ \hline \\ B(n, p) & {n\choose x}p^x q^{n-x} & & np & npq & (pe^t + q)^n \\ P(\lambda) & \frac{\lambda^x}{x!}e^{-\lambda} & & \lambda & \lambda & e^{\lambda(e^t-1)} \\ G(p) & q^{x-1}p & & \frac{1}{p} & \frac{q}{p^2} & \frac{pe^t}{1-qe^t} \\ B^-(r, p) & {n-1 \choose r-1}p^rq^{n-r} & & \frac{r}{p} & \frac{rq}{p^2} & \left(\frac{pe^t}{1-qe^t}\right)^r \\ U(a, b) & \indic{a\leq x\leq b}\frac 1 {b-a} & & \frac{1}{2}(a+b) & \frac{1}{12}(b-a)^2 & \frac{e^{tb} - e^{ta}}{t(b-a)} \\ Exp(\lambda) & \indic{0 \leq x}\lambda e^{-\lambda x} & & \frac 1 \lambda & \frac 1 {\lambda^2} & \frac \lambda {\lambda - t} \\ \Gamma(s, \lambda) & \indic{0 \leq x} \frac{\lambda e^{-\lambda x} (\lambda x)^{s-1}}{\Gamma(s)} & & \frac s \lambda & \frac s {\lambda^2} & \left( \frac{\lambda}{\lambda - t} \right)^s \\ N(\mu, \sigma^2) & \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} & & \mu & \sigma^2 & e^{\mu t + \frac{1}{2}\sigma^2 t^2} \] - Why you need the Stieltjes Integral: let $X \sim B(n, \frac 1 2), Y \sim U(0, 1),$ and \[ Z = \begin{cases} X, & $X = 1$ \\ Y, & else \end{cases} \] then $\abs{Z} = \abs{\RR}$ so $Z$ is not discrete, but $P(X = 1) = \frac 1 2 \neq 0$ so $Z$ is not continuous. Definition: \[ \int _ { a } ^ { b } g ( x ) ~d F ( x ) = \lim \sum _ { i = 1 } ^ { n } g \left( x _ { i } \right) \left( F \left( x _ { i } \right) - F \left( x _ { i - 1 } \right) \right) .\] ## Common Problems - Birthday Paradox - Coupon Collectors - Given $X = \theset{1, \cdots n}$, what is the expected number of draws until all $n$ outcomes are seen? ## Notes, Shortcuts, Misc - When computing expected values, variation, etc, just insert a parameter $k$ and compute the moments $E[X^k]$. Then with a solution in terms of $k$, let $k=1,2$ etc. - Neat property of pdfs: $P(X \in N_\varepsilon(a)) \approx \varepsilon f(a)$ :::{.definition title="The Gamma Function"} \[ \Gamma(x+1) = \int_{\RR^{>0}} e^{-t} t^x ~dt .\] Integrate by parts to obtain functional relation $\Gamma(x+1) = x\Gamma(x)$ ::: :::{.proposition title="Boole's Inequality"} \[ P(\union_\NN A_i) \leq \sum_\NN P(A_i) \] ::: - For any function $f:X \to \RR$, there is a lower bound: $\max_{x\in X}f(x) \geq E[f(x)]$ :::{.definition title="Moment Generating Functions"} Define \[ M(t) = E[e^{Xt}] \] - Then $M^{(n)}(0)$ is the $n\dash$th moment, i.e. $M'(0) = E[X], M''(0) = E[X^2]$, etc. - $M_{X+Y}(t) = M_X(t)M_Y(t)$ (if independent) - $M_{aX + b}(t) = e^{bt} M_X(at)$ - $f_X = \mathcal{F}^{-1}(M_X(it))$, denoting the inverse Fourier transform, :::