Basic of Probability Theory for Data Science

1. Basic Concepts for Probability

Random Experiment:

A trail that can have more than one possible outcome.
The trail should be replicable under fixed conditions
The outcome of the trail is unpredictable

Event : A specific outcome of a random experiment(e.g X=1)

Fundamental Event: The minimum grain of event defined according to the objective of the random experiment(Not possible or necessary to split into smaller grain). For example, for a throw of dice, the fundamental events would be face = 1,2,...6.

Compound Event: A event consists of multiple fundamental events(e. g, for a throw of dice: Face < 5)

Sample Space: A collection consists of all possible fundamental events. (e. g, for two flips of a coin \(\Omega = \{(face,face),(back,face),(face,back),(back,back)\}\))

Random Variable: A function that map each sample point \(\omega\) in the sample space \(\Omega\) into a real number \(X = X(\omega)\) . strictly, the definition of a event is a collection \(\{\omega | X(\omega = a)\}\), but in most real implementation, just understand event as an outcome.

2. Interpretation of Probability

Probability describe how likely a event would happen. In the probability theory, the following axiom are give: \[ 0 \le P(A) \le 1 \]

\[ P(\Omega) = 1 \]

\[ P( A_1+A_2) = P(A_1)+P(A_ 2) \]

Where \(\Omega\) is a definite event(containing all fundamental events), \(A_1\) and \(A_2\) are exclusive

2.1 Classical Model of Probability

In Classical Interpretation of probability, two assumptions are considered satisfied:

The sample space contains finite fundamental events
The happening of each fundamental event are equally likely

Under such assumptions, the probability of an event A can be defined as \[ P(A) = \frac{n_a}{n_s} \] where \(n_ s\) is the number of fundamental events in the sample space, and \(n_ a\) is the number of fundamental events in event A.

For most classical probability case, \(n_s\) and \(n_ a\) can be calculated from permutation and combination: \[ P_n^m = \frac{n! }{(n- m)!} \]

\[ C_m^n = \frac{n! }{m!(n- m)!} \]

2.2 Geometric Model of Probability

Define a geometric measure of a event(e. g length of line segment, area) \[ P(A) = \frac{M(A)}{M(S)} \]

2.3 Frequency and Statistical Probability

Suppose n times of random experiment are conducted, and event A happened m times, the define the frequency of event A as: \[ \omega(A) = \frac{m}{n} \] Then the Statistical Probability of event A is: \[ P(A) = \lim_{n\to\infty}w(A) \] Note that probability is a inner properties of a random variable. Statistical Probability is a mathematic approximation of th real probability

3. Baisc Theroms in Probability Theory

\[ P(A) = 1-P(\bar{A}) \]

\[ P(A-B) = P(A) - P(A\cap B) \]

\[ P(A+B) = P(A) +P(B) - P(A\cap B) \]

4. Conditional Probability, Joint Probability and Independency

4.1 Conditional Probability

Let A, B be two events in sample space \(\Omega\), the probability of the A given the condition that B has happened is called conditional probability, denoted as \(P(A| B)\)

The sample sapce of \(P(A| B)\) is B, not \(\Omega\). Conditioning means "Compression" on the sample spcae. According to the axiom of probability theory, the probability of the whole sample space is 1. Thus, let A be the random variable X = a: \[ \sum_a P(A|B) = 1 \] A and B can be two events of a same variable, or each be a event for a separate variable.

4.2 Law of Total Probability

Let \(B_1,B_ 2,...,B_n\) be a series of collectively exhaustive events, A be another event: \[ P(A) = \sum_i^n P(B_i)P(A|B_i) \]

4.3 Joint Probability

Assume a two-dimension sample space is determined by two random experiment, which means we have two random variables X and Y for a sample space. Let A, B be a certain outcome of variable X and Y respectively, the probability taht events A and B both happen is called the joint probability of A and B, denoted as \(P(A, B)\). The joint probability has the following properties: \[ \sum_x\sum_y P(X=x,Y =y) = 1 \] If X and Y are independent: \[ P(X,Y) = P(X)P(Y) \] For such a sample sapce: \[ P(A| B ) = \frac{P(A,B)}{P(B)} \] associated with the Law of Total Probability: \[ P(A)= \sum_i^n P(B_i)P(A|B_i) = \sum_i^n P(A,B_i) \]

4.4 Bayesian Law

Let \(A_1,A_ 2,...,A_n\) be a series of collectively exhaustive events, B be another event: \[ P(A_i|B) = \frac{P(A_i)P(B|A_i)}{P(B)}=\frac{P(A_i)P(B|A_i)}{\sum_i^nP(A_i)P(B|A_i)} \] where we call:

\(A_i\): hypothesis event, an event we want to attest its probability distribution through observations on evidence
B: evidence, an event used to update knowledge on the hypothesis event
\(P(A)\): prior probability, representing the knowledge before the evidence emerge
\(P(B|A)\): likehood, representing the probability of B under events A
\(P( A_i|B)\): posterior probability, representing the updated knowledge after evidence emerge

Specific examples of bayesian inference can be referred via this article

4.5 Independency of Events

If the probability of A is not affected by whether event B happen, then A is independent to B. In conditional probability form: \[ P(A) = P(A|B) \]

\[ P(A,B) = P(A)P(B) \]

Note that \(A \perp \!\!\! \perp B\) and \(A\cap B\) cannot be both true

5. Probability Distribution & Probability Density Function

5.1 Discrete Random Variable and Probability Distribution

If the possible value of a random variable is countable, then it is a discrete random variable. The probability distribution of a discrete random variable X is defined as a function: \[ P(X) = P(X=x_k) \] The PDF has the following properties: \[ P(X) \ge 0 \]

\[ \sum_k P(X) = 1 \]

5.2 Continuous Random Variable and Probability Density

If a randome variable can be any value on a range \([a , b]\) and there exists a integratable function \(f( x)\), so that: \[ P( a< X \le b ) = \int_a^ b f(x)dx \] Then X is called a continuous random variable, \(f(x )\) is called the probability density function of X.

For a continuous random variable, the probability of each single sample point would be 0. Instead of an actual probability, the distribution of a continuous random variable is described by the probability density of each data point. The value of \(f(x)\) on a specific point represents the probability density of that sample point: \[ f(x= a) = \lim_{\Delta \to \infty } P(a<X \le a+ \Delta ) \]

5.3 Distribution Type

For details about distribution type, refer to here

6. Expectation and Variance

6.1 Expectation

For discrete variable: \[ E[X] = \sum_i^n x_iP(X=x_i) \]

For discrete variable: \[ E[X] = \int_{-\infty}^\infty xf(x)dx \] if E[X] can converge

Properties of Expectation: \[ E[X+C] = E[X]+C \]

\[ E[CX] = CE[X] \]

\[ E[X + Y] = E[X]+E[Y] \qquad \forall X, Y \]

\[ E[XY] = E[X]E[Y] \qquad if \ X \perp \! \! \perp Y \]

\[ E[g(X)] = \sum_i^n g(x_i)P(X=x_i) \qquad or \qquad \int_{-\infty}^\infty g(x)f(x)dx \]

\[ E[X|Y=y] = \sum_i^nx_iP(X=x_i|Y=y) \qquad or \qquad \int_{-\infty}^\infty xf(x|y)dx \]

6.2 Variance

\[ V[X] = E[(X-E[X])^2] = E[ X^ 2]-(E[X])^2 \]

Properties of Variance: \[ V[C] = 0 \]

\[ V[X+C] = V[X] \]

\[ V[CX] = C^2V[X] \]

\[ V[X\pm Y] = V[X]+V[Y] \pm 2Cov[X, Y] \]

\[ V[g(X)] = g'(E[X])^2V[X] \]

6.3 Covariance

\[ Cov[X, Y] = E[(X-E[X])(Y- E[Y])] = E[XY]-E[X]E[Y] \]

Covariance is a measure of the joint variability of two random variables, it represent thedegree that two variables variate samely in directions.

Properties of Variance: \[ Cov[X,Y] = 0 \qquad if \ X \perp \! \! \perp Y \]

\[ Cov[X, Y] \le \sqrt {V[X]V[Y]} \]

\[ Cov[X,X] = V[X] \]

Probability & Statistics

#Probability #Basic Knowledge

Basic of Probability Theory for Data Science

http://example.com/2022/12/02/Basic-prob/

Author

Zhengyuan Yang

Posted on

December 2, 2022

Licensed under

Sampling Methods for Machine Learning Previous

Structural Causal Model Next