Lecture 3 - Part 1

Contents

Introduction

Probability Theory

Before introducing the detection, estimation, and forecasting techniques in details, it is important to understand some key probabilistic concepts. The probability theory itself merits an entire course on its own. Therefore, it is challenging to cover all aspects as part of one chapter. Interested readers are referred to [1] for the detailed exposition of the topic.

Sample Space

Let us consider an experiment, e.g., tossing of a coin or dice, predicting the location of a ball in roulette etc. The outcome of such experiment is random and not predictable with certainty. Nevertheless, we do know that tossing a coin will result in head/tails, or in case of dice it will result in a number between 1-6. Consequently, while the outcome of the experiment is not known and its random, the set of all possible outcome is known.

Definition

The set of all possible outcomes of an experiment is known as the Sample Space and is commonly denoted by SS.

Examples

  1. Coin Toss: In case of coin-toss example, Scoin={H,T}S_{coin}=\{H,T\}, i.e. the outcome is either heads(HH) or tails(TT).
  2. Student Exam: Consider a cohort of five students appearing in exam. Then the outcome is the position of the student after the test, i.e.
Sstudent={ All 5! combinations of (1,2,3,4,5)},\begin{equation} S_{student}=\{ \text{ All } 5! \text{ combinations of } (1,2,3,4,5)\}, \end{equation}

      the outcome (2,3,4,1,5) for instance means that the student with ID 2 came first in the test and so on.

  1. Double Coin-toss: In case of flipping two coins, the sample space can be defined as all possible combinations as:
Sdc={(H,H),(H,T),(T,T),(T,H)},={H,T}Scoin×{H,T}Scoin.\begin{equation} \begin{split} S_{dc} &=\{ (H,H), (H,T), (T,T), (T,H)\}, \\ &= \underbrace{\{H,T\}}_{S_{coin}} \times \underbrace{\{H,T\}}_{S_{coin}}. \\ \end{split} \end{equation}

One can appreciate from this example that the SdcS_{dc} is a product space of the sample space of single coin-tosses SdcS_{dc}.

Event

Definition

Any subset of sample space ESE \subset S is known as event. In other words, an event (EE) is a set of possible outcomes for the experiment.

If the outcome of the experiment is contained in set E, we say that event E has occurred. For instance, consider following examples:

  1. Rolling a dice: Consider rolling a dice, then an event can be rolling an even number, i.e.
E={2,4,6}Sdice\begin{equation} E = \{2,4,6\} \subset S_{dice} \end{equation}
  1. Double coin-toss: Imagine flipping two coins, then an event can be that at least one tail appears, i.e.,
E={(T,H),(H,T),(T,T)}Sdc\begin{equation} E = \{ (T,H), (H, T), (T,T)\} \subset S_{dc} \end{equation}

Let EE and FF be two events in SS then EFE \cup F will occur when either EE or FF occurs. The event EFE \cup F is called a union event. Similarly, we can define an event EFEF which is an intersection of EE and FF, i.e., contains all outcomes which are feasible when both EE and FF occur. If EFE \cap F is \varnothing then EE and FF are mutually exclusive events. The set S\ES\E is complementary set of EE often denoted by EcE^{c} or Eˉ\bar{E}. Generally, the unions and intersections of the events can be defined as:

iIEi with I={1,2,...,N}\begin{equation} \bigcup_{i \in I} E_i \text{ with } I =\{1,2,...,N\} \end{equation} iIEi with I={1,2,...,N}\begin{equation} \bigcap_{i \in I} E_i \text{ with } I =\{1,2,...,N\} \end{equation}

Some examples of these three concepts are as follows:

Set theory diagrams showing union, intersection, and complement

Example 1: Rolling a Die

The sample space is

S={1,2,3,4,5,6}.S = \{1,2,3,4,5,6\}.

Let

  • E={2,4,6}E = \{2,4,6\} (event that the outcome is even),
  • F={4,5,6}F = \{4,5,6\} (event that the outcome is at least 4).

Then:

  • Intersection:
EF={4,6}\begin{equation} EF = \{4,6\} \end{equation}

is the event that the outcome is even and at least 4.

  • Union:
EF={2,4,5,6}\begin{equation} E \cup F = \{2,4,5,6\} \end{equation}

is the event that the outcome is even or at least 4 (or both).

  • Complement:
Ec={1,3,5}\begin{equation} E^c = \{1,3,5\} \end{equation}

is the event that the outcome is odd.


Example 2: Drawing a Card

The sample space is

S={all 52 cards}.S = \{\text{all 52 cards}\}.

Let

  • E={all hearts}E = \{\text{all hearts}\} (13 cards),
  • F={all face cards (J, Q, K)}F = \{\text{all face cards (J, Q, K)}\} (12 cards).

Then:

  • Intersection:
EF={J,Q,K}\begin{equation} EF = \{\heartsuit J, \heartsuit Q, \heartsuit K\} \end{equation}

is the event that the card is a heart and a face card.

  • Union:
EF={all hearts}{all J, Q, K of any suit}\begin{equation} E \cup F = \{\text{all hearts}\} \cup \{\text{all J, Q, K of any suit}\} \end{equation}

is the event that the card is a heart or a face card (or both).

  • Complement:
Fc={all cards except J, Q, K}\begin{equation} F^c = \{\text{all cards except J, Q, K}\} \end{equation}

is the event that the card is not a face card.

Algebra of Events

The operations of forming unions, intersections, and complements of events obey certain rules similar to the rules of algebra.

We list a few of these important laws:

Commutative Laws

EF=FE\begin{equation} E \cup F = F \cup E \end{equation} EF=FE\begin{equation} EF = FE \end{equation}

Associative Laws

(EF)G=E(FG)\begin{equation} (E \cup F) \cup G = E \cup (F \cup G) \end{equation} (EF)G=E(FG)\begin{equation} (EF)G = E(FG) \end{equation}

Distributive Laws

(EF)G=EGFG\begin{equation} (E \cup F)G = EG \cup FG \end{equation} EFG=(EG)(FG)\begin{equation} EF \cup G = (E \cup G)(F \cup G) \end{equation}

De Morgan's Laws in Probability

For two events EE and FF:

  1. Union complement:
(EF)c=EcFc\begin{equation} (E \cup F)^c = E^c \cap F^c \end{equation}
  1. Intersection complement:
(EF)c=EcFc\begin{equation} (E \cap F)^c = E^c \cup F^c \end{equation}

Explanation:

  • (EF)c(E \cup F)^c contains all outcomes not in EE or FF.
  • (EF)c(E \cap F)^c contains all outcomes not in both EE and FF.

Axioms of Probability

Probability can be defined in number of ways. One way of defining the probability is in terms of relative occurrence of the event:

P(E)=limnn(E)n,\begin{equation} P(E) = \lim_ {n \rightarrow \infty} \frac{n(E)}{n}, \end{equation}

where P(E)P(E) is probability of the event, n(E)n(E) is the number of times the event occurs in nn runs of the experiment. In other words, P(E)P(E) is defined as the limiting proportion of time that EE occurs. It is thus a limiting frequency of EE. This can be verified by a small demo here.

Dice Roller

Law of Large Numbers

Suppose we roll a fair six-sided die. The theoretical probability of getting a 33 is P(3)=16P(3) = \tfrac{1}{6}.

If we perform nn rolls and observe kk outcomes equal to 3, then the empirical probability is

P^(3)=kn\hat{P}(3) = \frac{k}{n}

As nn \to \infty, P^(3)\hat{P}(3) converges to P(3)=16P(3) = \tfrac{1}{6}.

Notice, that the definition here inherently assumes that the n(E)/nn(E)/n converges to a finite value for all repetitions of the experiment. It is difficult to prove this without making assumption on the convergence. Therefore, modern probability theory rather adopts an axiom based approach. In particular, for each event ESE \subset S, we assume that P(E)P(E) is the probability of the event which satisfies the following axioms:

Axiom 1

0P(E)1.\begin{equation} 0 \leq P(E) \leq 1. \end{equation}

The probability of an event EE takes value between 0 and 1.

Axiom 2

P(S)=1\begin{equation} P(S) = 1 \end{equation}

The outcome of an experiment is a point in SS with probability 1.

Axiom 3

For any sequence of events {E1,E2,...,EN}\{E_1,E_2,..., E_N\} which are mutually exclusive, i.e., EiEj=E_i E_j = \varnothing when iji \neq j, then

P(i=1NEi)=i=1NP(Ei)\begin{equation} P\bigg( \bigcup_{i=1}^{N} E_i \bigg) = \sum_{i=1}^{N}P(E_i) \end{equation}

If events are mutually exclusive, then the chance of at least one happening is just the total of their separate chances.

Key Propositions in Probability Theory

1. Probability of the empty set

Statement:

P()=0\begin{equation} P(\varnothing) = 0 \end{equation}

Proof:
The empty set \varnothing is disjoint from SS and S=SS = \varnothing \cup S. By additivity,

P(S)=P(S)=P()+P(S).\begin{equation} P(S) = P(\varnothing \cup S) = P(\varnothing) + P(S). \end{equation}

Subtract P(S)P(S) from both sides and use P(S)=1P(S)=1:

0=P().\begin{equation} 0 = P(\varnothing). \end{equation}

2. Boundedness

Statement:

0P(A)1\begin{equation} 0 \le P(A) \le 1 \end{equation}

Proof:
Non-negativity gives P(A)0P(A)\ge0. Also ASA \subseteq S, and S=AAcS = A \cup A^c with AA and AcA^c disjoint, so by additivity

1=P(S)=P(A)+P(Ac)P(A),\begin{equation} 1 = P(S) = P(A) + P(A^c) \ge P(A), \end{equation}

since P(Ac)0P(A^c)\ge0. Hence P(A)1P(A)\le1. Combining yields 0P(A)10\le P(A)\le1.

Experiment: Toss a fair coin.

  • Sample space: S={H,T}S=\{H,T\}
  • Event: A={H}A=\{H\}

Probability:

P(A)=12,0P(A)1\begin{equation} P(A) = \frac{1}{2}, \quad 0 \leq P(A) \leq 1 \end{equation}

3. Complement Rule

Statement:

P(Ac)=1P(A)\begin{equation} P(A^c) = 1 - P(A) \end{equation}

Proof:
Because AA and AcA^c are disjoint and AAc=SA\cup A^c=S,

1=P(S)=P(AAc)=P(A)+P(Ac).\begin{equation} 1 = P(S) = P(A\cup A^c) = P(A) + P(A^c). \end{equation}

Rearrange to get P(Ac)=1P(A)P(A^c)=1-P(A).

Experiment: Roll a fair die.

  • Event A={2,4,6}A=\{2,4,6\} (even)
  • Complement Ac={1,3,5}A^c=\{1,3,5\} (odd)

Probabilities:

P(A)=36=12,P(Ac)=36=12\begin{equation} P(A) = \frac{3}{6} = \frac{1}{2}, \quad P(A^c) = \frac{3}{6} = \frac{1}{2} \end{equation}

Check complement rule:

P(Ac)=1P(A)=112=12\begin{equation} P(A^c) = 1 - P(A) = 1 - \frac{1}{2} = \frac{1}{2} \end{equation}

4. Sub-additivity (union bound)

Statement:

P(AB)P(A)+P(B)\begin{equation} P(A \cup B) \le P(A) + P(B) \end{equation}

Proof:
Start from the inclusion–exclusion identity (proved below):

P(AB)=P(A)+P(B)P(AB).\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B). \end{equation}

Since P(AB)0P(A\cap B)\ge0, subtracting it makes the right-hand side P(A)+P(B)\le P(A)+P(B).

Experiment: Draw a card from a deck of 52.

  • Event AA: heart (P(A)=13/52P(A)=13/52)
  • Event BB: king (P(B)=4/52P(B)=4/52)

Intersection: AB={King of Hearts},P(AB)=1/52A\cap B = \{\text{King of Hearts}\}, P(A\cap B)=1/52

Union bound:

P(AB)P(A)+P(B)\begin{equation} P(A\cup B) \le P(A) + P(B) \end{equation}

Actual probability:

P(AB)=P(A)+P(B)P(AB)=1352+452152=16521752\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \le \frac{17}{52} \end{equation}

5. Difference Rule

Statement:

P(AB)=P(A)P(AB)\begin{equation} P(A \setminus B) = P(A) - P(A \cap B) \end{equation}

Proof:
Partition AA into disjoint sets ABA\setminus B and ABA\cap B:

A=(AB)(AB),(AB)(AB)=.\begin{equation} A = (A\setminus B)\cup (A\cap B), \qquad (A\setminus B)\cap(A\cap B)=\varnothing. \end{equation}

By additivity,

P(A)=P(AB)+P(AB).\begin{equation} P(A) = P(A\setminus B) + P(A\cap B). \end{equation}

Rearrange to obtain the stated identity.

Experiment: Roll a die.

  • Event A={1,2,3,4}A=\{1,2,3,4\}
  • Event B={3,4,5,6}B=\{3,4,5,6\}

Then

AB={1,2},AB={3,4}\begin{equation} A\setminus B = \{1,2\}, \quad A\cap B = \{3,4\} \end{equation}

Probabilities:

P(A)=46,P(AB)=26,P(AB)=26\begin{equation} P(A) = \frac{4}{6}, \quad P(A\cap B) = \frac{2}{6}, \quad P(A\setminus B) = \frac{2}{6} \end{equation}

Check difference rule:

P(AB)=P(A)P(AB)=4626=26\begin{equation} P(A\setminus B) = P(A) - P(A\cap B) = \frac{4}{6} - \frac{2}{6} = \frac{2}{6} \end{equation}

6. Inclusion–Exclusion (two events)

Statement:

P(AB)=P(A)+P(B)P(AB)\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) \end{equation}

Proof:
Write ABA\cup B as a disjoint union:

AB=(AB)  (BA)  (AB),\begin{equation} A\cup B = (A\setminus B)\ \cup\ (B\setminus A)\ \cup\ (A\cap B), \end{equation}

with the three pieces pairwise disjoint. By additivity,

P(AB)=P(AB)+P(BA)+P(AB).\begin{equation} P(A\cup B)=P(A\setminus B)+P(B\setminus A)+P(A\cap B). \end{equation}

But

P(A)=P(AB)+P(AB),P(B)=P(BA)+P(AB).\begin{equation} P(A)=P(A\setminus B)+P(A\cap B),\qquad P(B)=P(B\setminus A)+P(A\cap B). \end{equation}

Adding these and subtracting P(AB)P(A\cap B) yields

P(A)+P(B)P(AB)=P(AB).\begin{equation} P(A)+P(B)-P(A\cap B)=P(A\cup B). \end{equation}

Experiment: Draw a card from a deck of 52.

  • Event AA: heart (P(A)=13/52P(A)=13/52)
  • Event BB: king (P(B)=4/52P(B)=4/52)
  • Intersection: king of hearts (P(AB)=1/52P(A\cap B)=1/52)

Check formula:

P(AB)=P(A)+P(B)P(AB)=1352+452152=1652\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \end{equation}

7. Inclusion–Exclusion (general form)

Statement:
For events A1,,AnA_1,\dots,A_n,

P ⁣(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)\begin{equation} P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i\cap A_j) + \sum_{i<j<k} P(A_i\cap A_j\cap A_k) - \cdots \end{equation}

Proof (sketch by induction):
Base n=1n=1 is trivial. For n=2n=2 we have the two-event formula. Assume it holds for n1n-1 events. Let
Un1=i=1n1AiU_{n-1}=\bigcup_{i=1}^{n-1} A_i. Then

P ⁣(Un1An)=P(Un1)+P(An)P(Un1An).\begin{equation} P\!\left(U_{n-1}\cup A_n\right)=P(U_{n-1})+P(A_n)-P(U_{n-1}\cap A_n). \end{equation}

Apply the induction hypothesis to P(Un1)P(U_{n-1}). Note that

Un1An=i=1n1(AiAn),\begin{equation} U_{n-1}\cap A_n = \bigcup_{i=1}^{n-1} (A_i\cap A_n), \end{equation}

so apply inclusion–exclusion again inside. Collecting terms produces the alternating sum for nn sets.

Experiment: Roll a die.

  • A={even}={2,4,6},P(A)=3/6A=\{\text{even}\}=\{2,4,6\}, P(A)=3/6
  • B={prime}={2,3,5},P(B)=3/6B=\{\text{prime}\}=\{2,3,5\}, P(B)=3/6
  • C={3}={1,2,3},P(C)=3/6C=\{\leq 3\}=\{1,2,3\}, P(C)=3/6

Intersections:

AB={2},P=1/6;AC={2},P=1/6;BC={2,3},P=2/6;ABC={2},P=1/6\begin{equation} A\cap B=\{2\}, \quad P=1/6; \quad A\cap C=\{2\}, \quad P=1/6; \quad B\cap C=\{2,3\}, \quad P=2/6; \quad A\cap B\cap C=\{2\}, \quad P=1/6 \end{equation}

Formula:

P(ABC)=P(A)+P(B)+P(C)[P(AB)+P(AC)+P(BC)]+P(ABC)\begin{equation} P(A\cup B\cup C) = P(A)+P(B)+P(C) - [P(A\cap B)+P(A\cap C)+P(B\cap C)] + P(A\cap B\cap C) \end{equation}

Substitute values:

P(ABC)=3/6+3/6+3/6(1/6+1/6+2/6)+1/6=1\begin{equation} P(A\cup B\cup C) = 3/6 + 3/6 + 3/6 - (1/6 + 1/6 + 2/6) + 1/6 = 1 \end{equation}

8. Monotonicity

Statement:
If ABA\subseteq B then

P(A)P(B)\begin{equation} P(A) \le P(B) \end{equation}

Proof:
When ABA\subseteq B, we can write B=A(BA)B = A \cup (B\setminus A). By additivity,

P(B)=P(A)+P(BA).\begin{equation} P(B) = P(A) + P(B\setminus A). \end{equation}

Since P(BA)0P(B\setminus A)\ge0 , it follows that P(B)P(A)P(B)\ge P(A).

Experiment: Roll a die.

  • A={1,2},P(A)=2/6A=\{1,2\}, P(A)=2/6
  • B={1,2,3,4},P(B)=4/6B=\{1,2,3,4\}, P(B)=4/6

Since ABA\subseteq B:

P(A)P(B)2/64/6\begin{equation} P(A) \le P(B) \quad \Rightarrow \quad 2/6 \le 4/6 \end{equation}

Conditional Probability

Suppose that we draw one card from a standard deck of 52 playing cards, and suppose that each of the 52 possible outcomes is equally likely to occur and hence has probability

152\begin{equation} \frac{1}{52} \end{equation}

Suppose further that we are told that the card drawn is a heart. Then, given this information, what is the probability that the card is a face card (Jack, Queen, or King)?

To calculate this probability, we reason as follows: Given that the card is a heart, there can be at most 13 possible outcomes of our experiment, namely,
{A,2,3,,K}\{ \heartsuit A, \heartsuit 2, \heartsuit 3, \dots, \heartsuit K \}.

Since each of these outcomes originally had the same probability of occurring, the outcomes should still have equal probabilities. That is, given that the card is a heart, the (conditional) probability of each of the outcomes is

113\begin{equation} \frac{1}{13} \end{equation}

whereas the (conditional) probability of the other 39 points in the sample space is 0.

Hence, the desired probability will be

313\begin{equation} \frac{3}{13} \end{equation}

since there are 3 favorable outcomes: {J,Q,K}\{ \heartsuit J, \heartsuit Q, \heartsuit K \}.

If we let EE and FF denote, respectively, the event that the card is a face card and the event that the card is a heart, then the probability just obtained is called the conditional probability that EE occurs given that FF has occurred and is denoted by

P(EF)\begin{equation} P(E \mid F) \end{equation}

Definition

A general formula for P(EF)P(E \mid F) that is valid for all events EE and FF is derived in the same manner: If the event FF occurs, then, in > order for EE to occur, it is necessary that the actual occurrence be a point both in EE and in FF; that is, it must be in EFE \cap F. Now, > since we know that FF has occurred, it follows that FF becomes our new, or reduced, sample space; hence, the probability that the event EFE \cap F occurs will equal the probability of EFE \cap F relative to the probability of FF. That is, we have the following definition:

P(EF)=P(EF)P(F), with P(F)>0.\begin{equation} P(E \mid F) = \frac{P(E \cap F)}{P(F)}, \text{ with } \qquad P(F)>0. \end{equation}

Figure 2: Visual representation

Multiplication Rule in Probability

The multiplication rule expresses the probability of the intersection of events in terms of conditional probabilities. For nn events A1,A2,,AnA_1, A_2, \dots, A_n, the joint probability can be written as:

P(A1A2An)=P(A1)P(A2A1)P(A3A1A2)P(AnA1A2An1)\begin{equation} P(A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1) \, P(A_2 \mid A_1) \, P(A_3 \mid A_1 \cap A_2) \, \dots \, P(A_n \mid A_1 \cap A_2 \cap \dots \cap A_{n-1}) \end{equation}

Or using product notation:

P(i=1nAi)=i=1nP(Aij=1i1Aj)\begin{equation} P\Bigg(\bigcap_{i=1}^{n} A_i\Bigg) = \prod_{i=1}^{n} P\Big(A_i \,\big|\, \bigcap_{j=1}^{i-1} A_j\Big) \end{equation}

Bayes Theorem

Bayes' Theorem provides a systematic way to update the probability of an event AA when new evidence BB is observed.
It is based on the idea that the posterior probability P(AB)P(A \mid B) depends on:

  1. The prior probability P(A)P(A), representing our initial belief before seeing the evidence,
  2. The likelihood P(BA)P(B \mid A), the probability of observing the evidence if the event occurs, and
  3. The total probability of the evidence P(B)P(B), which normalizes the posterior to ensure all probabilities sum to 1.

Definition

Formally, Bayes' Theorem is:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

If the evidence BB can occur under a set of mutually exclusive and exhaustive events {E1,E2,,En}\{E_1, E_2, \dots, E_n\}, then the Law of Total > Probability gives:

P(B)=i=1nP(BEi)P(Ei)P(B) = \sum_{i=1}^{n} P(B \mid E_i) \, P(E_i)

Combining these gives the full form of Bayes’ Theorem:

P(AB)=P(BA)P(A)i=1nP(BEi)P(Ei)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{\sum_{i=1}^{n} P(B \mid E_i) \, P(E_i)}

This formula shows that the posterior probability increases if the evidence BB is more likely when AA occurs (higher likelihood) or if > the prior probability P(A)P(A) is larger, but it is moderated by how likely the evidence is overall.

Example 1: Low prevalence, high test accuracy.

Consider a medical test for a rare condition:

  • AA = "person has the condition" with P(A)=0.01P(A) = 0.01,
  • E2E_2 = "person does not have the condition" with P(E2)=0.99P(E_2) = 0.99,
  • P(BA)=0.95P(B \mid A) = 0.95 (sensitivity),
  • P(BE2)=0.05P(B \mid E_2) = 0.05 (false positive rate).

Then the probability of having the condition given a positive test is:

P(AB)=0.950.010.950.01+0.050.990.161P(A \mid B) = \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \approx 0.161

Even though the test is positive, the posterior probability rises only to 16%, because the condition is rare and false positives are possible.

Counterexample: Higher prevalence, lower test accuracy.

Now suppose the condition is more common and the test is less accurate:

  • P(A)=0.3P(A) = 0.3,
  • P(E2)=0.7P(E_2) = 0.7,
  • P(BA)=0.7P(B \mid A) = 0.7,
  • P(BE2)=0.2P(B \mid E_2) = 0.2.

Then the posterior probability is:

P(AB)=0.70.30.70.3+0.20.7=0.210.350.6P(A \mid B) = \frac{0.7 \cdot 0.3}{0.7 \cdot 0.3 + 0.2 \cdot 0.7} = \frac{0.21}{0.35} \approx 0.6

A positive test now increases the probability from 30% to 60%, showing how higher prevalence and lower accuracy can still produce a strong posterior probability.

Key Takeaways:

  • Bayes’ Theorem combines prior knowledge and new evidence to produce a rational update of probabilities.
  • The denominator (total probability of the evidence) ensures the posterior is properly normalized.
  • Posterior probabilities depend on prevalence, likelihood, and false positive/negative rates.
  • It provides a quantitative framework for reasoning under uncertainty, widely used in many fields.

Practical Applications of Bayes’ Theorem

General Applications:

  1. Medical Diagnosis: Estimating the probability of disease given test results, as shown above.
  2. Risk Assessment: Calculating likelihoods of rare events (e.g., accidents, system failures).
  3. Decision Making: Updating beliefs based on new evidence in finance, law, and engineering.
  4. Fault Detection: In manufacturing or electronics, estimating the probability of failure given observed signals.
  5. Spam Detection: Filtering emails based on the probability that certain words indicate spam.

Some usage examples are as follows:

1. Spam Email Classification (Naive Bayes)

Suppose we want to classify an email as spam (SS) or not spam (HH) based on the presence of the word "discount" (WW).

  • Prior probability: P(S)=0.2P(S) = 0.2, P(H)=0.8P(H) = 0.8
  • Likelihoods: P(WS)=0.7P(W \mid S) = 0.7, P(WH)=0.1P(W \mid H) = 0.1

Compute the probability that the email is spam given it contains "discount":

P(SW)=P(WS)P(S)P(WS)P(S)+P(WH)P(H)=0.70.20.70.2+0.10.8=0.140.14+0.08=0.140.220.636P(S \mid W) = \frac{P(W \mid S) P(S)}{P(W \mid S) P(S) + P(W \mid H) P(H)} = \frac{0.7 \cdot 0.2}{0.7 \cdot 0.2 + 0.1 \cdot 0.8} = \frac{0.14}{0.14 + 0.08} = \frac{0.14}{0.22} \approx 0.636

Interpretation: The email is approximately 63.6% likely to be spam given it contains the word "discount".

2. Disease Prediction in ML Model (Binary Classification)

Suppose a binary classifier predicts a disease based on a symptom feature:

  • Prior probability: P(Disease)=0.05P(Disease) = 0.05, P(No Disease)=0.95P(No\ Disease) = 0.95
  • Likelihoods: P(SymptomDisease)=0.9P(Symptom \mid Disease) = 0.9, P(SymptomNo Disease)=0.1P(Symptom \mid No\ Disease) = 0.1

Compute the posterior probability:

P(DiseaseSymptom)=0.90.050.90.05+0.10.95=0.0450.045+0.095=0.0450.140.321P(Disease \mid Symptom) = \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.1 \cdot 0.95} = \frac{0.045}{0.045 + 0.095} = \frac{0.045}{0.14} \approx 0.321

Interpretation: Even though the symptom is strongly associated with the disease, the posterior probability is only 32.1% due to the low prevalence.

3. Feature-Based Classification in Naive Bayes

Suppose a classifier uses two independent features F1F_1 and F2F_2:

  • Prior: P(Class=1)=0.4P(Class=1) = 0.4, P(Class=0)=0.6P(Class=0) = 0.6
  • Likelihoods: P(F1=1Class=1)=0.8P(F_1=1 \mid Class=1)=0.8, P(F2=1Class=1)=0.7P(F_2=1 \mid Class=1)=0.7
  • Likelihoods for Class 0: P(F1=1Class=0)=0.3P(F_1=1 \mid Class=0)=0.3, P(F2=1Class=0)=0.2P(F_2=1 \mid Class=0)=0.2

If we observe F1=1F_1=1 and F2=1F_2=1, the posterior is:

P(Class=1F1=1,F2=1)=0.80.70.40.80.70.4+0.30.20.6=0.2240.224+0.036=0.2240.260.862P(Class=1 \mid F_1=1, F_2=1) = \frac{0.8 \cdot 0.7 \cdot 0.4}{0.8 \cdot 0.7 \cdot 0.4 + 0.3 \cdot 0.2 \cdot 0.6} = \frac{0.224}{0.224 + 0.036} = \frac{0.224}{0.26} \approx 0.862

Interpretation: Given both features are present, there is an 86.2% chance that the sample belongs to Class 1.

Example: IoT Sensor Fault Detection Using MTBF

Suppose we have a temperature sensor with a Mean Time Between Failures (MTBF) of 1000 hours. We want to determine the probability that the sensor has failed given that it shows an abnormal reading (AA) after 200 hours of operation.

Step 1: Convert MTBF to failure probability

The failure rate per hour is approximately:

λ=1MTBF=11000=0.001 per hour\lambda = \frac{1}{\text{MTBF}} = \frac{1}{1000} = 0.001 \text{ per hour}

After t=200t = 200 hours, the prior probability of failure (assuming exponential distribution) is:

P(F)=1eλt=1e0.0012000.181P(F) = 1 - e^{-\lambda t} = 1 - e^{-0.001 \cdot 200} \approx 0.181

So, P(W)=1P(F)=0.819P(W) = 1 - P(F) = 0.819.


Step 2: Sensor behavior (likelihoods)

  • If the sensor is faulty, probability of abnormal reading: P(AF)=0.9P(A \mid F) = 0.9
  • If the sensor is working, probability of false alarm: P(AW)=0.05P(A \mid W) = 0.05

Step 3: Apply Bayes’ Theorem

P(FA)=P(AF)P(F)P(AF)P(F)+P(AW)P(W)P(F \mid A) = \frac{P(A \mid F) P(F)}{P(A \mid F) P(F) + P(A \mid W) P(W)}

Substitute values:

P(FA)=0.90.1810.90.181+0.050.819=0.16290.1629+0.04095=0.16290.203850.799P(F \mid A) = \frac{0.9 \cdot 0.181}{0.9 \cdot 0.181 + 0.05 \cdot 0.819} = \frac{0.1629}{0.1629 + 0.04095} = \frac{0.1629}{0.20385} \approx 0.799

Interpretation: Given an abnormal reading after 200 hours of operation, there is approximately an 80% chance that the sensor has failed.

Insight: By combining MTBF-based prior probability and observed evidence, Bayes’ Theorem helps in predictive maintenance to identify likely sensor failures early and reduce system downtime.

Random Variable

Generally, when an experiment is performed, we are interested in functions of the outcome rather than outcome itself. For instance, we are interested in tossing two dice, if the sum of the faces adds up to 6 and not that much concerned about individual face values for each flip. Essentially, two flips can yield any of the combinations in set S^=\hat{S}= [(1,5), (2,4), (3,3), (4,2), (5,1)]. These functions which map the outcomes in the sample space to real value are known as Random variables. Formally,

Definition

A random variable (RV) XX is a function mapping outcomes to real values:

X:SR\begin{equation} X: S \to \mathbb{R} \end{equation}

Discrete RV takes countable values (in Z\mathbb{Z}) while the continuous RV can take values in R\mathbb{R}.

Probability Mass Function (PMF)

For a discrete random variable:

pX(x)=P(X=x)\begin{equation} p_X(x) = P(X = x) \end{equation}

with

xpX(x)=1\begin{equation} \sum_x p_X(x) = 1 \end{equation}

Example

Suppose that our experiment consists of tossing 3 fair coins. If we let YY denote the number of heads that appear, then YY is a random variable taking on one of the values 0,1,2,0, 1, 2, and 33 with respective probabilities

P{Y=0}=P{(T,T,T)}=18,P{Y=1}=P{(T,T,H),(T,H,T),(H,T,T)}=38,P{Y=2}=P{(T,H,H),(H,T,H),(H,H,T)}=38,P{Y=3}=P{(H,H,H)}=18.\begin{equation} \begin{gather} P\{Y=0\}=P\{(T, T, T)\}=\frac{1}{8},\\ P\{Y=1\}=P\{(T, T, H),(T, H, T),(H, T, T)\}=\frac{3}{8},\\ P\{Y=2\}=P\{(T, H, H),(H, T, H),(H, H, T)\}=\frac{3}{8},\\ P\{Y=3\}=P\{(H, H, H)\}=\frac{1}{8}. \end{gather} \end{equation}

Probability Density Function (PDF)

We say that XX is a continuous random variable if there exists with

fX(x)0,fX(x)dx=1\begin{equation} f_X(x) \geq 0, \quad \int_{-\infty}^{\infty} f_X(x)\, dx = 1 \end{equation}

so that we can define the probability of some set BRB \in \mathbb{R} to be

P(XB)=BfX(x)dx,\begin{equation} P(X \in B) = \int_{B} f_X(x)dx, \end{equation}

In other words, probability of RV XX taking some value x[a,b]x \in [a,b] is given by:

pX(x)=abfX(x)dx,pX(x)=FX(b)FX(a),FX(z)=infzfX(x)dx,\begin{gather} \begin{equation} p_X(x) = \int_{a}^{b} f_X(x)dx,\\ p_X(x) = F_X(b)-F_X(a),\\ F_X(z) = \int_{-\inf}^{z} f_X(x)dx, \end{equation} \end{gather}

Cumulative Distribution Function (CDF)

For both discrete and continuous cases:

FX(x)=P(Xx)\begin{equation} F_X(x) = P(X \leq x) \end{equation}

4. Expectation and Variance

Expectation (Mean):

E[X]={xxpX(x)discretexfX(x)dxcontinuous\begin{equation} \mathbb{E}[X] = \begin{cases} \sum_x x p_X(x) & \text{discrete} \\ \int_{-\infty}^\infty x f_X(x)\, dx & \text{continuous} \end{cases} \end{equation}

Variance:

Var(X)=E[(XE[X])2]\begin{equation} \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \end{equation}

Inequalities of Expected Value

Expected value inequalities provide upper bounds on the probability of a random variable's value deviating from its expected value. They are used when the exact probability distribution is unknown, offering a way to make probabilistic statements with limited information.


1. Markov's Inequality

Markov's inequality is a fundamental tool that applies to any non-negative random variable. It gives an upper bound on the probability that the random variable is greater than or equal to some positive constant.

Statement: For a non-negative random variable XX and a positive constant a>0a > 0, the inequality is:

P(Xa)E[X]a\begin{equation} P(X \ge a) \le \frac{E[X]}{a} \end{equation}

Analogy: If you know the average length of a movie is 120 minutes (E[X]=120E[X]=120), you can use this inequality to say that the probability of randomly picking a movie that is 240 minutes long or longer is no more than 120/240=0.5120/240 = 0.5.


2. Chebyshev's Inequality

Chebyshev's inequality is a more powerful version of Markov's that uses the variance of the random variable. It provides a tighter bound on the probability that a random variable deviates from its mean by more than a certain amount. It applies to any random variable with a finite mean and variance.

Statement: For a random variable XX with finite expected value μ=E[X]\mu = E[X] and finite non-zero variance σ2=Var(X)\sigma^2 = \text{Var}(X), the inequality is:

P(Xμk)σ2k2\begin{equation} P(|X - \mu| \ge k) \le \frac{\sigma^2}{k^2} \end{equation}

Analogy: If you're measuring the height of students and you know the average height is 5'9" with a standard deviation of 2 inches, Chebyshev's inequality guarantees that no more than 1/41/4 of the students are either shorter than 5'5" or taller than 6'1" (i.e., more than 2 standard deviations from the mean).

3. Jensen's Inequality

Jensen's inequality is different from the others as it doesn't directly bound probabilities. Instead, it relates the expected value of a convex or concave function to the function of the expected value. It's a foundational concept in optimization and information theory.

Statement: For a convex function ϕ\phi:

E[ϕ(X)]ϕ(E[X])\begin{equation} E[\phi(X)] \ge \phi(E[X]) \end{equation}

For a concave function ϕ\phi:

E[ϕ(X)]ϕ(E[X])\begin{equation} E[\phi(X)] \le \phi(E[X]) \end{equation}

Analogy: Imagine a game where your winnings are a convex function of your dice roll. Jensen's inequality tells you that your average winnings will be greater than or equal to what you'd win if the die always landed on its average value (3.5). This shows that variability can be beneficial for convex functions.

References

[1] Ross, "Signal Processing for Communications", EPFL Press, 2008.