Lecture 3 - Part 1

Introduction
Probability Theory
References

Introduction

Probability Theory

Before introducing the detection, estimation, and forecasting techniques in details, it is important to understand some key probabilistic concepts. The probability theory itself merits an entire course on its own. Therefore, it is challenging to cover all aspects as part of one chapter. Interested readers are referred to [1] for the detailed exposition of the topic.

Sample Space

Let us consider an experiment, e.g., tossing of a coin or dice, predicting the location of a ball in roulette etc. The outcome of such experiment is random and not predictable with certainty. Nevertheless, we do know that tossing a coin will result in head/tails, or in case of dice it will result in a number between 1-6. Consequently, while the outcome of the experiment is not known and its random, the set of all possible outcome is known.

Definition

The set of all possible outcomes of an experiment is known as the Sample Space and is commonly denoted by $S$ .

Examples

Coin Toss: In case of coin-toss example, $S_{coin}=\{H,T\}$ , i.e. the outcome is either heads( $H$ ) or tails( $T$ ).
Student Exam: Consider a cohort of five students appearing in exam. Then the outcome is the position of the student after the test, i.e.

\begin{equation} S_{student}=\{ \text{ All } 5! \text{ combinations of } (1,2,3,4,5)\}, \end{equation}

the outcome (2,3,4,1,5) for instance means that the student with ID 2 came first in the test and so on.

Double Coin-toss: In case of flipping two coins, the sample space can be defined as all possible combinations as:

\begin{equation} \begin{split} S_{dc} &=\{ (H,H), (H,T), (T,T), (T,H)\}, \\ &= \underbrace{\{H,T\}}_{S_{coin}} \times \underbrace{\{H,T\}}_{S_{coin}}. \\ \end{split} \end{equation}

One can appreciate from this example that the $S_{dc}$ is a product space of the sample space of single coin-tosses $S_{dc}$ .

Event

Definition

Any subset of sample space $E \subset S$ is known as event. In other words, an event ( $E$ ) is a set of possible outcomes for the experiment.

If the outcome of the experiment is contained in set E, we say that event E has occurred. For instance, consider following examples:

Rolling a dice: Consider rolling a dice, then an event can be rolling an even number, i.e.

\begin{equation} E = \{2,4,6\} \subset S_{dice} \end{equation}

Double coin-toss: Imagine flipping two coins, then an event can be that at least one tail appears, i.e.,

\begin{equation} E = \{ (T,H), (H, T), (T,T)\} \subset S_{dc} \end{equation}

Let $E$ and $F$ be two events in $S$ then $E \cup F$ will occur when either $E$ or $F$ occurs. The event $E \cup F$ is called a union event. Similarly, we can define an event $EF$ which is an intersection of $E$ and $F$ , i.e., contains all outcomes which are feasible when both $E$ and $F$ occur. If $E \cap F$ is $\varnothing$ then $E$ and $F$ are mutually exclusive events. The set $S\E$ is complementary set of $E$ often denoted by $E^{c}$ or $\bar{E}$ . Generally, the unions and intersections of the events can be defined as:

\begin{equation} \bigcup_{i \in I} E_i \text{ with } I =\{1,2,...,N\} \end{equation}

\begin{equation} \bigcap_{i \in I} E_i \text{ with } I =\{1,2,...,N\} \end{equation}

Some examples of these three concepts are as follows:

Set theory diagrams showing union, intersection, and complement

Example 1: Rolling a Die

The sample space is

S = \{1,2,3,4,5,6\}.

Let

$E = \{2,4,6\}$ (event that the outcome is even),
$F = \{4,5,6\}$ (event that the outcome is at least 4).

Then:

Intersection:

\begin{equation} EF = \{4,6\} \end{equation}

is the event that the outcome is even and at least 4.

Union:

\begin{equation} E \cup F = \{2,4,5,6\} \end{equation}

is the event that the outcome is even or at least 4 (or both).

Complement:

\begin{equation} E^c = \{1,3,5\} \end{equation}

is the event that the outcome is odd.

Example 2: Drawing a Card

The sample space is

S = \{\text{all 52 cards}\}.

Let

$E = \{\text{all hearts}\}$ (13 cards),
$F = \{\text{all face cards (J, Q, K)}\}$ (12 cards).

Then:

Intersection:

\begin{equation} EF = \{\heartsuit J, \heartsuit Q, \heartsuit K\} \end{equation}

is the event that the card is a heart and a face card.

Union:

\begin{equation} E \cup F = \{\text{all hearts}\} \cup \{\text{all J, Q, K of any suit}\} \end{equation}

is the event that the card is a heart or a face card (or both).

Complement:

\begin{equation} F^c = \{\text{all cards except J, Q, K}\} \end{equation}

is the event that the card is not a face card.

Algebra of Events

The operations of forming unions, intersections, and complements of events obey certain rules similar to the rules of algebra.

We list a few of these important laws:

Commutative Laws

\begin{equation} E \cup F = F \cup E \end{equation}

\begin{equation} EF = FE \end{equation}

Associative Laws

\begin{equation} (E \cup F) \cup G = E \cup (F \cup G) \end{equation}

\begin{equation} (EF)G = E(FG) \end{equation}

Distributive Laws

\begin{equation} (E \cup F)G = EG \cup FG \end{equation}

\begin{equation} EF \cup G = (E \cup G)(F \cup G) \end{equation}

De Morgan's Laws in Probability

For two events $E$ and $F$ :

Union complement:

\begin{equation} (E \cup F)^c = E^c \cap F^c \end{equation}

Intersection complement:

\begin{equation} (E \cap F)^c = E^c \cup F^c \end{equation}

Explanation:

$(E \cup F)^c$ contains all outcomes not in $E$ or $F$ .
$(E \cap F)^c$ contains all outcomes not in both $E$ and $F$ .

Axioms of Probability

Probability can be defined in number of ways. One way of defining the probability is in terms of relative occurrence of the event:

\begin{equation} P(E) = \lim_ {n \rightarrow \infty} \frac{n(E)}{n}, \end{equation}

where $P(E)$ is probability of the event, $n(E)$ is the number of times the event occurs in $n$ runs of the experiment. In other words, $P(E)$ is defined as the limiting proportion of time that $E$ occurs. It is thus a limiting frequency of $E$ . This can be verified by a small demo here.

Dice Roller

Number of Rolls:

Law of Large Numbers

Suppose we roll a fair six-sided die. The theoretical probability of getting a $3$ is $P(3) = \tfrac{1}{6}$ .

If we perform $n$ rolls and observe $k$ outcomes equal to 3, then the empirical probability is

\hat{P}(3) = \frac{k}{n}

As $n \to \infty$ , $\hat{P}(3)$ converges to $P(3) = \tfrac{1}{6}$ .

Notice, that the definition here inherently assumes that the $n(E)/n$ converges to a finite value for all repetitions of the experiment. It is difficult to prove this without making assumption on the convergence. Therefore, modern probability theory rather adopts an axiom based approach. In particular, for each event $E \subset S$ , we assume that $P(E)$ is the probability of the event which satisfies the following axioms:

Axiom 1

\begin{equation} 0 \leq P(E) \leq 1. \end{equation}

The probability of an event $E$ takes value between 0 and 1.

Axiom 2

\begin{equation} P(S) = 1 \end{equation}

The outcome of an experiment is a point in $S$ with probability 1.

Axiom 3

For any sequence of events $\{E_1,E_2,..., E_N\}$ which are mutually exclusive, i.e., $E_i E_j = \varnothing$ when $i \neq j$ , then

\begin{equation} P\bigg( \bigcup_{i=1}^{N} E_i \bigg) = \sum_{i=1}^{N}P(E_i) \end{equation}

If events are mutually exclusive, then the chance of at least one happening is just the total of their separate chances.

Key Propositions in Probability Theory

1. Probability of the empty set

Statement:

\begin{equation} P(\varnothing) = 0 \end{equation}

Proof:
The empty set $\varnothing$ is disjoint from $S$ and $S = \varnothing \cup S$ . By additivity,

\begin{equation} P(S) = P(\varnothing \cup S) = P(\varnothing) + P(S). \end{equation}

Subtract $P(S)$ from both sides and use $P(S)=1$ :

\begin{equation} 0 = P(\varnothing). \end{equation}

2. Boundedness

Statement:

\begin{equation} 0 \le P(A) \le 1 \end{equation}

Proof:
Non-negativity gives $P(A)\ge0$ . Also $A \subseteq S$ , and $S = A \cup A^c$ with $A$ and $A^c$ disjoint, so by additivity

\begin{equation} 1 = P(S) = P(A) + P(A^c) \ge P(A), \end{equation}

since $P(A^c)\ge0$ . Hence $P(A)\le1$ . Combining yields $0\le P(A)\le1$ .

Experiment: Toss a fair coin.

Sample space: $S=\{H,T\}$
Event: $A=\{H\}$

Probability:

\begin{equation} P(A) = \frac{1}{2}, \quad 0 \leq P(A) \leq 1 \end{equation}

3. Complement Rule

Statement:

\begin{equation} P(A^c) = 1 - P(A) \end{equation}

Proof:
Because $A$ and $A^c$ are disjoint and $A\cup A^c=S$ ,

\begin{equation} 1 = P(S) = P(A\cup A^c) = P(A) + P(A^c). \end{equation}

Rearrange to get $P(A^c)=1-P(A)$ .

Experiment: Roll a fair die.

Event $A=\{2,4,6\}$ (even)
Complement $A^c=\{1,3,5\}$ (odd)

Probabilities:

\begin{equation} P(A) = \frac{3}{6} = \frac{1}{2}, \quad P(A^c) = \frac{3}{6} = \frac{1}{2} \end{equation}

Check complement rule:

\begin{equation} P(A^c) = 1 - P(A) = 1 - \frac{1}{2} = \frac{1}{2} \end{equation}

4. Sub-additivity (union bound)

Statement:

\begin{equation} P(A \cup B) \le P(A) + P(B) \end{equation}

Proof:
Start from the inclusion–exclusion identity (proved below):

\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B). \end{equation}

Since $P(A\cap B)\ge0$ , subtracting it makes the right-hand side $\le P(A)+P(B)$ .

Experiment: Draw a card from a deck of 52.

Event $A$ : heart ( $P(A)=13/52$ )
Event $B$ : king ( $P(B)=4/52$ )

Intersection: $A\cap B = \{\text{King of Hearts}\}, P(A\cap B)=1/52$

Union bound:

\begin{equation} P(A\cup B) \le P(A) + P(B) \end{equation}

Actual probability:

\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \le \frac{17}{52} \end{equation}

5. Difference Rule

Statement:

\begin{equation} P(A \setminus B) = P(A) - P(A \cap B) \end{equation}

Proof:
Partition $A$ into disjoint sets $A\setminus B$ and $A\cap B$ :

\begin{equation} A = (A\setminus B)\cup (A\cap B), \qquad (A\setminus B)\cap(A\cap B)=\varnothing. \end{equation}

By additivity,

\begin{equation} P(A) = P(A\setminus B) + P(A\cap B). \end{equation}

Rearrange to obtain the stated identity.

Experiment: Roll a die.

Event $A=\{1,2,3,4\}$
Event $B=\{3,4,5,6\}$

Then

\begin{equation} A\setminus B = \{1,2\}, \quad A\cap B = \{3,4\} \end{equation}

Probabilities:

\begin{equation} P(A) = \frac{4}{6}, \quad P(A\cap B) = \frac{2}{6}, \quad P(A\setminus B) = \frac{2}{6} \end{equation}

Check difference rule:

\begin{equation} P(A\setminus B) = P(A) - P(A\cap B) = \frac{4}{6} - \frac{2}{6} = \frac{2}{6} \end{equation}

6. Inclusion–Exclusion (two events)

Statement:

\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) \end{equation}

Proof:
Write $A\cup B$ as a disjoint union:

\begin{equation} A\cup B = (A\setminus B)\ \cup\ (B\setminus A)\ \cup\ (A\cap B), \end{equation}

with the three pieces pairwise disjoint. By additivity,

\begin{equation} P(A\cup B)=P(A\setminus B)+P(B\setminus A)+P(A\cap B). \end{equation}

But

\begin{equation} P(A)=P(A\setminus B)+P(A\cap B),\qquad P(B)=P(B\setminus A)+P(A\cap B). \end{equation}

Adding these and subtracting $P(A\cap B)$ yields

\begin{equation} P(A)+P(B)-P(A\cap B)=P(A\cup B). \end{equation}

Experiment: Draw a card from a deck of 52.

Event $A$ : heart ( $P(A)=13/52$ )
Event $B$ : king ( $P(B)=4/52$ )
Intersection: king of hearts ( $P(A\cap B)=1/52$ )

Check formula:

\begin{equation} P(A\cup B) = P(A) + P(B) - P(A\cap B) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \end{equation}

7. Inclusion–Exclusion (general form)

Statement:
For events $A_1,\dots,A_n$ ,

\begin{equation} P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i\cap A_j) + \sum_{i<j<k} P(A_i\cap A_j\cap A_k) - \cdots \end{equation}

Proof (sketch by induction):
Base $n=1$ is trivial. For $n=2$ we have the two-event formula. Assume it holds for $n-1$ events. Let
$U_{n-1}=\bigcup_{i=1}^{n-1} A_i$ . Then

\begin{equation} P\!\left(U_{n-1}\cup A_n\right)=P(U_{n-1})+P(A_n)-P(U_{n-1}\cap A_n). \end{equation}

Apply the induction hypothesis to $P(U_{n-1})$ . Note that

\begin{equation} U_{n-1}\cap A_n = \bigcup_{i=1}^{n-1} (A_i\cap A_n), \end{equation}

so apply inclusion–exclusion again inside. Collecting terms produces the alternating sum for $n$ sets.

Experiment: Roll a die.

$A=\{\text{even}\}=\{2,4,6\}, P(A)=3/6$
$B=\{\text{prime}\}=\{2,3,5\}, P(B)=3/6$
$C=\{\leq 3\}=\{1,2,3\}, P(C)=3/6$

Intersections:

\begin{equation} A\cap B=\{2\}, \quad P=1/6; \quad A\cap C=\{2\}, \quad P=1/6; \quad B\cap C=\{2,3\}, \quad P=2/6; \quad A\cap B\cap C=\{2\}, \quad P=1/6 \end{equation}

Formula:

\begin{equation} P(A\cup B\cup C) = P(A)+P(B)+P(C) - [P(A\cap B)+P(A\cap C)+P(B\cap C)] + P(A\cap B\cap C) \end{equation}

Substitute values:

\begin{equation} P(A\cup B\cup C) = 3/6 + 3/6 + 3/6 - (1/6 + 1/6 + 2/6) + 1/6 = 1 \end{equation}

8. Monotonicity

Statement:
If $A\subseteq B$ then

\begin{equation} P(A) \le P(B) \end{equation}

Proof:
When $A\subseteq B$ , we can write $B = A \cup (B\setminus A)$ . By additivity,

\begin{equation} P(B) = P(A) + P(B\setminus A). \end{equation}

Since $P(B\setminus A)\ge0$ , it follows that $P(B)\ge P(A)$ .

Experiment: Roll a die.

$A=\{1,2\}, P(A)=2/6$
$B=\{1,2,3,4\}, P(B)=4/6$

Since $A\subseteq B$ :

\begin{equation} P(A) \le P(B) \quad \Rightarrow \quad 2/6 \le 4/6 \end{equation}

Conditional Probability

Suppose that we draw one card from a standard deck of 52 playing cards, and suppose that each of the 52 possible outcomes is equally likely to occur and hence has probability

\begin{equation} \frac{1}{52} \end{equation}

Suppose further that we are told that the card drawn is a heart. Then, given this information, what is the probability that the card is a face card (Jack, Queen, or King)?

To calculate this probability, we reason as follows: Given that the card is a heart, there can be at most 13 possible outcomes of our experiment, namely,
$\{ \heartsuit A, \heartsuit 2, \heartsuit 3, \dots, \heartsuit K \}$ .

Since each of these outcomes originally had the same probability of occurring, the outcomes should still have equal probabilities. That is, given that the card is a heart, the (conditional) probability of each of the outcomes is

\begin{equation} \frac{1}{13} \end{equation}

whereas the (conditional) probability of the other 39 points in the sample space is 0.

Hence, the desired probability will be

\begin{equation} \frac{3}{13} \end{equation}

since there are 3 favorable outcomes: $\{ \heartsuit J, \heartsuit Q, \heartsuit K \}$ .

If we let $E$ and $F$ denote, respectively, the event that the card is a face card and the event that the card is a heart, then the probability just obtained is called the conditional probability that $E$ occurs given that $F$ has occurred and is denoted by

\begin{equation} P(E \mid F) \end{equation}

Definition

A general formula for $P(E \mid F)$ that is valid for all events $E$ and $F$ is derived in the same manner: If the event $F$ occurs, then, in > order for $E$ to occur, it is necessary that the actual occurrence be a point both in $E$ and in $F$ ; that is, it must be in $E \cap F$ . Now, > since we know that $F$ has occurred, it follows that $F$ becomes our new, or reduced, sample space; hence, the probability that the event $E \cap F$ occurs will equal the probability of $E \cap F$ relative to the probability of $F$ . That is, we have the following definition:

\begin{equation} P(E \mid F) = \frac{P(E \cap F)}{P(F)}, \text{ with } \qquad P(F)>0. \end{equation}

Figure 2: Visual representation

Multiplication Rule in Probability

The multiplication rule expresses the probability of the intersection of events in terms of conditional probabilities. For $n$ events $A_1, A_2, \dots, A_n$ , the joint probability can be written as:

\begin{equation} P(A_1 \cap A_2 \cap \dots \cap A_n) = P(A_1) \, P(A_2 \mid A_1) \, P(A_3 \mid A_1 \cap A_2) \, \dots \, P(A_n \mid A_1 \cap A_2 \cap \dots \cap A_{n-1}) \end{equation}

Or using product notation:

\begin{equation} P\Bigg(\bigcap_{i=1}^{n} A_i\Bigg) = \prod_{i=1}^{n} P\Big(A_i \,\big|\, \bigcap_{j=1}^{i-1} A_j\Big) \end{equation}

Bayes Theorem

Bayes' Theorem provides a systematic way to update the probability of an event $A$ when new evidence $B$ is observed.
It is based on the idea that the posterior probability $P(A \mid B)$ depends on:

The prior probability $P(A)$ , representing our initial belief before seeing the evidence,
The likelihood $P(B \mid A)$ , the probability of observing the evidence if the event occurs, and
The total probability of the evidence $P(B)$ , which normalizes the posterior to ensure all probabilities sum to 1.

Definition

Formally, Bayes' Theorem is:

P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

If the evidence $B$ can occur under a set of mutually exclusive and exhaustive events $\{E_1, E_2, \dots, E_n\}$ , then the Law of Total > Probability gives:

P(B) = \sum_{i=1}^{n} P(B \mid E_i) \, P(E_i)

Combining these gives the full form of Bayes’ Theorem:

P(A \mid B) = \frac{P(B \mid A) \, P(A)}{\sum_{i=1}^{n} P(B \mid E_i) \, P(E_i)}

This formula shows that the posterior probability increases if the evidence $B$ is more likely when $A$ occurs (higher likelihood) or if > the prior probability $P(A)$ is larger, but it is moderated by how likely the evidence is overall.

Example 1: Low prevalence, high test accuracy.

Consider a medical test for a rare condition:

$A$ = "person has the condition" with $P(A) = 0.01$ ,
$E_2$ = "person does not have the condition" with $P(E_2) = 0.99$ ,
$P(B \mid A) = 0.95$ (sensitivity),
$P(B \mid E_2) = 0.05$ (false positive rate).

Then the probability of having the condition given a positive test is:

P(A \mid B) = \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \approx 0.161

Even though the test is positive, the posterior probability rises only to 16%, because the condition is rare and false positives are possible.

Counterexample: Higher prevalence, lower test accuracy.

Now suppose the condition is more common and the test is less accurate:

$P(A) = 0.3$ ,
$P(E_2) = 0.7$ ,
$P(B \mid A) = 0.7$ ,
$P(B \mid E_2) = 0.2$ .

Then the posterior probability is:

P(A \mid B) = \frac{0.7 \cdot 0.3}{0.7 \cdot 0.3 + 0.2 \cdot 0.7} = \frac{0.21}{0.35} \approx 0.6

A positive test now increases the probability from 30% to 60%, showing how higher prevalence and lower accuracy can still produce a strong posterior probability.

Key Takeaways:

Bayes’ Theorem combines prior knowledge and new evidence to produce a rational update of probabilities.
The denominator (total probability of the evidence) ensures the posterior is properly normalized.
Posterior probabilities depend on prevalence, likelihood, and false positive/negative rates.
It provides a quantitative framework for reasoning under uncertainty, widely used in many fields.

Practical Applications of Bayes’ Theorem

General Applications:

Medical Diagnosis: Estimating the probability of disease given test results, as shown above.
Risk Assessment: Calculating likelihoods of rare events (e.g., accidents, system failures).
Decision Making: Updating beliefs based on new evidence in finance, law, and engineering.
Fault Detection: In manufacturing or electronics, estimating the probability of failure given observed signals.
Spam Detection: Filtering emails based on the probability that certain words indicate spam.

Some usage examples are as follows:

1. Spam Email Classification (Naive Bayes)

Suppose we want to classify an email as spam ( $S$ ) or not spam ( $H$ ) based on the presence of the word "discount" ( $W$ ).

Prior probability: $P(S) = 0.2$ , $P(H) = 0.8$
Likelihoods: $P(W \mid S) = 0.7$ , $P(W \mid H) = 0.1$

Compute the probability that the email is spam given it contains "discount":

P(S \mid W) = \frac{P(W \mid S) P(S)}{P(W \mid S) P(S) + P(W \mid H) P(H)} = \frac{0.7 \cdot 0.2}{0.7 \cdot 0.2 + 0.1 \cdot 0.8} = \frac{0.14}{0.14 + 0.08} = \frac{0.14}{0.22} \approx 0.636

Interpretation: The email is approximately 63.6% likely to be spam given it contains the word "discount".

2. Disease Prediction in ML Model (Binary Classification)

Suppose a binary classifier predicts a disease based on a symptom feature:

Prior probability: $P(Disease) = 0.05$ , $P(No\ Disease) = 0.95$
Likelihoods: $P(Symptom \mid Disease) = 0.9$ , $P(Symptom \mid No\ Disease) = 0.1$

Compute the posterior probability:

P(Disease \mid Symptom) = \frac{0.9 \cdot 0.05}{0.9 \cdot 0.05 + 0.1 \cdot 0.95} = \frac{0.045}{0.045 + 0.095} = \frac{0.045}{0.14} \approx 0.321

Interpretation: Even though the symptom is strongly associated with the disease, the posterior probability is only 32.1% due to the low prevalence.

3. Feature-Based Classification in Naive Bayes

Suppose a classifier uses two independent features $F_1$ and $F_2$ :

Prior: $P(Class=1) = 0.4$ , $P(Class=0) = 0.6$
Likelihoods: $P(F_1=1 \mid Class=1)=0.8$ , $P(F_2=1 \mid Class=1)=0.7$
Likelihoods for Class 0: $P(F_1=1 \mid Class=0)=0.3$ , $P(F_2=1 \mid Class=0)=0.2$

If we observe $F_1=1$ and $F_2=1$ , the posterior is:

P(Class=1 \mid F_1=1, F_2=1) = \frac{0.8 \cdot 0.7 \cdot 0.4}{0.8 \cdot 0.7 \cdot 0.4 + 0.3 \cdot 0.2 \cdot 0.6} = \frac{0.224}{0.224 + 0.036} = \frac{0.224}{0.26} \approx 0.862

Interpretation: Given both features are present, there is an 86.2% chance that the sample belongs to Class 1.

Example: IoT Sensor Fault Detection Using MTBF

Suppose we have a temperature sensor with a Mean Time Between Failures (MTBF) of 1000 hours. We want to determine the probability that the sensor has failed given that it shows an abnormal reading ( $A$ ) after 200 hours of operation.

Step 1: Convert MTBF to failure probability

The failure rate per hour is approximately:

\lambda = \frac{1}{\text{MTBF}} = \frac{1}{1000} = 0.001 \text{ per hour}

After $t = 200$ hours, the prior probability of failure (assuming exponential distribution) is:

P(F) = 1 - e^{-\lambda t} = 1 - e^{-0.001 \cdot 200} \approx 0.181

So, $P(W) = 1 - P(F) = 0.819$ .

Step 2: Sensor behavior (likelihoods)

If the sensor is faulty, probability of abnormal reading: $P(A \mid F) = 0.9$
If the sensor is working, probability of false alarm: $P(A \mid W) = 0.05$

Step 3: Apply Bayes’ Theorem

P(F \mid A) = \frac{P(A \mid F) P(F)}{P(A \mid F) P(F) + P(A \mid W) P(W)}

Substitute values:

P(F \mid A) = \frac{0.9 \cdot 0.181}{0.9 \cdot 0.181 + 0.05 \cdot 0.819} = \frac{0.1629}{0.1629 + 0.04095} = \frac{0.1629}{0.20385} \approx 0.799

Interpretation: Given an abnormal reading after 200 hours of operation, there is approximately an 80% chance that the sensor has failed.

Insight: By combining MTBF-based prior probability and observed evidence, Bayes’ Theorem helps in predictive maintenance to identify likely sensor failures early and reduce system downtime.

Random Variable

Generally, when an experiment is performed, we are interested in functions of the outcome rather than outcome itself. For instance, we are interested in tossing two dice, if the sum of the faces adds up to 6 and not that much concerned about individual face values for each flip. Essentially, two flips can yield any of the combinations in set $\hat{S}=$ [(1,5), (2,4), (3,3), (4,2), (5,1)]. These functions which map the outcomes in the sample space to real value are known as Random variables. Formally,

Definition

A random variable (RV) $X$ is a function mapping outcomes to real values:

$\begin{equation} X: S \to \mathbb{R} \end{equation}$

Discrete RV takes countable values (in $\mathbb{Z}$ ) while the continuous RV can take values in $\mathbb{R}$ .

Probability Mass Function (PMF)

For a discrete random variable:

\begin{equation} p_X(x) = P(X = x) \end{equation}

with

\begin{equation} \sum_x p_X(x) = 1 \end{equation}

Example

Suppose that our experiment consists of tossing 3 fair coins. If we let $Y$ denote the number of heads that appear, then $Y$ is a random variable taking on one of the values $0, 1, 2,$ and $3$ with respective probabilities

\begin{equation} \begin{gather} P\{Y=0\}=P\{(T, T, T)\}=\frac{1}{8},\\ P\{Y=1\}=P\{(T, T, H),(T, H, T),(H, T, T)\}=\frac{3}{8},\\ P\{Y=2\}=P\{(T, H, H),(H, T, H),(H, H, T)\}=\frac{3}{8},\\ P\{Y=3\}=P\{(H, H, H)\}=\frac{1}{8}. \end{gather} \end{equation}

Probability Density Function (PDF)

We say that $X$ is a continuous random variable if there exists with

\begin{equation} f_X(x) \geq 0, \quad \int_{-\infty}^{\infty} f_X(x)\, dx = 1 \end{equation}

so that we can define the probability of some set $B \in \mathbb{R}$ to be

\begin{equation} P(X \in B) = \int_{B} f_X(x)dx, \end{equation}

In other words, probability of RV $X$ taking some value $x \in [a,b]$ is given by:

\begin{gather} \begin{equation} p_X(x) = \int_{a}^{b} f_X(x)dx,\\ p_X(x) = F_X(b)-F_X(a),\\ F_X(z) = \int_{-\inf}^{z} f_X(x)dx, \end{equation} \end{gather}

Cumulative Distribution Function (CDF)

For both discrete and continuous cases:

\begin{equation} F_X(x) = P(X \leq x) \end{equation}

4. Expectation and Variance

Expectation (Mean):

\begin{equation} \mathbb{E}[X] = \begin{cases} \sum_x x p_X(x) & \text{discrete} \\ \int_{-\infty}^\infty x f_X(x)\, dx & \text{continuous} \end{cases} \end{equation}

Variance:

\begin{equation} \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \end{equation}

Inequalities of Expected Value

Expected value inequalities provide upper bounds on the probability of a random variable's value deviating from its expected value. They are used when the exact probability distribution is unknown, offering a way to make probabilistic statements with limited information.

1. Markov's Inequality

Markov's inequality is a fundamental tool that applies to any non-negative random variable. It gives an upper bound on the probability that the random variable is greater than or equal to some positive constant.

Statement: For a non-negative random variable $X$ and a positive constant $a > 0$ , the inequality is:

\begin{equation} P(X \ge a) \le \frac{E[X]}{a} \end{equation}

Analogy: If you know the average length of a movie is 120 minutes ( $E[X]=120$ ), you can use this inequality to say that the probability of randomly picking a movie that is 240 minutes long or longer is no more than $120/240 = 0.5$ .

2. Chebyshev's Inequality

Chebyshev's inequality is a more powerful version of Markov's that uses the variance of the random variable. It provides a tighter bound on the probability that a random variable deviates from its mean by more than a certain amount. It applies to any random variable with a finite mean and variance.

Statement: For a random variable $X$ with finite expected value $\mu = E[X]$ and finite non-zero variance $\sigma^2 = \text{Var}(X)$ , the inequality is:

\begin{equation} P(|X - \mu| \ge k) \le \frac{\sigma^2}{k^2} \end{equation}

Analogy: If you're measuring the height of students and you know the average height is 5'9" with a standard deviation of 2 inches, Chebyshev's inequality guarantees that no more than $1/4$ of the students are either shorter than 5'5" or taller than 6'1" (i.e., more than 2 standard deviations from the mean).

3. Jensen's Inequality

Jensen's inequality is different from the others as it doesn't directly bound probabilities. Instead, it relates the expected value of a convex or concave function to the function of the expected value. It's a foundational concept in optimization and information theory.

Statement: For a convex function $\phi$ :

\begin{equation} E[\phi(X)] \ge \phi(E[X]) \end{equation}

For a concave function $\phi$ :

\begin{equation} E[\phi(X)] \le \phi(E[X]) \end{equation}

Analogy: Imagine a game where your winnings are a convex function of your dice roll. Jensen's inequality tells you that your average winnings will be greater than or equal to what you'd win if the die always landed on its average value (3.5). This shows that variability can be beneficial for convex functions.

References

[1] Ross, "Signal Processing for Communications", EPFL Press, 2008.

Lecture 3 - Part 1

Contents

Introduction

Probability Theory

Sample Space

Event

Algebra of Events

De Morgan's Laws in Probability

Axioms of Probability

Dice Roller

Law of Large Numbers

Axiom 1

Axiom 2

Axiom 3

Key Propositions in Probability Theory

1. Probability of the empty set

2. Boundedness

3. Complement Rule

4. Sub-additivity (union bound)

5. Difference Rule

6. Inclusion–Exclusion (two events)

7. Inclusion–Exclusion (general form)

8. Monotonicity

Conditional Probability

Multiplication Rule in Probability

Bayes Theorem

Practical Applications of Bayes’ Theorem

Example: IoT Sensor Fault Detection Using MTBF

Random Variable

Probability Mass Function (PMF)

Probability Density Function (PDF)

Cumulative Distribution Function (CDF)

4. Expectation and Variance

Inequalities of Expected Value

1. Markov's Inequality

2. Chebyshev's Inequality

3. Jensen's Inequality

References