The Power of Probability in AI

An Overview of Probability in AI, ML and NLP

This blog explains basic Probability theory concepts which are applicable to major areas in Artificial Intelligence (AI),Machine Learning (ML) and Natural Language Processing (NLP) areas. Probability is the heart of AI.

Following are major topics discussed and applicable in AI area, familiar with these topics will make you comfort in AI.

Please note that all these topics come under Probability Theory , in images I mention only Probability, Chain Rule and Bayes — take it this as Probability Theory.

1) Distributions: 2) Probability Axioms,Random Variables, Types of Random Variables 3) Conditional Probability 4) Independence 5) Bayes Rule 6) Chain Rule 7)Maximum Likelihood, and 8) Maximum A Posteriori (MAP)

Probability and Information Theory in AI

Let’s start by giving small introduction for probability and information theory in Artificial Intelligence,both subjects used for uncertainty. Probability theory allows us to make uncertain statements and reason in the presence of uncertainty, whereas information theory measure the disorder (or uncertainty) in a probability distribution.

First we will go through these concepts and see how involved in AI , Machine Learning and NLP applications.

Distribution: In simple terms its a data source and provides various kinds of data to use in AI applications, so that we can draw samples from distributions ( like Normal, Poisson, Bernoulli, Binomial, etc.,), We can generate distributions by using functions and probability concepts. We can build our own distributions and later draw sample for Training and Testing data sets.

Probability Formula in Mathematics

Where n is the total no of events and n(E) favourable events.

  1. Probability: The probability of a desired event E is the ratio of trails that result in E to the number of trails performed. It always lies in [0,1].

Axioms of Probability :

1) 0<=P(E)<=1

2) P(S)=1

3) P(success) = 1-P(failure), therefore P(success) + P(failure =1)

Where E — Event, S= sample space (set of all possible outcomes of an experiment.Sample Space: The set of possible values of each trial.

2) Random Variable: A Random Variable (RV) is a variable that can take on different values randomly. For example, x1 and x2 are both possible values that the random variable can take on. X = [x1,x2]. Random Variables are 2 types 1) Discrete Random Variable (DRV) — It has finite or countably infinite number of values/ states. 2) Continuous Random Variable (CRV) , It is associated with a real value. Note: All Variables and Events are expressed in terms of Random Variables.

3) Conditional Probability: It is defined as some event, given that some other event has happened. This is known as Conditional Probability. We denote that Y= y given X=x. It can be expressed with the formula. It is only defined when P(X =x) > 0. X and Y are Random Variables.

Conditional Probability Formula

4) Two Random Variables X and Y are said to be Independent if their distribution can be expressed as product of two factors , X and Y are conditionally independent given Z.

Conditional Independence Formulas

5) Chain Rule or Product Rule: Joint Probability Distribution two or more Random Variables may be decomposed into conditional distributions.

Chain Rule for 2 ,3 and N Random Variables

6) Bayes Rule :

Bayes Rule formula

Parameter Estimation :Estimating the value for parameter

7) Maximum Likelihood Estimator (MLE)

MLE Formula

It is used to Estimate the Parameter to reduce the error between Training Set and Prediction

The Random Variables will be words (internally convert into Vectors), Vectors, Numbers, etc.,

8) Maximum A Posteriori (MAP)

MAP Formula

MAP is used for prediction using Bayesian Rule,While the most principled way is to make prediction using the full bayesian posteriori distribution over the parameter , it is still often desirable to have a single point estimate.

Probability in Artificial Intelligence (AI)

AI Subjects or fields can be categorised as Learning, Problem Solving, Uncertainty & Reasoning , Knowledge Representation and Communication.

This Diagram shows where Probability Theory can be applied in AI area, Learning (Specially Machine Learning) & NLP be part of AI , but listed out separately due to widely used & necessity for understanding.

Probability Theory applying in different models in AI

In this diagram listed out Bayesian Networks based Probabilistic programs for making reasoning, reasoning over time and for decisions. The models listed out with respect to its area like Learning-based for Machine Learning, State-based for Problem Solving, Logic-based for Bayesian Networks, Logical-based for First Order Logic, and Communication for NLP (which is not listed in the diagram).

Probability in Machine Learning (ML)

The following diagram shows where Probability Theory can be applied in Machine Learning algorithms area,mostly it would be Generative Algorithms, Classification Algorithms and Estimation of Parameters.

Reinforcement Learning (RL) is the branch of ML and works on Environment and Reward basis,here we apply in MDP and POMDP process.

Generative Algorithms:

Usually we try to learn p(y/x) directly from the space of inputs to the labels are called Discriminative learning algorithms. Where y is label and x is input space.

The algorithms that we instead try to model p(x/y) and p(y) are called Generative Learning Algorithms.

Using Bayes Rule: p(y/x) (posteriori)= p(x/y) p(y) (priori). The Well known example is E-mail classification whether it is ham or spam. These tasks are called Text Classification in NLP.

For Example: In Naive Bayes Classification Algorithm both Generative and Discriminative tasks come into picture.

Estimation of Parameters: Estimation of parameters almost comes in all Supervised algorithms where parameters need to reduce error between Training and Testing data sets. For ex: Regression, Logistic Regression, Naive Bayes, Neural Networks, SVM, etc.,

Linear Regression as Maximum Likelihood:

The following example gives MLE (Maximum Likelihood Estimation) for Regression Algorithm for Mean Squared Error (MSE) loss function.

Input Space — x ; Algorithm Output: ^y. Mapping from x to ^y is chosen to minimize MSE, Instead of producing a single prediction ^y, we have to think of the model as gives p(y/x). Now p(y/x) is the distribution to all of those different y values that are compatible to x. Since the examples are IID., the conditional log-likelihood is

MLE Equation
Expanding MLE equation

Probability in Natural Language Processing (NLP)

This diagram shows where Probability Theory can be applied in NLP.

Probability Theory can be applied in NLP for N-grams, Language Modeling (LM), Conditinoal Language Modeling (CLM), Text Classfication (Email — spam or ham), parts-of-speech, Speech Recognition, Machine Translation , Information Extraction ( by applying CRF (Conditional Random Field)), etc.,

Knowing the concepts and background stuff is very important in AI, I hope this will give you starting point to start AI internal working stuff. Other topics also come into the picture but these are major ones, once you understand these other topics can grab easily.

Imagine expert in Probability Theory stuff will take you in many fields in AI. That’s why Probability is the Heart of AI.

Thanks for reading this article and drop a note for comments, mistakes etc.,