Basic knowledge about Machine Learning

1.About Machine Learning

1.1 What is Machine Learning

The definition about machine learning can be expressed as:

Suppose we want a model to perform a task T, and we have a approach E to evaluate the goodness the model complete this task. If we improve E through updating an optimization algorithm O using an observational dataset D, we can call such process a machine learning process

1.2 Three components of Machine Learning

Model

A model is a learnner trying to solve the task we want to perform. In most cases, that task would fitting a real mapping between variables. Thus, a model can usually be noted as \(f(x,\theta)\), where \(\theta\) is the parameter of the model, and the fitting task is to find the best \(\theta\) according to the training dataset.

As me may not know the sepcific form of the real mapping, we need to give the algorithm a scope for leaning, we call such scope the Hypothesis Space, noted as F. For all potential fitted mapping f , \(f\in F\)

Learning Principle

To find the best \(\theta\), we need a principle to judge how good a \(\theta\) is. This principle usually involve the definition and calculation of a Loss function

Optimization Algorithm

After confirming the training dataset, hypothesis space and learning principle, the task of finding the best parameters become a optimization problem. An optimization algorithm is a solver to solve such problem

1.3 Proprocess of the data

1.4 Evaluation of the model

2. Supervised Machine Learning

Supervised learning is a kind of machine learning where the output variable of the model is clearly labeled in the dataset. That is to say, we would have a real outcome \(Y_{true}\) and a predicted outcome \(Y_{pred}\)

2.1 Joint Probability Distribution

Joint probability is the probability that multiple conditions are satisfied same time, noted as \(P(X,Y)\)

the relationship among joint probability, conditional probability and edge probability are: \[ P(Y|X) = \frac{P(X,Y)}{P(X)} \\ P(Y) = \sum_{i=1}^N P(X,Y)P(X) \] For two independent variable: \[ P(X, Y) = P(X)P(Y) \\ P(Y|X) = P(Y) \]

The fundamental hypothesis of supervised machine learning is the existence of the Joint Probability Distribution of input variable X and output variable Y

The task we want a supervised machine learning algorithm to perform is to imitate the real JPD \(P(X,Y)\), so that we can calculate \(P(Y| X)\) when input X is given

2.2 Category of Supervised Machine Learning

Probability model v.s. Non-probability model

The Non-Probability model try to learn the mapping relationship f directly. Usually it confirm a hypothesis space according to prior knowledge(e.g a linear space ). For example, KNN, SVM, NN are all non-probability model

The probability model try to directly learn one or more of \(P(X,Y)\) , \(P( X|Y )\) and \(P(Y)\)

Usually, it would pre-decide the distribution form of P(Y|X) or P(Y), for example, the logistic regression suppose P(Y| X) follow a Bernoulli Distribution. Logistic Regression, Naive Bayes are probability model

Parameter model v.s. Non-parameter model

In statistics, a parameter estimation means an estimation with given hypothesis on the distribution of the whole population, while a non-parameter estimation does not have such hypothesis

For machine learning:

A parameter model means you have an explicit hypothesis on the mapping or the probability distribution, like liner regression, logistic regression, naive bayes(limited dimention of \(\theta\)) or MLP. The advantage of such kind of models is, if the hypothesis is correct, the model can be fit with very small dataset. However, if the hypothesis is incorrect, no matter how large the dataset is, there would still be inevitable bias

A non-parameter model means you put no or limited hypothesis on the mapping or distribution, like KNN, tree-based model, SVM(non-linear). The cost of storage and calculation of non-parameter model would be bigger, but theoretically, a non-parameter can fit any complicated mapping as long as we have enough data. Usually a non-parameter model has a few hyperparameters and infinite parameters

Discriminative model v.s. Generative model

A discriminative model directly model on Y:

All non-probability models are discriminative models
If a probability models try to directly learn \(P(Y|X)\), it is a discriminative model

including most machine learning model like MLP, logistic regression, decision tree, KNN and SVM

A Generative model try to induce \(P( X| Y)\) through learning P(X, Y) and P(Y), with \[ P(X,Y) = P(X| Y )P(Y) \]

\[ P(Y|X) = \frac{P(X,Y)}{P(X)} \] including naive bayes，GMM

3. Unsupervised Machine Learning

Unsupervised learning is a kind of machine learning where the output variable of the model is not labeled in the dataset. Typically we can separate unsupervised machine learning algorithm into serval types according to the task we want it to perform

3.1 Feature learning

Mining useful expression or combination of features in unlabeled dataset. Usually applied in dimensionality reduction or visualization, including:

PCA , T- SNE, SVD
Sparse Encoding, Auto-encoder ,Denoising Autoencoder

3.2 Probabilistic Density Estimation

Induce the probability density function of a variable through observational data, can be classified as:

Parametric Density Estimation: have prior hypothesis on the distribution form of the variable,including MLE, MAP etc.
Non-parametric Density Estimation: do not have a prior hypothesi, including histogram, Kernel Density Estimation(KDE) etc.

3.3 Clustering

About Clustering

Clustering is a process of splitting a sample space into some subspace, the points in a subspace shoud have similarity. A important thing to notice is that unlike supervised learning, a data point in the clustering naturally does not have a true label. This indicates clustering algorithm is not always theoretically feasible. Let \(\chi\) denote the sample space, d denote a dissimilarity function measuring the dissimilarity, or we can say "distance", of two points, F denote a clustering model that map \(\chi\) into a group of subspace \(C = (C_1,C_2,...C_k)\) \[ C(C_ 1,C_2,...C_k) = F(\chi,d) \] If F is a useful process, that it can segment the space into subspaces, that has inter-similarity, three assumption must be fulfilled:

Scale Invariance: \(F(\chi,d) = F(\chi,\alpha d)\). The similarity does not depends on the scale(unit of distance).
Richness: According to different values of F should be able to output all possible split \(C = c\)
Consistency: No matter what dissimilarity function is selected, similar points should be in the same cluster. That is, given 2 point \(x, y\) put into the same cluster when applying a dissimilarity function \(d_1\), if \(d_2(x,y) \le d_1(x,y)\), then \(F(X,d_1) = F(X,d_2)\)

Nevertheless, according to Kleinberg, there does not exist an F that make all three assumption fulfilled. Thus, a clustering model is always imperfect, this is called Clustering Paradox. Still, we can relax the constraint of Richness to k-richness, which means given the number of clusters k, F is able to output all possible split. Under such compromise, we can build a model that satisfy scale invariance, k-richness and consistency and try to find the best k.

Though different clustering models have unique methods to evaluate the fitness of model, a general way to evaluate the utility of a clustering model is: \[ U[C = (C_1,C_2,...C_k)] = \frac{\sum_m^k P(C_m)\sum_i \sum_j [P(a_i=v_{i,j}|C_m)^2 -P(a_i=v_{i,j})^2]}{k} \] where \(a_i\) is the \(i^th\) attribute, \(v_{i,j}\) is the \(j^{th}\) possible value of \(a_i\)

From a information theory perspective, through the cluster model, we should have better knowledge on the value of attributes of a data point by knowing it's cluser. \(P(a_i=v_{i,j}|C_m)^2 -P(a_i=v_{i,j})^2]\) measures such gain of knowledge. \(P(C_m)\) is a weight of clusters in the sample space to prevent the split to be too imbalanced. \(k\) is the number of clusters, it is put here to penalize the complexity of model, as making each data point as a cluster would always obtain inner similarity.

Category of Clustering Algorithm

based on the mechanism of clustering, clustering algorithm can be categorized as:

Distance-based clustering: K-Means, FCM
Hierarchical clustering: aggregating HC and decomposing HC
Density-based clustering: DBSCAN, CFSFDP
Probability-based: GMM

3.4 Other Unsupervised Learning

There are lots of emerging unsupervised learning algorithm like pred-Net, GAN etc. These algorithm would be elaborated in single section in other articles

Machine Learning

#Basic Knowledge

Basic knowledge about Machine Learning

http://example.com/2022/09/11/basic_knowledgefor_ML/

Author

Zhengyuan Yang

Posted on

September 11, 2022

Licensed under

Learning Principle Previous

Hypothesis Testing Next