Encoding Methods for Categorical Variable

1. About Encoding

Encoding is a process that map categorical variable to numerical variable. For nost models, untransformed categorical data is not accepted as inputs. Five types of encoder introduced below.

2. Label Encoder

Label Encoder is one of the simplest encoding methods. It gives a label to each category

A categorical variable before encoding:

Species
Cat
Dog
Bird

after encoding:

Species
0
1
2

A Label Encoder has following properties:

It changes a categorical variable into a multivalued discrete variable
It does not generate extra variable, thus is memory-saving
The encoded data has a magnitude relationship, thus Label encoder should be applied to ordinal variable instead of nominal variable
It does not change the number of categories

3. One-hot Encoder

One-hot encoder expand the categorical variable into c variables, where c is number of categories. For categories, these variables are exclusive.

A categorical variable before encoding:

Species
Cat
Dog
Bird

after encoding:

Cat	Dog	Bird
1	0	0
0	1	0
0	0	1

For some kinds of data, the generated variables could also be non-exclusive

Color	R	G	B
White	0	0	0
Purple	255	0	100
Orange	100	120	10

A One-hot Encoder has following properties:

It changes a categorical variable into c multivalued/binary discrete variables
It generates c-1 extra variable, thus is memory-costing when c is big. It suits variable with few categories
It can be applied to both ordinal and nominal variable
It does not change the number of categories
Dummy-variables and True-False encoder are very similar to one-hot encoder, only small differences exist

4. Target-Encoder

Target encoder transform the categorical variables according to the output variable.

For numerical output, the target encoder replace categorical variable with the mean of samples under each category

X(species)	Y(Weight)	X'
cat	10	12.5
cat	15	12.5
dog	20	25
dog	30	25

For categorical output, the target encoder replace categorical variable with $P(y= y_i|x=x_i)$

X(species)	Y(Size)	X1(size-Small)	X2(size-medium)	X3(size-BIG)
cat	big	0.25	0.25	0.5
cat	big	0.25	0.25	0.5
cat	medium	0.25	0.25	0.5
cat	small	0.25	0.25	0.5
dog	big	0.33	0	0.66
dog	big	0.33	0	0.66
dog	small	0.33	0	0.66

A Target Encoder has following properties:

It changes a categorical variable into some continuous variables
For continuous and binary outputs, it does not generate extra variables, for multivalued categorical output, it generate k variables, where k is the number of categories of output variable. When k < c, the target encoder can be more memory-saving than one-hot encoder
It can be applied to both ordinal and nominal variable
It does not change the number of categories
There several improved target encoders like smooth target encoder and bayesian target encoder

5.Frequency Encoder

The frequency encoder convert categorical variable into discrete variables by counting each category's frequency in training dataset:

A categorical variable before encoding:

Species
Cat
Cat
Dog
Bird

after encoding:

Species	X'
Cat	2
Dog	1
Bird	1

A Frequency Encoder has following properties:

It changes a categorical variable into a discrete variable
It does not generate extra variables, thus is memory-saving
The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset
There would be magnitude in transformed variables

6. Binary Encoder

Binary Encoder use $log_2N $ variables to express the original variable with N categories

Species
Cat
Dog
Bird
Snake

A variable with four categories can be expressed in a 2-dimension vector

Species	X1	X2
Cat	0	0
Dog	0	1
Bird	1	0
Snake	1	1

The binary encoder has similar properties as One-hot Encoder, but:

It saves more memory
The generated variables is less interoperable

7. Hash Encoder

The Hash Enocder map the original variable into a low-dimension space, and use the length of hash bin as transformed values. It is usually applied in a text processing scenario.

A text variable before encoding:

Message
I love python python is good
I dont like python

A text variable after encoding:

text	I	love	Python	is	good	dont	like
I love python python is good	1	1	2	1	1	0	0
I dont like python	1	0	1	0	0	1	1

A Hash Encoder has following properties:

It changes a categorical variable into some discrete variables
Comparing to One-hot Encoder, it saves memory when the original variable is complex and repeatable, like text and graph
The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset

8. Embedding Encoder

Embedding methods is a techniques transforms the original categorical variable into a vector that reflect the similarity of the original categories. It is more frequently used in deep learning scenarios like NLP. Generally speaking, it can be regraded as a kind of encoding methods.

[ongoing]

Machine Learning

#Feature Engineering #Encoding

Encoding Methods for Categorical Variable

http://example.com/2022/10/22/encoding/

Author

Zhengyuan Yang

Posted on

October 22, 2022

Licensed under

Common Distribution Types Previous

Normality Test and Distribution Type Transformation Next