Encoding Methods for Categorical Variable

Encoding Methods for Categorical Variable

1. About Encoding

Encoding is a process that map categorical variable to numerical variable. For nost models, untransformed categorical data is not accepted as inputs. Five types of encoder introduced below.

2. Label Encoder

Label Encoder is one of the simplest encoding methods. It gives a label to each category

A categorical variable before encoding:

Species
Cat
Dog
Bird

after encoding:

Species
0
1
2

A Label Encoder has following properties:

  • It changes a categorical variable into a multivalued discrete variable
  • It does not generate extra variable, thus is memory-saving
  • The encoded data has a magnitude relationship, thus Label encoder should be applied to ordinal variable instead of nominal variable
  • It does not change the number of categories

3. One-hot Encoder

One-hot encoder expand the categorical variable into c variables, where c is number of categories. For categories, these variables are exclusive.

A categorical variable before encoding:

Species
Cat
Dog
Bird

after encoding:

Cat Dog Bird
1 0 0
0 1 0
0 0 1

For some kinds of data, the generated variables could also be non-exclusive

Color R G B
White 0 0 0
Purple 255 0 100
Orange 100 120 10

A One-hot Encoder has following properties:

  • It changes a categorical variable into c multivalued/binary discrete variables
  • It generates c-1 extra variable, thus is memory-costing when c is big. It suits variable with few categories
  • It can be applied to both ordinal and nominal variable
  • It does not change the number of categories
  • Dummy-variables and True-False encoder are very similar to one-hot encoder, only small differences exist

4. Target-Encoder

Target encoder transform the categorical variables according to the output variable.

For numerical output, the target encoder replace categorical variable with the mean of samples under each category

X(species) Y(Weight) X'
cat 10 12.5
cat 15 12.5
dog 20 25
dog 30 25

For categorical output, the target encoder replace categorical variable with \(P(y= y_i|x=x_i)\)

X(species) Y(Size) X1(size-Small) X2(size-medium) X3(size-BIG)
cat big 0.25 0.25 0.5
cat big 0.25 0.25 0.5
cat medium 0.25 0.25 0.5
cat small 0.25 0.25 0.5
dog big 0.33 0 0.66
dog big 0.33 0 0.66
dog small 0.33 0 0.66

A Target Encoder has following properties:

  • It changes a categorical variable into some continuous variables
  • For continuous and binary outputs, it does not generate extra variables, for multivalued categorical output, it generate k variables, where k is the number of categories of output variable. When k < c, the target encoder can be more memory-saving than one-hot encoder
  • It can be applied to both ordinal and nominal variable
  • It does not change the number of categories
  • There several improved target encoders like smooth target encoder and bayesian target encoder

5.Frequency Encoder

The frequency encoder convert categorical variable into discrete variables by counting each category's frequency in training dataset:

A categorical variable before encoding:

Species
Cat
Cat
Dog
Bird

after encoding:

Species X'
Cat 2
Dog 1
Bird 1

A Frequency Encoder has following properties:

  • It changes a categorical variable into a discrete variable
  • It does not generate extra variables, thus is memory-saving
  • The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset
  • There would be magnitude in transformed variables

6. Binary Encoder

Binary Encoder use $log_2N $ variables to express the original variable with N categories

Species
Cat
Dog
Bird
Snake

A variable with four categories can be expressed in a 2-dimension vector

Species X1 X2
Cat 0 0
Dog 0 1
Bird 1 0
Snake 1 1

The binary encoder has similar properties as One-hot Encoder, but:

  • It saves more memory
  • The generated variables is less interoperable

7. Hash Encoder

The Hash Enocder map the original variable into a low-dimension space, and use the length of hash bin as transformed values. It is usually applied in a text processing scenario.

A text variable before encoding:

Message
I love python python is good
I dont like python

A text variable after encoding:

text I love Python is good dont like
I love python python is good 1 1 2 1 1 0 0
I dont like python 1 0 1 0 0 1 1

A Hash Encoder has following properties:

  • It changes a categorical variable into some discrete variables
  • Comparing to One-hot Encoder, it saves memory when the original variable is complex and repeatable, like text and graph
  • The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset

8. Embedding Encoder

Embedding methods is a techniques transforms the original categorical variable into a vector that reflect the similarity of the original categories. It is more frequently used in deep learning scenarios like NLP. Generally speaking, it can be regraded as a kind of encoding methods.

[ongoing]


Encoding Methods for Categorical Variable
http://example.com/2022/10/22/encoding/
Author
Zhengyuan Yang
Posted on
October 22, 2022
Licensed under