Encoding Methods for Categorical Variable
Encoding Methods for Categorical Variable
1. About Encoding
Encoding is a process that map categorical variable to numerical variable. For nost models, untransformed categorical data is not accepted as inputs. Five types of encoder introduced below.
2. Label Encoder
Label Encoder is one of the simplest encoding methods. It gives a label to each category
A categorical variable before encoding:
Species |
---|
Cat |
Dog |
Bird |
after encoding:
Species |
---|
0 |
1 |
2 |
A Label Encoder has following properties:
- It changes a categorical variable into a multivalued discrete variable
- It does not generate extra variable, thus is memory-saving
- The encoded data has a magnitude relationship, thus Label encoder should be applied to ordinal variable instead of nominal variable
- It does not change the number of categories
3. One-hot Encoder
One-hot encoder expand the categorical variable into c variables, where c is number of categories. For categories, these variables are exclusive.
A categorical variable before encoding:
Species |
---|
Cat |
Dog |
Bird |
after encoding:
Cat | Dog | Bird |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
For some kinds of data, the generated variables could also be non-exclusive
Color | R | G | B |
---|---|---|---|
White | 0 | 0 | 0 |
Purple | 255 | 0 | 100 |
Orange | 100 | 120 | 10 |
A One-hot Encoder has following properties:
- It changes a categorical variable into c multivalued/binary discrete variables
- It generates c-1 extra variable, thus is memory-costing when c is big. It suits variable with few categories
- It can be applied to both ordinal and nominal variable
- It does not change the number of categories
- Dummy-variables and True-False encoder are very similar to one-hot encoder, only small differences exist
4. Target-Encoder
Target encoder transform the categorical variables according to the output variable.
For numerical output, the target encoder replace categorical variable with the mean of samples under each category
X(species) | Y(Weight) | X' |
---|---|---|
cat | 10 | 12.5 |
cat | 15 | 12.5 |
dog | 20 | 25 |
dog | 30 | 25 |
For categorical output, the target encoder replace categorical variable with \(P(y= y_i|x=x_i)\)
X(species) | Y(Size) | X1(size-Small) | X2(size-medium) | X3(size-BIG) |
---|---|---|---|---|
cat | big | 0.25 | 0.25 | 0.5 |
cat | big | 0.25 | 0.25 | 0.5 |
cat | medium | 0.25 | 0.25 | 0.5 |
cat | small | 0.25 | 0.25 | 0.5 |
dog | big | 0.33 | 0 | 0.66 |
dog | big | 0.33 | 0 | 0.66 |
dog | small | 0.33 | 0 | 0.66 |
A Target Encoder has following properties:
- It changes a categorical variable into some continuous variables
- For continuous and binary outputs, it does not generate extra variables, for multivalued categorical output, it generate k variables, where k is the number of categories of output variable. When k < c, the target encoder can be more memory-saving than one-hot encoder
- It can be applied to both ordinal and nominal variable
- It does not change the number of categories
- There several improved target encoders like smooth target encoder and bayesian target encoder
5.Frequency Encoder
The frequency encoder convert categorical variable into discrete variables by counting each category's frequency in training dataset:
A categorical variable before encoding:
Species |
---|
Cat |
Cat |
Dog |
Bird |
after encoding:
Species | X' |
---|---|
Cat | 2 |
Dog | 1 |
Bird | 1 |
A Frequency Encoder has following properties:
- It changes a categorical variable into a discrete variable
- It does not generate extra variables, thus is memory-saving
- The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset
- There would be magnitude in transformed variables
6. Binary Encoder
Binary Encoder use $log_2N $ variables to express the original variable with N categories
Species |
---|
Cat |
Dog |
Bird |
Snake |
A variable with four categories can be expressed in a 2-dimension vector
Species | X1 | X2 |
---|---|---|
Cat | 0 | 0 |
Dog | 0 | 1 |
Bird | 1 | 0 |
Snake | 1 | 1 |
The binary encoder has similar properties as One-hot Encoder, but:
- It saves more memory
- The generated variables is less interoperable
7. Hash Encoder
The Hash Enocder map the original variable into a low-dimension space, and use the length of hash bin as transformed values. It is usually applied in a text processing scenario.
A text variable before encoding:
Message |
---|
I love python python is good |
I dont like python |
A text variable after encoding:
text | I | love | Python | is | good | dont | like |
---|---|---|---|---|---|---|---|
I love python python is good | 1 | 1 | 2 | 1 | 1 | 0 | 0 |
I dont like python | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
A Hash Encoder has following properties:
- It changes a categorical variable into some discrete variables
- Comparing to One-hot Encoder, it saves memory when the original variable is complex and repeatable, like text and graph
- The might be collision of variables, and change the number of categories, thus this endocing method does not fit small dataset
8. Embedding Encoder
Embedding methods is a techniques transforms the original categorical variable into a vector that reflect the similarity of the original categories. It is more frequently used in deep learning scenarios like NLP. Generally speaking, it can be regraded as a kind of encoding methods.
[ongoing]