Unveiling Auto Encoder in Machine Learning | by A.I Hub | Feb, 2025


Sparse Autoencoder

Sparse auto encoders are used to pull out the most influential feature representations.

This is beneficial when trying to understand what are the most

unique features of a data set.

It’s useful when using autoencoders as inputs
to downstream supervised models as it helps to highlight the unique signals
across the features.

Recall that neurons in a network are considered active if the threshold exceeds
certain capacity.

Since a Tanh activation function is S-curved from -1 to 1, we
consider a neuron active if the output value is closer to 1 and inactive if its output is closer to -1.2
Incorporating sparsity forces more neurons to be inactive.

This requires the autoencoder to represent each input as a combination of a
smaller number of activations.

To incorporate sparsity, we must first understand
the actual sparsity of the coding layer.

This is simply the average activation
of the coding layer as a function of the activation used (𝐴) and the inputs
supplied (𝑋) as illustrated in equation.

For our current best_model with 100 codings, the sparsity level is approximately
zero.

ae100_codings <- h2o.deepfeatures(best_model, features, layer = 1)
ae100_codings %>%
as.data.frame() %>%
tidyr::gather() %>%
summarize(average_activation = mean(value))

Sparse autoencoders attempt to enforce the constraint ̂𝜌 = 𝜌 where 𝜌 is a
sparsity parameter.

This penalizes the neurons that are too active, forcing
them to activate less.

To achieve this we add an extra penalty term to our
objective function in equation.

The most commonly used penalty is
known as the Kullback-Leibler divergence (KL divergence) which will measure
the divergence between the target probability 𝜌 that a neuron in the coding
layer will activate and the actual probability as illustrated in equation.

This penalty term is commonly written as an equation.

Similar to the ridge and LASSO penalties discussed earlier, we add this
penalty to our objective function and incorporate a parameter (𝛽) to control the weight of the penalty.

Consequently, our revised loss function with sparsity
induced is.

The average activation of the coding neurons in our default

autoencoder using a Tanh activation function.

Assume we want to induce sparsity with our current autoencoder that contains
100 codings.

We need to specify two parameters, 𝜌 and 𝛽.

In this example, we will just induce a little sparsity and specify 𝜌 = −0.1 by including
average_activation = -0.1 and since 𝛽 could take on multiple values we will
do a grid search across different sparsity_beta values.

Our results indicate
that 𝛽 = 0.01 performs best in reconstructing the original inputs.

# Hyperparameter search grid
hyper_grid <- list(sparsity_beta = c(0.01, 0.05, 0.1, 0.2))
# Execute grid search
ae_sparsity_grid <- h2o.grid(
algorithm = ’deeplearning’,
x = seq_along(features),
training_frame = features,
grid_id = ’sparsity_grid’,
autoencoder = TRUE,
hidden = 100,
activation = ’Tanh’,
hyper_params = hyper_grid,
sparse = TRUE,
average_activation = -0.1,
ignore_const_cols = FALSE,
seed = 123)
# Print grid details
h2o.getGrid(’sparsity_grid’, sort_by = ’mse’, decreasing = FALSE)

If we look at the average activation across our neurons now we see that it
shifted to the left compared, it is now -0.108 as illustrated in

the figure.

The average activation of the coding neurons in our sparse

autoencoder is now -0.108.

The amount of sparsity you apply is dependent on multiple factors.

When using
autoencoders for descriptive dimension reduction, the level of sparsity is dependent on the level of insight you want to gain behind the most unique statistical
features.

If you are trying to understand the most essential characteristics that

explain the features or images then a lower sparsity value is preferred.

For
example, this figure compares the four sampled digits from the MNIST test
set with a non-sparse autoencoder with a single layer of 100 codings using
Tanh activation functions and a sparse auto encoder that constrains 𝜌 = −0.75.

Adding sparsity helps to highlight the features that are driving the uniqueness
of these sampled digits.

This is most pronounced with the number 5 where the
sparse autoencoder reveals the primary focus is on the upper portion of the
glyph.

If you are using autoencoders as a feature engineering step prior to down-stream supervised modeling, then the level of sparsity can be considered a

hyperparameter that can be optimized with a search grid.

Original digits sampled from the MNIST test set (left),

reconstruction of sampled digits with a non-sparse autoencoder (middle) and
reconstruction with a sparse autoencoder (right).

As we discussed how an undercomplete autoencoder is used
to constrain the number of codings to be less than the number of inputs.

This constraint prevents the autoencoder from learning the identify function which would just create a perfect mapping of inputs to outputs and not learn

anything about the feature’s salient characteristics.

There are ways
to prevent an autoencoder with more hidden units than inputs known as an

overcomplete autoencoder from learning the identity function.

Adding sparsity
is one such approach and another is to

add randomness in the transformation from input to reconstruction which we

discuss next.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here