Autoencoder

Machine learning and data mining

Problems Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction
Supervised learning (classification • regression) Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction Graphical models (Bayes net, CRF, HMM)
Anomaly detection k-NN Local outlier factor
Neural nets Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Reinforcement Learning Q-Learning SARSA Temporal Difference (TD)
Theory Bias-variance dilemma Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine learning venues NIPS ICML IJMLC ML JMLR ArXiv:cs.LG
Machine learning portal

Schematic structure of an autoencoder with 3 fully-connected hidden layers.

An autoencoder, autoassociator or Diabolo network^[1]^:19 is an artificial neural network used for unsupervised learning of efficient codings.^[2]^[3] The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data.^[4]^[5]

Structure

Architecturally, the simplest form of an autoencoder is a feedforward, non-recurrent neural network very similar to the multilayer perceptron (MLP) – having an input layer, an output layer and one or more hidden layers connecting them –, but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs (instead of predicting the target value $Y$ given inputs $X$ ). Therefore, autoencoders are unsupervised learning models.

An autoencoder always consists of two parts, the encoder and the decoder, which can be defined as transitions $\phi$ and $\psi$ , such that:

\phi :{\mathcal {X}}\rightarrow {\mathcal {F}}

\psi :{\mathcal {F}}\rightarrow {\mathcal {X}}

\arg \min _{\phi ,\psi }\|X-(\psi \circ \phi )X\|^{2}

In the simplest case, where there is one hidden layer, an autoencoder takes the input $\mathbf {x} \in \mathbb {R} ^{d}={\mathcal {X}}$ and maps it onto $\mathbf {z} \in \mathbb {R} ^{p}={\mathcal {F}}$ :

\mathbf {z} =\sigma _{1}(\mathbf {Wx} +\mathbf {b} )

This is usually referred to as code or latent variables (latent representation). Here, $\sigma _{1}$ is an element-wise activation function such as a sigmoid function or a rectified linear unit. After that, $\mathbf {z}$ is mapped onto the reconstruction $\mathbf {x'}$ of the same shape as $\mathbf {x}$ :

\mathbf {x'} =\sigma _{2}(\mathbf {W'z} +\mathbf {b'} )

Autoencoders are also trained to minimise reconstruction errors (such as squared errors):

{\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )=\|\mathbf {x} -\mathbf {x'} \|^{2}=\|\mathbf {x} -\sigma _{2}(\mathbf {W'} (\sigma _{1}(\mathbf {Wx} +\mathbf {b} ))+\mathbf {b'} )\|^{2}

If the feature space ${\mathcal {F}}$ has less dimensionality than the input space ${\mathcal {X}}$ , then the feature vector $\phi (x)$ can be regarded as a compressed representation of the input $x$ . If the hidden layers are larger than the input layer, an autoencoder can potentially learn the identity function and become useless. However, experimental results have shown that autoencoders might still learn useful features in these cases.^[1]^:19

Variations

Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations:

Denoising autoencoder

Denoising autoencoders take a partially corrupted input whilst training to recover the original undistorted input. This technique has been introduced with a specific approach to good representation.^[6] A good representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input. This definition contains the following implicit assumptions:

The higher level representations are relatively stable and robust to the corruption of the input;
It is necessary to extract features that are useful for representation of the input distribution.

To train an autoencoder to denoise data, it is necessary to perform preliminary stochastic mapping $\mathbf {x} \rightarrow \mathbf {\tilde {x}}$ in order to corrupt the data and use $\mathbf {\tilde {x}}$ as input for a normal autoencoder, with the only exception being that the loss should be still computed for the initial input ${\mathcal {L}}(\mathbf {x} ,\mathbf {{\tilde {x}}'} )$ instead of ${\mathcal {L}}(\mathbf {\tilde {x}} ,\mathbf {{\tilde {x}}'} )$ .

Sparse autoencoder

By imposing sparsity on the hidden units during training (whilst having a larger number of hidden units than inputs), an autoencoder can learn useful structures in the input data. This allows sparse representations of inputs. These are useful in pretraining for classification tasks.

Sparsity may be achieved by additional terms in the loss function during training (by comparing the probability distribution of the hidden unit activations with some low desired value),^[7] or by manually zeroing all but the few strongest hidden unit activations (referred to as a k-sparse autoencoder).^[8]

Variational autoencoder (VAE)

Variational autoencoder models inherit autoencoder architecture, but make strong assumptions concerning the distribution of latent variables. They use variational approach for latent representation learning, which results in an additional loss component and specific training algorithm called Stochastic Gradient Variational Bayes (SGVB).^[4] It assumes that the data is generated by a directed graphical model $p(\mathbf {x} |\mathbf {z} )$ and that the encoder is learning an approximation $q_{\phi }(\mathbf {z} |\mathbf {x} )$ to the posterior distribution $p_{\theta }(\mathbf {z} |\mathbf {x} )$ where ${\mathbf {\phi }}$ and $\mathbf {\theta }$ denote the parameters of the encoder (recognition model) and decoder (generative model) respectively. The objective of the variational autoencoder in this case has the following form:

{\mathcal {L}}(\mathbf {\phi } ,\mathbf {\theta } ,\mathbf {x} )=-D_{KL}(q_{\phi }(\mathbf {z} |\mathbf {x} )||p_{\theta }(\mathbf {z} ))+\mathbb {E} _{q_{\phi }(\mathbf {z} |\mathbf {x} )}{\big (}\log p_{\theta }(\mathbf {x} |\mathbf {z} ){\big )}

Here, $D_{KL}$ stands for the Kullback–Leibler divergence of the approximate posterior from the prior, and the second term is an expected negative reconstruction error. The prior over the latent variables is set to be the centred isotropic multivariate Gaussian $p_{\theta }(\mathbf {z} )={\mathcal {N}}(\mathbf {0,I} )$ .

Contractive autoencoder (CAE)

Contractive autoencoder adds an explicit regularizer in their objective function that forces the model to learn a function that is robust to slight variations of input values. This regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. The final objective function has the following form:

{\mathcal {L}}(\mathbf {x} ,\mathbf {x'} )+\lambda \sum _{i}||\nabla _{x}h_{i}||^{2}

Relationship with truncated singular value decomposition (TSVD)

If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis (PCA),^[9] as explained by Pierre Baldi in several papers.^[10]

Training

The training algorithm for an autoencoder can be summarized as

For each input

x

Do a feed-forward pass to compute activations at all hidden layers, then at the output layer to obtain an output

\mathbf {x'}

Measure the deviation of

\mathbf {x'}

from the input

\mathbf {x}

(typically using squared error),

Backpropagate the error through the net and perform weight updates.

An autoencoder is often trained using one of the many variants of backpropagation (such as conjugate gradient method, steepest descent, etc.). Though these are often reasonably effective, there are fundamental problems with the use of backpropagation to train networks with many hidden layers. Once errors are backpropagated to the first few layers, they become minuscule and insignificant. This means that the network will almost always learn to reconstruct the average of all the training data. Though more advanced backpropagation methods (such as the conjugate gradient method) can solve this problem to a certain extent, they still result in a very slow learning process and poor solutions. This problem can be remedied by using initial weights that approximate the final solution. The process of finding these initial weights is often referred to as pretraining.

Geoffrey Hinton developed a pretraining technique for training many-layered "deep" autoencoders. This method involves treating each neighboring set of two layers as a restricted Boltzmann machine so that the pretraining approximates a good solution, then using a backpropagation technique to fine-tune the results.^[11] This model takes the name of deep belief network.

References

1 2 Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2. doi:10.1561/2200000006.
↑ Modeling word perception using the Elman network, Liou, C.-Y., Huang, J.-C. and Yang, W.-C., Neurocomputing, Volume 71, 3150–3157 (2008), doi:10.1016/j.neucom.2008.04.030
↑ Autoencoder for Words, Liou, C.-Y., Cheng, C.-W., Liou, J.-W., and Liou, D.-R., Neurocomputing, Volume 139, 84–96 (2014), doi:10.1016/j.neucom.2013.09.055
1 2 Auto-Encoding Variational Bayes, Kingma, D.P. and Welling, M., ArXiv e-prints, 2013 arxiv.org/abs/1312.6114
↑ Generating Faces with Torch, Boesen A., Larsen L. and Sonderby S.K., 2015 torch.ch/blog/2015/11/13/gan.html
↑ Vincent, Pascal; Larochelle, Hugo; Lajoie, Isabelle; Bengio, Yoshua; Manzagol, Pierre-Antoine (2010). "Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion". The Journal of Machine Learning Research. 11: 3371–3408.
↑ sparse autoencoders (PDF)
↑ k-sparse autoencoder, arXiv:1312.5663
↑ Bourlard, H.; Kamp, Y. (1988). "Auto-association by multilayer perceptrons and singular value decomposition". Biological Cybernetics. 59 (4–5): 291–294. doi:10.1007/BF00332918. PMID 3196773.
↑ Baldi et al., "Deep autoencoder neural networks for gene ontology annotation predictions". Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2014.
↑ Reducing the Dimensionality of Data with Neural Networks (Science, 28 July 2006, Hinton & Salakhutdinov)

This article is issued from Wikipedia - version of the 11/12/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.