CompressionVAE - A Powerful and Versatile Alternative to t-SNE and UMAP
Introducing a fast, easy to use, deep learning based dimensionality reduction tool
[tl;dr: CompressionVAE is a dimensionality reduction tool based on the idea of Variational Autoencoders. It can be installed via pip install cvae
and used very similarly to scikit-learn’s t-SNE, or UMAP-learn. Full code and documentation can be found here: https://github.com/maxfrenzel/CompressionVAE]
Dimensionality reduction is at the heart of many machine learning applications. Even simple 28x28 pixel black and white images like MNIST occupy a 784-dimensional space, and for most real data of interest this can easily go into the tens of thousands and beyond. In order to be able to handle this kind of data, a first step often involves reducing the data’s dimensionality.
The general idea behind this is that different dimensions are often highly correlated and redundant, and not all dimensions contain an equal amount of useful information. The challenge of a dimensionality reduction algorithm is to retain as much information as possible, while describing the data with fewer features, mapping the data from a high-dimensional data space to a much lower dimensional latent space (also knows as embedding space). It’s essentially a problem of compression. In general, the better a compression algorithm, the less information we lose in the process (with the small caveat that depending on the application, not all information is equally important, so the most lossless method might not always be the best).
A good dimensionality reduction technique allows us to create data that is more suitable for downstream tasks, allowing us to build more compact models and preventing us from suffering the curse of dimensionality. It also enables us to visualise and analyse data in two- or three-dimensional spaces. And it can uncover the hidden latent structure in large datasets.
For a long time t-Distributed Stochastic Neighbor Embedding (t-SNE), and particularly its implementation as part of scikit-learn, has been the workhorse of dimensionality reduction (along with PCA). It is easy to use and has many nice properties, particularly for visualisation. But it has several drawbacks. Finding the right parameters and interpreting the results can be somewhat challenging, and it doesn’t scale well to high-dimensional spaces (both in the input as well as the latent dimension). Its properties also don’t make it very suitable for non-visualisation purposes.
More recently, Uniform Manifold Approximation and Projection (UMAP) has become popular and started to supplement or even replace t-SNE for many applications, thanks to several advantages it offers. UMAP is considerably faster and scales better to high-dimensional data. It also preserves global structure better, making it more suitable for many applications beyond visualisation.
In this article, I want to present an alternative to both these methods: CompressionVAE (or CVAE for short). CVAE builds on the ease of use of the t-SNE and UMAP implementations, but offers several highly desirable properties (as well as a few drawbacks—it’s not a silver bullet).
At its core, CVAE is a TensorFlow implementation of a Variational Autoencoder (VAE), wrapped in an API that makes it easy to use. I decided to call it CompressionVAE to give the tool a unique name, but fundamentally it is just a VAE (with the optional addition of Inverse Autoregressive Flow layers that allow the learned distribution to be more expressive). As such, it is based on the same powerful theoretical foundation as the VAE, and can be extended easily to take into account the many ongoing developments in the field. If you’d like to learn more about VAEs in general, I wrote an in depth three part series on the topic, explaining them from the unique perspective of a two-player game.
We will explore some of CVAE's benefits (and drawbacks) more extensively in the examples below, but here a general overview:
CVAE is faster than either t-SNE or UMAP
CVAE tends to learn a very smooth embedding space
The learned representations are highly suitable as intermediate representations for downstream tasks
CVAE learns a deterministic and reversible mapping from data to embedding space (note: the most recent version of UMAP also offers some reversibility)
The tool can be integrated into live systems that can process previously unseen examples without re-training
The trained systems are full generative models, and can be used to generate new data from arbitrary latent variables
CVAE scales well to high dimensional input and latent spaces
It can in principle scale to arbitrarily large datasets. It can either be given the entire training data like the t-SNE and UMAP implementations, or load only a single training batch at a time into memory.
CVAE is highly customisable, even in its current implementation, giving many controllable parameters (while also providing fairly robust default settings for less experienced users)
Beyond the current implementation, it is highly extensible, and future versions could provide for example convolutional or recurrent encoders/decoders to allow for more problem/data specific high quality embeddings beyond simple vectors.
VAEs have a very strong and well studied theoretical foundation
Due to the optimisation objective, CVAE often does not get a very strong separation between clusters. Very hand-wavingly, there is a trade-off between having a smooth latent space and getting strong cluster separation, and the VAE is more on the smooth side and preserves global structure. This can be a problem or an advantage, depending on the application.
Biggest downside: As almost any deep learning tool, it needs a lot of data to be trained. t-SNE and UMAP work better on small datasets (there is no hard rule when CVAE becomes applicable, but in general <10k examples might be difficult)
So with this long introduction out of the way, let’s look at CompressionVAE in action.
Using CompressionVAE
After installing through the PyPI distribution (or cloning the repository and installing from there), CVAE is ready to use. More extensive code examples can be found in the readme, but in its most basic use case, we can train a CVAE model on a data array X with the following few lines of code:
# Import CVAE
from cvae import cvae
# Initialise the tool, assuming we already have an array X containing the data
embedder = cvae.CompressionVAE(X)
# Train the model
embedder.train()
By default, initialising a CompressionVAE object creates a model with two-dimensional latent space, splits the data X randomly into 90% train and 10% validation data, applies feature normalization, and tries to match the model architecture to the input and latent feature dimensions. It also saves the model in a temporary directory which gets overwritten the next time you create a new CVAE object there. All this can be customised, but I refer you to the readme for details on this.
The train method applies automatic learning rate scheduling based on the validation data loss, and stops either when the model converges (determined by certain adjustable criteria) or after 50k training steps. We can also stop the training process early through a KeyboardInterrupt (ctrl-c or 'interrupt kernel' in Jupyter notebook). The model will be saved at this point. It is also possible to stop training and then re-start with different parameters (again, see the readme for more details).
Once we have a trained model, we can use the tool's embed method to embed data. For example to embed our entire dataset X, we can run
z = embedder.embed(X)
In this case we embedded data that the model has already seen during training, but we can also pass new examples to the embed method.
Visulalising Embeddings
For two-dimensional latent spaces, CVAE comes with built in methods for visualisation. Let’s assume we trained our model on MNIST data, and originally defined X in the following way:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
X = mnist.data
We can then plot the embeddings we created above via
embedder.visualize(z, labels=[int(label) for label in mnist.target])
Below and example of this.
We can also scan the latent space and use the generator part of the VAE to visualize the learned space.
embedder.visualize_latent_grid()
Again, you can find more details and options on all of this in the readme.
Speed Comparisons
One of the benefits of CVAE is that it is fast to train. Below some comparisons to UMAP and t-SNE on two popular dataset of different size: MNIST (70,000 samples), and Kuzushiji-49 (270,912 samples).
For the comparisons, I used the basic version of t-SNE provided by scikit-learn, not any multicore version. When using UMAP I received a “NumbaPerformanceWarning”, warning that parallelisation could not be used. So both methods can probably be sped up, but the following is still a good comparison for the most vanilla use case most users will encounter. All tests were performed on my MacBook Pro without any GPU usage.
MNIST:
t-SNE: 5735 seconds (single run)
UMAP: 611 ± 1 seconds (mean and standard deviation over 10 runs)
CVAE: 121 ± 39 seconds (mean and standard deviation over 10 runs)
Kuzushiji-49:
t-SNE: Didn’t even try
UMAP: 4099 seconds (single run)
CVAE: 235 ± 70 seconds (mean and standard deviation over 10 runs)
We see that CVAE is clearly superior in terms of convergence times on these dataset sizes, and the advantage only increases with dataset size. Note that the CVAE duration is to convergence, as determined by the default learning rate schedule and stopping criteria.
More Detailed Look
The animation below shows an example of a training progression on MNIST. We can see that CVAE very quickly gets close to its final solution and then spends most of the time refining this. This shows that if training speed is a critical issue, stopping early might still provide sufficient results depending on the use case. [This particular model actually never fully converged since I disabled the learning rate decay.]
We also saw in the speed numbers above that CVAE has quite a large variance in convergence times, due to the current version being fairly sensitive to the initialisation of the model. This sensitivity affects both convergence speed as well as the final result. It might be worth trying several models and comparing results. As an example, below the latent spaces of three different instances trained on the same MNIST data.
However, in all cases we see that CVAE tends to achieve a very smooth latent space and preserves global structure very well. In contrast to this, UMAP's strong clustering for example separates similar digits very distinctly, while for CVAE they smoothly flow into each other.
The above plot also shows the result of the inverse transform of UMAP. Comparing this to the CVAE version above, we again see that CVAE’s latent space tends to be smoother. Just to be clear, the “smoothness” argument I’m making here is very qualitative. If you are interested in something more concrete, I wrote a paper related to this topic a while ago, but I did not apply any of these more precise metrics in any of the tests here.
One note on decoding data with CVAE: If decoding is the main interest, we should probably have a more specialised decoder. Right now the decoder is completely unconstrained, e.g. the MNIST model can output values < 0 and > 255, even though they don’t make sense as pixel values. I wanted CVAE to be as universal and data-agnostic as possible. However, it might be worth giving more options in future versions to specialise this more. For example on MNIST it is common practice to restrain the value to the appropriate range via a sigmoid function and appropriate loss.
Fashion MNIST is another simple and useful toy dataset to explore the tool’s capabilities and output. First, let’s look for comparison at the visualisations of UMAP.
Compared to this, here the results of a trained CVAE model.
We again see a very similar result as above, with CVAE having less distinct cluster separations but smoother transitions.
This example also shows very nicely how well CVAE captures local as well as global structure.
Globally, we see that the “ankle boot”, “sneaker”, and “sandals” categories all cluster together and flow seamlessly into each other. We observe a similar transition from “dress”, to “shirt” and “t-shirt/top” (which almost entirely overlap), to “pullover" and “coat”. We see much stronger divisions (and ill-defined reconstructions) between for example “trousers” and the larger shoe cluster. Interestingly bags are able to morph almost seamlessly into pullovers.
Locally we also discover a lot of structure, with shoes for example gaining more and more ankle height with decreasing y, and more heal height with increasing x.
All this shows that the model has clearly learned/discovered some of the underlying latent structure of the dataset.
We can also play with data interpolation. Let’s pick two random real examples and create five intermediary “fake” data points.
import random
import numpy as np
# Get two random data points
X1 = np.expand_dims(random.choice(X_fashion), axis=0)
X2 = np.expand_dims(random.choice(X_fashion), axis=0)
# Embed both
z1 = embedder.embed(X1)
z2 = embedder.embed(X2)
# Interpolate between embeddings in five steps
z_diff = (z2 - z1) / 6
z_list = [z1 + k*z_diff for k in range(7)]
z_interp = np.concatenate(z_list)
# Decode
X_interp = embedder.decode(z_interp)
Note that this code assumes that embedder
is a CVAE instance that was trained on the Fashion MNIST data X_fashion.
After doing some reshaping and displaying, we can get results like the following.
While all the examples discussed here were only toy datasets without any real applications, CVAE can be applied to many real world problems, and used in production settings as well.
At my current company Qosmo, we use variants of CVAE in several of our projects, ranging from the purely artistic to the commercial. An example of this are various music recommendation and playlist generation solutions we have developed.
In some of these, we first use a proprietary system to convert songs into very high dimensional vector representations of fixed size, then use CVAE to further compress the vectors and make them more suitable for downstream tasks, and finally employ another custom built system that uses the created song embeddings to generate meaningful and natural sounding playlists, potentially taking certain additional criteria and live data (e.g. current weather, or time of day) into account as well.
If you are interested in using one of these systems or want to work with us on a custom solution for your specific needs, please visit our website and get in touch with us.
CompressionVAE is still at a very early stage. But I encourage you as well to try and apply it to your own problems.
And if the current basic version does not serve the exact purpose you need, feel free to go under the hood and extend and adapt it to your own needs. If you’d share your customisation and make it become a part of the main CVAE distribution, even better. While there is definitely A LOT of room for improvement, I hope that the current implementation provides a nice framework and starting point to make this process as easy as possible.
I’m looking forward to see everyone’s use cases and contributions (and hopefully also fixes to my sloppy code, potential bugs, and instabilities).
CompressionVAE won’t be able to replace the various other dimensionally reductions tools, but I hope it provides a valuable addition to the existing arsenal of tools.
Happy compressing!