The Variational Autoencoder as a Two-Player Game — Part III

The Difficulties of Encoding Text

Illustrations by KITTYZILLA

Illustrations by KITTYZILLA

Welcome back to the final part of this three part series on variational autoencoders (VAEs) and their application to encoding text.

In Part I we met Alice and Bob, who were preparing for the Autoencoding Olympics. While following their training process, we learned about the concept of autoencoders and some of the basics of deep learning.

Unfortunately, due to some training issues we uncovered, we had to watch them fail miserably in their quest for a gold medal.

However, they managed to redeem themselves in Part II. By following an extra difficult training regimen, known as a Variational Autoencoding, they managed to overcome their problems and returned to the Olympics to dominate their competition and claim a decisive victory.

Now let’s rejoin Alice and Bob after they return home with their gold medals.

Entering a New Discipline: Text Encoding

Alice and Bob are overjoyed by their victory. But they are quickly looking for a new challenge.

They decide to aim for a new discipline in the next Olympics. This discipline requires them not to encode images, but sentences. So Bob hangs up his paint brush and grabs a pen instead, and Alice gets reading.

03_vae_02.jpg

The basic rules of the game are essentially still the same.

Alice has to read a sentence, which she needs to encode and send to Bob, who then has to try to reconstruct the sentence from Alice’s code.

Again we have to keep in mind that, just like in the image case in Part I, our AI Alice and Bob have absolutely no preconceived notion of language, not even the meaning of individual words, let alone complex sentences.

Initially, the sentence “A tall man stands at the side of the road.” is just as likely to them and carries just as much meaning as “Church doll regret lake unarmed machine changed appears knot precede.”


Their basic toolbox is what’s called their vocabulary. The set of words they can play with.

The problem is they have no idea about the meaning of those words or how to combine them into sequences that carry meaning.

It is as if they were handed a dictionary, but without any explanations. Just a long list of all the words in the English language. (Not that the explanations would have helped, since those themselves are formed of the same words that have no meaning to our players yet.)

Once again, Alice and Bob have to make sense of the external world from scratch, via their interactions with the sentences their are provided and the feedback Bob gets from their coach Charlie.

The way Charlie judges Bob’s prediction is slightly different in this variation of the game though.

Previously Charlie waited for Bob to paint the entire image.

Now on thr other hand he does not wait for Bob to finish the entire sentence. Instead he gives Bob a score and feedback after every single word he predicts.

Even more crucially, Charlie tells Bob what would have been the correct word.

This simplifies Bob’s task tremendously. Instead of predicting the entire sentence based only on Alice’s code, he can predict one word at a time, relying on the words he has already seen to refine his prediction of the next word.

Developing a Language Model

Every one of us has an inbuilt (or rather learned) language model.

Consider the sentence fragment “The dog chases the…”.

What would you think the next word is? To determine this you just invoked your own language model.

Different people have different language models given their background and experiences, but in this case I would bet that almost everyone would have guessed “cat”.

But what if I now told you that this sentence was taken from a quirky science fiction story about an alien invasion?

You might still think “cat” is most likely, but you’re probably not quite so sure anymore. Or maybe you even expect something different as most likely. You conditioned your language model on an additional piece of information I gave you.

This conditioning is exactly what Bob needs to learn to get a high score. In particular, he needs to condition his language model on Alice’s code. And Alice once again needs to figure out a clever way to convey as much information as possible in the two numbers she is allowed to send to Bob.

The problem is that because of the way Charlie provides his feedback, Bob can actually become pretty decent at the game without conditioning his language model.

Like in the example sentence above, in many cases the guess “cat” would have been right, leading to a good score. Only in a few outlier sentences will this be a bad guess.

But a language model, if unconditioned, can be misleading.

Let’s assume Bob, with corrections from Charlie, has so far guessed the text fragment “The dog chases the…” and look at what might happen next.

03_vae_03.jpg

Let’s assume that the full sentence is “The dog chases the flamboyant spaceship from Alpha Centauri”. By the time Bob gets to guess the last word his language model might have recovered from the initial shock of the “flamboyant spaceship” and he might make a reasonable guess about “Centauri” even without conditioning on Alice’s code. But the total score for the entire sentence will already have suffered a lot.

The trick, given the limited information flow allowed, lies in Alice encoding exactly the kind of information she thinks is surprising to Bob and letting him rely on his own language model for the rest, and also hoping that he actually uses her information in the first place.

This is what information theorists call an efficient code.

Encode exactly what’s the most surprising and omit the rest.

This is also closely related to the concept of entropy you might have heard thrown around in various contexts. But a thorough discussion of that would require many articles in its own right.

In the past, people have tried coming up with efficient codes for all sorts of problems manually. But now, if trained correctly, models like VAEs can actually automatically find highly efficient codes for very complex problems.


Note that in reality the decoder doesn’t actually predict just a single word at each step. It actually predicts a probability for each word in its vocabulary.

So while in the example Bob might have given “cat” a 99.9% probability, he would have also given every single other word he knows a finite probability, including maybe 0.0000037% for the correct word “flamboyant”.

This is what allows our critic Charlie to give Bob a precise score. He only receives a perfect score if he assigns 100% probability to the correct word. The lower the probability he gives to the correct solution, the worse his score.

Sounds all good, doesn’t it? Bob only needs to condition his language model on Alice’s code and they are all set again, right?

Alice’s Struggle

Well, yes. But… It turns out that in this new discipline Alice is struggling quite a bit.

In the case of the cat and dog images, we saw in Part I that Bob can get some early victories (painting a gray/brown blob with two smaller round blobs as eyes) without consulting Alice’s code (which at that time was still random due to the lack of feedback from Bob).

But it doesn’t get him very far. Fairly soon he is stuck and needs to figure out how to use Alice’s code.

However, in this new discipline Bob can actually get pretty decent at the game without considering Alice’s code (and hence without giving her any useful feedback).

Also, in Part II we noted that in the variational setting, a higher accuracy in code transmission comes at the cost of a higher penalty.

Now, since Alice realizes that Bob isn’t using her code anyway, she figures out that she might as well increase the uncertainty so that they don’t pay an additional penalty for precisely sending a code that is useless anyway.


As noted above, initially neither Bob nor Alice have any language model whatsoever. Text is just a random mess to them.

But very early on, by counting the occurrence of words, Bob might realize that “The” or “A” are the most common words at the beginning of a sentence, so he might just start every sentence with these.

And just based on word frequency, these are the most common words in general. So initially Bob might figure out a strategy of repeating the same word over and over again (“the the the the the the”) because he notices this gives him a higher score than just random guessing.

But soon he’ll notice that “the” is usually followed by a noun. One of the most common nouns in English is “time”, so as a first improvement Bob might learn to say “the time the time the time the…”.

Slowly figuring out more and longer of these common word combinations, Bob builds his language model.

He learns the statistics of the English language.

Or at least the particular language that is used in their training data. A language model trained on tweets is very different from a language model trained on the bible.

By the time Bob gets stuck by himself and can’t improve the score any further, he’s already learned quite a lot, whereas Alice is still stuck at the very beginning.

Alice has a much harder time learning.

Bob gets direct feedback after every single word he predicts, whereas the only feedback Alice gets on her understanding of the entire sentence comes from the two feedback numbers Bob sends her.

And if Bob doesn’t use her code at all he also can’t give her useful feedback.

So Alice basically just gets random noise from Bob to improve her already random code.

The variational setup further complicates this since Bob doesn’t even get Alice’s actual ideal code choice, but a value with some added uncertainty.

The already random code is further randomised.

So now with Bob being pretty decent all by himself and Alice having learned absolutely nothing yet, we risk being stuck at what is called a local maximum.

03_vae_04.jpg

Bob gets a decent score he can’t improve by himself, but every time he tries to listen to Alice and use the code she sends him to influence his predictions, their score gets worse.

The learning process is similar to mountain climbers without a map of the terrain looking for a summit in dense fog.

The only thing they can do is go upwards in the steepest direction. Once they reach a point where any direction goes down hill, they assume they reached the ultimate summit. But they don’t know that if they’d just go down hill for a little while they could get to an even higher peak.

So Bob, just like a climber who thinks he already accomplished the climb, abandons his attempts for improvement by using Alice’s code.

His process is already so refined that it’s very sensitive to the change introduced by trying to condition his language model on the code.

What he doesn’t know is that if he could just sacrifice their score for a little while and try using the code and giving Alice some feedback, they could get unstuck.

Alice could learn enough to provide Bob with useful codes that allow him to make predictions with much higher accuracy than he was ever able to by himself.

But Bob is too nearsighted and self-confident to sacrifice their score.

Restoring Balance

How can we help Alice with her difficult task of convincing Bob to use her code and provide good feedback?

This question has been around in the academic community for a while. Ever since people started using VAEs on text and encountered exactly this problem.

And it is still not completely solved. But many ideas have emerged that make variational autoencoding of text good enough to be useful in practice.

Just like introducing the variational aspect in Part II made autoencoding harder but improved performance, so most strategies here also involve making the problem seemingly harder instead of easier.

In particular, people tried making the task more challenging for Bob so that he can’t quite as easily race ahead of Alice in terms of learning.

One approach, called “word dropout”, has our critic Charlie sometimes staying silent.

He still always gives Bob a score for his word prediction, but from time to time he doesn’t reveal the correct answer, giving Bob’s own language model less information to work with.

03_vae_05.jpg

Let’s again look at our example sentence “The dog chases the…”. Let’s assume Charlie stayed silent on the second word. So to Bob the sentence now looks like “The … chases the…”.

I’m sure your own language model has a much harder time predicting the next word for this sentence with incomplete information.

The same is true for Bob. And he starts looking at Alice’s code for clues.


Other approaches are even more radical in modifying Charlie’s behavior.

In one of them, instead of giving Bob a score after every word, Charlie waits for Bob to finish the entire sentence before giving him any feedback.

This is similar to what Charlie did in the image case (and the neural network architecture that’s used for Bob in practice is actually also quite similar to the image case, both using convolutional neural networks).

This way Bob basically can’t rely on his own language model at all because he doesn’t know if his predictions of the sentence started out correct or not.

This gives him a huge incentive to consult Alice for additional information very early on.

In practice this approach has not been too successful, the task becoming too difficult for Bob.

But it has not been completely abandoned yet. It is still an active area of research.


The most successful approaches (e.g. one building Bob’s brain from dilated convolutions instead of the more common recurrent neural networks) actually all don’t aim at making the task harder per se. Instead they are making Bob dumber, or more forgetful.

03_vae_06.jpg

If we make Bob dumber in just the right way, essentially giving him a slight learning disability, we basically give the disadvantaged Alice a chance to keep up with Bob’s progress.


Imagine for example that Bob get’s very forgetful.

Poor Bob can now only remember 2 words at a time. In the middle of a sentence he has no idea anymore how it started.

Our example sentence now looks just like “… chases the…” to him.

Missing this crucial contextual information, he is eagerly looking for any information that can aid him figure out the next word. He happily turns to Alice’s code.

Approaches of this kind (although considerably more sophisticated) have proven extremely successful in ensuring that Alice and Bob learn in sync and don’t get trapped at local maxima.

03_vae_07.jpg

Initially their score rises more slowly than it did when Bob was still smart. But crucially it keeps rising without them getting stuck.


One more method (known as KL cost annealing) that can be combined with any of the other approaches looks at the way Charlie penalizes Alice for the code’s uncertainty.

As we saw in the image case in Part I and Part II, using the variational machines was crucial to learn examples outside of the training dataset. But it also increased the difficulty of Alice passing on useful information to Bob.

The new method allows Charlie some flexibility in how much he wants to penalize Alice for using a small uncertainty.

At the beginning, when they haven’t learned anything yet, Charlie is very forgiving, allowing Alice to specify her code as precise as she wants without any penalty.

But as the training progresses, Charlie gradually ramps up the penalty to its full extent.

As we have seen, this might initially lead to “holes” in the code, regions of code space that Bob has no clue how to decode. But later, as we reintroduce the penalty and Alice starts to use higher uncertainties, they will naturally have to learn to smooth out their code and make the holes disappear.

Alice essentially gets a penalty that adjusts to her learning curve.

Particularly, the task is no longer so difficult straight away that she doesn’t end up learning anything.

This, together with dumbing down Bob, allows them to get very successful at encoding text.


So what kind of code might they learn?

Bob needs to condition his language model on what Alice provides him.

Even if they only figure out a simple code where certain code regions tell Bob that the text is a tweet, others a news article, and yet others restaurant reviews, can dramatically help Bob make better guesses about the next word.

Since Bob also doesn’t know when to actually stop his prediction, Alice might also learn to encode the length of the sentence.

Just like in the image example in Part I with dog/cat on one axis and dark/light on another, they might learn a simple code with sentence length encoded in one dimension, and “formality level” encoded in the other dimension.

03_vae_08.jpg

Once again, as training continues and the two jointly learn, they might figure out much smarter (and probably less easily interpretable) ways of encoding lots of information in the two number.

We also saw that the code should be smooth and continuous.

That means if Alice wants to encode a sentence and not use too low of an uncertainty, she needs to make sure that her code is robust.

For example she should use very similar codes for the sentences “I went to his house” and “I went to his flat”, so that some randomness that changes one code into the other won’t have too dramatic consequences when Bob decodes it.

The sentence “This problematic tendency in learning is compounded by the LSTM decoder’s sensitivity to subtle variation in the hidden states, such as that introduced by the posterior sampling process” however should have a very different code.

Accidentally sending that instead of “I went to his house” would lead to a very crappy score from Charlie.


Having refined their training process and thus their code, Alice and Bob once again return victorious from the Autoencoding Olympics.

03_vae_09.jpg

What next?

There are still many other disciplines, such as audio encoding.

There are also completely different tasks which are not strictly autoencoding but still share many similarities, such as translation, where Alice encodes in one language and Bob decodes in a different language.

Sounds even tricker, and indeed it is!

But in this case there are slightly different rules that allow Bob to use what is called “attention”.

Essentially, in addition to the code that Alice sends him for the whole sentence, he is also allowed to ask her for a “bonus-code” for each new word he needs to predict. This technique is basically what allowed Google Translate to become so good in recent years.

Text summarisation is also a closely related discipline. Here Alice encodes a long text, and Bob has to decode it into a summary.

There are also countless more training methods, including yet to be discovered ones, that will help them keep up with the increasingly strong competition in their already mastered disciplines.

But for now, Alice and Bob need a well-deserved rest.