Google rolled out the idea of Google SmartReply in its I/O in 2017. Since then, it's been available for people with Android and iOS phones. I started noticing and using this feature quite heavily very recently, and is one of the features I love the most about Gmail today. It seems to really get my tone, my prospective response, and a few different ways to say it. So obvious question - how does it work?
Smart Reply is the feature now available on Gmail, where you find three automated responses that appear at the bottom of an incoming email. These responses save you time in crafting a reply, and instead, you could just click on a Google-generated phrase and then send your email.
An initial study covering several million email-reply pairs by the Google Research team showed that∼25% of replies have 20 tokens or less. With such short replies, the team began thinking of ways to assist users with composing these short messages. Especially in a tap-away fashion. 
All pairs of the incoming message with the user's response to that message were taken. Additionally, they also sampled a few messages that the user didn't respond to.
The data was preprocessed in the following way.
Language detection ( Only English is supported as of now, so all other languages are discarded)
Tokenization the Subject and message body are broken into words and punctuation marks.
Infrequent words and entities like personal names, URLs, email addresses, phone numbers, etc., are replaced by special tokens.
Quoted original messages and forwarded messages are removed.
After the preprocessing steps, the size of the training set was 238 million messages, which included 153 million messages that had no response.
The Smart Reply PipeLine
To smart respond or not to?
The trigger module is the entry point to the system. Not all emails can be replied to with a quick response. Example: Emails on sensitive topics, or with open-ended questions as well for promotional emails which don't need a reply at all. This classification is performed using a simple feed-forward neural network.
So, after preprocessing, as we see in the paragraph above, the bigrams and ngrams are sent to the Trigger module. The trigger module was so trained that there were emails without replies too.
If an incoming email is labeled negative by the Trigger module, we do not use the feature. If positive, we send to the next step. - Response Selection.
The input is a set of tokens; the output is a set of tokens depended on the incoming sequence. Hence the model used to generate responses is a sequence-to-sequence model, which is basically two neural networks. One Encoder, the other decoder. The encoder ingests the incoming email, and the decoder will produce the response. In this case, the model is an LSTM. The objective function is to maximize the log probability of observed responses. The training was performed against this objective using stochastic gradient descent with AdaGrad. Ten epochs in a distributed fashion using the TensorFlow library.
It was also found that the addition of a recurrent projection layer substantially improved both the quality of the converged model and the time to converge. It was also found that gradient clipping (with a value of 1) was essential to stable training.
So how does it work in real?
We feed each token one by one into the model, then greedily take the most probable token as output and feed it back in. Another way is to take the top b tokens and feed them in, then retain the best response prefixes and repeat. How to choose the best set of tokens, aka response? This can be done by feeding in each token of the candidate and using the softmax output to get the likelihood of the next candidate token.
This looks easy but comes with its set of challenges, as mentioned in the paper
Utility as in How to select multiple options to show a userso as to maximize the likelihood that one is chosen.
Ensuring Response Quality/ Utility
We need to ensure responses are correct sentences that do not carry common spelling/grammatical errors or use any profanity. The responses also need to diverse, and specific, and ideally not repeat the same intention.
For this, we revisit the preprocessed corpus earlier. A response set using only the most frequent anonymized sentences aggregated from this preprocessed data. This process gives you a response set with a few million unique sentences.
The first step is to automatically generate a set of canonical response messages that capture the variability in language. Each sentence is parsed using a dependency parser and use its syntactic structure to generate a canonicalized representation. Words (or phrases) that are modifiers or unattached to head words are dropped.
Now, these responses are clustered based on semantic intent. Ex Thank you set, Sorry set, Yes set etc. This clustering is done by using viewing it as a semi-supervised machine learning problem and use scalable graph algorithms. This automatically digest the entire information present in frequent responses into a coherent set ofsemantic clusters. I am not digging into it's details here.
Once the clustering is done, we extract the top k members for each semantic cluster, sorted by their label scores. The set of (response, cluster label) pairs are then validated by actual humans. The raters are provided with a response Ri, a corresponding cluster labelC (e.g.,thanks) as well as few example responses belonging to the cluster (e.g., “Thanks!”, “Thank you.”) and asked whether Ri belongs to C.The result is an automatically generated and validated set of high quality response messages labeled with semantic intent. This is subsequently used by the response scoring model to search for approximate best responses to an incoming email .
So basically, the LSTM first processes an incoming message and then selects the (approximate) best responses from the target response set.
Now how do we select the subset 3 members to show the user? The straight forward way is to go with top 3 responses from the model before, but the messages may be highly likely redundant, So to ensure diversity, we make sure that the 3 responses do not share an intent as in they do not come from the same clusters, as discussed above.
10% of all responses are now crafted from the responses that the Smart Reply feature provides. Making emailing a little easier for users like me and you around the globe.
People Behind The Work
(There were over eleven authors to the paper, writing on the first three)
She did her bachelors from Harward in Computer Science and now works as a Software Engineer with Google.
Karol completed his Phd from University of Warsaw before which he had interned with companies like NVIDIA, Facebook and Google. He joined Google to the Gmail Comprehension/Intelligence team in
2012 and was Google Brain (ML Researcher / Research Lead) until he recently quit to join as CTO to Cosmose Inc
Sujath completed his PhD in Computer Science from University Of Southern California. Prior to that he had worked as a Research Intern to Google and Yahoo, and finally joined Google as a Senior Staff Research Scientist, Senior Manager and technical Lead. He left Google in 2019 and is now the DIrector to Amazon Alexa.