I have been doing a lot of shopping through Amazon lately. The ease at which I find most products on Amazon got me thinking, how does it even work? I mean, there are millions of products on Amazon. How do they search and sort through this large data pile and get me what I want? I looked up a few resources online, and this is a summary of what I learned. I want to share what I understood in this blog, in case you have been wondering too.
You might already know that, Amazon has separate sites for a lot of countries around the world.
For each site, there are around 30 main categories. Models have been trained for each category, site pairs i.e ., context. The machine learning models used for each context are pretty standard for the industry— Gradient BoostedTrees with Pairwise Ranking. For each of these models, there are about 20 specialized features. More about that later.
So, on what data would we train and test the models?
Amazon is a giant company, with billions of product purchases each year. The training data is the real-life interactions of users with searches and products.
What would make a positive label? If the user put in a query and then clicked on the listed product or purchased the listed item, we can consider that as a positive label. You might be wondering why the team didn't look for purchases alone? Why did they include clicks? After all, a purchase is what they are trying to optimize the search for. Dr. Daria Sorokina explains it well in her MLconf talk.
The fact is that the most bought product for a query isn't necessarily the most relevant item for the query. She uses the example of the search query 'iPhone.' The most bought item under the query is the iPhone cable, which is clearly not what the user would expect as a first search result for 'iPhone.' Hence, The search relevance team at Amazon chose to mix both purchases and clicks as positive labels.
Then what would make a negative label? This is pretty much straight forward. If the user searched for an item and didn't click the listed products, the ranking might be wrong. A few random samples are also included in negative labeled data before training.
Initially, around 150 features are available for a particular category. Some of them random. The model is trained over these. The features that perform better than the random features are chosen to shorten the number of features to around 50.
The features are reduced from 50 to approximately 20 through backward elimination or forward selection. The latter is pretty straight forward, but the former has some interesting challenges.
When performing feature scoring using an ensemble of trees, it was observed that the more nodes a feature could be split into, the better it was assumed to be. This works in most cases, but when there is a mix of continuous and binary features, this doesn't.
A continuous feature can be split into many parts, while a binary feature can be divided into just one feature. Hence, in feature selection, where there is a mix of both, continuous features would always be ranked better. Important binary features rank worse than continuous random features. Ouch!
The search relevance team empirically experimented to confirm the same. It was observed that as the cardinality of the feature grows exponentially, the importance grows linearly. This is a logarithmic relationship, hence to bring the score of all random features to the same, the team normalized the original score with the node's entropy at each split. This worked.
This is how product searches happen within each category. Now, how about a general product search?
General Product Search
For a general product search, the top results of the query in each category are chosen. But there are over 20 categories at least. How can we combine them to show only the most relevant?
For this, the Search Relevance Team uses a blend of three main components.
The Query Category Score
This score tells us how important a given query is for a category. We can tell this by looking at the number of clicks and purchases for the products in the category for the given query for common queries. For rare ones, which tend to be longer, unigram, bigram, and trigrams of queries are looked for clicks.
The hunger score for a category is the hunger for being chosen/clicked for a given query. The longer a category is not chosen, the hungrier it becomes. For a category that has a high query category score, it gets hungrier faster.
In Category Relevance Score
This is the relevance of each product within the category.
A combination of these is used to decide which one appears on the top for a general product search query. Let us look at an example.
Search query is ' Cat.'
Suppose there are three categories - Pet products, Movies, Detergents.
The Query Category Score is as listed below.
P stands for Pet Products, M for Movie, and U for Music Albums.
Initially, the hunger score is set to zero for everyone.
Product P1 has the highest In Category Relevance Score and is chosen as Rank 1.
Since Movies and Music Albums weren't chosen, their hunger score increases. Since Movies have a higher Query Category score than Music Albums, it's hunger increases faster.
For Iteration 2, Product M1 has the highest In Category Relevance score and huger score and is chosen as Rank 2. Since the other two aren't chosen, their hunger increases. And the process follows.
Roughly, this is how various scores are blended to give a search result on the general products page.
How is the match set generated?
Match set is the set of items we show the user when they search for a query. There are two ways in which this match set is generated.
Items that match the exact description
These are items that are similar.
Textual matches are easy to find. But how are behavioral matches found?
Behavioral matches are found by looking at how users initially interacted with other items when the queried item was not present yet on Amazon. For example, A user searches for a movie on Amazon. Unfortunately, the item is not yet available. The user would continue to search for a similar item in the same session, i.e., watch a similar movie. This helps us generate a match. Since these happen in the same session, they will be saved as a Behavioral match.
Cold Start Problem
What about when no such data is available for an item? Dr.Daria Sorokina explains it with an example for the latest book of Harry Potter. When such an item is released, definitely the new book would have precedence over all other books. But the search would rank it the lowest and show other books more importantly.
How does Amazon deal with this problem?
On day one, users who want the item would find it anyway, even if it is after an exhaustive search. Slowly the item's score would increase. Day Two, when customer complaints pour in about the same, Amazon uses an infrastructure shortcut to update this manual change. If by Day Seven, the changes are not reflected as expected, manual patchwork is done.
Finally Non-Relevance Sort
What happens when a user opts for a filter other than 'Relevance,' for example, average review?
An item, for example, TVs, would have Ten Thousand item. Dr. Daria says the most highly reviewed product under that query is often TV cables. Suddenly the result becomes irrelevant to the user query. How does the Search Relevance Team deal with the problem?
When a user put out a filter other than relevance, the product listings are filtered with up to three product type. Only those items are listed as per average customer review, thereby giving the user a useful search result.
Roughly, this is the amazing logic that powers the Amazon Product Search happen ;)
Dr. Daria Sorokina is an Applied Machine Learning researcher with Amazon. She did her Ph.D. at Cornell University and has previously interned with Google. Later on, she went to work as a Senior Data Scientist at Linkedin. Since 2010, she has been part of the Search Relevance team at Amazon as an Applied Scientist.
PS: In case, you are unfamiliar with feature selection and Gradient BoostedTrees with Pairwise Ranking, I'll be writing about them soon. You can sign up for the mailing list to get an alert when it's up or follow me on LinkedIn
Let me know if you would like me to write on another real-life ML applications. Trying to make this a habit. :)