Because of the high number of categorical features and important interaction effects, Ads/RecSys models often use models which specifically incorporate cross features (the interaction of two features). We review a couple different approaches to this problem and explain how MaskNet, the model used by the Twitter heavy ranker, works.
Let's suppose you have a friend @username on Twitter and you only retweet their tweets about cats. You follow them, but you don't want to see their tweets about dogs (on Twitter these days people only like cats).
To understand the probability that you will retweet their tweet, the model needs to understand the interaction between a feature like "tweet is about cats" and "author is @username". We refer to this as a feature cross and it is a common problem in Ads/RecSys models.
So, how do we model this interaction?
The classical approach to modeling feature crosses is to use a linear polynomial model. This is a model which is a linear combination of the features and their powers. For example, if we have two features x and y and we want to model the second-order interaction between x and y, we would use a model like:
To make this a little more concrete, let's assume we have a model with only two features: a categorical feature for tweet topic and a categorical feature for user id and let's one hot encode each of the features into a group of features.
Now consider the second order features which we can form by multiplying any given one-hot feature with another one-hot feature. In this setup, we only have 3 non-zero second-order features:
- topic = cats AND author = @username (the cross feature we wanted)
- topic = cats AND topic = cats (which reduces to topic = cats, already in the model)
- author = @username AND author = @username (collinear as well)
All other second-order features are zero in this one-hot encoding, and so for this model we can learn the strength of the interaction of tweets about cats for @username for this user session (and can generalize this to include higher-order interactions like viewer = you AND topic = cats AND author = @username).
However, the problem with this approach is that it is computationally expensive to train and it is difficult to scale to large numbers of features. Unfortunately, to consider the second order interaction between all features, we would need to train a model with
Feature selection or hand-picking features is a common approach to this problem. However, this is not a scalable solution for features with high cardinality (like authors) and it is difficult to know which features to select.
Let's write the general form of our linear interaction model (order 2):
To make the next model more intuitive, let's rewrite this into matrix form:
where
In particular,
In a Factorization Machine (FM), we factorize the weights matrix
where
Writing this in another form, we can see that the FM is a linear model with a feature cross term:
where
Factorization Machines in particular have a neat trick called the sum-of-squares trick where we can reduce the computation to be linear in the number of features.
Another generalization of our linear polynomial model is known as the Wide & Deep model. In this model, we replace the linear part of the model with a neural network while leaving select higher-order feature crosses in the "wide" part of the model.
We can think of this as a feature partition between deep features
where
The idea here is that the deep part of the network can learn a complex non-linear transformation of the dense features while the wide part of the network keeps important cross features directly in the last layer.
The Deep Factorization Machine (DLRM variant) is a generalization of a Factorization Machine that also incorporates a feature partition between dense and sparse features. We can write it as:
$$\hat{y} = g(f(\beta_0 + \sum_{d=1}^n \beta_i x_d) + \sum_{i=1}^n \sum_{j=i+1}^n \mathbf{v}_i^T \mathbf{v}j x{ij})$$
where
Like the Wide & Deep model, we learn a transformation of the dense features and then add the feature crosses. However, unlike the Wide & Deep model, we use the embeddings to generate the feature crosses.
We then pass the concatenation of the dense projection and the feature crosses to a final neural network
A similar riff on this idea is the Deep Cross Net (v2) model. Unlike the DeepFM or Wide & Deep models, the Deep Cross Net (v2) model does not have a feature partition between dense and sparse features.
Instead, we generate feature crosses for all the features using a matrix equation very similar to our linear polynomial model (where the circle is element-wise multiplication):
The key insight here is that rather than taking the full product with
For a given feature
This is a neat trick because we're able to generate the same number of (second-order) feature crosses as we have features. This means that we can feed the output of this equation back into our equation and generate the crosses between our second-order crosses and our original features (aka third-order crosses) and so on! This gives us the real equation:
This approach can be combined with a neural net output to generate the final prediction.
We can also use the Factorization Machine approach to factorize the weights matrix
MaskNet can be thought of as a multi-head variant of the Deep Cross Net shown above. (Although the paper frames it in terms of an attention mechanism - which also uses element-wise multiplication - attention usually involves a softmax which is not the case here).
The MaskNet equation (for a single head) is:
This introduces crosses via a matrix decomposition
Finally, we also introduce LayerNorm's on the output of this, which is a common technique to stabilize training and improve generalization.
These parallel heads are then concatenated together and passed through a final neural network to generate the final prediction (or in this case, a collection of multi-task neural network heads produce a set of predictions).
To see the code for this, check out this code in the Twitter repo (note that net == mask_input
).
- Factorization Machines: https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
- DeepFM: https://arxiv.org/pdf/1703.04247.pdf
- Wide & Deep: https://arxiv.org/pdf/1606.07792.pdf
- DLRM: https://arxiv.org/pdf/1906.00091.pdf
- Deep Cross Net v2: https://arxiv.org/abs/2008.13535
- Google blog about Feature Crosses: https://www.tensorflow.org/recommenders/examples/dcn
- MaskNet: https://arxiv.org/abs/2102.07619
- Neural Word Embeddings as Implicit Matrix Factorization: https://papers.nips.cc/paper/2014/file/feab05aa91085b7a8012516bc3533958-Paper.pdf