Metric Learning Survey

Last updated on May 21, 2021 6 min read

Metric learning attempts to map data to an embedding space, where similar data are close together and dissimilar data are far apart. In general, this can be achieved by means of embedding and classification losses. Embedding losses operate on the relationships between samples in a batch, while classification losses include a weight matrix that transforms the embedding space into a vector of class logits

Embedding losses: Pair and triplet losses provide the foundation for two fundamental approaches to metric learning. A classic pair based method is the contrastive loss, which attempts to make the distance between positive pairs (dp) smaller than some threshold (mpos), and the distance between negative pairs (dn) larger than some threshold (mneg): Lcontrastive = [dp − mpos]+ + [mneg − dn]+

Classification losses are based on the inclusion of a weight matrix, where each column corresponds to a particular class. In most cases, training consists of matrix multiplying the weights with embedding vectors to obtain logits, and then applying a loss function to the logits. The most straightforward case is the normalized softmax loss, which is identical to cross entropy, but with the columns of the weight matrix L2 normalized. ProxyNCA is a variation of this, where cross entropy is applied to the Euclidean distances, rather than the cosine similarities, between embeddings and the weight matrix. A number of face verification losses have modified the cross entropy loss with angular margins in the softmax expression. Specifically, SphereFace, CosFace, and ArcFace apply multiplicative-angular, additive-cosine, and additive-angular margins, respectively. (It is interesting to note that metric learning papers have consistently left out face verification losses from their experiments, even though there is nothing face-specific about them.) The SoftTriple loss takes a different approach, by expanding the weight matrix to have multiple columns per class, theoretically providing more flexibility for modeling class variances

Pair and triplet mining: Mining is the process of finding the best pairs or triplets to train on. There are two broad approaches to mining: offline and online. Offline mining is performed before batch construction, so that each batch is made to contain the most informative samples. This might be accomplished by storing lists of hard negatives, doing a nearest neighbors search before each epoch, or before each iteration. In contrast, online mining finds hard pairs or triplets within each randomly sampled batch. Using all possible pairs or triplets is an alternative, but this has two weaknesses: practically, it can consume a lot of memory, and theoretically, it has the tendency to include a large number of easy negatives and positives, causing performance to plateau quickly. Thus, one intuitive strategy is to select only the most difficult positive and negative samples, but this has been found to produce noisy gradients and convergence to bad local optima [65]. A possible remedy is semihard negative mining, which finds the negative samples in a batch that are close to the anchor, but still further away than the corresponding positive samples. On the other hand, Wu et al. found that semihard mining makes little progress as the number of semihard negatives drops. They claim that distance-weighted sampling results in a variety of negatives (easy, semihard, and hard), and improved performance. Online mining can also be integrated into the structure of models. Specifically, the hard-aware deeply cascaded method uses models of varying complexity, in which the loss for the complex models only considers the pairs that the simpler models find difficult. Recently, Wang et al. proposed a simple pair mining strategy, where negatives are chosen if they are closer to an anchor than its hardest positive, and positives are chosen if they are further from an anchor than its hardest negative.

Advanced training methods: To obtain higher accuracy, many recent papers have gone beyond loss functions or mining techniques. For example, several recent methods incorporate generator networks in their training procedure. Lin et al. use a generator as part of their framework for modeling class centers and intraclass variance. Duan et al. use a hard-negative generator to expose the model to difficult negatives that might be absent from the training set. Zheng et al. follow up on this work by using an adaptive interpolation method that creates negatives of varying difficulty, based on the strength of the model. Other sophisticated training methods include HTL, ABE, MIC, and DCES. HTL constructs a hierarchical class tree at regular intervals during training, to estimate the optimal per-class margin in the triplet margin loss. ABE is an attention based ensemble, where each model learns a different set of attention masks. MIC uses a combination of clustering and encoder networks to disentangle class specific properties from shared characteristics like color and pose. DCES uses a divide and conquer approach, by partitioning the embedding space, and training an embedding layer for each partition separately.

Below is my recent presentation on various latest metric learning methods. Before diving to it, please refer to this link to grab some background knowledge of deep metric learning

https://github.com/kdhht2334/Survey_of_Deep_Metric_Learning#dvm

https://hav4ik.github.io/articles/deep-metric-learning-survey

Metric Learning Survey

Harry Trinh

Data Scientist

Related