@@ -41,17 +41,37 @@ the covariance matrix of the input data. This is a simple baseline method.
4141
4242 .. [1 ] On the Generalized Distance in Statistics, P.C.Mahalanobis, 1936
4343
44+ .. _lmnn :
45+
4446LMNN
4547-----
4648
47- Large-margin nearest neighbor metric learning.
49+ Large Margin Nearest Neighbor Metric Learning
50+ (:py:class: `LMNN <metric_learn.lmnn.LMNN> `)
4851
49- `LMNN ` learns a Mahanalobis distance metric in the kNN classification
50- setting using semidefinite programming . The learned metric attempts to keep
51- k-nearest neighbors in the same class, while keeping examples from different
52- classes separated by a large margin. This algorithm makes no assumptions about
52+ `LMNN ` learns a Mahalanobis distance metric in the kNN classification
53+ setting. The learned metric attempts to keep close k-nearest neighbors
54+ from the same class, while keeping examples from different classes
55+ separated by a large margin. This algorithm makes no assumptions about
5356the distribution of the data.
5457
58+ The distance is learned by solving the following optimization problem:
59+
60+ .. math ::
61+
62+ \min _\mathbf {L}\sum _{i, j}\eta _{ij}||\mathbf {L(x_i-x_j)}||^2 +
63+ c\sum _{i, j, l}\eta _{ij}(1 -y_{ij})[1 +||\mathbf {L(x_i-x_j)}||^2 -||
64+ \mathbf {L(x_i-x_l)}||^2 ]_+)
65+
66+ where :math: `\mathbf {x}_i` is an data point, :math: `\mathbf {x}_j` is one
67+ of its k nearest neighbors sharing the same label, and :math: `\mathbf {x}_l`
68+ are all the other instances within that region with different labels,
69+ :math: `\eta _{ij}, y_{ij} \in \{ 0 , 1 \}` are both the indicators,
70+ :math: `\eta _{ij}` represents :math: `\mathbf {x}_{j}` is the k nearest
71+ neighbors(with same labels) of :math: `\mathbf {x}_{i}`, :math: `y_{ij}=0 `
72+ indicates :math: `\mathbf {x}_{i}, \mathbf {x}_{j}` belong to different class,
73+ :math: `[\cdot ]_+=\max (0 , \cdot )` is the Hinge loss.
74+
5575.. topic :: Example Code:
5676
5777::
@@ -80,16 +100,44 @@ The two implementations differ slightly, and the C++ version is more complete.
80100 -margin -nearest-neighbor-classification> `_ Kilian Q. Weinberger, John
81101 Blitzer, Lawrence K. Saul
82102
103+ .. _nca :
104+
83105NCA
84106---
85107
86- Neighborhood Components Analysis (`NCA `) is a distance metric learning
87- algorithm which aims to improve the accuracy of nearest neighbors
88- classification compared to the standard Euclidean distance. The algorithm
89- directly maximizes a stochastic variant of the leave-one-out k-nearest
90- neighbors (KNN) score on the training set. It can also learn a low-dimensional
91- linear embedding of data that can be used for data visualization and fast
92- classification.
108+ Neighborhood Components Analysis(:py:class: `NCA <metric_learn.nca.NCA> `)
109+
110+ `NCA ` is a distance metric learning algorithm which aims to improve the
111+ accuracy of nearest neighbors classification compared to the standard
112+ Euclidean distance. The algorithm directly maximizes a stochastic variant
113+ of the leave-one-out k-nearest neighbors (KNN) score on the training set.
114+ It can also learn a low-dimensional linear transformation of data that can
115+ be used for data visualization and fast classification.
116+
117+ They use the decomposition :math: `\mathbf {M} = \mathbf {L}^T\mathbf {L}` and
118+ define the probability :math: `p_{ij}` that :math: `\mathbf {x}_i` is the
119+ neighbor of :math: `\mathbf {x}_j` by calculating the softmax likelihood of
120+ the Mahalanobis distance:
121+
122+ .. math ::
123+
124+ p_{ij} = \frac {\exp (-|| \mathbf {Lx}_i - \mathbf {Lx}_j ||_2 ^2 )}
125+ {\sum _{l\neq i}\exp (-||\mathbf {Lx}_i - \mathbf {Lx}_l||_2 ^2 )},
126+ \qquad p_{ii}=0
127+
128+ Then the probability that :math: `\mathbf {x}_i` will be correctly classified
129+ by the stochastic nearest neighbors rule is:
130+
131+ .. math ::
132+
133+ p_{i} = \sum _{j:j\neq i, y_j=y_i}p_{ij}
134+
135+ The optimization problem is to find matrix :math: `\mathbf {L}` that maximizes
136+ the sum of probability of being correctly classified:
137+
138+ .. math ::
139+
140+ \mathbf {L} = \text {argmax}\sum _i p_i
93141
94142 .. topic :: Example Code:
95143
@@ -116,16 +164,55 @@ classification.
116164 .. [2 ] Wikipedia entry on Neighborhood Components Analysis
117165 https://en.wikipedia.org/wiki/Neighbourhood_components_analysis
118166
167+ .. _lfda :
168+
119169LFDA
120170----
121171
122- Local Fisher Discriminant Analysis ( LFDA)
172+ Local Fisher Discriminant Analysis( :py:class: ` LFDA <metric_learn.lfda.LFDA> ` )
123173
124174`LFDA ` is a linear supervised dimensionality reduction method. It is
125- particularly useful when dealing with multimodality , where one ore more classes
175+ particularly useful when dealing with multi-modality , where one ore more classes
126176consist of separate clusters in input space. The core optimization problem of
127177LFDA is solved as a generalized eigenvalue problem.
128178
179+
180+ The algorithm define the Fisher local within-/between-class scatter matrix
181+ :math: `\mathbf {S}^{(w)}/ \mathbf {S}^{(b)}` in a pairwise fashion:
182+
183+ .. math ::
184+
185+ \mathbf {S}^{(w)} = \frac {1 }{2 }\sum _{i,j=1 }^nW_{ij}^{(w)}(\mathbf {x}_i -
186+ \mathbf {x}_j)(\mathbf {x}_i - \mathbf {x}_j)^T,\\
187+ \mathbf {S}^{(b)} = \frac {1 }{2 }\sum _{i,j=1 }^nW_{ij}^{(b)}(\mathbf {x}_i -
188+ \mathbf {x}_j)(\mathbf {x}_i - \mathbf {x}_j)^T,\\
189+
190+ where
191+
192+ .. math ::
193+
194+ W_{ij}^{(w)} = \left \{\begin {aligned}0 \qquad y_i\neq y_j \\
195+ \,\,\mathbf {A}_{i,j}/n_l \qquad y_i = y_j\end {aligned}\right .\\
196+ W_{ij}^{(b)} = \left \{\begin {aligned}1 /n \qquad y_i\neq y_j \\
197+ \,\,\mathbf {A}_{i,j}(1 /n-1 /n_l) \qquad y_i = y_j\end {aligned}\right .\\
198+
199+ here :math: `\mathbf {A}_{i,j}` is the :math: `(i,j)`-th entry of the affinity
200+ matrix :math: `\mathbf {A}`:, which can be calculated with local scaling methods.
201+
202+ Then the learning problem becomes derive the LFDA transformation matrix
203+ :math: `\mathbf {T}_{LFDA}`:
204+
205+ .. math ::
206+
207+ \mathbf {T}_{LFDA} = \arg \max _\mathbf {T}
208+ [\text {tr}((\mathbf {T}^T\mathbf {S}^{(w)}
209+ \mathbf {T})^{-1 }\mathbf {T}^T\mathbf {S}^{(b)}\mathbf {T})]
210+
211+ That is, it is looking for a transformation matrix :math: `\mathbf {T}` such that
212+ nearby data pairs in the same class are made close and the data pairs in
213+ different classes are separated from each other; far apart data pairs in the
214+ same class are not imposed to be close.
215+
129216.. topic :: Example Code:
130217
131218::
@@ -151,17 +238,50 @@ LFDA is solved as a generalized eigenvalue problem.
151238 <https://gastrograph.com/resources/whitepapers/local-fisher
152239 -discriminant-analysis-on-beer-style-clustering.html#> `_ Yuan Tang.
153240
241+ .. _mlkr :
154242
155243MLKR
156244----
157245
158- Metric Learning for Kernel Regression.
246+ Metric Learning for Kernel Regression( :py:class: ` MLKR <metric_learn.mlkr.MLKR> `)
159247
160248`MLKR ` is an algorithm for supervised metric learning, which learns a
161- distance function by directly minimising the leave-one-out regression error.
249+ distance function by directly minimizing the leave-one-out regression error.
162250This algorithm can also be viewed as a supervised variation of PCA and can be
163251used for dimensionality reduction and high dimensional data visualization.
164252
253+ Theoretically, `MLKR ` can be applied with many types of kernel functions and
254+ distance metrics, we hereafter focus the exposition on a particular instance
255+ of the Gaussian kernel and Mahalanobis metric, as these are used in our
256+ empirical development. The Gaussian kernel is denoted as:
257+
258+ .. math ::
259+
260+ k_{ij} = \frac {1 }{\sqrt {2 \pi }\sigma }\exp (-\frac {d(\mathbf {x}_i,
261+ \mathbf {x}_j)}{\sigma ^2 })
262+
263+ where :math: `d(\cdot , \cdot )` is the squared distance under some metrics,
264+ here in the fashion of Mahalanobis, it should be :math: `d(\mathbf {x}_i,
265+ \mathbf {x}_j) = ||\mathbf {A}(\mathbf {x}_i - \mathbf {x}_j)||`, the transition
266+ matrix :math: `\mathbf {A}` is derived from the decomposition of Mahalanobis
267+ matrix :math: `\mathbf {M=A^TA}`.
268+
269+ Since :math: `\sigma ^2 ` can be integrated into :math: `d(\cdot )`, we can set
270+ :math: `\sigma ^2 =1 ` for the sake of simplicity. Here we use the cumulative
271+ leave-one-out quadratic regression error of the training samples as the
272+ loss function:
273+
274+ .. math ::
275+
276+ \mathcal {L} = \sum _i(y_i - \hat {y}_i)^2
277+
278+ where the prediction :math: `\hat {y}_i` is derived from kernel regression by
279+ calculating a weighted average of all the training samples:
280+
281+ .. math ::
282+
283+ \hat {y}_i = \frac {\sum _{j\neq i}y_jk_{ij}}{\sum _{j\neq i}k_{ij}}
284+
165285 .. topic :: Example Code:
166286
167287::
@@ -193,7 +313,6 @@ generated from the labels information and passed to the underlying algorithm.
193313.. todo :: add more details about that (see issue `<https://github
194314 .com/metric-learn/metric-learn/issues/135>`_)
195315
196-
197316.. topic :: Example Code:
198317
199318::
0 commit comments