LMNN: fix mistake and improve performances #78

toto6 · 2017-11-24T09:40:20Z

Fix mistake in LMNN
Issue in function _find_impostors:

the squared euclidean distance is used to compute the margins in variable "margin_radii"
the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist"
the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances.

I propose to solve this problem by computing always the squared distances.

Faster LMNN
The use of the function sklearn.metrics.pairwise_distances gives horrible performances.
Replace it by a faster function.

Below is a little script to show the computation time of LMNN

import time
from sklearn.datasets import load_breast_cancer
from metric_learn import lmnn
dataset = load_breast_cancer()
s = time.time()
l1 = lmnn.LMNN(k=3)
l1.fit(dataset["data"], dataset["target"])
e = time.time()
print(e - s)

Before this commit, this script executes in 32 seconds
After this commit, this script executes in 1 second

Issue in function _find_impostors: - the squared euclidean distance is used to compute the margins in variable "margin_radii" - the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist" - the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances. I propose to solve this problem by computing always the squared distances.

The use of the function sklearn.metrics.pairwise_distances gives horrible performances. Replace it by a faster function. Below is a little script to show the computation time of LMNN import time from sklearn.datasets import load_breast_cancer from metric_learn import lmnn dataset = load_breast_cancer() s = time.time() l1 = lmnn.LMNN(k=3) l1.fit(dataset["data"], dataset["target"]) e = time.time() print(e - s) Before this commit, this script executes in 32 seconds After this commit, this script executes in 1 second

perimosocordiae · 2017-11-24T23:21:08Z

Thanks for the contribution! I agree with your analysis of the bug, though I was surprised to hear that sklearn's pairwise_distances was so slow.

So I looked a little closer and found that the issue is actually metric='seuclidean'. Regular L2 distance uses essentially the same algorithm as your pairwiseEuclidean:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L246

But if you use the seuclidean metric, it ends up calling Scipy's cdist function instead, which then has to make a copy of both input arrays to make them C-contiguous:
https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2361

So I think the easiest solution is, instead of rolling our own distance function, importing euclidean_distances from sklearn and calling it with squared=True anywhere that we want squared L2 distances. I did some quick testing and found that this also prevents the very slow running time.

perimosocordiae · 2017-11-26T15:57:16Z

metric_learn/lmnn.py

    for label in self.labels_:
      inds, = np.nonzero(self.label_inds_ == label)
-      dd = pairwise_distances(self.X_[inds])
+      dd = euclidean_distances(self.X_[inds], self.X_[inds], squared=True)


You can leave out the second argument, similar to how pairwise_distances works.

toto6 added 2 commits November 24, 2017 10:27

Use euclidean_distances from sklearn

27eac1f

perimosocordiae requested changes Nov 26, 2017

View reviewed changes

Remove no needed parameter

6f37865

perimosocordiae approved these changes Nov 27, 2017

View reviewed changes

perimosocordiae merged commit 4b889d4 into scikit-learn-contrib:master Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LMNN: fix mistake and improve performances #78

LMNN: fix mistake and improve performances #78

Uh oh!

toto6 commented Nov 24, 2017

Uh oh!

perimosocordiae commented Nov 24, 2017

Uh oh!

perimosocordiae Nov 26, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LMNN: fix mistake and improve performances #78

LMNN: fix mistake and improve performances #78

Uh oh!

Conversation

toto6 commented Nov 24, 2017

Uh oh!

perimosocordiae commented Nov 24, 2017

Uh oh!

perimosocordiae Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants