Skip to content

Conversation

@toto6
Copy link
Contributor

@toto6 toto6 commented Nov 24, 2017

Fix mistake in LMNN
Issue in function _find_impostors:

  • the squared euclidean distance is used to compute the margins in variable "margin_radii"
  • the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist"
  • the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances.

I propose to solve this problem by computing always the squared distances.

Faster LMNN
The use of the function sklearn.metrics.pairwise_distances gives horrible performances.
Replace it by a faster function.

Below is a little script to show the computation time of LMNN

import time
from sklearn.datasets import load_breast_cancer
from metric_learn import lmnn
dataset = load_breast_cancer()
s = time.time()
l1 = lmnn.LMNN(k=3)
l1.fit(dataset["data"], dataset["target"])
e = time.time()
print(e - s)

Before this commit, this script executes in 32 seconds
After this commit, this script executes in 1 second

Issue in function _find_impostors:
 - the squared euclidean distance is used to compute the margins in variable "margin_radii"
 - the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist"
 - the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances.

I propose to solve this problem by computing always the squared distances.
The use of the function sklearn.metrics.pairwise_distances gives horrible performances.
Replace it by a faster function.

Below is a little script to show the computation time of LMNN

import time
from sklearn.datasets import load_breast_cancer
from metric_learn import lmnn
dataset = load_breast_cancer()
s = time.time()
l1 = lmnn.LMNN(k=3)
l1.fit(dataset["data"], dataset["target"])
e = time.time()
print(e - s)

Before this commit, this script executes in 32 seconds
After this commit, this script executes in 1 second
@perimosocordiae
Copy link
Contributor

Thanks for the contribution! I agree with your analysis of the bug, though I was surprised to hear that sklearn's pairwise_distances was so slow.

So I looked a little closer and found that the issue is actually metric='seuclidean'. Regular L2 distance uses essentially the same algorithm as your pairwiseEuclidean:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L246

But if you use the seuclidean metric, it ends up calling Scipy's cdist function instead, which then has to make a copy of both input arrays to make them C-contiguous:
https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2361

So I think the easiest solution is, instead of rolling our own distance function, importing euclidean_distances from sklearn and calling it with squared=True anywhere that we want squared L2 distances. I did some quick testing and found that this also prevents the very slow running time.

for label in self.labels_:
inds, = np.nonzero(self.label_inds_ == label)
dd = pairwise_distances(self.X_[inds])
dd = euclidean_distances(self.X_[inds], self.X_[inds], squared=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave out the second argument, similar to how pairwise_distances works.

@perimosocordiae perimosocordiae merged commit 4b889d4 into scikit-learn-contrib:master Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants