-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Guidelines to choose an index
Choosing an index is not obvious, so here are a few essential questions that can help in the choice of an index. They are mainly applicable for L2 distances. We indicate:
-
the
index_factorystring for each of them. -
if there are parameters, we indicate it as the corresponding
ParameterSpaceargument.
The only index that can guarantee exact results is the IndexFlatL2. It provides the baseline for results for the other indexes. It does not compress the vectors, but does not add overhead on top of them. It does not support adding with ids (add_with_ids), only sequential adds, so if you need add_with_ids, use "IDMap,Flat". The flat index does not require training and does not have parameters.
Keep in mind that all Faiss indexes are stored in RAM. The following considers that if exact results are not required, RAM is the limiting factor, and that within memory constraints we optimize the precision-speed tradeoff.
If you have a lots of RAM or the dataset is small, HNSW is the best option, it is a very fast and accurate index. The 4 <= x <= 64 is the number of links per vector, higher is more accurate but uses more RAM. The speed-accuracy tradeoff is set via the efSearch parameter. The memory usage is (d * 4 + x * 2 * 4) bytes per vector.
HNSW does only support sequential adds (not add_with_ids) so here again, prefix with IDMap if needed. HNSW does not require training and does not support removing vectors from the index.
"..." means a clustering of the dataset has to be performed beforehand (read below). After clustering, "Flat" just organizes the vectors into buckets, so it does not compress them, the storage size is the same as that of the original dataset. The tradeoff between speed and accuracy is set via the nprobe parameter.
If storing the whole vectors is too expensive, this performs two operations:
-
a PCA to dimension
xto reduce the dimension -
a scalar quantization of each vector component into 1 byte.
Therefore the total storage is x bytes per vector.
PQx compresses the vectors using a product quantizer that outputs x-byte codes. x is typically <= 64, for larger codes SQ is usually as accurate and faster. OPQ is a linear transformation of the vectors to make them easier to compress. y is a dimension such that:
-
yis a multiple ofx(required) -
y<= d, with d the dimension of the input vectors (preferable) - y <= 4*x (preferable)
This question is used to fill in the clustering options (the ... above). The dataset is clustered into buckets and at search time, only a fraction of the buckets are visited (nprobe buckets). The clustering is performed on a representative sample of the dataset vectors, typically a sample of the dataset. We indicate the optimal size for this sample.
Where x is 4sqrt(N) to 16sqrt(N), with N the size of the dataset. This just clusters the vectors with k-means. You will need between 30x and 256x vectors for training (the more the better).
(here x is a literal x, not a number)
IMI also performs k-means with 2^10 centroids on the training vectors, but it does so independently on the first half and the second half of the vectors. This increases the number of clusters to 2^(2*10). You will need around 64 * 2^10 vectors for training.
Same as above, replace 10 with 12.
Same as above, replace 10 with 14.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
GPU Faiss + NVIDIA cuVS - Overview
GPU Faiss + NVIDIA cuVS - Usage
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark