Provide a lib to create a fast kNN index and get results as a pandas dataframe FastKnn use mainly nmslib as (fast) kNN backend
pip install git+https://github.com/Fanchouille/fastknn.git
FastKnn builds a kNN index with specified index_method
(default: hnsw
)
and index_space
(default: cosinesimil
)
This code has been tested with hnsw
method and cosinesimil
/ l2
space for dense data and cosinesimil_sparse
/ cosinesimil_sparse_fast
space
Example with dense data:
from fastknn import FastKnn
# Create index...
fastknn = FastKnn(data, id_dict)
# Save index
fastknn.save("test_fastknn")
# ...or load if exists
fastknn = FastKnn(fastknn_folder="test_fastknn")
# Choose sample vectors
query = data[:3, :]
# Query index & get results as df
results_df = fastknn.query_as_df(query, k=10, same_ids=True, remove_identity=True)
-
Where
data
is a m x n numpy array matrix andid_dict
is a python dictionary with mappings from integer index (0 to m-1) to real idsfastknn.datautils
provides method to getdata
andid_dict
easily from pandas dataframes
-
To use FastKnn in supervised mode, provide a
target
parameter which is a python dictionary containing labels (classes or quantity target) related todata
(default:None
: unsupervised mode) -
Other important parameters:
data_type
(default:dense
) anddist_type
(default:float
) - see main.py for examples -
Once instantiated,
save
method saves as files:- mappings from integer index to real ids as a json file
- index parameters as a json file
- index as a bin file
- target dictionary as a json file
-
Get a saved FastKnn back by specifying
fastknn_folder
-
Query a FastKnn object by using
query_as_df
provided method with the following parametersquery
- p x n numpy array - matrix to be matched todata
k
- integer - the number of nearest neighbours (default10
)query_index
- list of integer - index of the data provided in query (default:None
- takes row index as index)nn_column
- string - name of resulting column containing the nearest neighbours (default:nearest_neighbours
)distance_column
- string - name of resulting column containing the distances to nearest neighbours (default:distances
)same_ids
- bool - when querying the same data that was indexed, gets index + real ids (default:False
)remove_identity
- bool - when querying the same data that was indexed, getk
nearest neighbours without the perfect identity match (default:False
)
-
Get prediction with a FastKnn object by using
prediction_as_df
provided method with the following parametersquery
- p x n numpy array - matrix to be matched todata
k
- integer - the number of nearest neighbours (default10
)query_index
- list of integer - index of the data provided in query (default:None
- takes row index as index)same_ids
- bool - when querying the same data that was indexed, gets index + real ids (default:False
)remove_identity
- bool - when querying the same data that was indexed, getk
nearest neighbours without the perfect identity match (default:False
)prediction_type
- string -classification
(majority voting on thek
nearest neighbours) orregression
(mean on thek
nearest neighbours)(default:classification
)
Clone project
Install Anaconda local environment as below:
./install.sh
Activate Anaconda local environment as below:
conda activate ${PWD}/.conda