From 38061b0f95e91f076f3b7e9e1c132a9aa06065a8 Mon Sep 17 00:00:00 2001 From: Sean MacAvaney Date: Wed, 4 Dec 2024 15:08:26 +0000 Subject: [PATCH] good enough, I just need to be done with this --- pyterrier_dr/pt_docs/index.rst | 2 +- pyterrier_dr/pt_docs/overview.rst | 40 +++++++++++++++++++++++++++++-- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/pyterrier_dr/pt_docs/index.rst b/pyterrier_dr/pt_docs/index.rst index ea6061c..7e8dd8a 100644 --- a/pyterrier_dr/pt_docs/index.rst +++ b/pyterrier_dr/pt_docs/index.rst @@ -6,7 +6,7 @@ that provides functionality for Dense Retrieval. It provides this functionality primarily through: -1. Transformers for encoding queries/documents into dense vectors (e.g., :class:`~pyterrier_dr.SBertBiEncoder`) +1. Transformers for :doc:`encoding queries/documents <./encoding>` into dense vectors (e.g., :class:`~pyterrier_dr.SBertBiEncoder`) 2. Transformers for :doc:`indexing and retrieval <./indexing-retrieval>` using these dense vectors (e.g., :class:`~pyterrier_dr.FlexIndex`) diff --git a/pyterrier_dr/pt_docs/overview.rst b/pyterrier_dr/pt_docs/overview.rst index 07407ab..4cf34a5 100644 --- a/pyterrier_dr/pt_docs/overview.rst +++ b/pyterrier_dr/pt_docs/overview.rst @@ -35,6 +35,8 @@ and (2) algorithms and data structures to index and retrieve documents using the Encoding ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +(More information can be found at :doc:`encoding`.) + Let's start by loading a dense model: `RetroMAE `__. The model has several checkpoints available on huggingface, including ``Shitao/RetroMAE_MSMARCO_distill``. ``pyterrier_dr`` provides an alias to this checkpoint with :meth:`RetroMAE.msmarco_distill() `:[#]_ @@ -77,7 +79,42 @@ next section, we will use these vectors to perform retrieval. Indexing and Retrieval ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -More information found at :doc:`indexing-retrieval`. +(More information can be found at :doc:`indexing-retrieval`.) + +:class:`pyterrier_dr.FlexIndex` provides dense indexing and retrieval capabilities. Here's how you can index +a collection of documents: + +.. code-block:: python + :caption: Indexing documents with ``pyterrier_dr`` + + >>> from pyterrier_dr import FlexIndex, RetroMAE + >>> model = RetroMAE.msmarco_distill() + >>> index = FlexIndex('my-index.flex') + # build an indexing pipeline that first applies RetroMAE to get dense vectors, then indexes them into the FlexIndex + >>> pipeline = model >> index.indexer() + # run the indexing pipeline over a set of documents + >>> pipeline.index([ + ... {"docno": "1161848_2", "text": "Cutest breed of dog is a PBGV (look up on Internet) they are a little hound that looks like a shaggy terrier."}, + ... {"docno": "686980_0", "text": "Golden retriever has longer hair and is a little heavier."}, + ... {"docno": "4189224_1", "text": "The onion releases a chemical that makes your eyes water up. I mean, no way short of wearing a mask or just avoiding the sting."}, + ... ]) + +Now that the documents are indexed, you can retrieve over them: + +.. code-block:: python + :caption: Retrieving with ``pyterrier_dr`` + + >>> from pyterrier_dr import FlexIndex, RetroMAE + >>> model = RetroMAE.msmarco_distill() + >>> index = FlexIndex('my-index.flex') + # build a retrieval pipeline that first applies RetroMAE to encode the query, then retrieves using those vectors over the FlexIndex + >>> pipeline = model >> index.retriever() + # run the indexing pipeline over a set of documents + >>> pipeline.search('golden retrievers') + qid query docno docid score rank + 0 1 golden retrievers 686980_0 1 77.125557 0 + 1 1 golden retrievers 1161848_2 0 61.379417 1 + 2 1 golden retrievers 4189224_1 2 54.269958 2 Extras ------------------------------------------------------- @@ -99,4 +136,3 @@ Extras .. [#] You can also load the model from HuggingFace with :class:`~pyterrier_dr.HgfBiEncoder`: ``HgfBiEncoder("Shitao/RetroMAE_MSMARCO_distill")``. Using the alias will ensure that all settings for the model are assigned properly. -