Skip to content

Latest commit

 

History

History
43 lines (26 loc) · 3.94 KB

bm25.md

File metadata and controls

43 lines (26 loc) · 3.94 KB

BM25

  • Okapi BM25 - Wikipedia #ril
    • In information retrieval (信息檢索), Okapi BM25 (BM stands for Best Matching) is a RANKING FUNCTION used by search engines to rank matching documents according to their RELEVANCE TO A GIVEN SEARCH QUERY. 若 BM = Best Matching,那 25 是什麼??
    • The name of the actual ranking function is BM25. To set the right context, however, it is usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function. 其 Okapi 唸做 [o‵kɑpɪ],一種非洲鹿。
    • BM25 and its newer variants, e.g. BM25F (a version of BM25 that can take document STRUCTURE and ANCHOR TEXT into account), represent state-of-the-art TF-IDF-LIKE retrieval functions used in document retrieval. 所以 TF-IDF 是一種類型,而 BM25 就是其中一種實作??

新手上路 ?? {: #getting-started }

  • The ranking function - Okapi BM25 - Wikipedia #ril

    • BM25 is a BAG-OF-WORDS retrieval function that ranks a set of documents based on the QUERY TERMS APPEARING IN EACH DOCUMENT, regardless of the INTER-RELATIONSHIP BETWEEN THE QUERY TERMS WITHIN A DOCUMENT (e.g., their relative PROXIMITY). 不考慮同一份字詞間是否接近,是 BM25 的弱點?

    • It is not a single function, but actually a whole FAMILY OF SCORING FUNCTIONS (下面的公式還可以再展開), with slightly different components and parameters. One of the most prominent (著名的) instantiations of the function is as follows.

      Given a query $Q$, containing keywords $q_1$,...,$q_n$ the BM25 score of a document $D$ is: (將個別 keyword 在該 document 的分數加總)

      where $f(q_i, D)$ is $q_i$'s TERM FREQUENCY in the document $D$, $|D|$ is the length of the document $D$ IN WORDS, and $avgdl$ is the average document length in the TEXT COLLECTION from which documents are drawn. 其中 $k_1$$b$ 似乎分別調節 $f(q_i, D)$$\mid{D}\mid \over avgdl$ 的影響力? 但 $k_1$ 似乎也跟 $\mid{D}\mid \over avgdl$ 有關??

      $k_1$ and $b$ are FREE PARAMETERS, usually CHOSEN (精選的?), in absence of an advanced optimization??, as $k_1 \in [1.2,2.0]$ and $b = 0.75$. 調試出來,通常可以得到不錯的效果。

      $IDF(q_i)$ is the IDF (INVERSE DOCUMENT frequency) WEIGHT of the query term $q_i$. It is usually computed as:

      where $N$ is the total number of documents in the COLLECTION, and $n(q_i)$ is the number of documents containing $q_i$. 所在的 collection 越大,含有該 keyword 的 document 越少,IDF 的值就會越高 -- 分數越高,表示該 keyword 越獨特,分數也就越高。

      注意 TF 跟 IDF 中 TERM frequency 與 DOCUMNET frequency 的差別,前者是 "within a document",後者是 "in the collection";前者是 keyword 在一份文件裡出現的次數,後者是該 keyword 在整個 collection 裡的獨特/稀有性。

  • Information retrieval - Wikipedia #ril

  • Bag-of-words model - Wikipedia #ril

  • Ranking (information retrieval) - Wikipedia #ril

IF-IDF ??

參考資料 {: #reference }

相關: