Implementation of a search engine from scratch
This project is developed by 2 students from CentraleSupélec as part of the "Fondements en Recherche d'Information" course:
- Cécile Gontier - @CecileSerene
- Delphine Shi - @dlphn
We are working on two given collections:
- CACM collection
- CS276 collection
When installing, create a file config.py
in the main directory and fill with global paths to collections and path where you want the index to be stored:
CACM_path = '/path/to/CACM/'
CS276_path = '/path/to/pa1-data/'
index_path = '/path/to/index/'
Go to RunMe.ipynb
for a notebook with main results and explanations.
If you don't want to spend too much time generating the index, you can download it from there :
https://drive.google.com/drive/folders/17glYdz6KY_PJsnANKrYi4xooNkDQ0ua1?usp=sharing.
Be sure to replace the index/
folder with the unzipped folder.
Entry point: CACMIndex.py
and CS276Index.py
. Each will calculate token size and number of vocabulary of the collection, and also draw the corresponding frequency graphs.
Helper functions:
textProcessing.py
processes text with language processing tools like tokenize, lemmatize, removing stop words etc.indexBuilder.py
to help build each index.CACMParser.py
to parse CACM document and get title, summary and key words.
Heap Law: heapRegression.py
. Run to calculate Heap Law parameters of each collection. You will need to uncomment to change collection.
Frequency graphs: frequencyRankGraph.py
- helper class to draw frequency graphs.
Entry point : BSBI.py
.
Running this file will generate the different dictionaries (documents, terms, index) in the index/
folder given in config.py
.
Entry point : boolean/booleanEvaluation.py
.
Run tests on boolean/test.py
Entry point : vectorial/vectorialEvaluation.py
.
Run tests on vectorial/test.py
Both search models that we implemented inherit from evaluation.py
.
Evaluate our CACM search models by running functions in CACMEvaluation.py
.