For the project, we have taken huge data for training. We have trained the model with 66Gb of text data (wiki corpus) (https://dumps.wikimedia.org/enwiki/) that has almost all the words with its nearby predictive words. This huge data is fed to Latent Semantic Indexing and we get a LSI model. This model has a size of about 2.5 GB and this model contains all the data in numerical form. This step is performed because machine learning algorithms can only understand numbers and not text. To work with text data in machine learning, we convert all the text data to a numerical form that is understood by the machine learning algorithms.
After the LSI model is created, the making scheme is taken and stop words are removed from it and converted to dictionary. This dictionary contains all the unique words and these words are mapped to some unique numbers. The dictionary contains these unique numbers as keys and the occurrence of those words as values. This dictionary is then converted to bag of words. The same procedure is applied to the answer sheet as well. The bag of words that was created by the marking scheme is passed to the LSI model and the data is prepared for comparison and we get a set of indexing values which eases the comparing process.
Now the answer sheet’s corpus is taken and is passed through the LSI model and set of values is generated. This set of values is again compared with the indexed values of the marking scheme. The LSI model compares these values from answer sheet and marking scheme and assigns a comparsion percentage. This process is applied to each sentence and each sentence will have a comparison score. The average of these comparison scores is taken for one answer and this average is the comparison score of that answer. Marks is assigned based on this compariosn score for each answers and after all answers are processed, marks are added up and total marks is obtained.
To construct a corpus from a Wikipedia (or other MediaWiki-based) database dump, refer this tutorial https://radimrehurek.com/gensim/corpora/wikicorpus.html.
For detailed technical information please refer my research paper - https://www.ijariit.com/manuscript/answer-script-evaluator/
In the above figure, entities in the index 0 refers to first sentence in the answer sheet. The comparison scores written against it indicates the index number, similarity with that of the marking scheme sentences. Here 0 , 0.9099441 indicates that sentence 0 of answer sheet has a comparison percentage of 90% with sentence 0 of marking scheme.
Similarly, the second entity 5, 0.68514407 indicates that sentence 0 of answer sheet has a comparison percentage of 68% with sentence number 5 of marking scheme.
Similarly, the entire set of sentences are listed for each answer script. For each index of answer script, the marking scheme index with maximum comparison percentage is selected and assigned as most similar with the corresponding comparison percentage.
In the above answer, each sentence in the answer is again split to individual list items. So there are 6 sentences ranging from index 0 to index 5.
In the above figure, again each sentence is split to a list item and as we can see, there are more number of sentences in the marking scheme when compared to the answer script. And it is also observed that the points are jumbled i.e., they are not in the same order in both the documents. Now, each sentence to sentence comparison is done.
As shown in the above figure, the first column refers to index of marking scheme. 2 nd column refers to index of the answer sheet and the last column refers to the comparison percentage of those 2 sentences. This is the process for one answer. For all the answers in the answer script, the same procedure is applied and comparison percentage is calculated individually.