Note: With default parameters LinearSVC was giving highest accuracy, hence all the results other than Machine learning algorithms' parameter tuning, i.e feature generation and vectorization results are produced using LinerSVC model of sklearn. Features used all the time, 'including' the case while testing feature stack, are - NER, TAG, DEP, is-STOP. Accuracy mentioned is empirical accuracy (percent)
- Various
ngram_range
, (min, max) value:- (1, 2) - 59.0%
- (3, 5) - 57.4%
- (2, 3) - 57.8%
- (3, 4) - 57.2%
- (1, 3) - 58.4%
Hence, adding longer and more n_gram range is not a good way to generate text features for question dataset. For the results below ngram_range=(1, 2) was used.
- 'TAG' is just more fine grained form of POS in spaCy lib. With 'POS' instead of 'TAG' - 55.6%. And with 'TAG' - 59.0%. More information can be found at spaCy token documentation.
- 'Words' added as feature - 81.9%
- 'Lemma' (stem form of the word) added as feature - 82.7%
- 'Lemma', 'Shape', 'is-Alpha' added as features - 83.9%
- 'Lemma', 'is-Alpha' added as features - 82.8%
- 'Lemma', 'Shape' added as features - 84.2%
Hence, from point 1 it is clear that accuracy and depth of the NLP lib can distinctively affect the results. From point 2 and 3 we conclude that stem form of the word is more useful as a feature than the word itself. From point 4, 5 and 6 we can conclude that shape or length of the word is more effective feature than the feature telling us, the word is an alphabet or not.
- All Default - 80.4%
solver="newton-cg"
- 80.8%, increasing the number of max_iter also gives the same results.
- All Default - 27.8%
gamma=1E-6
- 2.4%C=0.025
- 2.4%max_iter=750
- 14.2%
Coarse accuracy (only the main class) = 89.4%
Fine accuracy (both main class and subclass) = 82.8%
Tested different combinations of epochs, batch size, layers etc. Almost all are 76-82% fine accuracy.
LinearSVC module is more efficient than svm with kernel="linear"
- All Default - 84.2%
loss="squared_hinge"
anddual=False
- 84.2%, withdual=True
also gives same accuracy.
Coarse accuracy (only the main class) = 91.2%
Fine accuracy (both main class and subclass) = 84.2%