This repository contains Streaming-KNORA (S-KNORA), an algorithm designed to analyse streaming data in a distributed environment. S-KNORA is implemented using Spark across multiple Hadoop nodes.
This project was developed in the School of Computer Science of the University of Manchester as part of my MSc dissertation under the supervision of Professor John A. Keane ([email protected]). Additional involved staff: Dr. Firat Tekiner ([email protected]).
This project was a proof of concept that aimed to demonstrate the feasibility of using KNORA ensemble learning on high throughput streaming data. Results show that:
- S-KNORA can learn concepts on disjoint streaming data and achieve higher accuracy than the single streaming learning mode;
- the pipeline's throughput, running with a large batch size, is up to 6.82 times than the pipeline running on a single thread;
- to capture severe concept drift, batch-incremental learning requires more frequent model update in a small batch causing high overhead in a distributed environment.
./Dataset_Single: the datasets used in StreamingKNORA_Single experiments
./Dataset_Spark: the datasets used in StreamingKNORA_Spark experiments
./StreamingKNORA_Single: A Java implementation for batch size selection; it is also considered as an ideal program without overhead.
./StreamingKNORA_Spark: A Spark implementation for performance evaluation that measures throughput and monitor resource utilization on different datasets.