Skip to content

2 RapidMiner plugin

agudys edited this page Aug 2, 2019 · 10 revisions

2.1. Installation

In order to use RuleKit RapidMiner plugin, download rulekit-<version>-rmbundle.zip file from the releases folder. The archive contains RapidMiner 9.3 bundled with the plugin. The bundle can be also built from the sources by running the following commands in the adaa.analytics.rules directory. Windows:

gradlew -b build.gradle rmbundle

Linux:

./gradlew -b build.gradle rmbundle

The output archive will be stored in adaa.analytics.rules/build/distributions. After unpacking ZIP file, please execute RapidMiner-Studio.bat (Windows) or RapidMiner-Studio.sh (Linux) script. Note, that the archive built under Windows may not work on Linux due to different new line characters in the shell script. The opposite situation is not the problem, though. In the releases we provide the archive that works under both systems.

2.2. Usage

The plugin consists of two operators:

  • RuleKit Generator,
  • RuleKit Performance,

which can be found in Extensions → ADAA → RuleKit folder.

The former operator allows inducing various typles of rule models. It is a RapidMiner learner with a single training set input and three outputs: model (to be applied on unseen data), example set (input training set passed without any changes), and estimated performance (model characteristics. RuleKit automatically determines the type of the problem on the basis of the training set metadata:

  • classification - nominal label attribute,
  • regression - numerical label attribute,
  • survival analysis - binary label attribute and numerical attribute with role survival_time specified.

The metadata are crucial for proper operation as they define available GUI parameters and induction algorithm to be used.

The RuleKit Performance operator allows assesing the model. It conforms to the standart Performance RM operator, thus it contains labelled data and performance inputs. The former allows calculating various performance metrices on the predicted data, the latter can be used to capture model characteristics returned by the RuleKit Generator.

2.3. Example

In the following subsection we show an example regression analysis with a use of RuleKit RapidMiner plugin. The investigated dataset is named methane and concerns the problem of predicting methane concentration in a coal mine. The set is split into separate testing and training parts distributed in ARFF format (download). For demonstration needs, a smaller version of these datasets suffixed with -minimal have been provided. The analysis is divided into two parts: data preparation and main processing. Corresponding RapidMiner processes preparation.rmp and regression.rmp are presented in Figure 2.1 and 2.2. The processes can be imported to RapidMiner (File → Import Process...) and executed (Play button).

The role of the preparation process is to add metadata to the sets and store them in the RM format (RapidMiner does not support metadata for ARFF files). Since Read ARFF operator is no longer provided by some RapidMiner revisions, we distribute its version from RapidMiner 5 as the element of RuleKit. After reading ARFF file, the Set Role operator is used for setting MM116_pred as the label attribute (in the survival analysis, a survival_time role has to be additionally assigned to some other attribute). Then, the sets are saved in the RapidMiner local repository with Store operators.

In the main process, datasets are loaded from the RM repository with Retrieve operator. Then, the training set is provided as an input for RuleKit Generator. All the parameters configurable from the XML interface are accessible through the RapidMiner GUI. Let mincov = 4 and RSS measure be used for growing, pruning, and voting. The corresponding panel with operator properties is presented in Figure 2.3.

Figure 2.1. Data preparation process.
Figure 2.2. Main analysis process.
Figure 2.3. RuleKit Generator parameters.

The model generated by RuleKit Generator is then applied on unseen data (Apply Model operator). The performance of the prediction is assesed using RuleKit Evaluator operator. Performance metrices as well as generated model are passed as process outputs. The text representation of the model is presented in the training report description.