A library for Positive-Unlabeled Learning for Apache Spark MLlib (ml package)
Original Positive-Unlabeled learning algorithm; firstly proposed in
Liu, B., Dai, Y., Li, X. L., Lee, W. S., & Philip, Y. (2002). Partially supervised classification of text documents. In ICML 2002, Proceedings of the nineteenth international conference on machine learning. (pp. 387–394).
Modified Positive-Unlabeled learning algorithm; main idea is to gradually refine set of positive examples. Pseudocode was taken from:
Fusilier, D. H., Montes-y-Gómez, M., Rosso, P., & Cabrera, R. G. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443.
Spark 1.5+
(Spark 2+ was not tested,
but should work if replace SparkContext
by SparkSession
and mllib.linalg.Vector
by ml.linalg.Vector
)
The library is published into Maven central and JCenter. Add the following lines depending on your build system.
compile 'ru.ispras:pu4spark:0.3'
<dependency>
<groupId>ru.ispras</groupId>
<artifactId>pu4spark</artifactId>
<version>0.3</version>
</dependency>
libraryDependencies += "ru.ispras" % "pu4spark" % "0.3"
Build library with gradle:
./gradlew jar
val inputLabelName = "category"
val srcFeaturesName = "srcFeatures"
val outputLabel = "outputLabel"
val puLearnerConfig = TraditionalPULearnerConfig(0.05, 1, LogisticRegressionConfig())
val puLearner = puLearnerConfig.build()
val df = ... //needed df that contains at least the following columns:
// binary label for positive and unlabel (inputLabelName)
// and features assembled as vector (featuresName)
val weightedDF = puLearner.weight(preparedDf, inputLabelName, srcFeaturesName, outputLabel)
Returned dataframe contains probability estimation for each instance in the column outputLabel
.
Features can be assembled to one column by using VectorAssembler:
val assembler = new VectorAssembler()
.setInputCols(df.columns.filter(c => c != rowName)) //keep here only feature columns
.setOutputCol(featuresName)
val pipeline = new Pipeline().setStages(Array(assembler))
val preparedDf = pipeline.fit(df).transform(df)