evaluator: DataFrame =[evaluate]=> Double
Evaluators
A evaluator is a transformation that maps a DataFrame
into a metric indicating how good a model is.
Evaluator
is an abstract class with evaluate
methods.
evaluate(dataset: DataFrame): Double
evaluate(dataset: DataFrame, paramMap: ParamMap): Double
It employs isLargerBetter
method to indicate whether the Double
metric should be maximized (true
) or minimized (false
). It considers a larger value better (true
) by default.
isLargerBetter: Boolean = true
The following is a list of some of the available Evaluator
implementations:
MulticlassClassificationEvaluator
MulticlassClassificationEvaluator
is a concrete Evaluator
that expects DataFrame
datasets with the following two columns:
-
prediction
ofDoubleType
-
label
offloat
ordouble
values
BinaryClassificationEvaluator
BinaryClassificationEvaluator
is a concrete Evaluator
for binary classification that expects datasets (of DataFrame
type) with two columns:
-
rawPrediction
beingDoubleType
orVectorUDT
. -
label
beingNumericType
Note
|
It can cross-validate models LogisticRegression, RandomForestClassifier et al. |
RegressionEvaluator
RegressionEvaluator
is a concrete Evaluator
for regression that expects datasets (of DataFrame
type) with the following two columns:
-
prediction
offloat
ordouble
values -
label
offloat
ordouble
values
When executed (via evaluate
) it prepares a RDD[Double, Double]
with (prediction, label)
pairs and passes it on to org.apache.spark.mllib.evaluation.RegressionMetrics
(from the "old" Spark MLlib).
RegressionEvaluator
can evaluate the following metrics:
-
rmse
(default; larger is better? no) is the root mean squared error. -
mse
(larger is better? no) is the mean squared error. -
r2
(larger is better?: yes) -
mae
(larger is better? no) is the mean absolute error.
// prepare a fake input dataset using transformers
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer().setInputCol("text")
import org.apache.spark.ml.feature.HashingTF
val hashTF = new HashingTF()
.setInputCol(tok.getOutputCol) // it reads the output of tok
.setOutputCol("features")
// Scala trick to chain transform methods
// It's of little to no use since we've got Pipelines
// Just to have it as an alternative
val transform = (tok.transform _).andThen(hashTF.transform _)
val dataset = Seq((0, "hello world", 0.0)).toDF("id", "text", "label")
// we're using Linear Regression algorithm
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tok, hashTF, lr))
val model = pipeline.fit(dataset)
// Let's do prediction
// Note that we're using the same dataset as for fitting the model
// Something you'd definitely not be doing in prod
val predictions = model.transform(dataset)
// Now we're ready to evaluate the model
// Evaluator works on datasets with predictions
import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
// check the available parameters
scala> println(regEval.explainParams)
labelCol: label column name (default: label)
metricName: metric name in evaluation (mse|rmse|r2|mae) (default: rmse)
predictionCol: prediction column name (default: prediction)
scala> regEval.evaluate(predictions)
res0: Double = 0.0