import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()
// dataset to transform
val df = Seq(
(1, "Hello world!"),
(2, "Here is yet another sentence.")).toDF("id", "sentence")
val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df)
scala> tokenized.show(truncate = false)
+---+-----------------------------+-----------------------------------+
|id |sentence |tokens |
+---+-----------------------------+-----------------------------------+
|1 |Hello world! |[hello, world!] |
|2 |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+---+-----------------------------+-----------------------------------+
Tokenizer
Tokenizer
is a unary transformer that converts the column of String values to lowercase and then splits it by white spaces.