Tokenizer

Tokenizer

Tokenizer is a unary transformer that converts the column of String values to lowercase and then splits it by white spaces.

import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()

// dataset to transform
val df = Seq(
  (1, "Hello world!"),
  (2, "Here is yet another sentence.")).toDF("id", "sentence")

val tokenized = tok.setInputCol("sentence").setOutputCol("tokens").transform(df)

scala> tokenized.show(truncate = false)
+---+-----------------------------+-----------------------------------+
|id |sentence                     |tokens                             |
+---+-----------------------------+-----------------------------------+
|1  |Hello world!                 |[hello, world!]                    |
|2  |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+---+-----------------------------+-----------------------------------+

Tokenizer

results matching ""

No results matching ""