UnsafeRow — Mutable Raw-Memory Unsafe Binary Row Format

UnsafeRow is a concrete InternalRow that represents a mutable internal raw-memory (and hence unsafe) binary row format.

In other words, UnsafeRow is an InternalRow that is backed by raw memory instead of Java objects.

// Use ExpressionEncoder for simplicity
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
val stringEncoder = ExpressionEncoder[String]
val row = stringEncoder.toRow("hello world")

import org.apache.spark.sql.catalyst.expressions.UnsafeRow
val unsafeRow = row match { case ur: UnsafeRow => ur }

scala> println(unsafeRow.getSizeInBytes)
32

scala> unsafeRow.getBytes
res0: Array[Byte] = Array(0, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 16, 0, 0, 0, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 0, 0, 0, 0, 0)

scala> unsafeRow.getUTF8String(0)
res1: org.apache.spark.unsafe.types.UTF8String = hello world

UnsafeRow supports Java’s Externalizable and Kryo’s KryoSerializable serialization/deserialization protocols.

The fields of a data row are placed using field offsets.

UnsafeRow’s mutable field data types (in alphabetical order):

  • BooleanType

  • ByteType

  • DateType

  • DoubleType

  • FloatType

  • IntegerType

  • LongType

  • NullType

  • ShortType

  • TimestampType

UnsafeRow is composed of three regions:

  1. Null Bit Set Bitmap Region (1 bit/field) for tracking null values

  2. Fixed-Length 8-Byte Values Region

  3. Variable-Length Data Section

That gives the property of rows being always 8-byte word aligned and so their size is always a multiple of 8 bytes.

Equality comparision and hashing of rows can be performed on raw bytes since if two rows are identical so should be their bit-wise representation. No type-specific interpretation is required.

isMutable Method

static boolean isMutable(DataType dt)

isMutable is enabled (i.e. returns true) when the input dt DataType is a mutable field type or DecimalType.

Otherwise, isMutable is disabled (i.e. returns false).

Note

isMutable is used when:

  • UnsafeFixedWidthAggregationMap does supportsAggregationBufferSchema

  • SortBasedAggregationIterator does newBuffer

Kryo’s KryoSerializable SerDe Protocol

Tip
Read up on KryoSerializable.

Serializing JVM Object — KryoSerializable’s write Method

void write(Kryo kryo, Output out)

Deserializing Kryo-Managed Object — KryoSerializable’s read Method

void read(Kryo kryo, Input in)

Java’s Externalizable SerDe Protocol

Tip
Read up on java.io.Externalizable.

Serializing JVM Object — Externalizable’s writeExternal Method

void writeExternal(ObjectOutput out)
throws IOException

Deserializing Java-Externalized Object — Externalizable’s readExternal Method

void readExternal(ObjectInput in)
throws IOException, ClassNotFoundException

results matching ""

    No results matching ""