package functions
Type Members
-
trait
AggregateFunction[S <: Serializable, R] extends BoundFunction
Interface for a function that produces a result value by aggregating over multiple input rows.
Interface for a function that produces a result value by aggregating over multiple input rows.
For each input row, Spark will call the
#updatemethod which should evaluate the row and update the aggregation state. The JVM type of result values produced by#produceResultmust be the type used by Spark's InternalRow API for theSQL data typereturned by#resultType(). Please refer to class documentation ofScalarFunctionfor the mapping betweenDataTypeand the JVM type.All implementations must support partial aggregation by implementing merge so that Spark can partially aggregate and shuffle intermediate results, instead of shuffling all rows for an aggregate. This reduces the impact of data skew and the amount of data shuffled to produce the result.
Intermediate aggregation state must be
Serializableso that state produced by parallel tasks can be serialized, shuffled, and then merged to produce a final result.- Annotations
- @Evolving()
- Since
3.2.0
-
trait
BoundFunction extends Function
Represents a function that is bound to an input type.
Represents a function that is bound to an input type.
- Annotations
- @Evolving()
- Since
3.2.0
-
trait
Function extends Serializable
Base class for user-defined functions.
Base class for user-defined functions.
- Annotations
- @Evolving()
- Since
3.2.0
-
trait
ScalarFunction[R] extends BoundFunction
Interface for a function that produces a result value for each input row.
Interface for a function that produces a result value for each input row.
To evaluate each input row, Spark will first try to lookup and use a "magic method" (described below) through Java reflection. If the method is not found, Spark will call
#produceResult(InternalRow)as a fallback approach.The JVM type of result values produced by this function must be the type used by Spark's InternalRow API for the
SQL data typereturned by#resultType(). The mapping betweenDataTypeand the corresponding JVM type is defined below.Magic method
IMPORTANT: the default implementation of
#produceResultthrowsUnsupportedOperationException. Users must choose to either override this method, or implement a magic method with name#MAGIC_METHOD_NAME, which takes individual parameters instead of aInternalRow. The magic method approach is generally recommended because it provides better performance over the default#produceResult, due to optimizations such as whole-stage codegen, elimination of Java boxing, etc.The type parameters for the magic method must match those returned from
BoundFunction#inputTypes(). Otherwise Spark will not be able to find the magic method.In addition, for stateless Java functions, users can optionally define the
#MAGIC_METHOD_NAMEas a static method, which further avoids certain runtime costs such as Java dynamic dispatch.For example, a scalar UDF for adding two integers can be defined as follow with the magic method approach:
public class IntegerAdd implements
In the above, sinceScalarFunction{ public DataType[] inputTypes() { return new DataType[] { DataTypes.IntegerType, DataTypes.IntegerType }; } public int invoke(int left, int right) { return left + right; } }#MAGIC_METHOD_NAMEis defined, and also that it has matching parameter types and return type, Spark will use it to evaluate inputs.As another example, in the following:
public class IntegerAdd implements
ScalarFunction{ public DataType[] inputTypes() { return new DataType[] { DataTypes.IntegerType, DataTypes.IntegerType }; } public static int invoke(int left, int right) { return left + right; } public Integer produceResult(InternalRow input) { return input.getInt(0) + input.getInt(1); } }the class defines both the magic method and the
#produceResult, and Spark will use#MAGIC_METHOD_NAMEover the#produceResult(InternalRow)as it takes higher precedence. Also note that the magic method is annotated as a static method in this case.Resolution on magic method is done during query analysis, where Spark looks up the magic method by first converting the actual input SQL data types to their corresponding Java types following the mapping defined below, and then checking if there is a matching method from all the declared methods in the UDF class, using method name and the Java types.
Handling of nullable primitive arguments
The handling of null primitive arguments is different between the magic method approach and the
#produceResultapproach. With the former, whenever any of the method arguments meet the following conditions:- the argument is of primitive type
- the argument is nullable
- the value of the argument is null
Spark will return null directly instead of calling the magic method. On the other hand, Spark will pass null primitive arguments to
#produceResultand it is user's responsibility to handle them in the function implementation.Because of the difference, if Spark users want to implement special handling of nulls for nullable primitive arguments, they should override the
#produceResultmethod instead of using the magic method approach.Spark data type to Java type mapping
The following are the mapping from
SQL data typeto Java type which is used by Spark to infer parameter types for the magic methods as well as return value type for#produceResult:org.apache.spark.sql.types.BooleanType:booleanorg.apache.spark.sql.types.ByteType:byteorg.apache.spark.sql.types.ShortType:shortorg.apache.spark.sql.types.IntegerType:intorg.apache.spark.sql.types.LongType:longorg.apache.spark.sql.types.FloatType:floatorg.apache.spark.sql.types.DoubleType:doubleorg.apache.spark.sql.types.StringType:org.apache.spark.unsafe.types.UTF8Stringorg.apache.spark.sql.types.DateType:intorg.apache.spark.sql.types.TimestampType:longorg.apache.spark.sql.types.BinaryType:byte[]org.apache.spark.sql.types.DayTimeIntervalType:longorg.apache.spark.sql.types.YearMonthIntervalType:intorg.apache.spark.sql.types.DecimalType:org.apache.spark.sql.types.Decimalorg.apache.spark.sql.types.StructType:InternalRoworg.apache.spark.sql.types.ArrayType:org.apache.spark.sql.catalyst.util.ArrayDataorg.apache.spark.sql.types.MapType:org.apache.spark.sql.catalyst.util.MapData
- Annotations
- @Evolving()
- Since
3.2.0
-
trait
UnboundFunction extends Function
Represents a user-defined function that is not bound to input types.
Represents a user-defined function that is not bound to input types.
- Annotations
- @Evolving()
- Since
3.2.0