Converts from Dataset[Row] into Dataset[Array[Byte]] containing Avro records.
Converts from Dataset[Row] into Dataset[Array[Byte]] containing Avro records.
Intended to be used when there is a Spark schema present in the Dataframe from which the Avro schema will be translated.
The API will infer the Avro schema from the incoming Dataframe. The inferred schema will receive the name and namespace informed as parameters.
The API will throw in case the Dataframe does not have a schema.
Differently than the other API, this one does not suffer from the schema changing issue, since the final Avro schema will be derived from the schema already used by Spark.
Converts from Dataset[Row] into Dataset[Array[Byte]] containing Avro records.
Converts from Dataset[Row] into Dataset[Array[Byte]] containing Avro records.
Intended to be used when there is not Spark schema available in the Dataframe but there is an expected Avro schema.
It is important to keep in mind that the specification for a field in the schema MUST be the same at both ends, writer and reader. For some fields (e.g. strings), Spark can ignore the nullability specified in the SQL struct (SPARK-14139). This issue could lead to fields being ignored. Thus, it is important to check the final SQL schema after Spark has created the Dataframes.
For instance, the Spark construct 'StructType("name", StringType, false)' translates to the Avro field {"name": "name", "type":"string"}. However, if Spark changes the nullability (StructType("name", StringType, TRUE)), the Avro field becomes a union: {"name":"name", "type": ["string", "null"]}.
The difference in the specifications will prevent the field from being correctly loaded by Avro readers, leading to data loss.
This class provides methods to perform the translation from Dataframe Rows into Avro records on the fly.
Users can either, inform the path to the destination Avro schema or inform record name and namespace and the schema will be inferred from the Dataframe.
The methods are "storage-agnostic", which means the provide Dataframes of Avro records which can be stored into any sink (e.g. Kafka, Parquet, etc).