Package io.delta.kernel.engine
Interface JsonHandler
Provides JSON handling functionality to Delta Kernel. Delta Kernel can use this client to
parse JSON strings into
ColumnarBatch or read content from JSON files.
Connectors can leverage this interface to provide their best implementation of the JSON parsing
capability to Delta Kernel.- Since:
- 3.0.0
-
Method Summary
Modifier and TypeMethodDescriptiondeserializeStructType(String structTypeJson) Deserialize the Delta schema fromstructTypeJsonaccording to the Delta Protocol schema serialization rules .parseJson(ColumnVector jsonStringVector, StructType outputSchema, Optional<ColumnVector> selectionVector) Parse the given json strings and return the fields requested byoutputSchemaas columns in aColumnarBatch.readJsonFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) Read and parse the JSON format file at given locations and return the data as aColumnarBatchwith the columns requested byphysicalSchema.voidwriteJsonFileAtomically(String filePath, CloseableIterator<Row> data, boolean overwrite) Serialize eachRowin the iterator as JSON and write as a separate line in destination file.
-
Method Details
-
parseJson
ColumnarBatch parseJson(ColumnVector jsonStringVector, StructType outputSchema, Optional<ColumnVector> selectionVector) Parse the given json strings and return the fields requested byoutputSchemaas columns in aColumnarBatch.There are a couple special cases that should be handled for specific data types:
- FloatType and DoubleType: handle non-numeric numbers encoded as strings
- NaN:
"NaN" - Positive infinity:
"+INF", "Infinity", "+Infinity" - Negative infinity:
"-INF", "-Infinity""
- NaN:
- DateType: handle dates encoded as strings in the format
"yyyy-MM-dd" - TimestampType: handle timestamps encoded as strings in the format
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
- Parameters:
jsonStringVector- StringColumnVectorof valid JSON strings.outputSchema- Schema of the data to return from the parsed JSON. If any requested fields are missing in the JSON string, a null is returned for that particular field in the returnedRow. The type for each given field is expected to match the type in the JSON.selectionVector- Optional selection vector indicating which rows to parse the JSON. If present, only the selected rows should be parsed. Unselected rows should be all null in the returned batch.- Returns:
- a
ColumnarBatchof schemaoutputSchemawith one row for each entry injsonStringVector
- FloatType and DoubleType: handle non-numeric numbers encoded as strings
-
deserializeStructType
Deserialize the Delta schema fromstructTypeJsonaccording to the Delta Protocol schema serialization rules .- Parameters:
structTypeJson- the JSON formatted schema string to parse- Returns:
- the parsed
StructType
-
readJsonFiles
CloseableIterator<ColumnarBatch> readJsonFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) throws IOException Read and parse the JSON format file at given locations and return the data as aColumnarBatchwith the columns requested byphysicalSchema.- Parameters:
fileIter- Iterator of files to read data from.physicalSchema- Select list of columns to read from the JSON file.predicate- Optional predicate which the JSON reader can optionally use to prune rows that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is still responsible apply the predicate on the data returned by this method.- Returns:
- an iterator of
ColumnarBatchs containing the data in columnar format. It is the responsibility of the caller to close the iterator. The data returned is in the same as the order of files given inscanFileIter - Throws:
IOException- if an I/O error occurs during the read.
-
writeJsonFileAtomically
void writeJsonFileAtomically(String filePath, CloseableIterator<Row> data, boolean overwrite) throws IOException Serialize eachRowin the iterator as JSON and write as a separate line in destination file. This call either succeeds in creating the file with given contents or no file is created at all. It won't leave behind a partially written file.Following are the supported data types and their serialization rules. At a high-level, the JSON serialization is similar to that of
jacksonJSON serializer.- Primitive types: @code boolean, byte, short, int, long, float, double, string}
struct: any element whose value is null is not written to filemap: only amapwithstringkey type is supported. If an entry value isnull, it should be written to the file.array:nullvalue elements are written to file
- Parameters:
filePath- Fully qualified destination file pathdata- Iterator ofRowobjects where each row should be serialized as JSON and written as separate line in the destination file.overwrite- Iftrue, the file is overwritten if it already exists. Iffalseand a file existsFileAlreadyExistsExceptionis thrown.- Throws:
FileAlreadyExistsException- if the file already exists andoverwriteis false.IOException- if any other I/O error occurs.
-