@Evolving public interface JsonHandler
ColumnarBatch or read content from JSON files. Connectors can leverage
this interface to provide their best implementation of the JSON parsing capability to Delta
Kernel.| Modifier and Type | Method and Description |
|---|---|
ColumnarBatch |
parseJson(ColumnVector jsonStringVector,
StructType outputSchema,
java.util.Optional<ColumnVector> selectionVector)
Parse the given json strings and return the fields requested by
outputSchema as
columns in a ColumnarBatch. |
CloseableIterator<ColumnarBatch> |
readJsonFiles(CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read and parse the JSON format file at given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema. |
void |
writeJsonFileAtomically(String filePath,
CloseableIterator<Row> data,
boolean overwrite)
Serialize each
Row in the iterator as JSON and write as a separate line in destination
file. |
ColumnarBatch parseJson(ColumnVector jsonStringVector, StructType outputSchema, java.util.Optional<ColumnVector> selectionVector)
outputSchema as
columns in a ColumnarBatch.
There are a couple special cases that should be handled for specific data types:
"NaN"
"+INF", "Infinity", "+Infinity"
"-INF", "-Infinity""
"yyyy-MM-dd"
"yyyy-MM-dd'T'HH:mm:ss.SSSXXX"
jsonStringVector - String ColumnVector of valid JSON strings.outputSchema - Schema of the data to return from the parsed JSON. If any requested fields
are missing in the JSON string, a null is returned for that particular field in the
returned Row. The type for each given field is expected to match the type in the
JSON.selectionVector - Optional selection vector indicating which rows to parse the JSON. If
present, only the selected rows should be parsed. Unselected rows should be all null in the
returned batch.ColumnarBatch of schema outputSchema with one row for each entry in
jsonStringVectorCloseableIterator<ColumnarBatch> readJsonFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
ColumnarBatch with the columns requested by physicalSchema.fileIter - Iterator of files to read data from.physicalSchema - Select list of columns to read from the JSON file.predicate - Optional predicate which the JSON reader can optionally use to prune rows that
don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is
still responsible apply the predicate on the data returned by this method.ColumnarBatchs containing the data in columnar format. It is the
responsibility of the caller to close the iterator. The data returned is in the same as the
order of files given in scanFileIterjava.io.IOException - if an I/O error occurs during the read.void writeJsonFileAtomically(String filePath,
CloseableIterator<Row> data,
boolean overwrite)
throws java.io.IOException
Row in the iterator as JSON and write as a separate line in destination
file. This call either succeeds in creating the file with given contents or no file is created
at all. It won't leave behind a partially written file.
Following are the supported data types and their serialization rules. At a high-level, the
JSON serialization is similar to that of jackson JSON serializer.
struct: any element whose value is null is not written to file
map: only a map with string key type is supported. If an entry
value is null, it should be written to the file.
array: null value elements are written to file
filePath - Fully qualified destination file pathdata - Iterator of Row objects where each row should be serialized as JSON and
written as separate line in the destination file.overwrite - If true, the file is overwritten if it already exists. If false and a file exists FileAlreadyExistsException is thrown.java.nio.file.FileAlreadyExistsException - if the file already exists and overwrite is false.java.io.IOException - if any other I/O error occurs.