- All Implemented Interfaces:
- java.io.Externalizable, java.io.Serializable, java.lang.Cloneable, water.Freezable
public class OrcParser
extends water.parser.Parser
ORC parser for H2O distributed parsing subsystem.
Basically, here is the plan:
To parse an Orc file, we need to do the following in order to get the following useful
information:
1. Get a Reader rdr.
2. From the reader rdr, we can get the following pieces of information:
a. number of columns, column types and column names. We only support parsing of primitive types;
b. Lists of StripeInformation that describes how many stripes of data that we will need to read;
c. For each stripe, get information like rows per stripe, data size in bytes
3. The plan is to read the file in parallel in whole numbers of stripes.
4. Inside each stripe, we will read data out in batches of VectorizedRowBatch (1024 rows or less).
- See Also:
- Serialized Form