- Enclosing class:
- AvroFileHdfsReader
public static class AvroFileHdfsReader.AvroFileCheckpoint
extends java.lang.Object
An avro file looks something like this:
Byte offset: 0 103 271 391
┌────────┬──────────────┬───────────┬───────────┐
Avro file: │ Header │ Block 1 │ Block 2 │ Block 3 │ ...
└────────┴──────────────┴───────────┴───────────┘
Each block contains multiple records. The start of a block is defined as a valid
synchronization point. A file reader can only seek to a synchronization point, i.e.
the start of blocks. Thus, to precisely describe the location of a record, we need
to use the pair (blockStart, recordOffset). Here "blockStart" means the start of the
block and "recordOffset" means the index of the record within the block.
Take the example above, and suppose block 1 has 4 records, we have record sequences as:
(103, 0), (103, 1), (103, 2), (103, 3), (271, 0), ...
where (271, 0) represents the first event in block 2
With the CP_DELIM being '@', the actual checkpoint string would look like "103@1",
"271@0" or "271", etc. For convenience, a checkpoint with only the blockStart but no
recordOffset within the block simply means the first record in that block. Thus,
"271@0" is equal to "271".