public class DiskOrderedScanner
extends Object
Provides an enumeration of all key/data pairs in a database, striving to
fetch in disk order.
Unlike SortedLSNTreeWalker, for which the primary use case is preload, this
class notifies the callback while holding a latch only if that can be done
without blocking (e.g., when the callback can buffer the data without
blocking). In other words, while holding a latch, the callback will not be
notified if it might block. This is appropriate for the DOS
(DiskOrderedCursor) use case, since the callback will block if the DOS queue
is full, and the user's consumer thread may not empty the queue as quickly
as it can be filled by the producer thread. If the callback were allowed to
block while a latch is held, this would block other threads attempting to
access the database, including JE internal threads, which would have a very
detrimental impact.
Algorithm
=========
Terminology
-----------
callback: object implementing the RecordProcessor interface
process: invoking the callback with a key-data pair
iteration: top level iteration consisting of phase I and II
phase I: accumulate LSNs
phase II: sort, fetch and process LSNs
Phase I and II
--------------
To avoid processing resident nodes (invoking the callback with a latch
held), a non-recursive algorithm is used. Instead of recursively
accumulating LSNs in a depth-first iteration of the tree (like the
SortedLSNTreeWalker algorithm), level 2 INs are traversed in phase I and
LSNs are accumulated for LNs or BINs (more on this below). When the memory
or LSN batch size limit is exceeded, phase I ends and all tree latches are
released. During phase II the previously accumulated LSNs are fetched and
the callback is invoked for each key or key-data pair. Since no latches are
held, it is permissible for the callback to block.
One iteration of phase I and II processes some subset of the database.
Since INs are traversed in tree order in phase I, this subset is described
by a range of keys. When performing the next iteration, the IN traversal is
restarted at the highest key that was processed by the previous iteration.
The previous highest key is used to avoid duplication of entries, since some
overlap between iterations may occur.
LN and BIN modes
----------------
As mentioned above, we accumulate LSNs for either LNs or BINs. The BIN
accumulation mode provides an optimization for key-only traversals and for
all traversals of duplicate DBs (in a dup DB, the data is included in the
key). In these cases we never need to fetch the LN, so we can sort and
fetch the BIN LSNs instead. This supports at least some types of traversals
that are efficient when all BINs are not in the JE cache.
We must only accumulate LN or BIN LSNs, never both, and never the LSNs of
other INs (above level 1). If we broke this rule, there would be no way to
constrain memory usage in our non-recursive approach, since we could not
easily predict in advance how much memory would be needed to fetch the
nested nodes. Even if we were able predict the memory needed, it would be
of little advantage to sort and fetch a small number of higher level nodes,
only to accumulate the LSNs of their descendants (which are much larger in
number). The much smaller number of higher level nodes would likely be
fetched via random IO anyway, in a large data set anyway.
The above justification also applies to the algorithm we use in LN mode, in
which we accumulate and fetch only LN LSNs. In this mode we always fetch
BINs explicitly (not in LSN sorted order), if they are not resident, for the
reasons stated above.
Furthermore, in BIN mode we must account for BIN-deltas. Phase I must keep
a copy any BIN-deltas encountered in the cache. And phase II must make two
passes for the accumulated LSNs: one pass to load the deltas and another to
load the full BINs and merge the previously loaded deltas. Unfortunately
we must budget memory for the deltas during phase I; since most BIN LSNs are
for deltas, not full BINs, we assume that we will need to temporarily save a
delta for each LSN. This two pass approach is not like the recursive
algorithm we rejected above, however, in two respects: 1) we know in advance
(roughly anyway) how much memory we will need for both passes, and 2) the
number of LSNs fetched in each pass is roughly the same.
Data Lag
--------
In phase I, as an exception to what was said above, we sometimes process
nodes that are resident in the Btree (in the JE cache) if this is possible
without blocking. The primary intention of this is to provide more recent
data to the callback. When accumulating BINs, if the BIN is dirty then
fetching its LSN later means that some recently written LNs will not be
included. Therefore, if the callback would not block, we process the keys
in a dirty BIN during phase I. Likewise, when accumulating LNs in a
deferred-write database, we process dirty LNs if the callback would not
block. When accumulating LN LSNs for a non-deferred-write database, we can
go further and process all resident LNs, as long as the callback would not
block, since we know that no LNs are dirty.
In spite of our attempt to process resident nodes, we may not be able to
process all of them if doing so would cause the callback to block. When we
can't process a dirty, resident node, the information added (new, deleted or
updated records) since the node was last flushed will not be visible to the
callback.
In other words, the data presented to the callback may lag back to the time
of the last checkpoint. It cannot lag further back than the last
checkpoint, because: 1) the scan doesn't accumulate LSNs any higher than the
BIN level, and 2) checkpoints flush all dirty BINs. For a DOS, the user may
decrease the likelihood of stale data by increasing the DOS queue size,
decreasing the LSN batch size, decreasing the memory limit, or performing a
checkpoint immediately before the start of the scan. Even so, it may be
impossible to guarantee that all records written at the start of the scan
are visible to the callback.