public class RecoveryManager
extends Object
Performs recovery when an Environment is opened.
TODO: Need a description of the recovery algorithm here. For some related
information, see the Checkpointer class comments.
Recovery, the INList and Eviction
=================================
There are two major steps in recovery: 1) recover the mapping database and
the INs for all other databases, 2) recover the LNs for the other databases.
In the buildTree method, step 1 comes before the call to buildINList and
step 2 comes after that. The INList is not maintained in step 1.
The INList is not maintained in the step 1 because there is no benefit -- we
cannot evict anyway as explained below -- and there are potential drawbacks
to maintaining it: added complexity and decreased performance. The
drawbacks are described in more detail further below.
Even if the INList were maintained in step 1, eviction could not be enabled
until step 2, because logging is not allowed until all the INs are in place.
In principle we could evict non-dirty nodes in step 1, but since recovery is
dirtying the tree as it goes, there would be little or nothing that is
non-dirty and could be evicted.
Therefore, the INList has an 'enabled' mode that is initially false (in step
1) and is set to true by buildINList, just before step 2. The mechanism for
adding nodes to the INList is skipped when it is disabled. In addition to
enabling it, buildINList populates it from the contents of the Btrees that
were constructed in step 1. In step 2, eviction is invoked explicitly by
calling EnvironmentImpl.invokeEvictor often during recovery. This is
important since the background evictor thread is not yet running.
An externally visible limitation created by this situation is that the nodes
placed in the Btree during step 1 must all fit in memory, since no eviction
is performed. So memory is a limiting factor in how large a recovery can be
performed. Since eviction is allowed in step 2, and step 2 is where the
bulk of the recovery is normally performed, this limitation of step 1 hasn't
been a critical problem.
Maintaining the INList
----------------------
In this section we consider the impact of maintaining the INList in step 1,
if this were done in a future release. It is being considered for a future
release so we can rely on the INList to reference INs by node ID in the
in-memory representation of an IN (see the Big Memory SR [#22292]).
To maintain the INList in step 1, when a branch of a tree (a parent IN) is
spliced in, the previous branch (all of the previous node's children) would
have to be removed from the INList. Doing this incorrectly could cause an
OOME, and it may also have a performance impact.
The performance impact of removing the previous branch from the INList is
difficult to estimate. In the normal case (recovery after normal shutdown),
very few nodes will be replaced because normally only nodes at the max flush
level are replayed, and the slots they are placed into will be empty (not
resident). Here is description of a worst case scenario, which is when
there is a crash near the end of a checkpoint:
+ The last checkpoint is large, includes all nodes in the tree, is mostly
complete, but was not finished (no CkptEnd). The middle INs (above BIN
and below the max flush level) must be replayed (see Checkpointer and
Provisional.BEFORE_CKPT_END).
+ For these middle INs, the INs at each level are placed in the tree and
replace any IN present in the slot. For the bottom-most level of middle
INs (level 2), these don't replace a node (the slot will be empty because
BINs are never replayed). But for the middle INs in all levels above
that, they replace a node that was fetched earlier; it was fetched
because it is the parent of a node at a lower level that was replayed.
+ In the worst case, all INs from level 3 to R-1, where R is the root
level, would be replayed and replace a node. However, it seems the
replaced node would not have resident children in the scenario described,
so the cost of removing it from the INList does not seem excessive.
+ Here's an example illustrating this scenario. The BINs and their parents
(as a sub-tree) are logged first, followed by all dirty INs at the next
level, etc.
0050 CkptStart
0100 BIN level 1
0200 BIN level 1
...
1000 IN level 2, parent of 0100, 0200, etc.
1100 BIN level 1
1200 BIN level 1
...
2000 IN level 2, parent of 1100, 1200, etc.
...
7000 IN level 2, last level 2 IN logged
8000 IN level 3, parent of 1000, 2000, etc.
...
9000 IN level 4, parent of 8000, etc.
...
9000 level 4
/
----8000---- level 3
/ / \
1000 2000 ...... level 2
BINs not shown
Only the root (if it happens to be logged right before the crash) is
non-provisional. We'll assume in this example that the root was not
logged. Level 2 through R-1 are logged as Provisional.BEFORE_CKPT_END,
and treated as non-provisional (replayed) by recovery because there is no
CkptEnd.
When 1000 (and all other nodes at level 2) is replayed, it is placed into
an empty slot.
When 8000 (and all other INs at level 3 and higher, below the root) is
replayed, it will replace a resident node that was fetched and placed in
the slot when replaying its children. The replaced node is one not
shown, and assumed to have been logged sometime prior to this checkpoint.
The replaced node will have all the level 2 nodes that were replayed
earlier (1000, 2000, etc.) as its resident children, and these are the
nodes that would have to be removed from the INList, if recovery were
changed to place INs on the INList in step 1.
So if the INs were placed on the INList, in this worst case scenario, all
INs from level 3 to R-1 will be replayed, and all their immediate
children would need to be removed from the INList. Grandchildren would
not be resident. In other words, all nodes at level 2 and above (except
the root) would be removed from the INList and replaced by a node being
replayed.
When there is a normal shutdown, we don't know of scenarios that would cause
this sort of INList thrashing. So perhaps maintaining the INList in step 1
could be justified, if the additional recovery cost after a crash is
acceptable.
Or, a potential solution for the worst case scenario above might be to place
the resident child nodes in the new parent, rather than discarding them and
removing them from the INList. This would have the benefit of populating
the cache and not wasting the work done to read and replay these nodes.
OTOH, it may cause OOME if too much of the tree is loaded in step 1.