@InterfaceAudience.Private @InterfaceStability.Evolving public class DynamoDBMetadataStore extends Object implements MetadataStore, AWSPolicyProvider
MetadataStore that persists
file system metadata to DynamoDB.
The current implementation uses a schema consisting of a single table. The
name of the table can be configured by config key
Constants.S3GUARD_DDB_TABLE_NAME_KEY.
By default, it matches the name of the S3 bucket. Each item in the table
represents a single directory or file. Its path is split into separate table
attributes:
s3a://bucket/dir1
|-- dir2
| |-- file1
| `-- file2
`-- dir3
|-- dir4
| `-- file3
|-- dir5
| `-- file4
`-- dir6
This is persisted to a single DynamoDB table as:
==================================================================================== | parent | child | is_dir | mod_time | len | etag | ver_id | ... | ==================================================================================== | /bucket | dir1 | true | | | | | | | /bucket/dir1 | dir2 | true | | | | | | | /bucket/dir1 | dir3 | true | | | | | | | /bucket/dir1/dir2 | file1 | | 100 | 111 | abc | mno | | | /bucket/dir1/dir2 | file2 | | 200 | 222 | def | pqr | | | /bucket/dir1/dir3 | dir4 | true | | | | | | | /bucket/dir1/dir3 | dir5 | true | | | | | | | /bucket/dir1/dir3/dir4 | file3 | | 300 | 333 | ghi | stu | | | /bucket/dir1/dir3/dir5 | file4 | | 400 | 444 | jkl | vwx | | | /bucket/dir1/dir3 | dir6 | true | | | | | | ====================================================================================This choice of schema is efficient for read access patterns.
get(Path) can be served from a single item lookup.
listChildren(Path) can be served from a query against all rows
matching the parent (the partition key) and the returned list is guaranteed
to be sorted by child (the range key). Tracking whether or not a path is a
directory helps prevent unnecessary queries during traversal of an entire
sub-tree.
Some mutating operations, notably
MetadataStore.deleteSubtree(Path, BulkOperationState) and
MetadataStore.move(Collection, Collection, BulkOperationState)
are less efficient with this schema.
They require mutating multiple items in the DynamoDB table.
By default, DynamoDB access is performed within the same AWS region as
the S3 bucket that hosts the S3A instance. During initialization, it checks
the location of the S3 bucket and creates a DynamoDB client connected to the
same region. The region may also be set explicitly by setting the config
parameter fs.s3a.s3guard.ddb.region to the corresponding region.MetadataStore.PruneModeAWSPolicyProvider.AccessLevel| Modifier and Type | Field and Description |
|---|---|
static String |
E_ON_DEMAND_NO_SET_CAPACITY |
static org.slf4j.Logger |
LOG |
static org.slf4j.Logger |
OPERATIONS_LOG
A log of all state changing operations to the store;
only updated at debug level.
|
static String |
OPERATIONS_LOG_NAME
Name of the operations log.
|
static int |
VERSION
Current version number.
|
static String |
VERSION_MARKER_ITEM_NAME
parent/child name to use in the version marker.
|
static String |
VERSION_MARKER_TAG_NAME
parent/child name to use in the version marker.
|
| Constructor and Description |
|---|
DynamoDBMetadataStore() |
| Modifier and Type | Method and Description |
|---|---|
void |
addAncestors(org.apache.hadoop.fs.Path qualifiedPath,
BulkOperationState operationState)
This adds all new ancestors of a path as directories.
|
void |
close() |
void |
delete(org.apache.hadoop.fs.Path path,
BulkOperationState operationState)
Deletes exactly one path, leaving a tombstone to prevent lingering,
inconsistent copies of it from being listed.
|
void |
deletePaths(Collection<org.apache.hadoop.fs.Path> paths,
BulkOperationState operationState)
Delete the paths.
|
void |
deleteSubtree(org.apache.hadoop.fs.Path path,
BulkOperationState operationState)
Deletes the entire sub-tree rooted at the given path, leaving tombstones
to prevent lingering, inconsistent copies of it from being listed.
|
void |
destroy()
Destroy all resources associated with the metadata store.
|
void |
forgetMetadata(org.apache.hadoop.fs.Path path)
Removes the record of exactly one path.
|
DDBPathMetadata |
get(org.apache.hadoop.fs.Path path)
Gets metadata for a path.
|
DDBPathMetadata |
get(org.apache.hadoop.fs.Path path,
boolean wantEmptyDirectoryFlag)
Gets metadata for a path.
|
com.amazonaws.services.dynamodbv2.AmazonDynamoDB |
getAmazonDynamoDB() |
long |
getBatchWriteCapacityExceededCount() |
Map<String,String> |
getDiagnostics()
Get any diagnostics information from a store, as a list of (key, value)
tuples for display.
|
MetastoreInstrumentation |
getInstrumentation()
Get any S3GuardInstrumentation for this store...must not be null.
|
Invoker |
getInvoker()
Get the operation invoker for write operations.
|
long |
getReadThrottleEventCount()
Get the count of read throttle events.
|
long |
getScanThrottleEventCount()
Get the count of scan throttle events.
|
protected DynamoDBMetadataStoreTableManager |
getTableHandler() |
String |
getTableName() |
long |
getWriteThrottleEventCount()
Get the count of write throttle events.
|
void |
initialize(org.apache.hadoop.conf.Configuration config,
ITtlTimeProvider ttlTp)
Performs one-time initialization of the metadata store via configuration.
|
void |
initialize(org.apache.hadoop.fs.FileSystem fs,
ITtlTimeProvider ttlTp)
Performs one-time initialization of the metadata store.
|
org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.AncestorState |
initiateBulkWrite(BulkOperationState.OperationType operation,
org.apache.hadoop.fs.Path dest)
Initiate a bulk update and create an operation state for it.
|
RenameTracker |
initiateRenameOperation(StoreContext storeContext,
org.apache.hadoop.fs.Path source,
S3AFileStatus sourceStatus,
org.apache.hadoop.fs.Path dest)
Initiate the rename operation by creating the tracker for the filesystem
to keep up to date with state changes in the S3A bucket.
|
List<RoleModel.Statement> |
listAWSPolicyRules(Set<AWSPolicyProvider.AccessLevel> access)
The administrative policy includes all DDB table operations;
application access is restricted to those operations S3Guard operations
require when working with data in a guarded bucket.
|
DirListingMetadata |
listChildren(org.apache.hadoop.fs.Path path)
Lists metadata for all direct children of a path.
|
int |
markAsAuthoritative(org.apache.hadoop.fs.Path dest,
BulkOperationState operationState)
Mark the directories instantiated under the destination path
as authoritative.
|
void |
move(Collection<org.apache.hadoop.fs.Path> pathsToDelete,
Collection<PathMetadata> pathsToCreate,
BulkOperationState operationState)
Record the effects of a
FileSystem.rename(Path, Path) in the
MetadataStore. |
void |
prune(MetadataStore.PruneMode pruneMode,
long cutoff)
Prune method with two modes of operation:
MetadataStore.PruneMode.ALL_BY_MODTIME
Clear any metadata older than a specified mod_time from the store. |
long |
prune(MetadataStore.PruneMode pruneMode,
long cutoff,
String keyPrefix)
Prune files, in batches.
|
void |
put(Collection<? extends PathMetadata> metas,
BulkOperationState operationState)
Saves metadata for any number of paths.
|
void |
put(DirListingMetadata meta,
List<org.apache.hadoop.fs.Path> unchangedEntries,
BulkOperationState operationState)
Save directory listing metadata.
|
void |
put(PathMetadata meta)
Saves metadata for exactly one path.
|
void |
put(PathMetadata meta,
BulkOperationState operationState)
Saves metadata for exactly one path, potentially
using any bulk operation state to eliminate duplicate work.
|
void |
setTtlTimeProvider(ITtlTimeProvider ttlTimeProvider)
The TtlTimeProvider has to be set during the initialization for the
metadatastore, but this method can be used for testing, and change the
instance during runtime.
|
String |
toString() |
void |
updateParameters(Map<String,String> parameters)
Tune/update parameters for an existing table.
|
<T> Iterable<T> |
wrapWithRetries(Iterable<T> source)
Wrap an iterator returned from any scan with a retrying one.
|
public static final org.slf4j.Logger LOG
public static final String OPERATIONS_LOG_NAME
public static final org.slf4j.Logger OPERATIONS_LOG
public static final String VERSION_MARKER_ITEM_NAME
public static final String VERSION_MARKER_TAG_NAME
public static final int VERSION
public static final String E_ON_DEMAND_NO_SET_CAPACITY
@Retries.OnceRaw public void initialize(org.apache.hadoop.fs.FileSystem fs, ITtlTimeProvider ttlTp) throws IOException
S3AFileSystem.shareCredentials(String); this will
increment the reference counter of these credentials.initialize in interface MetadataStorefs - S3AFileSystem associated with the MetadataStorettlTp - the time provider to use for metadata expiryIOException - on a failure@Retries.OnceRaw public void initialize(org.apache.hadoop.conf.Configuration config, ITtlTimeProvider ttlTp) throws IOException
MetadataStore.initialize(FileSystem, ITtlTimeProvider)
with an initialized S3AFileSystem instance.
Without a filesystem to act as a reference point, the configuration itself
must declare the table name and region in the
Constants.S3GUARD_DDB_TABLE_NAME_KEY and
Constants.S3GUARD_DDB_REGION_KEY respectively.
It also creates a new credential provider list from the configuration,
using the base fs.s3a.* options, as there is no bucket to infer per-bucket
settings from.initialize in interface MetadataStoreconfig - Configuration.ttlTp - the time provider to use for metadata expiryIOException - if there is an errorIllegalArgumentException - if the configuration is incompleteMetadataStore.initialize(FileSystem, ITtlTimeProvider)@Retries.RetryTranslated public void delete(org.apache.hadoop.fs.Path path, BulkOperationState operationState) throws IOException
MetadataStoreS3Guard.TtlTimeProvider because
the lastUpdated field of the record has to be updated to now.
delete in interface MetadataStorepath - the path to deleteoperationState - (nullable) operational state for a bulk updateIOException - if there is an error@Retries.RetryTranslated public void forgetMetadata(org.apache.hadoop.fs.Path path) throws IOException
MetadataStoreMetadataStore.delete(Path, BulkOperationState). It is currently
intended for testing only, and a need to use it as part of normal
FileSystem usage is not anticipated.forgetMetadata in interface MetadataStorepath - the path to deleteIOException - if there is an error@Retries.RetryTranslated public void deleteSubtree(org.apache.hadoop.fs.Path path, BulkOperationState operationState) throws IOException
MetadataStoreMetadataStore.get(Path),
implementations must also update any stored DirListingMetadata
objects which track the parent of this file.
Deleting a subtree with a tombstone needs a
S3Guard.TtlTimeProvider because
the lastUpdated field of all records have to be updated to now.
deleteSubtree in interface MetadataStorepath - the root of the sub-tree to deleteoperationState - (nullable) operational state for a bulk updateIOException - if there is an error@Retries.RetryTranslated public void deletePaths(Collection<org.apache.hadoop.fs.Path> paths, BulkOperationState operationState) throws IOException
MetadataStoredeletePaths in interface MetadataStorepaths - paths to delete.operationState - Nullable operation stateIOException - failure@Retries.RetryTranslated public DDBPathMetadata get(org.apache.hadoop.fs.Path path) throws IOException
MetadataStoreget in interface MetadataStorepath - the path to getpath, null if not foundIOException - if there is an error@Retries.RetryTranslated public DDBPathMetadata get(org.apache.hadoop.fs.Path path, boolean wantEmptyDirectoryFlag) throws IOException
MetadataStorePathMetadata.isEmptyDirectory(). Since determining emptiness
may be an expensive operation, this can save wasted work.get in interface MetadataStorepath - the path to getwantEmptyDirectoryFlag - Set to true to give a hint to the
MetadataStore that it should try to compute the empty directory flag.path, null if not foundIOException - if there is an error@Retries.RetryTranslated public DirListingMetadata listChildren(org.apache.hadoop.fs.Path path) throws IOException
MetadataStorelistChildren in interface MetadataStorepath - the path to listpath which are being
tracked by the MetadataStore, or null if the path was not found
in the MetadataStore.IOException - if there is an error@Retries.RetryTranslated public void addAncestors(org.apache.hadoop.fs.Path qualifiedPath, @Nullable BulkOperationState operationState) throws IOException
Important: to propagate TTL information, any new ancestors added
must have their last updated timestamps set through
S3Guard.patchLastUpdated(Collection, ITtlTimeProvider).
The implementation scans all up the directory tree and does a get() for each entry; at each level one is found it is added to the ancestor state.
The original implementation would stop on finding the first non-empty parent. This (re) implementation issues a GET for every parent entry and so detects and recovers from a tombstone marker further up the tree (i.e. an inconsistent store is corrected for).
if operationState is not null, when this method returns the
operation state will be updated with all new entries created.
This ensures that subsequent operations with the same store will not
trigger new updates.
addAncestors in interface MetadataStorequalifiedPath - path to updateoperationState - (nullable) operational state for a bulk updateIOException - on failure.@Retries.RetryTranslated public void move(@Nullable Collection<org.apache.hadoop.fs.Path> pathsToDelete, @Nullable Collection<PathMetadata> pathsToCreate, @Nullable BulkOperationState operationState) throws IOException
FileSystem.rename(Path, Path) in the
MetadataStore. Clients provide explicit enumeration of the affected
paths (recursively), before and after the rename.
This operation is not atomic, unless specific implementations claim
otherwise.
On the need to provide an enumeration of directory trees instead of just
source and destination paths:
Since a MetadataStore does not have to track all metadata for the
underlying storage system, and a new MetadataStore may be created on an
existing underlying filesystem, this move() may be the first time the
MetadataStore sees the affected paths. Therefore, simply providing src
and destination paths may not be enough to record the deletions (under
src path) and creations (at destination) that are happening during the
rename()..
The DDB implementation sorts all the paths such that new items
are ordered highest level entry first; deleted items are ordered
lowest entry first.
This is to ensure that if a client failed partway through the update,
there will no entries in the table which lack parent entries.move in interface MetadataStorepathsToDelete - Collection of all paths that were removed from the
source directory tree of the move.pathsToCreate - Collection of all PathMetadata for the new paths
that were created at the destination of the rename
().operationState - Any ongoing state supplied to the rename tracker
which is to be passed in with each move operation.IOException - if there is an error@Retries.RetryTranslated public void put(PathMetadata meta) throws IOException
MetadataStoreDirListingMetadata objects which
track the immediate parent of this file.put in interface MetadataStoremeta - the metadata to saveIOException - if there is an error@Retries.RetryTranslated public void put(PathMetadata meta, @Nullable BulkOperationState operationState) throws IOException
MetadataStoreDirListingMetadata objects which
track the immediate parent of this file.put in interface MetadataStoremeta - the metadata to saveoperationState - operational state for a bulk updateIOException - if there is an error@Retries.RetryTranslated public void put(Collection<? extends PathMetadata> metas, @Nullable BulkOperationState operationState) throws IOException
MetadataStoreput in interface MetadataStoremetas - the metadata to saveoperationState - (nullable) operational state for a bulk updateIOException - if there is an error@Retries.RetryTranslated public void put(DirListingMetadata meta, List<org.apache.hadoop.fs.Path> unchangedEntries, @Nullable BulkOperationState operationState) throws IOException
MetadataStore implementations may
subsequently keep track of all modifications to the directory contents at
this path, and return authoritative results from subsequent calls to
MetadataStore.listChildren(Path). See DirListingMetadata.
Any authoritative results returned are only authoritative for the scope
of the MetadataStore: A per-process MetadataStore, for
example, would only show results visible to that process, potentially
missing metadata updates (create, delete) made to the same path by
another process.
To optimize updates and avoid overwriting existing entries which
may contain extra data, entries in the list of unchangedEntries may
be excluded. That is: the listing metadata has the full list of
what it believes are children, but implementations can opt to ignore
some..
There is retry around building the list of paths to update, but
the call to
processBatchWriteRequest(DynamoDBMetadataStore.AncestorState, PrimaryKey[], Item[])
is only tried once.put in interface MetadataStoremeta - Directory listing metadata.unchangedEntries - unchanged child entry pathsoperationState - operational state for a bulk updateIOException - IO problempublic void close()
close in interface Closeableclose in interface AutoCloseable@Retries.RetryTranslated public void destroy() throws IOException
MetadataStoredestroy in interface MetadataStoreIOException - if there is an error@Retries.RetryTranslated public void prune(MetadataStore.PruneMode pruneMode, long cutoff) throws IOException
MetadataStoreMetadataStore.PruneMode.ALL_BY_MODTIME
Clear any metadata older than a specified mod_time from the store.
Note that this modification time is the S3 modification time from the
object's metadata - from the object store.
Implementations MUST clear file metadata, and MAY clear directory
metadata (s3a itself does not track modification time for directories).
Implementations may also choose to throw UnsupportedOperationException
instead. Note that modification times must be in UTC, as returned by
System.currentTimeMillis at the time of modification.
MetadataStore.PruneMode.TOMBSTONES_BY_LASTUPDATED
Clear any tombstone updated earlier than a specified time from the
store. Note that this last_updated is the time when the metadata
entry was last updated and maintained by the metadata store.
Implementations MUST clear file metadata, and MAY clear directory
metadata (s3a itself does not track modification time for directories).
Implementations may also choose to throw UnsupportedOperationException
instead. Note that last_updated must be in UTC, as returned by
System.currentTimeMillis at the time of modification.
prune in interface MetadataStorepruneMode - Prune Modecutoff - Oldest time to allow (UTC)IOException - if there is an error@Retries.RetryTranslated public long prune(MetadataStore.PruneMode pruneMode, long cutoff, String keyPrefix) throws IOException
prune in interface MetadataStorepruneMode - The mode of operation for the prune For details see
MetadataStore.prune(PruneMode, long)cutoff - Oldest modification time to allowkeyPrefix - The prefix for the keys that should be removedIOException - Any IO/DDB failure.InterruptedIOException - if the prune was interruptedpublic com.amazonaws.services.dynamodbv2.AmazonDynamoDB getAmazonDynamoDB()
public List<RoleModel.Statement> listAWSPolicyRules(Set<AWSPolicyProvider.AccessLevel> access)
listAWSPolicyRules in interface AWSPolicyProvideraccess - access level desired.public String getTableName()
@Retries.OnceRaw public Map<String,String> getDiagnostics() throws IOException
MetadataStoregetDiagnostics in interface MetadataStoreIOException - if there is an error@Retries.OnceRaw public void updateParameters(Map<String,String> parameters) throws IOException
MetadataStoreupdateParameters in interface MetadataStoreparameters - map of params to change.IOException - if there is an errorpublic long getReadThrottleEventCount()
public long getWriteThrottleEventCount()
public long getScanThrottleEventCount()
public long getBatchWriteCapacityExceededCount()
public Invoker getInvoker()
public <T> Iterable<T> wrapWithRetries(Iterable<T> source)
source - source iteratorpublic RenameTracker initiateRenameOperation(StoreContext storeContext, org.apache.hadoop.fs.Path source, S3AFileStatus sourceStatus, org.apache.hadoop.fs.Path dest)
initiateRenameOperation in interface MetadataStorestoreContext - store context.source - source pathsourceStatus - status of the source file/dirdest - destination path.public int markAsAuthoritative(org.apache.hadoop.fs.Path dest,
BulkOperationState operationState)
throws IOException
markAsAuthoritative in interface MetadataStoredest - destination path.operationState - active state.IOException - failure.public org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.AncestorState initiateBulkWrite(BulkOperationState.OperationType operation, org.apache.hadoop.fs.Path dest)
MetadataStoreinitiateBulkWrite in interface MetadataStoreoperation - the type of the operation.dest - path under which updates will be explicitly put.public void setTtlTimeProvider(ITtlTimeProvider ttlTimeProvider)
MetadataStoresetTtlTimeProvider in interface MetadataStorepublic MetastoreInstrumentation getInstrumentation()
MetadataStoregetInstrumentation in interface MetadataStoreprotected DynamoDBMetadataStoreTableManager getTableHandler()
Copyright © 2008–2022 Apache Software Foundation. All rights reserved.