public class FSUtils extends Object
| Modifier and Type | Class and Description |
|---|---|
static interface |
FSUtils.SerializableFunction<T,R> |
| Modifier and Type | Field and Description |
|---|---|
static Pattern |
LOG_FILE_PATTERN |
| Constructor and Description |
|---|
FSUtils() |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.hadoop.fs.Path |
addSchemeIfLocalPath(String path) |
static int |
computeNextLogVersion(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath,
String fileId,
String logFileExtension,
String baseCommitTime)
computes the next log version for the specified fileId in the partition path.
|
static String |
createNewFileId(String idPfx,
int id) |
static String |
createNewFileIdPfx()
Returns a new unique prefix for creating a file group.
|
static void |
createPathIfNotExists(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath) |
static boolean |
deleteDir(HoodieEngineContext hoodieEngineContext,
org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path dirPath,
int parallelism)
Deletes a directory by deleting sub-paths in parallel on the file system.
|
static boolean |
deleteSubPath(String subPathStr,
SerializableConfiguration conf,
boolean recursive)
Deletes a sub-path.
|
static org.apache.hadoop.fs.FileStatus[] |
getAllDataFilesInPartition(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath)
Get the names of all the base and log files in the given partition path.
|
static List<String> |
getAllFoldersWithPartitionMetaFile(org.apache.hadoop.fs.FileSystem fs,
String basePathStr)
Obtain all the partition paths, that are present in this table, denoted by presence of
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX. |
static Stream<HoodieLogFile> |
getAllLogFiles(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath,
String fileId,
String logFileExtension,
String baseCommitTime)
Get all the log files for the passed in file-id in the partition path.
|
static List<String> |
getAllPartitionFoldersThreeLevelsDown(org.apache.hadoop.fs.FileSystem fs,
String basePath)
Gets all partition paths assuming date partitioning (year, month, day) three levels down.
|
static List<String> |
getAllPartitionPaths(HoodieEngineContext engineContext,
HoodieMetadataConfig metadataConfig,
String basePathStr) |
static List<String> |
getAllPartitionPaths(HoodieEngineContext engineContext,
String basePathStr,
boolean useFileListingFromMetadata,
boolean assumeDatePartitioning) |
static String |
getBaseCommitTimeFromLogPath(org.apache.hadoop.fs.Path path)
Get the first part of the file name in the log file.
|
static String |
getCommitFromCommitFile(String commitFileName) |
static String |
getCommitTime(String fullFileName) |
static int |
getDefaultBufferSize(org.apache.hadoop.fs.FileSystem fs) |
static Short |
getDefaultReplication(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path) |
static String |
getDFSFullPartitionPath(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path fullPartitionPath)
Get DFS full partition path (e.g.
|
static String |
getFileExtension(String fullName) |
static String |
getFileExtensionFromLog(org.apache.hadoop.fs.Path logPath)
Get the file extension from the log file.
|
static String |
getFileId(String fullFileName) |
static String |
getFileIdFromFilePath(org.apache.hadoop.fs.Path filePath)
Check if the file is a base file of a log file.
|
static String |
getFileIdFromLogPath(org.apache.hadoop.fs.Path path)
Get the first part of the file name in the log file.
|
static String |
getFileName(String filePathWithPartition,
String partition)
Extracts the file name from the relative path based on the table base path.
|
static Map<String,org.apache.hadoop.fs.FileStatus[]> |
getFilesInPartitions(HoodieEngineContext engineContext,
HoodieMetadataConfig metadataConfig,
String basePathStr,
String[] partitionPaths,
String spillableMapPath) |
static long |
getFileSize(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path) |
static List<org.apache.hadoop.fs.FileStatus> |
getFileStatusAtLevel(HoodieEngineContext hoodieEngineContext,
org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path rootPath,
int expectLevel,
int parallelism)
Lists file status at a certain level in the directory hierarchy.
|
static int |
getFileVersionFromLog(org.apache.hadoop.fs.Path logPath)
Get the last part of the file name in the log file and convert to int.
|
static org.apache.hadoop.fs.FileSystem |
getFs(org.apache.hadoop.fs.Path path,
org.apache.hadoop.conf.Configuration conf) |
static org.apache.hadoop.fs.FileSystem |
getFs(String pathStr,
org.apache.hadoop.conf.Configuration conf) |
static org.apache.hadoop.fs.FileSystem |
getFs(String pathStr,
org.apache.hadoop.conf.Configuration conf,
boolean localByDefault) |
static HoodieWrapperFileSystem |
getFs(String path,
SerializableConfiguration hadoopConf,
ConsistencyGuardConfig consistencyGuardConfig)
Get the FS implementation for this table.
|
static List<org.apache.hadoop.fs.FileStatus> |
getGlobStatusExcludingMetaFolder(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path globPath)
Helper to filter out paths under metadata folder when running fs.globStatus.
|
static Option<HoodieLogFile> |
getLatestLogFile(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath,
String fileId,
String logFileExtension,
String baseCommitTime)
Get the latest log file for the passed in file-id in the partition path
|
static Option<Pair<Integer,String>> |
getLatestLogVersion(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath,
String fileId,
String logFileExtension,
String baseCommitTime)
Get the latest log version for the fileId in the partition path.
|
static org.apache.hadoop.fs.Path |
getPartitionPath(org.apache.hadoop.fs.Path basePath,
String partitionPath) |
static org.apache.hadoop.fs.Path |
getPartitionPath(String basePath,
String partitionPath) |
static String |
getRelativePartitionPath(org.apache.hadoop.fs.Path basePath,
org.apache.hadoop.fs.Path fullPartitionPath)
Given a base partition and a partition path, return relative path of partition path to the base path.
|
static Long |
getSizeInMB(long sizeInBytes) |
static Integer |
getStageIdFromLogPath(org.apache.hadoop.fs.Path path)
Get StageId used in log-path.
|
static Integer |
getTaskAttemptIdFromLogPath(org.apache.hadoop.fs.Path path)
Get Task Attempt Id used in log-path.
|
static Integer |
getTaskPartitionIdFromLogPath(org.apache.hadoop.fs.Path path)
Get TaskPartitionId used in log-path.
|
static String |
getWriteTokenFromLogPath(org.apache.hadoop.fs.Path path)
Get Write-Token used in log-path.
|
static boolean |
isBaseFile(org.apache.hadoop.fs.Path path) |
static boolean |
isCHDFileSystem(org.apache.hadoop.fs.FileSystem fs)
Chdfs will throw
IOException instead of EOFException. |
static boolean |
isDataFile(org.apache.hadoop.fs.Path path)
Returns true if the given path is a Base file or a Log file.
|
static boolean |
isGCSFileSystem(org.apache.hadoop.fs.FileSystem fs)
This is due to HUDI-140 GCS has a different behavior for detecting EOF during seek().
|
static boolean |
isLogFile(org.apache.hadoop.fs.Path logPath) |
static boolean |
isLogFile(String fileName) |
static boolean |
isTableExists(String path,
org.apache.hadoop.fs.FileSystem fs)
Check if table already exists in the given path.
|
static String |
makeBaseFileName(String instantTime,
String writeToken,
String fileId) |
static String |
makeBaseFileName(String instantTime,
String writeToken,
String fileId,
String fileExtension) |
static String |
makeBootstrapIndexFileName(String instantTime,
String fileId,
String ext) |
static String |
makeLogFileName(String fileId,
String logFileExtension,
String baseCommitTime,
int version,
String writeToken) |
static org.apache.hadoop.fs.Path |
makeQualified(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path)
Makes path qualified w/
FileSystem's URI |
static String |
makeWriteToken(int taskPartitionId,
int stageId,
long taskAttemptId)
A write token uniquely identifies an attempt at one of the IOHandle operations (Merge/Create/Append).
|
static String |
maskWithoutFileId(String instantTime,
int taskPartitionId) |
static <T> Map<String,T> |
parallelizeFilesProcess(HoodieEngineContext hoodieEngineContext,
org.apache.hadoop.fs.FileSystem fs,
int parallelism,
FSUtils.SerializableFunction<Pair<String,SerializableConfiguration>,T> pairFunction,
List<String> subPaths) |
static <T> Map<String,T> |
parallelizeSubPathProcess(HoodieEngineContext hoodieEngineContext,
org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path dirPath,
int parallelism,
Predicate<org.apache.hadoop.fs.FileStatus> subPathPredicate,
FSUtils.SerializableFunction<Pair<String,SerializableConfiguration>,T> pairFunction)
Processes sub-path in parallel.
|
static org.apache.hadoop.conf.Configuration |
prepareHadoopConf(org.apache.hadoop.conf.Configuration conf) |
static void |
processFiles(org.apache.hadoop.fs.FileSystem fs,
String basePathStr,
Function<org.apache.hadoop.fs.FileStatus,Boolean> consumer,
boolean excludeMetaFolder)
Recursively processes all files in the base-path.
|
static boolean |
recoverDFSFileLease(org.apache.hadoop.hdfs.DistributedFileSystem dfs,
org.apache.hadoop.fs.Path p)
When a file was opened and the task died without closing the stream, another task executor cannot open because the
existing lease will be active.
|
static org.apache.hadoop.conf.Configuration |
registerFileSystem(org.apache.hadoop.fs.Path file,
org.apache.hadoop.conf.Configuration conf) |
public static final Pattern LOG_FILE_PATTERN
public static org.apache.hadoop.conf.Configuration prepareHadoopConf(org.apache.hadoop.conf.Configuration conf)
public static org.apache.hadoop.fs.FileSystem getFs(String pathStr, org.apache.hadoop.conf.Configuration conf)
public static org.apache.hadoop.fs.FileSystem getFs(org.apache.hadoop.fs.Path path,
org.apache.hadoop.conf.Configuration conf)
public static org.apache.hadoop.fs.FileSystem getFs(String pathStr, org.apache.hadoop.conf.Configuration conf, boolean localByDefault)
public static boolean isTableExists(String path, org.apache.hadoop.fs.FileSystem fs) throws IOException
path - base path of the table.fs - instance of FileSystem.true if table exists. false otherwise.IOExceptionpublic static org.apache.hadoop.fs.Path addSchemeIfLocalPath(String path)
public static org.apache.hadoop.fs.Path makeQualified(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path)
FileSystem's URIfs - instance of FileSystem path belongs topath - path to be qualifiedpublic static String makeWriteToken(int taskPartitionId, int stageId, long taskAttemptId)
public static String makeBaseFileName(String instantTime, String writeToken, String fileId)
public static String makeBaseFileName(String instantTime, String writeToken, String fileId, String fileExtension)
public static String makeBootstrapIndexFileName(String instantTime, String fileId, String ext)
public static long getFileSize(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path)
throws IOException
IOExceptionpublic static List<String> getAllPartitionFoldersThreeLevelsDown(org.apache.hadoop.fs.FileSystem fs, String basePath) throws IOException
IOExceptionpublic static String getRelativePartitionPath(org.apache.hadoop.fs.Path basePath, org.apache.hadoop.fs.Path fullPartitionPath)
public static List<String> getAllFoldersWithPartitionMetaFile(org.apache.hadoop.fs.FileSystem fs, String basePathStr) throws IOException
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX.
If the basePathStr is a subdirectory of .hoodie folder then we assume that the partitions of an internal
table (a hoodie table within the .hoodie directory) are to be obtained.fs - FileSystem instancebasePathStr - base directoryIOExceptionpublic static void processFiles(org.apache.hadoop.fs.FileSystem fs,
String basePathStr,
Function<org.apache.hadoop.fs.FileStatus,Boolean> consumer,
boolean excludeMetaFolder)
throws IOException
fs - File SystembasePathStr - Base-Pathconsumer - Callback for processingexcludeMetaFolder - Exclude .hoodie folderIOException - -public static List<String> getAllPartitionPaths(HoodieEngineContext engineContext, String basePathStr, boolean useFileListingFromMetadata, boolean assumeDatePartitioning)
public static List<String> getAllPartitionPaths(HoodieEngineContext engineContext, HoodieMetadataConfig metadataConfig, String basePathStr)
public static Map<String,org.apache.hadoop.fs.FileStatus[]> getFilesInPartitions(HoodieEngineContext engineContext, HoodieMetadataConfig metadataConfig, String basePathStr, String[] partitionPaths, String spillableMapPath)
public static String createNewFileIdPfx()
public static String getFileExtensionFromLog(org.apache.hadoop.fs.Path logPath)
public static String getFileIdFromLogPath(org.apache.hadoop.fs.Path path)
public static String getFileIdFromFilePath(org.apache.hadoop.fs.Path filePath)
public static String getBaseCommitTimeFromLogPath(org.apache.hadoop.fs.Path path)
public static Integer getTaskPartitionIdFromLogPath(org.apache.hadoop.fs.Path path)
public static String getWriteTokenFromLogPath(org.apache.hadoop.fs.Path path)
public static Integer getStageIdFromLogPath(org.apache.hadoop.fs.Path path)
public static Integer getTaskAttemptIdFromLogPath(org.apache.hadoop.fs.Path path)
public static int getFileVersionFromLog(org.apache.hadoop.fs.Path logPath)
public static String makeLogFileName(String fileId, String logFileExtension, String baseCommitTime, int version, String writeToken)
public static boolean isBaseFile(org.apache.hadoop.fs.Path path)
public static boolean isLogFile(org.apache.hadoop.fs.Path logPath)
public static boolean isLogFile(String fileName)
public static boolean isDataFile(org.apache.hadoop.fs.Path path)
public static org.apache.hadoop.fs.FileStatus[] getAllDataFilesInPartition(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath)
throws IOException
IOExceptionpublic static Option<HoodieLogFile> getLatestLogFile(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path partitionPath, String fileId, String logFileExtension, String baseCommitTime) throws IOException
IOExceptionpublic static Stream<HoodieLogFile> getAllLogFiles(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path partitionPath, String fileId, String logFileExtension, String baseCommitTime) throws IOException
IOExceptionpublic static Option<Pair<Integer,String>> getLatestLogVersion(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path partitionPath, String fileId, String logFileExtension, String baseCommitTime) throws IOException
IOExceptionpublic static int computeNextLogVersion(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath,
String fileId,
String logFileExtension,
String baseCommitTime)
throws IOException
IOExceptionpublic static int getDefaultBufferSize(org.apache.hadoop.fs.FileSystem fs)
public static Short getDefaultReplication(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path)
public static boolean recoverDFSFileLease(org.apache.hadoop.hdfs.DistributedFileSystem dfs,
org.apache.hadoop.fs.Path p)
throws IOException,
InterruptedException
IOExceptionInterruptedExceptionpublic static void createPathIfNotExists(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path partitionPath)
throws IOException
IOExceptionpublic static Long getSizeInMB(long sizeInBytes)
public static org.apache.hadoop.fs.Path getPartitionPath(String basePath, String partitionPath)
public static org.apache.hadoop.fs.Path getPartitionPath(org.apache.hadoop.fs.Path basePath,
String partitionPath)
public static String getFileName(String filePathWithPartition, String partition)
filePathWithPartition - the relative file path based on the table base path.partition - the relative partition path. For partitioned table, `partition` contains the relative partition path;
for non-partitioned table, `partition` is emptypublic static String getDFSFullPartitionPath(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path fullPartitionPath)
public static boolean isGCSFileSystem(org.apache.hadoop.fs.FileSystem fs)
fs - fileSystem instance.public static boolean isCHDFileSystem(org.apache.hadoop.fs.FileSystem fs)
IOException instead of EOFException. It will cause error in isBlockCorrupted().
Wrapped by BoundedFsDataInputStream, to check whether the desired offset is out of the file size in advance.public static org.apache.hadoop.conf.Configuration registerFileSystem(org.apache.hadoop.fs.Path file,
org.apache.hadoop.conf.Configuration conf)
public static HoodieWrapperFileSystem getFs(String path, SerializableConfiguration hadoopConf, ConsistencyGuardConfig consistencyGuardConfig)
path - Path StringhadoopConf - Serializable Hadoop ConfigurationconsistencyGuardConfig - Consistency Guard Configpublic static List<org.apache.hadoop.fs.FileStatus> getGlobStatusExcludingMetaFolder(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path globPath) throws IOException
fs - File SystemglobPath - Glob PathIOException - when having trouble listing the pathpublic static boolean deleteDir(HoodieEngineContext hoodieEngineContext, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path dirPath, int parallelism)
hoodieEngineContext - HoodieEngineContext instancefs - file systemdirPath - directory pathparallelism - parallelism to use for sub-pathstrue if the directory is delete; false otherwise.public static <T> Map<String,T> parallelizeSubPathProcess(HoodieEngineContext hoodieEngineContext, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path dirPath, int parallelism, Predicate<org.apache.hadoop.fs.FileStatus> subPathPredicate, FSUtils.SerializableFunction<Pair<String,SerializableConfiguration>,T> pairFunction)
T - type of result to return for each sub-pathhoodieEngineContext - HoodieEngineContext instancefs - file systemdirPath - directory pathparallelism - parallelism to use for sub-pathssubPathPredicate - predicate to use to filter sub-paths for processingpairFunction - actual processing logic for each sub-pathpublic static <T> Map<String,T> parallelizeFilesProcess(HoodieEngineContext hoodieEngineContext, org.apache.hadoop.fs.FileSystem fs, int parallelism, FSUtils.SerializableFunction<Pair<String,SerializableConfiguration>,T> pairFunction, List<String> subPaths)
public static boolean deleteSubPath(String subPathStr, SerializableConfiguration conf, boolean recursive)
subPathStr - sub-path Stringconf - serializable configrecursive - is recursive or nottrue if the sub-path is deleted; false otherwise.public static List<org.apache.hadoop.fs.FileStatus> getFileStatusAtLevel(HoodieEngineContext hoodieEngineContext, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path rootPath, int expectLevel, int parallelism)
E.g., given "/tmp/hoodie_table" as the rootPath, and 3 as the expected level,
this method gives back the FileStatus of all files under
"/tmp/hoodie_table/[*]/[*]/[*]/" folders.
hoodieEngineContext - HoodieEngineContext instance.fs - FileSystem instance.rootPath - Root path for the file listing.expectLevel - Expected level of directory hierarchy for files to be added.parallelism - Parallelism for the file listing.Copyright © 2022 The Apache Software Foundation. All rights reserved.