public class DataTokenization
extends java.lang.Object
DataTokenization pipeline reads data from one of the supported sources, tokenizes
data with external API calls to some tokenization server, and writes data into one of the
supported sinks. Pipeline Requirements
Example Usage
Gradle Preparation To run this example your build.gradle file should contain the following task to execute the pipeline:task execute (type:JavaExec) { mainClass = System.getProperty("mainClass") classpath = sourceSets.main.runtimeClasspath systemProperties System.getProperties() args System.getProperty("exec.args", "").split() }This task allows to run the pipeline via the following command:gradle clean execute -DmainClass=org.apache.beam.examples.complete.datatokenization.DataTokenization \ -Dexec.args="--<argument>=<value> --<argument>=<value>"Running the pipeline To execute this pipeline, specify the parameters: - Data schema - dataSchemaPath: Path to data schema (JSON format) compatible with BigQuery. - 1 specified input source out of these: - File System - inputFilePattern: Filepattern for files to read data from - inputFileFormat: File format of input files. Supported formats: JSON, CSV - In case if input data is in CSV format: - csvContainsHeaders: `true` if file(s) in bucket to read data from contain headers, and `false` otherwise - csvDelimiter: Delimiting character in CSV. Default: use delimiter provided in csvFormat - csvFormat: Csv format according to Apache Commons CSV format. Default is: [Apache Commons CSV default](https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.html#DEFAULT) . Must match format names exactly found at: https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.Predefined.html - Google Pub/Sub - pubsubTopic: The Cloud Pub/Sub topic to read from, in the format of ' projects/yourproject/topics/yourtopic' - 1 specified output sink out of these: - File System - outputDirectory: Directory to write data to - outputFileFormat: File format of output files. Supported formats: JSON, CSV - windowDuration: The window duration in which data will be written. Should be specified only for 'Pub/Sub -> FileSystem' case. Defaults to 30s. Allowed formats are: - Ns (for seconds, example: 5s), - Nm (for minutes, example: 12m), - Nh (for hours, example: 2h). - Google Cloud BigQuery - bigQueryTableName: Cloud BigQuery table name to write into - tempLocation: Folder in a Google Cloud Storage bucket, which is needed for BigQuery to handle data writing - Cloud BigTable - bigTableProjectId: Id of the project where the Cloud BigTable instance to write into is located - bigTableInstanceId: Id of the Cloud BigTable instance to write into - bigTableTableId: Id of the Cloud BigTable table to write into - bigTableKeyColumnName: Column name to use as a key in Cloud BigTable - bigTableColumnFamilyName: Column family name to use in Cloud BigTable - RPC server parameters - rpcUri: URI for the API calls to RPC server - batchSize: Size of the batch to send to RPC server per request The template allows for the user to supply the following optional parameter: - nonTokenizedDeadLetterPath: Folder where failed to tokenize data will be stored Specify the parameters in the following format:--dataSchemaPath="path-to-data-schema-in-json-format" --inputFilePattern="path-pattern-to-input-data" --outputDirectory="path-to-output-directory" # example for CSV case --inputFileFormat="CSV" --outputFileFormat="CSV" --csvContainsHeaders="true" --nonTokenizedDeadLetterPath="path-to-errors-rows-writing" --batchSize=batch-size-number --rpcUri=http://host:port/tokenizeBy default, this will run the pipeline locally with the DirectRunner. To change the runner, specify:--runner=YOUR_SELECTED_RUNNER
| Modifier and Type | Field and Description |
|---|---|
static FailsafeElementCoder<java.lang.String,java.lang.String> |
FAILSAFE_ELEMENT_CODER
String/String Coder for FailsafeElement.
|
| Constructor and Description |
|---|
DataTokenization() |
| Modifier and Type | Method and Description |
|---|---|
static void |
main(java.lang.String[] args)
Main entry point for pipeline execution.
|
static org.apache.beam.sdk.PipelineResult |
run(DataTokenizationOptions options)
Runs the pipeline to completion with the specified options.
|
public static final FailsafeElementCoder<java.lang.String,java.lang.String> FAILSAFE_ELEMENT_CODER
public static void main(java.lang.String[] args)
args - Command line arguments to the pipeline.public static org.apache.beam.sdk.PipelineResult run(DataTokenizationOptions options)
options - The execution options.