Class DataTokenization
- java.lang.Object
-
- org.apache.beam.examples.complete.datatokenization.DataTokenization
-
public class DataTokenization extends java.lang.ObjectTheDataTokenizationpipeline reads data from one of the supported sources, tokenizes data with external API calls to some tokenization server, and writes data into one of the supported sinks.
Pipeline Requirements
- Java 8
- Data schema (JSON with an array of fields described in BigQuery format)
- 1 of supported sources to read data from
- File system (Only JSON or CSV)
- Google Pub/Sub
- 1 of supported destination sinks to write data into
- File system (Only JSON or CSV)
- Google Cloud BigQuery
- Cloud BigTable
- A configured tokenization server
Example Usage
Gradle Preparation To run this example your build.gradle file should contain the following task to execute the pipeline:
task execute (type:JavaExec) { mainClass = System.getProperty("mainClass") classpath = sourceSets.main.runtimeClasspath systemProperties System.getProperties() args System.getProperty("exec.args", "").split() }This task allows to run the pipeline via the following command:gradle clean execute -DmainClass=org.apache.beam.examples.complete.datatokenization.DataTokenization \ -Dexec.args="--<argument>=<value> --<argument>=<value>"Running the pipeline To execute this pipeline, specify the parameters: - Data schema - dataSchemaPath: Path to data schema (JSON format) compatible with BigQuery. - 1 specified input source out of these: - File System - inputFilePattern: Filepattern for files to read data from - inputFileFormat: File format of input files. Supported formats: JSON, CSV - In case if input data is in CSV format: - csvContainsHeaders: `true` if file(s) in bucket to read data from contain headers, and `false` otherwise - csvDelimiter: Delimiting character in CSV. Default: use delimiter provided in csvFormat - csvFormat: Csv format according to Apache Commons CSV format. Default is: [Apache Commons CSV default](https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.html#DEFAULT) . Must match format names exactly found at: https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.Predefined.html - Google Pub/Sub - pubsubTopic: The Cloud Pub/Sub topic to read from, in the format of ' projects/yourproject/topics/yourtopic' - 1 specified output sink out of these: - File System - outputDirectory: Directory to write data to - outputFileFormat: File format of output files. Supported formats: JSON, CSV - windowDuration: The window duration in which data will be written. Should be specified only for 'Pub/Sub -> FileSystem' case. Defaults to 30s. Allowed formats are: - Ns (for seconds, example: 5s), - Nm (for minutes, example: 12m), - Nh (for hours, example: 2h). - Google Cloud BigQuery - bigQueryTableName: Cloud BigQuery table name to write into - tempLocation: Folder in a Google Cloud Storage bucket, which is needed for BigQuery to handle data writing - Cloud BigTable - bigTableProjectId: Id of the project where the Cloud BigTable instance to write into is located - bigTableInstanceId: Id of the Cloud BigTable instance to write into - bigTableTableId: Id of the Cloud BigTable table to write into - bigTableKeyColumnName: Column name to use as a key in Cloud BigTable - bigTableColumnFamilyName: Column family name to use in Cloud BigTable - RPC server parameters - rpcUri: URI for the API calls to RPC server - batchSize: Size of the batch to send to RPC server per request The template allows for the user to supply the following optional parameter: - nonTokenizedDeadLetterPath: Folder where failed to tokenize data will be stored Specify the parameters in the following format:--dataSchemaPath="path-to-data-schema-in-json-format" --inputFilePattern="path-pattern-to-input-data" --outputDirectory="path-to-output-directory" # example for CSV case --inputFileFormat="CSV" --outputFileFormat="CSV" --csvContainsHeaders="true" --nonTokenizedDeadLetterPath="path-to-errors-rows-writing" --batchSize=batch-size-number --rpcUri=http://host:port/tokenizeBy default, this will run the pipeline locally with the DirectRunner. To change the runner, specify:--runner=YOUR_SELECTED_RUNNER
-
-
Field Summary
Fields Modifier and Type Field Description static FailsafeElementCoder<java.lang.String,java.lang.String>FAILSAFE_ELEMENT_CODERString/String Coder for FailsafeElement.
-
Constructor Summary
Constructors Constructor Description DataTokenization()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static voidmain(java.lang.String[] args)Main entry point for pipeline execution.static org.apache.beam.sdk.PipelineResultrun(DataTokenizationOptions options)Runs the pipeline to completion with the specified options.
-
-
-
Field Detail
-
FAILSAFE_ELEMENT_CODER
public static final FailsafeElementCoder<java.lang.String,java.lang.String> FAILSAFE_ELEMENT_CODER
String/String Coder for FailsafeElement.
-
-
Method Detail
-
main
public static void main(java.lang.String[] args)
Main entry point for pipeline execution.- Parameters:
args- Command line arguments to the pipeline.
-
run
public static org.apache.beam.sdk.PipelineResult run(DataTokenizationOptions options)
Runs the pipeline to completion with the specified options.- Parameters:
options- The execution options.- Returns:
- The pipeline result.
-
-