Class DataTokenization


  • public class DataTokenization
    extends java.lang.Object
    The DataTokenization pipeline reads data from one of the supported sources, tokenizes data with external API calls to some tokenization server, and writes data into one of the supported sinks.

    Pipeline Requirements

    • Java 8
    • Data schema (JSON with an array of fields described in BigQuery format)
    • 1 of supported sources to read data from
    • 1 of supported destination sinks to write data into
    • A configured tokenization server

    Example Usage

     Gradle Preparation
     To run this example your  build.gradle file should contain the following task
     to execute the pipeline:
       
       task execute (type:JavaExec) {
          mainClass = System.getProperty("mainClass")
          classpath = sourceSets.main.runtimeClasspath
          systemProperties System.getProperties()
          args System.getProperty("exec.args", "").split()
       }
       
     This task allows to run the pipeline via the following command:
       
       gradle clean execute -DmainClass=org.apache.beam.examples.complete.datatokenization.DataTokenization \
            -Dexec.args="--<argument>=<value> --<argument>=<value>"
       
     Running the pipeline
     To execute this pipeline, specify the parameters:
    
     - Data schema
         - dataSchemaPath: Path to data schema (JSON format) compatible with BigQuery.
     - 1 specified input source out of these:
         - File System
             - inputFilePattern: Filepattern for files to read data from
             - inputFileFormat: File format of input files. Supported formats: JSON, CSV
             - In case if input data is in CSV format:
                 - csvContainsHeaders: `true` if file(s) in bucket to read data from contain headers,
                   and `false` otherwise
                 - csvDelimiter: Delimiting character in CSV. Default: use delimiter provided in
                   csvFormat
                 - csvFormat: Csv format according to Apache Commons CSV format. Default is:
                   [Apache Commons CSV default](https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.html#DEFAULT)
                   . Must match format names exactly found
                   at: https://static.javadoc.io/org.apache.commons/commons-csv/1.7/org/apache/commons/csv/CSVFormat.Predefined.html
         - Google Pub/Sub
             - pubsubTopic: The Cloud Pub/Sub topic to read from, in the format of '
               projects/yourproject/topics/yourtopic'
     - 1 specified output sink out of these:
         - File System
             - outputDirectory: Directory to write data to
             - outputFileFormat: File format of output files. Supported formats: JSON, CSV
             - windowDuration: The window duration in which data will be written. Should be specified
               only for 'Pub/Sub -> FileSystem' case. Defaults to 30s.
    
               Allowed formats are:
                 - Ns (for seconds, example: 5s),
                 - Nm (for minutes, example: 12m),
                 - Nh (for hours, example: 2h).
         - Google Cloud BigQuery
             - bigQueryTableName: Cloud BigQuery table name to write into
             - tempLocation: Folder in a Google Cloud Storage bucket, which is needed for
               BigQuery to handle data writing
         - Cloud BigTable
             - bigTableProjectId: Id of the project where the Cloud BigTable instance to write into
               is located
             - bigTableInstanceId: Id of the Cloud BigTable instance to write into
             - bigTableTableId: Id of the Cloud BigTable table to write into
             - bigTableKeyColumnName: Column name to use as a key in Cloud BigTable
             - bigTableColumnFamilyName: Column family name to use in Cloud BigTable
     - RPC server parameters
         - rpcUri: URI for the API calls to RPC server
         - batchSize: Size of the batch to send to RPC server per request
    
     The template allows for the user to supply the following optional parameter:
    
     - nonTokenizedDeadLetterPath: Folder where failed to tokenize data will be stored
    
    
     Specify the parameters in the following format:
    
     
     --dataSchemaPath="path-to-data-schema-in-json-format"
     --inputFilePattern="path-pattern-to-input-data"
     --outputDirectory="path-to-output-directory"
     # example for CSV case
     --inputFileFormat="CSV"
     --outputFileFormat="CSV"
     --csvContainsHeaders="true"
     --nonTokenizedDeadLetterPath="path-to-errors-rows-writing"
     --batchSize=batch-size-number
     --rpcUri=http://host:port/tokenize
     
    
     By default, this will run the pipeline locally with the DirectRunner. To change the runner, specify:
    
     
     --runner=YOUR_SELECTED_RUNNER
     
     
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static void main​(java.lang.String[] args)
      Main entry point for pipeline execution.
      static org.apache.beam.sdk.PipelineResult run​(DataTokenizationOptions options)
      Runs the pipeline to completion with the specified options.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • FAILSAFE_ELEMENT_CODER

        public static final FailsafeElementCoder<java.lang.String,​java.lang.String> FAILSAFE_ELEMENT_CODER
        String/String Coder for FailsafeElement.
    • Constructor Detail

      • DataTokenization

        public DataTokenization()
    • Method Detail

      • main

        public static void main​(java.lang.String[] args)
        Main entry point for pipeline execution.
        Parameters:
        args - Command line arguments to the pipeline.
      • run

        public static org.apache.beam.sdk.PipelineResult run​(DataTokenizationOptions options)
        Runs the pipeline to completion with the specified options.
        Parameters:
        options - The execution options.
        Returns:
        The pipeline result.