Class SchemaConformingTransformer

  • All Implemented Interfaces:
    Serializable, RecordTransformer

    public class SchemaConformingTransformer
    extends Object
    implements RecordTransformer
    This transformer transforms records with varying keys such that they can be stored in a table with a fixed schema. Since these records have varying keys, it is impractical to store each field in its own table column. At the same time, most (if not all) fields may be important to the user, so we should not drop any field unnecessarily. So this transformer primarily takes record-fields that don't exist in the schema and stores them in a type of catchall field.

    For example, consider this record:

     {
       "timestamp": 1687786535928,
       "hostname": "host1",
       "HOSTNAME": "host1",
       "level": "INFO",
       "message": "Started processing job1",
       "tags": {
         "platform": "data",
         "service": "serializer",
         "params": {
           "queueLength": 5,
           "timeout": 299,
           "userData_noIndex": {
             "nth": 99
           }
         }
       }
     }
     
    And let's say the table's schema contains these fields:
    • timestamp
    • hostname
    • level
    • message
    • tags.platform
    • tags.service
    • indexableExtras
    • unindexableExtras

    Without this transformer, the entire "tags" field would be dropped when storing the record in the table. However, with this transformer, the record would be transformed into the following:

     {
       "timestamp": 1687786535928,
       "hostname": "host1",
       "level": "INFO",
       "message": "Started processing job1",
       "tags.platform": "data",
       "tags.service": "serializer",
       "indexableExtras": {
         "tags": {
           "params": {
             "queueLength": 5,
             "timeout": 299
           }
         }
       },
       "unindexableExtras": {
         "tags": {
           "userData_noIndex": {
             "nth": 99
           }
         }
       }
     }
     
    Notice that the transformer:
    • Flattens nested fields which exist in the schema, like "tags.platform"
    • Drops some fields like "HOSTNAME", where "HOSTNAME" must be listed as a field in the config option "fieldPathsToDrop".
    • Moves fields which don't exist in the schema and have the suffix "_noIndex" into the "unindexableExtras" field (the field name is configurable)
    • Moves any remaining fields which don't exist in the schema into the "indexableExtras" field (the field name is configurable)

    The "unindexableExtras" field allows the transformer to separate fields which don't need indexing (because they are only retrieved, not searched) from those that do. The transformer also has other configuration options specified in SchemaConformingTransformerConfig.

    See Also:
    Serialized Form
    • Constructor Detail

      • SchemaConformingTransformer

        public SchemaConformingTransformer​(TableConfig tableConfig,
                                           Schema schema)
    • Method Detail

      • validateSchema

        public static void validateSchema​(@Nonnull
                                          Schema schema,
                                          @Nonnull
                                          SchemaConformingTransformerConfig transformerConfig)
        Validates the schema against the given transformer's configuration.
      • isNoOp

        public boolean isNoOp()
        Description copied from interface: RecordTransformer
        Returns true if the transformer is no-op (can be skipped), false otherwise.
        Specified by:
        isNoOp in interface RecordTransformer
      • transform

        @Nullable
        public GenericRow transform​(GenericRow record)
        Description copied from interface: RecordTransformer
        Transforms a record based on some custom rules.
        Specified by:
        transform in interface RecordTransformer
        Parameters:
        record - Record to transform
        Returns:
        Transformed record, or null if the record does not follow certain rules.