class DsvIngestionJob extends IngestionJob
Main class to ingest delimiter separated values file
- Alphabetic
- By Inheritance
- DsvIngestionJob
- IngestionJob
- SparkJob
- JobBase
- StrictLogging
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
DsvIngestionJob(domain: Domain, schema: Schema, types: List[Type], path: List[Path], storageHandler: StorageHandler, schemaHandler: SchemaHandler, options: Map[String, String])(implicit settings: Settings)
- domain
: Input Dataset Domain
- schema
: Input Dataset Schema
- types
: List of globally defined types
- path
: Input dataset path
- storageHandler
: Storage Handler
- options
: Parameters to pass as input (k1=v1,k2=v2,k3=v3)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
analyze(fullTableName: String): Any
- Attributes
- protected
- Definition Classes
- SparkJob
-
def
applyIgnore(dfIn: DataFrame): Dataset[Row]
- Attributes
- protected
- Definition Classes
- IngestionJob
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
createSparkViews(views: Views, sqlParameters: Map[String, String]): Unit
- Attributes
- protected
- Definition Classes
- SparkJob
-
val
domain: Domain
- Definition Classes
- DsvIngestionJob → IngestionJob
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
lazy val
extension: String
- Definition Classes
- IngestionJob
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
val
flatRowValidator: GenericRowValidator
- Attributes
- protected
- Definition Classes
- IngestionJob
-
lazy val
format: String
- Definition Classes
- IngestionJob
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getWriteMode(): WriteMode
- Definition Classes
- IngestionJob
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
ingest(dataset: DataFrame): (RDD[_], RDD[_])
Apply the schema to the dataset.
Apply the schema to the dataset. This is where all the magic happen Valid records are stored in the accepted path / table and invalid records in the rejected path / table
- dataset
: Spark Dataset
- Attributes
- protected
- Definition Classes
- DsvIngestionJob → IngestionJob
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
loadDataSet(): Try[DataFrame]
Load dataset using spark csv reader and all metadata.
Load dataset using spark csv reader and all metadata. Does not infer schema. columns not defined in the schema are dropped fro the dataset (require datsets with a header)
- returns
Spark Dataset
- Attributes
- protected
- Definition Classes
- DsvIngestionJob → IngestionJob
-
val
logger: Logger
- Attributes
- protected
- Definition Classes
- StrictLogging
-
lazy val
metadata: Metadata
Merged metadata
Merged metadata
- Definition Classes
- IngestionJob
-
def
name: String
- returns
Spark Job name
- Definition Classes
- DsvIngestionJob → JobBase
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
val
now: Timestamp
- Definition Classes
- IngestionJob
-
val
options: Map[String, String]
- Definition Classes
- DsvIngestionJob → IngestionJob
-
def
parseViewDefinition(valueWithEnv: String): (SinkType, Option[JdbcConfigName], String)
- valueWithEnv
in the form [SinkType:[configName:]]viewName
- returns
(SinkType, configName, viewName)
- Attributes
- protected
- Definition Classes
- JobBase
-
def
partitionDataset(dataset: DataFrame, partition: List[String]): DataFrame
- Attributes
- protected
- Definition Classes
- SparkJob
-
def
partitionedDatasetWriter(dataset: DataFrame, partition: List[String]): DataFrameWriter[Row]
Partition a dataset using dataset columns.
Partition a dataset using dataset columns. To partition the dataset using the ingestion time, use the reserved column names :
- comet_date
- comet_year
- comet_month
- comet_day
- comet_hour
- comet_minute These columns are renamed to "date", "year", "month", "day", "hour", "minute" in the dataset and their values is set to the current date/time.
- dataset
: Input dataset
- partition
: list of columns to use for partitioning.
- returns
The Spark session used to run this job
- Attributes
- protected
- Definition Classes
- SparkJob
-
val
path: List[Path]
- Definition Classes
- DsvIngestionJob → IngestionJob
-
def
registerUdf(udf: String): Unit
- Attributes
- protected
- Definition Classes
- SparkJob
-
def
reorderAttributes(dataFrame: DataFrame): List[Attribute]
- Definition Classes
- IngestionJob
-
def
run(): Try[JobResult]
Main entry point as required by the Spark Job interface
Main entry point as required by the Spark Job interface
- returns
: Spark Session used for the job
- Definition Classes
- IngestionJob → JobBase
-
def
saveAccepted(acceptedRDD: RDD[Row], orderedSparkTypes: StructType): (DataFrame, Path)
- Attributes
- protected
-
def
saveAccepted(dataframe: DataFrame): (DataFrame, Path)
Merge new and existing dataset if required Save using overwrite / Append mode
Merge new and existing dataset if required Save using overwrite / Append mode
- Attributes
- protected
- Definition Classes
- IngestionJob
-
def
saveRejected(rejectedRDD: RDD[String]): Try[Path]
- Attributes
- protected
- Definition Classes
- IngestionJob
-
val
schema: Schema
- Definition Classes
- DsvIngestionJob → IngestionJob
-
val
schemaHandler: SchemaHandler
- Definition Classes
- DsvIngestionJob → IngestionJob
-
val
schemaHeaders: List[String]
dataset Header names as defined by the schema
-
lazy val
session: SparkSession
- Definition Classes
- SparkJob
-
implicit
val
settings: Settings
- Definition Classes
- DsvIngestionJob → JobBase
-
lazy val
sparkEnv: SparkEnv
- Definition Classes
- SparkJob
-
val
storageHandler: StorageHandler
- Definition Classes
- DsvIngestionJob → IngestionJob
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
val
treeRowValidator: GenericRowValidator
- Attributes
- protected
- Definition Classes
- IngestionJob
-
val
types: List[Type]
- Definition Classes
- DsvIngestionJob → IngestionJob
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()