class Pool extends Params with HasLabelCol with HasFeaturesCol with HasWeightCol with Logging
CatBoost's abstraction of a dataset.
Features data can be stored in raw (features column has org.apache.spark.ml.linalg.Vector type)
or quantized (float feature values are quantized into integer bin values, features column has
Array[Byte] type) form.
Raw Pool can be transformed to quantized form using quantize method.
This is useful if this dataset is used for training multiple times and quantization parameters do not
change. Pre-quantized Pool allows to cache quantized features data and so do not re-run
feature quantization step at the start of an each training.
- Alphabetic
- By Inheritance
- Pool
- Logging
- HasWeightCol
- HasFeaturesCol
- HasLabelCol
- Params
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new Pool(data: DataFrame, pairsData: DataFrame)
Construct Pool from DataFrame also specifying pairs data in an additional DataFrame
Construct Pool from DataFrame also specifying pairs data in an additional DataFrame
val spark = SparkSession.builder() .master("local[4]") .appName("PoolWithPairsTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12", 0x0L, 0.12f, 0), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22", 0x0L, 0.18f, 1), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34", 0x1L, 1.0f, 2), Row(Vectors.dense(0.23, 0.01, 0.0), "0.0", 0x1L, 1.2f, 3) ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType), StructField("groupId", LongType), StructField("weight", FloatType) StructField("sampleId", LongType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val srcPairsData = Seq( Row(0x0L, 0, 1), Row(0x1L, 3, 2) ) val srcPairsDataSchema = Seq( StructField("groupId", LongType), StructField("winnerId", IntegerType), StructField("loserId", IntegerType) ) val pairsDf = spark.createDataFrame( spark.sparkContext.parallelize(srcPairsData), StructType(srcPairsDataSchema) ) val pool = new Pool(df, pairsDf) .setGroupIdCol("groupId") .setWeightCol("weight") .setSampleIdCol("sampleId") pool.data.show() pool.pairsData.show()
Example: - new Pool(data: DataFrame)
Construct Pool from DataFrame Call set*Col methods to specify non-default columns.
Construct Pool from DataFrame Call set*Col methods to specify non-default columns. Only features and label columns with "features" and "label" names are assumed by default.
val spark = SparkSession.builder() .master("local[4]") .appName("PoolTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12", 0x0L, 0.12f), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22", 0x0L, 0.18f), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34", 0x1L, 1.0f) ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType), StructField("groupId", LongType), StructField("weight", FloatType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) .setGroupIdCol("groupId") .setWeightCol("weight") pool.data.show()
Example: - new Pool(uid: String, data: DataFrame = null, featuresLayout: TFeaturesLayoutPtr = null, quantizedFeaturesInfo: QuantizedFeaturesInfoPtr = null, pairsData: DataFrame = null, partitionedByGroups: Boolean = false)
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def $[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- final val baselineCol: Param[String]
- def cache(): Pool
Persist Datasets of this Pool with the default storage level (MEMORY_AND_DISK).
- def calcNanModesAndBorders(nanModeAndBordersBuilder: TNanModeAndBordersBuilder, quantizationParams: QuantizationParamsTrait): Unit
- Attributes
- protected
- def checkpoint(): Pool
Returns Pool with eagerly checkpointed Datasets.
- def checkpoint(eager: Boolean): Pool
Returns Pool with checkpointed Datasets.
Returns Pool with checkpointed Datasets.
- eager
Whether to checkpoint Datasets immediately
- final def clear(param: Param[_]): Pool.this.type
- Definition Classes
- Params
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def copy(extra: ParamMap): Pool
- Definition Classes
- Pool → Params
- def copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- def copyWithModifiedData(modifiedData: DataFrame, partitionedByGroups: Boolean = false): Pool
used to add additional columns to data (for example estimated features) It is impossible to just write an external function for this because copyValues is protected
- def count: Long
- returns
Number of objects in the dataset, similar to the same method of org.apache.spark.sql.Dataset
- def createQuantizationSchema(quantizationParams: QuantizationParamsTrait): QuantizedFeaturesInfoPtr
- Attributes
- protected
- def createQuantized(quantizedFeaturesInfo: QuantizedFeaturesInfoPtr): Pool
- Attributes
- protected
- val data: DataFrame
- final def defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- def ensurePartitionByGroupsIfPresent(): Pool
ensure that if groups are present data in partitions contains whole groups in consecutive order
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def explainParam(param: Param[_]): String
- Definition Classes
- Params
- def explainParams(): String
- Definition Classes
- Params
- final def extractParamMap(): ParamMap
- Definition Classes
- Params
- final def extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
- final val featuresCol: Param[String]
- Definition Classes
- HasFeaturesCol
- var featuresLayout: TFeaturesLayoutPtr
- Attributes
- protected
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- final def getBaselineCol: String
- def getBaselineCount: Int
- returns
dimension of formula baseline, 0 if no baseline specified
- def getCatFeaturesUniqValueCounts: Array[Int]
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- final def getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getEstimatedFeatureCount: Int
- def getFeatureCount: Int
- def getFeatureNames: Array[String]
- final def getFeaturesCol: String
- Definition Classes
- HasFeaturesCol
- def getFeaturesLayout: TFeaturesLayoutPtr
- final def getGroupIdCol: String
- final def getGroupWeightCol: String
- final def getLabelCol: String
- Definition Classes
- HasLabelCol
- final def getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
- def getParam(paramName: String): Param[Any]
- Definition Classes
- Params
- final def getSampleIdCol: String
- final def getSubgroupIdCol: String
- def getTargetType: ERawTargetType
- final def getTimestampCol: String
- final def getWeightCol: String
- Definition Classes
- HasWeightCol
- final val groupIdCol: Param[String]
- final val groupWeightCol: Param[String]
- final def hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
- def hasParam(paramName: String): Boolean
- Definition Classes
- Params
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isQuantized: Boolean
- final def isSet(param: Param[_]): Boolean
- Definition Classes
- Params
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- final val labelCol: Param[String]
- Definition Classes
- HasLabelCol
- def localCheckpoint(): Pool
Returns Pool with eagerly locally checkpointed Datasets.
- def localCheckpoint(eager: Boolean): Pool
Returns Pool with locally checkpointed Datasets.
Returns Pool with locally checkpointed Datasets.
- eager
Whether to checkpoint Datasets immediately
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def mapQuantizedPartitions[R](selectedColumns: Seq[String], includeEstimatedFeatures: Boolean, includePairsIfPresent: Boolean, dstColumnNames: Array[String], dstRowLength: Int, f: (TDataProviderPtr, TDataProviderPtr, ArrayBuffer[Array[Any]], TLocalExecutor) => Iterator[R])(implicit arg0: Encoder[R], arg1: ClassTag[R]): Dataset[R]
Map over partitions for quantized Pool
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def pairsCount: Long
- returns
Number of pairs in the dataset
- val pairsData: DataFrame
- lazy val params: Array[Param[_]]
- Definition Classes
- Params
- val partitionedByGroups: Boolean
- def persist(): Pool
Persist Datasets of this Pool with the default storage level (MEMORY_AND_DISK).
- def persist(storageLevel: StorageLevel): Pool
Returns Pool with Datasets persisted with the given storage level.
- def quantize(quantizedFeaturesInfo: QuantizedFeaturesInfoPtr): Pool
Create Pool with quantized features from Pool with raw features.
Create Pool with quantized features from Pool with raw features. This variant of the method is useful if QuantizedFeaturesInfo with data for quantization (borders and nan modes) has already been computed. Used, for example, to quantize evaluation datasets after the training dataset has been quantized.
- def quantize(quantizationParams: QuantizationParamsTrait = new QuantizationParams()): Pool
Create Pool with quantized features from Pool with raw features
val spark = SparkSession.builder() .master("local[*]") .appName("QuantizationTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12"), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22"), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34") ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) val quantizedPool = pool.quantize(new QuantizationParams) val quantizedPoolWithTwoBinsPerFeature = pool.quantize(new QuantizationParams().setBorderCount(1)) quantizedPool.data.show() quantizedPoolWithTwoBinsPerFeature.data.show()
Example: - def quantizeForModelApplication[Model <: PredictionModel[Vector, Model]](model: CatBoostModelTrait[Model]): Pool
Create Pool with quantized features from Pool with raw features.
- val quantizedFeaturesInfo: QuantizedFeaturesInfoPtr
- def repartition(partitionCount: Int, byGroupColumnsIfPresent: Boolean = true): Pool
Repartition data to the specified number of partitions.
Repartition data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its' own CatBoost worker with a part of the training data).
- def sample(fraction: Double): Pool
Create subset of this pool with the fraction of the samples (or groups of samples if present)
- final val sampleIdCol: Param[String]
- final def set(paramPair: ParamPair[_]): Pool.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set(param: String, value: Any): Pool.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set[T](param: Param[T], value: T): Pool.this.type
- Definition Classes
- Params
- final def setBaselineCol(value: String): Pool.this.type
- final def setDefault(paramPairs: ParamPair[_]*): Pool.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def setDefault[T](param: Param[T], value: T): Pool.this.type
- Attributes
- protected[ml]
- Definition Classes
- Params
- def setFeaturesCol(value: String): Pool
- final def setGroupIdCol(value: String): Pool.this.type
- final def setGroupWeightCol(value: String): Pool.this.type
- def setLabelCol(value: String): Pool
- final def setSampleIdCol(value: String): Pool.this.type
- final def setSubgroupIdCol(value: String): Pool.this.type
- final def setTimestampCol(value: String): Pool.this.type
- def setWeightCol(value: String): Pool
- final val subgroupIdCol: Param[String]
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- final val timestampCol: Param[String]
- def toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
- val uid: String
- Definition Classes
- Pool → Identifiable
- def unpersist(blocking: Boolean): Pool
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
- blocking
Whether to block until all blocks are deleted.
- def unpersist(): Pool
Mark Datasets of this Pool as non-persistent, and remove all blocks for them from memory and disk.
- def updateCatFeaturesInfo(isInitialization: Boolean, quantizedFeaturesInfo: QuantizedFeaturesInfoPtr): Unit
- Attributes
- protected
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- final val weightCol: Param[String]
- Definition Classes
- HasWeightCol
- def write(): PoolWriter
Interface for saving the content out into external storage (API similar to Spark's Dataset).