Construct Pool from DataFrame Call set*Col methods to specify non-default columns.
Construct Pool from DataFrame Call set*Col methods to specify non-default columns. Only features and label columns with "features" and "label" names are assumed by default.
val spark = SparkSession.builder() .master("local[4]") .appName("PoolTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12", 0x0L, 0.12f), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22", 0x0L, 0.18f), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34", 0x1L, 1.0f) ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType), StructField("groupId", LongType), StructField("weight", FloatType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) .setGroupIdCol("groupId") .setWeightCol("weight") pool.data.show()
Number of objects in the dataset, similar to the same method of org.apache.spark.sql.Dataset
dimension of formula baseline, 0 if no baseline specified
Create Pool with quantized features from Pool with raw features.
Create Pool with quantized features from Pool with raw features. This variant of the method is useful if QuantizedFeaturesInfo with data for quantization (borders and nan modes) has already been computed. Used, for example, to quantize evaluation datasets after the training dataset has been quantized.
Create Pool with quantized features from Pool with raw features
val spark = SparkSession.builder() .master("local[*]") .appName("QuantizationTest") .getOrCreate(); val srcData = Seq( Row(Vectors.dense(0.1, 0.2, 0.11), "0.12"), Row(Vectors.dense(0.97, 0.82, 0.33), "0.22"), Row(Vectors.dense(0.13, 0.22, 0.23), "0.34") ) val srcDataSchema = Seq( StructField("features", SQLDataTypes.VectorType), StructField("label", StringType) ) val df = spark.createDataFrame(spark.sparkContext.parallelize(srcData), StructType(srcDataSchema)) val pool = new Pool(df) val quantizedPool = pool.quantize(new QuantizationParams) val quantizedPoolWithTwoBinsPerFeature = pool.quantize(new QuantizationParams().setBorderCount(1)) quantizedPool.data.show() quantizedPoolWithTwoBinsPerFeature.data.show()
Repartion data to the specified number of partitions.
Repartion data to the specified number of partitions. Useful to repartition data to create one partition per executor for training (where each executor gets its' own CatBoost worker with a part of the training data).
CatBoost's abstraction of a dataset.
Features data can be stored in raw (features column has org.apache.spark.ml.linalg.Vector type) or quantized (float feature values are quantized into integer bin values, features column has
Array[Byte]type) form.Raw Pool can be transformed to quantized form using
quantizemethod. This is useful if this dataset is used for training multiple times and quantization parameters do not change. Pre-quantized Pool allows to cache quantized features data and so do not re-run feature quantization step at the start of an each training.