class CatBoostRegressionModel extends RegressionModel[Vector, CatBoostRegressionModel] with CatBoostModelTrait[CatBoostRegressionModel]
Regression model trained by CatBoost. Use CatBoostRegressor to train it
Serialization
Supports standard Spark MLLib serialization. Data can be saved to distributed filesystem like HDFS or
local files.
When saved to path two files are created:
-<path>/metadata which contains Spark-specific metadata in JSON format
-<path>/model which contains model in usual CatBoost format which can be read using other local
CatBoost APIs (if stored in a distributed filesystem it has to be copied to the local filesystem first).
Saving to and loading from local files in standard CatBoost model formats is also supported.
Save model
val trainPool : Pool = ... init Pool ... val regressor = new CatBoostRegressor val model = regressor.fit(trainPool) val path = "/home/user/catboost_spark_models/model0" model.write.save(path)
, Load model
val dataFrameForPrediction : DataFrame = ... init DataFrame ... val path = "/home/user/catboost_spark_models/model0" val model = CatBoostRegressionModel.load(path) val predictions = model.transform(dataFrameForPrediction) predictions.show()
, Save as a native model
val trainPool : Pool = ... init Pool ... val regressor = new CatBoostRegressor val model = regressor.fit(trainPool) val path = "/home/user/catboost_native_models/model0.cbm" model.saveNativeModel(path)
, Load native model
val dataFrameForPrediction : DataFrame = ... init DataFrame ... val path = "/home/user/catboost_native_models/model0.cbm" val model = CatBoostRegressionModel.loadNativeModel(path) val predictions = model.transform(dataFrameForPrediction) predictions.show()
- Alphabetic
- By Inheritance
- CatBoostRegressionModel
- CatBoostModelTrait
- MLWritable
- RegressionModel
- PredictionModel
- PredictorParams
- HasPredictionCol
- HasFeaturesCol
- HasLabelCol
- Model
- Transformer
- PipelineStage
- Logging
- Params
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new CatBoostRegressionModel(nativeModel: TFullModel)
- new CatBoostRegressionModel(uid: String, nativeModel: TFullModel = null, nativeDimension: Int)
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def $[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- final def clear(param: Param[_]): CatBoostRegressionModel.this.type
- Definition Classes
- Params
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def copy(extra: ParamMap): CatBoostRegressionModel
- Definition Classes
- CatBoostRegressionModel → Model → Transformer → PipelineStage → Params
- def copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def explainParam(param: Param[_]): String
- Definition Classes
- Params
- def explainParams(): String
- Definition Classes
- Params
- final def extractParamMap(): ParamMap
- Definition Classes
- Params
- final def extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
- final val featuresCol: Param[String]
- Definition Classes
- HasFeaturesCol
- def featuresDataType: DataType
- Attributes
- protected
- Definition Classes
- PredictionModel
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getAdditionalColumnsForApply: Seq[StructField]
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- final def getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
- def getFeatureImportance(fstrType: EFstrType = EFstrType.FeatureImportance, data: Pool = null, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular): Array[Double]
- fstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- data
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcType
Used only for PredictionValuesChange. Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- returns
array of feature importances (index corresponds to the order of features in the model)
- Definition Classes
- CatBoostModelTrait
- def getFeatureImportanceInteraction(): Array[FeatureInteractionScore]
- returns
array of feature interaction scores
- Definition Classes
- CatBoostModelTrait
- def getFeatureImportancePrettified(fstrType: EFstrType = EFstrType.FeatureImportance, data: Pool = null, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular): Array[FeatureImportance]
- fstrType
Supported values are FeatureImportance, PredictionValuesChange, LossFunctionChange, PredictionDiff
- data
if fstrType is PredictionDiff it is required and must contain 2 samples if fstrType is PredictionValuesChange this param is required in case if model was explicitly trained with flag to store no leaf weights. otherwise it can be null
- calcType
Used only for PredictionValuesChange. Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- returns
array of feature importances sorted in descending order by importance
- Definition Classes
- CatBoostModelTrait
- def getFeatureImportanceShapInteractionValues(data: Pool, featureIndices: Pair[Int, Int] = null, featureNames: Pair[String, String] = null, preCalcMode: EPreCalcShapValues = EPreCalcShapValues.Auto, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular, outputColumns: Array[String] = null): DataFrame
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames are specified.
SHAP interaction values are calculated for all features pairs if nor featureIndices nor featureNames are specified.
- data
dataset to calculate SHAP interaction values
- featureIndices
(optional) pair of feature indices to calculate SHAP interaction values for.
- featureNames
(optional) pair of feature names to calculate SHAP interaction values for.
- preCalcMode
Possible values:
- Auto Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL2 D2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcType
Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- outputColumns
columns from data to add to output DataFrame, if null - add all columns
- returns
- for binclass or regression: DataFrame which contains outputColumns and "featureIdx1", "featureIdx2", "shapInteractionValue" columns
- for multiclass: DataFrame which contains outputColumns and "classIdx", "featureIdx1", "featureIdx2", "shapInteractionValue" columns
- Definition Classes
- CatBoostModelTrait
- def getFeatureImportanceShapValues(data: Pool, preCalcMode: EPreCalcShapValues = EPreCalcShapValues.Auto, calcType: ECalcTypeShapValues = ECalcTypeShapValues.Regular, modelOutputType: EExplainableModelOutput = EExplainableModelOutput.Raw, referenceData: Pool = null, outputColumns: Array[String] = null): DataFrame
- data
dataset to calculate SHAP values for
- preCalcMode
Possible values:
- Auto Use direct SHAP Values calculation only if data size is smaller than average leaves number (the best of two strategies below is chosen).
- UsePreCalc Calculate SHAP Values for every leaf in preprocessing. Final complexity is O(NT(D+F))+O(TL2 D2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree This is much faster (because of a smaller constant) than direct calculation when N >> L
- NoPreCalc Use direct SHAP Values calculation calculation with complexity O(NTLD^2). Direct algorithm is faster when N < L (algorithm from https://arxiv.org/abs/1802.03888)
- calcType
Possible values:
- Regular Calculate regular SHAP values
- Approximate Calculate approximate SHAP values
- Exact Calculate exact SHAP values
- referenceData
reference data for Independent Tree SHAP values from https://arxiv.org/abs/1905.04610v1 if referenceData is not null, then Independent Tree SHAP values are calculated
- outputColumns
columns from data to add to output DataFrame, if null - add all columns
- returns
- for regression and binclass models: DataFrame which contains outputColumns and "shapValues" column with Vector of length (n_features + 1) with SHAP values
- for multiclass models: DataFrame which contains outputColumns and "shapValues" column with Matrix of shape (n_classes x (n_features + 1)) with SHAP values
- Definition Classes
- CatBoostModelTrait
- final def getFeaturesCol: String
- Definition Classes
- HasFeaturesCol
- final def getLabelCol: String
- Definition Classes
- HasLabelCol
- final def getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
- def getParam(paramName: String): Param[Any]
- Definition Classes
- Params
- final def getPredictionCol: String
- Definition Classes
- HasPredictionCol
- def getResultIteratorForApply(objectsDataProvider: SWIGTYPE_p_NCB__TObjectsDataProviderPtr, dstRows: ArrayBuffer[Array[Any]], localExecutor: TLocalExecutor): Iterator[Row]
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
- final def hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
- def hasParam(paramName: String): Boolean
- Definition Classes
- Params
- def hasParent: Boolean
- Definition Classes
- Model
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def isSet(param: Param[_]): Boolean
- Definition Classes
- Params
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- final val labelCol: Param[String]
- Definition Classes
- HasLabelCol
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- var nativeDimension: Int
- Attributes
- protected
- Definition Classes
- CatBoostRegressionModel → CatBoostModelTrait
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def numFeatures: Int
- Definition Classes
- PredictionModel
- Annotations
- @Since("1.6.0")
- lazy val params: Array[Param[_]]
- Definition Classes
- Params
- var parent: Estimator[CatBoostRegressionModel]
- Definition Classes
- Model
- def predict(features: Vector): Double
Prefer batch computations operating on datasets as a whole for efficiency
Prefer batch computations operating on datasets as a whole for efficiency
- Definition Classes
- CatBoostRegressionModel → PredictionModel
- final def predictRawImpl(features: Vector): Array[Double]
Prefer batch computations operating on datasets as a whole for efficiency
Prefer batch computations operating on datasets as a whole for efficiency
- Definition Classes
- CatBoostModelTrait
- final val predictionCol: Param[String]
- Definition Classes
- HasPredictionCol
- def save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since("1.6.0") @throws("If the input path already exists but overwrite is not enabled.")
- def saveNativeModel(fileName: String, format: ru.yandex.catboost.spark.catboost4j_spark.core.src.native_impl.EModelType = EModelType.CatboostBinary, exportParameters: Map[String, Any] = null, pool: Pool = null): Unit
Save the model to a local file.
Save the model to a local file.
- fileName
The path to the output model.
- format
The output format of the model. Possible values:
CatboostBinary CatBoost binary format (default). AppleCoreML Apple CoreML format (only datasets without categorical features are currently supported). Cpp Standalone C++ code (multiclassification models are not currently supported). See the C++ section for details on applying the resulting model. Python Standalone Python code (multiclassification models are not currently supported). See the Python section for details on applying the resulting model. Json JSON format. Refer to the CatBoost JSON model tutorial for format details. Onnx ONNX-ML format (only datasets without categorical features are currently supported). Refer to https://onnx.ai for details. Pmml PMML version 4.3 format. Categorical features must be interpreted as one-hot encoded during the training if present in the training dataset. This can be accomplished by setting the --one-hot-max-size/one_hot_max_size parameter to a value that is greater than the maximum number of unique categorical feature values among all categorical features in the dataset. Note. Multiclassification models are not currently supported. See the PMML section for details on applying the resulting model. - exportParameters
Additional format-dependent parameters for AppleCoreML, Onnx or Pmml formats. See python API documentation for details.
- pool
The dataset previously used for training. This parameter is required if the model contains categorical features and the output format is Cpp, Python, or Json.
- Definition Classes
- CatBoostModelTrait
val spark = SparkSession.builder() .master("local[*]") .appName("testSaveLocalModel") .getOrCreate() val pool = Pool.load( spark, "dsv:///home/user/datasets/my_dataset/train.dsv", columnDescription = "/home/user/datasets/my_dataset/cd" ) val regressor = new CatBoostRegressor() val model = regressor.fit(pool) // save in CatBoostBinary format model.saveNativeModel("/home/user/model/model.cbm") // save in ONNX format with metadata model.saveNativeModel( "/home/user/model/model.onnx", EModelType.Onnx, Map( "onnx_domain" -> "ai.catboost", "onnx_model_version" -> 1, "onnx_doc_string" -> "test model for regression", "onnx_graph_name" -> "CatBoostModel_for_regression" ) )
Example: - final def set(paramPair: ParamPair[_]): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set(param: String, value: Any): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def set[T](param: Param[T], value: T): CatBoostRegressionModel.this.type
- Definition Classes
- Params
- final def setDefault(paramPairs: ParamPair[_]*): CatBoostRegressionModel.this.type
- Attributes
- protected
- Definition Classes
- Params
- final def setDefault[T](param: Param[T], value: T): CatBoostRegressionModel.this.type
- Attributes
- protected[ml]
- Definition Classes
- Params
- def setFeaturesCol(value: String): CatBoostRegressionModel
- Definition Classes
- PredictionModel
- def setParent(parent: Estimator[CatBoostRegressionModel]): CatBoostRegressionModel
- Definition Classes
- Model
- def setPredictionCol(value: String): CatBoostRegressionModel
- Definition Classes
- PredictionModel
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
- def transform(dataset: Dataset[_]): DataFrame
- Definition Classes
- PredictionModel → Transformer
- def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since("2.0.0")
- def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
- Definition Classes
- Transformer
- Annotations
- @Since("2.0.0") @varargs()
- def transformCatBoostImpl(dataset: Dataset[_]): DataFrame
- Attributes
- protected
- Definition Classes
- CatBoostModelTrait
- def transformImpl(dataset: Dataset[_]): DataFrame
- Definition Classes
- CatBoostRegressionModel → PredictionModel
- def transformPool(dataset: Pool): DataFrame
This function is useful when the dataset has been already quantized but works with any Pool
This function is useful when the dataset has been already quantized but works with any Pool
- Definition Classes
- CatBoostModelTrait
- def transformSchema(schema: StructType): StructType
- Definition Classes
- PredictionModel → PipelineStage
- def transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
- val uid: String
- Definition Classes
- CatBoostRegressionModel → Identifiable
- def validateAndTransformSchema(schema: StructType, fitting: Boolean, featuresDataType: DataType): StructType
- Attributes
- protected
- Definition Classes
- PredictorParams
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def write: MLWriter
- Definition Classes
- CatBoostModelTrait → MLWritable