package splits
- Alphabetic
- Public
- All
Type Members
-
case class
BoltzmannSplitter(temperature: Double) extends Splitter[Double] with Product with Serializable
Find a split for a regression problem
Find a split for a regression problem
The splits are picked with a probability that is related to the reduction in variance: P(split) ~ exp[ - {remaining variance} / ({temperature} * {total variance}) ] recalling that the "variance" here is weighted by the sample size (so its really the sum of the square difference from the mean of that side of the split). This is analogous to simulated annealing and Metropolis-Hastings.
The motivation here is to reduce the correlation of the trees by making random choices between splits that are almost just as good as the strictly optimal one. Reducing the correlation between trees will reduce the variance in an ensemble method (e.g. random forests): the variance will both decrease more quickly with the tree count and will reach a lower floor. In this paragraph, we're using "variance" as in "bias-variance trade-off".
Division by the local total variance make the splitting behavior invariant to data size and the scale of the labels. That means, however, that you can't set the temperature based on a known absolute noise scale. For that, you'd want to divide by the total weight rather than the total variance.
TODO: allow the rescaling to happen based on the total weight instead of the total variance, as an option
Created by maxhutch on 11/29/16.
- temperature
used to control how sensitive the probability of a split is to its change in variance. The temperature can be thought of as a hyperparameter.
-
class
CategoricalSplit extends Split
Split based on inclusion in a set
-
case class
ClassificationSplitter(randomizedPivotLocation: Boolean = false) extends Splitter[Char] with Product with Serializable
Find the best split for classification problems.
Find the best split for classification problems.
Created by maxhutch on 12/2/16.
-
class
NoSplit extends Split
If no split was found
-
class
RealSplit extends Split
Split based on a real value in the index position
-
case class
RegressionSplitter(randomizePivotLocation: Boolean = false) extends Splitter[Double] with Product with Serializable
Find the best split for regression problems.
Find the best split for regression problems.
The best split is the one that reduces the total weighted variance: totalVariance = N_left * \sigma_left2 + N_right * \sigma_right2 which, in scala-ish, would be: totalVariance = leftWeight * (leftSquareSum /leftWeight - (leftSum / leftWeight )2) + rightWeight * (rightSquareSum/rightWeight - (rightSum / rightWeight)2) Because we are comparing them, we can subtract off leftSquareSum + rightSquareSum, which yields the following simple expression after some simplification: totalVariance = -leftSum * leftSum / leftWeight - Math.pow(totalSum - leftSum, 2) / (totalWeight - leftWeight) which depends only on updates to leftSum and leftWeight (since totalSum and totalWeight are constant).
Created by maxhutch on 11/29/16.
-
trait
Split extends Serializable
Splits are used by decision trees to partition the input space
-
trait
Splitter[T] extends AnyRef
Created by maxhutch on 7/5/17.
Value Members
- object BoltzmannSplitter extends Serializable
-
object
MultiTaskSplitter
Created by maxhutch on 11/29/16.
- object Splitter