public class SimpleKMeans extends RandomizableClusterer implements NumberOfClustersRequestable, WeightedInstancesHandler, TechnicalInformationHandler
@inproceedings{Arthur2007,
author = {D. Arthur and S. Vassilvitskii},
booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms},
pages = {1027-1035},
title = {k-means++: the advantages of carefull seeding},
year = {2007}
}
Valid options are:
-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
RandomizableClusterer,
Serialized Form| Modifier and Type | Field and Description |
|---|---|
static int |
CANOPY |
static int |
FARTHEST_FIRST |
static int |
KMEANS_PLUS_PLUS |
protected int[] |
m_Assignments
Assignments obtained.
|
protected Canopy |
m_canopyClusters
The canopy clusterer (if being used)
|
protected java.util.List<long[]> |
m_centroidCanopyAssignments
Canopies that each centroid falls into (determined by T1 radius)
|
protected Instances |
m_ClusterCentroids
holds the cluster centroids.
|
protected double[][] |
m_ClusterMissingCounts |
protected double[][][] |
m_ClusterNominalCounts
For each cluster, holds the frequency counts for the values of each nominal
attribute.
|
protected double[] |
m_ClusterSizes
The number of instances in each cluster.
|
protected Instances |
m_ClusterStdDevs
Holds the standard deviations of the numeric attributes in each cluster.
|
protected int |
m_completed |
protected java.util.List<long[]> |
m_dataPointCanopyAssignments
Canopies that each training instance falls into (determined by T1 radius)
|
protected boolean |
m_displayStdDevs
Display standard deviations for numeric atts.
|
protected DistanceFunction |
m_DistanceFunction
the distance function used.
|
protected boolean |
m_dontReplaceMissing
Replace missing values globally?
|
protected int |
m_executionSlots
Number of threads to run
|
protected java.util.concurrent.ExecutorService |
m_executorPool
For parallel execution mode
|
protected int |
m_failed |
protected boolean |
m_FastDistanceCalc
whether to use fast calculation of distances (using a cut-off).
|
protected double[] |
m_FullMeansOrMediansOrModes
Stats on the full data set for comparison purposes.
|
protected double[] |
m_FullMissingCounts |
protected double[][] |
m_FullNominalCounts |
protected double[] |
m_FullStdDevs |
protected int |
m_initializationMethod
The initialization method to use
|
protected Instances |
m_initialStartPoints
Holds the initial start points, as supplied by the initialization method
used
|
protected int |
m_Iterations
Keep track of the number of iterations completed before convergence.
|
protected int |
m_maxCanopyCandidates
The maximum number of candidate canopies to hold in memory at any one time
(if using canopy clustering)
|
protected int |
m_MaxIterations
Maximum number of iterations to be executed.
|
protected double |
m_minClusterDensity
The minimum cluster density (according to T2 distance) allowed.
|
protected int |
m_NumClusters
number of clusters to generate.
|
protected int |
m_periodicPruningRate
Prune low-density candidate canopies after every x instances have been seen
(if using canopy clustering)
|
protected boolean |
m_PreserveOrder
Preserve order of instances.
|
protected ReplaceMissingValues |
m_ReplaceMissingFilter
replace missing values in training instances.
|
protected boolean |
m_speedUpDistanceCompWithCanopies
Whether to reducet the number of distance calcs done by k-means with
canopies
|
protected double[] |
m_squaredErrors
Holds the squared errors for all clusters.
|
protected double |
m_t1
The t1 radius to pass through to Canopy
|
protected double |
m_t2
The t2 radius to pass through to Canopy
|
static int |
RANDOM |
static Tag[] |
TAGS_SELECTION
Initialization methods
|
m_Seed, m_SeedDefaultm_Debug, m_DoNotCheckCapabilities| Constructor and Description |
|---|
SimpleKMeans()
the default constructor.
|
| Modifier and Type | Method and Description |
|---|---|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
protected void |
canopyInit(Instances data)
Initialize with the canopy centers of the Canopy clustering method
|
java.lang.String |
canopyMaxNumCanopiesToHoldInMemoryTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyMinimumCanopyDensityTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyPeriodicPruningRateTipText()
Returns the tip text for this property.
|
java.lang.String |
canopyT1TipText()
Tip text for this property
|
java.lang.String |
canopyT2TipText()
Tip text for this property
|
int |
clusterInstance(Instance instance)
Classifies a given instance.
|
java.lang.String |
displayStdDevsTipText()
Returns the tip text for this property.
|
java.lang.String |
distanceFunctionTipText()
Returns the tip text for this property.
|
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
protected void |
farthestFirstInit(Instances data)
Initialize with the fartherst first centers
|
java.lang.String |
fastDistanceCalcTipText()
Returns the tip text for this property.
|
int[] |
getAssignments()
Gets the assignments for each instance.
|
int |
getCanopyMaxNumCanopiesToHoldInMemory()
Get the maximum number of candidate canopies to retain in memory during
training.
|
double |
getCanopyMinimumCanopyDensity()
Get the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
int |
getCanopyPeriodicPruningRate()
Get the how often to prune low density canopies during training (if using
canopy clustering)
|
double |
getCanopyT1()
Get the t1 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
double |
getCanopyT2()
Get the t2 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
Instances |
getClusterCentroids()
Gets the the cluster centroids.
|
double[][][] |
getClusterNominalCounts()
Returns for each cluster the weighted frequency counts for the values of each
nominal attribute.
|
double[] |
getClusterSizes()
Gets the sum of weights for all the instances in each cluster.
|
Instances |
getClusterStandardDevs()
Gets the standard deviations of the numeric attributes in each cluster.
|
boolean |
getDisplayStdDevs()
Gets whether standard deviations and nominal count.
|
DistanceFunction |
getDistanceFunction()
returns the distance function currently in use.
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
boolean |
getFastDistanceCalc()
Gets whether to use faster distance calculation.
|
SelectedTag |
getInitializationMethod()
Get the initialization method to use
|
int |
getMaxIterations()
gets the number of maximum iterations to be executed.
|
int |
getNumClusters()
gets the number of clusters to generate.
|
int |
getNumExecutionSlots()
Get the degree of parallelism to use.
|
java.lang.String[] |
getOptions()
Gets the current settings of SimpleKMeans.
|
boolean |
getPreserveInstancesOrder()
Gets whether order of instances must be preserved.
|
boolean |
getReduceNumberOfDistanceCalcsViaCanopies()
Get whether to use canopies to reduce the number of distance computations
required
|
java.lang.String |
getRevision()
Returns the revision string.
|
double |
getSquaredError()
Gets the squared error for all clusters.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer.
|
java.lang.String |
initializationMethodTipText()
Returns the tip text for this property.
|
protected void |
kMeansPlusPlusInit(Instances data)
Initialize using the k-means++ method
|
protected boolean |
launchAssignToClusters(Instances insts,
int[] clusterAssignments)
Launch the tasks that assign instances to clusters
|
protected int |
launchMoveCentroids(Instances[] clusters)
Launch the move centroids tasks
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args)
Main method for executing this class.
|
java.lang.String |
maxIterationsTipText()
Returns the tip text for this property.
|
protected double[] |
moveCentroid(int centroidIndex,
Instances members,
boolean updateClusterInfo,
boolean addToCentroidInstances)
Move the centroid to it's new coordinates.
|
int |
numberOfClusters()
Returns the number of clusters.
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property.
|
java.lang.String |
numExecutionSlotsTipText()
Returns the tip text for this property
|
java.lang.String |
preserveInstancesOrderTipText()
Returns the tip text for this property.
|
java.lang.String |
reduceNumberOfDistanceCalcsViaCanopiesTipText()
Returns the tip text for this property.
|
void |
setCanopyMaxNumCanopiesToHoldInMemory(int max)
Set the maximum number of candidate canopies to retain in memory during
training.
|
void |
setCanopyMinimumCanopyDensity(double dens)
Set the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
void |
setCanopyPeriodicPruningRate(int p)
Set the how often to prune low density canopies during training (if using
canopy clustering)
|
void |
setCanopyT1(double t1)
Set the t1 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
void |
setCanopyT2(double t2)
Set the t2 radius to use when canopy clustering is being used as start
points and/or to reduce the number of distance calcs
|
void |
setDisplayStdDevs(boolean stdD)
Sets whether standard deviations and nominal count.
|
void |
setDistanceFunction(DistanceFunction df)
sets the distance function to use for instance comparison.
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setFastDistanceCalc(boolean value)
Sets whether to use faster distance calculation.
|
void |
setInitializationMethod(SelectedTag method)
Set the initialization method to use
|
void |
setMaxIterations(int n)
set the maximum number of iterations to be executed.
|
void |
setNumClusters(int n)
set the number of clusters to generate.
|
void |
setNumExecutionSlots(int slots)
Set the degree of parallelism to use.
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPreserveInstancesOrder(boolean r)
Sets whether order of instances must be preserved.
|
void |
setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
Set whether to use canopies to reduce the number of distance computations
required
|
protected void |
startExecutorPool()
Start the pool of execution threads
|
java.lang.String |
toString()
return a string describing this clusterer.
|
getSeed, seedTipText, setSeeddebugTipText, distributionForInstance, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilitiesprotected ReplaceMissingValues m_ReplaceMissingFilter
protected int m_NumClusters
protected Instances m_initialStartPoints
protected Instances m_ClusterCentroids
protected Instances m_ClusterStdDevs
protected double[][][] m_ClusterNominalCounts
protected double[][] m_ClusterMissingCounts
protected double[] m_FullMeansOrMediansOrModes
protected double[] m_FullStdDevs
protected double[][] m_FullNominalCounts
protected double[] m_FullMissingCounts
protected boolean m_displayStdDevs
protected boolean m_dontReplaceMissing
protected double[] m_ClusterSizes
protected int m_MaxIterations
protected int m_Iterations
protected double[] m_squaredErrors
protected DistanceFunction m_DistanceFunction
protected boolean m_PreserveOrder
protected int[] m_Assignments
protected boolean m_FastDistanceCalc
public static final int RANDOM
public static final int KMEANS_PLUS_PLUS
public static final int CANOPY
public static final int FARTHEST_FIRST
public static final Tag[] TAGS_SELECTION
protected int m_initializationMethod
protected boolean m_speedUpDistanceCompWithCanopies
protected java.util.List<long[]> m_centroidCanopyAssignments
protected java.util.List<long[]> m_dataPointCanopyAssignments
protected Canopy m_canopyClusters
protected int m_maxCanopyCandidates
protected int m_periodicPruningRate
protected double m_minClusterDensity
protected double m_t2
protected double m_t1
protected int m_executionSlots
protected transient java.util.concurrent.ExecutorService m_executorPool
protected int m_completed
protected int m_failed
protected void startExecutorPool()
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandlergetTechnicalInformation in interface TechnicalInformationHandlerpublic java.lang.String globalInfo()
public Capabilities getCapabilities()
getCapabilities in interface ClusterergetCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractClustererCapabilitiesprotected int launchMoveCentroids(Instances[] clusters)
clusters - the cluster centroidsprotected boolean launchAssignToClusters(Instances insts, int[] clusterAssignments) throws java.lang.Exception
insts - the instances to be clusteredclusterAssignments - the array of cluster assignmentsjava.lang.Exception - if a problem occurspublic void buildClusterer(Instances data) throws java.lang.Exception
buildClusterer in interface ClustererbuildClusterer in class AbstractClustererdata - set of instances serving as training datajava.lang.Exception - if the clusterer has not been generated successfullyprotected void canopyInit(Instances data) throws java.lang.Exception
data - the training datajava.lang.Exception - if a problem occursprotected void farthestFirstInit(Instances data) throws java.lang.Exception
data - the training datajava.lang.Exception - if a problem occursprotected void kMeansPlusPlusInit(Instances data) throws java.lang.Exception
data - the training datajava.lang.Exception - if a problem occursprotected double[] moveCentroid(int centroidIndex,
Instances members,
boolean updateClusterInfo,
boolean addToCentroidInstances)
centroidIndex - index of the centroid which the coordinates will be
computedmembers - the objects that are assigned to the cluster of this
centroidupdateClusterInfo - if the method is supposed to update the m_Cluster
arraysaddToCentroidInstances - true if the method is to add the computed
coordinates to the Instances holding the centroidspublic int clusterInstance(Instance instance) throws java.lang.Exception
clusterInstance in interface ClustererclusterInstance in class AbstractClustererinstance - the instance to be assigned to a clusterjava.lang.Exception - if instance could not be classified successfullypublic int numberOfClusters()
throws java.lang.Exception
numberOfClusters in interface ClusterernumberOfClusters in class AbstractClustererjava.lang.Exception - if number of clusters could not be returned successfullypublic java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClustererpublic java.lang.String numClustersTipText()
public void setNumClusters(int n)
throws java.lang.Exception
setNumClusters in interface NumberOfClustersRequestablen - the number of clusters to generatejava.lang.Exception - if number of clusters is negativepublic int getNumClusters()
public java.lang.String initializationMethodTipText()
public void setInitializationMethod(SelectedTag method)
method - the initialization method to usepublic SelectedTag getInitializationMethod()
public java.lang.String reduceNumberOfDistanceCalcsViaCanopiesTipText()
public void setReduceNumberOfDistanceCalcsViaCanopies(boolean c)
c - true if canopies are to be used to reduce the number of distance
computationspublic boolean getReduceNumberOfDistanceCalcsViaCanopies()
public java.lang.String canopyPeriodicPruningRateTipText()
public void setCanopyPeriodicPruningRate(int p)
p - how often (every p instances) to prune low density canopiespublic int getCanopyPeriodicPruningRate()
public java.lang.String canopyMinimumCanopyDensityTipText()
public void setCanopyMinimumCanopyDensity(double dens)
dens - the minimum canopy densitypublic double getCanopyMinimumCanopyDensity()
public java.lang.String canopyMaxNumCanopiesToHoldInMemoryTipText()
public void setCanopyMaxNumCanopiesToHoldInMemory(int max)
max - the maximum number of candidate canopies to retain in memory
during trainingpublic int getCanopyMaxNumCanopiesToHoldInMemory()
public java.lang.String canopyT2TipText()
public void setCanopyT2(double t2)
t2 - the t2 radius to usepublic double getCanopyT2()
public java.lang.String canopyT1TipText()
public void setCanopyT1(double t1)
t1 - the t1 radius to usepublic double getCanopyT1()
public java.lang.String maxIterationsTipText()
public void setMaxIterations(int n)
throws java.lang.Exception
n - the maximum number of iterationsjava.lang.Exception - if maximum number of iteration is smaller than 1public int getMaxIterations()
public java.lang.String displayStdDevsTipText()
public void setDisplayStdDevs(boolean stdD)
stdD - true if std. devs and counts should be displayedpublic boolean getDisplayStdDevs()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r - true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public java.lang.String distanceFunctionTipText()
public DistanceFunction getDistanceFunction()
public void setDistanceFunction(DistanceFunction df) throws java.lang.Exception
df - the new distance function to usejava.lang.Exception - if instances cannot be processedpublic java.lang.String preserveInstancesOrderTipText()
public void setPreserveInstancesOrder(boolean r)
r - true if missing values are to be replacedpublic boolean getPreserveInstancesOrder()
public java.lang.String fastDistanceCalcTipText()
public void setFastDistanceCalc(boolean value)
value - true if faster calculation to be usedpublic boolean getFastDistanceCalc()
public java.lang.String numExecutionSlotsTipText()
public void setNumExecutionSlots(int slots)
slots - the number of tasks to run in parallel when computing the
nearest neighbors and evaluating different values of k between the
lower and upper boundspublic int getNumExecutionSlots()
public void setOptions(java.lang.String[] options)
throws java.lang.Exception
-N <num> Number of clusters. (default 2).
-init Initialization method to use. 0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first. (default = 0)
-C Use canopies to reduce the number of distance calculations.
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time when using canopy clustering. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies when using canopy clustering. (default = every 10,000 training instances)
-min-density Minimum canopy density, when using canopy clustering, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use when using canopy clustering. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. (default = -1.0)
-t1 The T1 distance to use when using canopy clustering. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-V Display std. deviations for centroids.
-M Don't replace missing values with mean/mode.
-A <classname and options> Distance function to use. (default: weka.core.EuclideanDistance)
-I <num> Maximum number of iterations.
-O Preserve order of instances.
-fast Enables faster distance calculations, using cut-off values. Disables the calculation/output of squared errors/distances.
-num-slots <num> Number of execution slots. (default 1 - i.e. no parallelism)
-S <num> Random number seed. (default 10)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
setOptions in interface OptionHandlersetOptions in class RandomizableClustereroptions - the list of options as an array of stringsjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClustererpublic java.lang.String toString()
toString in class java.lang.Objectpublic Instances getClusterCentroids()
public Instances getClusterStandardDevs()
public double[][][] getClusterNominalCounts()
public double getSquaredError()
m_FastDistanceCalcpublic double[] getClusterSizes()
public int[] getAssignments()
throws java.lang.Exception
java.lang.Exception - if order of instances wasn't preserved or no assignments
were madepublic java.lang.String getRevision()
getRevision in interface RevisionHandlergetRevision in class AbstractClustererpublic static void main(java.lang.String[] args)
args - use -h to list all parameters