public class Canopy extends RandomizableClusterer implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler
@inproceedings{McCallum2000,
author = {A. McCallum and K. Nigam and L.H. Ungar},
booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms},
pages = {169-178},
title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching},
year = {2000}
}
Valid options are:
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
| Modifier and Type | Field and Description |
|---|---|
static double |
DEFAULT_T1 |
static double |
DEFAULT_T2 |
protected Instances |
m_canopies
The canopy centers
|
protected java.util.List<double[][]> |
m_canopyCenters |
protected java.util.List<double[]> |
m_canopyNumMissingForNumerics |
protected java.util.List<double[]> |
m_canopyT2Density
The T2 density of each canopy
|
protected java.util.List<long[]> |
m_clusterCanopies
The list of canopies that each canopy is a member of (according to the T1
radius, which can overlap).
|
protected boolean |
m_didPruneLastTime
True if the pruning operation did remove at least one low density canopy
the last time it was invoked
|
protected NormalizableDistance |
m_distanceFunction
The distance function to use
|
protected boolean |
m_dontReplaceMissing
Replace missing values globally when running in batch mode?
|
protected int |
m_instanceCount
Number of training instances seen so far
|
protected int |
m_maxCanopyCandidates
The maximum number of candidate canopies to hold in memory at any one time
|
protected double |
m_minClusterDensity
The minimum cluster density (according to T2 distance) allowed.
|
protected Filter |
m_missingValuesReplacer
If not null, then this is expected to be a filter that can replace missing
values immediately (at training and testing time)
|
protected int |
m_numClustersRequested
Default is to let the t2 radius determine how many canopies/clusters are
formed
|
protected int |
m_periodicPruningRate
Prune low-density candidate canopies after every x instances have been seen
|
protected double |
m_t1
Outer radius
|
protected double |
m_t2
Inner radius
|
protected Instances |
m_trainingData
Used to pad out number of cluster centers if fewer canopies are generated
than the number of requested clusters and we are running in batch mode.
|
protected double |
m_userT1
< 0 indicates the multiplier to use for T2 when setting T1, otherwise the
value is take as is
|
protected double |
m_userT2
< 0 means use the heuristic based on std. dev. to set the t2 radius
|
m_Seed, m_SeedDefaultm_Debug, m_DoNotCheckCapabilities| Constructor and Description |
|---|
Canopy() |
| Modifier and Type | Method and Description |
|---|---|
protected void |
adjustCanopies(double[] densities)
Adjust the final number of canopies to match the user-requested number (if
possible)
|
static Canopy |
aggregateCanopies(java.util.List<Canopy> canopies,
double aggregationT1,
double aggregationT2,
NormalizableDistance finalDistanceFunction,
Filter missingValuesReplacer,
int finalNumCanopies)
Aggregate the canopies from a list of Canopy clusterers together into one
final model.
|
long[] |
assignCanopies(Instance inst)
Uses T1 distance to assign canopies to the supplied instance.
|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
void |
cleanUp()
Save memory
|
double[] |
distributionForInstance(Instance instance)
Predicts the cluster memberships for a given instance.
|
java.lang.String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
double |
getActualT1()
Get the actual value of T1 (which may be different from the initial value
if the heuristic is used)
|
double |
getActualT2()
Get the actual value of T2 (which may be different from the initial value
if the heuristic is used)
|
Instances |
getCanopies()
Get the canopies (cluster centers).
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
java.util.List<long[]> |
getClusterCanopyAssignments()
Get the canopies that each canopy (cluster center) is within T1 distance of
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
int |
getMaxNumCandidateCanopiesToHoldInMemory()
Get the maximum number of candidate canopies to retain in memory during
training.
|
double |
getMinimumCanopyDensity()
Get the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
int |
getNumClusters()
Get the number of clusters to generate
|
java.lang.String[] |
getOptions()
Gets the current settings of Canopy.
|
int |
getPeriodicPruningRate()
Get the how often to prune low density canopies during training
|
double |
getT1()
Get the T1 distance.
|
double |
getT2()
Get the T2 distance to use.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
java.lang.String |
globalInfo()
Returns a string describing this clusterer.
|
void |
initializeDistanceFunction(Instances init)
Initialize the distance function (i.e set min/max values for numeric
attributes) with the supplied instances.
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(java.lang.String[] args) |
java.lang.String |
maxNumCandidateCanopiesToHoldInMemory()
Returns the tip text for this property.
|
java.lang.String |
minimumCanopyDensityTipText()
Returns the tip text for this property.
|
static boolean |
nonEmptyCanopySetIntersection(long[] first,
long[] second)
Tests if two sets of canopies have a non-empty intersection
|
int |
numberOfClusters()
Returns the number of clusters.
|
java.lang.String |
numClustersTipText()
Returns the tip text for this property.
|
java.lang.String |
periodicPruningRateTipText()
Returns the tip text for this property.
|
static java.lang.String |
printCanopyAssignments(Instances dataPoints,
java.util.List<long[]> canopyAssignments)
Print the supplied instances and their canopies
|
static java.lang.String |
printSingleAssignment(long[] assignments) |
protected void |
pruneCandidateCanopies()
Prune low density candidate canopies
|
void |
setCanopies(Instances canopies)
Set the canopies to use (replaces any learned by this clusterer already)
|
void |
setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
Set the canopies that each canopy (cluster center) is within T1 distance of
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setMaxNumCandidateCanopiesToHoldInMemory(int max)
Set the maximum number of candidate canopies to retain in memory during
training.
|
void |
setMinimumCanopyDensity(double dens)
Set the minimum T2-based density below which a canopy will be pruned during
periodic pruning.
|
void |
setMissingValuesReplacer(Filter missingReplacer)
Set a ready-to-use missing values replacement filter
|
void |
setNumClusters(int numClusters)
Set the number of clusters to generate
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setPeriodicPruningRate(int p)
Set the how often to prune low density canopies during training
|
void |
setT1(double t1)
Set the T1 distance.
|
void |
setT2(double t2)
Set the T2 distance to use.
|
protected void |
setT2T1BasedOnStdDev(Instances trainingBatch)
Pretty hokey heuristic to try and set t2 distance automatically based on
standard deviation
|
java.lang.String |
t1TipText()
Tip text for this property
|
java.lang.String |
t2TipText()
Tip text for this property
|
java.lang.String |
toString() |
java.lang.String |
toString(boolean header)
Return a textual description of this clusterer
|
protected void |
updateCanopyCenter(Instance newInstance,
double[][] center,
double[] numMissingNumerics) |
void |
updateClusterer(Instance newInstance)
Adds an instance to the clusterer.
|
void |
updateFinished()
Signals the end of the updating.
|
getSeed, seedTipText, setSeedclusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, postExecution, preExecution, run, runClusterer, setDebug, setDoNotCheckCapabilitiesprotected Instances m_canopies
protected java.util.List<double[]> m_canopyT2Density
protected java.util.List<double[][]> m_canopyCenters
protected java.util.List<double[]> m_canopyNumMissingForNumerics
protected java.util.List<long[]> m_clusterCanopies
public static final double DEFAULT_T2
public static final double DEFAULT_T1
protected double m_userT2
protected double m_userT1
protected double m_t1
protected double m_t2
protected int m_periodicPruningRate
protected double m_minClusterDensity
protected int m_maxCanopyCandidates
protected boolean m_didPruneLastTime
protected int m_instanceCount
protected int m_numClustersRequested
protected Filter m_missingValuesReplacer
protected boolean m_dontReplaceMissing
protected NormalizableDistance m_distanceFunction
protected Instances m_trainingData
public java.lang.String globalInfo()
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandlergetTechnicalInformation in interface TechnicalInformationHandlerpublic Capabilities getCapabilities()
getCapabilities in interface ClusterergetCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractClustererCapabilitiespublic java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClustererpublic void setOptions(java.lang.String[] options)
throws java.lang.Exception
-N <num> Number of clusters. (default 2).
-max-candidates <num> Maximum number of candidate canopies to retain in memory at any one time. T2 distance plus, data characteristics, will determine how many candidate canopies are formed before periodic and final pruning are performed, which might result in exceess memory consumption. This setting avoids large numbers of candidate canopies consuming memory. (default = 100)
-periodic-pruning <num> How often to prune low density canopies. (default = every 10,000 training instances)
-min-density Minimum canopy density, below which a canopy will be pruned during periodic pruning. (default = 2 instances)
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
-output-debug-info If set, clusterer is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, clusterer capabilities are not checked before clusterer is built (use with caution).
setOptions in interface OptionHandlersetOptions in class RandomizableClustereroptions - the list of options as an array of strings throws Exception
if an option is not supportedjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClustererpublic static boolean nonEmptyCanopySetIntersection(long[] first,
long[] second)
throws java.lang.Exception
first - the first canopy setsecond - the second canopy setjava.lang.Exception - if a problem occurspublic long[] assignCanopies(Instance inst) throws java.lang.Exception
inst - the instance to find covering canopies forjava.lang.Exception - if a problem occursprotected void updateCanopyCenter(Instance newInstance, double[][] center, double[] numMissingNumerics)
public void updateClusterer(Instance newInstance) throws java.lang.Exception
UpdateableClustererupdateClusterer in interface UpdateableClusterernewInstance - the instance to be addedjava.lang.Exception - if something goes wrongprotected void pruneCandidateCanopies()
public double[] distributionForInstance(Instance instance) throws java.lang.Exception
AbstractClustererdistributionForInstance in interface ClustererdistributionForInstance in class AbstractClustererinstance - the instance to be assigned a cluster.java.lang.Exception - if distribution could not be computed successfullyprotected void adjustCanopies(double[] densities)
densities - the density of each of the canopiespublic void updateFinished()
UpdateableClustererupdateFinished in interface UpdateableClustererpublic void initializeDistanceFunction(Instances init) throws java.lang.Exception
init - the instances to initialize withjava.lang.Exception - if a problem occursprotected void setT2T1BasedOnStdDev(Instances trainingBatch) throws java.lang.Exception
trainingBatch - the training instancesjava.lang.Exception - if a problem occurspublic void buildClusterer(Instances data) throws java.lang.Exception
AbstractClustererbuildClusterer in interface ClustererbuildClusterer in class AbstractClustererdata - set of instances serving as training datajava.lang.Exception - if the clusterer has not been generated successfullypublic int numberOfClusters()
throws java.lang.Exception
AbstractClusterernumberOfClusters in interface ClusterernumberOfClusters in class AbstractClustererjava.lang.Exception - if number of clusters could not be returned
successfullypublic void setMissingValuesReplacer(Filter missingReplacer)
missingReplacer - the missing values replacement filter to usepublic Instances getCanopies()
public void setCanopies(Instances canopies)
canopies - the canopies to usepublic java.util.List<long[]> getClusterCanopyAssignments()
public void setClusterCanopyAssignments(java.util.List<long[]> clusterCanopies)
clusterCanopies - the list canopies for each cluster centerpublic double getActualT2()
public double getActualT1()
public java.lang.String t1TipText()
public void setT1(double t1)
t1 - the T1 distance to usepublic double getT1()
public java.lang.String t2TipText()
public void setT2(double t2)
t2 - the T2 distance to usepublic double getT2()
public java.lang.String numClustersTipText()
public void setNumClusters(int numClusters)
throws java.lang.Exception
NumberOfClustersRequestablesetNumClusters in interface NumberOfClustersRequestablenumClusters - the number of clusters to generatejava.lang.Exception - if the requested number of
clusters in inapropriatepublic int getNumClusters()
public java.lang.String periodicPruningRateTipText()
public void setPeriodicPruningRate(int p)
p - how often (every p instances) to prune low density canopiespublic int getPeriodicPruningRate()
public java.lang.String minimumCanopyDensityTipText()
public void setMinimumCanopyDensity(double dens)
dens - the minimum canopy densitypublic double getMinimumCanopyDensity()
public java.lang.String maxNumCandidateCanopiesToHoldInMemory()
public void setMaxNumCandidateCanopiesToHoldInMemory(int max)
max - the maximum number of candidate canopies to retain in memory
during trainingpublic int getMaxNumCandidateCanopiesToHoldInMemory()
public java.lang.String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r - true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public static java.lang.String printSingleAssignment(long[] assignments)
public static java.lang.String printCanopyAssignments(Instances dataPoints, java.util.List<long[]> canopyAssignments)
dataPoints - the instances to printcanopyAssignments - the canopy assignments, one assignment array for
each instancepublic java.lang.String toString(boolean header)
header - true if the header should be printedpublic java.lang.String toString()
toString in class java.lang.Objectpublic void cleanUp()
public static Canopy aggregateCanopies(java.util.List<Canopy> canopies, double aggregationT1, double aggregationT2, NormalizableDistance finalDistanceFunction, Filter missingValuesReplacer, int finalNumCanopies)
canopies - the list of Canopy clusterers to aggregateaggregationT1 - the T1 distance to use for the aggregated classifieraggregationT2 - the T2 distance to use when aggregating canopiesfinalDistanceFunction - the distance function to use with the final
Canopy clusterermissingValuesReplacer - the missing value replacement filter to use
with the final clusterer (can be null for no missing value
replacement)finalNumCanopies - the final number of canopiespublic static void main(java.lang.String[] args)