public class SGDText extends RandomizableClassifier implements UpdateableClassifier, UpdateableBatchProcessor, WeightedInstancesHandler, Aggregateable<SGDText>
-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-min-coeff <double> Minimum absolute value of coefficients in the model. If periodic pruning is turned on then this is also used to prune words from the dictionary (default = 0.001
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-S <num> Random number seed. (default 1)
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
| Modifier and Type | Class and Description |
|---|---|
static class |
SGDText.Count |
| Modifier and Type | Field and Description |
|---|---|
static int |
HINGE
the hinge loss function.
|
static int |
LOGLOSS
the log loss function.
|
protected double |
m_bias
Holds the bias term
|
protected Instances |
m_data
The header of the training data
|
protected java.util.LinkedHashMap<java.lang.String,SGDText.Count> |
m_dictionary
The dictionary (and term weights)
|
protected int |
m_epochs
The number of epochs to perform (batch learning).
|
protected boolean |
m_fitLogistic
True if a logistic regression is to be fit to the output of the SVM for
producing probability estimates
|
protected Instances |
m_fitLogisticStructure |
protected java.util.LinkedHashMap<java.lang.String,SGDText.Count> |
m_inputVector
Holds the current document vector (LinkedHashMap is more efficient when
iterating over EntrySet than HashMap)
|
protected double |
m_lambda
The regularization parameter
|
protected double |
m_learningRate
The learning rate
|
protected double |
m_lnorm
The L-norm to use
|
protected int |
m_loss
The current loss function to minimize
|
protected boolean |
m_lowercaseTokens
Whether or not to convert all tokens to lowercase
|
protected double |
m_minAbsCoefficient
Prune terms from the model that have a coefficient smaller than this.
|
protected double |
m_minWordP
Only consider dictionary words (features) that occur at least this many
times.
|
protected double |
m_norm
The length that each document vector should have in the end
|
protected boolean |
m_normalize
Whether to normalized document length or not
|
protected double |
m_numInstances
The number of training instances
|
protected int |
m_numModels |
protected int |
m_periodicP
The number of training instances at which to periodically prune the
dictionary of min frequency words.
|
protected Stemmer |
m_stemmer
The stemming algorithm.
|
protected StopwordsHandler |
m_StopwordsHandler
Stopword handler to use.
|
protected SGD |
m_svmProbs
Used for producing probabilities for SVM via SGD logistic regression
|
protected double |
m_t
Holds the current iteration number
|
protected Tokenizer |
m_tokenizer
The tokenizer to use
|
protected boolean |
m_wordFrequencies
Use word frequencies rather than bag-of-words if true
|
static Tag[] |
TAGS_SELECTION
Loss functions to choose from
|
m_SeedBATCH_SIZE_DEFAULT, m_BatchSize, m_Debug, m_DoNotCheckCapabilities, m_numDecimalPlaces, NUM_DECIMAL_PLACES_DEFAULT| Constructor and Description |
|---|
SGDText() |
| Modifier and Type | Method and Description |
|---|---|
SGDText |
aggregate(SGDText toAggregate)
Aggregate an object with this one
|
void |
batchFinished()
Signal that the training data is finished (for now).
|
double |
bias() |
void |
buildClassifier(Instances data)
Method for building the classifier.
|
double[] |
distributionForInstance(Instance inst)
Predicts the class memberships for a given instance.
|
protected double |
dloss(double z) |
protected double |
dotProd(java.util.Map<java.lang.String,SGDText.Count> document) |
java.lang.String |
epochsTipText()
Returns the tip text for this property
|
void |
finalizeAggregation()
Call to complete the aggregation process.
|
Capabilities |
getCapabilities()
Returns default capabilities of the classifier.
|
java.util.LinkedHashMap<java.lang.String,SGDText.Count> |
getDictionary()
Get this model's dictionary (including term weights).
|
int |
getDictionarySize()
Return the size of the dictionary (minus any low frequency terms that are
below the threshold but haven't been pruned yet).
|
int |
getEpochs()
Get current number of epochs
|
double |
getLambda()
Get the current value of lambda
|
double |
getLearningRate()
Get the learning rate.
|
double |
getLNorm()
Get the L Norm used.
|
SelectedTag |
getLossFunction()
Get the current loss function.
|
boolean |
getLowercaseTokens()
Get whether to convert all tokens to lowercase
|
double |
getMinAbsoluteCoefficientValue()
Get the minimum absolute magnitude for model coefficients.
|
double |
getMinWordFrequency()
Get the minimum word frequency.
|
double |
getNorm()
Get the instance's Norm.
|
boolean |
getNormalizeDocLength()
Get whether to normalize the length of each document
|
java.lang.String[] |
getOptions()
Gets the current settings of the classifier.
|
boolean |
getOutputProbsForSVM()
Get whether to fit a logistic regression (itself trained using SGD) to the
outputs of the SVM (if an SVM is being learned).
|
int |
getPeriodicPruning()
Get how often to prune the dictionary
|
java.lang.String |
getRevision()
Returns the revision string.
|
Stemmer |
getStemmer()
Returns the current stemming algorithm, null if none is used.
|
StopwordsHandler |
getStopwordsHandler()
Gets the stopwords handler.
|
Tokenizer |
getTokenizer()
Returns the current tokenizer algorithm.
|
boolean |
getUseWordFrequencies()
Get whether to use word frequencies rather than binary bag of words
representation.
|
java.lang.String |
globalInfo()
Returns a string describing classifier
|
protected void |
initializeSVMProbs(Instances data) |
java.lang.String |
lambdaTipText()
Returns the tip text for this property
|
java.lang.String |
learningRateTipText()
Returns the tip text for this property
|
java.util.Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
java.lang.String |
LNormTipText()
Returns the tip text for this property
|
java.lang.String |
lossFunctionTipText()
Returns the tip text for this property
|
java.lang.String |
lowercaseTokensTipText()
Returns the tip text for this property
|
static void |
main(java.lang.String[] args)
Main method for testing this class.
|
java.lang.String |
minAbsoluteCoefficientValueTipText()
Returns the tip text for this property
|
java.lang.String |
minWordFrequencyTipText()
Returns the tip text for this property
|
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property
|
java.lang.String |
normTipText()
Returns the tip text for this property
|
java.lang.String |
outputProbsForSVMTipText()
Returns the tip text for this property
|
java.lang.String |
periodicPruningTipText()
Returns the tip text for this property
|
protected void |
pruneDictionary(boolean force) |
void |
reset()
Reset the classifier.
|
void |
setBias(double bias) |
void |
setEpochs(int e)
Set the number of epochs to use
|
void |
setLambda(double lambda)
Set the value of lambda to use
|
void |
setLearningRate(double lr)
Set the learning rate.
|
void |
setLNorm(double newLNorm)
Set the L-norm to used
|
void |
setLossFunction(SelectedTag function)
Set the loss function to use.
|
void |
setLowercaseTokens(boolean l)
Set whether to convert all tokens to lowercase
|
void |
setMinAbsoluteCoefficientValue(double minCoeff)
Set the minimum absolute magnitude for model coefficients.
|
void |
setMinWordFrequency(double minFreq)
Set the minimum word frequency.
|
void |
setNorm(double newNorm)
Set the norm of the instances
|
void |
setNormalizeDocLength(boolean norm)
Set whether to normalize the length of each document
|
void |
setOptions(java.lang.String[] options)
Parses a given list of options.
|
void |
setOutputProbsForSVM(boolean o)
Set whether to fit a logistic regression (itself trained using SGD) to the
outputs of the SVM (if an SVM is being learned).
|
void |
setPeriodicPruning(int p)
Set how often to prune the dictionary
|
void |
setStemmer(Stemmer value)
the stemming algorithm to use, null means no stemming at all (i.e., the
NullStemmer is used).
|
void |
setStopwordsHandler(StopwordsHandler value)
Sets the stopwords handler to use.
|
void |
setTokenizer(Tokenizer value)
the tokenizer algorithm to use.
|
void |
setUseWordFrequencies(boolean u)
Set whether to use word frequencies rather than binary bag of words
representation.
|
java.lang.String |
stemmerTipText()
Returns the tip text for this property.
|
java.lang.String |
stopwordsHandlerTipText()
Returns the tip text for this property.
|
protected double |
svmOutput() |
protected void |
tokenizeInstance(Instance instance,
boolean updateDictionary) |
java.lang.String |
tokenizerTipText()
Returns the tip text for this property.
|
java.lang.String |
toString() |
protected void |
train(Instances data) |
void |
updateClassifier(Instance instance)
Updates the classifier with the given instance.
|
protected void |
updateClassifier(Instance instance,
boolean updateDictionary) |
java.lang.String |
useWordFrequenciesTipText()
Returns the tip text for this property
|
getSeed, seedTipText, setSeedbatchSizeTipText, classifyInstance, debugTipText, distributionsForInstances, doNotCheckCapabilitiesTipText, forName, getBatchSize, getDebug, getDoNotCheckCapabilities, getNumDecimalPlaces, implementsMoreEfficientBatchPrediction, makeCopies, makeCopy, numDecimalPlacesTipText, postExecution, preExecution, run, runClassifier, setBatchSize, setDebug, setDoNotCheckCapabilities, setNumDecimalPlacesprotected int m_periodicP
protected double m_minWordP
protected double m_minAbsCoefficient
protected boolean m_wordFrequencies
protected boolean m_normalize
protected double m_norm
protected double m_lnorm
protected java.util.LinkedHashMap<java.lang.String,SGDText.Count> m_dictionary
protected StopwordsHandler m_StopwordsHandler
protected Tokenizer m_tokenizer
protected boolean m_lowercaseTokens
protected Stemmer m_stemmer
protected double m_lambda
protected double m_learningRate
protected double m_t
protected double m_bias
protected double m_numInstances
protected Instances m_data
protected int m_epochs
protected transient java.util.LinkedHashMap<java.lang.String,SGDText.Count> m_inputVector
public static final int HINGE
public static final int LOGLOSS
protected int m_loss
public static final Tag[] TAGS_SELECTION
protected SGD m_svmProbs
protected boolean m_fitLogistic
protected Instances m_fitLogisticStructure
protected int m_numModels
protected double dloss(double z)
public Capabilities getCapabilities()
getCapabilities in interface ClassifiergetCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractClassifierCapabilitiespublic void setStemmer(Stemmer value)
value - the configured stemming algorithm, or nullNullStemmerpublic Stemmer getStemmer()
public java.lang.String stemmerTipText()
public void setTokenizer(Tokenizer value)
value - the configured tokenizing algorithmpublic Tokenizer getTokenizer()
public java.lang.String tokenizerTipText()
public java.lang.String useWordFrequenciesTipText()
public void setUseWordFrequencies(boolean u)
u - true if word frequencies are to be used.public boolean getUseWordFrequencies()
public java.lang.String lowercaseTokensTipText()
public void setLowercaseTokens(boolean l)
l - true if all tokens are to be converted to lowercasepublic boolean getLowercaseTokens()
public void setStopwordsHandler(StopwordsHandler value)
value - the stopwords handler, if null, Null is usedpublic StopwordsHandler getStopwordsHandler()
public java.lang.String stopwordsHandlerTipText()
public java.lang.String periodicPruningTipText()
public void setPeriodicPruning(int p)
p - how often to prunepublic int getPeriodicPruning()
public java.lang.String minWordFrequencyTipText()
public void setMinWordFrequency(double minFreq)
minFreq - the minimum word frequency to usepublic double getMinWordFrequency()
public java.lang.String minAbsoluteCoefficientValueTipText()
public void setMinAbsoluteCoefficientValue(double minCoeff)
minCoeff - the minimum absolute value of a model coefficientpublic double getMinAbsoluteCoefficientValue()
public java.lang.String normalizeDocLengthTipText()
public void setNormalizeDocLength(boolean norm)
norm - true if document lengths is to be normalizedpublic boolean getNormalizeDocLength()
public java.lang.String normTipText()
public double getNorm()
public void setNorm(double newNorm)
newNorm - the norm to wich the instances must be setpublic java.lang.String LNormTipText()
public double getLNorm()
public void setLNorm(double newLNorm)
newLNorm - the L-normpublic java.lang.String lambdaTipText()
public void setLambda(double lambda)
lambda - the value of lambda to usepublic double getLambda()
public void setLearningRate(double lr)
lr - the learning rate to use.public double getLearningRate()
public java.lang.String learningRateTipText()
public java.lang.String epochsTipText()
public void setEpochs(int e)
e - the number of epochs to usepublic int getEpochs()
public void setLossFunction(SelectedTag function)
function - the loss function to use.public SelectedTag getLossFunction()
public java.lang.String lossFunctionTipText()
public void setOutputProbsForSVM(boolean o)
o - true if a logistic regression is to be fit to the output of the
SVM to produce probability estimates.public boolean getOutputProbsForSVM()
public java.lang.String outputProbsForSVMTipText()
public java.util.Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClassifierpublic void setOptions(java.lang.String[] options)
throws java.lang.Exception
-F Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression) (default = 0)
-outputProbs Output probabilities for SVMs (fits a logsitic model to the output of the SVM)
-L The learning rate (default = 0.01).
-R <double> The lambda regularization constant (default = 0.0001)
-E <integer> The number of epochs to perform (batch learning only, default = 500)
-W Use word frequencies instead of binary bag of words.
-P <# instances> How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune)
-M <double> Minimum word frequency. Words with less than this frequence are ignored. If periodic pruning is turned on then this is also used to determine which words to remove from the dictionary (default = 3).
-min-coeff <double> Minimum absolute value of coefficients in the model. If periodic pruning is turned on then this is also used to prune words from the dictionary (default = 0.001
-normalize Normalize document length (use in conjunction with -norm and -lnorm)
-norm <num> Specify the norm that each instance must have (default 1.0)
-lnorm <num> Specify L-norm to use (default 2.0)
-lowercase Convert all tokens to lowercase before adding to the dictionary.
-stopwords-handler The stopwords handler to use (default Null).
-tokenizer <spec> The tokenizing algorihtm (classname plus parameters) to use. (default: weka.core.tokenizers.WordTokenizer)
-stemmer <spec> The stemmering algorihtm (classname plus parameters) to use.
-S <num> Random number seed. (default 1)
-output-debug-info If set, classifier is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, classifier capabilities are not checked before classifier is built (use with caution).
setOptions in interface OptionHandlersetOptions in class RandomizableClassifieroptions - the list of options as an array of stringsjava.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClassifierpublic java.lang.String globalInfo()
public void reset()
public void buildClassifier(Instances data) throws java.lang.Exception
buildClassifier in interface Classifierdata - the set of training instances.java.lang.Exception - if the classifier can't be built successfully.protected void initializeSVMProbs(Instances data) throws java.lang.Exception
java.lang.Exceptionprotected void train(Instances data) throws java.lang.Exception
java.lang.Exceptionpublic void updateClassifier(Instance instance) throws java.lang.Exception
updateClassifier in interface UpdateableClassifierinstance - the new training instance to include in the modeljava.lang.Exception - if the instance could not be incorporated in the
model.protected void updateClassifier(Instance instance, boolean updateDictionary) throws java.lang.Exception
java.lang.Exceptionprotected void tokenizeInstance(Instance instance, boolean updateDictionary)
protected void pruneDictionary(boolean force)
protected double svmOutput()
public double[] distributionForInstance(Instance inst) throws java.lang.Exception
AbstractClassifierdistributionForInstance in interface ClassifierdistributionForInstance in class AbstractClassifierinst - the instance to be classifiedjava.lang.Exception - if distribution could not be computed successfullyprotected double dotProd(java.util.Map<java.lang.String,SGDText.Count> document)
public java.lang.String toString()
toString in class java.lang.Objectpublic java.util.LinkedHashMap<java.lang.String,SGDText.Count> getDictionary()
public int getDictionarySize()
public double bias()
public void setBias(double bias)
public java.lang.String getRevision()
getRevision in interface RevisionHandlergetRevision in class AbstractClassifierpublic SGDText aggregate(SGDText toAggregate) throws java.lang.Exception
aggregate in interface Aggregateable<SGDText>toAggregate - the object to aggregatejava.lang.Exception - if the supplied object can't be aggregated for some
reasonpublic void finalizeAggregation()
throws java.lang.Exception
finalizeAggregation in interface Aggregateable<SGDText>java.lang.Exception - if the aggregation can't be finalized for some reasonpublic void batchFinished()
throws java.lang.Exception
UpdateableBatchProcessorbatchFinished in interface UpdateableBatchProcessorjava.lang.Exception - if a problem occurspublic static void main(java.lang.String[] args)