Class TermFrequencyParser<V extends elki.data.SparseNumberVector>

  • All Implemented Interfaces:
    elki.datasource.bundle.BundleStreamSource, Parser, StreamingParser

    public class TermFrequencyParser<V extends elki.data.SparseNumberVector>
    extends NumberVectorLabelParser<V>
    A parser to load term frequency data, which essentially are sparse vectors with text keys.

    Parse a file containing term frequencies. The expected format is:

     rowlabel1 term1 <freq> term2 <freq> ...
     rowlabel2 term1 <freq> term3 <freq> ...
     
    Terms must not contain the separator character!

    If your data does not contain frequencies, you can maybe use SimpleTransactionParser instead.

    Since:
    0.4.0
    Author:
    Erich Schubert
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Class logger.
      • numterms

        int numterms
        Number of different terms observed.
      • keymap

        it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap<java.lang.String> keymap
        Map.
      • normalize

        boolean normalize
        Normalize.
      • sparsefactory

        private elki.data.SparseNumberVector.Factory<V extends elki.data.SparseNumberVector> sparsefactory
        Same as NumberVectorLabelParser.factory, but subtype.
      • values

        it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap values
        (Reused) set of values for the number vector.
      • labels

        java.util.ArrayList<java.lang.String> labels
        (Reused) label buffer.
    • Constructor Detail

      • TermFrequencyParser

        public TermFrequencyParser​(boolean normalize,
                                   elki.data.SparseNumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        normalize - Normalize
        factory - Vector type
      • TermFrequencyParser

        public TermFrequencyParser​(boolean normalize,
                                   CSVReaderFormat format,
                                   long[] labelIndices,
                                   elki.data.SparseNumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        normalize - Normalize
        format - Input format
        labelIndices - Indices to use as labels
        factory - Vector type
    • Method Detail

      • parseLineInternal

        protected boolean parseLineInternal()
        Description copied from class: NumberVectorLabelParser
        Internal method for parsing a single line. Used by both line based parsing as well as block parsing. This saves the building of meta data for each line.
        Overrides:
        parseLineInternal in class NumberVectorLabelParser<V extends elki.data.SparseNumberVector>
        Returns:
        true when a valid line was read, false on a label row.
      • getTypeInformation

        protected elki.data.type.SimpleTypeInformation<V> getTypeInformation​(int mindim,
                                                                             int maxdim)
        Description copied from class: NumberVectorLabelParser
        Get a prototype object for the given dimensionality.
        Overrides:
        getTypeInformation in class NumberVectorLabelParser<V extends elki.data.SparseNumberVector>
        Parameters:
        mindim - Minimum dimensionality
        maxdim - Maximum dimensionality
        Returns:
        Prototype object