Class SparseNumberVectorLabelParser<V extends elki.data.SparseNumberVector>

  • Type Parameters:
    V - vector type
    All Implemented Interfaces:
    elki.datasource.bundle.BundleStreamSource, Parser, StreamingParser
    Direct Known Subclasses:
    LibSVMFormatParser

    @Title("Sparse Vector Label Parser")
    @Description("Parser for the following line format:\nA single line provides a single point. Entries are separated by whitespace. The values will be parsed as floats (resulting in a set of SparseFloatVectors).\nA line is expected in the following format:\nThe first entry of each line is the number of attributes with coordinate value not zero. Subsequent entries are of the form (index, value), where index is the number of the corresponding dimension, and value is the value of the corresponding attribute. Any pair of two subsequent substrings not containing whitespace is tried to be read as int and float. If this fails for the first of the pair (interpreted ans index), it will be appended to a label. (Thus, any label must not be parseable as Integer.) If the float component is not parseable, an exception will be thrown. Empty lines and lines beginning with \"#\" will be ignored.")
    public class SparseNumberVectorLabelParser<V extends elki.data.SparseNumberVector>
    extends NumberVectorLabelParser<V>
    Parser for parsing one point per line, attributes separated by whitespace.

    Several labels may be given per point. A label must not be parseable as double. Lines starting with "#" will be ignored.

    A line is expected in the following format: The first entry of each line is the number of attributes with coordinate value not zero. Subsequent entries are of the form index value each, where index is the number of the corresponding dimension, and value is the value of the corresponding attribute. A complete line then could look like this:

     3 7 12.34 8 56.78 11 1.234 objectlabel
     
    where 3 indicates there are three attributes set, 7,8,11 are the attributes indexes and there is a non-numerical object label.

    An index can be specified to identify an entry to be treated as class label. This index counts all entries (numeric and labels as well) starting with 0.

    Since:
    0.2
    Author:
    Arthur Zimek
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Class logger.
      • sparsefactory

        protected elki.data.SparseNumberVector.Factory<V extends elki.data.SparseNumberVector> sparsefactory
        Same as NumberVectorLabelParser.factory, but subtype.
      • values

        it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap values
        (Reused) set of values for the number vector.
      • labels

        java.util.ArrayList<java.lang.String> labels
        (Reused) label buffer.
    • Constructor Detail

      • SparseNumberVectorLabelParser

        public SparseNumberVectorLabelParser​(CSVReaderFormat format,
                                             long[] labelIndices,
                                             elki.data.SparseNumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        format - Input format
        labelIndices - Indices to use as labels
        factory - Vector factory
      • SparseNumberVectorLabelParser

        public SparseNumberVectorLabelParser​(java.util.regex.Pattern colSep,
                                             java.lang.String quoteChars,
                                             java.util.regex.Pattern comment,
                                             long[] labelIndices,
                                             elki.data.SparseNumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        colSep - Column separator
        quoteChars - Quotation character
        comment - Comment pattern
        labelIndices - Indices to use as labels
        factory - Vector factory
    • Method Detail

      • parseLineInternal

        protected boolean parseLineInternal()
        Description copied from class: NumberVectorLabelParser
        Internal method for parsing a single line. Used by both line based parsing as well as block parsing. This saves the building of meta data for each line.
        Overrides:
        parseLineInternal in class NumberVectorLabelParser<V extends elki.data.SparseNumberVector>
        Returns:
        true when a valid line was read, false on a label row.
      • getTypeInformation

        protected elki.data.type.SimpleTypeInformation<V> getTypeInformation​(int mindim,
                                                                             int maxdim)
        Description copied from class: NumberVectorLabelParser
        Get a prototype object for the given dimensionality.
        Overrides:
        getTypeInformation in class NumberVectorLabelParser<V extends elki.data.SparseNumberVector>
        Parameters:
        mindim - Minimum dimensionality
        maxdim - Maximum dimensionality
        Returns:
        Prototype object