Class NumberVectorLabelParser<V extends elki.data.NumberVector>

  • Type Parameters:
    V - the type of NumberVector used
    All Implemented Interfaces:
    elki.datasource.bundle.BundleStreamSource, Parser, StreamingParser
    Direct Known Subclasses:
    BitVectorLabelParser, CategorialDataAsNumberVectorParser, SparseNumberVectorLabelParser, TermFrequencyParser

    public class NumberVectorLabelParser<V extends elki.data.NumberVector>
    extends AbstractStreamingParser
    Parser for a simple CSV type of format, with columns separated by the given pattern (default: whitespace).

    Several labels may be given per point. A label must not be parseable as double. Lines starting with "#" will be ignored.

    An index can be specified to identify an entry to be treated as class label. This index counts all entries (numeric and labels as well) starting with 0.

    Since:
    0.1
    Author:
    Arthur Zimek, Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  NumberVectorLabelParser.Par<V extends elki.data.NumberVector>
      Parameterization class.
      • Nested classes/interfaces inherited from interface elki.datasource.bundle.BundleStreamSource

        elki.datasource.bundle.BundleStreamSource.Event
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected elki.utilities.datastructures.arraylike.DoubleArray attributes
      Double array storing the numerical attributes during parsing.
      protected java.util.List<java.lang.String> columnnames
      Column names.
      protected elki.data.LabelList curlbl
      Current labels.
      protected V curvec
      Current vector.
      protected elki.data.NumberVector.Factory<V> factory
      Vector factory class.
      protected boolean haslabels
      Whether or not the data set has labels.
      private long[] labelIndices
      Keeps the indices of the attributes to be treated as a string label.
      (package private) java.util.ArrayList<java.lang.String> labels
      (Reused) store for labels.
      private static elki.logging.Logging LOG
      Logging class.
      protected int maxdim
      Dimensionality reported.
      protected elki.datasource.bundle.BundleMeta meta
      Metadata.
      protected int mindim
      Dimensionality reported.
      (package private) elki.datasource.bundle.BundleStreamSource.Event nextevent
      Event to report next.
      (package private) it.unimi.dsi.fastutil.objects.ObjectOpenHashSet<java.lang.String> unique
      For String unification.
      (package private) boolean warnedDim
      Emit a dimensionality change warning once.
      (package private) boolean warnedPrecision
      Emit a double-precision limit warning once.
    • Constructor Summary

      Constructors 
      Constructor Description
      NumberVectorLabelParser​(elki.data.NumberVector.Factory<V> factory)
      Constructor with defaults.
      NumberVectorLabelParser​(CSVReaderFormat format, long[] labelIndices, elki.data.NumberVector.Factory<V> factory)
      Constructor.
      NumberVectorLabelParser​(java.util.regex.Pattern colSep, java.lang.String quoteChars, java.util.regex.Pattern comment, long[] labelIndices, elki.data.NumberVector.Factory<V> factory)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void buildMeta()
      Update the meta element.
      void cleanup()
      Perform cleanup operations after parsing.
      protected V createVector()
      Creates a database object of type V.
      java.lang.Object data​(int rnum)  
      protected elki.logging.Logging getLogger()
      Get the logger for this class.
      elki.datasource.bundle.BundleMeta getMeta()  
      (package private) elki.data.type.SimpleTypeInformation<V> getTypeInformation​(int mindim, int maxdim)
      Get a prototype object for the given dimensionality.
      void initStream​(java.io.InputStream in)
      Init the streaming parser for the given input stream.
      protected boolean isLabelColumn​(int col)
      Test if the current column is marked as label column.
      elki.datasource.bundle.BundleStreamSource.Event nextEvent()  
      protected boolean parseLineInternal()
      Internal method for parsing a single line.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Logging class.
      • labelIndices

        private long[] labelIndices
        Keeps the indices of the attributes to be treated as a string label.
      • factory

        protected elki.data.NumberVector.Factory<V extends elki.data.NumberVector> factory
        Vector factory class.
      • mindim

        protected int mindim
        Dimensionality reported.
      • maxdim

        protected int maxdim
        Dimensionality reported.
      • meta

        protected elki.datasource.bundle.BundleMeta meta
        Metadata.
      • columnnames

        protected java.util.List<java.lang.String> columnnames
        Column names.
      • haslabels

        protected boolean haslabels
        Whether or not the data set has labels.
      • curvec

        protected V extends elki.data.NumberVector curvec
        Current vector.
      • curlbl

        protected elki.data.LabelList curlbl
        Current labels.
      • attributes

        protected elki.utilities.datastructures.arraylike.DoubleArray attributes
        Double array storing the numerical attributes during parsing.
      • labels

        final java.util.ArrayList<java.lang.String> labels
        (Reused) store for labels.
      • unique

        it.unimi.dsi.fastutil.objects.ObjectOpenHashSet<java.lang.String> unique
        For String unification.
      • nextevent

        elki.datasource.bundle.BundleStreamSource.Event nextevent
        Event to report next.
      • warnedPrecision

        boolean warnedPrecision
        Emit a double-precision limit warning once.
      • warnedDim

        boolean warnedDim
        Emit a dimensionality change warning once.
    • Constructor Detail

      • NumberVectorLabelParser

        public NumberVectorLabelParser​(CSVReaderFormat format,
                                       long[] labelIndices,
                                       elki.data.NumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        format - Input format
        labelIndices - Column indexes that are not numeric.
        factory - Vector factory
      • NumberVectorLabelParser

        public NumberVectorLabelParser​(elki.data.NumberVector.Factory<V> factory)
        Constructor with defaults.
        Parameters:
        factory - Vector factory
      • NumberVectorLabelParser

        public NumberVectorLabelParser​(java.util.regex.Pattern colSep,
                                       java.lang.String quoteChars,
                                       java.util.regex.Pattern comment,
                                       long[] labelIndices,
                                       elki.data.NumberVector.Factory<V> factory)
        Constructor.
        Parameters:
        colSep - Column separator
        quoteChars - Quote character
        comment - Comment pattern
        labelIndices - Column indexes that are not numeric.
        factory - Vector factory
    • Method Detail

      • isLabelColumn

        protected boolean isLabelColumn​(int col)
        Test if the current column is marked as label column.
        Parameters:
        col - Column number
        Returns:
        true when a label column.
      • getMeta

        public elki.datasource.bundle.BundleMeta getMeta()
      • nextEvent

        public elki.datasource.bundle.BundleStreamSource.Event nextEvent()
      • buildMeta

        protected void buildMeta()
        Update the meta element.
      • data

        public java.lang.Object data​(int rnum)
      • parseLineInternal

        protected boolean parseLineInternal()
        Internal method for parsing a single line. Used by both line based parsing as well as block parsing. This saves the building of meta data for each line.
        Returns:
        true when a valid line was read, false on a label row.
      • createVector

        protected V createVector()
        Creates a database object of type V.
        Returns:
        a vector of type V containing the given attribute values
      • getTypeInformation

        elki.data.type.SimpleTypeInformation<V> getTypeInformation​(int mindim,
                                                                   int maxdim)
        Get a prototype object for the given dimensionality.
        Parameters:
        mindim - Minimum dimensionality
        maxdim - Maximum dimensionality
        Returns:
        Prototype object