mldata.rst 11.8 KB
Newer Older
Maria Dimashova's avatar
Maria Dimashova committed
1 2 3 4 5
MLData
===================

.. highlight:: cpp

6
For the machine learning algorithms, the data set is often stored in a file of the ``.csv``-like format. The file contains a table of predictor and response values where each row of the table corresponds to a sample. Missing values are supported. The UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/) provides many data sets stored in such a format to the machine learning community. The class ``MLData`` is implemented to easily load the data for training one of the OpenCV machine learning algorithms. For float values, only the  ``'.'`` separator is supported. The table can have a header and in such case the user have to set the number of the header lines to skip them duaring the file reading.
Maria Dimashova's avatar
Maria Dimashova committed
7 8

CvMLData
9
--------
Maria Dimashova's avatar
Maria Dimashova committed
10 11
.. ocv:class:: CvMLData

12
Class for loading the data from a ``.csv`` file.
Maria Dimashova's avatar
Maria Dimashova committed
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
::

    class CV_EXPORTS CvMLData
    {
    public:
        CvMLData();
        virtual ~CvMLData();

        int read_csv(const char* filename);

        const CvMat* get_values() const;
        const CvMat* get_responses();
        const CvMat* get_missing() const;

        void set_response_idx( int idx );
        int get_response_idx() const;

30

Maria Dimashova's avatar
Maria Dimashova committed
31 32 33 34
        void set_train_test_split( const CvTrainTestSplit * spl);
        const CvMat* get_train_sample_idx() const;
        const CvMat* get_test_sample_idx() const;
        void mix_train_and_test_idx();
35

Maria Dimashova's avatar
Maria Dimashova committed
36
        const CvMat* get_var_idx();
Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
37
        void change_var_idx( int vi, bool state );
Maria Dimashova's avatar
Maria Dimashova committed
38 39 40

        const CvMat* get_var_types();
        void set_var_types( const char* str );
41

Maria Dimashova's avatar
Maria Dimashova committed
42 43
        int get_var_type( int var_idx ) const;
        void change_var_type( int var_idx, int type);
44

Maria Dimashova's avatar
Maria Dimashova committed
45 46 47 48 49
        void set_delimiter( char ch );
        char get_delimiter() const;

        void set_miss_ch( char ch );
        char get_miss_ch() const;
50

51
        const std::map<String, int>& get_class_labels_map() const;
52 53 54

    protected:
        ...
Maria Dimashova's avatar
Maria Dimashova committed
55 56 57
    };

CvMLData::read_csv
58
------------------
59
Reads the data set from a ``.csv``-like ``filename`` file and stores all read values in a matrix.
60

61
.. ocv:function:: int CvMLData::read_csv(const char* filename)
Maria Dimashova's avatar
Maria Dimashova committed
62

63 64
    :param filename: The input file name

65
While reading the data, the method tries to define the type of variables (predictors and responses): ordered or categorical. If a value of the variable is not numerical (except for the label for a missing value), the type of the variable is set to ``CV_VAR_CATEGORICAL``. If all existing values of the variable are numerical, the type of the variable is set to ``CV_VAR_ORDERED``. So, the default definition of variables types works correctly for all cases except the case of a categorical variable with numerical class labels. In this case, the type ``CV_VAR_ORDERED`` is set. You should change the type to ``CV_VAR_CATEGORICAL`` using the method :ocv:func:`CvMLData::change_var_type`. For categorical variables, a common map is built to convert a string class label to the numerical class label. Use :ocv:func:`CvMLData::get_class_labels_map` to obtain this map.
66

67
Also, when reading the data, the method constructs the mask of missing values. For example, values are equal to `'?'`.
Maria Dimashova's avatar
Maria Dimashova committed
68 69

CvMLData::get_values
70
--------------------
71 72
Returns a pointer to the matrix of predictors and response values

73
.. ocv:function:: const CvMat* CvMLData::get_values() const
Maria Dimashova's avatar
Maria Dimashova committed
74

75
The method returns a pointer to the matrix of predictor and response ``values``  or ``0`` if the data has not been loaded from the file yet.
76 77

The row count of this matrix equals the sample count. The column count equals predictors ``+ 1`` for the response (if exists) count. This means that each row of the matrix contains values of one sample predictor and response. The matrix type is ``CV_32FC1``.
Maria Dimashova's avatar
Maria Dimashova committed
78 79

CvMLData::get_responses
80
-----------------------
81 82
Returns a pointer to the matrix of response values

83
.. ocv:function:: const CvMat* CvMLData::get_responses()
Maria Dimashova's avatar
Maria Dimashova committed
84

85
The method returns a pointer to the matrix of response values or throws an exception if the data has not been loaded from the file yet.
86 87

This is a single-column matrix of the type ``CV_32FC1``. Its row count is equal to the sample count, one column and .
Maria Dimashova's avatar
Maria Dimashova committed
88 89

CvMLData::get_missing
90
---------------------
91 92
Returns a pointer to the mask matrix of missing values

93
.. ocv:function:: const CvMat* CvMLData::get_missing() const
Maria Dimashova's avatar
Maria Dimashova committed
94

95
The method returns a pointer to the mask matrix of missing values or throws an exception if the data has not been loaded from the file yet.
96 97

This matrix has the same size as the  ``values`` matrix (see :ocv:func:`CvMLData::get_values`) and the type ``CV_8UC1``.
Maria Dimashova's avatar
Maria Dimashova committed
98 99

CvMLData::set_response_idx
100
--------------------------
101 102
Specifies index of response column in the data matrix

103
.. ocv:function:: void CvMLData::set_response_idx( int idx )
Maria Dimashova's avatar
Maria Dimashova committed
104

105
The method sets the index of a response column in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) or throws an exception if the data has not been loaded from the file yet.
106 107

The old response columns become predictors. If ``idx < 0``, there is no response.
Maria Dimashova's avatar
Maria Dimashova committed
108 109

CvMLData::get_response_idx
110 111 112
--------------------------
Returns index of the response column in the loaded data matrix

113
.. ocv:function:: int CvMLData::get_response_idx() const
Maria Dimashova's avatar
Maria Dimashova committed
114

115 116 117
The method returns the index of a response column in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) or throws an exception if the data has not been loaded from the file yet.

If ``idx < 0``, there is no response.
118

Maria Dimashova's avatar
Maria Dimashova committed
119 120

CvMLData::set_train_test_split
121
------------------------------
122
Divides the read data set into two disjoint training and test subsets.
123

124
.. ocv:function:: void CvMLData::set_train_test_split( const CvTrainTestSplit * spl )
125

126
This method sets parameters for such a split using ``spl`` (see :ocv:class:`CvTrainTestSplit`) or throws an exception if the data has not been loaded from the file yet.
Maria Dimashova's avatar
Maria Dimashova committed
127 128

CvMLData::get_train_sample_idx
129
------------------------------
130 131
Returns the matrix of sample indices for a training subset

132
.. ocv:function:: const CvMat* CvMLData::get_train_sample_idx() const
Maria Dimashova's avatar
Maria Dimashova committed
133

134
The method returns the matrix of sample indices for a training subset. This is a single-row  matrix of the type ``CV_32SC1``. If data split is not set, the method returns ``0``. If the data has not been loaded from the file yet, an exception is thrown.
Maria Dimashova's avatar
Maria Dimashova committed
135 136

CvMLData::get_test_sample_idx
137
-----------------------------
138 139
Returns the matrix of sample indices for a testing subset

140
.. ocv:function:: const CvMat* CvMLData::get_test_sample_idx() const
141

142

Maria Dimashova's avatar
Maria Dimashova committed
143
CvMLData::mix_train_and_test_idx
144
--------------------------------
145 146
Mixes the indices of training and test samples

147
.. ocv:function:: void CvMLData::mix_train_and_test_idx()
148

149
The method shuffles the indices of training and test samples preserving sizes of training and test subsets if the data split is set by :ocv:func:`CvMLData::get_values`. If the data has not been loaded from the file yet, an exception is thrown.
Maria Dimashova's avatar
Maria Dimashova committed
150 151

CvMLData::get_var_idx
152
---------------------
153 154
Returns the indices of the active variables in the data matrix

155
.. ocv:function:: const CvMat* CvMLData::get_var_idx()
156 157

The method returns the indices of variables (columns) used in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`).
158 159

It returns ``0`` if the used subset is not set. It throws an exception if the data has not been loaded from the file yet. Returned matrix is a single-row matrix of the type ``CV_32SC1``. Its column count is equal to the size of the used variable subset.
Maria Dimashova's avatar
Maria Dimashova committed
160

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
161
CvMLData::change_var_idx
162
------------------------
163 164
Enables or disables particular variable in the loaded data

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
165
.. ocv:function:: void CvMLData::change_var_idx( int vi, bool state )
Maria Dimashova's avatar
Maria Dimashova committed
166

167
By default, after reading the data set all variables in the ``values`` matrix (see :ocv:func:`CvMLData::get_values`) are used. But you may want to use only a subset of variables and include/exclude (depending on ``state`` value) a variable with the ``vi`` index from the used subset. If the data has not been loaded from the file yet, an exception is thrown.
168

Maria Dimashova's avatar
Maria Dimashova committed
169
CvMLData::get_var_types
170
-----------------------
171
Returns a matrix of the variable types.
172

173
.. ocv:function:: const CvMat* CvMLData::get_var_types()
174

175
The function returns a single-row matrix of the type ``CV_8UC1``, where each element is set to either ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``. The number of columns is equal to the number of variables. If data has not been loaded from file yet an exception is thrown.
176

Maria Dimashova's avatar
Maria Dimashova committed
177
CvMLData::set_var_types
178
-----------------------
179 180
Sets the variables types in the loaded data.

181
.. ocv:function:: void CvMLData::set_var_types( const char* str )
Maria Dimashova's avatar
Maria Dimashova committed
182

183
In the string, a variable type is followed by a list of variables indices. For example: ``"ord[0-17],cat[18]"``, ``"ord[0,2,4,10-12], cat[1,3,5-9,13,14]"``, ``"cat"`` (all variables are categorical), ``"ord"`` (all variables are ordered).
Maria Dimashova's avatar
Maria Dimashova committed
184

185 186
CvMLData::get_header_lines_number
---------------------------------
187
Returns a number of the table header lines.
188 189

.. ocv:function:: int CvMLData::get_header_lines_number() const
190

191 192 193 194 195 196 197 198
CvMLData::set_header_lines_number
---------------------------------
Sets a number of the table header lines.

.. ocv:function:: void CvMLData::set_header_lines_number( int n )

By default it is supposed that the table does not have a header, i.e. it contains only the data.

Maria Dimashova's avatar
Maria Dimashova committed
199
CvMLData::get_var_type
200
----------------------
201 202
Returns type of the specified variable

203
.. ocv:function:: int CvMLData::get_var_type( int var_idx ) const
Maria Dimashova's avatar
Maria Dimashova committed
204

205
The method returns the type of a variable by the index ``var_idx`` ( ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``).
206

Maria Dimashova's avatar
Maria Dimashova committed
207
CvMLData::change_var_type
208
-------------------------
209 210
Changes type of the specified variable

211
.. ocv:function:: void CvMLData::change_var_type( int var_idx, int type)
212

213
The method changes type of variable with index ``var_idx`` from existing type to ``type`` ( ``CV_VAR_ORDERED`` or ``CV_VAR_CATEGORICAL``).
214

Maria Dimashova's avatar
Maria Dimashova committed
215
CvMLData::set_delimiter
216
-----------------------
217 218
Sets the delimiter in the file used to separate input numbers

219
.. ocv:function:: void CvMLData::set_delimiter( char ch )
Maria Dimashova's avatar
Maria Dimashova committed
220

221
The method sets the delimiter for variables in a file. For example: ``','`` (default), ``';'``, ``' '`` (space), or other characters. The floating-point separator ``'.'`` is not allowed.
Maria Dimashova's avatar
Maria Dimashova committed
222 223

CvMLData::get_delimiter
224
-----------------------
225 226
Returns the currently used delimiter character.

227
.. ocv:function:: char CvMLData::get_delimiter() const
Maria Dimashova's avatar
Maria Dimashova committed
228 229 230


CvMLData::set_miss_ch
231
---------------------
232 233
Sets the character used to specify missing values

234
.. ocv:function:: void CvMLData::set_miss_ch( char ch )
Maria Dimashova's avatar
Maria Dimashova committed
235

236
The method sets the character used to specify missing values. For example: ``'?'`` (default), ``'-'``. The floating-point separator ``'.'`` is not allowed.
Maria Dimashova's avatar
Maria Dimashova committed
237 238

CvMLData::get_miss_ch
239
---------------------
240
Returns the currently used missing value character.
Maria Dimashova's avatar
Maria Dimashova committed
241

242
.. ocv:function:: char CvMLData::get_miss_ch() const
Maria Dimashova's avatar
Maria Dimashova committed
243

244 245
CvMLData::get_class_labels_map
-------------------------------
246 247
Returns a map that converts strings to labels.

248
.. ocv:function:: const std::map<String, int>& CvMLData::get_class_labels_map() const
249

250
The method returns a map that converts string class labels to the numerical class labels. It can be used to get an original class label as in a file.
Maria Dimashova's avatar
Maria Dimashova committed
251 252

CvTrainTestSplit
253
----------------
254
.. ocv:struct:: CvTrainTestSplit
Maria Dimashova's avatar
Maria Dimashova committed
255

256
Structure setting the split of a data set read by :ocv:class:`CvMLData`.
Maria Dimashova's avatar
Maria Dimashova committed
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
::

    struct CvTrainTestSplit
    {
        CvTrainTestSplit();
        CvTrainTestSplit( int train_sample_count, bool mix = true);
        CvTrainTestSplit( float train_sample_portion, bool mix = true);

        union
        {
            int count;
            float portion;
        } train_sample_part;
        int train_sample_part_mode;

        bool mix;
    };

275 276
There are two ways to construct a split:

277
* Set the training sample count (subset size) ``train_sample_count``. Other existing samples are located in a test subset.
278 279

* Set a training sample portion in ``[0,..1]``. The flag ``mix`` is used to mix training and test samples indices when the split is set. Otherwise, the data set is split in the storing order: the first part of samples of a given size is a training subset, the second part is a test subset.