mldata.rst 8.89 KB
Newer Older
Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
1
Training Data
Maria Dimashova's avatar
Maria Dimashova committed
2 3 4 5
===================

.. highlight:: cpp

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
6
In machine learning algorithms there is notion of training data. Training data includes several components:
Maria Dimashova's avatar
Maria Dimashova committed
7

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
8
* A set of training samples. Each training sample is a vector of values (in Computer Vision it's sometimes referred to as feature vector). Usually all the vectors have the same number of components (features); OpenCV ml module assumes that. Each feature can be ordered (i.e. its values are floating-point numbers that can be compared with each other and strictly ordered, i.e. sorted) or categorical (i.e. its value belongs to a fixed set of values that can be integers, strings etc.).
Maria Dimashova's avatar
Maria Dimashova committed
9

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
10
* Optional set of responses corresponding to the samples. Training data with no responses is used in unsupervised learning algorithms that learn structure of the supplied data based on distances between different samples. Training data with responses is used in supervised learning algorithms, which learn the function mapping samples to responses. Usually the responses are scalar values, ordered (when we deal with regression problem) or categorical (when we deal with classification problem; in this case the responses are often called "labels"). Some algorithms, most noticeably Neural networks, can handle not only scalar, but also multi-dimensional or vector responses.
Maria Dimashova's avatar
Maria Dimashova committed
11

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
12
* Another optional component is the mask of missing measurements. Most algorithms require all the components in all the training samples be valid, but some other algorithms, such as decision tress, can handle the cases of missing measurements.
13

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
14 15 16
* In the case of classification problem user may want to give different weights to different classes. This is useful, for example, when
  * user wants to shift prediction accuracy towards lower false-alarm rate or higher hit-rate.
  * user wants to compensate for significantly different amounts of training samples from different classes.
17

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
18
* In addition to that, each training sample may be given a weight, if user wants the algorithm to pay special attention to certain training samples and adjust the training model accordingly.
Maria Dimashova's avatar
Maria Dimashova committed
19

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
20
* Also, user may wish not to use the whole training data at once, but rather use parts of it, e.g. to do parameter optimization via cross-validation procedure.
21

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
22
As you can see, training data can have rather complex structure; besides, it may be very big and/or not entirely available, so there is need to make abstraction for this concept. In OpenCV ml there is ``cv::ml::TrainData`` class for that.
23

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
24
TrainData
25
---------
Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
26
.. ocv:class:: TrainData
27

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
28
Class encapsulating training data. Please note that the class only specifies the interface of training data, but not implementation. All the statistical model classes in ml take Ptr<TrainData>. In other words, you can create your own class derived from ``TrainData`` and supply smart pointer to the instance of this class into ``StatModel::train``.
Maria Dimashova's avatar
Maria Dimashova committed
29

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
30 31 32
TrainData::loadFromCSV
----------------------
Reads the dataset from a .csv file and returns the ready-to-use training data.
33

34
.. ocv:function:: Ptr<TrainData> loadFromCSV(const String& filename, int headerLineCount, int responseStartIdx=-1, int responseEndIdx=-1, const String& varTypeSpec=String(), char delimiter=',', char missch='?')
Maria Dimashova's avatar
Maria Dimashova committed
35

36 37
    :param filename: The input file name

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
38
    :param headerLineCount: The number of lines in the beginning to skip; besides the header, the function also skips empty lines and lines staring with '#'
39

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
40
    :param responseStartIdx: Index of the first output variable. If -1, the function considers the last variable as the response
41

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
42
    :param responseEndIdx: Index of the last output variable + 1. If -1, then there is single response variable at ``responseStartIdx``.
43

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
44 45 46 47
    :param varTypeSpec: The optional text string that specifies the variables' types. It has the format ``ord[n1-n2,n3,n4-n5,...]cat[n6,n7-n8,...]``. That is, variables from n1 to n2 (inclusive range), n3, n4 to n5 ... are considered ordered and n6, n7 to n8 ... are considered as categorical. The range [n1..n2] + [n3] + [n4..n5] + ... + [n6] + [n7..n8] should cover all the variables. If varTypeSpec is not specified, then algorithm uses the following rules:
        1. all input variables are considered ordered by default. If some column contains has non-numerical values, e.g. 'apple', 'pear', 'apple', 'apple', 'mango', the corresponding variable is considered categorical.
        2. if there are several output variables, they are all considered as ordered. Error is reported when non-numerical values are used.
        3. if there is a single output variable, then if its values are non-numerical or are all integers, then it's considered categorical. Otherwise, it's considered ordered.
48

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
49
    :param delimiter: The character used to separate values in each line.
50

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
51 52 53 54 55 56 57 58 59
    :param missch: The character used to specify missing measurements. It should not be a digit. Although it's a non-numerical value, it surely does not affect the decision of whether the variable ordered or categorical.

TrainData::create
-----------------
Creates training data from in-memory arrays.

.. ocv:function:: Ptr<TrainData> create(InputArray samples, int layout, InputArray responses, InputArray varIdx=noArray(), InputArray sampleIdx=noArray(), InputArray sampleWeights=noArray(), InputArray varType=noArray())

    :param samples: matrix of samples. It should have ``CV_32F`` type.
60

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
61
    :param layout: it's either ``ROW_SAMPLE``, which means that each training sample is a row of ``samples``, or ``COL_SAMPLE``, which means that each training sample occupies a column of ``samples``.
62

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
63
    :param responses: matrix of responses. If the responses are scalar, they should be stored as a single row or as a single column. The matrix should have type ``CV_32F`` or ``CV_32S`` (in the former case the responses are considered as ordered by default; in the latter case - as categorical)
64

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
65
    :param varIdx: vector specifying which variables to use for training. It can be an integer vector (``CV_32S``) containing 0-based variable indices or byte vector (``CV_8U``) containing a mask of active variables.
66

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
67
    :param sampleIdx: vector specifying which samples to use for training. It can be an integer vector (``CV_32S``) containing 0-based sample indices or byte vector (``CV_8U``) containing a mask of training samples.
68

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
69
    :param sampleWeights: optional vector with weights for each sample. It should have ``CV_32F`` type.
70

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
71 72 73 74
    :param varType: optional vector of type ``CV_8U`` and size <number_of_variables_in_samples> + <number_of_variables_in_responses>, containing types of each input and output variable. The ordered variables are denoted by value ``VAR_ORDERED``, and categorical - by ``VAR_CATEGORICAL``.


TrainData::getTrainSamples
75
--------------------------
Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
76
Returns matrix of train samples
Maria Dimashova's avatar
Maria Dimashova committed
77

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
78
.. ocv:function:: Mat TrainData::getTrainSamples(int layout=ROW_SAMPLE, bool compressSamples=true, bool compressVars=true) const
79

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
80
    :param layout: The requested layout. If it's different from the initial one, the matrix is transposed.
81

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
82
    :param compressSamples: if true, the function returns only the training samples (specified by sampleIdx)
83

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
84
    :param compressVars: if true, the function returns the shorter training samples, containing only the active variables.
85

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
86
In current implementation the function tries to avoid physical data copying and returns the matrix stored inside TrainData (unless the transposition or compression is needed).
Maria Dimashova's avatar
Maria Dimashova committed
87 88


Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
89 90 91
TrainData::getTrainResponses
----------------------------
Returns the vector of responses
92

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
93
.. ocv:function:: Mat TrainData::getTrainResponses() const
Maria Dimashova's avatar
Maria Dimashova committed
94

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
95
The function returns ordered or the original categorical responses. Usually it's used in regression algorithms.
Maria Dimashova's avatar
Maria Dimashova committed
96 97


Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
98 99 100
TrainData::getClassLabels
----------------------------
Returns the vector of class labels
Maria Dimashova's avatar
Maria Dimashova committed
101

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
102
.. ocv:function:: Mat TrainData::getClassLabels() const
103

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
104
The function returns vector of unique labels occurred in the responses.
105

Maria Dimashova's avatar
Maria Dimashova committed
106

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
107 108 109
TrainData::getTrainNormCatResponses
-----------------------------------
Returns the vector of normalized categorical responses
Maria Dimashova's avatar
Maria Dimashova committed
110

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
111
.. ocv:function:: Mat TrainData::getTrainNormCatResponses() const
Maria Dimashova's avatar
Maria Dimashova committed
112

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
113
The function returns vector of responses. Each response is integer from 0 to <number of classes>-1. The actual label value can be retrieved then from the class label vector, see ``TrainData::getClassLabels``.
Maria Dimashova's avatar
Maria Dimashova committed
114

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
115 116 117
TrainData::setTrainTestSplitRatio
-----------------------------------
Splits the training data into the training and test parts
Maria Dimashova's avatar
Maria Dimashova committed
118

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
119
.. ocv:function:: void TrainData::setTrainTestSplitRatio(double ratio, bool shuffle=true)
Maria Dimashova's avatar
Maria Dimashova committed
120

Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
121
The function selects a subset of specified relative size and then returns it as the training set. If the function is not called, all the data is used for training. Please, note that for each of ``TrainData::getTrain*`` there is corresponding ``TrainData::getTest*``, so that the test subset can be retrieved and processed as well.
122 123


Vadim Pisarevsky's avatar
Vadim Pisarevsky committed
124 125 126
Other methods
-------------
The class includes many other methods that can be used to access normalized categorical input variables, access training data by parts, so that does not have to fit into the memory etc.