This lightweight class encapsulates pitched memory on a GPU and is passed to nvcc-compiled code (CUDA kernels). Typically, it is used internally by OpenCV and by users who write device code. You can call its members from both host and device code.
This lightweight class encapsulates pitched memory on a GPU and is passed to nvcc-compiled code (CUDA kernels). Typically, it is used internally by OpenCV and by users who write device code. You can call its members from both host and device code. ::
::
template <typename T> struct DevMem2D_
template <typename T> struct DevMem2D_
{
{
...
@@ -103,9 +102,10 @@ This is a base storage class for GPU memory with reference counting. Its interfa
...
@@ -103,9 +102,10 @@ This is a base storage class for GPU memory with reference counting. Its interfa
*
*
no expression templates technique support
no expression templates technique support
Beware that the latter limitation may lead to overloaded matrix operators that cause memory allocations. The ``GpuMat`` class is convertible to :cpp:class:`gpu::DevMem2D_` and :cpp:class:`gpu::PtrStep_` so it can be passed to directly to kernel.
Beware that the latter limitation may lead to overloaded matrix operators that cause memory allocations. The ``GpuMat`` class is convertible to :cpp:class:`gpu::DevMem2D_` and :cpp:class:`gpu::PtrStep_` so it can be passed directly to kernel.
**Note:**
**Note:**
In contrast with :c:type:`Mat`, in most cases ``GpuMat::isContinuous() == false`` . This means that rows are aligned to size depending on the hardware. Single-row ``GpuMat`` is always a continuous matrix. ::
In contrast with :c:type:`Mat`, in most cases ``GpuMat::isContinuous() == false`` . This means that rows are aligned to size depending on the hardware. Single-row ``GpuMat`` is always a continuous matrix. ::
class CV_EXPORTS GpuMat
class CV_EXPORTS GpuMat
...
@@ -141,6 +141,7 @@ In contrast with :c:type:`Mat`, in most cases ``GpuMat::isContinuous() == false`
...
@@ -141,6 +141,7 @@ In contrast with :c:type:`Mat`, in most cases ``GpuMat::isContinuous() == false`
**Note:**
**Note:**
You are not recommended to leave static or global ``GpuMat`` variables allocated, that is to rely on its destructor. The destruction order of such variables and CUDA context is undefined. GPU memory release function returns error if the CUDA context has been destroyed before.
You are not recommended to leave static or global ``GpuMat`` variables allocated, that is to rely on its destructor. The destruction order of such variables and CUDA context is undefined. GPU memory release function returns error if the CUDA context has been destroyed before.
See Also:
See Also:
...
@@ -156,14 +157,15 @@ This class with reference counting wraps special memory type allocation function
...
@@ -156,14 +157,15 @@ This class with reference counting wraps special memory type allocation function
:func:`Mat`-like but with additional memory type parameters.
:func:`Mat`-like but with additional memory type parameters.
*
*
``ALLOC_PAGE_LOCKED``: Sets a page locked memory type, used commonly for fast and asynchronous upload/download data from/to GPU.
``ALLOC_PAGE_LOCKED``: Sets a page locked memory type, used commonly for fast and asynchronous uploading/downloading data from/to GPU.
*
*
``ALLOC_ZEROCOPY``: Specifies a zero copy memory allocation that enables mapping the host memory to GPU address space, if supported.
``ALLOC_ZEROCOPY``: Specifies a zero copy memory allocation that enables mapping the host memory to GPU address space, if supported.
*
*
``ALLOC_WRITE_COMBINED``: Sets the write combined buffer that is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is a better CPU cache utilization.
``ALLOC_WRITE_COMBINED``: Sets the write combined buffer that is not cached by CPU. Such buffers are used to supply GPU with data when GPU only reads it. The advantage is a better CPU cache utilization.
**Note:**
**Note:**
Allocation size of such memory types is usually limited. For more details please see "CUDA 2.2 Pinned Memory APIs" document or "CUDA_C Programming Guide".
Allocation size of such memory types is usually limited. For more details, see "CUDA 2.2 Pinned Memory APIs" document or "CUDA C Programming Guide".
Maps CPU memory to GPU address space and creates :cpp:class:`gpu::GpuMat` header without reference counting for it. This can be done only if memory was allocated with ``ALLOC_ZEROCOPY`` flag and if it is supported by the hardware (laptops often share video and CPU memory, so address spaces can be mapped, and that eliminates an extra copy).
Maps CPU memory to GPU address space and creates the :cpp:class:`gpu::GpuMat` header without reference counting for it. This can be done only if memory was allocated with the ``ALLOC_ZEROCOPY`` flag and if it is supported by the hardware (laptops often share video and CPU memory, so address spaces can be mapped, which eliminates an extra copy).
Returns true if the current hardware supports address space mapping and ``ALLOC_ZEROCOPY`` memory allocation.
Returns ``true`` if the current hardware supports address space mapping and ``ALLOC_ZEROCOPY`` memory allocation.
.. index:: gpu::Stream
.. index:: gpu::Stream
...
@@ -228,9 +230,10 @@ gpu::Stream
...
@@ -228,9 +230,10 @@ gpu::Stream
-----------
-----------
.. cpp:class:: gpu::Stream
.. cpp:class:: gpu::Stream
This class encapsulated a queue of asynchronous calls. Some functions have overloads with the additional ``gpu::Stream`` parameter. The overloads do initialization work (allocate output buffers, upload constants, and so on), start the GPU kernel, and return before results are ready. You can check whether all operation are complete via :cpp:func:`gpu::Stream::queryIfComplete`. You can asynchronously upload/download data from/to page-locked buffers, using :cpp:class:`gpu::CudaMem` or :c:type:`Mat` header that points to a region of :cpp:class:`gpu::CudaMem`.
This class encapsulates a queue of asynchronous calls. Some functions have overloads with the additional ``gpu::Stream`` parameter. The overloads do initialization work (allocate output buffers, upload constants, and so on), start the GPU kernel, and return before results are ready. You can check whether all operations are complete via :cpp:func:`gpu::Stream::queryIfComplete`. You can asynchronously upload/download data from/to page-locked buffers, using the :cpp:class:`gpu::CudaMem` or :c:type:`Mat` header that points to a region of :cpp:class:`gpu::CudaMem`.
**Note:**
**Note:**
Currently, you may face problems if an operation is enqueued twice with different data. Some functions use the constant GPU memory, and next call may update the memory before the previous one has been finished. But calling different operations asynchronously is safe because each operation has its own constant buffer. Memory copy/upload/download/set operations to the buffers you hold are also safe.
Currently, you may face problems if an operation is enqueued twice with different data. Some functions use the constant GPU memory, and next call may update the memory before the previous one has been finished. But calling different operations asynchronously is safe because each operation has its own constant buffer. Memory copy/upload/download/set operations to the buffers you hold are also safe.
Sets a device and initializes it for the current thread. If call of this function is omitted, a default device is initialized at the fist GPU usage.
Sets a device and initializes it for the current thread. If the call of this function is omitted, a default device is initialized at the fist GPU usage.
:param device: System index of a GPU device starting with 0.
:param device: System index of a GPU device starting with 0.
...
@@ -27,7 +27,7 @@ gpu::getDevice
...
@@ -27,7 +27,7 @@ gpu::getDevice
------------------
------------------
.. cpp:function:: int getDevice()
.. cpp:function:: int getDevice()
Returns the current device index that was set by {gpu::getDevice} or initialized by default.
Returns the current device index that was set by ``{gpu::getDevice}`` or initialized by default.
.. index:: gpu::GpuFeature
.. index:: gpu::GpuFeature
...
@@ -81,7 +81,7 @@ This class provides functionality for querying the specified GPU properties.
...
@@ -81,7 +81,7 @@ This class provides functionality for querying the specified GPU properties.
Checks the GPU module and device compatibility. This function returns true if the GPU module can be run on the specified device, otherwise returns false.
Checks the GPU module and device compatibility. This function returns ``true`` if the GPU module can be run on the specified device. Otherwise, it returns false.
.. index:: gpu::TargetArchs
.. index:: gpu::TargetArchs
...
@@ -164,13 +164,13 @@ gpu::TargetArchs
...
@@ -164,13 +164,13 @@ gpu::TargetArchs
----------------
----------------
.. cpp:class:: gpu::TargetArchs
.. cpp:class:: gpu::TargetArchs
This class provides a set of static methods to check what NVIDIA card architecture the GPU module was built for.
This class provides a set of static methods to check what NVIDIA* card architecture the GPU module was built for.
The following method checks whether the module was built with the support of the given feature:
The following method checks whether the module was built with the support of the given feature:
:param feature: Feature to be checked. See :cpp:class:`gpu::GpuFeature`.
:param feature: Feature to be checked. See :cpp:class:`gpu::GpuFeature`.
There is a set of methods to check whether the module contains intermediate (PTX) or binary GPU code for the given architecture(s):
There is a set of methods to check whether the module contains intermediate (PTX) or binary GPU code for the given architecture(s):
...
@@ -192,7 +192,7 @@ There is a set of methods to check whether the module contains intermediate (PTX
...
@@ -192,7 +192,7 @@ There is a set of methods to check whether the module contains intermediate (PTX
:param minor: Minor compute capability version.
:param minor: Minor compute capability version.
According to the CUDA C Programming Guide Version 3.2: "PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability".
According to the CUDA C Programming Guide Version 3.2: "PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability".
.. index:: gpu::MultiGpuManager
.. index:: gpu::MultiGpuManager
...
@@ -201,7 +201,7 @@ gpu::MultiGpuManager
...
@@ -201,7 +201,7 @@ gpu::MultiGpuManager
--------------------
--------------------
.. cpp:class:: gpu::MultiGpuManager
.. cpp:class:: gpu::MultiGpuManager
Provides functionality for working with many GPUs. ::
This class provides functionality for working with many GPUs. ::
The OpenCV GPU module is a set of classes and functions to utilize GPU computational capabilities. It is implemented using NVidia* CUDA Runtime API and supports only NVidia GPUs. The OpenCV GPU module includes utility functions, low-level vision primitives, and high-level algorithms. The utility functions and low-level primitives provide a powerful infrastructure for developing fast vision algorithms taking advantage of GPU whereas the high-level functionality includes some state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others), ready to be used by the application developers.
The OpenCV GPU module is a set of classes and functions to utilize GPU computational capabilities. It is implemented using NVIDIA* CUDA Runtime API and supports only NVIDIA GPUs. The OpenCV GPU module includes utility functions, low-level vision primitives, and high-level algorithms. The utility functions and low-level primitives provide a powerful infrastructure for developing fast vision algorithms taking advantage of GPU whereas the high-level functionality includes some state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others), ready to be used by the application developers.
The GPU module is designed as a host-level API. This means that if you have pre-compiled OpenCV GPU binaries, you are not required to have the CUDA Toolkit installed or write any extra code to make use of the GPU.
The GPU module is designed as a host-level API. This means that if you have pre-compiled OpenCV GPU binaries, you are not required to have the CUDA Toolkit installed or write any extra code to make use of the GPU.
The GPU module depends on the CUDA Toolkit and NVidia Performance Primitives library (NPP). Make sure you have the latest versions of this software installed. You can download two libraries for all supported platforms from the NVidia site. To compile the OpenCV GPU module, you need a compiler compatible with the Cuda Runtime Toolkit.
The GPU module depends on the CUDA Toolkit and NVIDIA Performance Primitives library (NPP). Make sure you have the latest versions of this software installed. You can download two libraries for all supported platforms from the NVIDIA site. To compile the OpenCV GPU module, you need a compiler compatible with the CUDA Runtime Toolkit.
The OpenCV GPU module is designed for ease of use and does not require any knowledge of CUDA. Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest performance. It is helpful to understand the cost of various operations, what the GPU does, what the preferred data formats are, and so on. The GPU module is an effective instrument for quick implementation of GPU-accelerated computer vision algorithms. However, if your algorithm involves many simple operations, then, for the best possible performance, you may still need to write your own kernels to avoid extra write and read operations on the intermediate results.
The OpenCV GPU module is designed for ease of use and does not require any knowledge of CUDA. Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest performance. It is helpful to understand the cost of various operations, what the GPU does, what the preferred data formats are, and so on. The GPU module is an effective instrument for quick implementation of GPU-accelerated computer vision algorithms. However, if your algorithm involves many simple operations, then, for the best possible performance, you may still need to write your own kernels to avoid extra write and read operations on the intermediate results.
To enable CUDA support, configure OpenCV using CMake with ``WITH_CUDA=ON`` . When the flag is set and if CUDA is installed, the full-featured OpenCV GPU module is built. Otherwise, the module is still built, but at runtime all functions from the module throw
To enable CUDA support, configure OpenCV using ``CMake`` with ``WITH_CUDA=ON`` . When the flag is set and if CUDA is installed, the full-featured OpenCV GPU module is built. Otherwise, the module is still built, but at runtime all functions from the module throw
:func:`Exception` with ``CV_GpuNotSupported`` error code, except for
:func:`Exception` with ``CV_GpuNotSupported`` error code, except for
:func:`gpu::getCudaEnabledDeviceCount()`. The latter function returns zero GPU count in this case. Building OpenCV without CUDA support does not perform device code compilation, so it does not require the CUDA Toolkit installed. Therefore, using
:func:`gpu::getCudaEnabledDeviceCount()`. The latter function returns zero GPU count in this case. Building OpenCV without CUDA support does not perform device code compilation, so it does not require the CUDA Toolkit installed. Therefore, using the
:func:`gpu::getCudaEnabledDeviceCount()` function, you can implement a high-level algorithm that will detect GPU presence at runtime and choose the appropriate implementation (CPU or GPU) accordingly.
:func:`gpu::getCudaEnabledDeviceCount()` function, you can implement a high-level algorithm that will detect GPU presence at runtime and choose an appropriate implementation (CPU or GPU) accordingly.
Compilation for Different NVidia* Platforms
Compilation for Different NVIDIA* Platforms
-------------------------------------------
-------------------------------------------
NVidia* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX). Binary code often implies a specific GPU architecture and generation, so the compatibility with other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the set of capabilities or features. Depending on the selected virtual platform, some of the instructions are emulated or disabled, even if the real hardware supports all the features.
NVIDIA* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX). Binary code often implies a specific GPU architecture and generation, so the compatibility with other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the set of capabilities or features. Depending on the selected virtual platform, some of the instructions are emulated or disabled, even if the real hardware supports all the features.
At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails.
At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails.
By default, the OpenCV GPU module includes:
By default, the OpenCV GPU module includes:
*
*
Binaries for compute capabilities 1.3 and 2.0 (controlled by ``CUDA_ARCH_BIN`` in CMake)
Binaries for compute capabilities 1.3 and 2.0 (controlled by ``CUDA_ARCH_BIN`` in ``CMake``)
*
*
PTX code for compute capabilities 1.1 and 1.3 (controlled by ``CUDA_ARCH_PTX`` in CMake)
PTX code for compute capabilities 1.1 and 1.3 (controlled by ``CUDA_ARCH_PTX`` in ``CMake``)
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
:func:`Exception`. For platforms where JIT compilation is performed first, run is slow.
:func:`Exception`. For platforms where JIT compilation is performed first, the run is slow.
On a GPU with CC 1.0, you can still compile the GPU module and most of the functions will run flawlessly. To achieve this, add "1.0" to the list of binaries, for example, ``CUDA_ARCH_BIN="1.0 1.3 2.0"`` . The functions that cannot be run on CC 1.0 GPUs throw an exception.
On a GPU with CC 1.0, you can still compile the GPU module and most of the functions will run flawlessly. To achieve this, add "1.0" to the list of binaries, for example, ``CUDA_ARCH_BIN="1.0 1.3 2.0"`` . The functions that cannot be run on CC 1.0 GPUs throw an exception.
...
@@ -44,7 +44,7 @@ You can always determine at runtime whether the OpenCV GPU-built binaries (or PT
...
@@ -44,7 +44,7 @@ You can always determine at runtime whether the OpenCV GPU-built binaries (or PT
Threading and Multi-threading
Threading and Multi-threading
------------------------------
------------------------------
The OpenCV GPU module follows the CUDA Runtime API conventions regarding the multi-threaded programming. This means that for the first API call a CUDA context is created implicitly, attached to the current CPU thread and then is used as the thread's "current" context. All further operations, such as memory allocation, GPU code compilation, are associated with the context and the thread. Because any other thread is not attached to the context, memory (and other resources) allocated in the first thread cannot be accessed by the other thread. Instead, for this other thread CUDA creates another context associated with it. In short, by default, different threads do not share resources.
The OpenCV GPU module follows the CUDA Runtime API conventions regarding the multi-threaded programming. This means that for the first API call a CUDA context is created implicitly, attached to the current CPU thread and then is used as the thread's "current" context. All further operations, such as a memory allocation, GPU code compilation, are associated with the context and the thread. Because any other thread is not attached to the context, memory (and other resources) allocated in the first thread cannot be accessed by the other thread. Instead, for this other thread CUDA creates another context associated with it. In short, by default, different threads do not share resources.
But you can remove this limitation by using the CUDA Driver API (version 3.1 or later). You can retrieve context reference for one thread, attach it to another thread, and make it "current" for that thread. As a result, the threads can share memory and other resources. It is also possible to create a context explicitly before calling any GPU code and attach it to all the threads you want to share the resources with.
But you can remove this limitation by using the CUDA Driver API (version 3.1 or later). You can retrieve context reference for one thread, attach it to another thread, and make it "current" for that thread. As a result, the threads can share memory and other resources. It is also possible to create a context explicitly before calling any GPU code and attach it to all the threads you want to share the resources with.
...
@@ -56,7 +56,7 @@ Utilizing Multiple GPUs
...
@@ -56,7 +56,7 @@ Utilizing Multiple GPUs
In the current version, each of the OpenCV GPU algorithms can use only a single GPU. So, to utilize multiple GPUs, you have to manually distribute the work between GPUs. Here are the two ways of utilizing multiple GPUs:
In the current version, each of the OpenCV GPU algorithms can use only a single GPU. So, to utilize multiple GPUs, you have to manually distribute the work between GPUs. Here are the two ways of utilizing multiple GPUs:
*
*
If you only use synchronous functions, create several CPU threads (one per each GPU) and from within each thread create a CUDA context for the corresponding GPU using
If you use only synchronous functions, create several CPU threads (one per each GPU) and from within each thread create a CUDA context for the corresponding GPU using
:func:`gpu::setDevice()` or Driver API. Each of the threads will use the associated GPU.
:func:`gpu::setDevice()` or Driver API. Each of the threads will use the associated GPU.
Provides a histogram of Oriented Gradients [Navneet Dalal and Bill Triggs. Histogram of oriented gradients for human detection. 2005.] descriptor and detector.
This class provides a histogram of Oriented Gradients [Navneet Dalal and Bill Triggs. Histogram of oriented gradients for human detection. 2005.] descriptor and detector.
::
::
struct CV_EXPORTS HOGDescriptor
struct CV_EXPORTS HOGDescriptor
...
@@ -61,7 +61,7 @@ gpu::HOGDescriptor
...
@@ -61,7 +61,7 @@ gpu::HOGDescriptor
}
}
Interfaces of all methods are kept similar to the ``CPU HOG`` descriptor and detector analogues as much as possible.
Interfaces of all methods are kept similar to the ``CPU HOG`` descriptor and detector analogues as much as possible.
Performs object detection without a multi-scale window.
Performs object detection without a multi-scale window.
:param img: Source image. ``CV_8UC1`` and ``CV_8UC4`` types are supported for now.
:param img: Source image. ``CV_8UC1`` and ``CV_8UC4`` types are supported for now.
:param found_locations: Left-top corner points of detected objects boundaries.
:param found_locations: Left-top corner points of detected objects boundaries.
:param hit_threshold: Threshold for the distance between features and SVM classifying plane. Usually it is 0 and should be specfied in the detector coefficients (as the last free coefficient). But if the free coefficient is omitted (which is allowed), you can specify it manually here.
:param hit_threshold: Threshold for the distance between features and SVM classifying plane. Usually it is 0 and should be specfied in the detector coefficients (as the last free coefficient). But if the free coefficient is omitted (which is allowed), you can specify it manually here.
:param win_stride: Window stride. It must be a multiple of block stride.
:param win_stride: Window stride. It must be a multiple of block stride.
:param padding: Mock parameter to keep the CPU interface compatibility. Must be (0,0).
:param padding: Mock parameter to keep the CPU interface compatibility. It must be (0,0).
:param win_stride: Window stride. It must be a multiple of block stride.
:param win_stride: Window stride. It must be a multiple of block stride.
:param padding: Mock parameter to keep the CPU interface compatibility. Must be (0,0).
:param padding: Mock parameter to keep the CPU interface compatibility. It must be (0,0).
:param scale0: Coefficient of the detection window increase.
:param scale0: Coefficient of the detection window increase.
:param group_threshold: Coefficient to regulate the similarity threshold. When detected, some objects can be covered by many rectangles. 0 means not to perform grouping. See
:param group_threshold: Coefficient to regulate the similarity threshold. When detected, some objects can be covered by many rectangles. 0 means not to perform grouping. See :func:`groupRectangles` .
:func:`groupRectangles` .
.. index:: gpu::HOGDescriptor::getDescriptors
.. index:: gpu::HOGDescriptor::getDescriptors
...
@@ -217,7 +216,7 @@ gpu::CascadeClassifier_GPU
...
@@ -217,7 +216,7 @@ gpu::CascadeClassifier_GPU
--------------------------
--------------------------
.. cpp:class:: gpu::CascadeClassifier_GPU
.. cpp:class:: gpu::CascadeClassifier_GPU
The cascade classifier class used for object detection.
This cascade classifier class is used for object detection.
:param filename: Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the haartraining application) and NVidia's ``nvbin`` are supported.
:param filename: Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the ``haar`` training application) and NVIDIA's ``nvbin`` are supported.
Loads the classifier from a file. The previous content is destroyed.
Loads the classifier from a file. The previous content is destroyed.
:param filename: Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the haartraining application) and NVidia's ``nvbin`` are supported.
:param filename: Name of the file from which the classifier is loaded. Only the old ``haar`` classifier (trained by the ``haar`` training application) and NVIDIA's ``nvbin`` are supported.
:param image: Matrix of type ``CV_8U`` containing an image where objects should be detected.
:param image: Matrix of type ``CV_8U`` containing an image where objects should be detected.
:param objects: Buffer to store detected objects (rectangles). If it is empty, it is allocated with the default size. If not empty, the function searches not more than N objects, where N = sizeof(objectsBufer's data)/sizeof(cv::Rect).
:param objects: Buffer to store detected objects (rectangles). If it is empty, it is allocated with the default size. If not empty, the function searches not more than N objects, where ``N = sizeof(objectsBufer's data)/sizeof(cv::Rect)``.
:param scaleFactor: Value to specify how much the image size is reduced at each image scale.
:param scaleFactor: Value to specify how much the image size is reduced at each image scale.