KinectFusion big update: OpenCL support, etc. (#1798)

* KinFu demo: idle mode added * undistort added * KinFu demo: depth recorder added * TSDFVolume gets voxelSize, voxelSizeInv, truncDist members; decorative fixes * TSDFVolumeGPU::integrate(): host code compiles * TSDFVolume: truncDist fixed * TSDFVolume::integrate(): initial OCL version is done * TSDFVolume::integrate(): OCL: minor fixes * kinfu: small fixes * TSDFVolume::raycast(): initial GPU version is done * USE_INTRINSICS directive for centralized enable/disable opt. code * TSDF Volume supports 3 different sizes/resolutions on each dimension * TSDFVolume: serviceMembers moved to parent class * TSDFVolumeGPU: mem order changed, gave perf boost 4x * Frame: fixed UMat as InputArray; TSDF: comments, TODOs, minor fixes * Frame::getPointsNormals(); minors * FrameGPU: initial version (not working) * minor * FrameGPU: several fixes * added OCL "fast" options * ICP OCL: initial slow version is done (host-side reduce) * ICP OCL: reduce is done, buggy * KinFu Frame: more args fixes * ICP OCL: small fixes to kernel * ICP OCL reduce works * OCL code refactored * ICP OCL: less allocations, better speed * ICP OCL: pose matrix made arg * Render OCL: small fix * Demo: using UMats everywhere * TSDF integrate OCL: put const arg into kernel arg * Platform parameter partially removed, implementation choice is done through OCL availability check * Frame and FrameGenerator removed (other code is in following commits) * CPU render: 4b instead of 3b * ICP: no Frame class use, one class for both CPU and GPU code * demo: fix for UMat chain * TSDF: no Frame or FrameGenerator use * KinFu: no Frame or FrameGenerator, parametrized for Mat or UMat based on OCL availability * KinFu::setParams() removed since it has no effect anyway * TSDF::raycast OCL: fixed normals rendering * ScopeTime -> CV_TRACE * 3-dims resolution and size passed to API * fixed unexpected fails of ICP OCL * voxels made cubic again * args fixed a little * fixed volSize calculation * Tests: inequal, OCL, unified scene test function * kinfu_frame: input types fixed * fixed for Intel HD Graphics * KinFu demo: setUseOptimized instead of setUseOpenCL * tsdf: data types fixed * TSDF OCL: fetch normals implemented * roundDownPow2 -> utils.hpp * TSDF OCL: fetch points+normals implemented * TSDF OCL: NoSize, other fixes for kernel args * Frame OCL: HAVE_OPENCL, NoSize, other kernel args fixed * ICP OCL: HAVE_OPENCL, NoSize and other kernel fixes * KinFu demo fixes: volume size and too long delay * whitespace fix * TSDF: allowed sizes not divisable by 32 * TSDF: fixed type traits; added optimization TODOs * KinFu made non-free * minor fixes: cast and whitespace * fixed FastICP test * warnings fixed: implicit type conversions * OCL kernels: local args made through KernelArg::Local(lsz) call * MSVC warnings fixed * a workaround for broken OCV's bilateral * KinFu tests made a bit faster * build fixed until 3.4 isn't merged to master * warnings fixed, test time shortened

KinectFusion big update: OpenCL support, etc. (#1798)
* KinFu demo: idle mode added * undistort added * KinFu demo: depth recorder added * TSDFVolume gets voxelSize, voxelSizeInv, truncDist members; decorative fixes * TSDFVolumeGPU::integrate(): host code compiles * TSDFVolume: truncDist fixed * TSDFVolume::integrate(): initial OCL version is done * TSDFVolume::integrate(): OCL: minor fixes * kinfu: small fixes * TSDFVolume::raycast(): initial GPU version is done * USE_INTRINSICS directive for centralized enable/disable opt. code * TSDF Volume supports 3 different sizes/resolutions on each dimension * TSDFVolume: serviceMembers moved to parent class * TSDFVolumeGPU: mem order changed, gave perf boost 4x * Frame: fixed UMat as InputArray; TSDF: comments, TODOs, minor fixes * Frame::getPointsNormals(); minors * FrameGPU: initial version (not working) * minor * FrameGPU: several fixes * added OCL "fast" options * ICP OCL: initial slow version is done (host-side reduce) * ICP OCL: reduce is done, buggy * KinFu Frame: more args fixes * ICP OCL: small fixes to kernel * ICP OCL reduce works * OCL code refactored * ICP OCL: less allocations, better speed * ICP OCL: pose matrix made arg * Render OCL: small fix * Demo: using UMats everywhere * TSDF integrate OCL: put const arg into kernel arg * Platform parameter partially removed, implementation choice is done through OCL availability check * Frame and FrameGenerator removed (other code is in following commits) * CPU render: 4b instead of 3b * ICP: no Frame class use, one class for both CPU and GPU code * demo: fix for UMat chain * TSDF: no Frame or FrameGenerator use * KinFu: no Frame or FrameGenerator, parametrized for Mat or UMat based on OCL availability * KinFu::setParams() removed since it has no effect anyway * TSDF::raycast OCL: fixed normals rendering * ScopeTime -> CV_TRACE * 3-dims resolution and size passed to API * fixed unexpected fails of ICP OCL * voxels made cubic again * args fixed a little * fixed volSize calculation * Tests: inequal, OCL, unified scene test function * kinfu_frame: input types fixed * fixed for Intel HD Graphics * KinFu demo: setUseOptimized instead of setUseOpenCL * tsdf: data types fixed * TSDF OCL: fetch normals implemented * roundDownPow2 -> utils.hpp * TSDF OCL: fetch points+normals implemented * TSDF OCL: NoSize, other fixes for kernel args * Frame OCL: HAVE_OPENCL, NoSize, other kernel args fixed * ICP OCL: HAVE_OPENCL, NoSize and other kernel fixes * KinFu demo fixes: volume size and too long delay * whitespace fix * TSDF: allowed sizes not divisable by 32 * TSDF: fixed type traits; added optimization TODOs * KinFu made non-free * minor fixes: cast and whitespace * fixed FastICP test * warnings fixed: implicit type conversions * OCL kernels: local args made through KernelArg::Local(lsz) call * MSVC warnings fixed * a workaround for broken OCV's bilateral * KinFu tests made a bit faster * build fixed until 3.4 isn't merged to master * warnings fixed, test time shortened
75bcd397 · Rostislav Vasilikhin · Vadim Pisarevsky · 2f014f6f · 75bcd397 · 75bcd397
Commit 75bcd397 authored Oct 09, 2018 by Rostislav Vasilikhin Committed by Vadim Pisarevsky Oct 09, 2018
20 changed files
--- a/modules/rgbd/README.md
+++ b/modules/rgbd/README.md
@@ -15,3 +15,5 @@ Note that the KinectFusion algorithm was patented and its use may be restricted
 * _US8401225B2_  Moving object segmentation using depth images

 Since OpenCV's license imposes different restrictions on usage please consult a legal before using this algorithm any way.
+
+That's why you need to set the OPENCV_ENABLE_NONFREE option in CMake to use KinectFusion.
--- a/modules/rgbd/include/opencv2/rgbd/kinfu.hpp
+++ b/modules/rgbd/include/opencv2/rgbd/kinfu.hpp
@@ -28,19 +28,6 @@ struct CV_EXPORTS_W Params
    */
    CV_WRAP static Ptr<Params> coarseParams();

-    enum PlatformType
-    {
-
-        PLATFORM_CPU, PLATFORM_GPU
-    };
-
-    /** @brief A platform on which to run the algorithm.
-     *
-    Currently KinFu supports only one platform which is a CPU.
-    GPU platform is to be implemented in the future.
-    */
-    PlatformType platform;
-
    /** @brief frame size in pixels */
    CV_PROP_RW Size frameSize;

@@ -50,8 +37,9 @@ struct CV_EXPORTS_W Params
    /** @brief pre-scale per 1 meter for input values

    Typical values are:
-    * 5000 per 1 meter for the 16-bit PNG files of TUM database
-    * 1 per 1 meter for the 32-bit float images in the ROS bag files
+         * 5000 per 1 meter for the 16-bit PNG files of TUM database
+         * 1000 per 1 meter for Kinect 2 device
+         * 1 per 1 meter for the 32-bit float images in the ROS bag files
    */
    CV_PROP_RW float depthFactor;

@@ -65,13 +53,13 @@ struct CV_EXPORTS_W Params
    /** @brief Number of pyramid levels for ICP */
    CV_PROP_RW int pyramidLevels;

-    /** @brief Resolution of voxel cube
+    /** @brief Resolution of voxel space

-    Number of voxels in each cube edge.
+    Number of voxels in each dimension.
    */
-    CV_PROP_RW int volumeDims;
-    /** @brief Size of voxel cube side in meters */
-    CV_PROP_RW float volumeSize;
+    CV_PROP_RW Vec3i volumeDims;
+    /** @brief Size of voxel in meters */
+    CV_PROP_RW float voxelSize;

    /** @brief Minimal camera movement in meters

@@ -84,7 +72,7 @@ struct CV_EXPORTS_W Params

    /** @brief distance to truncate in meters

-    Distances that exceed this value will be truncated in voxel cube values.
+    Distances to surface that exceed this value will be truncated to 1.0.
    */
    CV_PROP_RW float tsdf_trunc_dist;

@@ -128,11 +116,19 @@ struct CV_EXPORTS_W Params
  The output can be obtained as a vector of points and their normals
  or can be Phong-rendered from given camera pose.

-  An internal representation of a model is a voxel cube that keeps TSDF values
+  An internal representation of a model is a voxel cuboid that keeps TSDF values
  which are a sort of distances to the surface (for details read the @cite kinectfusion article about TSDF).
  There is no interface to that representation yet.

-  This implementation is based on (kinfu-remake)[https://github.com/Nerei/kinfu_remake].
+  KinFu uses OpenCL acceleration automatically if available.
+  To enable or disable it explicitly use cv::setUseOptimized() or cv::ocl::setUseOpenCL().
+
+  This implementation is based on [kinfu-remake](https://github.com/Nerei/kinfu_remake).
+
+  Note that the KinectFusion algorithm was patented and its use may be restricted by
+  the list of patents mentioned in README.md file in this module directory.
+
+  That's why you need to set the OPENCV_ENABLE_NONFREE option in CMake to use KinectFusion.
 */
 class CV_EXPORTS_W KinFu
 {
@@ -142,11 +138,10 @@ public:

    /** @brief Get current parameters */
    virtual const Params& getParams() const = 0;
-    virtual void setParams(const Params&) = 0;

    /** @brief Renders a volume into an image

-      Renders a 0-surface of TSDF using Phong shading into a CV_8UC3 Mat.
+      Renders a 0-surface of TSDF using Phong shading into a CV_8UC4 Mat.
      Light pose is fixed in KinFu params.

        @param image resulting image
@@ -186,21 +181,18 @@ public:
    */
    CV_WRAP virtual void reset() = 0;

-    /** @brief Get current pose in voxel cube space */
+    /** @brief Get current pose in voxel space */
    virtual const Affine3f getPose() const = 0;

    /** @brief Process next depth frame

-      Integrates depth into voxel cube with respect to its ICP-calculated pose.
+      Integrates depth into voxel space with respect to its ICP-calculated pose.
      Input image is converted to CV_32F internally if has another type.

    @param depth one-channel image which size and depth scale is described in algorithm's parameters
    @return true if succeded to align new frame with current scene, false if opposite
    */
    CV_WRAP virtual bool update(InputArray depth) = 0;
-
-private:
-    class KinFuImpl;
 };

 //! @}

--- a/modules/rgbd/samples/kinfu_demo.cpp
+++ b/modules/rgbd/samples/kinfu_demo.cpp
--- a/modules/rgbd/src/fast_icp.cpp
+++ b/modules/rgbd/src/fast_icp.cpp
--- a/modules/rgbd/src/fast_icp.hpp
+++ b/modules/rgbd/src/fast_icp.hpp
@@ -18,7 +18,10 @@ class ICP
 public:
    ICP(const cv::kinfu::Intr _intrinsics, const std::vector<int> &_iterations, float _angleThreshold, float _distanceThreshold);

-    virtual bool estimateTransform(cv::Affine3f& transform, cv::Ptr<Frame> oldFrame, cv::Ptr<Frame> newFrame) const = 0;
+    virtual bool estimateTransform(cv::Affine3f& transform,
+                                   InputArray oldPoints, InputArray oldNormals,
+                                   InputArray newPoints, InputArray newNormals
+                                   ) const = 0;

    virtual ~ICP() { }

@@ -30,8 +33,7 @@ protected:
    cv::kinfu::Intr intrinsics;
 };

-cv::Ptr<ICP> makeICP(cv::kinfu::Params::PlatformType t,
-                     const cv::kinfu::Intr _intrinsics, const std::vector<int> &_iterations,
+cv::Ptr<ICP> makeICP(const cv::kinfu::Intr _intrinsics, const std::vector<int> &_iterations,
                     float _angleThreshold, float _distanceThreshold);

 } // namespace kinfu

--- a/modules/rgbd/src/kinfu.cpp
+++ b/modules/rgbd/src/kinfu.cpp
--- a/modules/rgbd/src/kinfu_frame.cpp
+++ b/modules/rgbd/src/kinfu_frame.cpp
--- a/modules/rgbd/src/kinfu_frame.hpp
+++ b/modules/rgbd/src/kinfu_frame.hpp
@@ -70,52 +70,24 @@ inline ptype toPtype(const cv::Vec3f& x)
    return ptype(x[0], x[1], x[2], 0);
 }

-typedef cv::Mat_< ptype > Points;
-typedef Points Normals;
-
-typedef cv::Mat_< depthType > Depth;
-
-struct Frame
+enum
 {
-public:
-    virtual void render(cv::OutputArray image, int level, cv::Affine3f lightPose) const = 0;
-    virtual void getDepth(cv::OutputArray depth) const = 0;
-    virtual ~Frame() { }
+    DEPTH_TYPE = DataType<depthType>::type,
+    POINT_TYPE = DataType<ptype    >::type
 };

-struct FrameCPU : Frame
-{
-public:
-    FrameCPU() : points(), normals() { }
-    virtual ~FrameCPU() { }
-
-    virtual void render(cv::OutputArray image, int level, cv::Affine3f lightPose) const override;
-    virtual void getDepth(cv::OutputArray depth) const override;
-
-    std::vector<Points> points;
-    std::vector<Normals> normals;
-    Depth depthData;
-};
-
-struct FrameGPU : Frame
-{
-public:
-    virtual void render(cv::OutputArray image, int level, cv::Affine3f lightPose) const override;
-    virtual void getDepth(cv::OutputArray depth) const override;
-    virtual ~FrameGPU() { }
-};
+typedef cv::Mat_< ptype > Points;
+typedef Points Normals;

-struct FrameGenerator
-{
-public:
-    virtual cv::Ptr<Frame> operator ()() const = 0;
-    virtual void operator() (cv::Ptr<Frame> frame, cv::InputArray depth, const cv::kinfu::Intr, int levels, float depthFactor,
-                                       float sigmaDepth, float sigmaSpatial, int kernelSize) const = 0;
-    virtual void operator() (cv::Ptr<Frame> frame, cv::InputArray points, cv::InputArray normals, int levels) const = 0;
-    virtual ~FrameGenerator() {}
-};
+typedef cv::Mat_< depthType > Depth;

-cv::Ptr<FrameGenerator> makeFrameGenerator(cv::kinfu::Params::PlatformType t);
+void renderPointsNormals(InputArray _points, InputArray _normals, OutputArray image, cv::Affine3f lightPose);
+void makeFrameFromDepth(InputArray depth, OutputArray pyrPoints, OutputArray pyrNormals,
+                        const Intr intr, int levels, float depthFactor,
+                        float sigmaDepth, float sigmaSpatial, int kernelSize);
+void buildPyramidPointsNormals(InputArray _points, InputArray _normals,
+                               OutputArrayOfArrays pyrPoints, OutputArrayOfArrays pyrNormals,
+                               int levels);

 } // namespace kinfu
 } // namespace cv

--- a/modules/rgbd/src/odometry.cpp
+++ b/modules/rgbd/src/odometry.cpp
@@ -1490,21 +1490,10 @@ Size FastICPOdometry::prepareFrameCache(Ptr<OdometryFrame>& frame, int cacheType
    checkDepth(frame->depth, frame->depth.size());

    // mask isn't used by FastICP
-
-    Ptr<FrameGenerator> fg = makeFrameGenerator(Params::PlatformType::PLATFORM_CPU);
-    Ptr<FrameCPU> f = (*fg)().dynamicCast<FrameCPU>();
    Intr intr(cameraMatrix);
    float depthFactor = 1.f; // user should rescale depth manually
-    (*fg)(f, frame->depth, intr, (int)iterCounts.total(), depthFactor,
-          sigmaDepth, sigmaSpatial, kernelSize);
-
-    frame->pyramidCloud.clear();
-    frame->pyramidNormals.clear();
-    for(size_t i = 0; i < f->points.size(); i++)
-    {
-        frame->pyramidCloud.push_back(f->points[i]);
-        frame->pyramidNormals.push_back(f->normals[i]);
-    }
+    makeFrameFromDepth(frame->depth, frame->pyramidCloud, frame->pyramidNormals, intr, (int)iterCounts.total(),
+                       depthFactor, sigmaDepth, sigmaSpatial, kernelSize);

    return frame->depth.size();
 }
@@ -1526,22 +1515,16 @@ bool FastICPOdometry::computeImpl(const Ptr<OdometryFrame>& srcFrame,
 {
    kinfu::Intr intr(cameraMatrix);
    std::vector<int> iterations = iterCounts;
-    Ptr<kinfu::ICP> icp = kinfu::makeICP(kinfu::Params::PlatformType::PLATFORM_CPU,
-                                         intr,
+    Ptr<kinfu::ICP> icp = kinfu::makeICP(intr,
                                         iterations,
                                         angleThreshold,
                                         maxDistDiff);

+    // KinFu's ICP calculates transformation from new frame to old one (src to dst)
    Affine3f transform;
-    Ptr<FrameCPU> srcF = makePtr<FrameCPU>(), dstF = makePtr<FrameCPU>();
-    for(size_t i = 0; i < srcFrame->pyramidCloud.size(); i++)
-    {
-        srcF-> points.push_back(srcFrame->pyramidCloud  [i]);
-        srcF->normals.push_back(srcFrame->pyramidNormals[i]);
-        dstF-> points.push_back(dstFrame->pyramidCloud  [i]);
-        dstF->normals.push_back(dstFrame->pyramidNormals[i]);
-    }
-    bool result = icp->estimateTransform(transform, dstF, srcF);
+    bool result = icp->estimateTransform(transform,
+                                         dstFrame->pyramidCloud, dstFrame->pyramidNormals,
+                                         srcFrame->pyramidCloud, srcFrame->pyramidNormals);

    Rt.create(Size(4, 4), CV_64FC1);
    Mat(Matx44d(transform.matrix)).copyTo(Rt.getMat());

--- a/modules/rgbd/src/opencl/icp.cl
+++ b/modules/rgbd/src/opencl/icp.cl
+// This file is part of OpenCV project.
+// It is subject to the license terms in the LICENSE file found in the top-level directory
+// of this distribution and at http://opencv.org/license.html
+
+// This code is also subject to the license terms in the LICENSE_KinectFusion.md file found in this module's directory
+
+#define UTSIZE 27
+
+typedef float4 ptype;
+
+/*
+    Calculate an upper triangle of Ab matrix then reduce it across workgroup
+*/
+
+inline void calcAb7(__global const char * oldPointsptr,
+                    int oldPoints_step, int oldPoints_offset,
+                    __global const char * oldNormalsptr,
+                    int oldNormals_step, int oldNormals_offset,
+                    const int2 oldSize,
+                    __global const char * newPointsptr,
+                    int newPoints_step, int newPoints_offset,
+                    __global const char * newNormalsptr,
+                    int newNormals_step, int newNormals_offset,
+                    const int2 newSize,
+                    const float16 poseMatrix,
+                    const float2 fxy,
+                    const float2 cxy,
+                    const float sqDistanceThresh,
+                    const float minCos,
+                    float* ab7
+                    )
+{
+    const int x = get_global_id(0);
+    const int y = get_global_id(1);
+
+    if(x >= newSize.x || y >= newSize.y)
+        return;
+
+    // coord-independent constants
+
+    const float3 poseRot0 = poseMatrix.s012;
+    const float3 poseRot1 = poseMatrix.s456;
+    const float3 poseRot2 = poseMatrix.s89a;
+    const float3 poseTrans = poseMatrix.s37b;
+
+    const float2 oldEdge = (float2)(oldSize.x - 1, oldSize.y - 1);
+
+    __global const ptype* newPtsRow = (__global const ptype*)(newPointsptr +
+                                                              newPoints_offset +
+                                                              y*newPoints_step);
+
+    __global const ptype* newNrmRow = (__global const ptype*)(newNormalsptr +
+                                                              newNormals_offset +
+                                                              y*newNormals_step);
+
+    float3 newP = newPtsRow[x].xyz;
+    float3 newN = newNrmRow[x].xyz;
+
+    if(any(isnan(newP)) || any(isnan(newN)))
+        return;
+
+    //transform to old coord system
+    newP = (float3)(dot(newP, poseRot0),
+                    dot(newP, poseRot1),
+                    dot(newP, poseRot2)) + poseTrans;
+    newN = (float3)(dot(newN, poseRot0),
+                    dot(newN, poseRot1),
+                    dot(newN, poseRot2));
+
+    //find correspondence by projecting the point
+    float2 oldCoords = (newP.xy/newP.z)*fxy+cxy;
+
+    if(!(all(oldCoords >= 0.f) && all(oldCoords < oldEdge)))
+        return;
+
+    // bilinearly interpolate oldPts and oldNrm under oldCoords point
+    float3 oldP, oldN;
+    float2 ip = floor(oldCoords);
+    float2 t = oldCoords - ip;
+    int xi = ip.x, yi = ip.y;
+
+    __global const ptype* prow0 = (__global const ptype*)(oldPointsptr +
+                                                          oldPoints_offset +
+                                                          (yi+0)*oldPoints_step);
+    __global const ptype* prow1 = (__global const ptype*)(oldPointsptr +
+                                                          oldPoints_offset +
+                                                          (yi+1)*oldPoints_step);
+    float3 p00 = prow0[xi+0].xyz;
+    float3 p01 = prow0[xi+1].xyz;
+    float3 p10 = prow1[xi+0].xyz;
+    float3 p11 = prow1[xi+1].xyz;
+
+    // NaN check is done later
+
+    __global const ptype* nrow0 = (__global const ptype*)(oldNormalsptr +
+                                                          oldNormals_offset +
+                                                          (yi+0)*oldNormals_step);
+    __global const ptype* nrow1 = (__global const ptype*)(oldNormalsptr +
+                                                          oldNormals_offset +
+                                                          (yi+1)*oldNormals_step);
+
+    float3 n00 = nrow0[xi+0].xyz;
+    float3 n01 = nrow0[xi+1].xyz;
+    float3 n10 = nrow1[xi+0].xyz;
+    float3 n11 = nrow1[xi+1].xyz;
+
+    // NaN check is done later
+
+    float3 p0 = mix(p00, p01, t.x);
+    float3 p1 = mix(p10, p11, t.x);
+    oldP = mix(p0, p1, t.y);
+
+    float3 n0 = mix(n00, n01, t.x);
+    float3 n1 = mix(n10, n11, t.x);
+    oldN = mix(n0, n1, t.y);
+
+    if(any(isnan(oldP)) || any(isnan(oldN)))
+        return;
+
+    //filter by distance
+    float3 diff = newP - oldP;
+    if(dot(diff, diff) > sqDistanceThresh)
+        return;
+
+    //filter by angle
+    if(fabs(dot(newN, oldN)) < minCos)
+        return;
+
+    // build point-wise vector ab = [ A | b ]
+
+    float3 VxN = cross(newP, oldN);
+    float ab[7] = {VxN.x, VxN.y, VxN.z, oldN.x, oldN.y, oldN.z, -dot(oldN, diff)};
+
+    for(int i = 0; i < 7; i++)
+        ab7[i] = ab[i];
+}
+
+
+__kernel void getAb(__global const char * oldPointsptr,
+                    int oldPoints_step, int oldPoints_offset,
+                    __global const char * oldNormalsptr,
+                    int oldNormals_step, int oldNormals_offset,
+                    const int2 oldSize,
+                    __global const char * newPointsptr,
+                    int newPoints_step, int newPoints_offset,
+                    __global const char * newNormalsptr,
+                    int newNormals_step, int newNormals_offset,
+                    const int2 newSize,
+                    const float16 poseMatrix,
+                    const float2 fxy,
+                    const float2 cxy,
+                    const float sqDistanceThresh,
+                    const float minCos,
+                    __local float * reducebuf,
+                    __global char* groupedSumptr,
+                    int groupedSum_step, int groupedSum_offset
+)
+{
+    const int x = get_global_id(0);
+    const int y = get_global_id(1);
+
+    if(x >= newSize.x || y >= newSize.y)
+        return;
+
+    const int gx = get_group_id(0);
+    const int gy = get_group_id(1);
+    const int gw = get_num_groups(0);
+    const int gh = get_num_groups(1);
+
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+    const int lw = get_local_size(0);
+    const int lh = get_local_size(1);
+    const int lsz = lw*lh;
+    const int lid = lx + ly*lw;
+
+    float ab[7];
+    for(int i = 0; i < 7; i++)
+        ab[i] = 0;
+
+    calcAb7(oldPointsptr,
+            oldPoints_step, oldPoints_offset,
+            oldNormalsptr,
+            oldNormals_step, oldNormals_offset,
+            oldSize,
+            newPointsptr,
+            newPoints_step, newPoints_offset,
+            newNormalsptr,
+            newNormals_step, newNormals_offset,
+            newSize,
+            poseMatrix,
+            fxy, cxy,
+            sqDistanceThresh,
+            minCos,
+            ab);
+
+    // build point-wise upper-triangle matrix [ab^T * ab] w/o last row
+    // which is [A^T*A | A^T*b]
+    // and gather sum
+
+    __local float* upperTriangle = reducebuf + lid*UTSIZE;
+
+    int pos = 0;
+    for(int i = 0; i < 6; i++)
+    {
+        for(int j = i; j < 7; j++)
+        {
+            upperTriangle[pos++] = ab[i]*ab[j];
+        }
+    }
+
+    // reduce upperTriangle to local mem
+
+    // maxStep = ctz(lsz), ctz isn't supported on CUDA devices
+    const int c = clz(lsz & -lsz);
+    const int maxStep = c ? 31 - c : c;
+    for(int nstep = 1; nstep <= maxStep; nstep++)
+    {
+        if(lid % (1 << nstep) == 0)
+        {
+            __local float* rto   = reducebuf + UTSIZE*lid;
+            __local float* rfrom = reducebuf + UTSIZE*(lid+(1 << (nstep-1)));
+            for(int i = 0; i < UTSIZE; i++)
+                rto[i] += rfrom[i];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    // here group sum should be in reducebuf[0...UTSIZE]
+    if(lid == 0)
+    {
+        __global float* groupedRow = (__global float*)(groupedSumptr +
+                                                       groupedSum_offset +
+                                                       gy*groupedSum_step);
+
+        for(int i = 0; i < UTSIZE; i++)
+            groupedRow[gx*UTSIZE + i] = reducebuf[i];
+    }
+}
--- a/modules/rgbd/src/opencl/kinfu_frame.cl
+++ b/modules/rgbd/src/opencl/kinfu_frame.cl
--- a/modules/rgbd/src/opencl/tsdf.cl
+++ b/modules/rgbd/src/opencl/tsdf.cl
--- a/modules/rgbd/src/precomp.hpp
+++ b/modules/rgbd/src/precomp.hpp
@@ -13,6 +13,8 @@
 #include "opencv2/core/utility.hpp"
 #include "opencv2/core/private.hpp"
 #include "opencv2/core/hal/intrin.hpp"
+#include "opencv2/core/ocl.hpp"
+#include "opencl_kernels_rgbd.hpp"
 #include "opencv2/imgproc.hpp"
 #include "opencv2/calib3d.hpp"
 #include "opencv2/rgbd.hpp"

--- a/modules/rgbd/src/tsdf.cpp
+++ b/modules/rgbd/src/tsdf.cpp
--- a/modules/rgbd/src/tsdf.hpp
+++ b/modules/rgbd/src/tsdf.hpp
@@ -18,12 +18,12 @@ class TSDFVolume
 {
 public:
    // dimension in voxels, size in meters
-    TSDFVolume(int _res, float _size, cv::Affine3f _pose, float _truncDist, int _maxWeight,
-               float _raycastStepFactor);
+    TSDFVolume(Point3i _res, float _voxelSize, cv::Affine3f _pose, float _truncDist, int _maxWeight,
+               float _raycastStepFactor, bool zFirstMemOrder = true);

-    virtual void integrate(cv::Ptr<Frame> depth, float depthFactor, cv::Affine3f cameraPose, cv::kinfu::Intr intrinsics) = 0;
-    virtual void raycast(cv::Affine3f cameraPose, cv::kinfu::Intr intrinsics, cv::Size frameSize, int pyramidLevels,
-                         cv::Ptr<FrameGenerator> frameGenerator, cv::Ptr<Frame> frame) const = 0;
+    virtual void integrate(InputArray _depth, float depthFactor, cv::Affine3f cameraPose, cv::kinfu::Intr intrinsics) = 0;
+    virtual void raycast(cv::Affine3f cameraPose, cv::kinfu::Intr intrinsics, cv::Size frameSize,
+                         cv::OutputArray points, cv::OutputArray normals) const = 0;

    virtual void fetchPointsNormals(cv::OutputArray points, cv::OutputArray normals) const = 0;
    virtual void fetchNormals(cv::InputArray points, cv::OutputArray _normals) const = 0;
@@ -32,14 +32,20 @@ public:

    virtual ~TSDFVolume() { }

-    float edgeSize;
-    int edgeResolution;
+    float voxelSize;
+    float voxelSizeInv;
+    Point3i volResolution;
    int maxWeight;
    cv::Affine3f pose;
+    float raycastStepFactor;
+
+    Point3f volSize;
+    float truncDist;
+    Vec4i volDims;
+    Vec8i neighbourCoords;
 };

-cv::Ptr<TSDFVolume> makeTSDFVolume(cv::kinfu::Params::PlatformType t,
-                                   int _res, float _size, cv::Affine3f _pose, float _truncDist, int _maxWeight,
+cv::Ptr<TSDFVolume> makeTSDFVolume(Point3i _res,  float _voxelSize, cv::Affine3f _pose, float _truncDist, int _maxWeight,
                                   float _raycastStepFactor);

 } // namespace kinfu

--- a/modules/rgbd/src/utils.cpp
+++ b/modules/rgbd/src/utils.cpp
@@ -50,38 +50,6 @@ namespace rgbd

 namespace kinfu {

-#if PRINT_TIME
-
-ScopeTime::ScopeTime(std::string name_, bool _enablePrint) :
-    name(name_), enablePrint(_enablePrint)
-{
-    start = (double)cv::getTickCount();
-    nested++;
-}
-
-ScopeTime::~ScopeTime()
-{
-    double time_ms =  ((double)cv::getTickCount() - start)*1000.0/cv::getTickFrequency();
-    if(enablePrint)
-    {
-        std::string spaces(nested, '-');
-        std::cout << spaces << "Time(" << name << ") = " << time_ms << " ms" << std::endl;
-    }
-    nested--;
-}
-
-int ScopeTime::nested  = 0;
-
-#else
-
-ScopeTime::ScopeTime(std::string /*name_*/, bool /*_enablePrint = true*/)
-{ }
-
-ScopeTime::~ScopeTime()
-{ }
-
-#endif
-
 } // namespace kinfu
 } // namespace cv

--- a/modules/rgbd/src/utils.hpp
+++ b/modules/rgbd/src/utils.hpp
@@ -47,14 +47,14 @@ rescaleDepthTemplated<double>(const Mat& in, Mat& out)

 namespace kinfu {

-// Print execution times of each block marked with ScopeTime
-#define PRINT_TIME 0
+// One place to turn intrinsics on and off
+#define USE_INTRINSICS CV_SIMD128

 typedef float depthType;

 const float qnan = std::numeric_limits<float>::quiet_NaN();
 const cv::Vec3f nan3(qnan, qnan, qnan);
-#if CV_SIMD128
+#if USE_INTRINSICS
 const cv::v_float32x4 nanv(qnan, qnan, qnan, qnan);
 #endif

@@ -63,26 +63,13 @@ inline bool isNaN(cv::Point3f p)
    return (cvIsNaN(p.x) || cvIsNaN(p.y) || cvIsNaN(p.z));
 }

-#if CV_SIMD128
+#if USE_INTRINSICS
 static inline bool isNaN(const cv::v_float32x4& p)
 {
    return cv::v_check_any(p != p);
 }
 #endif

-struct ScopeTime
-{
-    ScopeTime(std::string name_, bool _enablePrint = true);
-    ~ScopeTime();
-
-#if PRINT_TIME
-    static int nested;
-    const std::string name;
-    const bool enablePrint;
-    double start;
-#endif
-};
-
 /** @brief Camera intrinsics */
 struct Intr
 {
@@ -143,6 +130,16 @@ struct Intr
    float fx, fy, cx, cy;
 };

+inline size_t roundDownPow2(size_t x)
+{
+    size_t shift = 0;
+    while(x != 0)
+    {
+        shift++; x >>= 1;
+    }
+    return (size_t)(1ULL << (shift-1));
+}
+
 } // namespace kinfu

 } // namespace cv

--- a/modules/rgbd/test/test_kinfu.cpp
+++ b/modules/rgbd/test/test_kinfu.cpp
@@ -90,10 +90,18 @@ struct RenderInvoker : ParallelLoopBody
    float depthFactor;
 };

-struct CubeSpheresScene
+struct Scene
+{
+    virtual ~Scene() {}
+    static Ptr<Scene> create(int nScene, Size sz, Matx33f _intr, float _depthFactor);
+    virtual Mat depth(Affine3f pose) = 0;
+    virtual std::vector<Affine3f> getPoses() = 0;
+};
+
+struct CubeSpheresScene : Scene
 {
    const int framesPerCycle = 32;
-    const int nCycles = 1;
+    const float nCycles = 0.25f;
    const Affine3f startPose = Affine3f(Vec3f(-0.5f, 0.f, 0.f), Vec3f(2.1f, 1.4f, -2.1f));

    CubeSpheresScene(Size sz, Matx33f _intr, float _depthFactor) :
@@ -125,7 +133,7 @@ struct CubeSpheresScene
        return res;
    }

-    Mat depth(Affine3f pose)
+    Mat depth(Affine3f pose) override
    {
        Mat_<float> frame(frameSize);
        Reprojector reproj(intr);
@@ -136,10 +144,10 @@ struct CubeSpheresScene
        return frame;
    }

-    std::vector<Affine3f> getPoses()
+    std::vector<Affine3f> getPoses() override
    {
        std::vector<Affine3f> poses;
-        for(int i = 0; i < framesPerCycle*nCycles; i++)
+        for(int i = 0; i < (int)(framesPerCycle*nCycles); i++)
        {
            float angle = (float)(CV_2PI*i/framesPerCycle);
            Affine3f pose;
@@ -160,10 +168,10 @@ struct CubeSpheresScene
 };


-struct RotatingScene
+struct RotatingScene : Scene
 {
-    const int framesPerCycle = 64;
-    const int nCycles = 1;
+    const int framesPerCycle = 32;
+    const float nCycles = 0.5f;
    const Affine3f startPose = Affine3f(Vec3f(-1.f, 0.f, 0.f), Vec3f(1.5f, 2.f, -1.5f));

    RotatingScene(Size sz, Matx33f _intr, float _depthFactor) :
@@ -221,7 +229,7 @@ struct RotatingScene
        return res;
    }

-    Mat depth(Affine3f pose)
+    Mat depth(Affine3f pose) override
    {
        Mat_<float> frame(frameSize);
        Reprojector reproj(intr);
@@ -232,7 +240,7 @@ struct RotatingScene
        return frame;
    }

-    std::vector<Affine3f> getPoses()
+    std::vector<Affine3f> getPoses() override
    {
        std::vector<Affine3f> poses;
        for(int i = 0; i < framesPerCycle*nCycles; i++)
@@ -258,24 +266,42 @@ struct RotatingScene

 Mat_<float> RotatingScene::randTexture(256, 256);

+Ptr<Scene> Scene::create(int nScene, Size sz, Matx33f _intr, float _depthFactor)
+{
+    if(nScene == 0)
+        return makePtr<RotatingScene>(sz, _intr, _depthFactor);
+    else
+        return makePtr<CubeSpheresScene>(sz, _intr, _depthFactor);
+}
+
 static const bool display = false;

-TEST( KinectFusion, lowDense )
+void flyTest(bool hiDense, bool inequal)
 {
-    Ptr<kinfu::Params> params = kinfu::Params::coarseParams();
+    Ptr<kinfu::Params> params;
+    if(hiDense)
+        params = kinfu::Params::defaultParams();
+    else
+        params = kinfu::Params::coarseParams();
+
+    if(inequal)
+    {
+        params->volumeDims[0] += 32;
+        params->volumeDims[1] -= 32;
+    }

-    RotatingScene scene(params->frameSize, params->intr, params->depthFactor);
+    Ptr<Scene> scene = Scene::create(hiDense, params->frameSize, params->intr, params->depthFactor);

    Ptr<kinfu::KinFu> kf = kinfu::KinFu::create(params);

-    std::vector<Affine3f> poses = scene.getPoses();
+    std::vector<Affine3f> poses = scene->getPoses();
    Affine3f startPoseGT = poses[0], startPoseKF;
    Affine3f pose, kfPose;
    for(size_t i = 0; i < poses.size(); i++)
    {
        pose = poses[i];

-        Mat depth = scene.depth(pose);
+        Mat depth = scene->depth(pose);

        ASSERT_TRUE(kf->update(depth));

@@ -295,46 +321,35 @@ TEST( KinectFusion, lowDense )
        }
    }

-    ASSERT_LT(cv::norm(kfPose.rvec() - pose.rvec()), 0.01);
-    ASSERT_LT(cv::norm(kfPose.translation() - pose.translation()), 0.1);
+    double rvecThreshold = hiDense ? 0.01 : 0.02;
+    ASSERT_LT(cv::norm(kfPose.rvec() - pose.rvec()), rvecThreshold);
+    double poseThreshold = hiDense ? 0.03 : 0.1;
+    ASSERT_LT(cv::norm(kfPose.translation() - pose.translation()), poseThreshold);
 }

-TEST( KinectFusion, highDense )
+TEST( KinectFusion, lowDense )
 {
-    Ptr<kinfu::Params> params = kinfu::Params::defaultParams();
-    CubeSpheresScene scene(params->frameSize, params->intr, params->depthFactor);
-
-    Ptr<kinfu::KinFu> kf = kinfu::KinFu::create(params);
-
-    std::vector<Affine3f> poses = scene.getPoses();
-    Affine3f startPoseGT = poses[0], startPoseKF;
-    Affine3f pose, kfPose;
-    for(size_t i = 0; i < poses.size(); i++)
-    {
-        pose = poses[i];
-
-        Mat depth = scene.depth(pose);
-
-        ASSERT_TRUE(kf->update(depth));
-
-        kfPose = kf->getPose();
-        if(i == 0)
-            startPoseKF = kfPose;
+    flyTest(false, false);
+}

-        pose = (  startPoseGT.inv() * pose  )*startPoseKF;
+TEST( KinectFusion, highDense )
+{
+    flyTest(true, false);
+}

-        if(display)
-        {
-            imshow("depth", depth*(1.f/params->depthFactor/4.f));
-            Mat rendered;
-            kf->render(rendered);
-            imshow("render", rendered);
-            waitKey(10);
-        }
-    }
+TEST( KinectFusion, inequal )
+{
+    flyTest(false, true);
+}

-    ASSERT_LT(cv::norm(kfPose.rvec() - pose.rvec()), 0.01);
-    ASSERT_LT(cv::norm(kfPose.translation() - pose.translation()), 0.03);
+#ifdef HAVE_OPENCL
+TEST( KinectFusion, OCL )
+{
+    cv::ocl::setUseOpenCL(false);
+    flyTest(false, false);
+    cv::ocl::setUseOpenCL(true);
+    flyTest(false, false);
 }
+#endif

 }} // namespace
--- a/modules/rgbd/test/test_odometry.cpp
+++ b/modules/rgbd/test/test_odometry.cpp
@@ -235,10 +235,10 @@ void CV_OdometryTest::run(int)
    // On each iteration an input frame is warped using generated transformation.
    // Odometry is run on the following pair: the original frame and the warped one.
    // Comparing a computed transformation with an applied one we compute 2 errors:
-    // better_1time_count - count of poses which error is less than ground thrush pose,
-    // better_5times_count - count of poses which error is 5 times less than ground thrush pose.
+    // better_1time_count - count of poses which error is less than ground truth pose,
+    // better_5times_count - count of poses which error is 5 times less than ground truth pose.
    int iterCount = 100;
-    int better_1time_count = 0; 
+    int better_1time_count = 0;
    int better_5times_count = 0;
    for(int iter = 0; iter < iterCount; iter++)
    {
@@ -267,7 +267,6 @@ void CV_OdometryTest::run(int)
        waitKey();
 #endif

-
        // compare rotation
        double rdiffnorm = cv::norm(rvec - calcRvec),
               rnorm = cv::norm(rvec);

--- a/modules/rgbd/test/test_precomp.hpp
+++ b/modules/rgbd/test/test_precomp.hpp
@@ -13,6 +13,10 @@
 #include <opencv2/rgbd.hpp>
 #include <opencv2/calib3d.hpp>

+#ifdef HAVE_OPENCL
+#include <opencv2/core/ocl.hpp>
+#endif
+
 namespace opencv_test {
 using namespace cv::rgbd;
 }