Book Image

Instant OpenCV for iOS

4 (1)
Book Image

Instant OpenCV for iOS

4 (1)

Overview of this book

Computer vision on mobile devices is becoming more and more popular. Personal gadgets are now powerful enough to process high-resolution images, stitch panoramas, and detect and track objects. OpenCV, with its decent performance and wide range of functionality, can be an extremely useful tool in the hands of iOS developers. Instant OpenCV for iOS is a practical guide that walks you through every important step for building a computer vision application for the iOS platform. It will help you to port your OpenCV code, profile and optimize it, and wrap it into a GUI application. Each recipe is accompanied by a sample project or an example that helps you focus on a particular aspect of the technology. Instant OpenCV for iOS starts by creating a simple iOS application and linking OpenCV before moving on to processing images and videos in real-time. It covers the major ways to retrieve images, process them, and view or export results. Special attention is also given to performance issues, as they greatly affect the user experience.Several computer vision projects will be considered throughout the book. These include a couple of photo filters that help you to print a postcard or add a retro effect to your images. Another one is a demonstration of the facial feature detection algorithm. In several time-critical cases, the processing speed is measured and optimized using ARM NEON and the Accelerate framework. OpenCV for iOS gives you all the information you need to build a high-performance computer vision application for iOS devices.
Table of Contents (7 chapters)

Optimizing performance with ARM NEON (Advanced)


NEON is a set of single instruction, multiple data (SIMD) instructions for ARM, and it can help in performance optimization. In this recipe, we will learn how to add NEON support to your project, and how to vectorize the code using it.

Getting ready

We will use the Recipe12_ProcessingVideo project as a starting point, trying to minimize the processing time. The source code is available in the Recipe14_OptimizingWithNEON folder in the code bundle that accompanies this book. For this recipe, you can't use Simulator, as NEON instructions are not supported on it and they are ARM-specific, while Simulator is x86.

How to do it...

The following is how we will optimize our video processing application:

  1. Profile the application and find hotspots.

  2. Enable NEON support in our source code.

  3. Create an alternative implementation for the bottleneck functions using NEON.

Let's implement the described steps:

  1. First of all, we need to profile the RetroFilter::applyToVideo method, as it is the most time consuming part of our application. We'll create a copy of this method with the name applyToVideo_optimized, and insert time measurements in it, as we did in the Printing a postcard (Intermediate) recipe. We'll not show the code of the method here, as it differs with these measurements only.

    Note

    It is generally a good practice to use special profiling tools to find hotspots in an application. But in our case, we only have a few functions, and it is better to measure their individual time without using any tools. Image processing tasks are quite time consuming, so you can easily detect bottlenecks with simple logging, and focus on optimization.

  2. The following is a sample console log with processing steps:

    TIMER_ConvertingToGray: 8.28ms
    TIMER_IntensityVariation: 16.23ms
    TIMER_AddingScratches: 4.46ms
    TIMER_FuzzyBorder: 14.65ms
    TIMER_ConvertingToBGR: 2.59ms
    2013-05-25 19:04:12.879 Recipe14_OptimizingWithNEON[4503:5203] Processing time = 48.05ms; Running average FPS = 20.1;
    

    Profiling will show that there are two major hotspots in our application: alphaBlendC1 function and the matrix multiplication with scalar (intensity variation). Because both functions process individual pixels independently, we can parallelize their execution. We then have several choices, such as multi-threading (via libdispatch) of vectorization using the NEON SIMD instruction set. To process images with several threads, we can split them into several stripes (for example, into four horizontal stripes) and process them as submatrices. This approach is quite easy to implement, and it actually doesn't require memory copy.

  3. But let's focus on NEON; we will put the vectorized code to the Processing_NEON.cpp file of the CvEffects static library project. It is shown in the following code snippet:

    #include "Processing.hpp"
    
    #if defined(__ARM_NEON__)
      #include <arm_neon.h>
    #endif
    
    #define USE_NEON true
    #define USE_FIXED_POINT false
    
    using namespace cv;
    
    void alphaBlendC1_NEON(const Mat& src, Mat& dst, const Mat& alpha)
    {
        CV_Assert(src.type() == dst.type() == alpha.type() == CV_8UC1 &&
                  src.isContinuous() && dst.isContinuous() &&
                  alpha.isContinuous() &&
                  (src.cols % 8 == 0) &&
                  (src.cols == dst.cols) && (src.cols == alpha.cols));
        
    #if !defined(__ARM_NEON__) || !USE_NEON
        alphaBlendC1(src, dst, alpha);
    #else
        uchar* pSrc = src.data;
        uchar* pDst = dst.data;
        uchar* pAlpha = alpha.data;
        for(int i=0; i < src.total(); i+=8, pSrc+=8, pDst+=8, pAlpha+=8)
        {
            // Load data from memory to NEON registers
            uint8x8_t vsrc = vld1_u8(pSrc);
            uint8x8_t vdst = vld1_u8(pDst);
            uint8x8_t valpha = vld1_u8(pAlpha);
            uint8x8_t v255 = vdup_n_u8(255);
            
            // Multiply source pixels
            uint16x8_t mult1 = vmull_u8(vsrc, valpha);
            
            // Multiply destination pixels
            uint8x8_t tmp = vsub_u8(v255, valpha);
            uint16x8_t mult2 = vmull_u8(tmp, vdst);
            
            //Add them
            uint16x8_t sum = vaddq_u16(mult1, mult2);
            
            // Take upper bytes (approximates division by 255)
            uint8x8_t out = vshrn_n_u16(sum, 8);
            
            // Store the result back to the memory
            vst1_u8(pDst, out);
        }
    #endif
    }
    
    void multiply_NEON(Mat& src, float multiplier)
    {
        CV_Assert(src.type() == CV_8UC1 && src.isContinuous() &&
                  (src.cols % 8 == 0));
        
    #if !defined(__ARM_NEON__) || !USE_NEON
        src *= multiplier;
    #elif USE_FIXED_POINT
        uchar fpMult = uchar((multiplier * 128.f) + 0.5f);
        uchar* ptr = src.data;
        for(int i = 0; i < src.total(); i+=8, ptr+=8)
        {
            uint8x8_t vsrc = vld1_u8(ptr);
            uint8x8_t vmult = vdup_n_u8(fpMult);
            uint16x8_t product = vmull_u8(vsrc, vmult);
            uint8x8_t out = vqshrn_n_u16(product, 7);
            vst1_u8(ptr, out);
        }
    
    #else
        uchar* ptr = src.data;
        for(int i = 0; i < src.total(); i+=8, ptr+=8)
        {
            float32x4_t vmult1 = vdupq_n_f32(multiplier);
            float32x4_t vmult2 = vdupq_n_f32(multiplier);
            
            uint8x8_t in = vld1_u8(ptr); // Load
            
            // Convert to 16bit
            uint16x8_t in16bit = vmovl_u8(in);
            
            // Split vector
            uint16x4_t in16bit1 = vget_high_u16(in16bit);
            uint16x4_t in16bit2 = vget_low_u16(in16bit);
            
            // Convert to float
            uint32x4_t in32bit1 = vmovl_u16(in16bit1);
            uint32x4_t in32bit2 = vmovl_u16(in16bit2);
            float32x4_t inFlt1 = vcvtq_f32_u32(in32bit1);
            float32x4_t inFlt2 = vcvtq_f32_u32(in32bit2);
            
            // Multiplication
            float32x4_t outFlt1 = vmulq_f32(vmult1, inFlt1);
            float32x4_t outFlt2 = vmulq_f32(vmult2, inFlt2);
            
            // Convert from float
            uint32x4_t out32bit1 = vcvtq_u32_f32(outFlt1);
            uint32x4_t out32bit2 = vcvtq_u32_f32(outFlt2);
            uint16x4_t out16bit1 = vmovn_u32(out32bit1);
            uint16x4_t out16bit2 = vmovn_u32(out32bit2);
            
            // Combine back
            uint16x8_t out16bit = vcombine_u16(out16bit2, out16bit1);
            
            // Convert to 8bit
            uint8x8_t out8bit = vqmovn_u16(out16bit);
            
            // Store to the memory
            vst1_u8(ptr, out8bit);
        }
    #endif
    }
  4. Now, we should call these functions from the applyToVideo_optimized method.

  5. When ready, build and run the application. Depending on your device, you can see up to two times the total performance speedup. Speedup of optimized functions alone is much higher.

How it works...

Nowadays, SIMD instructions are available on many architectures, from desktop CPU to embedded DSP. ARM processors provide a rich set of instructions, called NEON; they are available on all iOS devices starting from iPhone 3GS.

To start writing NEON code, you have to add the following declaration to your file:

#if defined(__ARM_NEON__)
  #include <arm_neon.h>
#endif

Now you can use all the types and functions declared there. Please note, that we're going to use the so-called intrinsics—functions in C that serve as a wrapper over NEON assembler instructions. In fact, you can write your code in pure assembler, but it will worsen the readability, although there is a small performance gain, it usually isn't worth it.

Let's consider how the alphaBlendC1_optimized function works. This function should use the following formula to calculate the resulting pixel's value:

dst(x, y) = [alpha(x, y) * src(x, y) + (255.0 - alpha(x, y)) * dst(x, y)] / 255.0;

The NEON code does exactly that, except the very last division, which is approximated by bit-shifting 8 positions to the right (vshrn_n_u16 function). This means that we divide by 256, instead of 255, and the result of the vectorized function may differ from the original implementation. But we can tolerate that, as we're working on a visual effect, and the possible difference is negligibly small. But please note that such approximations may be unacceptable in a numerical pipeline.

You can also see that we process 8 pixels simultaneously. Our alphaBlendC1_optimized function heavily relies on the exact format of input matrices (that is, is one channel, is continuous, and the number of columns is a multiple of 8), but it can be easily generalized for other situations.

Note

If the image width is not divided by the width of the SIMD instruction, the common practice is to process the tail with ordinary C code. As images are normally large enough, this non-vectorized processing near the right-hand side border doesn't affect performance much.

The multiply function performs simple multiplication with a floating-point coefficient. But we need to do a sequence of conversions to perform the multiplication. But still, because we process 8 pixels simultaneously, the speedup is impressive.

There's more...

Performance optimization with NEON is a deep and wide subject. Most image processing functions could be optimized for 3x speedup, without affecting accuracy. You can even get more if you apply some approximations. In the following sections, we provide some pointers for further study.

NEON

ARM Information Center provides extensive documentation on NEON intrinsics, and can be found at http://bit.ly/3848_ARMNEON. You can see that the instruction set is quite rich, and allows you to optimize your code in different situations.

Fixed-point arithmetic

Our multiply function is a naive translation of the C++ code to NEON intrinsics. But sometimes, it is possible to achieve much better speedup by using some approximation. The very popular method of approximating floating-point calculations is the so-called fixed-point arithmetic, where we store real numbers in variables of integer type (http://en.wikipedia.org/wiki/Fixed-point_arithmetic).

In our case, we can convert the value of multiplier into the Q1.7 format, perform multiplication, and then scale the result back. More about the Qm.n format can be found at http://en.wikipedia.org/wiki/Q_(number_format). The only difference is that the actual Q1.7 format requires 9 bits, where the first bit is used for the sign. But because pixel values are positive, we can drop the sign bit and pack the Q1.7 format into 8 bits of a single byte.

In the following code, we demonstrate the use of the fixed-point arithmetic:

    uchar src = 111;
    float multiplier = 0.76934;
    uchar dst = 0;

    dst = uchar(src * multiplier);
    printf("dst floating-point = %d\n", dst);

    uchar fpMultiplier = uchar((multiplier * 128.f) + 0.5f);
    dst = (src * fpMultipiler) >> 7; // 128 = 2^7
    printf("dst fixed-point = %d\n", dst);

The following is the console output for that code. You can see that approximation is not exact, but again, we can tolerate it in our application. We can also try to use the Qm.n format with a larger value of n, for example, Q1.15:

    dst floating-point = 85
    dst fixed-point = 84

It can bee seen that fixed-point arithmetic uses integer operations instead of floating-point, and so is much more efficient. At the same time, it can be effectively vectorized with NEON, producing even higher speedups.

Please note that you shouldn't expect speedup in our example, as the NEON version is already good enough. But if the numerical pipeline is a little bit more complicated, fixed-point may give you an impressive speedup.