Android* Application Optimization on Intel® Architecture Multi-core Processors

1. Introduction

The 4.1 version of Android* has a new improvement that optimizes multi-thread applications running on multi-core processors. The Android operating system can schedule threads to run on each CPU core. In addition, on Intel architecture (IA)-based devices, you have another way to implement multi-core optimization—Intel® Threading Building Blocks (Intel® TBB).

Intel TBB can be downloaded from http://threadingbuildingblocks.org/download. You must choose the Linux* version. You also need libtbb.so and libtbbmalloc.so, which you can download from lib/android.

2. Using Intel TBB with SSE

As we know, SSE ( Streaming SIMD Extensions) can get amazing performance improvements on IA devices, but it only works for single core. Using Intel TBB together with SSE will get the best performance on multi-core devices. Here, I use a program called YUV2RGB as an example.

YUV2RGB is a color space transform function. You can find this function in many open-source web sites, such as ffmpeg and opencv. Generally it has SSE (or MMX2) function for IA platforms. SSE optimization can get nearly a 6x performance improvement.

*30401824 YUV2RGB888**	SWS_BILINEAR	SWS_FAST_BILINEAR
C code	179 ms	155 ms
SSE code	27 ms	27 ms

See * below

But on the Intel® Atom™ processor which is based on a 2-core 4 hyper-threads , Intel TBB can get 2.2X performance improvement. The score should be better when running on more CPU cores.

Use SSE code + TBB

12 ms

See * below for performance tests notice

3. How to do in detail

For image processing, we generally apply SSE optimization on the image width. That is to say, we get 8~16 pixels and package them into the XMM register, and then run the SSE instruct on the whole width. Generally, for each width the operation is nearly the same, so parallelism is possible for the height. Sample code for parallelism image processing is shown below:

#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"

using namespace tbb;

class multi_two
{
public:
  void operator()(const tbb::blocked_range<size_t>& range) const
  {		
		Log (“range.begin(),range.end()”);
		for(int j=range.begin();j<range.end();j++) {
			for(i=0; i<ImgWidth; i+=8) {
				__asm__ volatile(
				// do sse 
			);
				YUVBuffer+=8;
			}
		}
		//.. the same as UV
  }
    multi_two(BYTE * _RGBBuffer, BYTE * _YUVBuffer)
    {
    	RGBBuffer = _ RGBBuffer;
		YUVBuffer = _ YUVBuffer;
    }
private:
		BYTE * RGBBuffer;
BYTE * YUVBuffer;
};

void YUV2RGB(BYTE * YUVBuffer, BYTE * RGBBuffer)
{
	tbb::parallel_for(tbb::blocked_range<size_t>(0, height/2), multi_two(YUVBuffer , RGBBuffer ));
	
	//if using MMX, remember add emms after all parallel computing finish
	//__asm__ volatile("emms");	
}

parallel_for is the simplest usage for Intel TBB. You must create a class that has a function named “operator,” and pass this class as a parameter. The blocked range is important to Intel TBB, as the range tells Intel TBB to split the whole operation into several task ranges from range.begin() to range.end(). So if you set the range as [0,4], you can get 4 logs which may be [2,3], [0,1], [3,4], [1,2]. Generally, we set the task range from 0 to height/2 (for UV height, it is Y height/2).

Note: The task is not a thread. Intel TBB will create a thread pool according to the CPU core number and distribute these tasks into running threads (from the thread pool). So Intel TBB will try splitting tasks evenly into each thread (each thread binding into a CPU core), and get the best performance compared to multi-thread.

4. Intel TBB vs. Multi-Thread

pthread_create(&m_StreamingThread1, NULL, streamingThreadProc1, NULL);
pthread_create(&m_StreamingThread2, NULL, streamingThreadProc2, NULL);
pthread_create(&m_StreamingThread3, NULL, streamingThreadProc3, NULL);
pthread_create(&m_StreamingThread4, NULL, streamingThreadProc4, NULL);

void* status;
pthread_join(m_StreamingThread1,&status);  //wait thread1 finish
pthread_join(m_StreamingThread2,&status);  //wait thread2 finish
pthread_join(m_StreamingThread3,&status);  //wait thread3 finish
pthread_join(m_StreamingThread4,&status);  //wait thread4 finish

This demo compares Intel TBB and multi-thread. It splits single threads like this:

When no other threads are running, the execution times for Intel TBB and multi-thread are nearly the same. But when I add some dummy threads that just wait 200 ms, things change. Multi-thread is slower even than single thread, but Intel TBB’s time is still good. The reason, you assume, is that all threads will finish at the same time. But actually some other threads will block the four threads running, so that the total finish time will be slower.

Test results are shown in the table below (each step adds 2 dummy threads, the dummy thread is just waiting 10 ms).

STEP	Multi-Thread	Intel TBB
1	296 ms	272 ms
2	316 ms	280 ms
3	341 ms	292 ms
4	363 ms	298 ms
5	457 ms	310 ms

See * below for performance tests notice

"*": Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: [describe config + what test used + who did testing]. For more information go to http://www.intel.com/performance