Series
- Part 1: compile opencv on ubuntu 16.04
- Part 2: compile opencv with CUDA support on windows 10
- Part 3: opencv mat for loop
- Part 4: speed up opencv image processing with openmp
Guide
config
- linux/window: cmake with
CXX_FLAGS=-fopenmp
- window VS: VS also support openmp,
C/C++| Language | /openmp
usage
1 |
|
code
1 |
|
CMakeLists.txt
use CXX_FLAGS=-fopenmp
in CMakeLists.txt
1 | cmake_minimum_required(VERSION 3.0.0) |
options
or use g++ hello.cpp -fopenmp
to compile
view demo
list dynamic dependencies (ldd)
ldd hello
linux-vdso.so.1 => (0x00007ffd71365000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8ea7f00000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f8ea7cde000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8ea7914000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8ea760b000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8ea8282000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8ea73f5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8ea71f1000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8ea6fd4000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
list names (nm)
nm hello
0000000000602080 B __bss_start
0000000000602190 b completed.7594
U __cxa_atexit@@GLIBC_2.2.5
0000000000602070 D __data_start
0000000000602070 W data_start
0000000000400b00 t deregister_tm_clones
0000000000400b80 t __do_global_dtors_aux
0000000000601df8 t __do_global_dtors_aux_fini_array_entry
0000000000602078 d __dso_handle
0000000000601e08 d _DYNAMIC
0000000000602080 D _edata
0000000000602198 B _end
0000000000400d44 T _fini
0000000000400ba0 t frame_dummy
0000000000601de8 t __frame_dummy_init_array_entry
0000000000400f18 r __FRAME_END__
0000000000602000 d _GLOBAL_OFFSET_TABLE_
0000000000400c28 t _GLOBAL__sub_I_main
w __gmon_start__
0000000000400d54 r __GNU_EH_FRAME_HDR
U GOMP_parallel@@GOMP_4.0
U __gxx_personality_v0@@CXXABI_1.3
00000000004009e0 T _init
0000000000601df8 t __init_array_end
0000000000601de8 t __init_array_start
0000000000400d50 R _IO_stdin_used
w _ITM_deregisterTMCloneTable
w _ITM_registerTMCloneTable
0000000000601e00 d __JCR_END__
0000000000601e00 d __JCR_LIST__
w _Jv_RegisterClasses
0000000000400d40 T __libc_csu_fini
0000000000400cd0 T __libc_csu_init
U __libc_start_main@@GLIBC_2.2.5
0000000000400bc6 T main
0000000000400c3d t main._omp_fn.0
U omp_get_num_threads@@OMP_1.0
U omp_get_thread_num@@OMP_1.0
0000000000400b40 t register_tm_clones
0000000000400ad0 T _start
0000000000602080 d __TMC_END__
0000000000400bea t _Z41__static_initialization_and_destruction_0ii
U _ZNSolsEPFRSoS_E@@GLIBCXX_3.4
U _ZNSt8ios_base4InitC1Ev@@GLIBCXX_3.4
U _ZNSt8ios_base4InitD1Ev@@GLIBCXX_3.4
0000000000602080 B _ZSt4cout@@GLIBCXX_3.4
U _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@@GLIBCXX_3.4
0000000000602191 b _ZStL8__ioinit
U _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_c@@GLIBCXX_3.4
omp_get_num_threads
,omp_get_thread_num
OpenMP Introduction
OpenMP的指令格式
#pragma omp directive [clause[clause]…]
#pragma omp parallel private(i, j)
parallel
is directive,private
is clause
directive
- parallel,用在一个代码段之前,表示这段代码将被多个线程并行执行
- for,用于for循环之前,将循环分配到多个线程中并行执行,必须保证每次循环之间无相关性。
- parallel for, parallel 和 for语句的结合,也是用在一个for循环之前,表示for循环的代码将被多个线程并行执行。
- sections,用在可能会被并行执行的代码段之前
- parallel sections,parallel和sections两个语句的结合
- critical,用在一段代码临界区之前
- single,用在一段只被单个线程执行的代码段之前,表示后面的代码段将被单线程执行。
- flush,
- barrier,用于并行区内代码的线程同步,所有线程执行到barrier时要停止,直到所有线程都执行到barrier时才继续往下执行。
- atomic,用于指定一块内存区域被制动更新
- master,用于指定一段代码块由主线程执行
- ordered, 用于指定并行区域的循环按顺序执行
- threadprivate, 用于指定一个变量是线程私有的。
parallel for
OpenMP 对可以多线程化的循环有如下五个要求:
- 循环的变量变量(就是i)必须是有符号整形,其他的都不行。
- 循环的比较条件必须是< <= > >=中的一种
- 循环的增量部分必须是增减一个不变的值(即每次循环是不变的)。
- 如果比较符号是< <=,那每次循环i应该增加,反之应该减小
- 循环必须是没有奇奇怪怪的东西,不能从内部循环跳到外部循环,goto和break只能在循环内部跳转,异常必须在循环内部被捕获。
如果你的循环不符合这些条件,那就只好改写了.
avoid race condition
当一个循环满足以上五个条件时,依然可能因为数据依赖而不能够合理的并行化。当两个不同的迭代之间的数据存在依赖关系时,就会发生这种情况。
1 | // 假设数组已经初始化为1 |
ERROR.
1 | omp_set_num_threads(4); |
same as
1 | omp_set_num_threads(4); |
parallel sections
1 |
|
parallel sections里面的内容要并行执行,具体分工上,每个线程执行其中的一个section
clause
- private, 指定每个线程都有它自己的变量私有副本。
- firstprivate,指定每个线程都有它自己的变量私有副本,并且变量要被继承主线程中的初值。
- lastprivate,主要是用来指定将线程中的私有变量的值在并行处理结束后复制回主线程中的对应变量。
- reduce,用来指定一个或多个变量是私有的,并且在并行处理结束后这些变量要执行指定的运算。
- nowait,忽略指定中暗含的等待
- num_threads,指定线程的个数
- schedule,指定如何调度for循环迭代
- shared,指定一个或多个变量为多个线程间的共享变量
- ordered,用来指定for循环的执行要按顺序执行
- copyprivate,用于single指令中的指定变量为多个线程的共享变量
- copyin,用来指定一个threadprivate的变量的值要用主线程的值进行初始化。
- default,用来指定并行处理区域内的变量的使用方式,缺省是shared
private
1 |
|
local variables are automatically private to each thread.
The reason for the existence of theprivate
clause is so that you don’t have to change your code.
see here
The only way to parallelize the following code without the private clause
1 | int i,j; |
is to change the code. For example like this:
1 | int i; |
reduction
例如累加
1 | int sum = 0; |
上面的这个程序里,sum公有或者私有都不对,为了解决这个问题,OpenMP 提供了reduction
语句;
1 | int sum = 0; |
内部实现中,OpenMP为每个线程提供了私有的sum变量(初始化为0),当线程退出时,OpenMP再把每个线程私有的sum加在一起得到最终结果。
num_threads
num_threads(4)
same as omp_set_num_threads(4)
1 | // `num_threads(4)` same as `omp_set_num_threads(4)` |
schedule
format
#pragma omp parallel for schedule(kind [, chunk size])
kind: see openmp-loop-scheduling and whats-the-difference-between-static-and-dynamic-schedule-in-openmp
static
: Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size.By default, chunk size is loop_count/number_of_threads
.dynamic
: Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue.By default, the chunk size is 1
. Be careful when using this scheduling type because of the extra overhead involved.guided
: special case ofdynamic
. Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies them minimum size chunk to use.By default the chunk size is approximately loop_count/number_of_threads
.auto
: When schedule (auto) is specified, the decision regardingscheduling is delegated to the compiler
. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.runtime
: with ENVOMP_SCHEDULE
, we can test 3 types scheduling:static,dynamic,guided
without recompile the code.
The optional parameter (chunk), when specified, must be a positive integer.
默认情况下,OpenMP认为所有的循环迭代运行的时间都是一样的,这就导致了OpenMP会把不同的迭代等分到不同的核心上,并且让他们分布的尽可能减小内存访问冲突,这样做是因为循环一般会线性地访问内存, 所以把循环按照前一半后一半的方法分配可以最大程度的减少冲突. 然而对内存访问来说这可能是最好的方法, 但是对于负载均衡可能并不是最好的方法, 而且反过来最好的负载均衡可能也会破坏内存访问. 因此必须折衷考虑.
内存访问vs负载均衡,需要折中考虑。
openmp默认使用的schedule是取决于编译器实现的。gcc默认使用schedule(dynamic,1),也就是动态调度并且块大小是1.
线程数不要大于实际核数,否则就是oversubscription
isprime可以对dynamic做一个展示。
functions
omp_get_num_procs
, 返回运行本线程的多处理机的处理器个数。omp_set_num_threads
, 设置并行执行代码时的线程个数omp_get_num_threads
, 返回当前并行区域中的活动线程(active thread)个数,如果没有设置,默认为1。omp_get_thread_num
, 返回线程号(0,1,2,…)omp_init_lock
, 初始化一个简单锁omp_set_lock
, 上锁操作omp_unset_lock
, 解锁操作,要和omp_set_lock
函数配对使用omp_destroy_lock
,关闭一个锁,要和omp_init_lock
函数配对使用
check cpu
cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
8 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
omp_get_num_procs
return8
.
OpenMP Example
omp_get_num_threads
1 | void test0() |
parallel
case1
1 | void test1() |
case2
1 | void test1_2() |
notice the difference of
std::cout
andprintf
case3
1 | void test1_3() |
omp parallel/for
omp parallel + omp for
1 | void test2() |
omp parallel for
1 | void test2_2() |
sqrt case
1 | void base_sqrt() |
sequential
time ./demo_openmp
Worker Thread = 0, cost = 1746 ms
Worker Thread = 0, cost = 1711 ms
Worker Thread = 0, cost = 1736 ms
Worker Thread = 0, cost = 1734 ms
Worker Thread = 0, cost = 1750 ms
Worker Thread = 0, cost = 1718 ms
Worker Thread = 0, cost = 1769 ms
Worker Thread = 0, cost = 1732 ms
Main Thread = 0, cost = 13899 ms
./demo_openmp 13.90s user 0.00s system 99% cpu 13.903 total
parallel
time ./demo_openmp
Worker Thread = 1, cost = 1875 ms
Worker Thread = 6, cost = 1876 ms
Worker Thread = 0, cost = 1876 ms
Worker Thread = 7, cost = 1876 ms
Worker Thread = 5, cost = 1877 ms
Worker Thread = 3, cost = 1963 ms
Worker Thread = 4, cost = 2000 ms
Worker Thread = 2, cost = 2027 ms
Main Thread = 0, cost = 2031 ms
./demo_openmp 15.10s user 0.01s system 740% cpu 2.041 total
2031s + 10ms(system) = 2041ms (total)
2.041* 740% = 15.1034 s
parallel sections
1 | void test3() |
private
error case
1 | void test4_error() |
error results.
fix1 by changing code
1 | void test4_fix1() |
fix2 by private(j)
1 | void test4_fix2() |
reduction
error case
1 | void test5_error() |
reduction(+:sum)
1 | void test5_fix() |
num_threads
1 | void test6() |
schedule
(static,2)
1 | void test7_1() |
(static,4)
1 | void test7_2() |
(dynamic,1)
1 | void test7_3() |
(dynamic,3)
1 | void test7_4() |
schedule compare
1 |
|
no schedule
Number of primes numbers: 5761455./demo_openmp 151.64s user 0.04s system 582% cpu 26.048 total
schedule(static,1)
Number of primes numbers: 5761455./demo_openmp 111.13s user 0.00s system 399% cpu 27.799 total
schedule(dynamic,1)
Number of primes numbers: 5761455./demo_openmp 167.22s user 0.02s system 791% cpu 21.135 total
schedule(dynamic,200)
Number of primes numbers: 5761455./demo_openmp 165.96s user 0.02s system 791% cpu 20.981 total
OpenCV with OpenMP
see how-opencv-use-openmp-thread-to-get-performance
3 type OpenCV implementation
- sequential implementation: default (slowest)
- parallel implementation: OpenMP / TBB
- GPU implementation: CUDA(fastest) / OpenCL
With CMake-gui, Building
OpenCV
with theWITH_OPENMP
flag means that the internal functions will useOpenMP
to parallelize some of the algorithms, likecvCanny
,cvSmooth
andcvThreshold
.
In OpenCV, an algorithm can have a
sequential (slowest) implementation
; aparallel implementation
usingOpenMP
orTBB
; and aGPU implementation
usingOpenCL
orCUDA
(fastest). You can decide with theWITH_XXX
flags which version to use.
Of course, not every algorithm can be parallelized.
Now, if you want to parallelize your methods with OpenMP, you have to implement it yourself.
concepts
avoiding extra copying
from improving-image-processing-speed
There is one important thing about increasing speed in OpenCV not related to processor nor algorithm and it is avoiding extra copying when dealing with matrices. I will give you an example taken from the documentation:
“…by constructing a header for a part of another matrix. It can be a single row, single column, several rows, several columns, rectangular region in the matrix (called a minor in algebra) or a diagonal. Such operations are also O(1), because the new header will reference the same data. You can actually modify a part of the matrix using this feature, e.g.”
parallel for
1 |
|
parallel sections
1 |
|
Reference
- openmp
- openmp + MPI
- openmp
- how-opencv-use-openmp-thread-to-get-performance
- csdn opencv with openmp for+section
- openmp functions
- improving-image-processing-speed
- openmp-are-local-variables-automatically-private
- whats-the-difference-between-static-and-dynamic-schedule-in-openmp
- dynamic openmp with isprime
History
- 20190403: created.