Kezunlin's Blog

compile and install mplack on ubuntu 16.04

Posted on 2019-05-20 Edited on 2024-10-14 In cpp

Guide

mlpack: a scalable C++ machine learning library

dependencies

Armadillo >= 6.500.0
Boost
CMake >= 3.3.2

Armadillo: c++ linear algebra library based on LAPACK and BLAS
If you are compiling Armadillo by hand, ensure that LAPACK and BLAS are enabled.

see OpenCV vs. Armadillo vs. Eigen on Linux

1	sudo apt-get install libarmadillo-dev

install

apt-get

1	sudo apt-get install libmlpack-dev

version: 2.0.1
by default mlpack will install to /usr/include/mlpack and /usr/lib

compile

wget https://www.mlpack.org/files/mlpack-3.1.1.tar.gz
git clone https://github.com/mlpack/mlpack.git
mkdir build && cd build && cmake-gui ..
make -j8
sudo make install

configure and output

...
Found Armadillo: /usr/lib/libarmadillo.so (found suitable version "6.500.5", minimum required is "6.500.0") 
Armadillo libraries: /usr/lib/libarmadillo.so
...

version: 3.1.1
by default mlpack will install to /usr/local/include and /usr/local/lib/libmlpack.so.3.1

usage

mlpack-config.cmake

#.rst:
# FindMLPACK
# -------------
#
# Find MLPACK
#
# Find the MLPACK C++ library
#
# Using MLPACK::
#
#   find_package(MLPACK REQUIRED)
#   include_directories(${MLPACK_INCLUDE_DIRS})
#   add_executable(foo foo.cc)
#   target_link_libraries(foo ${MLPACK_LIBRARIES})
#
# This module sets the following variables::
#
#   MLPACK_FOUND - set to true if the library is found
#   MLPACK_INCLUDE_DIRS - list of required include directories
#   MLPACK_LIBRARIES - list of libraries to be linked
#   MLPACK_VERSION_MAJOR - major version number
#   MLPACK_VERSION_MINOR - minor version number
#   MLPACK_VERSION_PATCH - patch version number
#   MLPACK_VERSION_STRING - version number as a string (ex: "1.0.4")


# UNIX paths are standard, no need to specify them.
find_library(MLPACK_LIBRARY
	NAMES mlpack
	PATHS "$ENV{ProgramFiles}/mlpack/lib"  "$ENV{ProgramFiles}/mlpack/lib64" "$ENV{ProgramFiles}/mlpack"
)
find_path(MLPACK_INCLUDE_DIR
	NAMES mlpack/core.hpp mlpack/prereqs.hpp
	PATHS "$ENV{ProgramFiles}/mlpack"
)


if(MLPACK_INCLUDE_DIR)
	# Read and parse mlpack version header file for version number
	file(STRINGS "${MLPACK_INCLUDE_DIR}/mlpack/core/util/version.hpp" _mlpack_HEADER_CONTENTS REGEX "#define MLPACK_VERSION_[A-Z]+ ")
	string(REGEX REPLACE ".*#define MLPACK_VERSION_MAJOR ([0-9]+).*" "\\1" MLPACK_VERSION_MAJOR "${_mlpack_HEADER_CONTENTS}")
	string(REGEX REPLACE ".*#define MLPACK_VERSION_MINOR ([0-9]+).*" "\\1" MLPACK_VERSION_MINOR "${_mlpack_HEADER_CONTENTS}")
	string(REGEX REPLACE ".*#define MLPACK_VERSION_PATCH ([0-9]+).*" "\\1" MLPACK_VERSION_PATCH "${_mlpack_HEADER_CONTENTS}")
	
	unset(_mlpack_HEADER_CONTENTS)
	
	set(MLPACK_VERSION_STRING "${MLPACK_VERSION_MAJOR}.${MLPACK_VERSION_MINOR}.${MLPACK_VERSION_PATCH}")
endif()

find_package_handle_standard_args(MLPACK
	REQUIRED_VARS MLPACK_LIBRARY MLPACK_INCLUDE_DIR
	VERSION_VAR MLPACK_VERSION_STRING
)

if(MLPACK_FOUND)
	set(MLPACK_INCLUDE_DIRS ${MLPACK_INCLUDE_DIR})
	set(MLPACK_LIBRARIES ${MLPACK_LIBRARY})
endif()

# Hide internal variables
mark_as_advanced(
	MLPACK_INCLUDE_DIR
	MLPACK_LIBRARY
)

From here

CMakeLists.txt

find_package(MLPACK REQUIRED)
MESSAGE( [Main] " MLPACK_INCLUDE_DIRS = ${MLPACK_INCLUDE_DIRS}") 
MESSAGE( [Main] " MLPACK_LIBRARIES = ${MLPACK_LIBRARIES}")  
# /usr/local/include
# /usr/local/lib/libmlpack.so

mlpack clustering

see mlpack clustering

kmeans

skip now.

meanshift

dbscan

mlpack dbscan test

sklearn clustering

1
2
3

from sklearn.cluster import MeanShift
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans

see sklearn clustering

opencv clustering

cv::kmeans()

see opencv clustering

Reference

History

20190520: created.

pratical machine learning with python

Posted on 2019-05-14 Edited on 2024-10-14 In machine learning

Linear Algebra

determinant

basic

determinant properties

determinant properties

online calculator

inverse/adjoint(adjugate) matrix

Only non-singular matrices have inverses. (det(A) != 0)

minor matrix
cofactor matrix
adjoint/adjugate matrix
inverse matrix
conjugate matrix

eigenvalue/eigenvector

An*n
graph demo
eigenvector

eigenvector steps

svd

singular value decomposition
Am*n (m!=n)
svd

Probability Theory

random variable: discrete/continuous

probability mass function: pmf (possion, binomial distribution ) for discrete random variable
probability density function: pdf (normal,uniform) for contiunous random variable
cumulative distribution function: cdf for discrete+contiunous random variable

see pmf-cdf-pdf
distribution-function-terminology-pdf-cdf-pmf-etc

binomial: n times Bernoulli trial, P(x=k)=C(n,k)* p^k * (1-p)^(n-k)

marginal probability
joint probability
conditional probability
bayes theorem

see here

Marginal probability: the probability of an event occurring (p(A)), it may be thought of as an unconditional probability. It is not conditioned on another event. Example: the probability that a card drawn is red (p(red) = 0.5). Another example: the probability that a card drawn is a 4 (p(four)=1/13).

Joint probability: p(A and B). The probability of event A and event B occurring. It is the probability of the intersection of two or more events. The probability of the intersection of A and B may be written p(A ∩ B). Example: the probability that a card is a four and red =p(four and red) = 2/52=1/26. (There are two red fours in a deck of 52, the 4 of hearts and the 4 of diamonds).

Conditional probability: p(A|B) is the probability of event A occurring, given that event B occurs. Example: given that you drew a red card, what’s the probability that it’s a four (p(four|red))=2/26=1/13. So out of the 26 red cards (given a red card), there are two fours so 2/26=1/13.

bayes theorem: p(cancer)=0.01, p(positive test|cancer)=0.9, p(positive test|no cancer)=0.08
p(cancer|positive test)?

basic-prob

Statistics

2 types of statistics

descriptive statistics 描述性统计值
inferential statistics　推理性统计值

descriptive statistics

basic

n, sum, min,max, range =max-min,
mean,median,mode
variance,standard deviation
skewness,kurtosis

from mean,median,mode

mean/median/mode

mean: regular meaning of “average”
median: middle value
mode: most often

2 types of data set: here

population: u,sigma^2, sigma —> parameter
sample: x, s^2, s —> statistic

population是总体,总体的数据是不变的,u就代表总体真实的均值；
sample是样本,我们总体的数据很难得到,必须借助样本猜测总体的情况,但是每次采样的时候会有不同,因此x拔表示一次采样的均值；
不同采样的均值x往往不同,但是总体均值u一定是不变的。

population

sample

see example

skewness vs kurtosis

skewness: 偏度 the degree of symmetry
kurtosis: 峰度　the degree of peakedness/flatness

formula see skewness kurtosis formula

inferential statistics

Each hypothesis: null hypothesis + an alternative hypothesis.

H0: u1=u2=u3=…=un. it indicates that the group means for the various groups are NOT very different from each other based on statistical significance levels.
Ha: there exists at least two group means that are statistically significantly different from each other.

significance tests　显著性检验

H0: there is NO real difference
Ha: there is a difference

Reject H0 at 5% significant level if p-value<5%, statistical significant
Reject H0 at 1% significant level if p-value<1%, highly significant
one-tailed tests vs two-tailed tests

one-way ANOVA test:

if p-value<=5%, the result is statistically significant different, we reject the null hypothesis in favor of the alternative hypothesis. (Ha was correct)
Otherwise, if the results is not statistically significant, we conclude that our null hypothesis was correct. (H0 was correct)

demo

F-stat>4.737 or p-value<0.05, then reject H0

parametric tests vs nonparametric tests 参数检验　vs 非参数检验

Data Mining

KDD: knowledge discovery of dataset
CRISP-DM: cross-industry standard process for data mining 跨行业数据挖掘标准流程

Machine Learning methods

with/without labels

supervised learning:
- classification
- regression
unsupervised learning
- clustering
- dimensionality reduction
- anomaly detection
- assiciation rule-mining/market basket analysis(购物篮分析)
semi-supervised learning
reinforcement learning

online/offline

batch learning/offline learning
online learning

instance/model

instance based learning
model based learning

EDA

statistics

descriptive statistics
inferential statistics

analysis

3 types

univariate analysis: n=1
bivariate analysis: n=2
multivariate analysis: n>=3

use histogram to visualize data
correlation matrix/heatmap

Model Evaluation

Classification

confusion matrix

accuracy
precision
recall
F1-score: harmonic mean 调和平均值

value range (0-1), the bigger, the better.

precision vs recall curve

another curve

roc: receiver operating characteristic 接受者操作特征. TPR vs FPR curve
auc: area under curve. value range (0-1), the bigger, the better.

all in one

multi-class classification for ROC

micro-averaging: treat as binary
macro-averaging: equal weight

Clustering

types

partition/centroid based clustering: k-means,k-medoids
hierachical clustering: AgglomerativeClustering, affinity propagation
- ward/single linkage
- averate linkage
- complete linkage
distribution based clustering: gaussian mixture models
densitity based clustering: DBSCAN, OPTICS

external validation

with labels

homogeneity
completeness
v-measure: harmonic mean 调和平均值
value range (0-1), the bigger, the better.

internal validation

no labels
2 most important traits:

compact groups
well seperated groups

metric

silhouette coefficient: SC轮廓系数. value range (-1-1), the bigger, the better.
calinski-harabaz index: chi指数 value range >0 , the bigger, the better.

Regression

metric:

mean squared error: MSE
root mean squared error: RMSE
coefficient of determination (R^2):判定系数
coefficient of correlation (r):相关系数 value range (-1,1)

R2: value range (0,1), the bigger, the better.

for simple linear regression, R^2 = r^2

formula:

correlation coefficient

r2 demo

images from bing search.

regression analysis

types

simple linear regression
multiple linear regression
nonlinear regression

assumptions

training dataset(sample) is representative of the population being modeled
x1,x2,…,xn are linearly independent. no multicollinearity 非多重共线性
homoscedasticity of error 同方差性: residuals being random and no any patterns

multicollinearity 多重共线性: correlation matrix
variance inflation factor (VIF)方差膨胀因子　VIFi = 1/(1-Ri^2). VIF越大，显示共线性越严重。经验判断方法表明：当0<VIF<10，不存在多重共线性；当10≤VIF<100，存在较强的多重共线性；当VIF≥100，存在严重多重共线性。
homo-scedastic(同方差) vs hetero-scedastic (异方差性): residual plot
homogeneous vs heterogeneous 同质的vs异质的

evaluation analysis

residual analysis
normality tests (Q-Q plot）正态分布检验
R^2

linear regression

y = kx + b, use OLS

decision tree based regression

linear vs non-linear regression:

linear regression
decision tree based regression (non-linear)

decision tree can be used for both classification and regression. CART

node splitting
for regression:

MSE: mean squared error
RMSE: root mean squared error
MAE: mean absolute error
MAPE: mean absolute percentage error

for classification

information gain(entropy): 信息增益(熵)
gini impurity/index: GINI 基尼不纯度
misclassification error:

stoppint criteria

max depth
min samples to split internal nodes
max leaf nodes

use GridSearch to search for optimal hyperparameters

decesion tree algorithms

CART
ID3
C4.5

ensemble learning

3 major families:

bagging: boostrap aggregating, boostrap sampling(自助采样法) eg. RandomForest
boosting: eg. Gradient Boosting Machine(GBM), AdaBoost
- GBM variant: LightGBM, Extreme Gradient Boosting(XGBoost)
stacking

others

binning
blending
averaging
voting

see What is the difference between Bagging and Boosting?
see 集成学习-Boosting,Bagging与Stacking

boostrap aggregating/bagging

boosting

model stacking

Model Tuning

decision trees

information gain: IG 信息增益
gini impurity: GI 基尼不纯度

bias-variance tradeoff

The main causes of error in learning are due to noise, bias and variance.

extreme cases of bias-variance

underfitting: higt bias, low variance
overfitting: lower bias, high vairance

bias-variance tradeoff

see learnopencv

cross validation

train/validation/test

cross validation strategies:

leave one out CV: n-1 samples as train, 1 sample as validate
k-fold CV: split into k equal subsets. k-1 subsets as train, 1 subset as validate

5-fold, 10-fold in pratice

hyperparameter tuning strategies

grid search: manually specifying the grid, parallelizable
randomized search: automatic

Model Interpertation

tools

lime
skater

global vs local interpertation

global interpertation: based on the whole dataset (feature_importance, partial_dependence plot)
local interpertation: based on a single prediction

global interpertation

local interpertation

model decision surface/ hypersurface

Model Deployment

rest api
micro service
model deployment as a service, anything as a service(XAAS)

Real-world case studies

customer segmentation

clustering problem

factors

geographic　地理因素
demographic 人口统计因素
psychographic 心理因素
behavioural 行为因素

RFM Model for customer value

recency
frequency
monetary value

association-rule mining

assiciation rule-mining/market basket analysis(购物篮分析)

basics

association rule: {item1,item2,item3 —> itemK}
itemset: {milk,bread} {beer,diaper}
frequent itemset: {milk,bread}

metrics

support = frq(X,Y)/N
confidence = support(X,Y)/support(X) = frq(X,Y)/frq(X)
lift = support(X,Y)/(support(X)*support(Y)) = N*frq(X,Y)/(frq(X)*frq(Y))

good rules: large confidence, large support, lift >1

lift(X->Y) = 0　means X and Y not occur at the same time
lift(X->Y) = 1　means X and Y are independent of each other.

algorithms

apriori algorithm: generate all 2^k itemsets, TOO EXPENSIVE
FP growth: no need to generate all 2^k itemsets, use special structure FP-tree, divide-and-conquer stragety

k unique products, then 2^k itemsets.

recommender system

recommender systems/ recommendation engines

big data with pandas

how to process big data with pandas ?

import pandas as pd
for chunk in pd.read_csv(<filepath>, chunksize=<your_chunksize_here>)
    do_processing()
    train_algorithm()

read by chunk
see opening-a-20gb-file-for-analysis-with-pandas

other tools

dask
dask-ml

other refs

types of recommendation engines

3 types

user-based recommendation engines
content-based recommendation engines
hybrid/collaborative filtering(协同过滤) recommendation engines

based on similarity

different cases

popularity-based: most liked songs by all users
similarity-based: similar songs for given user
matrix factorization based: use svd to get low rand approximation of the utility matrix

similarity

Jaccard Index/Jaccard similarity coefficient, (0-1)
cosine similarity

Jaccard Distance = 1 - Jaccard Index

matrix factorization

矩阵分解
use matrix factorization to discover latent features between two different kinds of entities

sparse matrix

use SVD: matrix factorization, PCA

implicit feedback 隐式反馈: song play count—> likeness

recommendation engine libraries

scikit-surprise (Simple Python Recommendation System Engine)
lightfm
crab
rec_sys

time series forecasting

basics

predictive modeling

time series analysis/forecasting:

traditional approaches
- Moving Average: MV
- Exponential Smoothing: EWMA
- Holt-Winter EWMA
- Box-jenkins methodologies: AR, MA, ARIMA, S-ARIMA
deep learning approaches: RNN, eg. LSTM
- regression modeling (x1,x2,..x6,—>x7): many-to-one
- sequence modeling: squence -> sequence

two domains

frequency domain: spectral and wavelet analysis
time domain: auto- and cross-correlation analysis

where to get data ?

Yahho
quandl:

tools to fetch data:

quandl: register for key first
pandas-datareader

time series components

3 major components:

seasonality
trend
residual

smoothing techniques

Moving Average: MV
Exponential Smoothing: EWMA

ARIMA

AR vs MV

auto regressive
moving average

ARIMA: auto regressive integrated moving average

key concepts

Stationarity(平稳性): One the key assumptions behind the ARIMA models. Stationarity refers to the property where for a time series its mean, variance, and autocorrelation are time invariant. In other words, mean, variance,and autocorrelation do not change with time
Differencing(差分): differencing is widely used to stabilize the mean of a time series.　We can then apply different tests to confirm if the resulting series is stationary　or not.
Unit Root Tests: Statistical tests that help us understand if a given series is stationary
or not.
- ad_fuller_test: The Augmented Dickey Fuller test begins with a null hypothesis of series being non-stationary
- kpss_test: while Kwiatkowski-Phillips-Schmidt-Shin test or KPSS has a null hypothesis that the series is stationary.

ad_fuller_test

not statistically significant, accpet H0: non-stationary

statistically significant, reject H0 and accept Ha: stationary

ARIMA(p,d,q) model
where,

p is the order of Autoregression
q is the order of Moving average
d is the order of differencing

how to choose p and q?

ACF or Auto Correlation Function plot —> q = 1
PACF or the Partial Auto Correlation Function plot —> p = 1

use grid search to choose p and q based on AIC

AIC or Akaike Information Criterion measures the
goodness of fit and parsimony.

LSTM

Efficient Market Hypothesis: which says that it is almost impossible to beat the market consistently and there
are others which disagree with it.

modeling

regression modeling
sequence modeling

(N,W,F) format as input

number of sequence
window: length of sequence
features per timestamp

for regression

for sequence

we need to pad test sequence to match input shape.

other time series tools

facebook prophet

New Concepts

Linear Discriminant Analysis(LDA)线性判别分析
Quadratic Discriminant Analysis(QDA)线性判别分析

sklearn code

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

Reference

History

20190516: created.

stream live video over network with opencv and imagezmq

Posted on 2019-05-06 Edited on 2024-10-14 In python

Guide

imagezmq

1	git clone https://github.com/jeffbass/imagezmq.git

imagezmq has been tested with:

Python 3.5 and 3.6
OpenCV 3.3
Raspian Stretch and Raspian Jessie
PyZMQ 16.0
imutils 0.4.3 (used get to images from PiCamera)

install tools

1
2
3

workon py3cv3  # use your virtual environment name
pip install pyzmq
pip install imutils

test

# terminal 1
cd imagezmq/tests
python test_1_receive_images.py

# terminal 2
cd imagezmq/tests
python test_1_send_images.py

received image snapshot

Reference

History

20190506: created.

cpp ref

Posted on 2019-04-29 Edited on 2024-10-14 In cpp

Guide

sizeof(array)

int print_size1(int a[], int n)
{// int add(int*, int)
	std::cout<< sizeof(a) << std::endl; 
	// we get sizeof(int*)
}

int print_size2(int *a, int n)
{// int add(int*, int)
	std::cout<< sizeof(a) << std::endl;
	// we get sizeof(int*)
}

#define N_ELEMENTS(array) (sizeof(array)/sizeof((array)[0])) 

void test_array()
{
	int a[5] = {1,2,3,4,5};
	int n = N_ELEMENTS(a); 
	std::cout<<"num  = "<< n << std::endl; // 5
	std::cout<< sizeof(int) << std::endl; // int size
	std::cout<< sizeof(int*) << std::endl; // pointer size
	std::cout<< sizeof(a) << std::endl; // 20
	print_size1(a, 5);
	print_size2(a, 5);
}

An array-type is implicitly converted into pointer type when you pass it in to a function.
int a[]作为函数参数，隐式的转换为int *a.

编译会产生warning.

warning: ‘sizeof’ on array function parameter ‘a’ will return size of ‘int*’ [-Wsizeof-array-argument]
  std::cout<< sizeof(a) << std::endl; // pointer size

char* string

void test_str1()
{
	// warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
    char* str = "Hello";  //Warning
  
    const char* str1 = "Hello"; // No warning 
  
    // trying to modify const string literal 
    // gives Runtime error  
	// segmentation fault (core dumped)
    //str[1] = 'o'; 
  
    cout << str << endl; 
}
/*
Hello常量字符串，存放在静态数据区; 由于字符串常量无需改动，放在静态内存区会提高效率.
*/

void test_str2()
{
	char str1[] = "abc";
	char str2[] = "abc";
	const char str3[] = "abc";
	const char str4[] = "abc";
	const char *str5 = "abc";
	const char *str6 = "abc";
	cout << ( str1 == str2 ) << endl; // 0
	cout << ( str3 == str4 ) << endl; // 0
	cout << ( str5 == str6 ) << endl; // 1

	str1[1] = 'B'; // OK 
	//str3[1] = 'B'; // Compiler ERROR
}
/*
str1,str2,str3,str4是数组变量。它们有各自的内存空间；
而str5,str6是指针，它们指向同样的常量字符串。
*/


const char* return_str()
{
 	const char *p="abc";
 	return p;
}

void test_str3()
{
 	const char *str=NULL; 
 	str= return_str();
 	printf("%s\n", str); // abc
}
/*
由于"abc"是一个字符串常量，存放在静态数据区。把该字符串常量存放的静态数据区的首地址赋值给了指针。
所以return_str函数退出时，该字符串常量所在内存不会被回收。故可以通过指针顺利无误的訪问。
*/

char* return_str2()
{
 	char p[] ="abc";
 	return p; // warning: address of local variable ‘p’ returned [-Wreturn-local-addr]
}

void test_str4()
{
 	char *str=NULL; 
 	str= return_str2();
 	printf("%s\n", str); // null
}
/*
"abc"是一个字符串常量，存放在静态数据区。
可是把一个字符串常量赋值给了一个局部变量(char []型数组)，该局部变量存放在栈中，该数组空间中也存储"abc"的一份拷贝。

也就是说`char p[]="abc";`这条语句让"abc"这个字符串在内存中有两份拷贝，一份在动态分配的栈中，还有一份在静态存储区。
这是与前者return_str1最本质的差别，当return_str2函数退出时，栈要清空，局部变量的内存也被清空了。
所以这时的函数返回的是一个已被释放的内存地址。所以打印出来的是null。
*/


char* return_str3()
{
 	static char p[]="abc"; //p存放在静态存储区，内容为abc
 	return p;
}

void test_str5()
{
 	char *str=NULL; 
 	str= return_str3();
 	printf("%s\n", str); // abc
}
/*
如果函数的返回值非要是一个局部变量的地址，那么该局部变量一定要申明为static类型。
*/

char*-vs-stdstring
const string

NULL vs nullptr

nullptr
NULL (void *)0

convert to integer
pointer

nullptr keyword

pointer
CAN NOT convert to integer
nullptr is convertible to bool.

const vs non-const

Use const whenever possible.

将某些东西声明为const可以帮助编译器侦测出错误的用法。const可以被施加于任何作用域内的对象、函数参数、函数返回类型、成员函数本体。
当const和non-const成员函数有实质等价的实现时，令non-const版本去调用const版本可以避免代码重复。反之则不可。

code

class TextBlock {
public:
...
    //const：和原先一样
    const char& operator[] (std::size_t position) const 
    { 
        ...
        ...
        ...
        return text[position]; 
    }

    //non-const：发生区别，直接调用了const op[]
    char& operator[] (std::size_t position)
    {
        return   //直接return
            const_cast<char&>(  //(3)将op[]返回值的const移除
                static_cast<const TextBlock&>(*this)    //(1)为*this加上const
                    [position]    //(2)调用const op[]
                );
            }
        ...
    }

两次转型:
第一次，用来为this添加 const，将this从其原始类型TextBlock& 转换为const TextBlock&，使得接下来调用operator[]是可以条用const的版本，使用static_cast。
第二次，则是从const operator[]的返回值中移除const，利用const_cast来完成。

static

static: internal linkage
extern: external linkage

extern

extern is present by default with C functions.
Since the declaration can be done any number of times and definition can be done only once

volatile

volatile
volatile cnblogs

volatile 异变的，告诉compiler这个值可能会在当前线程外部被改变，因此不要进行优化，每次都从ram地址读取，而不要从register读取缓存的副本。

Internal Linkage and External Linkage in C

internal linkage and external linkage
what-is-external-linkage-and-internal-linkage

scope is a property handled by compiler, whereas linkage is a property handled by linker.
external linkage means the symbol (function or global variable) is accessible throughout your program and internal linkage means that it’s only accessible in one translation unit.
You can explicitly control the linkage of a symbol by using the extern and static keywords. If the linkage isn’t specified then the default linkage is extern for non-const symbols and static (internal) for const symbols.
The keyword static plays a double role. (1) When used in the definitions of global variables, it specifies internal linkage. (2) When used in the definitions of the local variables, it specifies that the lifetime of the variable is going to be the duration of the program instead of being the duration of the function.

constexpr

constexper

constexpr is a feature added in C++ 11. The main idea is performance improvement of programs by doing computations at compile time rather than run time.

constexpr vs inline functions

Both are for performance improvements, inline functions are request to compiler to expand at compile time and save time of function call overheads. In inline functions, expressions are always evaluated at run time. constexpr is different, here expressions are evaluated at compile time.

vtable and vptr

virtual-functions-and-runtime-polymorphism
what-are-vtable-and-vptr
calling-virtual-methods-in-constructordestructor-in-cpp

It is highly recommended to avoid calling virtual methods from constructor/destructor.

virtual-function-table
class-memory-layout

Virtual Constructor

Virtual Constructor, NO
Can we make a class constructor virtual in C++ to create polymorphic objects? No. C++ being static typed (the purpose of RTTI is different) language, it is meaningless to the C++ compiler to create an object polymorphically. The compiler must be aware of the class type to create the object. In other words, what type of object to be created is a compile time decision from C++ compiler perspective. If we make constructor virtual, compiler flags an error.

Virtual Destructor

Deleting a derived class object using a pointer to a base class that has a non-virtual destructor results in undefined behavior.

Advanced

Order of Constructor/ Destructor Call in C++
How to make a C++ class whose objects can only be dynamically allocated?
Is it possible to call constructor and destructor explicitly?

Yes, it is possible to call special member functions explicitly by programmer.
When the constructor is called explicitly the compiler creates a nameless temporary object and it is immediately destroyed.
we should never call destructor explicitly on local (automatic) object, because really bad results can be acquired by doing that.
Local objects are automatically destroyed by compiler when they go out of scope and this is the guarantee of C++ language.
Print 1 to 100 in C++, without loop and recursion

Template Metaprogramming

thread

pass by value by default
pass by ref: std::ref(variable)

Reference

History

20190429: created.

difference between push_back and emplace_back with C++ 11

Posted on 2019-04-22 Edited on 2024-10-14 In cpp

Guide

case1

#include <iostream>
#include <vector>
class A
{
public:
  A (int x_arg) : x (x_arg) { std::cout << "A (x_arg)\n"; }
  A () { x = 0; std::cout << "A ()\n"; }
  A (const A &rhs) noexcept { x = rhs.x; std::cout << "A (A &)\n"; }
  A (A &&rhs) noexcept { x = rhs.x; std::cout << "A (A &&)\n"; }
  ~A() { std::cout << "~A ()\n"; }

private:
  int x;
};

void test_emplace_back_1()
{
	// For emplace_back constructor A (int x_arg) will be called.
	// And for push_back A (int x_arg) is called first and 
	// move A (A &&rhs) is called afterwards
  {
    std::vector<A> a;
    std::cout << "call emplace_back:\n";
    a.emplace_back(0); 
	// (1) direct object creation inside vector
  }

  {
    std::vector<A> a;
    std::cout << "call push_back:\n";
    a.push_back(1);
	// (1) create temp object and 
	// (2) then move copy to vector and 
	// (3) free temp object
  }
}
/*
call emplace_back:
A (x_arg)
~A ()
call push_back:
A (x_arg)
A (A &&)
~A ()
~A ()
 */

see

image from c-difference-between-emplace_back-and-push_back-function

case2

void test_emplace_back_2()
{
	// emplace_back and push_back for `A(0)`, it's same.
	// A (int x_arg) is called first and 
	// move A (A &&rhs) is called afterwards
  {
    std::vector<A> a;
    std::cout << "call emplace_back:\n";
    a.emplace_back(A(0)); 
	// (1) create temp object and 
	// (2) then move copy to vector and 
	// (3) free temp object
  }

  {
    std::vector<A> a;
    std::cout << "call push_back:\n";
    a.push_back(A(1));
	// (1) create temp object and 
	// (2) then move copy to vector and 
	// (3) free temp object
  }
}

/*
call emplace_back:
A (x_arg)
A (A &&)
~A ()
~A ()
call push_back:
A (x_arg)
A (A &&)
~A ()
~A ()
 */

case 3

void test_emplace_back_3()
{
	// emplace_back and push_back for `A obj(0)`, it's same.
	// A (int x_arg) is called first and 
	// copy constructor A (A &) is called afterwards
  {
    std::vector<A> a;
    std::cout << "call emplace_back:\n";
    A obj(0);
    a.emplace_back(obj); 
	 // copy constructor to vector
  }

  {
    std::vector<A> a;
    std::cout << "call push_back:\n";
    A obj(1);
    a.push_back(obj);
	 // copy constructor to vector
  }
}
/*
call emplace_back:
A (x_arg)
A (A &)
~A ()
~A ()
call push_back:
A (x_arg)
A (A &)
~A ()
~A ()
 */

extract subvector

1
2
3

vector<int>::const_iterator first = myVec.begin() + 10;
vector<int>::const_iterator last = myVec.begin() + 15;
vector<int> newVec(first, last); // [10,15)

Reference

History

20190422: created.

use nvidia tensorrt fp32 fp16 to do inference with caffe and pytorch model

Posted on 2019-04-22 Edited on 2024-10-14 In deep learning

Series

Code Example

include headers

#include <assert.h>
#include <sys/stat.h>
#include <time.h>

#include <iostream>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <cmath>
#include <algorithm>

#include <cuda_runtime_api.h>

#include "NvCaffeParser.h"
#include "NvOnnxConfig.h"
#include "NvOnnxParser.h"
#include "NvInfer.h"
#include "common.h"

using namespace nvinfer1;
using namespace nvcaffeparser1;

static Logger gLogger;

// Attributes of MNIST Caffe model
static const int INPUT_H = 28;
static const int INPUT_W = 28;
static const int OUTPUT_SIZE = 10;
//const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "prob";
const std::string mnist_data_dir = "data/mnist/";


// Simple PGM (portable greyscale map) reader
void readPGMFile(const std::string& fileName, uint8_t buffer[INPUT_H * INPUT_W])
{
    readPGMFile(fileName, buffer, INPUT_H, INPUT_W);
}

caffe model to tensorrt

void caffeToTRTModel(const std::string& deployFilepath,       // Path of Caffe prototxt file
                     const std::string& modelFilepath,        // Path of Caffe model file
                     const std::vector<std::string>& outputs, // Names of network outputs
                     unsigned int maxBatchSize,               // Note: Must be at least as large as the batch we want to run with
                     IHostMemory*& trtModelStream)            // Output buffer for the TRT model
{
    // Create builder
    IBuilder* builder = createInferBuilder(gLogger);

    // Parse caffe model to populate network, then set the outputs
    std::cout << "Reading Caffe prototxt: " << deployFilepath << "\n";
    std::cout << "Reading Caffe model: " << modelFilepath << "\n";
    INetworkDefinition* network = builder->createNetwork();
    ICaffeParser* parser = createCaffeParser();

    bool useFp16 = builder->platformHasFastFp16();
    std::cout << "platformHasFastFp16: " << useFp16 << "\n";

    bool useInt8 = builder->platformHasFastInt8();
    std::cout << "platformHasFastInt8: " << useInt8 << "\n";

    // create a 16-bit model if it's natively supported
    DataType modelDataType = useFp16 ? DataType::kHALF : DataType::kFLOAT; 
    
    const IBlobNameToTensor* blobNameToTensor = parser->parse(deployFilepath.c_str(),
                                                              modelFilepath.c_str(),
                                                              *network,
                                                              modelDataType);
    // Specify output tensors of network
    // ERROR: Network must have at least one output
    for (auto& s : outputs){
        std::cout<<"output = "<< s.c_str() << std::endl;
        network->markOutput(*blobNameToTensor->find(s.c_str())); // prob
    } 

    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1 << 20);

    // set up the network for paired-fp16 format if available
    if(useFp16)
        builder->setFp16Mode(true);

    // Build engine
    ICudaEngine* engine = builder->buildCudaEngine(*network);
    assert(engine);

    // Destroy parser and network
    network->destroy();
    parser->destroy();

    // Serialize engine and destroy it
    trtModelStream = engine->serialize();
    engine->destroy();
    builder->destroy();

    //shutdownProtobufLibrary();
}

pytorch onnx to tensorrt

void onnxToTRTModel( const std::string& modelFilepath,        // name of the onnx model 
                     unsigned int maxBatchSize,            // batch size - NB must be at least as large as the batch we want to run with
                     IHostMemory *&trtModelStream)      // output buffer for the TensorRT model
{
    // create the builder
    IBuilder* builder = createInferBuilder(gLogger);

    nvonnxparser::IOnnxConfig* config = nvonnxparser::createONNXConfig();
    config->setModelFileName(modelFilepath.c_str());
    
    nvonnxparser::IONNXParser* parser = nvonnxparser::createONNXParser(*config);
    
    //Optional - uncomment below lines to view network layer information
    //config->setPrintLayerInfo(true);
    //parser->reportParsingInfo();
    
    if (!parser->parse(modelFilepath.c_str(), DataType::kFLOAT))
    {
        string msg("failed to parse onnx file");
        gLogger.log(nvinfer1::ILogger::Severity::kERROR, msg.c_str());
        exit(EXIT_FAILURE);
    }
    
    if (!parser->convertToTRTNetwork()) {
        string msg("ERROR, failed to convert onnx network into TRT network");
        gLogger.log(nvinfer1::ILogger::Severity::kERROR, msg.c_str());
        exit(EXIT_FAILURE);
    }
    nvinfer1::INetworkDefinition* network = parser->getTRTNetwork();
    
    // Build the engine
    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1 << 20);

    ICudaEngine* engine = builder->buildCudaEngine(*network);
    assert(engine);

    // we don't need the network any more, and we can destroy the parser
    network->destroy();
    parser->destroy();

    // serialize the engine, then close everything down
    trtModelStream = engine->serialize();
    engine->destroy();
    builder->destroy();

    //shutdownProtobufLibrary();
}

do inference

void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
    const ICudaEngine& engine = context.getEngine();
    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 2);
    void* buffers[2];

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    int inputIndex, outputIndex;

    printf("Bindings after deserializing:\n");
    for (int bi = 0; bi < engine.getNbBindings(); bi++) 
    {
        if (engine.bindingIsInput(bi) == true) 
        {
            inputIndex = bi;
            printf("Binding %d (%s): Input.\n",  bi, engine.getBindingName(bi));
        } else 
        {
            outputIndex = bi;
            printf("Binding %d (%s): Output.\n", bi, engine.getBindingName(bi));
        }
    }

    //const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    //const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

    std::cout<<"inputIndex = "<< inputIndex << std::endl; // 0   data
    std::cout<<"outputIndex = "<< outputIndex << std::endl; // 1  prob

    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));

    // Create stream
    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);

    // Release stream and buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}

save and load engine

void SaveEngine(const nvinfer1::IHostMemory& trtModelStream, const std::string& engine_filepath)
{
    std::ofstream file;
    file.open(engine_filepath, std::ios::binary | std::ios::out);
    if(!file.is_open())
    {
        std::cout << "read create engine file" << engine_filepath <<" failed" << std::endl;
        return;
    }
    file.write((const char*)trtModelStream.data(), trtModelStream.size());
    file.close();
};


ICudaEngine* LoadEngine(IRuntime& runtime, const std::string& engine_filepath)
{
    ifstream file;
    file.open(engine_filepath, ios::binary | ios::in);
    file.seekg(0, ios::end); 
    int length = file.tellg();         
    file.seekg(0, ios::beg); 

    std::shared_ptr<char> data(new char[length], std::default_delete<char[]>());
    file.read(data.get(), length);
    file.close();

    // runtime->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), nullptr);
    ICudaEngine* engine = runtime.deserializeCudaEngine(data.get(), length, nullptr);
    assert(engine != nullptr);
    return engine;
}

example

void demo_save_caffe_to_trt(const std::string& engine_filepath)
{
    std::string deploy_filepath = mnist_data_dir + "mnist.prototxt";
    std::string model_filepath = mnist_data_dir + "mnist.caffemodel";
    
     // Create TRT model from caffe model and serialize it to a stream
    IHostMemory* trtModelStream{nullptr};
    caffeToTRTModel(deploy_filepath, model_filepath, std::vector<std::string>{OUTPUT_BLOB_NAME}, 1, trtModelStream);
    assert(trtModelStream != nullptr);

    SaveEngine(*trtModelStream, engine_filepath);

    // destroy stream
    trtModelStream->destroy();
}


void demo_save_onnx_to_trt(const std::string& engine_filepath)
{
    std::string onnx_filepath = mnist_data_dir + "mnist.onnx";
    
     // Create TRT model from caffe model and serialize it to a stream
    IHostMemory* trtModelStream{nullptr};
    onnxToTRTModel(onnx_filepath, 1, trtModelStream);
    assert(trtModelStream != nullptr);

    SaveEngine(*trtModelStream, engine_filepath);

    // destroy stream
    trtModelStream->destroy();
}


int mnist_demo()
{
    bool use_caffe = false; 
    std::string engine_filepath;
    if (use_caffe){
        engine_filepath = "cfg/mnist/caffe_minist_fp32.trt";
        demo_save_caffe_to_trt(engine_filepath);
    } else {
        engine_filepath = "cfg/mnist/onnx_minist_fp32.trt";
        demo_save_onnx_to_trt(engine_filepath);
    }
    std::cout<<"[API] Save engine to "<< engine_filepath <<std::endl;
    
    const int num = 6;
    std::string digit_filepath = mnist_data_dir + std::to_string(num) + ".pgm";

     // Read a digit file
    uint8_t fileData[INPUT_H * INPUT_W];
    readPGMFile(digit_filepath, fileData);
    float data[INPUT_H * INPUT_W];

    if (use_caffe){

        std::string mean_filepath = mnist_data_dir + "mnist_mean.binaryproto";
        // Parse mean file
        ICaffeParser* parser = createCaffeParser();
        IBinaryProtoBlob* meanBlob = parser->parseBinaryProto(mean_filepath.c_str());
        parser->destroy();

        // Subtract mean from image
        const float* meanData = reinterpret_cast<const float*>(meanBlob->getData()); // size 786

        for (int i = 0; i < INPUT_H * INPUT_W; i++)
            data[i] = float(fileData[i]) - meanData[i];
        
        meanBlob->destroy();
    } else {

        for (int i = 0; i < INPUT_H * INPUT_W; i++)
            data[i] = 1.0 - float(fileData[i]/255.0);
    }
    

    // Deserialize engine we serialized earlier
    IRuntime* runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);

    std::cout<<"[API] Load engine from "<< engine_filepath <<std::endl;
    ICudaEngine* engine = LoadEngine(*runtime, engine_filepath);
    assert(engine != nullptr);
    
    IExecutionContext* context = engine->createExecutionContext();
    assert(context != nullptr);

    // Run inference on input data
    float prob[OUTPUT_SIZE];
    doInference(*context, data, prob, 1);

    // Destroy the engine
    context->destroy();
    engine->destroy();
    runtime->destroy();

    // Print histogram of the output distribution
    std::cout << "\nOutput:\n\n";

    // for onnx,we get z as output, we need to use softmax to get probs
    if ( !use_caffe){

        //Calculate Softmax
        float sum{0.0f};
        for(int i = 0; i < OUTPUT_SIZE; i++)
        {
            prob[i] = exp(prob[i]);
            sum += prob[i];
        }
        for(int i = 0; i < OUTPUT_SIZE; i++)
        {
            prob[i] /= sum;
        }
    }
    
    // find max probs
    float val{0.0f};
    int idx{0};
    for (unsigned int i = 0; i < 10; i++)
    {
        val = std::max(val, prob[i]);
        if (val == prob[i]) {
            idx = i;
        }
        cout << " Prob " << i << "  "<< std::fixed << std::setw(5) << std::setprecision(4) << prob[i];
        std::cout << i << ": " << std::string(int(std::floor(prob[i] * 10 + 0.5f)), '*') << "\n";
    }
    std::cout << std::endl;

    return (idx == num && val > 0.9f) ? EXIT_SUCCESS : EXIT_FAILURE;
}


int main(int argc, char** argv)
{
    mnist_demo();
    return 0;
}

results

./bin/sample_mnist 
[API] Save engine to cfg/mnist/onnx_minist_fp32.trt
[API] Load engine from cfg/mnist/onnx_minist_fp32.trt
Bindings after deserializing:
Binding 0 (Input3): Input.
Binding 1 (Plus214_Output_0): Output.
inputIndex = 0
outputIndex = 1

Output:

 Prob 0  0.00000: 
 Prob 1  0.00001: 
 Prob 2  0.00002: 
 Prob 3  0.00003: 
 Prob 4  0.00004: 
 Prob 5  0.00005: 
 Prob 6  1.00006: **********
 Prob 7  0.00007: 
 Prob 8  0.00008: 
 Prob 9  0.00009:

Reference

tensorrt-api

History

20190422 created.

python tkinter tutorial

Posted on 2019-04-11 Edited on 2024-10-14 In python

Guide

main ui

messagebox
- showinfo()
- showwarning()
- showerror()
- askquestion()
- askokcancel()
- askyesno()
- askretrycancel()
- askyesnocancel()


filedialog
- asksaveasfilename() 
- asksaveasfile()
- askopenfilename()
- askopenfile()
- askdirectory()
- askopenfilenames()
- askopenfiles()

demo


from numpy.random import seed, uniform
from numpy import uint8, uint16, load, save

from cv2 import imread, imwrite
from os import listdir, makedirs
from os.path import exists, basename

# for python 3
from tkinter import Tk, Frame, messagebox, filedialog, Button, Label, StringVar

class MyGUI():
    def __init__(self):

        self.root = Tk()

        sw = self.root.winfo_screenwidth()
        sh = self.root.winfo_screenheight()
       
        ww = 700
        wh = 200
        x = (sw-ww) / 2
        y = (sh-wh) / 2
        self.root.title('Image Compress Tool')
        # center
        self.root.geometry("%dx%d+%d+%d" % (ww, wh, x, y))

        # frame1
        frame1 = Frame(self.root) 
        frame1.grid(row=0, column=0, sticky='w')

        self.input_btn = Button(frame1, text="Input Folder", width=10, height=3, command=self.set_input_folder)
        self.input_btn.pack(side='left')

        self.input_label_text = StringVar()
        self.input_label_text.set("Input Folder")

        self.input_label = Label(frame1, textvariable=self.input_label_text, width=70, height=3)
        self.input_label.pack(side='left') 

        # frame2
        frame2 = Frame(self.root) 
        frame2.grid(row=1, column=0, sticky='w')
        
        self.output_btn = Button(frame2, text="Output Folder", width=10, height=3, command=self.set_output_folder)
        self.output_btn.pack(side='left')

        self.output_label_text = StringVar()
        self.output_label_text.set("Output Folder")

        self.output_label = Label(frame2, textvariable = self.output_label_text, width=70, height=3)
        self.output_label.pack(side='left') 

        # frame3
        frame3 = Frame(self.root)
        frame3.grid(row=2, column=0, sticky='nw')
        
        self.run_btn = Button(frame3, text="执行加密", width=10, height=3, command=self.run_task)
        self.run_btn.pack(side='left')

        self.run_label_text = StringVar()
        self.run_label_text.set("Ready")

        self.run_label = Label(frame3, textvariable = self.run_label_text, width=70, height=3)
        self.run_label.pack(side='left') 

    def mainloop(self):
        self.root.mainloop() 

    def set_input_folder(self):
        result = filedialog.askdirectory()
        self.input_label_text.set(result)
        
    def set_output_folder(self):
        result = filedialog.askdirectory()
        self.output_label_text.set(result)
    
    def run_task(self):
        input_folder = self.input_label_text.get()
        output_folder = self.output_label_text.get()
        #print("input_folder: "+input_folder)
        #print("output_folder: "+output_folder)
        if exists(input_folder):
            #batch_compress(input_folder, output_folder)
            self.run_label_text.set("Compress OK.")
            messagebox.showinfo("Info", "Compress OK.")
        else:
            messagebox.showwarning("Warn", "Please input folder")

def gui():
    app = MyGUI()
    app.mainloop()

def main():
    gui()

if __name__ =="__main__":
    main()

snapshots

Reference

History

20190411: created.

speed up opencv image processing with openmp

Posted on 2019-04-03 Edited on 2024-10-14 In cpp

Series

Guide

config

linux/window: cmake with CXX_FLAGS=-fopenmp
window VS: VS also support openmp, C/C++| Language | /openmp

usage

#include <omp.h>

#pragma omp parallel for
	for loop ...

code

#include <iostream>
#include <omp.h>

int main()
{
	omp_set_num_threads(4);
#pragma omp parallel for
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
	printf("\n");	

    return 0;
}

/*
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 4, I am Thread 2
i = 5, I am Thread 2
i = 6, I am Thread 3
i = 7, I am Thread 3
i = 2, I am Thread 1
i = 3, I am Thread 1
*/

CMakeLists.txt

use CXX_FLAGS=-fopenmp in CMakeLists.txt

cmake_minimum_required(VERSION 3.0.0)

project(hello)

find_package(OpenMP REQUIRED)
if(OPENMP_FOUND)
    message("OPENMP FOUND")

    message([main] " OpenMP_C_FLAGS=${OpenMP_C_FLAGS}") # -fopenmp
    message([main] " OpenMP_CXX_FLAGS}=${OpenMP_CXX_FLAGS}") # -fopenmp
    message([main] " OpenMP_EXE_LINKER_FLAGS=${OpenMP_EXE_LINKER_FLAGS}") # ***

    # no use for xxx_INCLUDE_DIRS and xxx_libraries for OpenMP
    message([main] " OpenMP_INCLUDE_DIRS=${OpenMP_INCLUDE_DIRS}") # ***
    message([main] " OpenMP_LIBRARIES=${OpenMP_LIBRARIES}") # ***

    set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
    set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}")
endif()

add_executable(hello hello.cpp)
#target_link_libraries(hello xxx)

options

or use g++ hello.cpp -fopenmp to compile

view demo

list dynamic dependencies (ldd)

ldd hello 
    linux-vdso.so.1 =>  (0x00007ffd71365000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8ea7f00000)
    libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f8ea7cde000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8ea7914000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8ea760b000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f8ea8282000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f8ea73f5000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f8ea71f1000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8ea6fd4000)

libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1

list names (nm)

nm hello 
0000000000602080 B __bss_start
0000000000602190 b completed.7594
                 U __cxa_atexit@@GLIBC_2.2.5
0000000000602070 D __data_start
0000000000602070 W data_start
0000000000400b00 t deregister_tm_clones
0000000000400b80 t __do_global_dtors_aux
0000000000601df8 t __do_global_dtors_aux_fini_array_entry
0000000000602078 d __dso_handle
0000000000601e08 d _DYNAMIC
0000000000602080 D _edata
0000000000602198 B _end
0000000000400d44 T _fini
0000000000400ba0 t frame_dummy
0000000000601de8 t __frame_dummy_init_array_entry
0000000000400f18 r __FRAME_END__
0000000000602000 d _GLOBAL_OFFSET_TABLE_
0000000000400c28 t _GLOBAL__sub_I_main
                 w __gmon_start__
0000000000400d54 r __GNU_EH_FRAME_HDR
                 U GOMP_parallel@@GOMP_4.0
                 U __gxx_personality_v0@@CXXABI_1.3
00000000004009e0 T _init
0000000000601df8 t __init_array_end
0000000000601de8 t __init_array_start
0000000000400d50 R _IO_stdin_used
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
0000000000601e00 d __JCR_END__
0000000000601e00 d __JCR_LIST__
                 w _Jv_RegisterClasses
0000000000400d40 T __libc_csu_fini
0000000000400cd0 T __libc_csu_init
                 U __libc_start_main@@GLIBC_2.2.5
0000000000400bc6 T main
0000000000400c3d t main._omp_fn.0
                 U omp_get_num_threads@@OMP_1.0
                 U omp_get_thread_num@@OMP_1.0
0000000000400b40 t register_tm_clones
0000000000400ad0 T _start
0000000000602080 d __TMC_END__
0000000000400bea t _Z41__static_initialization_and_destruction_0ii
                 U _ZNSolsEPFRSoS_E@@GLIBCXX_3.4
                 U _ZNSt8ios_base4InitC1Ev@@GLIBCXX_3.4
                 U _ZNSt8ios_base4InitD1Ev@@GLIBCXX_3.4
0000000000602080 B _ZSt4cout@@GLIBCXX_3.4
                 U _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@@GLIBCXX_3.4
0000000000602191 b _ZStL8__ioinit
                 U _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_c@@GLIBCXX_3.4

omp_get_num_threads, omp_get_thread_num

OpenMP Introduction

openmp

OpenMP的指令格式

#pragma omp directive [clause[clause]…]
#pragma omp parallel private(i, j)

parallel is directive， private is clause

directive

parallel，用在一个代码段之前，表示这段代码将被多个线程并行执行
for，用于for循环之前，将循环分配到多个线程中并行执行，必须保证每次循环之间无相关性。
parallel for， parallel 和 for语句的结合，也是用在一个for循环之前，表示for循环的代码将被多个线程并行执行。
sections，用在可能会被并行执行的代码段之前
parallel sections，parallel和sections两个语句的结合
critical，用在一段代码临界区之前
single，用在一段只被单个线程执行的代码段之前，表示后面的代码段将被单线程执行。
flush，
barrier，用于并行区内代码的线程同步，所有线程执行到barrier时要停止，直到所有线程都执行到barrier时才继续往下执行。
atomic，用于指定一块内存区域被制动更新
master，用于指定一段代码块由主线程执行
ordered，用于指定并行区域的循环按顺序执行
threadprivate, 用于指定一个变量是线程私有的。

parallel for

OpenMP 对可以多线程化的循环有如下五个要求：

循环的变量变量（就是i）必须是有符号整形，其他的都不行。
循环的比较条件必须是< <= > >=中的一种
循环的增量部分必须是增减一个不变的值（即每次循环是不变的）。
如果比较符号是< <=，那每次循环i应该增加，反之应该减小
循环必须是没有奇奇怪怪的东西，不能从内部循环跳到外部循环，goto和break只能在循环内部跳转，异常必须在循环内部被捕获。

如果你的循环不符合这些条件，那就只好改写了.

avoid race condition

当一个循环满足以上五个条件时，依然可能因为数据依赖而不能够合理的并行化。当两个不同的迭代之间的数据存在依赖关系时，就会发生这种情况。

// 假设数组已经初始化为1
#pragma omp parallel for
for (int i = 2; i < 10; i++) {
    factorial[i] = i * factorial[i-1];
}

ERROR.

omp_set_num_threads(4);
#pragma omp parallel
	{
		#pragma omp for
		for (int i = 0; i < 8; i++)
		{
			printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
		}
	}

same as

omp_set_num_threads(4);
#pragma omp parallel for
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}

parallel sections

#pragma omp parallel sections # parallel 
{
    #pragma omp section # thread-1
    {
        function1();
    }
　　#pragma omp section # thread-2
    {
        function2();
    }
}

parallel sections里面的内容要并行执行，具体分工上，每个线程执行其中的一个section

clause

private, 指定每个线程都有它自己的变量私有副本。
firstprivate，指定每个线程都有它自己的变量私有副本，并且变量要被继承主线程中的初值。
lastprivate，主要是用来指定将线程中的私有变量的值在并行处理结束后复制回主线程中的对应变量。
reduce，用来指定一个或多个变量是私有的，并且在并行处理结束后这些变量要执行指定的运算。
nowait，忽略指定中暗含的等待
num_threads，指定线程的个数
schedule，指定如何调度for循环迭代
shared，指定一个或多个变量为多个线程间的共享变量
ordered，用来指定for循环的执行要按顺序执行
copyprivate，用于single指令中的指定变量为多个线程的共享变量
copyin，用来指定一个threadprivate的变量的值要用主线程的值进行初始化。
default，用来指定并行处理区域内的变量的使用方式，缺省是shared

private

#pragma omp parallel
{
    int x; // private to each thread ? YES
}

#pragma omp parallel for
for (int i = 0; i < 1000; ++i)
{
    int x; // private to each thread ? YES
}

local variables are automatically private to each thread.
The reason for the existence of the private clause is so that you don’t have to change your code.
see here

The only way to parallelize the following code without the private clause

int i,j;
#pragma omp parallel for private(j)
for(i = 0; i < n; i++) {
    for(j = 0; j < n; j++) {
        //do something
    }
}

is to change the code. For example like this:

int i;
#pragma omp parallel for
for(i = 0; i < n; i++) {
    int j; // mark j as local variable to worker thread
    for(j = 0; j < n; j++) {
        //do something
    }
}

reduction

例如累加

int sum = 0;
for (int i = 0; i < 100; i++) {
    sum += array[i]; // sum需要私有才能实现并行化，但是又必须是公有的才能产生正确结果
}

上面的这个程序里，sum公有或者私有都不对，为了解决这个问题，OpenMP 提供了reduction语句；

int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < 100; i++) {
    sum += array[i];
}

内部实现中，OpenMP为每个线程提供了私有的sum变量(初始化为0)，当线程退出时，OpenMP再把每个线程私有的sum加在一起得到最终结果。

num_threads

num_threads(4) same as omp_set_num_threads(4)

// `num_threads(4)` same as `omp_set_num_threads(4)`
	#pragma omp parallel num_threads(4)
	{
		printf("Hello, I am Thread %d\n", omp_get_thread_num()); // 0,1,2,3,
	}

schedule

format

#pragma omp parallel for schedule(kind [, chunk size])

kind: see openmp-loop-scheduling and whats-the-difference-between-static-and-dynamic-schedule-in-openmp

static: Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is loop_count/number_of_threads.
dynamic: Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
guided: special case of dynamic. Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies them minimum size chunk to use. By default the chunk size is approximately loop_count/number_of_threads.
auto: When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.
runtime: with ENVOMP_SCHEDULE, we can test 3 types scheduling: static,dynamic,guided without recompile the code.

The optional parameter (chunk), when specified, must be a positive integer.

默认情况下，OpenMP认为所有的循环迭代运行的时间都是一样的，这就导致了OpenMP会把不同的迭代等分到不同的核心上，并且让他们分布的尽可能减小内存访问冲突，这样做是因为循环一般会线性地访问内存, 所以把循环按照前一半后一半的方法分配可以最大程度的减少冲突. 然而对内存访问来说这可能是最好的方法, 但是对于负载均衡可能并不是最好的方法, 而且反过来最好的负载均衡可能也会破坏内存访问. 因此必须折衷考虑.

内存访问vs负载均衡,需要折中考虑。
openmp默认使用的schedule是取决于编译器实现的。gcc默认使用schedule(dynamic,1)，也就是动态调度并且块大小是1.
线程数不要大于实际核数，否则就是oversubscription

isprime可以对dynamic做一个展示。

functions

omp_get_num_procs, 返回运行本线程的多处理机的处理器个数。
omp_set_num_threads, 设置并行执行代码时的线程个数
omp_get_num_threads, 返回当前并行区域中的活动线程(active thread)个数,如果没有设置，默认为1。
omp_get_thread_num, 返回线程号(0,1,2,…)
omp_init_lock, 初始化一个简单锁
omp_set_lock，上锁操作
omp_unset_lock，解锁操作，要和omp_set_lock函数配对使用
omp_destroy_lock，关闭一个锁，要和 omp_init_lock函数配对使用

check cpu

cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c 
    8  Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

omp_get_num_procs return 8.

OpenMP Example

omp_get_num_threads

void test0()
{
	printf("I am Thread %d,  omp_get_num_threads = %d, omp_get_num_procs = %d\n", 
		omp_get_thread_num(), 
		omp_get_num_threads(),
		omp_get_num_procs()
	);
}
/*
I am Thread 0,  omp_get_num_threads = 1, omp_get_num_procs = 8
*/

parallel

case1

void test1()
{
	// `parallel`，用在一个代码段之前，表示这段代码block将被多个线程并行执行
	// if not set `omp_set_num_threads`, by default use `omp_get_num_procs`, eg 8
	//omp_set_num_threads(4); // 设置线程数，一般设置的线程数不超过CPU核心数
#pragma omp parallel
	{
		printf("Hello, I am Thread %d,  omp_get_num_threads = %d, omp_get_num_procs = %d\n", 
			omp_get_thread_num(), 
			omp_get_num_threads(),
			omp_get_num_procs()
		);
	}
}
/*
Hello, I am Thread 3,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 7,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 1,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 6,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 5,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 4,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 2,  omp_get_num_threads = 8, omp_get_num_procs = 8
Hello, I am Thread 0,  omp_get_num_threads = 8, omp_get_num_procs = 8
*/

case2

void test1_2()
{
	// `parallel`，用在一个代码段之前，表示这段代码block将被多个线程并行执行
	omp_set_num_threads(4); // 设置线程数，一般设置的线程数不超过CPU核心数
#pragma omp parallel
	{
		printf("Hello, I am Thread %d,  omp_get_num_threads = %d, omp_get_num_procs = %d\n", 
			omp_get_thread_num(), 
			omp_get_num_threads(),
			omp_get_num_procs()
		);
		//std::cout << "Hello" << ", I am Thread " << omp_get_thread_num() << std::endl; // 0,1,2,3
	}
}
/*
# use `cout`
HelloHello, I am Thread Hello, I am Thread , I am Thread Hello, I am Thread 2
1
3
0
*/

/* use `printf`
Hello, I am Thread 0,  omp_get_num_threads = 4, omp_get_num_procs = 8
Hello, I am Thread 3,  omp_get_num_threads = 4, omp_get_num_procs = 8
Hello, I am Thread 1,  omp_get_num_threads = 4, omp_get_num_procs = 8
Hello, I am Thread 2,  omp_get_num_threads = 4, omp_get_num_procs = 8
*/

notice the difference of std::cout and printf

case3

void test1_3()
{
	// `parallel`，用在一个代码段之前，表示这段代码block将被多个线程并行执行
	omp_set_num_threads(4);
#pragma omp parallel
	for (int i = 0; i < 3; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}	
}
/*
i = 0, I am Thread 1
i = 1, I am Thread 1
i = 2, I am Thread 1
i = 0, I am Thread 3
i = 1, I am Thread 3
i = 2, I am Thread 3
i = 0, I am Thread 2
i = 1, I am Thread 2
i = 2, I am Thread 2
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
*/

omp parallel/for

omp parallel + omp for

void test2()
{
	// `omp parallel` + `omp for` === `omp parallel for`
	// `omp for` 用在一个for循环之前，表示for循环的每一次iteration将被分配到多个线程并行执行。
	// 此处8次iteration被平均分配到4个thread执行，每个thread执行2次iteration
	/*
	iter   #thread id
	0,1     0
	2,3     1
	4,5     2
	6,7     3
	*/
	omp_set_num_threads(4);
#pragma omp parallel
	{
		#pragma omp for
		for (int i = 0; i < 8; i++)
		{
			printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
		}
	}
}
/*
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 1
i = 3, I am Thread 1
i = 6, I am Thread 3
i = 7, I am Thread 3
i = 4, I am Thread 2
i = 5, I am Thread 2
*/

omp parallel for

void test2_2()
{
	// `parallel for`，用在一个for循环之前，表示for循环的每一次iteration将被分配到多个线程并行执行。
	// 此处8次iteration被平均分配到4个thread执行，每个thread执行2次iteration
	/*
	iter   #thread id
	0,1     0
	2,3     1
	4,5     2
	6,7     3
	*/
	omp_set_num_threads(4);
#pragma omp parallel for
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
}
/*
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 4, I am Thread 2
i = 5, I am Thread 2
i = 6, I am Thread 3
i = 7, I am Thread 3
i = 2, I am Thread 1
i = 3, I am Thread 1
*/

sqrt case

void base_sqrt()
{
	boost::posix_time::ptime pt1 = boost::posix_time::microsec_clock::local_time();

    float a = 0;
    for (int i=0;i<1000000000;i++)
        a = sqrt(i);
	
	boost::posix_time::ptime pt2 = boost::posix_time::microsec_clock::local_time();
	int64_t cost = (pt2 - pt1).total_milliseconds();
	printf("Worker Thread = %d, cost = %d ms\n",omp_get_thread_num(), cost);
}

void test2_3()
{
	boost::posix_time::ptime pt1 = boost::posix_time::microsec_clock::local_time();

	omp_set_num_threads(8);
#pragma omp parallel for
    for (int i=0;i<8;i++)
        base_sqrt();
    
	boost::posix_time::ptime pt2 = boost::posix_time::microsec_clock::local_time();
	int64_t cost = (pt2 - pt1).total_milliseconds();
	printf("Main Thread = %d, cost = %d ms\n",omp_get_thread_num(), cost);
}

sequential

time ./demo_openmp
Worker Thread = 0, cost = 1746 ms
Worker Thread = 0, cost = 1711 ms
Worker Thread = 0, cost = 1736 ms
Worker Thread = 0, cost = 1734 ms
Worker Thread = 0, cost = 1750 ms
Worker Thread = 0, cost = 1718 ms
Worker Thread = 0, cost = 1769 ms
Worker Thread = 0, cost = 1732 ms
Main Thread = 0, cost = 13899 ms
./demo_openmp  13.90s user 0.00s system 99% cpu 13.903 total

parallel

time ./demo_openmp
Worker Thread = 1, cost = 1875 ms
Worker Thread = 6, cost = 1876 ms
Worker Thread = 0, cost = 1876 ms
Worker Thread = 7, cost = 1876 ms
Worker Thread = 5, cost = 1877 ms
Worker Thread = 3, cost = 1963 ms
Worker Thread = 4, cost = 2000 ms
Worker Thread = 2, cost = 2027 ms
Main Thread = 0, cost = 2031 ms
./demo_openmp  15.10s user 0.01s system 740% cpu 2.041 total

2031s + 10ms(system) = 2041ms (total)
2.041* 740% = 15.1034 s

parallel sections

void test3()
{
	boost::posix_time::ptime pt1 = boost::posix_time::microsec_clock::local_time();

	omp_set_num_threads(4);
	// `parallel sections`里面的内容要并行执行，具体分工上，每个线程执行其中的一个`section`
	#pragma omp parallel sections // parallel 
	{
		#pragma omp section // thread-0
		{
			base_sqrt();
		}

		#pragma omp section // thread-1
		{
			base_sqrt();
		}

		#pragma omp section // thread-2
		{
			base_sqrt();
		}

		#pragma omp section // thread-3
		{
			base_sqrt();
		}
	}

	boost::posix_time::ptime pt2 = boost::posix_time::microsec_clock::local_time();
	int64_t cost = (pt2 - pt1).total_milliseconds();
	printf("Main Thread = %d, cost = %d ms\n",omp_get_thread_num(), cost);
}
/*
time ./demo_openmp
Worker Thread = 0, cost = 1843 ms
Worker Thread = 1, cost = 1843 ms
Worker Thread = 3, cost = 1844 ms
Worker Thread = 2, cost = 1845 ms
Main Thread = 0, cost = 1845 ms
./demo_openmp  7.39s user 0.00s system 398% cpu 1.855 total
*/

private

error case

void test4_error()
{
	int i,j;
	omp_set_num_threads(4);
	// we get error result, because `j` is shared between all worker threads.
	#pragma omp parallel for
	for(i = 0; i < 4; i++) {
		for(j = 0; j < 8; j++) {
			printf("Worker Thread = %d, j = %d ms\n",omp_get_thread_num(), j);
		}
	}
}
/*
Worker Thread = 3, j = 0 ms
Worker Thread = 3, j = 1 ms
Worker Thread = 0, j = 0 ms
Worker Thread = 0, j = 3 ms
Worker Thread = 0, j = 4 ms
Worker Thread = 0, j = 5 ms
Worker Thread = 3, j = 2 ms
Worker Thread = 3, j = 7 ms
Worker Thread = 0, j = 6 ms
Worker Thread = 1, j = 0 ms
Worker Thread = 2, j = 0 ms
*/

error results.

fix1 by changing code

void test4_fix1()
{
	int i;
	omp_set_num_threads(4);
	// we get error result, because `j` is shared between all worker threads.
	// fix1: we have to change original code to make j as local variable
	#pragma omp parallel for
	for(i = 0; i < 4; i++) {
		int j;  // fix1: `int j`
		for(j = 0; j < 8; j++) { 
			printf("Worker Thread = %d, j = %d ms\n",omp_get_thread_num(), j);
		}
	}
}

/*
Worker Thread = 0, j = 0 ms
Worker Thread = 0, j = 1 ms
Worker Thread = 2, j = 0 ms
Worker Thread = 2, j = 1 ms
Worker Thread = 1, j = 0 ms
Worker Thread = 1, j = 1 ms
Worker Thread = 1, j = 2 ms
Worker Thread = 1, j = 3 ms
Worker Thread = 1, j = 4 ms
Worker Thread = 1, j = 5 ms
Worker Thread = 1, j = 6 ms
Worker Thread = 1, j = 7 ms
Worker Thread = 2, j = 2 ms
Worker Thread = 2, j = 3 ms
Worker Thread = 2, j = 4 ms
Worker Thread = 2, j = 5 ms
Worker Thread = 2, j = 6 ms
Worker Thread = 2, j = 7 ms
Worker Thread = 0, j = 2 ms
Worker Thread = 0, j = 3 ms
Worker Thread = 0, j = 4 ms
Worker Thread = 0, j = 5 ms
Worker Thread = 0, j = 6 ms
Worker Thread = 0, j = 7 ms
Worker Thread = 3, j = 0 ms
Worker Thread = 3, j = 1 ms
Worker Thread = 3, j = 2 ms
Worker Thread = 3, j = 3 ms
Worker Thread = 3, j = 4 ms
Worker Thread = 3, j = 5 ms
Worker Thread = 3, j = 6 ms
Worker Thread = 3, j = 7 ms
*/

fix2 by private(j)

void test4_fix2()
{
	int i,j;
	omp_set_num_threads(4);
	// we get error result, because `j` is shared between all worker threads.
	// fix1: we have to change original code to make j as local variable
	// fix2: use `private(j)`, no need to change original code
	#pragma omp parallel for private(j) // fix2
	for(i = 0; i < 4; i++) {
		for(j = 0; j < 8; j++) {
			printf("Worker Thread = %d, j = %d ms\n",omp_get_thread_num(), j);
		}
	}
}

/*
Worker Thread = 0, j = 0 ms
Worker Thread = 0, j = 1 ms
Worker Thread = 0, j = 2 ms
Worker Thread = 0, j = 3 ms
Worker Thread = 0, j = 4 ms
Worker Thread = 0, j = 5 ms
Worker Thread = 0, j = 6 ms
Worker Thread = 0, j = 7 ms
Worker Thread = 2, j = 0 ms
Worker Thread = 2, j = 1 ms
Worker Thread = 2, j = 2 ms
Worker Thread = 2, j = 3 ms
Worker Thread = 2, j = 4 ms
Worker Thread = 2, j = 5 ms
Worker Thread = 2, j = 6 ms
Worker Thread = 2, j = 7 ms
Worker Thread = 3, j = 0 ms
Worker Thread = 3, j = 1 ms
Worker Thread = 3, j = 2 ms
Worker Thread = 3, j = 3 ms
Worker Thread = 3, j = 4 ms
Worker Thread = 3, j = 5 ms
Worker Thread = 1, j = 0 ms
Worker Thread = 1, j = 1 ms
Worker Thread = 1, j = 2 ms
Worker Thread = 1, j = 3 ms
Worker Thread = 1, j = 4 ms
Worker Thread = 1, j = 5 ms
Worker Thread = 1, j = 6 ms
Worker Thread = 1, j = 7 ms
Worker Thread = 3, j = 6 ms
Worker Thread = 3, j = 7 ms
*/

reduction

error case

void test5_error()
{
	int array[8] = {0,1,2,3,4,5,6,7};

	int sum = 0;
	omp_set_num_threads(4);
//#pragma omp parallel for reduction(+:sum)
#pragma omp parallel for  // ERROR
	for (int i = 0; i < 8; i++) {
		sum += array[i];
		printf("Worker Thread = %d, sum = %d ms\n",omp_get_thread_num(), sum);
	}
	printf("Main Thread = %d, sum = %d ms\n",omp_get_thread_num(), sum);
}
/*
// ERROR RESULT
Worker Thread = 0, sum = 0 ms
Worker Thread = 0, sum = 9 ms
Worker Thread = 3, sum = 8 ms
Worker Thread = 3, sum = 16 ms
Worker Thread = 1, sum = 2 ms
Worker Thread = 1, sum = 19 ms
Worker Thread = 2, sum = 4 ms
Worker Thread = 2, sum = 24 ms
Main Thread = 0, sum = 24 ms
*/

reduction(+:sum)

void test5_fix()
{
	int array[8] = {0,1,2,3,4,5,6,7};

	int sum = 0;
	/*
	sum需要私有才能实现并行化，但是又必须是公有的才能产生正确结果;
	sum公有或者私有都不对，为了解决这个问题，OpenMP提供了reduction语句.
	内部实现中，OpenMP为每个线程提供了私有的sum变量(初始化为0)，
	当线程退出时，OpenMP再把每个线程私有的sum加在一起得到最终结果。
	*/
	omp_set_num_threads(4);
#pragma omp parallel for reduction(+:sum)
//#pragma omp parallel for  // ERROR
	for (int i = 0; i < 8; i++) {
		sum += array[i];
		printf("Worker Thread = %d, sum = %d ms\n",omp_get_thread_num(), sum);
	}
	printf("Main Thread = %d, sum = %d ms\n",omp_get_thread_num(), sum);
}

/*
Worker Thread = 0, sum = 0 ms
Worker Thread = 0, sum = 1 ms
Worker Thread = 1, sum = 2 ms
Worker Thread = 1, sum = 5 ms
Worker Thread = 3, sum = 6 ms
Worker Thread = 3, sum = 13 ms
Worker Thread = 2, sum = 4 ms
Worker Thread = 2, sum = 9 ms
Main Thread = 0, sum = 28 ms
*/

num_threads

void test6()
{
	// `num_threads(4)` same as `omp_set_num_threads(4)`
	#pragma omp parallel num_threads(4)
	{
		printf("Hello, I am Thread %d\n", omp_get_thread_num()); // 0,1,2,3,
	}
}
/*
Hello, I am Thread 0
Hello, I am Thread 2
Hello, I am Thread 3
Hello, I am Thread 1
*/

schedule

(static,2)

void test7_1()
{
	omp_set_num_threads(4);
	// static, num_loop/num_threads
#pragma omp parallel for schedule(static,2) 
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
}
/*
i = 2, I am Thread 1
i = 3, I am Thread 1
i = 6, I am Thread 3
i = 7, I am Thread 3
i = 4, I am Thread 2
i = 5, I am Thread 2
i = 0, I am Thread 0
i = 1, I am Thread 0
*/

(static,4)

void test7_2()
{
	omp_set_num_threads(4);
	// static, num_loop/num_threads
#pragma omp parallel for schedule(static,4) 
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
}
/*
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 3, I am Thread 0
i = 4, I am Thread 1
i = 5, I am Thread 1
i = 6, I am Thread 1
i = 7, I am Thread 1
*/

(dynamic,1)

void test7_3()
{
	omp_set_num_threads(4);
	// dynamic
#pragma omp parallel for schedule(dynamic,1) 
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
}
/*
i = 0, I am Thread 2
i = 4, I am Thread 2
i = 5, I am Thread 2
i = 6, I am Thread 2
i = 7, I am Thread 2
i = 3, I am Thread 3
i = 1, I am Thread 0
i = 2, I am Thread 1
*/

(dynamic,3)

void test7_4()
{
	omp_set_num_threads(4);
	// dynamic
#pragma omp parallel for schedule(dynamic,3) 
	for (int i = 0; i < 8; i++)
	{
		printf("i = %d, I am Thread %d\n", i, omp_get_thread_num());
	}
}
/*
i = 0, I am Thread 0
i = 1, I am Thread 0
i = 2, I am Thread 0
i = 6, I am Thread 2
i = 7, I am Thread 2
i = 3, I am Thread 1
i = 4, I am Thread 1
i = 5, I am Thread 1
*/

schedule compare

#define NUM 100000000

int isprime( int x )
{
    for( int y = 2; y * y <= x; y++ )
    {
        if( x % y == 0 )
            return 0;
    }
    return 1;
}

void test8()
{
	int sum = 0;

	#pragma omp parallel for reduction (+:sum) schedule(dynamic,1) 
    for( int i = 2; i <= NUM ; i++ )
    {
        sum += isprime(i);
    }

    printf( "Number of primes numbers: %d", sum );
}

no schedule

Number of primes numbers: 5761455./demo_openmp  151.64s user 0.04s system 582% cpu 26.048 total

schedule(static,1)

Number of primes numbers: 5761455./demo_openmp  111.13s user 0.00s system 399% cpu 27.799 total

schedule(dynamic,1)

Number of primes numbers: 5761455./demo_openmp  167.22s user 0.02s system 791% cpu 21.135 total

schedule(dynamic,200)

Number of primes numbers: 5761455./demo_openmp  165.96s user 0.02s system 791% cpu 20.981 total

OpenCV with OpenMP

see how-opencv-use-openmp-thread-to-get-performance

3 type OpenCV implementation

sequential implementation: default (slowest)
parallel implementation: OpenMP / TBB
GPU implementation: CUDA(fastest) / OpenCL

With CMake-gui, Building OpenCV with the WITH_OPENMP flag means that the internal functions will use OpenMP to parallelize some of the algorithms, like cvCanny, cvSmooth and cvThreshold.

In OpenCV, an algorithm can have a sequential (slowest) implementation; a parallel implementation using OpenMP or TBB; and a GPU implementation using OpenCL or CUDA(fastest). You can decide with the WITH_XXX flags which version to use.

Of course, not every algorithm can be parallelized.

Now, if you want to parallelize your methods with OpenMP, you have to implement it yourself.

concepts

avoiding extra copying

from improving-image-processing-speed

There is one important thing about increasing speed in OpenCV not related to processor nor algorithm and it is avoiding extra copying when dealing with matrices. I will give you an example taken from the documentation:

“…by constructing a header for a part of another matrix. It can be a single row, single column, several rows, several columns, rectangular region in the matrix (called a minor in algebra) or a diagonal. Such operations are also O(1), because the new header will reference the same data. You can actually modify a part of the matrix using this feature, e.g.”

parallel for

#include "opencv2/highgui/highgui.hpp"
#include "opencv2/features2d/features2d.hpp"
#include <iostream>
#include <vector>
#include <omp.h>

void opencv_vector()
{
    int imNum = 2;
    std::vector<cv::Mat> imVec(imNum);
    std::vector<std::vector<cv::KeyPoint>>keypointVec(imNum);
    std::vector<cv::Mat> descriptorsVec(imNum);
	
	cv::Ptr<cv::ORB> detector = cv::ORB::create();
	cv::Ptr<DescriptorMatcher> matcher = cv::DescriptorMatcher::create("BruteForce-Hamming");

    std::vector< cv::DMatch > matches;
    char filename[100];
    double t1 = omp_get_wtime();
    
//#pragma omp parallel for
    for (int i=0;i<imNum;i++){
        sprintf(filename,"rgb%d.jpg",i);
        imVec[i] = cv::imread( filename, CV_LOAD_IMAGE_GRAYSCALE );
        detector->detect( imVec[i], keypointVec[i] );
        detector->compute( imVec[i],keypointVec[i],descriptorsVec[i]);
        std::cout<<"find "<<keypointVec[i].size()<<" keypoints in im"<<i<<std::endl;
    }
    
    double t2 = omp_get_wtime();
    std::cout<<"time: "<<t2-t1<<std::endl;
	
	matcher->match(descriptorsVec[0], descriptorsVec[1], matches, 2); // uchar descriptor Mat

    cv::Mat img_matches;
    cv::drawMatches( imVec[0], keypointVec[0], imVec[1], keypointVec[1], matches, img_matches ); 
    cv::namedWindow("Matches",CV_WINDOW_AUTOSIZE);
    cv::imshow( "Matches", img_matches );
    cv::waitKey(0);
}

parallel sections

#pragma omp parallel sections
    {
#pragma omp section
        {
            std::cout<<"processing im0"<<std::endl;
            im0 = cv::imread("rgb0.jpg", CV_LOAD_IMAGE_GRAYSCALE );
            detector.detect( im0, keypoints0);
            extractor.compute( im0,keypoints0,descriptors0);
            std::cout<<"find "<<keypoints0.size()<<"keypoints in im0"<<std::endl;
        }
        
#pragma omp section
        {
            std::cout<<"processing im1"<<std::endl;
            im1 = cv::imread("rgb1.jpg", CV_LOAD_IMAGE_GRAYSCALE );
            detector.detect( im1, keypoints1);
            extractor.compute( im1,keypoints1,descriptors1);
            std::cout<<"find "<<keypoints1.size()<<"keypoints in im1"<<std::endl;
        }
    }

Reference

History

20190403: created.

install and use nvidia management library (nvml)

Posted on 2019-03-21 Edited on 2024-10-14 In python

Guide

nvidia-smi

> nvidia-smi
Thu Mar 21 09:41:18 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   65C    P0    30W /  N/A |    538MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

nvidia-ml-py

This is a wrapper around the NVML library.

Python methods wrap NVML functions, implemented in a C shared library.

Each function’s use is the same with the following exceptions: Instead of returning error codes, failing error codes are raised as Python exceptions.

1 2	pip install nvidia-ml-py2 --user pip install nvidia-ml-py3 --user

demo

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# pip install nvidia-ml-py3 --user

import pynvml

try:
    pynvml.nvmlInit()
except pynvml.NVMLError as error:
    print(error) 
    # Driver Not Loaded      驱动加载失败(没装驱动或者驱动有问题)
    # Insufficent Permission 没有以管理员权限运行  pynvml.NVMLError_DriverNotLoaded: Driver Not Loaded
    exit()

try:
    print(pynvml.nvmlDeviceGetCount())
except pynvml.NVMLError as error:
    print(error)

print(pynvml.nvmlDeviceGetCount())# total gpu count = 1
print(pynvml.nvmlSystemGetDriverVersion()) # 396.54

GPU_ID = 0
handle = pynvml.nvmlDeviceGetHandleByIndex(GPU_ID)
print(pynvml.nvmlDeviceGetName(handle)) # GeForce GTX 1060

meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
MB_SIZE = 1024*1024
print(meminfo.total/MB_SIZE) # 6078 MB
print(meminfo.used/MB_SIZE)  # 531 MB
print(meminfo.free/MB_SIZE)  # 5546 MB

pynvml.nvmlShutdown()

Reference

History

20190321: created.

mean shift clustering

Posted on 2019-03-18 Edited on 2024-10-14 In python

Guide

MeanShift

python: git clone https://github.com/mattnedrich/MeanShift_py.git
cpp: git https://github.com/mattnedrich/MeanShift_cpp.git

cpp compile

cd MeanShfit_cpp 
mkdir build && cd build && cmake .. && make -j8

./MeanShift_cpp

Visualization for linux

1	sudo apt-get install gnuplot gnuplot-qt

gnuplot
plot ‘test.csv’ with points, ‘result.csv’ with points

python demo

import mean_shift as ms
import matplotlib.pyplot as plt
import numpy as np

def ms_cluster(data):
        # case(1) demo:     kernel_bandwidth = 3.0, cluster_epsilon = 6
        # case(2) laneseg:  kernel_bandwidth = 0.5, cluster_epsilon = 2
        mean_shifter = ms.MeanShift()
        mean_shift_result = mean_shifter.cluster(data, kernel_bandwidth = 3, cluster_epsilon= 6)
        return mean_shift_result

def sklearn_cluster(data):
        from sklearn.cluster import MeanShift
        from sklearn.cluster import estimate_bandwidth

        bandwidth = estimate_bandwidth(data, quantile=0.2, n_samples=data.shape[0])
        #print("bandwidth=",bandwidth) # 3
        mean_shifter = MeanShift(bandwidth, bin_seeding=True)
        mean_shifter.fit(data)

        # get same results
        original_points = data
        cluster_centers = mean_shifter.cluster_centers_
        cluster_ids = mean_shifter.labels_

        mean_shift_result = ms.MeanShiftResult(original_points, cluster_centers, cluster_ids)
        return mean_shift_result

def cluster_api(data, use_sklearn=True):
        if use_sklearn:
                return sklearn_cluster(data)
        else:
                return ms_cluster(data)

def print_cluster_result(mean_shift_result):
        print("Original Point     Shifted Point  Cluster ID")
        print("============================================")
        for i in range(len(mean_shift_result.original_points)): # 125
                original_point = mean_shift_result.original_points[i] # 125
                cluster_id = mean_shift_result.cluster_ids[i] # 125  value=0,1,2
                cluster_center = mean_shift_result.cluster_centers[cluster_id] # 3   
                
                print( 
                        "(%5.2f,%5.2f) ->  (%5.2f,%5.2f) cluster %i" % 
                        (original_point[0], original_point[1], 
                        cluster_center[0], cluster_center[1], 
                        cluster_id)
                ) 
        print("============================================")

def main():

        use_sklearn = True
        data = np.genfromtxt('data.csv', delimiter=',')
        print("data.shape=",data.shape)

        mean_shift_result = cluster_api(data,use_sklearn)
        #print_cluster_result(mean_shift_result)

        original_points =  mean_shift_result.original_points # (125, 2)
        cluster_centers = mean_shift_result.cluster_centers  # (3, 2)
        cluster_ids = mean_shift_result.cluster_ids # (125,)   value=[0,1,2]

        unique_ids = np.unique(cluster_ids) # (3,)  value=[0,1,2]

        print("original_points.shape=",original_points.shape) # (125, 2)
        print(original_points[:10])

        print("cluster_centers.shape=",cluster_centers.shape) # (3, 2)
        print(cluster_centers)

        print("cluster_ids.shape=",cluster_ids.shape) # (125,)
        print(cluster_ids) # [0,0,0,...1,1,1,...,2,2,2,...] 0,1,2 cluster ids

        print("unique_ids.shape=",unique_ids.shape) # (3,)
        print(unique_ids)  # 0,1,2

        x = original_points[:,0]
        y = original_points[:,1]

        fig = plt.figure()
        ax = fig.add_subplot(111)
        scatter = ax.scatter(x,y,c=cluster_ids,s=50)
        for cx,cy in cluster_centers:
                ax.scatter(cx,cy,s=50,c='red',marker='+')
                ax.set_xlabel('x')
                ax.set_ylabel('y')
                plt.colorbar(scatter)

        if use_sklearn:
                filename = "1_sklearn"
        else:
                filename = "2_ms"

        fig.savefig(filename)
        plt.show()
        print("OK "+filename)

if __name__ == "__main__":
        main()

meanshift_py

#===============================
# ms 
#===============================
('data.shape=', (125, 2))
('original_points.shape=', (125, 2))
[[10.91079039  8.38941202]
 [ 9.87500165  9.9092509 ]
 [ 7.8481223  10.4317483 ]
 [ 8.53412293  9.55908561]
 [10.38316846  9.61879086]
 [ 8.11061595  9.77471761]
 [10.02119468  9.53877962]
 [ 9.37705852  9.70853991]
 [ 7.67017034  9.60315231]
 [10.94308287 11.76207349]]
('cluster_centers.shape=', (3, 2))
[[-3.45216026  5.28851174]
 [ 5.02926925  3.56548696]
 [ 8.63149568  9.25488818]]
('cluster_ids.shape=', (125,))
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
('unique_ids.shape=', (3,))
[0 1 2]
OK 2_ms

sklearn

#===============================
# sklearn 
#===============================

('data.shape=', (125, 2))
('original_points.shape=', (125, 2))
[[10.91079039  8.38941202]
 [ 9.87500165  9.9092509 ]
 [ 7.8481223  10.4317483 ]
 [ 8.53412293  9.55908561]
 [10.38316846  9.61879086]
 [ 8.11061595  9.77471761]
 [10.02119468  9.53877962]
 [ 9.37705852  9.70853991]
 [ 7.67017034  9.60315231]
 [10.94308287 11.76207349]]
('cluster_centers.shape=', (3, 2))
[[ 4.79792283  3.01140269]
 [ 9.2548292  10.11312163]
 [-4.11368202  5.44826076]]
('cluster_ids.shape=', (125,))
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 0 2 2 2]
('unique_ids.shape=', (3,))
[0 1 2]
OK 1_sklearn

Reference

History

20190318: created.