0%

Guide

requirements:

  • ubuntu: 16.04
  • python: 3.5.2
  • opencv: 3.4.2+
  • tesseract: v4 (binary)
  • pytesseract: 0.2.4 (python bindings)

install python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
apt-get install python3-dev python3-pip 
```

### install opencv
```bash
workon py3
pip install opencv-contrib-python
```

### install tesseract
```bash
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install tesseract-ocr

The latest release of Tesseract (v4) supports deep learning-based OCR that is significantly more accurate.

The underlying OCR engine itself utilizes a Long Short-Term Memory (LSTM) network, a kind of Recurrent Neural Network (RNN).

check

1
2
3
4
5
6
7
tesseract -v
tesseract 4.0.0-beta.4-138-g2093
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2
Found AVX2
Found AVX
Found SSE

install Tesseract + Python bindings

1
2
3
workon py3
pip install pytesseract
pip install pillow imutils

tesseract

help

tesseract --help
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

help-extra

tesseract --help-extra
Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes: (see https://github.com/tesseract-ocr/tesseract/wiki#linux)
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

run script

1
2
3
4
5
6
python text_recognition.py --east frozen_east_text_detection.pb \
--image images/example_01.jpg
[INFO] loading EAST text detector...
OCR TEXT
========
OH OK

Reference

History

  • 20180920: created.

Guide

install python

install commands

1
2
3
4
sudo apt-get install python3-pip python3-dev

pip3 -V
pip 8.1.1 from /usr/lib/python3/dist-packages (python 3.5)

change pip source

ubuntu

edit .pip/pip.conf

1
2
3
4
[global]
index-url = http://pypi.douban.com/simple
[install]
trusted-host = pypi.douban.com

windows

edit C:\Users\zunli\AppData\Roaming\pip\pip.ini

1
2
3
4
[global]
index-url = http://pypi.douban.com/simple
[install]
trusted-host = pypi.douban.com

temp solutions

1
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.4.0

install virtualenv

1
sudo pip3 install virtualenv virtualenvwrapper

vim .bashrc

1
2
3
4
# for virtualenv and virtualenvwrapper
export WORKON_HOME=$HOME/.local
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
source /usr/local/bin/virtualenvwrapper.sh

source .bashrc

mkvirtualenv

1
2
kezunlin@ke: mkvirtualenv py3 -p python3
(py3) kezunlin@ke:~$

commands

1
2
3
4
5
6
7
8
9
10
ls $WORKON_HOME
mkvirtualenv py3 -p python3
mkvirtualenv py2 -p python2
rmvirtualenv py3

lsvirtualenv
lssitepackages

workon py3
deactivate

Opencv with virtualenv

python2

OpenCV should now be installed in

1
2
locate cv2.so
/usr/local/lib/python2.7/dist-packages/cv2.so

However, our py2 virtual environment is located in our home directory — thus to use OpenCV within our py2 environment, we first need to sym-link OpenCV into the site-packages directory of the py2 virtual environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
cd ~/.local/py2/lib/python2.7/site-packages/
ln -s /usr/local/lib/python2.7/site-packages/cv2.so cv2.so
ln -s /usr/local/lib/python2.7/dist-packages/cv2.so cv2.so
```

import opencv

```bash
workon py2
python
>import cv2
>print(cv2.__version__)
'3.1.0'

python3

you may get error

ImportError: dynamic module does not define init function (PyInit_cv2) 

when import cv2 in python3 (no such problem in python2).

install opencv-python

1
2
workon py3
pip3 install opencv-contrib-python

test version

1
2
3
4
5
workon py3
python
import cv2
print(cv2.__version__)
'3.4.2'

install pycharm

apt-get (slow)

1
2
3
4
5
6
7
8
9
10
11
12
sudo add-apt-repository ppa:mystic-mirage/pycharm

sudo apt update

# no free
sudo apt install pycharm

# free
sudo apt install pycharm-community

# remove
sudo apt remove pycharm pycharm-community && sudo apt autoremove

offical (faster)

download from here

start by

1
sh pycharm.sh

Reference

History

  • 20180920: created.

Series

Guide

requirements:

  • ubuntu: 16.04
  • opencv: 3.3.0

install dependencies

1
2
3
4
5
sudo apt-get install build-essential
sudo apt-get install cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev

sudo apt-get install cmake-gui

compile

1
2
3
4
5
6
7
8
9
10
11
12
git clone https://github.com/opencv/opencv.git
wget https://github.com/opencv/opencv/archive/3.1.0.zip

cd opencv-3.1.0
mkdir build
cd build && cmake-gui ..

# may take several minutes
sudo make -j8

# install to /usr/local/bin
sudo make install

check version

1
2
opencv_version
3.3.0

python cv2

1
2
3
python
>>> import cv2
>>> cv2.__version__

pip install opencv

1
2
3
4
5
6
7
workon py3
pip install opencv-contrib-python

python
>import cv2
>cv2.__version__
'3.3.0'

for virtualenv, see python virtualenv tutorial

opencv samples

1
2
3
cd samples
cmake .
make

Example

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <iostream>
using namespace std;

#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
using namespace cv;

int main()
{
Mat image = imread("../image/cat.jpg",0);
imshow("image",image);
waitKey(0);
return 0;
}

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
cmake_minimum_required(VERSION 2.8.8)

project(demo)

# Find includes in corresponding build directories
set(CMAKE_INCLUDE_CURRENT_DIR ON)

find_package(OpenCV REQUIRED COMPONENTS core highgui imgproc features2d calib3d)
include_directories(${OpenCV_INCLUDE_DIRS})

message( [opencv] ${OpenCV_INCLUDE_DIRS} )
message( [opencv] ${${OpenCV_LIBS}} )
message( [opencv] ${${OpenCV_LIBRARIES}} )


add_executable(${PROJECT_NAME}
demo.cpp
)
target_link_libraries(${PROJECT_NAME} ${OpenCV_LIBRARIES})

Reference

History

  • 20180919: created.

Overview

cuda 9.2

  • nvidia driver 396.54
  • cuda 9.2 (not install driver,install toolkit and samples)
  • cudnn 7.1.4 for cuda9.2 (for TensorRT) caffe,tensorflow, baidu anakin

cuda 8.0

  • nvidia driver 384.130
  • cuda 8.0 (not install driver,install toolkit and samples)
  • cudnn 6.0.21 for cuda8.0 caffe

prepare

GUI vs tty

  • ctrl+alt+F7 to enter GUI
  • ctrl+alt+F1-F6 to enter tty1-6, login with(username,password)

use fbterm instead of default terminal when we are in tty1

1
2
sudo apt-get -y install fbterm
sudo fbterm

cuda and cudnn

  • download cuda_9.2.148_396.37_linux.run from cuda
  • download cudnn-9.2-linux-x64-v7.1.tgz from cudnn

Steps

install general dependencies

1
2
3
4
5
6
7
8
9
10
11
12
apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
apt-get install --no-install-recommends libboost-all-dev

# blas
sudo apt-get install libopenblas-dev liblapack-dev libatlas-base-dev

sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev

sudo apt-get install git cmake build-essential

# fix missing
#sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

GUI mode

1
2
3
4
5
6
7
8
9
10
11
12
# disable default ubuntu driver
sudo vim /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
sudo reboot

tty mode

ctrl+alt+F1 to enter tty1, login with(username,password)

1
2
3
4
sudo fbterm

# stop x-server before install cuda driver
sudo service lightdm stop

remove previous nvidia driver + cuda toolkit

1
2
3
4
5
sudo apt-get remove --purge nvidia-*
# remove 8.0
sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
# remove 9.2
sudo /usr/local/cuda-9.2/bin/uninstall_cuda_9.2.pl

install nvidia driver from ppa

DO NOT use cuda_xxx_linux.run to install nvidia driver, otherwise we
get Loop Login Problem when we reboot.

安装显卡驱动推荐使用官方ppa源的方式进行安装,使用cuda_xxx_linux.run文件离线安装会导致循环登录问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
sudo add-apt-repository ppa:graphics-drivers/ppa
sudp apt-get update

sudo apt-cache search nvidia-*
# nvidia-384
# nvidia-396
sudo apt-get -y install nvidia-396

# test
sudo nvidia-smi
```

#### install cuda toolkit from run file

> 1. DO NOT install nvidia driver, install cuda toolkit + samples.
>
> 2. use default install path `/usr/local/cuda-9.2`
>
> 3. use `/usr/local/cuda-9.2/bin/uninstall_cuda_9.2.pl` to uninstall

```bash
chmod +x ./cuda_9.2.148_396.37_linux.run

# Using unspported compiler---> override
./cuda_9.2.148_396.37_linux.run --override

output

---------------------------------------
Do you accept the previously read EULA? 
(accept/decline/quit): accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.37? (y)es/(n)o/(q)uit: no

Install the CUDA 9.2 Toolkit? 
(y)es/(n)o/(q)uit: yes

Enter Toolkit Location 
    [ default is /usr/local/cuda-9.2 ]:

Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: yes


Install the CUDA 9.2 Samples? 
(y)es/(n)o/(q)uit: yes

Enter CUDA Samples Location 
    [ default is /home/kezunlin ]: 


Installing the CUDA Toolkit in /usr/local/cuda-9.2 ...
Installing the CUDA Samples in /home/kezunlin ...

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-9.2
Samples:  Installed in /home/kezunlin

Please make sure that
 -   PATH includes /usr/local/cuda-9.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.2/lib64, or, add /usr/local/cuda-9.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.2/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.2/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.2 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_6659.log

reboot to enter GUI

1
sudo reboot 

OK. we no longer have Loop Login Problem.

add library path

system env

1
2
3
4
5
6
7
vim .bashrc

# for cuda and cudnn
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

source .bashrc

or by conf file

1
2
3
4
sudo vim /etc/ld.so.conf.d/cuda.conf
/usr/local/cuda/lib64

sudo ldconifg

test

nvidia-smi

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
nvidia-smi
Tue Sep 18 10:35:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 58C P0 31W / N/A | 288MiB / 6078MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1636 G /usr/lib/xorg/Xorg 164MiB |
| 0 2569 G compiz 40MiB |
| 0 4828 G ...-token=2DAB0000EFF3321D4D304928FA64B811 81MiB |
+-----------------------------------------------------------------------------+

or

1
cat /proc/driver/nvidia/version

nvcc

1
2
3
4
5
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

deviceQuery

1
2
3
cd ~/NVIDIA_CUDA-9.2_Samples/1_Utilities/deviceQuery
make
./deviceQuery

output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6078 MBytes (6373572608 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1733 MHz (1.73 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

we get Result = PASS.

install cudnn

download cudnn-9.2-linux-x64-v7.1.tgz for ubuntu 16.04

  • copy include to /usr/local/cuda-9.2/include
  • copy lib64 to /usr/local/cuda-9.2/lib64

commands

1
2
3
tar -xzvf cudnn-9.2-linux-x64-v7.1.tgz 
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/* /usr/local/cuda/lib64/

Reference

History

  • 20180917: created.

Series

Guide

version

  • ubuntu 16.04 (14.04,16.04 only) not support Windows
  • CUDA 8.0 (8.0,9.0,9.2 only)
  • CUDA 9.2
  • cudnn 7.1.4 (7.1 only)
  • TensorRT 4.0.1.6
  • TensorFlow-gpu v1.4+
  • python: 3.5.2 (2.7 or 3.5)

TensorRT support matrix

  • 4.0.1.6
    support matrix

  • 5.0.2.6
    support matrix

hardware precision matrix

hardware precision support matrix
hardware precision support matrix

see tensorrt-support-matrix

ubuntu

  • GeForce 1060 (fp32,int8) no fp16

jetson products

  • Jetson TX1 (fp32,fp16)
  • Jetson TX2 (fp32,fp16)
  • Jetson AGX Xavier (fp32,fp16,int8,dla)
  • Jetson Nano (Jetbot)

install

download and install

download TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-8.0.cudnn7.1.tar.gz from here

1
2
3
4
5
6
7
8
tar zxvf TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-8.0.cudnn7.1.tar.gz 

ls TensorRT-4.0.1.6
bin data doc graphsurgeon include lib python samples targets TensorRT-Release-Notes.pdf uff

sudo mv TensorRT-4.0.1.6 /opt/
cd /opt
sudo ln -s TensorRT-4.0.1.6/ tensorrt

Updates: from cuda-8.0 ===> cuda-9.2. download TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-9.2.cudnn7.1.tar.gz from here

add lib to path

1
2
3
4
sudo vim /etc/ld.so.conf.d/tensorrt
/opt/tensorrt/lib

sudo ldconfig

or

1
2
3
4
vim ~/.bashrc
export LD_LIBRARY_PATH=/opt/tensorrt/lib:$LD_LIBRARY_PATH

source ~/.bashrc

python package

1
2
cd /opt/tensorrt/python
sudo pip2 install tensorrt-4.0.1.6-cp27-cp27mu-linux_x86_64.whl

or

1
2
cd /opt/tensorrt/python
sudo pip3 install tensorrt-4.0.1.6-cp35-cp35m-linux_x86_64.whl

uff package

1
2
3
4
5
cd /opt/tensorrt/uff 
sudo pip install uff-0.4.0-py2.py3-none-any.whl

which convert-to-uff
/usr/local/bin/convert-to-uff

folder structure

include

1
2
3
4
5
6
7
8
9
tree include/
include/
├── NvCaffeParser.h
├── NvInfer.h
├── NvInferPlugin.h
├── NvOnnxConfig.h
├── NvOnnxParser.h
├── NvUffParser.h
└── NvUtils.h

lib

1
2
3
4
5
ls -al *.4.1.2
lrwxrwxrwx 1 kezunlin kezunlin 21 6月 12 15:42 libnvcaffe_parser.so.4.1.2 -> libnvparsers.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin 2806840 6月 12 15:42 libnvinfer_plugin.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin 80434488 6月 12 15:42 libnvinfer.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin 3951712 6月 12 15:42 libnvparsers.so.4.1.2

bin

1
2
3
4
5
tree bin
bin
├── download-digits-model.py
├── giexec
└── trtexec

sample

add envs

1
2
3
4
5
vim ~/.bashrc

# tensorrt cuda and cudnn
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda

compile all

1
2
cd samples/
make -j8

generate all sample_xxx to bin/ folder.

compile sampleMNIST

1
2
3
4
cd samples/sampleMNIST
ls
Makefile sampleMNIST.cpp
make -j8

error occurs

dpkg-query: no packages found matching cuda-cudart-[0-9]*
../Makefile.config:6: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda- by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
../Makefile.config:9: CUDNN_INSTALL_DIR variable is not specified, using  by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.

fix solutions:

1
2
3
4
5
vim ~/.bashrc

# tensorrt cuda and cudnn
export CUDA_INSTALL_DIR=/opt/cuda
export CUDNN_INSTALL_DIR=/opt/cuda

make again

:
:
Compiling: sampleMNIST.cpp
Compiling: sampleMNIST.cpp
Linking: ../../bin/sample_mnist
Linking: ../../bin/sample_mnist_debug
# Copy every EXTRA_FILE of this sample to bin dir

test sample_mnist

./sample_mnist
Reading Caffe prototxt: ../../../data/mnist/mnist.prototxt
Reading Caffe model: ../../../data/mnist/mnist.caffemodel

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@%.:@@@@@@@@@@@@
@@@@@@@@@@@@@: *@@@@@@@@@@@@
@@@@@@@@@@@@* =@@@@@@@@@@@@@
@@@@@@@@@@@% :@@@@@@@@@@@@@@
@@@@@@@@@@@- *@@@@@@@@@@@@@@
@@@@@@@@@@# .@@@@@@@@@@@@@@@
@@@@@@@@@@: #@@@@@@@@@@@@@@@
@@@@@@@@@+ -@@@@@@@@@@@@@@@@
@@@@@@@@@: %@@@@@@@@@@@@@@@@
@@@@@@@@+ +@@@@@@@@@@@@@@@@@
@@@@@@@@:.%@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@#..:@@@@@@@@
@@@@@@@% +@@@@@-    :@@@@@@@
@@@@@@@% =@@@@%.#@@- +@@@@@@
@@@@@@@@..%@@@*+@@@@ :@@@@@@
@@@@@@@@= -%@@@@@@@@ :@@@@@@
@@@@@@@@@- .*@@@@@@+ +@@@@@@
@@@@@@@@@@+  .:-+-: .@@@@@@@
@@@@@@@@@@@@+:    :*@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: 
2: 
3: 
4: 
5: 
6: **********
7: 
8: 
9: 

Sample

compile all samples

1
2
cd sample
make -j8

sample_mnist

see above. skip.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ldd sample_mnist
linux-vdso.so.1 => (0x00007ffecd9f3000)
libnvinfer.so.4 => /opt/tensorrt/lib/libnvinfer.so.4 (0x00007f48de6f2000)
libnvparsers.so.4.1.2 => /opt/tensorrt/lib/libnvparsers.so.4.1.2 (0x00007f48de12c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f48ddf24000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f48ddd20000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f48ddb03000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f48dd781000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f48dd478000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f48dd262000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f48dce98000)
libcudnn.so.7 => /usr/local/cuda/lib64/libcudnn.so.7 (0x00007f48c8818000)
libcublas.so.9.2 => /usr/local/cuda/lib64/libcublas.so.9.2 (0x00007f48c4dca000)
libcudart.so.9.2 => /usr/local/cuda/lib64/libcudart.so.9.2 (0x00007f48c4b60000)
/lib64/ld-linux-x86-64.so.2 (0x00007f48e42bc000)

libnvinfer.so, libnvparsers.so, libcudart.so, libcudnn.so, libcublas.so

sample_onnx_mnist

./sample_onnx_mnist



---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@%.-@@@@@@@@@@@
@@@@@@@@@@@*-    %@@@@@@@@@@
@@@@@@@@@@= .-.  *@@@@@@@@@@
@@@@@@@@@= +@@@  *@@@@@@@@@@
@@@@@@@@* =@@@@  %@@@@@@@@@@
@@@@@@@@..@@@@%  @@@@@@@@@@@
@@@@@@@# *@@@@-  @@@@@@@@@@@
@@@@@@@: @@@@%   @@@@@@@@@@@
@@@@@@@: @@@@-   @@@@@@@@@@@
@@@@@@@: =+*= +: *@@@@@@@@@@
@@@@@@@*.    +@: *@@@@@@@@@@
@@@@@@@@%#**#@@: *@@@@@@@@@@
@@@@@@@@@@@@@@@: -@@@@@@@@@@
@@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@@*  @@@@@@@@@@
@@@@@@@@@@@@@@@@  %@@@@@@@@@
@@@@@@@@@@@@@@@@  #@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@- +@@@@@@@@@
@@@@@@@@@@@@@@@@*:%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@


 Prob 0  0.00000: 
 Prob 1  0.00001: 
 Prob 2  0.00002: 
 Prob 3  0.00003: 
 Prob 4  0.00044: 
 Prob 5  0.00005: 
 Prob 6  0.00006: 
 Prob 7  0.00007: 
 Prob 8  0.00008: 
 Prob 9  0.99969: **********

sample_uff_mnist

../../../data/mnist/lenet5.uff



---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@%.-@@@@@@@@@@@
@@@@@@@@@@@*-    %@@@@@@@@@@
@@@@@@@@@@= .-.  *@@@@@@@@@@
@@@@@@@@@= +@@@  *@@@@@@@@@@
@@@@@@@@* =@@@@  %@@@@@@@@@@
@@@@@@@@..@@@@%  @@@@@@@@@@@
@@@@@@@# *@@@@-  @@@@@@@@@@@
@@@@@@@: @@@@%   @@@@@@@@@@@
@@@@@@@: @@@@-   @@@@@@@@@@@
@@@@@@@: =+*= +: *@@@@@@@@@@
@@@@@@@*.    +@: *@@@@@@@@@@
@@@@@@@@%#**#@@: *@@@@@@@@@@
@@@@@@@@@@@@@@@: -@@@@@@@@@@
@@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@@*  @@@@@@@@@@
@@@@@@@@@@@@@@@@  %@@@@@@@@@
@@@@@@@@@@@@@@@@  #@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@- +@@@@@@@@@
@@@@@@@@@@@@@@@@*:%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -2.75228	 : 
1 => -1.51534	 : 
2 => -4.11729	 : 
3 => 0.316925	 : 
4 => 3.73423	 : 
5 => -3.00593	 : 
6 => -6.18866	 : 
7 => -1.02671	 : 
8 => 1.937	 : 
9 => 14.8275	 : ***

Average over 10 runs is 0.0843257 ms.

sample_mnist_api

./sample_mnist_api
Loading weights: ../../../data/mnist/mnistapi.wts

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@+ @@@@@@@@@@@@@@
@@@@@@@@@@@@. @@@@@@@@@@@@@@
@@@@@@@@@@@@- @@@@@@@@@@@@@@
@@@@@@@@@@@#  @@@@@@@@@@@@@@
@@@@@@@@@@@#  *@@@@@@@@@@@@@
@@@@@@@@@@@@  :@@@@@@@@@@@@@
@@@@@@@@@@@@= .@@@@@@@@@@@@@
@@@@@@@@@@@@#  %@@@@@@@@@@@@
@@@@@@@@@@@@% .@@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@@= +@@@@@@@@@@@@
@@@@@@@@@@@@@* -@@@@@@@@@@@@
@@@@@@@@@@@@@*  @@@@@@@@@@@@
@@@@@@@@@@@@@@  @@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@* @@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: **********
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 

sample_int8

./sample_int8 mnist

FP32 run:400 batches of size 100 starting at 100
........................................
Top1: 0.9904, Top5: 1
Processing 40000 images averaged 0.00332707 ms/image and 0.332707 ms/batch.

FP16 run:400 batches of size 100 starting at 100
Engine could not be created at this precision

INT8 run:400 batches of size 100 starting at 100
........................................
Top1: 0.9909, Top5: 1
Processing 40000 images averaged 0.00215323 ms/image and 0.215323 ms/batch.

Reference

History

  • 20180907: created.
  • 20181119: add tensorrt-5.0.

Guide

version

  • gcc 4.8.5/5.4.0
  • g++ 4.8.5/5.4.0
  • cmake 3.2.2
  • nvidia driver 396.54 + cuda 9.2 + cudnn 7.1.4
  • protobuf 3.4.0

install nvidia-docker2

see nvidia-docker2 guide on ubuntu 16.04

test

1
sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

build and run

build

1
2
3
git clone https://github.com/PaddlePaddle/Anakin.git anakin
cd anakin/docker
./anakin_docker_build_and_run.sh -p NVIDIA-GPU -o Ubuntu -m Build

error occur with cudnn. skip.

run

1
./anakin_docker_build_and_run.sh  -p NVIDIA-GPU -o Ubuntu -m Run

compile anakin

1
2
3
sudo docker run -it --runtime=nvidia fdcda959f60a bin/bash
root@962077742ae9:/# cd Anakin/
git checkout developing

build

1
2
3
4
5
6
7
8
# 1. use script to build
./tools/gpu_build.sh

# 2. or you can build directly.
mkdir build
cd build
cmake ..
make -j8

x86 build

1
./tools/x86_build.sh

OK. no errors.

gpu build

1
./tools/gpu_build.sh

build errors occur. no cudnn found.

compile anakin in host

install protobuf

install protobuf 3.4.0, see Part 1: compile protobuf-cpp on ubuntu 16.04

configure env

vim .bashrc

1
2
3
4
5
6
7
# cuda for anakin
export PATH=/usr/local/cuda/bin:$PATH

# CUDNN for anakin
export CUDNN_ROOT=/usr/local/cuda/
export LD_LIBRARY_PATH=${CUDNN_ROOT}/lib64:$LD_LIBRARY_PATH
export CPLUS_INCLUDE_PATH=${CUDNN_ROOT}/include:$CPLUS_INCLUDE_PATH

source .bashrc

build anakin

x86 build

1
2
3
4
git checkout developing  
./tools/x86_build.sh

mv output x86_output

OK. no errors.

if error occurs, then

1
2
3
4
5
rm -rf CMakeFiles
rm -rf anakin/framework/model_parser/proto/*.h
rm output

chown -R kezunlin:kezunlin anakin

gpu build

1
2
./tools/gpu_build.sh
mv output gpu_output

gpu build with cmake

1
2
3
cd anakin
mkdir build
cd build && cmake-gui ..

anakin overview

anakin

用Anakin来进行前向计算主要分为三个步骤:

  1. 将外部模型通过Anakin Parser解析为Anakin模型
  2. 加载Anakin模型生成原始计算图,然后需要对原始计算图进行优化。
  3. Anakin会选择不同硬件平台执行计算图。

Tensor

Tensor接受三个模板参数:

 template<typename TargetType, DataType datatype, typename LayOutType = NCHW>
 class Tensor .../* Inherit other class */{
  //some implements
  ...
 };
  • TargetType是平台类型,如X86,GPU等等,在Anakin内部有相应的标识与之对应;
  • datatype是普通的数据类型,在Anakin内部也有相应的标志与之对应;
  • LayOutType是数据分布类型,如batch x channel x height x width [NxCxHxW], 在Anakin内部用一个struct来标识。 Anakin中数据类型与基本数据类型的对应如下:

TargetType

Anakin TargetType | platform
:—-: |
NV | NVIDIA GPU
ARM | ARM
AMD | AMD GPU
X86 | X86
NVHX86 | NVIDIA GPU with Pinned Memory

DataType

Anakin DataType C++ Description
AK_HALF short fp16
AK_FLOAT float fp32
AK_DOUBLE double fp64
AK_INT8 char int8
AK_INT16 short int16
AK_INT32 int int32
AK_INT64 long int64
AK_UINT8 unsigned char uint8
AK_UINT16 unsigned short uint8
AK_UINT32 unsigned int uint32
AK_STRING std::string /
AK_BOOL bool /
AK_SHAPE / Anakin Shape
AK_TENSOR / Anakin Tensor

LayOutType

Anakin LayOutType ( Tensor LayOut ) Tensor Dimention Tensor Support Op Support
W 1-D YES NO
HW 2-D YES NO
WH 2-D YES NO
NW 2-D YES YES
NHW 3-D YES YES
NCHW ( default ) 4-D YES YES
NHWC 4-D YES NO
NCHW_C4 5-D YES YES

理论上,Anakin支持申明1维以上的tensor,但是对于Anakin中的Op来说,只支持NW、NHW、NCHW、NCHW_C4这四种LayOut,其中NCHW是默认的LayOutType,NCHW_C4是专门针对于int8这种数据类型的。

Graph

Graph类负责加载Anakin模型生成计算图、对图进行优化、存储模型等操作。

template<typename TargetType, DataType Dtype, Precision Ptype>
class Graph ... /* inherit other class*/{

  //some implements
  ...

};

load

//some declarations
...
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
std::string model_path = "the/path/to/where/your/models/are";
const char *model_path1 = "the/path/to/where/your/models/are";

//Loading Anakin model to generate a compute graph.
auto status = graph->load(model_path);

//Or this way.
auto status = graph->load(model_path1);
//Check whether load operation success.
if(!status){
  std::cout << "error" << endl;
  //do something...
}

optimize

//some declarations
...
//Load graph.
...
//According to the ops of loaded graph, optimize compute graph.
graph->Optimize();

save

//some declarations
...
//Load graph.
...
// save a model
//save_model_path: the path to where your model is.
auto status = graph->save(save_model_path);

//Checking
if(!status){
  cout << "error" << endl;
  //do somethin...
}

Net

Net是计算图的执行器,通过Net对象获得输入和输出。

template<typename TargetType, DataType Dtype, Precision PType, OpRunType RunType = OpRunType::ASYNC>
class Net{
  //some implements
  ...

};
  • Precision指定Op的精度。
  • OpRunType表示同步或异步类型,异步是默认类型。OpRunType::SYNC表示同步,在GPU上只有单个流;OpRunType::ASYNC表示异步,在GPU上有多个流并以异步方式执行。

Precision

Precision Op support
Precision::INT4 NO
Precision::INT8 NO
Precision::FP16 NO
Precision::FP32 YES
Precision::FP64 NO

现在Op的精度只支持FP32, 但在将来我们会支持剩下的Precision.

OpRunType

OpRunType Sync/Aync Description
OpRunType::SYNC Synchronization single-stream on GPU
OpRunType::ASYNC Asynchronization multi-stream on GPU

create a executor

//some declarations
...
//Create a pointer to a graph.
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
//do something...
...

//create a executor
Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);

get input tensor

//some declaratinos
...

//create a executor
//TargetType is NV [NVIDIA GPU]
Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);

//Get the first input tensor.
//The following tensors(tensor_in0, tensor_in2 ...) are resident at GPU.
//Note: Member function get_in returns an pointer to tensor.
Tensor<NV, AK_FLOAT>* tensor_in0 = executor.get_in("input_0");

//If you have multiple input tensors
//You just type this code below.
Tensor<NV, AK_FLOAT>* tensor_in1 = executor.get_in("input_1");
...
auto tensor_inn = executor.get_in("input_n");

fill input tensor

//This tensor is resident at GPU.
auto tensor_d_in = executor.get_in("input_0");

//If we want to feed above tensor, we must feed the tensor which is resident at host. And then copy the host tensor to the device's one.

//using Tensor4d = Tensor<Ttype, Dtype>;
Tensor4d<X86, AK_FLOAT> tensor_h_in; //host tensor;
//Tensor<X86, AK_FLOAT> tensor_h_in; 

//Allocate memory for host tensor.
tensor_h_in.re_alloc(tensor_d_in->valid_shape());
//Get a writable pointer to tensor.
float *h_data = tensor_h_in.mutable_data();

//Feed your tensor.
/** example
for(int i = 0; i < tensor_h_in.size(); i++){
  h_data[i] = 1.0f;
}
*/
//Copy host tensor's data to device tensor.
tensor_d_in->copy_from(tensor_h_in);

// And then

get output tensor

//Note: this tensor are resident at GPU.
Tensor<NV, AK_FLOAT>* tensor_out_d = executor.get_out("pred_out");

execute graph

executor.prediction();

code example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
std::string model_path = "your_Anakin_models/xxxxx.anakin.bin";
// Create an empty graph object.
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
// Load Anakin model.
auto status = graph->load(model_path);
if(!status ) {
LOG(FATAL) << " [ERROR] " << status.info();
}
// Reshape
graph->Reshape("input_0", {10, 384, 960, 10});
// You must optimize graph for the first time.
graph->Optimize();
// Create a executer.
Net<NV, AK_FLOAT, Precision::FP32> net_executer(*graph);

//Get your input tensors through some specific string such as "input_0", "input_1", and
//so on.
//And then, feed the input tensor.
//If you don't know Which input do these specific string ("input_0", "input_1") correspond with, you can launch dash board to find out.
auto d_tensor_in_p = net_executer.get_in("input_0");
Tensor4d<X86, AK_FLOAT> h_tensor_in;
auto valid_shape_in = d_tensor_in_p->valid_shape();
for (int i=0; i<valid_shape_in.size(); i++) {
LOG(INFO) << "detect input dims[" << i << "]" << valid_shape_in[i]; //see tensor's dimentions
}
h_tensor_in.re_alloc(valid_shape_in);
float* h_data = h_tensor_in.mutable_data();
for (int i=0; i<h_tensor_in.size(); i++) {
h_data[i] = 1.0f;
}
d_tensor_in_p->copy_from(h_tensor_in);

//Do inference.
net_executer.prediction();

//Get result tensor through the name of output node.
//And also, you need to see the dash board again to find out how many output nodes are and remember their name.

//For example, you've got a output node named obj_pre_out
//Then, you can get an output tensor.
auto d_tensor_out_0_p = net_executer.get_out("obj_pred_out"); //get_out returns a pointer to output tensor.
auto d_tensor_out_1_p = net_executer.get_out("lc_pred_out"); //get_out returns a pointer to output tensor.
//......
// do something else ...
//...
//save model.
//You might not optimize the graph when you load the saved model again.
std::string save_model_path = model_path + std::string(".saved");
auto status = graph->save(save_model_path);
if (!status ) {
LOG(FATAL) << " [ERROR] " << status.info();
}

anakin converter

1
2
3
4
5
6
cd anakin/tools/external_converter_v2
sudo pip install flask prettytable

vim config.yaml
# ...
python converter.py

config.yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
OPTIONS:
Framework: CAFFE
SavePath: ./output
ResultName: mylenet
Config:
LaunchBoard: ON
Server:
ip: 0.0.0.0
port: 8888
OptimizedGraph:
enable: OFF
path: ./anakin_optimized/lenet.anakin.bin.saved
LOGGER:
LogToPath: ./log/
WithColor: ON

TARGET:
CAFFE:
# path to proto files
ProtoPaths:
- /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
PrototxtPath: /home/kezunlin/program/caffe/examples/mnist/lenet.prototxt
ModelPath: /home/kezunlin/program/caffe/examples/mnist/lenet_iter_10000.caffemodel

FLUID:
# path of fluid inference model
Debug: NULL # Generally no need to modify.
ModelPath: /path/to/your/model/ # The upper path of a fluid inference model.
NetType: # Generally no need to modify.

LEGO:
# path to proto files
ProtoPath:
PrototxtPath:
ModelPath:

TENSORFLOW:
ProtoPaths: /
PrototxtPath: /
ModelPath: /
OutPuts:

ONNX:
ProtoPath:
PrototxtPath:
ModelPath:

  • input: caffe.proto + lenet.prototxt + lenet_iter_10000.caffemodel
  • output: output/mylenet.anakin.bin + log/xxx.log

anakin test

model_test.cpp

1
2
3
4
cat Anakin/test/framework/net/model_test.cpp

cd gpu_output
./unit_test/model_test '/home/kezunlin/program/anakin/demo/model/'

example_nv_cnn_net.cpp

1
cat Anakin/examples/cuda/example_nv_cnn_net.cpp

my example

my workspace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ls demo/
anakin_lib build cmake CMakeLists.txt image model src


tree demo/src/ demo/model/ demo/cmake demo/image
demo/src/
└── demo.cpp
demo/model/
└── mylenet.anakin.bin
demo/cmake
├── anakin-config.cmake
├── msg_color.cmake
├── statistic.cmake
└── utils.cmake
demo/image
├── big.jpg
└── cat.jpg

0 directories, 8 files

anakin_lib

use ./tools/gpu_build.sh to generate gpu_build_sm61 and rename to anakin_lib

1
2
3
4
5
6
7
8
./tools/gpu_build.sh
# ...

mv gpu_build_sm61 anakin_lib

ls anakin_lib/
anakin_config.h libanakin_saber_common.so libanakin.so log unit_test
framework libanakin_saber_common.so.0.1.2 libanakin.so.0.1.2 saber utils

anakin-config.cmake

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
set(ANAKIN_FOUND TRUE) # auto 
set(ANAKIN_VERSION 0.1.2)
set(ANAKIN_ROOT_DIR "/home/kezunlin/program/anakin/demo/anakin_lib")

set(ANAKIN_ROOT ${ANAKIN_ROOT_DIR})
set(ANAKIN_FRAMEWORK ${ANAKIN_ROOT}/framework)
set(ANAKIN_SABER ${ANAKIN_ROOT}/saber)
set(ANAKIN_UTILS ${ANAKIN_ROOT}/utils)


set(ANAKIN_FRAMEWORK_CORE ${ANAKIN_FRAMEWORK}/core)
set(ANAKIN_FRAMEWORK_GRAPH ${ANAKIN_FRAMEWORK}/graph)
set(ANAKIN_FRAMEWORK_LITE ${ANAKIN_FRAMEWORK}/lite)
set(ANAKIN_FRAMEWORK_MODEL_PARSER ${ANAKIN_FRAMEWORK}/model_parser)
set(ANAKIN_FRAMEWORK_OPERATORS ${ANAKIN_FRAMEWORK}/operators)

set(ANAKIN_SABER_CORE ${ANAKIN_SABER}/core)
set(ANAKIN_SABER_FUNCS ${ANAKIN_SABER}/funcs)
set(ANAKIN_SABER_LITE ${ANAKIN_SABER}/lite)

set(ANAKIN_UTILS_LOGGER ${ANAKIN_UTILS}/logger)
set(ANAKIN_UTILS_UINT_TEST ${ANAKIN_UTILS}/unit_test)

#find_path(ANAKIN_INCLUDE_DIR NAMES anakin_config.h PATHS "${ANAKIN_ROOT_DIR}")
mark_as_advanced(ANAKIN_INCLUDE_DIR) # show entry in cmake-gui

find_library(ANAKIN_SABER_COMMON_LIBRARY NAMES anakin_saber_common PATHS "${ANAKIN_ROOT_DIR}")
mark_as_advanced(ANAKIN_SABER_COMMON_LIBRARY) # show entry in cmake-gui

find_library(ANAKIN_LIBRARY NAMES anakin PATHS "${ANAKIN_ROOT_DIR}")
mark_as_advanced(ANAKIN_LIBRARY) # show entry in cmake-gui

# use xxx_INCLUDE_DIRS and xxx_LIBRARIES in CMakeLists.txt
set(ANAKIN_INCLUDE_DIRS
${ANAKIN_ROOT}
${ANAKIN_FRAMEWORK}
${ANAKIN_SABER}
${ANAKIN_UTILS}

${ANAKIN_FRAMEWORK_CORE}
${ANAKIN_FRAMEWORK_GRAPH}
${ANAKIN_FRAMEWORK_LITE}
${ANAKIN_FRAMEWORK_MODEL_PARSER}
${ANAKIN_FRAMEWORK_OPERATORS}

${ANAKIN_SABER_CORE}
${ANAKIN_SABER_FUNCS}
${ANAKIN_SABER_LITE}

${ANAKIN_UTILS_LOGGER}
${ANAKIN_UTILS_UINT_TEST}
)

set(ANAKIN_LIBRARIES ${ANAKIN_SABER_COMMON_LIBRARY} ${ANAKIN_LIBRARY} )

message( "anakin-config.cmake " ${ANAKIN_ROOT_DIR})

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
cmake_minimum_required(VERSION 2.8.8)

project(demo)

include(cmake/msg_color.cmake)
include(cmake/utils.cmake)
include(cmake/statistic.cmake)

#add_definitions( -Dshared_DEBUG) # define macro

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")

set(ROOT_CMAKE_DIR ./cmake)
set(CMAKE_PREFIX_PATH ${CMAKE_PREFIX_PATH} "${ROOT_CMAKE_DIR};${CMAKE_PREFIX_PATH}")
MESSAGE( [cmake] " CMAKE_PREFIX_PATH = ${CMAKE_PREFIX_PATH} for find_package")

# Find includes in corresponding build directories
set(CMAKE_INCLUDE_CURRENT_DIR ON)

find_package(OpenCV REQUIRED COMPONENTS core highgui imgproc features2d calib3d)
include_directories(${OpenCV_INCLUDE_DIRS})

# find anakin-config.cmake file
#include(cmake/anakin-config.cmake)
find_package(ANAKIN REQUIRED)
include_directories(${ANAKIN_INCLUDE_DIRS})

#message( [opencv] ${OpenCV_INCLUDE_DIRS} )
#message( [opencv] ${OpenCV_LIBS} )
#message( [anakin] ${ANAKIN_INCLUDE_DIRS} )
#message( [anakin] ${ANAKIN_LIBRARIES} )

add_executable(${PROJECT_NAME}
src/demo.cpp
)

# dl pthread
# error with -std=c++11 -lpthread -ldl

target_link_libraries(${PROJECT_NAME}
dl
pthread
${OpenCV_LIBS}
${ANAKIN_LIBRARIES}
)

src/demo.cpp

edit from Anakin/examples/cuda/example_nv_cnn_net.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#include <iostream>
using namespace std;

// opencv
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
using namespace cv;

// anakin
#include "utils/logger/logger.h"
#include "framework/graph/graph.h"
#include "framework/core/net/net.h"

/*util to fill tensor*/
#include "saber/core/tensor_op.h"
using namespace anakin;
using namespace anakin::graph;
using namespace anakin::saber;

/*
+------------+-----------------+-------+-----------+
| Input Name | Shape | Alias | Data Type |
+------------+-----------------+-------+-----------+
| input_0 | [64, 1, 28, 28] | NULL | NULL |
+------------+-----------------+-------+-----------+
+-------------+
| Output Name |
+-------------+
| prob_out |
+-------------+
*/

int fill_tensor(Tensor4d<X86, AK_FLOAT>& h_tensor_in, const cv::Mat& image)
{
// write data to tensor
int height = image.rows;
int width = image.cols;

LOG(INFO)<<"height*width ="<< height*width <<std::endl; // 784
LOG(INFO)<<"h_tensor_in.size() ="<<h_tensor_in.size()<<std::endl; // 784

float* tensor_ptr = h_tensor_in.mutable_data(); // int, float or double.

const float* ptr;
for (int h = 0; h < height; ++h)
{
ptr = image.ptr<float>(h); // row ptr
for (int w = 0; w < width; ++w)
{
*tensor_ptr++ = *ptr++;
}
}

return 1;
}

int main(int argc, const char** argv) {

const char *model_path = "../model/mylenet.anakin.bin";

Mat image = imread("../image/cat.jpg",0);
cv::resize(image,image,Size(28,28));
//imshow("image",image);
//waitKey(0);

/*init graph object, graph is the skeleton of model*/
Graph<NV, AK_FLOAT, Precision::FP32> graph;

/*load model from file to init the graph*/
auto status = graph.load(model_path);
if (!status) {
LOG(FATAL) << " [ERROR] " << status.info();
}

/*set net input shape and use this shape to optimize the graph(fusion and init operator),shape is n,c,h,w*/
graph.Reshape("input_0", {1, 1, 28, 28});
graph.Optimize();

/*net_executer is the executor object of model. use graph to init Net*/
Net<NV, AK_FLOAT, Precision::FP32> net_executer(graph, true);

/*use input string to get the input tensor of net. for we use NV as target, the tensor of net_executer is on GPU memory*/
auto d_tensor_in_p = net_executer.get_in("input_0");
auto valid_shape_in = d_tensor_in_p->valid_shape();

/*create tensor located in host*/
Tensor4d<X86, AK_FLOAT> h_tensor_in;

/*alloc for host tensor*/
h_tensor_in.re_alloc(valid_shape_in);

/*init host tensor by random*/
//fill_tensor_host_rand(h_tensor_in, -1.0f, 1.0f);

image.convertTo(image, CV_32FC1); // faster
fill_tensor(h_tensor_in,image);

/*use host tensor to int device tensor which is net input*/
d_tensor_in_p->copy_from(h_tensor_in);

/*run infer*/
net_executer.prediction();

LOG(INFO)<<"infer finish";

/*get the out put of net, which is a device tensor*/
auto d_out=net_executer.get_out("prob_out");

/*create another host tensor, and copy the content of device tensor to host*/
Tensor4d<X86, AK_FLOAT> h_tensor_out;
h_tensor_out.re_alloc(d_out->valid_shape());
h_tensor_out.copy_from(*d_out);

/*show output content*/
for(int i=0;i<h_tensor_out.valid_size();i++){
LOG(INFO)<<"out ["<<i<<"] = "<<h_tensor_out.data()[i];
}
}

compile demo

1
2
3
4
5
mkdir build
cd build
cmake ..
make
./demo

output

ERR| 16:45:56.00581| 110838.067s|         37CBF8C0| operator_attr.h:94]  you have set the argument: is_reverse , so it's igrored by anakin
 ERR| 16:45:56.00581| 110838.067s|         37CBF8C0| operator_attr.h:94]  you have set the argument: is_reverse , so it's igrored by anakin
   0| 16:45:56.00681| 0.098s|         37CBF8C0| parser.cpp:96] graph name: LeNet
   0| 16:45:56.00681| 0.099s|         37CBF8C0| parser.cpp:101] graph in: input_0
   0| 16:45:56.00681| 0.099s|         37CBF8C0| parser.cpp:107] graph out: prob_out
   0| 16:45:56.00742| 0.159s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScaleReluPool
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScaleRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvReluPool
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScale
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : DeconvRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : PermutePower
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnorm
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : EltwiseRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : EltwiseActivation
 WAN| 16:45:56.00743| 0.160s|         37CBF8C0| net.cpp:663] Detect and initial 1 lanes.
   0| 16:45:56.00743| 0.161s|         37CBF8C0| env.h:44] found 1 device(s)
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:45] Device id: 0 , name: GeForce GTX 1060
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:47] Multiprocessors: 10
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:50] frequency:1733MHz
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:52] CUDA Capability : 6.1
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:54] total global memory: 6078MBytes.
 WAN| 16:45:56.00743| 0.161s|         37CBF8C0| net.cpp:667] Current used device id : 0
 WAN| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:16] Parsing Input op parameter.
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [0]: 1
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [1]: 1
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [2]: 28
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [3]: 28
 ERR| 16:45:56.00744| 0.161s|         37CBF8C0| net.cpp:210] node_ptr->get_op_name()  sass not support yet.
 ERR| 16:45:56.00744| 0.161s|         37CBF8C0| net.cpp:210] node_ptr->get_op_name()  sass not support yet.
 WAN| 16:45:57.00269| 0.686s|         37CBF8C0| context.h:40] device index exceeds the number of devices, set to default device(0)!
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:300] Temp mem used:        0 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:301] Original mem used:    0 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:302] Model mem used:       1 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:303] System mem used:      153 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| demo.cpp:40] height*width =784
   0| 16:45:57.00270| 0.687s|         37CBF8C0| demo.cpp:41] h_tensor_in.size() =784
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:105] infer finish
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [0] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [1] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [2] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [3] = 1
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [4] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [5] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [6] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [7] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [8] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [9] = 0

For Windows (skip)

version

  • windows 10
  • vs 2015
  • cmake 3.2.2
  • cuda 8.0 + cudnn 6.0.21 (same as caffe) sm_61
  • protobuf 3.4.0

protobuf

see compile protobuf-cpp on windows 10

compile

1
2
3
4
#git clone https://github.com/PaddlePaddle/Anakin.git anakin
git clone https://github.com/kezunlin/Anakin.git anakin
cd anakin
mkdir build && cd build && cmake-gui ..

with options

CUDNN_ROOT "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0/"
PROTOBUF_ROOT "C:/Program Files/protobuf" 

BUILD_SHARED ON
USE_GPU_PLACE ON
USE_OPENMP OFF
USE_OPENCV ON

generate Anakin.sln and compile with VS 2015 with x64 Release mode.

error fixs

we get 101 errors, hard to fix.
skip now.

Reference

History

  • 20180903: created.

Docker Guide

install docker

1
2
3
4
5
6
7
8
9
10
11
12
13
# step 1: install tools
sudo apt-get update
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common

# step 2: install GPG
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -

# Step 3: add apt repo
sudo add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

# Step 4: install docker-ce
sudo apt-get -y update
sudo apt-get -y install docker-ce

install docker-ce for given version

1
2
3
4
5
6
7
# Step 1: search versions
# apt-cache madison docker-ce
# docker-ce | 17.03.1~ce-0~ubuntu-xenial | http://mirrors.aliyun.com/docker-ce/linux/ubuntu xenial/stable amd64 Packages
# docker-ce | 17.03.0~ce-0~ubuntu-xenial | http://mirrors.aliyun.com/docker-ce/linux/ubuntu xenial/stable amd64 Packages

# Step 2: install given version
# sudo apt-get -y install docker-ce=17.03.1~ce-0~ubuntu-xenial

test docker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sudo docker version
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:56 2018
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:21 2018
OS/Arch: linux/amd64
Experimental: false

docker namespace

host

id
uid=1000(kezunlin) gid=1000(kezunlin) groups=1000(kezunlin),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare)

sudo docker images
sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda

container

root@6f167ef72a80:/home/kezunlin/workspace# ll
total 48
drwxrwxr-x 12 1000 1000 4096 Nov 30 10:04 ./
drwxr-xr-x  3 root root 4096 Nov 30 10:14 ../
drwxrwxr-x 10 1000 1000 4096 Dec  5  2017 MyGit/
drwxrwxr-x 12 1000 1000 4096 Oct 31 03:01 blog/
drwxrwxr-x  5 1000 1000 4096 Sep 20 07:33 opencv/
drwxrwxr-x  4 1000 1000 4096 Oct 31 07:55 openmp/
drwxrwxr-x  5 1000 1000 4096 Jan  9  2018 qt/
drwxrwxr-x  2 1000 1000 4096 Jan  4  2018 ros/
drwxrwxr-x  4 1000 1000 4096 Nov 16  2017 voc/
drwxrwxr-x  5 1000 1000 4096 Aug  7 03:19 vs/
root@6f167ef72a80:/home/kezunlin/workspace# touch 1.txt

root@6f167ef72a80:/home/kezunlin/workspace# id
uid=0(root) gid=0(root) groups=0(root)

host

ll /home/kezunlin/workspace/
total 48
drwxrwxr-x 12 kezunlin kezunlin 4096 11月 30 18:14 ./
drwxr-xr-x 47 kezunlin kezunlin 4096 11月 30 18:04 ../

-rw-r--r--  1 root     root        0 11月 30 18:14 1.txt

drwxrwxr-x 12 kezunlin kezunlin 4096 10月 31 11:01 blog/
drwxrwxr-x  5 kezunlin kezunlin 4096 9月  20 15:33 opencv/
drwxrwxr-x  4 kezunlin kezunlin 4096 10月 31 15:55 openmp/
drwxrwxr-x  5 kezunlin kezunlin 4096 1月   9  2018 qt/
drwxrwxr-x  2 kezunlin kezunlin 4096 1月   4  2018 ros/
drwxrwxr-x  4 kezunlin kezunlin 4096 11月 16  2017 voc/
drwxrwxr-x  5 kezunlin kezunlin 4096 8月   7 11:19 vs/

install nvidia-docker2

The machine running the CUDA container only requires the NVIDIA driver, the CUDA toolkit doesn’t have to be installed.
Host系统只需要安装NVIDIA driver即可运行CUDA container。

install

remove nvidia-docker 1.0

1
2
3
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

Add the package repositories

vim repo.sh

1
2
3
4
5
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list

run scripts

1
2
chmod +x repo.sh
./repo.sh

Install nvidia-docker2 and reload the Docker daemon configuration

1
2
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

test

1
sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
8ee29e426c26: Pull complete
6e83b260b73b: Pull complete
e26b65fd1143: Pull complete
40dca07f8222: Pull complete
b420ae9e10b3: Pull complete
a579c1327556: Pull complete
b440bb8df79e: Pull complete
de3b2ccf9562: Pull complete
a69a544d350e: Pull complete
02348b5db71c: Pull complete
Digest: sha256:5996fa2fc0666972360502fe32118286177b879a8a1a834a176e7786021b8cee
Status: Downloaded newer image for nvidia/cuda:latest
Mon Sep 3 10:08:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 59C P8 8W / N/A | 408MiB / 6072MiB | 40% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

or by tty

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sudo docker run --runtime=nvidia -t -i --privileged nvidia/cuda bash

root@8f3ebd5ecbb6:/# nvidia-smi
Tue Sep 4 01:26:31 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 56C P0 31W / N/A | 374MiB / 6072MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

Advanced Topics

Default runtime

The default runtime used by the Docker® Engine is runc, our runtime can become the default one by configuring the docker daemon with --default-runtime=nvidia. Doing so will remove the need to add the --runtime=nvidia argument to docker run. It is also the only way to have GPU access during docker build.

Environment variables

The behavior of the runtime can be modified through environment variables (such as NVIDIA_VISIBLE_DEVICES).
Those environment variables are consumed by nvidia-container-runtime and are documented here.
Our official CUDA images use default values for these variables.

docker command

1
2
3
4
5
6
7
8
sudo docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
nvidia/cuda latest 04a9ce0dec6d 3 weeks ago 1.96GB

sudo docker run -it --privileged nvidia/cuda bash


docker build --network=host -t anakin:$tag . -f $DockerfilePath

kubernetes with GPU

kubernetes 对于 GPU 的支持截止到 1.9 版本,算是经历了3个阶段:

  • kubernetes 1.3 版本开始支持GPU,但是只支持单个 GPU卡;

  • kubernetes 1.6 版本开始支持对多个GPU卡的支持;

  • kubernetes 1.8 版本以 device plugin 方式提供对GPU的支持。

    ls /dev/nvidia*
    /dev/nvidia0 /dev/nvidia2 /dev/nvidia4 /dev/nvidia6 /dev/nvidiactl
    /dev/nvidia1 /dev/nvidia3 /dev/nvidia5 /dev/nvidia7

  • Kubernetes 1.8~1.9,通过k8s-device-plugin 获取每个Node上GPU的信息,根据这些信息对GPU资源进行管理和调度。需要结合 nvidia-docker2 使用。

  • k8s-device-plugin也是由 nvidia 提供,在kubernetes中可以DaemonSet方式运行。

Reference

History

  • 20180903: created.

Series

Tutorial

version

version 1:

  • windows 10 64 bit + GTX 1060(8G) + cuda driver
  • windows 10 64 bit + GTX 1080(12G) + cuda driver
  • CUDA 8.0 + cudnn 6.0.1(win10) + tensorflow-gpu 1.4.0
  • python 3.5.3

version 2:

  • windows 10 64 bit + GeForce Titan Xp(12G) + cuda driver for Titan xp
  • CUDA 9.0 + cudnn 7.1.4(win10) + tensorflow-gpu 1.8.0 ( 1.8.0, 1.9.0 for cuda 9.0)

version 3:

  • windows 10 64 bit + Quadro P4000(8G) + cuda driver for Quadro P4000(实测用Titan Xp的driver也可以)
  • CUDA 9.0 + cudnn 7.1.4(win10) + tensorflow-gpu 1.8.0 ( 1.8.0, 1.9.0 for cuda 9.0)

errors

error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows

see tensorflow-gpu==1.4.0

Tips: for tensorflow-gpu==1.4.0
on linux, support python 2.7,3.3,3.4,3.5,3.6.
on windows, only support python 3.5,3.6.

see tensorflow-gpu==1.8.0

Tips: for tensorflow-gpu==1.8.0
on linux, support python 2.7,3.3,3.4,3.5,3.6.
on windows, only support python 3.5,3.6.

from Tensorflow1.6 use CUDA9.0+cuDNN7.

tensorflow download pages

cuda & cudnn

see Part 1: Install and Configure Caffe on windows 10

system env

1
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin

python

install python 3.5.3,add python and pip path to system env.

copy python.exe to python3.exe,
copy pip.exe to pip3.exe

system env

1
2
C:\Users\zunli\AppData\Local\Programs\Python\Python35\
C:\Users\zunli\AppData\Local\Programs\Python\Python35\Scripts

test

1
2
3
4
python3
Python 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> ^Z

pip3

1
2
pip3 -V
pip 9.0.1 from c:\users\zunli\appdata\local\programs\python\python35\lib\site-packages (python 3.5)

tensorflow

1
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple Pillow scipy sklearn scikit-image matplotlib

1.4.0

1
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.4.0 keras=2.1.0

1.8.0

1
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.8.0 keras=2.2.0

test tensorflow

1
2
3
4
5
6
import tensorflow as tf
import numpy as np

hello=tf.constant('hhh')
sess=tf.Session()
print (sess.run(hello))

test cuda and gpu

1
2
3
4
5
6
7
8
9
10
11
import tensorflow as tf

a = tf.test.is_built_with_cuda() # 判断CUDA是否可以用

b = tf.test.is_gpu_available(
cuda_only=False,
min_cuda_compute_capability=None
) # 判断GPU是否可以用

print(a)
print(b)

test gpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import tensorflow as tf

with tf.device('/cpu:0'):
a = tf.constant([1.0, 2.0, 3.0], shape=[3], name='a')
b = tf.constant([1.0, 2.0, 3.0], shape=[3], name='b')
with tf.device('/gpu:0'):
c = a + b

# 注意:allow_soft_placement=True表明:计算设备可自行选择,如果没有这个参数,会报错。
# 因为不是所有的操作都可以被放在GPU上,如果强行将无法放在GPU上的操作指定到GPU上,将会报错。
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(tf.global_variables_initializer())
print(sess.run(c))

gpu run

pycharm

run code with pycharm

pycharm with python3

jupyter notebook

1
2
3
4
pip install ipykernel
python -m ipykernel install --user --name=tensorflow

Installed kernelspec tensorflow in C:\Users\zunli\AppData\Roaming\jupyter\kernels\tensorflow

error fix

errors:

No matching distribution found for tensorflow

solution: use python 3.5 instead of python 2.7

Reference

History

  • 20180829: created.

Guide

syntax

syntax

[ capture clause ] (parameters) -> return-type  
{   
   definition of method   
} 

capture

We can capture external variables from enclosing scope by three ways :

  Capture by reference
  Capture by value (making a copy)
  Capture by both (mixed capture)

Syntax used for capturing variables :

    []:   capture nothing
  [&] : capture all external variable by reference
  [=] : capture all external variable by value (making a copy)
  [a, &b] : capture a by value and b by reference 
  [this] :	Capture the this pointer of the enclosing class

C++11中的Lambda表达式捕获外部变量主要有以下形式:

捕获形式	说明
[]	不捕获任何外部变量
[变量名, …]	默认以值得形式捕获指定的多个外部变量(用逗号分隔),如果引用捕获,需要显示声明(使用&说明符)
[this]	以值的形式捕获this指针
[=]	以值的形式捕获所有外部变量
[&]	以引用形式捕获所有外部变量
[=, &x]	变量x以引用形式捕获,其余变量以传值形式捕获
[&, x]	变量x以值的形式捕获,其余变量以引用形式捕获

example code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
#include <bits/stdc++.h> 
using namespace std;

void test_labmda_0()
{
// call lambda with ending ();
[] ()
{
cout << "Hello, my Greek friends";
}();

// return value
auto l1 = [] ()
{
return 1;
} ; // compiler knows this returns an integer

auto l2 = [] () -> int
{
return 1;
} ; // now we're telling the compiler what we want
}

// Function to print vector
void printVector(const vector<int>& v)
{
// lambda expression to print vector
for_each(v.begin(), v.end(), [](int i)
{
std::cout << i << " ";
});
cout << endl;
}

void test_lambda_1()
{
vector<int> v {4, 1, 3, 5, 2, 3, 1, 7};
printVector(v);

// capture nothing
std::sort(v.begin(), v.end(), [](const int& a, const int& b) -> bool
{
return a > b;
});
printVector(v);


int ans = accumulate(v.begin(),v.end(),0,
[](int i,int j)
{
return i+j;
}
);
cout << "SUM = " << ans << endl;
}

void test_lambda_2()
{
vector<int> v1 = {3, 1, 7, 9};
vector<int> v2 = {10, 2, 7, 16, 9};

// access v1 and v2 by reference
auto pushinto = [&] (int m)
{
v1.push_back(m);
v2.push_back(m);
};

// it pushes 20 in both v1 and v2
pushinto(20);


// access v1 by value (copy)
auto printv = [v1]()
{
for (auto p = v1.begin(); p != v1.end(); p++)
{
cout << *p << " ";
}
cout << endl;
};
printv();


int N = 5;
// below snippet find first number greater than N
// [N] denotes, can access only N by value
vector<int>:: iterator p = find_if(v1.begin(), v1.end(), [N](int i)
{
return i > N;
});
cout << "First number greater than 5 is : " << *p << endl;
}


class Foo
{
public:
Foo () : _x( 3 ) {}
void func ()
{
// a very silly, but illustrative way of printing out the value of _x
[this] ()
{
cout << this->_x;
} ();
}

private:
int _x;
};


int test_labmda_3()
{
Foo f;
f.func();
}


void main_demo()
{
test_lambda_0();
test_lambda_1();
test_lambda_2();
test_labmda_3();
}

int main(int argc, char const *argv[])
{
main_demo();
return 0;
}

Reference

History

  • 20180823: created.

Series

Guide

Mat

  • for gray image, use type <uchar>
  • for RGB color image,use type <Vec3b>

gray format storage
gray

color format storage: BGR
BGR

we can use method isContinuous() to judge whether the memory buffer is continuous or not.

color space reduction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
uchar color_space_reduction(uchar pixel)
{
/*
0-9 ===>0
10-19===>10
20-29===>20
...
240-249===>24
250-255===>25

map from 256*256*256===>26*26*26
*/

int divideWith = 10;
uchar new_pixel = (pixel / divideWith)*divideWith;
return new_pixel;
}

color table

1
2
3
4
5
6
7
8
void get_color_table()
{
// cache color value in table[256]
int divideWith = 10;
uchar table[256];
for (int i = 0; i < 256; ++i)
table[i] = divideWith* (i / divideWith);
}

C++

ptr []

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// C ptr []: faster but not safe
Mat& ScanImageAndReduce_Cptr(Mat& I, const uchar* const table)
{
// accept only char type matrices
CV_Assert(I.depth() != sizeof(uchar));
int channels = I.channels();
int nRows = I.rows;
int nCols = I.cols* channels;
if (I.isContinuous())
{
nCols *= nRows;
nRows = 1;
}
int i, j;
uchar* p;
for (i = 0; i < nRows; ++i)
{
p = I.ptr<uchar>(i);
for (j = 0; j < nCols; ++j)
{
p[j] = table[p[j]];
}
}
return I;
}

ptr ++

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// C ptr ++: faster but not safe
Mat& ScanImageAndReduce_Cptr2(Mat& I, const uchar* const table)
{
// accept only char type matrices
CV_Assert(I.depth() != sizeof(uchar));
int channels = I.channels();
int nRows = I.rows;
int nCols = I.cols* channels;
if (I.isContinuous())
{
nCols *= nRows;
nRows = 1;
}
uchar* start = I.ptr<uchar>(0); // same as I.ptr<uchar>(0,0)
uchar* end = start + nRows * nCols;
for (uchar* p=start; p < end; ++p)
{
*p = table[*p];
}
return I;
}

at(i,j)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 // at<uchar>(i,j): random access, slow
Mat& ScanImageAndReduce_atRandomAccess(Mat& I, const uchar* const table)
{
// accept only char type matrices
CV_Assert(I.depth() != sizeof(uchar));
const int channels = I.channels();
switch (channels)
{
case 1:
{
for (int i = 0; i < I.rows; ++i)
for (int j = 0; j < I.cols; ++j)
I.at<uchar>(i, j) = table[I.at<uchar>(i, j)];
break;
}
case 3:
{
Mat_<Vec3b> _I = I;

for (int i = 0; i < I.rows; ++i)
for (int j = 0; j < I.cols; ++j)
{
_I(i, j)[0] = table[_I(i, j)[0]];
_I(i, j)[1] = table[_I(i, j)[1]];
_I(i, j)[2] = table[_I(i, j)[2]];
}
I = _I;
break;
}
}
return I;
}

Iterator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
 // MatIterator_<uchar>: safe but slow
Mat& ScanImageAndReduce_Iterator(Mat& I, const uchar* const table)
{
// accept only char type matrices
CV_Assert(I.depth() != sizeof(uchar));
const int channels = I.channels();
switch (channels)
{
case 1:
{
MatIterator_<uchar> it, end;
for (it = I.begin<uchar>(), end = I.end<uchar>(); it != end; ++it)
*it = table[*it];
break;
}
case 3:
{
MatIterator_<Vec3b> it, end;
for (it = I.begin<Vec3b>(), end = I.end<Vec3b>(); it != end; ++it)
{
(*it)[0] = table[(*it)[0]];
(*it)[1] = table[(*it)[1]];
(*it)[2] = table[(*it)[2]];
}
}
}
return I;
}

opencv LUT

1
2
3
4
5
6
7
8
9
10
11
 // LUT
Mat& ScanImageAndReduce_LUT(Mat& I, const uchar* const table)
{
Mat lookUpTable(1, 256, CV_8U);
uchar* p = lookUpTable.data;
for (int i = 0; i < 256; ++i)
p[i] = table[i];

cv::LUT(I, lookUpTable, I);
return I;
}

forEach

forEach method of the Mat class that utilizes all the cores on your machine to apply any function at every pixel.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Parallel execution with function object.
struct ForEachOperator
{
uchar m_table[256];
ForEachOperator(const uchar* const table)
{
for (size_t i = 0; i < 256; i++)
{
m_table[i] = table[i];
}
}

void operator ()(uchar& p, const int * position) const
{
// Perform a simple operation
p = m_table[p];
}
};

// forEach use multiple processors, very fast
Mat& ScanImageAndReduce_forEach(Mat& I, const uchar* const table)
{
I.forEach<uchar>(ForEachOperator(table));
return I;
}

forEach with lambda

1
2
3
4
5
6
7
8
9
10
11
12
// forEach lambda use multiple processors, very fast (lambda slower than ForEachOperator)
Mat& ScanImageAndReduce_forEach_with_lambda(Mat& I, const uchar* const table)
{
I.forEach<uchar>
(
[=](uchar &p, const int * position) -> void
{
p = table[p];
}
);
return I;
}

time cost

no foreach

[1 Cptr   ] times=5000, total_cost=988 ms, avg_cost=0.1976 ms
[1 Cptr2  ] times=5000, total_cost=1704 ms, avg_cost=0.3408 ms
[2 atRandom] times=5000, total_cost=9611 ms, avg_cost=1.9222 ms
[3 Iterator] times=5000, total_cost=20195 ms, avg_cost=4.039 ms
[4 LUT    ] times=5000, total_cost=899 ms, avg_cost=0.1798 ms

[1 Cptr   ] times=10000, total_cost=2425 ms, avg_cost=0.2425 ms
[1 Cptr2  ] times=10000, total_cost=3391 ms, avg_cost=0.3391 ms
[2 atRandom] times=10000, total_cost=20024 ms, avg_cost=2.0024 ms
[3 Iterator] times=10000, total_cost=39980 ms, avg_cost=3.998 ms
[4 LUT    ] times=10000, total_cost=103 ms, avg_cost=0.0103 ms

foreach

[5 forEach     ] times=200000, total_cost=199 ms, avg_cost=0.000995 ms
[5 forEach lambda] times=200000, total_cost=521 ms, avg_cost=0.002605 ms

[5 forEach     ] times=20000, total_cost=17 ms, avg_cost=0.00085 ms
[5 forEach lambda] times=20000, total_cost=23 ms, avg_cost=0.00115 ms

results

Loop Type | Time Cost (us)
:—-: |
ptr [] | 242
ptr ++ | 339
at | 2002
iterator | 3998
LUT | 10
forEach | 0.85
forEach lambda | 1.15

forEach is 10x times faster than LUT, 240340x times faster than ptr [] and ptr ++, and 20004000x times faster than at and iterator.

code

code here

Python

pure python

1
2
3
4
5
6
# import the necessary packages
import matplotlib.pyplot as plt
import cv2
print(cv2.__version__)

%matplotlib inline
3.4.2
1
2
3
4
5
6
# load the original image, convert it to grayscale, and display
# it inline
image = cv2.imread("cat.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)
#plt.imshow(image, cmap="gray")
(360, 480)
1
%load_ext cython
The cython extension is already loaded. To reload it, use:
  %reload_ext cython
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
%%cython -a

def threshold_python(T, image):
# grab the image dimensions
h = image.shape[0]
w = image.shape[1]

# loop over the image, pixel by pixel
for y in range(0, h):
for x in range(0, w):
# threshold the pixel
image[y, x] = 255 if image[y, x] >= T else 0

# return the thresholded image
return image
1
%timeit threshold_python(5, image)
263 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cython

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
%%cython -a

import cython

@cython.boundscheck(False)
cpdef unsigned char[:, :] threshold_cython(int T, unsigned char [:, :] image):
# set the variable extension types
cdef int x, y, w, h

# grab the image dimensions
h = image.shape[0]
w = image.shape[1]

# loop over the image
for y in range(0, h):
for x in range(0, w):
# threshold the pixel
image[y, x] = 255 if image[y, x] >= T else 0

# return the thresholded image
return image

numba

1
%timeit threshold_cython(5, image)
150 µs ± 7.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from numba import njit

@njit
def threshold_njit(T, image):
# grab the image dimensions
h = image.shape[0]
w = image.shape[1]

# loop over the image, pixel by pixel
for y in range(0, h):
for x in range(0, w):
# threshold the pixel
image[y, x] = 255 if image[y, x] >= T else 0

# return the thresholded image
return image
1
%timeit threshold_njit(5, image)
43.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

numpy

1
2
3
def threshold_numpy(T, image):
image[image > T] = 255
return image
1
%timeit threshold_numpy(5, image)
111 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

conclusions

1
2
3
4
5
6
7
8
image = cv2.imread("cat.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)

%timeit threshold_python(5, image)
%timeit threshold_cython(5, image)
%timeit threshold_njit(5, image)
%timeit threshold_numpy(5, image)
(360, 480)
251 ms ± 6.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
143 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.8 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
113 µs ± 957 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1
2
3
4
5
6
7
8
image = cv2.imread("big.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)

%timeit threshold_python(5, image)
%timeit threshold_cython(5, image)
%timeit threshold_njit(5, image)
%timeit threshold_numpy(5, image)
(2880, 5120)
21.8 s ± 460 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.3 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.91 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.3 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

60,480

  • python: 251 ms
  • cython: 143 us
  • numba: 43 us
  • numpy: 113 us

2880, 5120

  • python: 21 s
  • cython: 12 ms
  • numba: 4 ms
  • numpy: 10 ms

Reference

History

  • 20180823: created.