Kezunlin's Blog

OpenCV OCR and text recognition with Tesseract on ubuntu 16.04

Posted on 2018-09-20 Edited on 2024-10-14 In linux

Guide

requirements:

ubuntu: 16.04
python: 3.5.2
opencv: 3.4.2+
tesseract: v4 (binary)
pytesseract: 0.2.4 (python bindings)

install python

apt-get install python3-dev python3-pip 
```  

### install opencv 
```bash
workon py3 
pip install opencv-contrib-python
```   

### install tesseract
```bash
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install tesseract-ocr

The latest release of Tesseract (v4) supports deep learning-based OCR that is significantly more accurate.

The underlying OCR engine itself utilizes a Long Short-Term Memory (LSTM) network, a kind of Recurrent Neural Network (RNN).

check

tesseract -v
tesseract 4.0.0-beta.4-138-g2093
  leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2
  Found AVX2
  Found AVX
  Found SSE

install Tesseract + Python bindings

1
2
3

workon py3
pip install pytesseract
pip install pillow imutils

tesseract

help

tesseract --help
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

help-extra

tesseract --help-extra
Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OCR Engine modes: (see https://github.com/tesseract-ocr/tesseract/wiki#linux)
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

Single options:
  -h, --help            Show minimal help message.
  --help-extra          Show extra help for advanced users.
  --help-psm            Show page segmentation modes.
  --help-oem            Show OCR Engine modes.
  -v, --version         Show version information.
  --list-langs          List available languages for tesseract engine.
  --print-parameters    Print tesseract parameters.

run script

python text_recognition.py --east frozen_east_text_detection.pb \
    --image images/example_01.jpg
[INFO] loading EAST text detector...
OCR TEXT
========
OH OK

Reference

OpenCV OCR and text recognition with Tesseract

History

20180920: created.

python virtualenv tutorial

Posted on 2018-09-20 Edited on 2024-10-14 In python

Guide

install python

install commands

sudo apt-get install python3-pip python3-dev

pip3 -V
pip 8.1.1 from /usr/lib/python3/dist-packages (python 3.5)

change pip source

ubuntu

edit .pip/pip.conf

[global]
index-url = http://pypi.douban.com/simple
[install]
trusted-host = pypi.douban.com

windows

edit C:\Users\zunli\AppData\Roaming\pip\pip.ini

[global]
index-url = http://pypi.douban.com/simple
[install]
trusted-host = pypi.douban.com

temp solutions

1	pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.4.0

install virtualenv

1	sudo pip3 install virtualenv virtualenvwrapper

vim .bashrc

# for virtualenv and virtualenvwrapper
export WORKON_HOME=$HOME/.local
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
source /usr/local/bin/virtualenvwrapper.sh

source .bashrc

mkvirtualenv

1 2	kezunlin@ke: mkvirtualenv py3 -p python3 (py3) kezunlin@ke:~$

commands

ls $WORKON_HOME
mkvirtualenv py3 -p python3
mkvirtualenv py2 -p python2
rmvirtualenv py3

lsvirtualenv 
lssitepackages

workon py3
deactivate

Opencv with virtualenv

python2

OpenCV should now be installed in

1 2	locate cv2.so /usr/local/lib/python2.7/dist-packages/cv2.so

However, our py2 virtual environment is located in our home directory — thus to use OpenCV within our py2 environment, we first need to sym-link OpenCV into the site-packages directory of the py2 virtual environment:

cd ~/.local/py2/lib/python2.7/site-packages/
ln -s /usr/local/lib/python2.7/site-packages/cv2.so cv2.so
ln -s /usr/local/lib/python2.7/dist-packages/cv2.so cv2.so
 ```   

import opencv 

```bash
workon py2
python
>import cv2
>print(cv2.__version__)
'3.1.0'

python3

you may get error

ImportError: dynamic module does not define init function (PyInit_cv2)

when import cv2 in python3 (no such problem in python2).

install opencv-python

1 2	workon py3 pip3 install opencv-contrib-python

test version

workon py3
python
import cv2
print(cv2.__version__)
'3.4.2'

install pycharm

apt-get (slow)

sudo add-apt-repository ppa:mystic-mirage/pycharm

sudo apt update

# no free
sudo apt install pycharm

# free
sudo apt install pycharm-community

# remove
sudo apt remove pycharm pycharm-community && sudo apt autoremove

offical (faster)

download from here

start by

1	sh pycharm.sh

Reference

install pycharm

History

20180920: created.

compile and install opencv on ubuntu 16.04

Posted on 2018-09-19 Edited on 2024-10-14 In cpp

Series

Guide

requirements:

ubuntu: 16.04
opencv: 3.3.0

install dependencies

sudo apt-get install build-essential
sudo apt-get install cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev

sudo apt-get install cmake-gui

compile

git clone https://github.com/opencv/opencv.git
wget https://github.com/opencv/opencv/archive/3.1.0.zip

cd opencv-3.1.0
mkdir build
cd build && cmake-gui ..

# may take several minutes
sudo make -j8 

# install to /usr/local/bin
sudo make install

check version

1 2	opencv_version 3.3.0

python cv2

1
2
3

python
>>> import cv2
>>> cv2.__version__

pip install opencv

workon py3
pip install opencv-contrib-python

python
>import cv2
>cv2.__version__
'3.3.0'

for virtualenv, see python virtualenv tutorial

opencv samples

1
2
3

cd samples
cmake .
make

Example

Code

#include <iostream>
using namespace std;

#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
using namespace cv;

int main()
{
    Mat image = imread("../image/cat.jpg",0);
    imshow("image",image);
    waitKey(0);
    return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 2.8.8)

project(demo)

# Find includes in corresponding build directories
set(CMAKE_INCLUDE_CURRENT_DIR ON)

find_package(OpenCV REQUIRED COMPONENTS core highgui imgproc features2d calib3d) 
include_directories(${OpenCV_INCLUDE_DIRS})

message( [opencv] ${OpenCV_INCLUDE_DIRS} )
message( [opencv] ${${OpenCV_LIBS}} )
message( [opencv] ${${OpenCV_LIBRARIES}} )


add_executable(${PROJECT_NAME} 
    demo.cpp
)
target_link_libraries(${PROJECT_NAME} ${OpenCV_LIBRARIES})

Reference

History

20180919: created.

install and configure cuda 9.2 with cudnn 7.1 on ubuntu 16.04

Posted on 2018-09-17 Edited on 2024-10-14 In deep learning

Overview

cuda 9.2

nvidia driver 396.54
cuda 9.2 (not install driver,install toolkit and samples)
cudnn 7.1.4 for cuda9.2 (for TensorRT) caffe,tensorflow, baidu anakin

cuda 8.0

nvidia driver 384.130
cuda 8.0 (not install driver,install toolkit and samples)
cudnn 6.0.21 for cuda8.0 caffe

prepare

GUI vs tty

ctrl+alt+F7 to enter GUI
ctrl+alt+F1-F6 to enter tty1-6, login with(username,password)

use fbterm instead of default terminal when we are in tty1

1 2	sudo apt-get -y install fbterm sudo fbterm

cuda and cudnn

download cuda_9.2.148_396.37_linux.run from cuda
download cudnn-9.2-linux-x64-v7.1.tgz from cudnn

Steps

install general dependencies

apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
apt-get install --no-install-recommends libboost-all-dev

# blas
sudo apt-get install libopenblas-dev liblapack-dev libatlas-base-dev

sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev

sudo apt-get install git cmake build-essential

# fix missing 
#sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

GUI mode

# disable default ubuntu driver
sudo vim /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
sudo update-initramfs -u
sudo reboot

tty mode

ctrl+alt+F1 to enter tty1, login with(username,password)

sudo fbterm

# stop x-server before install cuda driver
sudo service lightdm stop

remove previous nvidia driver + cuda toolkit

sudo apt-get remove --purge nvidia-*
# remove 8.0
sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
# remove 9.2
sudo /usr/local/cuda-9.2/bin/uninstall_cuda_9.2.pl

install nvidia driver from ppa

DO NOT use cuda_xxx_linux.run to install nvidia driver, otherwise we
get Loop Login Problem when we reboot.

安装显卡驱动推荐使用官方ppa源的方式进行安装，使用cuda_xxx_linux.run文件离线安装会导致循环登录问题。

sudo add-apt-repository ppa:graphics-drivers/ppa
sudp apt-get update

sudo apt-cache search nvidia-*
# nvidia-384
# nvidia-396
sudo apt-get -y install nvidia-396

# test 
sudo nvidia-smi
```  

#### install cuda toolkit from run file

> 1. DO NOT install nvidia driver, install cuda toolkit + samples.
>
> 2. use default install path `/usr/local/cuda-9.2`
> 
> 3. use `/usr/local/cuda-9.2/bin/uninstall_cuda_9.2.pl` to uninstall 

```bash
chmod +x ./cuda_9.2.148_396.37_linux.run

# Using unspported compiler---> override
./cuda_9.2.148_396.37_linux.run --override

output

---------------------------------------
Do you accept the previously read EULA? 
(accept/decline/quit): accept

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.37? (y)es/(n)o/(q)uit: no

Install the CUDA 9.2 Toolkit? 
(y)es/(n)o/(q)uit: yes

Enter Toolkit Location 
    [ default is /usr/local/cuda-9.2 ]:

Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: yes


Install the CUDA 9.2 Samples? 
(y)es/(n)o/(q)uit: yes

Enter CUDA Samples Location 
    [ default is /home/kezunlin ]: 


Installing the CUDA Toolkit in /usr/local/cuda-9.2 ...
Installing the CUDA Samples in /home/kezunlin ...

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-9.2
Samples:  Installed in /home/kezunlin

Please make sure that
 -   PATH includes /usr/local/cuda-9.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.2/lib64, or, add /usr/local/cuda-9.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.2/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.2/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.2 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_6659.log

reboot to enter GUI

1	sudo reboot

OK. we no longer have Loop Login Problem.

add library path

system env

vim .bashrc

# for cuda and cudnn
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

source .bashrc

or by conf file

sudo vim /etc/ld.so.conf.d/cuda.conf
/usr/local/cuda/lib64

sudo ldconifg

test

nvidia-smi

nvidia-smi
Tue Sep 18 10:35:55 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    31W /  N/A |    288MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1636      G   /usr/lib/xorg/Xorg                           164MiB |
|    0      2569      G   compiz                                        40MiB |
|    0      4828      G   ...-token=2DAB0000EFF3321D4D304928FA64B811    81MiB |
+-----------------------------------------------------------------------------+

1	cat /proc/driver/nvidia/version

nvcc

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Tue_Jun_12_23:07:04_CDT_2018
Cuda compilation tools, release 9.2, V9.2.148

deviceQuery

1
2
3

cd ~/NVIDIA_CUDA-9.2_Samples/1_Utilities/deviceQuery
make 
./deviceQuery

output

./deviceQuery Starting...

    CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060"
    CUDA Driver Version / Runtime Version          9.2 / 9.2
    CUDA Capability Major/Minor version number:    6.1
    Total amount of global memory:                 6078 MBytes (6373572608 bytes)
    (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
    GPU Max Clock rate:                            1733 MHz (1.73 GHz)
    Memory Clock rate:                             4004 Mhz
    Memory Bus Width:                              192-bit
    L2 Cache Size:                                 1572864 bytes
    Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
    Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
    Total amount of constant memory:               65536 bytes
    Total amount of shared memory per block:       49152 bytes
    Total number of registers available per block: 65536
    Warp size:                                     32
    Maximum number of threads per multiprocessor:  2048
    Maximum number of threads per block:           1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
    Maximum memory pitch:                          2147483647 bytes
    Texture alignment:                             512 bytes
    Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
    Run time limit on kernels:                     Yes
    Integrated GPU sharing Host Memory:            No
    Support host page-locked memory mapping:       Yes
    Alignment requirement for Surfaces:            Yes
    Device has ECC support:                        Disabled
    Device supports Unified Addressing (UVA):      Yes
    Device supports Compute Preemption:            Yes
    Supports Cooperative Kernel Launch:            Yes
    Supports MultiDevice Co-op Kernel Launch:      Yes
    Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
    Compute Mode:
        < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

we get Result = PASS.

install cudnn

download cudnn-9.2-linux-x64-v7.1.tgz for ubuntu 16.04

copy include to /usr/local/cuda-9.2/include
copy lib64 to /usr/local/cuda-9.2/lib64

commands

1
2
3

tar -xzvf cudnn-9.2-linux-x64-v7.1.tgz 
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/* /usr/local/cuda/lib64/

Reference

setting-up-ubuntu-16-04-cuda-gpu-for-deep-learning-with-python

History

20180917: created.

install and configure tensorrt 4 on ubuntu 16.04

Posted on 2018-09-07 Edited on 2024-10-14 In deep learning

Series

Guide

version

ubuntu 16.04 (14.04,16.04 only) not support Windows
~~CUDA 8.0~~ (8.0,9.0,9.2 only)
CUDA 9.2
cudnn 7.1.4 (7.1 only)
TensorRT 4.0.1.6
TensorFlow-gpu v1.4+
python: 3.5.2 (2.7 or 3.5)

TensorRT support matrix

4.0.1.6
5.0.2.6

hardware precision matrix

hardware precision support matrix

see tensorrt-support-matrix

ubuntu

GeForce 1060 (fp32,int8) no fp16

jetson products

Jetson TX1 (fp32,fp16)
Jetson TX2 (fp32,fp16)
Jetson AGX Xavier (fp32,fp16,int8,dla)
Jetson Nano (Jetbot)

install

download and install

download TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-8.0.cudnn7.1.tar.gz from here

tar zxvf TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-8.0.cudnn7.1.tar.gz 

ls TensorRT-4.0.1.6
bin  data  doc  graphsurgeon  include  lib  python  samples  targets  TensorRT-Release-Notes.pdf  uff

sudo mv TensorRT-4.0.1.6 /opt/
cd /opt
sudo ln -s TensorRT-4.0.1.6/ tensorrt

Updates: from cuda-8.0 ===> cuda-9.2. download TensorRT-4.0.1.6.Ubuntu-16.04.4.x86_64-gnu.cuda-9.2.cudnn7.1.tar.gz from here

add lib to path

sudo vim /etc/ld.so.conf.d/tensorrt
/opt/tensorrt/lib

sudo ldconfig

vim ~/.bashrc
export LD_LIBRARY_PATH=/opt/tensorrt/lib:$LD_LIBRARY_PATH

source ~/.bashrc

python package

1 2	cd /opt/tensorrt/python sudo pip2 install tensorrt-4.0.1.6-cp27-cp27mu-linux_x86_64.whl

1 2	cd /opt/tensorrt/python sudo pip3 install tensorrt-4.0.1.6-cp35-cp35m-linux_x86_64.whl

uff package

cd /opt/tensorrt/uff 
sudo pip install uff-0.4.0-py2.py3-none-any.whl 

which convert-to-uff
/usr/local/bin/convert-to-uff

folder structure

include

tree include/
include/
├── NvCaffeParser.h
├── NvInfer.h
├── NvInferPlugin.h
├── NvOnnxConfig.h
├── NvOnnxParser.h
├── NvUffParser.h
└── NvUtils.h

lib

ls -al *.4.1.2
lrwxrwxrwx 1 kezunlin kezunlin       21 6月  12 15:42 libnvcaffe_parser.so.4.1.2 -> libnvparsers.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin  2806840 6月  12 15:42 libnvinfer_plugin.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin 80434488 6月  12 15:42 libnvinfer.so.4.1.2
-rwxrwxr-x 1 kezunlin kezunlin  3951712 6月  12 15:42 libnvparsers.so.4.1.2

bin

tree bin
bin
├── download-digits-model.py
├── giexec
└── trtexec

sample

add envs

vim ~/.bashrc

# tensorrt cuda and cudnn
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda

compile all

1 2	cd samples/ make -j8

generate all sample_xxx to bin/ folder.

compile sampleMNIST

cd samples/sampleMNIST
ls 
Makefile  sampleMNIST.cpp
make -j8

error occurs

dpkg-query: no packages found matching cuda-cudart-[0-9]*
../Makefile.config:6: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda- by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
../Makefile.config:9: CUDNN_INSTALL_DIR variable is not specified, using  by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.

fix solutions:

vim ~/.bashrc

# tensorrt cuda and cudnn
export CUDA_INSTALL_DIR=/opt/cuda
export CUDNN_INSTALL_DIR=/opt/cuda

make again

:
:
Compiling: sampleMNIST.cpp
Compiling: sampleMNIST.cpp
Linking: ../../bin/sample_mnist
Linking: ../../bin/sample_mnist_debug
# Copy every EXTRA_FILE of this sample to bin dir

test sample_mnist

./sample_mnist
Reading Caffe prototxt: ../../../data/mnist/mnist.prototxt
Reading Caffe model: ../../../data/mnist/mnist.caffemodel

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@%.:@@@@@@@@@@@@
@@@@@@@@@@@@@: *@@@@@@@@@@@@
@@@@@@@@@@@@* =@@@@@@@@@@@@@
@@@@@@@@@@@% :@@@@@@@@@@@@@@
@@@@@@@@@@@- *@@@@@@@@@@@@@@
@@@@@@@@@@# .@@@@@@@@@@@@@@@
@@@@@@@@@@: #@@@@@@@@@@@@@@@
@@@@@@@@@+ -@@@@@@@@@@@@@@@@
@@@@@@@@@: %@@@@@@@@@@@@@@@@
@@@@@@@@+ +@@@@@@@@@@@@@@@@@
@@@@@@@@:.%@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@@@@@@@@@@@@@
@@@@@@@% -@@@@@@#..:@@@@@@@@
@@@@@@@% +@@@@@-    :@@@@@@@
@@@@@@@% =@@@@%.#@@- +@@@@@@
@@@@@@@@..%@@@*+@@@@ :@@@@@@
@@@@@@@@= -%@@@@@@@@ :@@@@@@
@@@@@@@@@- .*@@@@@@+ +@@@@@@
@@@@@@@@@@+  .:-+-: .@@@@@@@
@@@@@@@@@@@@+:    :*@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: 
2: 
3: 
4: 
5: 
6: **********
7: 
8: 
9:

Sample

compile all samples

1 2	cd sample make -j8

sample_mnist

see above. skip.

ldd sample_mnist
    linux-vdso.so.1 =>  (0x00007ffecd9f3000)
    libnvinfer.so.4 => /opt/tensorrt/lib/libnvinfer.so.4 (0x00007f48de6f2000)
    libnvparsers.so.4.1.2 => /opt/tensorrt/lib/libnvparsers.so.4.1.2 (0x00007f48de12c000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f48ddf24000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f48ddd20000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f48ddb03000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f48dd781000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f48dd478000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f48dd262000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f48dce98000)
    libcudnn.so.7 => /usr/local/cuda/lib64/libcudnn.so.7 (0x00007f48c8818000)
    libcublas.so.9.2 => /usr/local/cuda/lib64/libcublas.so.9.2 (0x00007f48c4dca000)
    libcudart.so.9.2 => /usr/local/cuda/lib64/libcudart.so.9.2 (0x00007f48c4b60000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f48e42bc000)

libnvinfer.so, libnvparsers.so, libcudart.so, libcudnn.so, libcublas.so

sample_onnx_mnist

./sample_onnx_mnist



---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@%.-@@@@@@@@@@@
@@@@@@@@@@@*-    %@@@@@@@@@@
@@@@@@@@@@= .-.  *@@@@@@@@@@
@@@@@@@@@= +@@@  *@@@@@@@@@@
@@@@@@@@* =@@@@  %@@@@@@@@@@
@@@@@@@@..@@@@%  @@@@@@@@@@@
@@@@@@@# *@@@@-  @@@@@@@@@@@
@@@@@@@: @@@@%   @@@@@@@@@@@
@@@@@@@: @@@@-   @@@@@@@@@@@
@@@@@@@: =+*= +: *@@@@@@@@@@
@@@@@@@*.    +@: *@@@@@@@@@@
@@@@@@@@%#**#@@: *@@@@@@@@@@
@@@@@@@@@@@@@@@: -@@@@@@@@@@
@@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@@*  @@@@@@@@@@
@@@@@@@@@@@@@@@@  %@@@@@@@@@
@@@@@@@@@@@@@@@@  #@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@- +@@@@@@@@@
@@@@@@@@@@@@@@@@*:%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@


 Prob 0  0.00000: 
 Prob 1  0.00001: 
 Prob 2  0.00002: 
 Prob 3  0.00003: 
 Prob 4  0.00044: 
 Prob 5  0.00005: 
 Prob 6  0.00006: 
 Prob 7  0.00007: 
 Prob 8  0.00008: 
 Prob 9  0.99969: **********

sample_uff_mnist

../../../data/mnist/lenet5.uff



---------------------------



@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@%.-@@@@@@@@@@@
@@@@@@@@@@@*-    %@@@@@@@@@@
@@@@@@@@@@= .-.  *@@@@@@@@@@
@@@@@@@@@= +@@@  *@@@@@@@@@@
@@@@@@@@* =@@@@  %@@@@@@@@@@
@@@@@@@@..@@@@%  @@@@@@@@@@@
@@@@@@@# *@@@@-  @@@@@@@@@@@
@@@@@@@: @@@@%   @@@@@@@@@@@
@@@@@@@: @@@@-   @@@@@@@@@@@
@@@@@@@: =+*= +: *@@@@@@@@@@
@@@@@@@*.    +@: *@@@@@@@@@@
@@@@@@@@%#**#@@: *@@@@@@@@@@
@@@@@@@@@@@@@@@: -@@@@@@@@@@
@@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@@*  @@@@@@@@@@
@@@@@@@@@@@@@@@@  %@@@@@@@@@
@@@@@@@@@@@@@@@@  #@@@@@@@@@
@@@@@@@@@@@@@@@@: +@@@@@@@@@
@@@@@@@@@@@@@@@@- +@@@@@@@@@
@@@@@@@@@@@@@@@@*:%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
10 eltCount
--- OUTPUT ---
0 => -2.75228	 : 
1 => -1.51534	 : 
2 => -4.11729	 : 
3 => 0.316925	 : 
4 => 3.73423	 : 
5 => -3.00593	 : 
6 => -6.18866	 : 
7 => -1.02671	 : 
8 => 1.937	 : 
9 => 14.8275	 : ***

Average over 10 runs is 0.0843257 ms.

sample_mnist_api

./sample_mnist_api
Loading weights: ../../../data/mnist/mnistapi.wts

Input:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@+ @@@@@@@@@@@@@@
@@@@@@@@@@@@. @@@@@@@@@@@@@@
@@@@@@@@@@@@- @@@@@@@@@@@@@@
@@@@@@@@@@@#  @@@@@@@@@@@@@@
@@@@@@@@@@@#  *@@@@@@@@@@@@@
@@@@@@@@@@@@  :@@@@@@@@@@@@@
@@@@@@@@@@@@= .@@@@@@@@@@@@@
@@@@@@@@@@@@#  %@@@@@@@@@@@@
@@@@@@@@@@@@% .@@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@%  %@@@@@@@@@@@@
@@@@@@@@@@@@@= +@@@@@@@@@@@@
@@@@@@@@@@@@@* -@@@@@@@@@@@@
@@@@@@@@@@@@@*  @@@@@@@@@@@@
@@@@@@@@@@@@@@  @@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@  *@@@@@@@@@@@
@@@@@@@@@@@@@@* @@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@

Output:

0: 
1: **********
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9:

sample_int8

./sample_int8 mnist

FP32 run:400 batches of size 100 starting at 100
........................................
Top1: 0.9904, Top5: 1
Processing 40000 images averaged 0.00332707 ms/image and 0.332707 ms/batch.

FP16 run:400 batches of size 100 starting at 100
Engine could not be created at this precision

INT8 run:400 batches of size 100 starting at 100
........................................
Top1: 0.9909, Top5: 1
Processing 40000 images averaged 0.00215323 ms/image and 0.215323 ms/batch.

Reference

History

20180907: created.
20181119: add tensorrt-5.0.

compile baidu anakin on ubuntu 16.04

Posted on 2018-09-03 Edited on 2024-10-14 In deep learning

Guide

version

gcc 4.8.5/5.4.0
g++ 4.8.5/5.4.0
cmake 3.2.2
nvidia driver 396.54 + cuda 9.2 + cudnn 7.1.4
protobuf 3.4.0

install nvidia-docker2

see nvidia-docker2 guide on ubuntu 16.04

test

1	sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

build and run

build

1
2
3

git clone https://github.com/PaddlePaddle/Anakin.git anakin
cd anakin/docker
./anakin_docker_build_and_run.sh -p NVIDIA-GPU -o Ubuntu -m Build

error occur with cudnn. skip.

run

1	./anakin_docker_build_and_run.sh -p NVIDIA-GPU -o Ubuntu -m Run

compile anakin

1
2
3

sudo docker run -it --runtime=nvidia fdcda959f60a bin/bash
root@962077742ae9:/# cd Anakin/
git checkout developing

build

# 1. use script to build
./tools/gpu_build.sh

# 2. or you can build directly.
mkdir build  
cd build  
cmake ..  
make -j8

x86 build

1	./tools/x86_build.sh

OK. no errors.

gpu build

1	./tools/gpu_build.sh

build errors occur. no cudnn found.

compile anakin in host

install protobuf

install protobuf 3.4.0, see Part 1: compile protobuf-cpp on ubuntu 16.04

configure env

vim .bashrc

# cuda for anakin
export PATH=/usr/local/cuda/bin:$PATH

# CUDNN for anakin
export CUDNN_ROOT=/usr/local/cuda/
export LD_LIBRARY_PATH=${CUDNN_ROOT}/lib64:$LD_LIBRARY_PATH
export CPLUS_INCLUDE_PATH=${CUDNN_ROOT}/include:$CPLUS_INCLUDE_PATH

source .bashrc

build anakin

x86 build

git checkout developing  
./tools/x86_build.sh

mv output x86_output

OK. no errors.

if error occurs, then

rm -rf CMakeFiles
rm -rf anakin/framework/model_parser/proto/*.h
rm output

chown -R kezunlin:kezunlin anakin

gpu build

1 2	./tools/gpu_build.sh mv output gpu_output

gpu build with cmake

1
2
3

cd anakin
mkdir build
cd build && cmake-gui ..

anakin overview

用Anakin来进行前向计算主要分为三个步骤：

将外部模型通过Anakin Parser解析为Anakin模型
加载Anakin模型生成原始计算图，然后需要对原始计算图进行优化。
Anakin会选择不同硬件平台执行计算图。

Tensor

Tensor接受三个模板参数:

 template<typename TargetType, DataType datatype, typename LayOutType = NCHW>
 class Tensor .../* Inherit other class */{
  //some implements
  ...
 };

TargetType是平台类型，如X86，GPU等等，在Anakin内部有相应的标识与之对应；
datatype是普通的数据类型，在Anakin内部也有相应的标志与之对应；
LayOutType是数据分布类型，如batch x channel x height x width [NxCxHxW], 在Anakin内部用一个struct来标识。 Anakin中数据类型与基本数据类型的对应如下:

TargetType

DataType

Anakin DataType	C++	Description
AK_HALF	short	fp16
AK_FLOAT	float	fp32
AK_DOUBLE	double	fp64
AK_INT8	char	int8
AK_INT16	short	int16
AK_INT32	int	int32
AK_INT64	long	int64
AK_UINT8	unsigned char	uint8
AK_UINT16	unsigned short	uint8
AK_UINT32	unsigned int	uint32
AK_STRING	std::string	/
AK_BOOL	bool	/
AK_SHAPE	/	Anakin Shape
AK_TENSOR	/	Anakin Tensor

LayOutType

Anakin LayOutType ( Tensor LayOut )	Tensor Dimention	Tensor Support	Op Support
W	1-D	YES	NO
HW	2-D	YES	NO
WH	2-D	YES	NO
NW	2-D	YES	YES
NHW	3-D	YES	YES
NCHW ( default )	4-D	YES	YES
NHWC	4-D	YES	NO
NCHW_C4	5-D	YES	YES

理论上，Anakin支持申明1维以上的tensor，但是对于Anakin中的Op来说，只支持NW、NHW、NCHW、NCHW_C4这四种LayOut，其中NCHW是默认的LayOutType，NCHW_C4是专门针对于int8这种数据类型的。

Graph

Graph类负责加载Anakin模型生成计算图、对图进行优化、存储模型等操作。

template<typename TargetType, DataType Dtype, Precision Ptype>
class Graph ... /* inherit other class*/{

  //some implements
  ...

};

load

//some declarations
...
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
std::string model_path = "the/path/to/where/your/models/are";
const char *model_path1 = "the/path/to/where/your/models/are";

//Loading Anakin model to generate a compute graph.
auto status = graph->load(model_path);

//Or this way.
auto status = graph->load(model_path1);
//Check whether load operation success.
if(!status){
  std::cout << "error" << endl;
  //do something...
}

optimize

//some declarations
...
//Load graph.
...
//According to the ops of loaded graph, optimize compute graph.
graph->Optimize();

save

//some declarations
...
//Load graph.
...
// save a model
//save_model_path: the path to where your model is.
auto status = graph->save(save_model_path);

//Checking
if(!status){
  cout << "error" << endl;
  //do somethin...
}

Net

Net是计算图的执行器，通过Net对象获得输入和输出。

template<typename TargetType, DataType Dtype, Precision PType, OpRunType RunType = OpRunType::ASYNC>
class Net{
  //some implements
  ...

};

Precision指定Op的精度。
OpRunType表示同步或异步类型，异步是默认类型。OpRunType::SYNC表示同步，在GPU上只有单个流；OpRunType::ASYNC表示异步，在GPU上有多个流并以异步方式执行。

Precision

Precision	Op support
Precision::INT4	NO
Precision::INT8	NO
Precision::FP16	NO
Precision::FP32	YES
Precision::FP64	NO

现在Op的精度只支持FP32，但在将来我们会支持剩下的Precision.

OpRunType

OpRunType	Sync/Aync	Description
OpRunType::SYNC	Synchronization	single-stream on GPU
OpRunType::ASYNC	Asynchronization	multi-stream on GPU

create a executor

//some declarations
...
//Create a pointer to a graph.
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
//do something...
...

//create a executor
Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);

get input tensor

//some declaratinos
...

//create a executor
//TargetType is NV [NVIDIA GPU]
Net<NV, AK_FLOAT, Precision::FP32> executor(*graph);

//Get the first input tensor.
//The following tensors(tensor_in0, tensor_in2 ...) are resident at GPU.
//Note: Member function get_in returns an pointer to tensor.
Tensor<NV, AK_FLOAT>* tensor_in0 = executor.get_in("input_0");

//If you have multiple input tensors
//You just type this code below.
Tensor<NV, AK_FLOAT>* tensor_in1 = executor.get_in("input_1");
...
auto tensor_inn = executor.get_in("input_n");

fill input tensor

//This tensor is resident at GPU.
auto tensor_d_in = executor.get_in("input_0");

//If we want to feed above tensor, we must feed the tensor which is resident at host. And then copy the host tensor to the device's one.

//using Tensor4d = Tensor<Ttype, Dtype>;
Tensor4d<X86, AK_FLOAT> tensor_h_in; //host tensor;
//Tensor<X86, AK_FLOAT> tensor_h_in; 

//Allocate memory for host tensor.
tensor_h_in.re_alloc(tensor_d_in->valid_shape());
//Get a writable pointer to tensor.
float *h_data = tensor_h_in.mutable_data();

//Feed your tensor.
/** example
for(int i = 0; i < tensor_h_in.size(); i++){
  h_data[i] = 1.0f;
}
*/
//Copy host tensor's data to device tensor.
tensor_d_in->copy_from(tensor_h_in);

// And then

get output tensor

//Note: this tensor are resident at GPU.
Tensor<NV, AK_FLOAT>* tensor_out_d = executor.get_out("pred_out");

execute graph

executor.prediction();

code example

std::string model_path = "your_Anakin_models/xxxxx.anakin.bin";
// Create an empty graph object.
auto graph = new Graph<NV, AK_FLOAT, Precision::FP32>();
// Load Anakin model.
auto status = graph->load(model_path);
if(!status ) {
    LOG(FATAL) << " [ERROR] " << status.info();
}
// Reshape
graph->Reshape("input_0", {10, 384, 960, 10});
// You must optimize graph for the first time.
graph->Optimize();
// Create a executer.
Net<NV, AK_FLOAT, Precision::FP32> net_executer(*graph);

//Get your input tensors through some specific string such as "input_0", "input_1", and 
//so on. 
//And then, feed the input tensor.
//If you don't know Which input do these specific string ("input_0", "input_1") correspond with, you can launch dash board to find out.
auto d_tensor_in_p = net_executer.get_in("input_0");
Tensor4d<X86, AK_FLOAT> h_tensor_in;
auto valid_shape_in = d_tensor_in_p->valid_shape();
for (int i=0; i<valid_shape_in.size(); i++) {
    LOG(INFO) << "detect input dims[" << i << "]" << valid_shape_in[i]; //see tensor's dimentions
}
h_tensor_in.re_alloc(valid_shape_in);
float* h_data = h_tensor_in.mutable_data();
for (int i=0; i<h_tensor_in.size(); i++) {
    h_data[i] = 1.0f;
}
d_tensor_in_p->copy_from(h_tensor_in);

//Do inference.
net_executer.prediction();

//Get result tensor through the name of output node.
//And also, you need to see the dash board again to find out how many output nodes are and remember their name.

//For example, you've got a output node named obj_pre_out
//Then, you can get an output tensor.
auto d_tensor_out_0_p = net_executer.get_out("obj_pred_out"); //get_out returns a pointer to output tensor.
auto d_tensor_out_1_p = net_executer.get_out("lc_pred_out"); //get_out returns a pointer to output tensor.
//......
// do something else ...
//...
//save model.
//You might not optimize the graph when you load the saved model again.
std::string save_model_path = model_path + std::string(".saved");
auto status = graph->save(save_model_path);
if (!status ) {
    LOG(FATAL) << " [ERROR] " << status.info();
}

anakin converter

cd anakin/tools/external_converter_v2
sudo pip install flask prettytable

vim config.yaml 
# ...
python converter.py

config.yaml

OPTIONS:
    Framework: CAFFE
    SavePath: ./output
    ResultName: mylenet
    Config:
        LaunchBoard: ON
        Server:
            ip: 0.0.0.0
            port: 8888
        OptimizedGraph: 
            enable: OFF
            path: ./anakin_optimized/lenet.anakin.bin.saved
    LOGGER:
        LogToPath: ./log/
        WithColor: ON 

TARGET:
    CAFFE:
        # path to proto files
        ProtoPaths:
            - /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
        PrototxtPath: /home/kezunlin/program/caffe/examples/mnist/lenet.prototxt
        ModelPath: /home/kezunlin/program/caffe/examples/mnist/lenet_iter_10000.caffemodel

    FLUID:
        # path of fluid inference model
        Debug: NULL                            # Generally no need to modify.
        ModelPath: /path/to/your/model/        # The upper path of a fluid inference model.
        NetType:                               # Generally no need to modify.
    
    LEGO:
        # path to proto files
        ProtoPath:
        PrototxtPath:
        ModelPath:
    
    TENSORFLOW:
        ProtoPaths: /
        PrototxtPath: /
        ModelPath: /
        OutPuts:
    
    ONNX:
        ProtoPath:
        PrototxtPath:
        ModelPath:

input: caffe.proto + lenet.prototxt + lenet_iter_10000.caffemodel
output: output/mylenet.anakin.bin + log/xxx.log

anakin test

model_test.cpp

cat Anakin/test/framework/net/model_test.cpp

cd gpu_output
./unit_test/model_test '/home/kezunlin/program/anakin/demo/model/'

example_nv_cnn_net.cpp

1	cat Anakin/examples/cuda/example_nv_cnn_net.cpp

my example

my workspace

ls demo/
anakin_lib  build  cmake  CMakeLists.txt  image  model  src


tree demo/src/ demo/model/ demo/cmake demo/image
demo/src/
└── demo.cpp
demo/model/
└── mylenet.anakin.bin
demo/cmake
├── anakin-config.cmake
├── msg_color.cmake
├── statistic.cmake
└── utils.cmake
demo/image
├── big.jpg
└── cat.jpg

0 directories, 8 files

anakin_lib

use ./tools/gpu_build.sh to generate gpu_build_sm61 and rename to anakin_lib

./tools/gpu_build.sh
# ...

mv gpu_build_sm61 anakin_lib

ls anakin_lib/
anakin_config.h  libanakin_saber_common.so        libanakin.so        log    unit_test
framework        libanakin_saber_common.so.0.1.2  libanakin.so.0.1.2  saber  utils

anakin-config.cmake

set(ANAKIN_FOUND TRUE) # auto 
set(ANAKIN_VERSION 0.1.2)
set(ANAKIN_ROOT_DIR "/home/kezunlin/program/anakin/demo/anakin_lib")

set(ANAKIN_ROOT ${ANAKIN_ROOT_DIR})
set(ANAKIN_FRAMEWORK ${ANAKIN_ROOT}/framework)
set(ANAKIN_SABER ${ANAKIN_ROOT}/saber)
set(ANAKIN_UTILS ${ANAKIN_ROOT}/utils)


set(ANAKIN_FRAMEWORK_CORE ${ANAKIN_FRAMEWORK}/core)
set(ANAKIN_FRAMEWORK_GRAPH ${ANAKIN_FRAMEWORK}/graph)
set(ANAKIN_FRAMEWORK_LITE ${ANAKIN_FRAMEWORK}/lite)
set(ANAKIN_FRAMEWORK_MODEL_PARSER ${ANAKIN_FRAMEWORK}/model_parser)
set(ANAKIN_FRAMEWORK_OPERATORS ${ANAKIN_FRAMEWORK}/operators)

set(ANAKIN_SABER_CORE ${ANAKIN_SABER}/core)
set(ANAKIN_SABER_FUNCS ${ANAKIN_SABER}/funcs)
set(ANAKIN_SABER_LITE ${ANAKIN_SABER}/lite)

set(ANAKIN_UTILS_LOGGER ${ANAKIN_UTILS}/logger)
set(ANAKIN_UTILS_UINT_TEST ${ANAKIN_UTILS}/unit_test)

#find_path(ANAKIN_INCLUDE_DIR NAMES anakin_config.h PATHS "${ANAKIN_ROOT_DIR}") 
mark_as_advanced(ANAKIN_INCLUDE_DIR) # show entry in cmake-gui

find_library(ANAKIN_SABER_COMMON_LIBRARY NAMES anakin_saber_common PATHS "${ANAKIN_ROOT_DIR}") 
mark_as_advanced(ANAKIN_SABER_COMMON_LIBRARY) # show entry in cmake-gui

find_library(ANAKIN_LIBRARY NAMES anakin PATHS "${ANAKIN_ROOT_DIR}") 
mark_as_advanced(ANAKIN_LIBRARY) # show entry in cmake-gui

# use xxx_INCLUDE_DIRS and xxx_LIBRARIES in CMakeLists.txt
set(ANAKIN_INCLUDE_DIRS 
    ${ANAKIN_ROOT} 
    ${ANAKIN_FRAMEWORK} 
    ${ANAKIN_SABER} 
    ${ANAKIN_UTILS} 

    ${ANAKIN_FRAMEWORK_CORE} 
    ${ANAKIN_FRAMEWORK_GRAPH} 
    ${ANAKIN_FRAMEWORK_LITE} 
    ${ANAKIN_FRAMEWORK_MODEL_PARSER} 
    ${ANAKIN_FRAMEWORK_OPERATORS} 

    ${ANAKIN_SABER_CORE} 
    ${ANAKIN_SABER_FUNCS} 
    ${ANAKIN_SABER_LITE} 

    ${ANAKIN_UTILS_LOGGER} 
    ${ANAKIN_UTILS_UINT_TEST} 
)

set(ANAKIN_LIBRARIES ${ANAKIN_SABER_COMMON_LIBRARY} ${ANAKIN_LIBRARY} )

message( "anakin-config.cmake " ${ANAKIN_ROOT_DIR})

CMakeLists.txt

cmake_minimum_required(VERSION 2.8.8)

project(demo)

include(cmake/msg_color.cmake)
include(cmake/utils.cmake)
include(cmake/statistic.cmake)

#add_definitions( -Dshared_DEBUG) # define macro

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11")

set(ROOT_CMAKE_DIR ./cmake)
set(CMAKE_PREFIX_PATH ${CMAKE_PREFIX_PATH} "${ROOT_CMAKE_DIR};${CMAKE_PREFIX_PATH}")
MESSAGE( [cmake] " CMAKE_PREFIX_PATH = ${CMAKE_PREFIX_PATH} for find_package")

# Find includes in corresponding build directories
set(CMAKE_INCLUDE_CURRENT_DIR ON)

find_package(OpenCV REQUIRED COMPONENTS core highgui imgproc features2d calib3d) 
include_directories(${OpenCV_INCLUDE_DIRS})

# find anakin-config.cmake file
#include(cmake/anakin-config.cmake)
find_package(ANAKIN REQUIRED)
include_directories(${ANAKIN_INCLUDE_DIRS})

#message( [opencv] ${OpenCV_INCLUDE_DIRS} )
#message( [opencv] ${OpenCV_LIBS} )
#message( [anakin] ${ANAKIN_INCLUDE_DIRS} )
#message( [anakin] ${ANAKIN_LIBRARIES} )

add_executable(${PROJECT_NAME} 
    src/demo.cpp
)

# dl pthread 
# error with  -std=c++11 -lpthread -ldl 

target_link_libraries(${PROJECT_NAME} 
    dl 
    pthread
    ${OpenCV_LIBS} 
    ${ANAKIN_LIBRARIES}
)

src/demo.cpp

edit from Anakin/examples/cuda/example_nv_cnn_net.cpp

#include <iostream>
using namespace std;

// opencv
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
using namespace cv;

// anakin
#include "utils/logger/logger.h"
#include "framework/graph/graph.h"
#include "framework/core/net/net.h"

/*util to fill tensor*/
#include "saber/core/tensor_op.h"
using namespace anakin;
using namespace anakin::graph;
using namespace anakin::saber;

/*
+------------+-----------------+-------+-----------+
| Input Name |      Shape      | Alias | Data Type |
+------------+-----------------+-------+-----------+
|  input_0   | [64, 1, 28, 28] |  NULL |    NULL   |
+------------+-----------------+-------+-----------+
+-------------+
| Output Name |
+-------------+
|   prob_out  |
+-------------+
*/

int fill_tensor(Tensor4d<X86, AK_FLOAT>& h_tensor_in, const cv::Mat& image)
{
    // write data to tensor
    int height = image.rows;
    int width = image.cols;

    LOG(INFO)<<"height*width ="<< height*width <<std::endl;  // 784
    LOG(INFO)<<"h_tensor_in.size() ="<<h_tensor_in.size()<<std::endl; // 784

    float* tensor_ptr = h_tensor_in.mutable_data(); // int, float or double.

    const float* ptr;
    for (int h = 0; h < height; ++h)
    {
        ptr = image.ptr<float>(h); // row ptr
        for (int w = 0; w < width; ++w)
        {
            *tensor_ptr++ = *ptr++;
        }
    }

    return 1;
}

int main(int argc, const char** argv) {

    const char *model_path = "../model/mylenet.anakin.bin";

    Mat image = imread("../image/cat.jpg",0);
    cv::resize(image,image,Size(28,28));
    //imshow("image",image);
    //waitKey(0);

    /*init graph object, graph is the skeleton of model*/
    Graph<NV, AK_FLOAT, Precision::FP32> graph;

    /*load model from file to init the graph*/
    auto status = graph.load(model_path);
    if (!status) {
        LOG(FATAL) << " [ERROR] " << status.info();
    }

    /*set net input shape and use this shape to optimize the graph(fusion and init operator),shape is n,c,h,w*/
    graph.Reshape("input_0", {1, 1, 28, 28});
    graph.Optimize();

    /*net_executer is the executor object of model. use graph to init Net*/
    Net<NV, AK_FLOAT, Precision::FP32> net_executer(graph, true);

    /*use input string to get the input tensor of net. for we use NV as target, the tensor of net_executer is on GPU memory*/
    auto d_tensor_in_p = net_executer.get_in("input_0");
    auto valid_shape_in = d_tensor_in_p->valid_shape();

    /*create tensor located in host*/
    Tensor4d<X86, AK_FLOAT> h_tensor_in;

    /*alloc for host tensor*/
    h_tensor_in.re_alloc(valid_shape_in);

    /*init host tensor by random*/
    //fill_tensor_host_rand(h_tensor_in, -1.0f, 1.0f);

    image.convertTo(image, CV_32FC1); // faster
    fill_tensor(h_tensor_in,image);

    /*use host tensor to int device tensor which is net input*/
    d_tensor_in_p->copy_from(h_tensor_in);

    /*run infer*/
    net_executer.prediction();

    LOG(INFO)<<"infer finish";

    /*get the out put of net, which is a device tensor*/
    auto d_out=net_executer.get_out("prob_out");

    /*create another host tensor, and copy the content of device tensor to host*/
    Tensor4d<X86, AK_FLOAT> h_tensor_out;
    h_tensor_out.re_alloc(d_out->valid_shape());
    h_tensor_out.copy_from(*d_out);

    /*show output content*/
    for(int i=0;i<h_tensor_out.valid_size();i++){
        LOG(INFO)<<"out ["<<i<<"] = "<<h_tensor_out.data()[i];
    }
}

compile demo

mkdir build
cd build 
cmake ..
make 
./demo

output

ERR| 16:45:56.00581| 110838.067s|         37CBF8C0| operator_attr.h:94]  you have set the argument: is_reverse , so it's igrored by anakin
 ERR| 16:45:56.00581| 110838.067s|         37CBF8C0| operator_attr.h:94]  you have set the argument: is_reverse , so it's igrored by anakin
   0| 16:45:56.00681| 0.098s|         37CBF8C0| parser.cpp:96] graph name: LeNet
   0| 16:45:56.00681| 0.099s|         37CBF8C0| parser.cpp:101] graph in: input_0
   0| 16:45:56.00681| 0.099s|         37CBF8C0| parser.cpp:107] graph out: prob_out
   0| 16:45:56.00742| 0.159s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScaleReluPool
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScaleRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvReluPool
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnormScale
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : DeconvRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : PermutePower
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : ConvBatchnorm
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : EltwiseRelu
   0| 16:45:56.00742| 0.160s|         37CBF8C0| graph.cpp:153]  processing in-ordered fusion : EltwiseActivation
 WAN| 16:45:56.00743| 0.160s|         37CBF8C0| net.cpp:663] Detect and initial 1 lanes.
   0| 16:45:56.00743| 0.161s|         37CBF8C0| env.h:44] found 1 device(s)
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:45] Device id: 0 , name: GeForce GTX 1060
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:47] Multiprocessors: 10
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:50] frequency:1733MHz
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:52] CUDA Capability : 6.1
   0| 16:45:56.00743| 0.161s|         37CBF8C0| cuda_device.cpp:54] total global memory: 6078MBytes.
 WAN| 16:45:56.00743| 0.161s|         37CBF8C0| net.cpp:667] Current used device id : 0
 WAN| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:16] Parsing Input op parameter.
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [0]: 1
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [1]: 1
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [2]: 28
   0| 16:45:56.00744| 0.161s|         37CBF8C0| input.cpp:19]  |-- shape [3]: 28
 ERR| 16:45:56.00744| 0.161s|         37CBF8C0| net.cpp:210] node_ptr->get_op_name()  sass not support yet.
 ERR| 16:45:56.00744| 0.161s|         37CBF8C0| net.cpp:210] node_ptr->get_op_name()  sass not support yet.
 WAN| 16:45:57.00269| 0.686s|         37CBF8C0| context.h:40] device index exceeds the number of devices, set to default device(0)!
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:300] Temp mem used:        0 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:301] Original mem used:    0 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:302] Model mem used:       1 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| net.cpp:303] System mem used:      153 MB
   0| 16:45:57.00270| 0.687s|         37CBF8C0| demo.cpp:40] height*width =784
   0| 16:45:57.00270| 0.687s|         37CBF8C0| demo.cpp:41] h_tensor_in.size() =784
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:105] infer finish
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [0] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [1] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [2] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [3] = 1
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [4] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [5] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [6] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [7] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [8] = 0
   0| 16:45:57.00270| 0.688s|         37CBF8C0| demo.cpp:117] out [9] = 0

For Windows (skip)

version

windows 10
vs 2015
cmake 3.2.2
cuda 8.0 + cudnn 6.0.21 (same as caffe) sm_61
protobuf 3.4.0

protobuf

see compile protobuf-cpp on windows 10

compile

#git clone https://github.com/PaddlePaddle/Anakin.git anakin
git clone https://github.com/kezunlin/Anakin.git anakin
cd anakin 
mkdir build && cd build && cmake-gui ..

with options

CUDNN_ROOT "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0/"
PROTOBUF_ROOT "C:/Program Files/protobuf" 

BUILD_SHARED ON
USE_GPU_PLACE ON
USE_OPENMP OFF
USE_OPENCV ON

generate Anakin.sln and compile with VS 2015 with x64 Release mode.

error fixs

we get 101 errors, hard to fix.
skip now.

Reference

History

20180903: created.

how to install docker and nvidia-docker2 on ubuntu 16.04

Posted on 2018-09-03 Edited on 2024-10-14 In deep learning

Docker Guide

install docker

# step 1: install tools
sudo apt-get update
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common

# step 2: install GPG 
curl -fsSL http://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -

# Step 3: add apt repo
sudo add-apt-repository "deb [arch=amd64] http://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

# Step 4: install docker-ce
sudo apt-get -y update
sudo apt-get -y install docker-ce

install docker-ce for given version

# Step 1: search versions
# apt-cache madison docker-ce
#   docker-ce | 17.03.1~ce-0~ubuntu-xenial | http://mirrors.aliyun.com/docker-ce/linux/ubuntu xenial/stable amd64 Packages
#   docker-ce | 17.03.0~ce-0~ubuntu-xenial | http://mirrors.aliyun.com/docker-ce/linux/ubuntu xenial/stable amd64 Packages

# Step 2: install given version
# sudo apt-get -y install docker-ce=17.03.1~ce-0~ubuntu-xenial

test docker

sudo docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false

docker namespace

host

id
uid=1000(kezunlin) gid=1000(kezunlin) groups=1000(kezunlin),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare)

sudo docker images
sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda

container

root@6f167ef72a80:/home/kezunlin/workspace# ll
total 48
drwxrwxr-x 12 1000 1000 4096 Nov 30 10:04 ./
drwxr-xr-x  3 root root 4096 Nov 30 10:14 ../
drwxrwxr-x 10 1000 1000 4096 Dec  5  2017 MyGit/
drwxrwxr-x 12 1000 1000 4096 Oct 31 03:01 blog/
drwxrwxr-x  5 1000 1000 4096 Sep 20 07:33 opencv/
drwxrwxr-x  4 1000 1000 4096 Oct 31 07:55 openmp/
drwxrwxr-x  5 1000 1000 4096 Jan  9  2018 qt/
drwxrwxr-x  2 1000 1000 4096 Jan  4  2018 ros/
drwxrwxr-x  4 1000 1000 4096 Nov 16  2017 voc/
drwxrwxr-x  5 1000 1000 4096 Aug  7 03:19 vs/
root@6f167ef72a80:/home/kezunlin/workspace# touch 1.txt

root@6f167ef72a80:/home/kezunlin/workspace# id
uid=0(root) gid=0(root) groups=0(root)

host

ll /home/kezunlin/workspace/
total 48
drwxrwxr-x 12 kezunlin kezunlin 4096 11月 30 18:14 ./
drwxr-xr-x 47 kezunlin kezunlin 4096 11月 30 18:04 ../

-rw-r--r--  1 root     root        0 11月 30 18:14 1.txt

drwxrwxr-x 12 kezunlin kezunlin 4096 10月 31 11:01 blog/
drwxrwxr-x  5 kezunlin kezunlin 4096 9月  20 15:33 opencv/
drwxrwxr-x  4 kezunlin kezunlin 4096 10月 31 15:55 openmp/
drwxrwxr-x  5 kezunlin kezunlin 4096 1月   9  2018 qt/
drwxrwxr-x  2 kezunlin kezunlin 4096 1月   4  2018 ros/
drwxrwxr-x  4 kezunlin kezunlin 4096 11月 16  2017 voc/
drwxrwxr-x  5 kezunlin kezunlin 4096 8月   7 11:19 vs/

install nvidia-docker2

The machine running the CUDA container only requires the NVIDIA driver, the CUDA toolkit doesn’t have to be installed.
Host系统只需要安装NVIDIA driver即可运行CUDA container。

install

remove nvidia-docker 1.0

1
2
3

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

Add the package repositories

vim repo.sh

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

run scripts

1 2	chmod +x repo.sh ./repo.sh

Install nvidia-docker2 and reload the Docker daemon configuration

1 2	sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd

test

1	sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

output

Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
8ee29e426c26: Pull complete 
6e83b260b73b: Pull complete 
e26b65fd1143: Pull complete 
40dca07f8222: Pull complete 
b420ae9e10b3: Pull complete 
a579c1327556: Pull complete 
b440bb8df79e: Pull complete 
de3b2ccf9562: Pull complete 
a69a544d350e: Pull complete 
02348b5db71c: Pull complete 
Digest: sha256:5996fa2fc0666972360502fe32118286177b879a8a1a834a176e7786021b8cee
Status: Downloaded newer image for nvidia/cuda:latest
Mon Sep  3 10:08:27 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P8     8W /  N/A |    408MiB /  6072MiB |     40%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

or by tty

sudo docker run --runtime=nvidia -t -i --privileged nvidia/cuda bash

root@8f3ebd5ecbb6:/# nvidia-smi
Tue Sep  4 01:26:31 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P0    31W /  N/A |    374MiB /  6072MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Advanced Topics

Default runtime

The default runtime used by the Docker® Engine is runc, our runtime can become the default one by configuring the docker daemon with --default-runtime=nvidia. Doing so will remove the need to add the --runtime=nvidia argument to docker run. It is also the only way to have GPU access during docker build.

Environment variables

The behavior of the runtime can be modified through environment variables (such as NVIDIA_VISIBLE_DEVICES).
Those environment variables are consumed by nvidia-container-runtime and are documented here.
Our official CUDA images use default values for these variables.

docker command

sudo docker image list
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
nvidia/cuda         latest              04a9ce0dec6d        3 weeks ago         1.96GB

sudo docker run -it --privileged nvidia/cuda bash


docker build --network=host -t anakin:$tag . -f $DockerfilePath

kubernetes with GPU

kubernetes 对于 GPU 的支持截止到 1.9 版本，算是经历了3个阶段：

kubernetes 1.3 版本开始支持GPU，但是只支持单个 GPU卡；
kubernetes 1.6 版本开始支持对多个GPU卡的支持；
kubernetes 1.8 版本以 device plugin 方式提供对GPU的支持。

ls /dev/nvidia*
/dev/nvidia0 /dev/nvidia2 /dev/nvidia4 /dev/nvidia6 /dev/nvidiactl
/dev/nvidia1 /dev/nvidia3 /dev/nvidia5 /dev/nvidia7
Kubernetes 1.8~1.9，通过k8s-device-plugin 获取每个Node上GPU的信息，根据这些信息对GPU资源进行管理和调度。需要结合 nvidia-docker2 使用。
k8s-device-plugin也是由 nvidia 提供，在kubernetes中可以DaemonSet方式运行。

Reference

History

20180903: created.

install and configure tensorflow on windows 10

Posted on 2018-08-29 Edited on 2024-10-14 In deep learning

Series

Tutorial

version

version 1:

windows 10 64 bit + GTX 1060(8G) + cuda driver
windows 10 64 bit + GTX 1080(12G) + cuda driver
CUDA 8.0 + cudnn 6.0.1(win10) + tensorflow-gpu 1.4.0
python 3.5.3

version 2:

windows 10 64 bit + GeForce Titan Xp(12G) + cuda driver for Titan xp
CUDA 9.0 + cudnn 7.1.4(win10) + tensorflow-gpu 1.8.0 ( 1.8.0, 1.9.0 for cuda 9.0)

version 3:

windows 10 64 bit + Quadro P4000(8G) + cuda driver for Quadro P4000(实测用Titan Xp的driver也可以)
CUDA 9.0 + cudnn 7.1.4(win10) + tensorflow-gpu 1.8.0 ( 1.8.0, 1.9.0 for cuda 9.0)

errors

error retrieving driver version: Unimplemented: kernel reported driver version not implemented on Windows

see tensorflow-gpu==1.4.0

Tips: for tensorflow-gpu==1.4.0
on linux, support python 2.7,3.3,3.4,3.5,3.6.
on windows, only support python 3.5,3.6.

see tensorflow-gpu==1.8.0

Tips: for tensorflow-gpu==1.8.0
on linux, support python 2.7,3.3,3.4,3.5,3.6.
on windows, only support python 3.5,3.6.

from Tensorflow1.6 use CUDA9.0+cuDNN7.

cuda & cudnn

see Part 1: Install and Configure Caffe on windows 10

system env

1	C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin

python

install python 3.5.3,add python and pip path to system env.

copy python.exe to python3.exe,
copy pip.exe to pip3.exe

system env

1 2	C:\Users\zunli\AppData\Local\Programs\Python\Python35\ C:\Users\zunli\AppData\Local\Programs\Python\Python35\Scripts

test

python3
Python 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> ^Z

pip3

1 2	pip3 -V pip 9.0.1 from c:\users\zunli\appdata\local\programs\python\python35\lib\site-packages (python 3.5)

tensorflow

1	pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple Pillow scipy sklearn scikit-image matplotlib

1.4.0

1	pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.4.0 keras=2.1.0

1.8.0

1	pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.8.0 keras=2.2.0

test tensorflow

import tensorflow as tf
import numpy as np

hello=tf.constant('hhh')
sess=tf.Session()
print (sess.run(hello))

test cuda and gpu

import tensorflow as tf

a = tf.test.is_built_with_cuda()  # 判断CUDA是否可以用

b = tf.test.is_gpu_available(
    cuda_only=False,
    min_cuda_compute_capability=None
)  # 判断GPU是否可以用

print(a)
print(b)

test gpu

import tensorflow as tf

with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0], shape=[3], name='a')
    b = tf.constant([1.0, 2.0, 3.0], shape=[3], name='b')
with tf.device('/gpu:0'):
    c = a + b

# 注意：allow_soft_placement=True表明：计算设备可自行选择，如果没有这个参数，会报错。
# 因为不是所有的操作都可以被放在GPU上，如果强行将无法放在GPU上的操作指定到GPU上，将会报错。
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(tf.global_variables_initializer())
print(sess.run(c))

pycharm

run code with pycharm

jupyter notebook

pip install ipykernel
python -m ipykernel install --user --name=tensorflow

Installed kernelspec tensorflow in C:\Users\zunli\AppData\Roaming\jupyter\kernels\tensorflow

error fix

errors:

No matching distribution found for tensorflow

solution: use python 3.5 instead of python 2.7

Reference

tensorflow-windows-wheel

History

20180829: created.

c++ 11 lambda

Posted on 2018-08-23 Edited on 2024-10-14 In cpp

Guide

syntax

[ capture clause ] (parameters) -> return-type  
{   
   definition of method   
}

capture

We can capture external variables from enclosing scope by three ways :

  Capture by reference
  Capture by value (making a copy)
  Capture by both (mixed capture)

Syntax used for capturing variables :

    []:   capture nothing
  [&] : capture all external variable by reference
  [=] : capture all external variable by value (making a copy)
  [a, &b] : capture a by value and b by reference 
  [this] :	Capture the this pointer of the enclosing class

C++11中的Lambda表达式捕获外部变量主要有以下形式：

捕获形式	说明
[]	不捕获任何外部变量
[变量名, …]	默认以值得形式捕获指定的多个外部变量（用逗号分隔），如果引用捕获，需要显示声明（使用&说明符）
[this]	以值的形式捕获this指针
[=]	以值的形式捕获所有外部变量
[&]	以引用形式捕获所有外部变量
[=, &x]	变量x以引用形式捕获，其余变量以传值形式捕获
[&, x]	变量x以值的形式捕获，其余变量以引用形式捕获

example code

#include <bits/stdc++.h> 
using namespace std; 
  
void test_labmda_0()
{
    // call lambda with ending ();
    [] () 
    { 
      cout << "Hello, my Greek friends"; 
    }();

    // return value 
    auto l1 = [] () 
    { 
      return 1; 
    } ; // compiler knows this returns an integer

    auto l2 = [] () -> int 
    { 
      return 1; 
    } ; // now we're telling the compiler what we want
}

// Function to print vector 
void printVector(const vector<int>& v) 
{ 
    // lambda expression to print vector 
    for_each(v.begin(), v.end(), [](int i) 
    { 
        std::cout << i << " "; 
    }); 
    cout << endl; 
} 

void test_lambda_1()
{
    vector<int> v {4, 1, 3, 5, 2, 3, 1, 7}; 
    printVector(v); 

    // capture nothing
    std::sort(v.begin(), v.end(), [](const int& a, const int& b) -> bool
    { 
        return a > b; 
    }); 
    printVector(v); 


    int ans = accumulate(v.begin(),v.end(),0, 
      [](int i,int j)
      {
        return i+j;
      }
    );
    cout << "SUM = " << ans << endl;
}

void test_lambda_2()
{
    vector<int> v1 = {3, 1, 7, 9}; 
    vector<int> v2 = {10, 2, 7, 16, 9}; 
  
    //  access v1 and v2 by reference 
    auto pushinto = [&] (int m) 
    { 
        v1.push_back(m); 
        v2.push_back(m); 
    }; 
  
    // it pushes 20 in both v1 and v2 
    pushinto(20); 


    // access v1 by value (copy) 
    auto printv = [v1]() 
    { 
        for (auto p = v1.begin(); p != v1.end(); p++) 
        { 
            cout << *p << " "; 
        } 
        cout << endl; 
    }; 
    printv();


    int N = 5; 
    // below snippet find first number greater than N 
    // [N]  denotes,   can access only N by value 
    vector<int>:: iterator p = find_if(v1.begin(), v1.end(), [N](int i) 
    { 
        return i > N; 
    }); 
    cout << "First number greater than 5 is : " << *p << endl; 
}


class Foo
{
public:
    Foo () : _x( 3 ) {}
    void func ()
    {
        // a very silly, but illustrative way of printing out the value of _x
        [this] () 
        { 
          cout << this->_x; 
        } ();
    }

private:
        int _x;
};


int test_labmda_3()
{
    Foo f;
    f.func();
}


void main_demo()
{
   test_lambda_0();
   test_lambda_1();
   test_lambda_2();
   test_labmda_3();
}

int main(int argc, char const *argv[])
{
    main_demo();
    return 0;
}

Reference

History

20180823: created.

speed up opencv image mat for loop

Posted on 2018-08-23 Edited on 2024-10-14 In cpp

Series

Guide

Mat

for gray image, use type <uchar>
for RGB color image，use type <Vec3b>

gray format storage

color format storage: BGR

we can use method isContinuous() to judge whether the memory buffer is continuous or not.

color space reduction

uchar color_space_reduction(uchar pixel)
{
	/*
	0-9 ===>0
	10-19===>10
	20-29===>20
	...
	240-249===>24
	250-255===>25

	map from 256*256*256===>26*26*26
	*/

	int divideWith = 10;
	uchar new_pixel = (pixel / divideWith)*divideWith;
	return new_pixel;
}

color table

void get_color_table()
{
	// cache color value in table[256]
	int divideWith = 10;
	uchar table[256];
	for (int i = 0; i < 256; ++i)
		table[i] = divideWith* (i / divideWith);
}

C++

ptr []

// C ptr []: faster but not safe
Mat& ScanImageAndReduce_Cptr(Mat& I, const uchar* const table)
{
	// accept only char type matrices
	CV_Assert(I.depth() != sizeof(uchar));
	int channels = I.channels();
	int nRows = I.rows;
	int nCols = I.cols* channels;
	if (I.isContinuous())
	{
		nCols *= nRows;
		nRows = 1;
	}
	int i, j;
	uchar* p;
	for (i = 0; i < nRows; ++i)
	{
		p = I.ptr<uchar>(i);
		for (j = 0; j < nCols; ++j)
		{
			p[j] = table[p[j]];
		}
	}
	return I;
}

ptr ++

// C ptr ++: faster but not safe
Mat& ScanImageAndReduce_Cptr2(Mat& I, const uchar* const table)
{
	// accept only char type matrices
	CV_Assert(I.depth() != sizeof(uchar));
	int channels = I.channels();
	int nRows = I.rows;
	int nCols = I.cols* channels;
	if (I.isContinuous())
	{
		nCols *= nRows;
		nRows = 1;
	}
	uchar* start = I.ptr<uchar>(0); // same as I.ptr<uchar>(0,0)
	uchar* end = start + nRows * nCols;
	for (uchar* p=start; p < end; ++p)
	{
		*p = table[*p];
	}
	return I;
}

at(i,j)

 // at<uchar>(i,j): random access, slow
Mat& ScanImageAndReduce_atRandomAccess(Mat& I, const uchar* const table)
{
	// accept only char type matrices
	CV_Assert(I.depth() != sizeof(uchar));
	const int channels = I.channels();
	switch (channels)
	{
	case 1:
	{
		for (int i = 0; i < I.rows; ++i)
			for (int j = 0; j < I.cols; ++j)
				I.at<uchar>(i, j) = table[I.at<uchar>(i, j)];
		break;
	}
	case 3:
	{
		Mat_<Vec3b> _I = I;

		for (int i = 0; i < I.rows; ++i)
			for (int j = 0; j < I.cols; ++j)
			{
				_I(i, j)[0] = table[_I(i, j)[0]];
				_I(i, j)[1] = table[_I(i, j)[1]];
				_I(i, j)[2] = table[_I(i, j)[2]];
			}
		I = _I;
		break;
	}
	}
	return I;
}

Iterator

 // MatIterator_<uchar>: safe but slow
Mat& ScanImageAndReduce_Iterator(Mat& I, const uchar* const table)
{
	// accept only char type matrices
	CV_Assert(I.depth() != sizeof(uchar));
	const int channels = I.channels();
	switch (channels)
	{
	case 1:
	{
		MatIterator_<uchar> it, end;
		for (it = I.begin<uchar>(), end = I.end<uchar>(); it != end; ++it)
			*it = table[*it];
		break;
	}
	case 3:
	{
		MatIterator_<Vec3b> it, end;
		for (it = I.begin<Vec3b>(), end = I.end<Vec3b>(); it != end; ++it)
		{
			(*it)[0] = table[(*it)[0]];
			(*it)[1] = table[(*it)[1]];
			(*it)[2] = table[(*it)[2]];
		}
	}
	}
	return I;
}

opencv LUT

 // LUT
Mat& ScanImageAndReduce_LUT(Mat& I, const uchar* const table)
{
	Mat lookUpTable(1, 256, CV_8U);
	uchar* p = lookUpTable.data;
	for (int i = 0; i < 256; ++i)
		p[i] = table[i];

	cv::LUT(I, lookUpTable, I);
	return I;
}

forEach

forEach method of the Mat class that utilizes all the cores on your machine to apply any function at every pixel.

// Parallel execution with function object.
struct ForEachOperator
{
	uchar m_table[256];
	ForEachOperator(const uchar* const table)
	{
		for (size_t i = 0; i < 256; i++)
		{
			m_table[i] = table[i];
		}
	}

	void operator ()(uchar& p, const int * position) const
	{
		// Perform a simple operation
		p = m_table[p];
	}
};

// forEach use multiple processors, very fast
Mat& ScanImageAndReduce_forEach(Mat& I, const uchar* const table)
{
	I.forEach<uchar>(ForEachOperator(table));
	return I;
}

forEach with lambda

// forEach lambda use multiple processors, very fast (lambda slower than ForEachOperator)
Mat& ScanImageAndReduce_forEach_with_lambda(Mat& I, const uchar* const table)
{
	I.forEach<uchar>
	(
		[=](uchar &p, const int * position) -> void
		{
			p = table[p];
		}
	);
	return I;
}

time cost

no foreach

[1 Cptr   ] times=5000, total_cost=988 ms, avg_cost=0.1976 ms
[1 Cptr2  ] times=5000, total_cost=1704 ms, avg_cost=0.3408 ms
[2 atRandom] times=5000, total_cost=9611 ms, avg_cost=1.9222 ms
[3 Iterator] times=5000, total_cost=20195 ms, avg_cost=4.039 ms
[4 LUT    ] times=5000, total_cost=899 ms, avg_cost=0.1798 ms

[1 Cptr   ] times=10000, total_cost=2425 ms, avg_cost=0.2425 ms
[1 Cptr2  ] times=10000, total_cost=3391 ms, avg_cost=0.3391 ms
[2 atRandom] times=10000, total_cost=20024 ms, avg_cost=2.0024 ms
[3 Iterator] times=10000, total_cost=39980 ms, avg_cost=3.998 ms
[4 LUT    ] times=10000, total_cost=103 ms, avg_cost=0.0103 ms

foreach

[5 forEach     ] times=200000, total_cost=199 ms, avg_cost=0.000995 ms
[5 forEach lambda] times=200000, total_cost=521 ms, avg_cost=0.002605 ms

[5 forEach     ] times=20000, total_cost=17 ms, avg_cost=0.00085 ms
[5 forEach lambda] times=20000, total_cost=23 ms, avg_cost=0.00115 ms

results

Loop Type | Time Cost (us)
:—-: |
ptr [] | 242
ptr ++ | 339
at | 2002
iterator | 3998
LUT | 10
forEach | 0.85
forEach lambda | 1.15

forEach is 10x times faster than LUT, 240~~340x times faster than ptr [] and ptr ++, and 2000~~4000x times faster than at and iterator.

code

code here

Python

pure python

# import the necessary packages
import matplotlib.pyplot as plt
import cv2
print(cv2.__version__)

%matplotlib inline

3.4.2

# load the original image, convert it to grayscale, and display
# it inline
image = cv2.imread("cat.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)
#plt.imshow(image, cmap="gray")

(360, 480)

1	%load_ext cython

The cython extension is already loaded. To reload it, use:
  %reload_ext cython

%%cython -a
 
def threshold_python(T, image):
    # grab the image dimensions
    h = image.shape[0]
    w = image.shape[1]
    
    # loop over the image, pixel by pixel
    for y in range(0, h):
        for x in range(0, w):
            # threshold the pixel
            image[y, x] = 255 if image[y, x] >= T else 0
            
    # return the thresholded image
    return image

1	%timeit threshold_python(5, image)

263 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cython

%%cython -a
 
import cython
 
@cython.boundscheck(False)
cpdef unsigned char[:, :] threshold_cython(int T, unsigned char [:, :] image):
    # set the variable extension types
    cdef int x, y, w, h
    
    # grab the image dimensions
    h = image.shape[0]
    w = image.shape[1]
    
    # loop over the image
    for y in range(0, h):
        for x in range(0, w):
            # threshold the pixel
            image[y, x] = 255 if image[y, x] >= T else 0
    
    # return the thresholded image
    return image

numba

1	%timeit threshold_cython(5, image)

150 µs ± 7.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

from numba import njit

@njit
def threshold_njit(T, image):
    # grab the image dimensions
    h = image.shape[0]
    w = image.shape[1]
    
    # loop over the image, pixel by pixel
    for y in range(0, h):
        for x in range(0, w):
            # threshold the pixel
            image[y, x] = 255 if image[y, x] >= T else 0
            
    # return the thresholded image
    return image

1	%timeit threshold_njit(5, image)

43.5 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

numpy

1
2
3

def threshold_numpy(T, image):
    image[image > T] = 255
    return image

1	%timeit threshold_numpy(5, image)

111 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

conclusions

image = cv2.imread("cat.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)

%timeit threshold_python(5, image)
%timeit threshold_cython(5, image)
%timeit threshold_njit(5, image)
%timeit threshold_numpy(5, image)

(360, 480)
251 ms ± 6.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
143 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.8 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
113 µs ± 957 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

image = cv2.imread("big.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
print(image.shape)

%timeit threshold_python(5, image)
%timeit threshold_cython(5, image)
%timeit threshold_njit(5, image)
%timeit threshold_numpy(5, image)

(2880, 5120)
21.8 s ± 460 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
12.3 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.91 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.3 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

60,480

python: 251 ms
cython: 143 us
numba: 43 us
numpy: 113 us

2880, 5120

python: 21 s
cython: 12 ms
numba: 4 ms
numpy: 10 ms

Reference

History

20180823: created.