0%

Guide

ln

1
2
3
ln -s src dest

ln -s $(pwd)/yolo /home/kezunlin/program/darknet/kzl-yolo

tips: use absolute path to link folder

sed

case1

1
sed -i "s/old/new/g" 1.txt

replace “old” with “new”

case2

1
sed -i "7,7s/neat_enable\: false/neat_enable\: true/g" _config.yml

replace neat_enale: false to neat_enale: true

case3

1
sed -i "s/http:\/\/kezunlin.me/https:\/\/kezunlin.me/g" posts/post1.md

replace “http://kezunlin.me“ with “https://kezunlin.me

grep

case1

1
2
3
4
5
grep UpsampleParameter . -r --include *.h 

grep UpsampleParameter . -r --include *.proto
./src/caffe/proto/caffe.proto: optional UpsampleParameter upsample_param = 150;
./src/caffe/proto/caffe.proto:message UpsampleParameter {

case2

1
2
3
4
grep -r "http://kezunlin.me" posts/ 
posts/post1.md:[here](http://kezunlin.me/post/book)
posts/post2.md:-[here](http://kezunlin.me/post/book)
posts/post2.md:- [img](http://kezunlin.me/post/book)

case3

1
2
3
grep -rl "http://kezunlin.me" posts 
posts/post1.md
posts/post2.md

case4

1
sed -i "s/http:\/\/kezunlin.me/https:\/\/kezunlin.me/g" `grep -rl "http://kezunlin.me" posts`  

replace http://kezunlin.me with https://kezunlin.me in all posts files.

case5

1
sed -i "s/comments\: false/comments\: true/g" _posts/*

replace comments: false with comments: true in all posts files.

case6

1
2
3
4
grep --include=*.py -lr OKUtil .  | xargs 

grep -r -l <old> * | xargs sed -i 's/<old>/<new>/g'
grep -r -l <OKUtil> * | xargs sed -i 's/<OKUtil>/<OkoooUtil>/g'

ldd

for so

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ldd test_opencv | awk '/ => / { print $3 }'
(0x00007fff309d4000)
/usr/local/lib/libopencv_imgcodecs.so.3.1
/usr/local/lib/libopencv_core.so.3.1
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/lib/x86_64-linux-gnu/libgcc_s.so.1
/lib/x86_64-linux-gnu/libc.so.6
/usr/local/lib/libopencv_imgproc.so.3.1
/usr/local/lib/libjpeg.so.8
/lib/x86_64-linux-gnu/libpng12.so.0
/usr/lib/x86_64-linux-gnu/libtiff.so.5
/usr/lib/x86_64-linux-gnu/libjasper.so.1
/lib/x86_64-linux-gnu/libz.so.1
/lib/x86_64-linux-gnu/libm.so.6
/lib/x86_64-linux-gnu/libpthread.so.0
/lib/x86_64-linux-gnu/libdl.so.2
/lib/x86_64-linux-gnu/librt.so.1
/lib/x86_64-linux-gnu/liblzma.so.5
/usr/lib/x86_64-linux-gnu/libjbig.so.0

for not found

1
ldd test_opencv | awk 'NF==1 {file=$1} /not found/ { print file, $1 }'

useful

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ldd example_opencv | awk  '{if (match($3,"/")){ printf("%s \n"),$3 } }'
/usr/local/lib/libopencv_imgcodecs.so.3.1
/usr/local/lib/libopencv_core.so.3.1
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/lib/x86_64-linux-gnu/libgcc_s.so.1
/lib/x86_64-linux-gnu/libc.so.6
/usr/local/lib/libopencv_imgproc.so.3.1
/usr/local/lib/libjpeg.so.8
/lib/x86_64-linux-gnu/libpng12.so.0
/usr/lib/x86_64-linux-gnu/libtiff.so.5
/usr/lib/x86_64-linux-gnu/libjasper.so.1
/lib/x86_64-linux-gnu/libz.so.1
/lib/x86_64-linux-gnu/libm.so.6
/lib/x86_64-linux-gnu/libpthread.so.0
/lib/x86_64-linux-gnu/libdl.so.2
/lib/x86_64-linux-gnu/librt.so.1
/lib/x86_64-linux-gnu/liblzma.so.5
/usr/lib/x86_64-linux-gnu/libjbig.so.0

curl

login website with json data

1
2
3
4
curl -X POST \
-H "Accept: application/json" -H "Content-type: application/json" \
--data '{"name":"admin","password":"21232f297a57a5a743894a0e4a801fc3"}' \
http://192.168.0.12:8888/api/login

md5(admin) = 21232f297a57a5a743894a0e4a801fc3

output

{"rtn":0,"message":"OK","session_id":"[email protected]"}

query results with session_id

1
2
3
curl -X GET \
-H "session_id: [email protected]" \
http://192.168.0.12:8888/api/book

download file

1
curl -LSso ~/.vim/autoload/pathogen.vim https://tpo.pe/pathogen.vim

Tips

check cpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c 
8 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz


cat /proc/cpuinfo | grep physical | uniq -c
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual
1 physical id : 0
1 address sizes : 39 bits physical, 48 bits virtual

check gpu

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
lspci | grep -i vga
00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c20 (rev a1)


lspci -v -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c20 (rev a1) (prog-if 00 [VGA controller])
Subsystem: CLEVO/KAPOK Computer Device 65a1
Flags: bus master, fast devsel, latency 0, IRQ 130
Memory at db000000 (32-bit, non-prefetchable) [size=16M]
Memory at 90000000 (64-bit, prefetchable) [size=256M]
Memory at a0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at dc000000 [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_396, nvidia_396_drm


nvidia-smi
Tue Feb 12 10:09:14 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 56C P8 8W / N/A | 601MiB / 6078MiB | 3% Default |
+-------------------------------+----------------------+----------------------+


watch -n 1 nvidia-smi

pip install gpustat
watch --color -n1 gpustat -cpu

multiple terminal

1
gnome-terminal --working-directory=/home --tab --tab --tab

stop lightdm

1
2
3
4
5
6
sudo apt-get install xserver-org

# stop desktop
sudo /etc/init.d/lightdm stop

sudo fbterm

changedir no effect

changedir.sh

1
2
3
#!/bin/bash
cd /home/firefox
pwd

执行的时候是./changedir.sh来执行的,这样执行的话终端会产生一个subshell,subshell去执行脚本,在subshell中已经切换了目录了,但是subshell一旦执行完,马上退出,subshell中的变量和操作全部都收回。回到终端根本就看不到这个过程的变化。

solution

1
2
3
source ./changedir.sh
# or
. changedir.sh

source changedir.sh执行,这时候就是直接在终端的shell执行脚本了,没有生成子shell,所以当前终端切换了目录。

kill process

1
2
3
lsof -i:4000
netstat -tunlp | grep 4000
kill -9 1234

Reference

History

  • 20181130: created.

Series

Guide

compile

1
2
3
git clone https://github.com/davisking/dlib.git
cd dlib && mkdir build && cd build
cmake-gui ..

with options

CMAKE_INSTALL_PREFIX /usr/local
CUDA 9.2 + cuDNN 7.1.4

generate

Found CUDA: /usr/local/cuda (found suitable version "9.2", minimum required is "7.5") 
Looking for cuDNN install...
Found cuDNN: /usr/local/cuda/lib64/libcudnn.so
Building a CUDA test project to see if your compiler is compatible with CUDA...
Checking if you have the right version of cuDNN installed.
Enabling CUDA support for dlib.  DLIB WILL USE CUDA
C++11 activated.

make and install

1
2
make -j8 
sudo make install

output

[100%] Linking CXX static library libdlib.a
[100%] Built target dlib

generate static library libdlib.a

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
find_package(dlib REQUIRED)

if(MSVC)
set(dlib_LIBRARIES "C:/Program Files/dlib/lib/dlib.lib") # replace dlib::dlib
else()
endif(MSVC)
# ${dlib_INCLUDE_DIRS} and ${dlib_LIBRARIES} are deprecated, simply use target_link_libraries(your_app dlib::dlib)
MESSAGE( [Main] " dlib_INCLUDE_DIRS = ${dlib_INCLUDE_DIRS}")
MESSAGE( [Main] " dlib_LIBRARIES = ${dlib_LIBRARIES}")


add_executable(demo demo.cpp)
#target_link_libraries(demo ${dlib_LIBRARIES})
target_link_libraries(demo dlib::dlib)

Reference

History

  • 20181127: created.

Series

Guide

requirements:

  • pybind11 v2.3.dev0
  • python 3.5

install pytest

1
pip3 install pytest 

compile

1
2
3
4
5
git clone https://github.com/pybind/pybind11.git
cd pybind11
mkdir build
cd build
cmake-gui ..

with options

PYBIND11_CPP_STANDARD /std:c++11 # default c++14
PYTHON_EXECUTABLE /usr/bin/python3.5
CMAKE_INSTALL_PREFIX /usr/local

install

make and install

1
2
make -j8
sudo make install

install to /usr/local/include/pybind11 with only include and /usr/local/share/cmake/pybind11

output

Install the project...
-- Install configuration: "MinSizeRel"
-- Installing: /usr/local/include/pybind11
-- Installing: /usr/local/include/pybind11/chrono.h
-- Installing: /usr/local/include/pybind11/eigen.h
-- Installing: /usr/local/include/pybind11/stl.h
-- Installing: /usr/local/include/pybind11/complex.h
-- Installing: /usr/local/include/pybind11/detail
-- Installing: /usr/local/include/pybind11/detail/internals.h
-- Installing: /usr/local/include/pybind11/detail/common.h
-- Installing: /usr/local/include/pybind11/detail/descr.h
-- Installing: /usr/local/include/pybind11/detail/init.h
-- Installing: /usr/local/include/pybind11/detail/class.h
-- Installing: /usr/local/include/pybind11/detail/typeid.h
-- Installing: /usr/local/include/pybind11/common.h
-- Installing: /usr/local/include/pybind11/iostream.h
-- Installing: /usr/local/include/pybind11/buffer_info.h
-- Installing: /usr/local/include/pybind11/attr.h
-- Installing: /usr/local/include/pybind11/numpy.h
-- Installing: /usr/local/include/pybind11/pybind11.h
-- Installing: /usr/local/include/pybind11/operators.h
-- Installing: /usr/local/include/pybind11/options.h
-- Installing: /usr/local/include/pybind11/cast.h
-- Installing: /usr/local/include/pybind11/eval.h
-- Installing: /usr/local/include/pybind11/embed.h
-- Installing: /usr/local/include/pybind11/pytypes.h
-- Installing: /usr/local/include/pybind11/functional.h
-- Installing: /usr/local/include/pybind11/stl_bind.h
-- Installing: /usr/local/share/cmake/pybind11/pybind11Config.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11ConfigVersion.cmake
-- Installing: /usr/local/share/cmake/pybind11/FindPythonLibsNew.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11Tools.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11Targets.cmake

Usage

pybind11

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
find_package(pybind11 CONFIG REQUIRED)
include_directories(${pybind11_INCLUDE_DIRS})

MESSAGE( [MAIN] "Found pybind11 v${pybind11_VERSION}: ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_INCLUDE_DIRS = ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_LIBRARIES = ${pybind11_LIBRARIES}")

add_library(examplelib
${HEADER_FILES}
${SOURCE_FILES}
)

target_link_libraries (examplelib
pybind11::module
${xxx_LIBRARIES}
)

embed

CMakeLists.txt

1
2
3
4
5
6
7
8
9
find_package(pybind11 CONFIG REQUIRED)
include_directories(${pybind11_INCLUDE_DIRS})

MESSAGE( [MAIN] "Found pybind11 v${pybind11_VERSION}: ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_INCLUDE_DIRS = ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_LIBRARIES = ${pybind11_LIBRARIES}")

add_executable(cpp_use_python cpp_use_python.cpp)
target_link_libraries(cpp_use_python PRIVATE pybind11::embed)

Reference

History

  • 20181127: created.

Guide

RefineDet is based on Caffe.

See Install and Configure Caffe on ubuntu 16.04

  • ubuntu 16.04
  • CUDA 9.2 + cudnn 7.1.4 (for caffe/tensorrt/anakin)
  • opencv 3.3.0
  • python 2.7
  • caffe (from refinedet)

compile

1
2
3
4
5
git clone https://github.com/sfzhang15/RefineDet.git
cd RefineDet
mkdir build && cd build && cmake-gui ..

make -j8 && make pycaffe

options

USE_CUDNN True
USE_OPENCV True
WITH_PYTHON_LAYER True
BLAS atlas

CMAKE_INSTALL_PREFIX /home/kezunlin/program/refinedet/build/install

tips: vim CMakeLists.txt and comment out examples and docs

1
2
3
4
#add_subdirectory(examples)
add_subdirectory(python)
add_subdirectory(matlab)
#add_subdirectory(docs)

fix gflags error

  • caffe/include/caffe/common.hpp
  • caffe/examples/mnist/convert_mnist_data.cpp

Comment out the ifndef

1
2
3
// #ifndef GFLAGS_GFLAGS_H_
namespace gflags = google;
// #endif // GFLAGS_GFLAGS_H_

example

two version:

  • single version
  • batch version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
'''
In this example, we will load a RefineDet model and use it to detect objects.
'''
import argparse
import os
import sys
import numpy as np
import skimage.io as io
import cv2
# Make sure that caffe is on the python path:
caffe_root = './'
os.chdir(caffe_root)
sys.path.insert(0, os.path.join(caffe_root, 'python'))
import caffe

classes = ['background', 'person']

def filter_boxs(boxs, threshold=0.4):
"""
boxs: 500*6 (xmin,ymin,xmax,ymax,confidence,class_index)
class_index: 0 background, 1 person
confidence: 0-1
return:
new_boxs `list` [b1,b2,b3,...]
"""
new_boxs = []
for i in range(0, boxs.shape[0]):
xmin,ymin,xmax,ymax,confidence,class_index = boxs[i]
#print(type(class_index)) # float32
if int(class_index)>0 and confidence >= threshold:
box = [int(xmin),int(ymin),int(xmax),int(ymax),confidence, int(class_index)]
new_boxs.append(box)
return new_boxs # list [b1,b2,b3,...]

def save_results(counter, image_file, boxs, save_fig=False):

img = cv2.imread(image_file)
for i in range(0, len(boxs)):
xmin,ymin,xmax,ymax,confidence,class_index = boxs[i]

name = classes[class_index]
coords = (xmin, ymin), xmax - xmin, ymax - ymin

cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (0, 0, 255), 3) # bgr
#display_text = '%s: %.2f' % (name, confidence)
display_text = '%.2f' % (confidence)
cv2.putText(img, display_text, (xmin, ymin-5), cv2.FONT_HERSHEY_SIMPLEX, 1, color=(0,0,255), thickness=2)

if save_fig:
image_filepath = 'output/{0}_results.jpg'.format(counter)
cv2.imwrite(image_filepath, img)
print('Saved: ' + image_filepath)


def single():
caffe.set_device(0)
caffe.set_mode_gpu()

save_dir = "./output"
if not os.path.exists(save_dir):
os.mkdir(save_dir)

# load model
model_def = 'models/ResNet/coco/refinedet_resnet101_512x512/deploy.prototxt'
model_weights = 'models/ResNet/coco/refinedet_resnet101_512x512/coco_refinedet_resnet101_512x512_iter_75000.caffemodel'
net = caffe.Net(model_def, model_weights, caffe.TEST)

# image preprocessing
img_resize = 512
net.blobs['data'].reshape(1, 3, img_resize, img_resize)
data_shape = net.blobs['data'].data.shape
print("data_shape=", data_shape) # 1, 3, 512, 512
# by default, caffe use chw, bgr, 0-255, image-[104, 117, 123]
transformer = caffe.io.Transformer({'data':data_shape})
transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
transformer.set_channel_swap('data', (2, 1, 0)) # rgb===>bgr
transformer.set_raw_scale('data', 255) # [0-1]===> [0,255]
transformer.set_mean('data', np.array([104, 117, 123])) # bgr mean pixel

files = ["./images/1.png", "./images/2.png"]# 500,7 + 384,7 === 500,7 + 500,7
for index,image_file in enumerate(files):
print("image_file=", image_file)
image = caffe.io.load_image(image_file) # hwc, rgb, 0-1
print("image.shape=", image.shape)

transformed_image = transformer.preprocess('data', image)
print("transformed_image.shape=", transformed_image.shape)

net.blobs['data'].data[...] = transformed_image

detections = net.forward()['detection_out']
print("detections.shape = ",detections.shape) # 1, 1, 500, 7
det_label = detections[0, 0, :, 1] # 0 back, 1 -person (now only ==1)
det_conf = detections[0, 0, :, 2] # 0-1
det_xmin = detections[0, 0, :, 3] * image.shape[1]
det_ymin = detections[0, 0, :, 4] * image.shape[0]
det_xmax = detections[0, 0, :, 5] * image.shape[1]
det_ymax = detections[0, 0, :, 6] * image.shape[0]
boxs = np.column_stack([det_xmin, det_ymin, det_xmax, det_ymax, det_conf, det_label])
print("boxs = ", boxs.shape) # 500,6

new_boxs = filter_boxs(boxs)
print("new_boxs = ", len(new_boxs)) # 3 boxs

# show result
save_results(index, image_file, new_boxs, save_fig=True)

def batch():
caffe.set_device(0)
caffe.set_mode_gpu()

save_dir = "./output"
if not os.path.exists(save_dir):
os.mkdir(save_dir)

# load model
model_def = 'models/ResNet/coco/refinedet_resnet101_512x512/deploy.prototxt'
model_weights = 'models/ResNet/coco/refinedet_resnet101_512x512/coco_refinedet_resnet101_512x512_iter_75000.caffemodel'
net = caffe.Net(model_def, model_weights, caffe.TEST)

box_count_per_image = 500
#files = ["./images/2.png"]
files = ["./images/1.png", "./images/2.png"]# 500,7 + 384,7 === 500,7 + 500,7
# update detection_output_layer.cpp and cu to keep 500 box results
batch_size = len(files)
# image preprocessing
img_resize = 512
net.blobs['data'].reshape(batch_size, 3, img_resize, img_resize)
data_shape = net.blobs['data'].data.shape
print("data_shape=", data_shape) # 1, 3, 512, 512
# by default, caffe use chw, bgr, 0-255, image-[104, 117, 123]
transformer = caffe.io.Transformer({'data':data_shape})
transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
transformer.set_channel_swap('data', (2, 1, 0)) # rgb===>bgr
transformer.set_raw_scale('data', 255) # [0-1]===> [0,255]
transformer.set_mean('data', np.array([104, 117, 123])) # bgr mean pixel

for i in range(len(files)):
#image_file = "./images/1.png"
image_file = files[i]
print("image_file=", image_file)
image = caffe.io.load_image(image_file) # hwc, rgb, 0-1
print("image.shape=", image.shape)

transformed_image = transformer.preprocess('data', image)
print("transformed_image.shape=", transformed_image.shape)

net.blobs['data'].data[i,:,:,:] = transformed_image

detections = net.forward()['detection_out']
print("detections.shape = ",detections.shape) # 1, 1, 500+384, 7 ===> 1,1, 1000,7

for i in range(batch_size):
start = i * box_count_per_image
end = (i+1) * box_count_per_image
print("start-end: ",start, end)

det_label = detections[0, 0, start:end, 1] # 0 back, 1 -person (now only ==1)
print(det_label[:10])
det_conf = detections[0, 0, start:end, 2] # 0-1
det_xmin = detections[0, 0, start:end, 3] * image.shape[1]
det_ymin = detections[0, 0, start:end, 4] * image.shape[0]
det_xmax = detections[0, 0, start:end, 5] * image.shape[1]
det_ymax = detections[0, 0, start:end, 6] * image.shape[0]
boxs = np.column_stack([det_xmin, det_ymin, det_xmax, det_ymax, det_conf, det_label])
print("boxs = ", boxs.shape) # 500,6

new_boxs = filter_boxs(boxs)
print("new_boxs = ", len(new_boxs)) # 3 boxs

# show result
save_results(i, image_file, new_boxs, save_fig=True)

if __name__ == '__main__':
#single()
batch()

output

('data_shape=', (2, 3, 512, 512))
('image_file=', './images/1.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
('image_file=', './images/2.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
('detections.shape = ', (1, 1, 1000, 7))
('start-end: ', 0, 500)
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
('boxs = ', (500, 6))
('new_boxs = ', 3)
Saved: output/0_results.jpg
('start-end: ', 500, 1000)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
('boxs = ', (500, 6))
('new_boxs = ', 6)
Saved: output/1_results.jpg

Reference

History

  • 20181127: created.

code example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
import os
import xml.etree.ElementTree as ET
import numpy as np
import cv2
import pickle
import copy
import yolo.config as cfg
import matplotlib.pyplot as plt

np.random.seed(1234) # for np.random.shuffle(gt_labels)

classes = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
'motorbike', 'person', 'pottedplant', 'sheep', 'sofa',
'train', 'tvmonitor']


class pascal_voc(object):
def __init__(self, phase, rebuild=False):
self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
self.data_path = os.path.join(self.devkil_path, 'VOC2007')
self.cache_path = cfg.CACHE_PATH
self.batch_size = cfg.BATCH_SIZE
self.image_size = cfg.IMAGE_SIZE
self.cell_size = cfg.CELL_SIZE
self.classes = cfg.CLASSES
self.class_to_ind = dict(zip(self.classes, range(len(self.classes))))
self.flipped = False # cfg.FLIPPED
self.phase = phase
self.rebuild = rebuild
self.cursor = 0
self.epoch = 1
self.gt_labels = None
self.prepare()

def get(self):
images = np.zeros(
(self.batch_size, self.image_size, self.image_size, 3))
labels = np.zeros(
(self.batch_size, self.cell_size, self.cell_size, 25))
count = 0
while count < self.batch_size:
imname = self.gt_labels[self.cursor]['imname']
flipped = self.gt_labels[self.cursor]['flipped']
images[count, :, :, :] = self.image_read(imname, flipped)
labels[count, :, :, :] = self.gt_labels[self.cursor]['label']
count += 1
self.cursor += 1
if self.cursor >= len(self.gt_labels):
np.random.shuffle(self.gt_labels)
self.cursor = 0
self.epoch += 1
return images, labels

def image_read(self, imname, flipped=False):
image = cv2.imread(imname)
image = cv2.resize(image, (self.image_size, self.image_size))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
image = (image / 255.0) * 2.0 - 1.0
if flipped:
image = image[:, ::-1, :]
return image

def prepare(self):
gt_labels = self.load_labels()
if self.flipped:
print('Appending horizontally-flipped training examples ...')
# keep y; flip x;
gt_labels_cp = copy.deepcopy(gt_labels)
for idx in range(len(gt_labels_cp)):
gt_labels_cp[idx]['flipped'] = True
gt_labels_cp[idx]['label'] = \
gt_labels_cp[idx]['label'][:, ::-1, :] # flip x grid index [0,1,2,3,4,5,6] ===>[6,5,4,3,2,1,0]
for i in range(self.cell_size):
for j in range(self.cell_size):
if gt_labels_cp[idx]['label'][i, j, 0] == 1:
gt_labels_cp[idx]['label'][i, j, 1] = \
self.image_size - 1 - \
gt_labels_cp[idx]['label'][i, j, 1] # cx = 448 -1 - cx flipped cx
gt_labels += gt_labels_cp
np.random.shuffle(gt_labels) # shuffle labels
self.gt_labels = gt_labels
return gt_labels

def load_labels(self):
cache_file = os.path.join(
self.cache_path, 'pascal_' + self.phase + '_gt_labels.pkl')

if os.path.isfile(cache_file) and not self.rebuild:
print('Loading gt_labels from: ' + cache_file)
with open(cache_file, 'rb') as f:
gt_labels = pickle.load(f)
return gt_labels

print('Processing gt_labels from: ' + self.data_path)

if not os.path.exists(self.cache_path):
os.makedirs(self.cache_path)

if self.phase == 'train':
txtname = os.path.join(
self.data_path, 'ImageSets', 'Main', 'trainval.txt')
else:
txtname = os.path.join(
self.data_path, 'ImageSets', 'Main', 'test.txt')
with open(txtname, 'r') as f:
self.image_index = [x.strip() for x in f.readlines()] # 5011 lines

gt_labels = []
for index in self.image_index:
label, num = self.load_pascal_annotation(index)
if num == 0:
continue
imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
gt_labels.append({'imname': imname,
'label': label,
'flipped': False})
print('Saving gt_labels to: ' + cache_file)
with open(cache_file, 'wb') as f:
pickle.dump(gt_labels, f)
return gt_labels

def load_pascal_annotation(self, index):
"""
Load image and bounding boxes info from XML file in the PASCAL VOC
format. 002939
"""

imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
im = cv2.imread(imname)
h_ratio = 1.0 * self.image_size / im.shape[0]
w_ratio = 1.0 * self.image_size / im.shape[1]
# im = cv2.resize(im, [self.image_size, self.image_size])

label = np.zeros((self.cell_size, self.cell_size, 25)) # 7,7,25
filename = os.path.join(self.data_path, 'Annotations', index + '.xml')
tree = ET.parse(filename)
objs = tree.findall('object')

for obj in objs:
bbox = obj.find('bndbox') # xmin,ymin,xmax,ymax 1-based ===> 0-based
# Make pixel indexes 0-based
x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()]
boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1] # cx,cy,w,h [0-447]
x_ind = int(boxes[0] * self.cell_size / self.image_size) # grid x,y index [0-6]
y_ind = int(boxes[1] * self.cell_size / self.image_size)
if label[y_ind, x_ind, 0] == 1: # if multiple objects fall in same grid, we only use the first one
continue
label[y_ind, x_ind, 0] = 1 # has object 1 or 0
label[y_ind, x_ind, 1:5] = boxes # boxs (cx,cy,w,h) [0-447]
label[y_ind, x_ind, 5 + cls_ind] = 1 # class 20-one-hot-vector

return label, len(objs)


"""
3 , 4 = [0. 0. 0. 0. 0.]
3 , 5 = [ 1. 325.248 229.6 111.104 228.48 ]
class_one_hot = [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
class_index = 8
class_name = chair
3 , 6 = [0. 0. 0. 0. 0.]
4 , 0 = [0. 0. 0. 0. 0.]
4 , 1 = [0. 0. 0. 0. 0.]
4 , 2 = [ 1. 132.16 288.4 172.928 316.96 ]
class_one_hot = [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
class_index = 8
class_name = chair
4 , 3 = [0. 0. 0. 0. 0.]


data['label'].shape # 7,7,25 (confidence+ (x,y,w,h) + 20-classes)

confidence: 1 if gt_box center falls in this grid, otherwise 0
box(x,y,w,h): gt_box center x,y,w,h; otherwize [0,0,0,0]
class: 20-one-hot-vector if gt_box; othersize [0]*20

how flip works: flip x dim

(1) flip grid x-dim: data['label'] = data['label'][:, ::-1, :]

y-grid = y-grid [0,1,2,3,4,5,6]
x-grid flip [0,1,2,3,4,5,6] ===>[6,5,4,3,2,1,0]

(2) flip data['label']
confidence = confidence
cx: flip cx = 417-cx: data['label'][i, j, 1] = 448 - 1 - data['label'][i, j, 1]
cy = cy
w = w
h = h
class = class
"""


def print_data(data):
# grid y,x
for y in range(7):
for x in range(7):
print(y, ",", x, "= ", data['label'][y, x, :5])
if data['label'][y, x, 0] > 0: # confidence >0
class_one_hot = data['label'][y, x, 5:] # (20)
class_index = np.argmax(class_one_hot)
print(" class_one_hot = ", class_one_hot)
print(" class_index = ", class_index)
print(" class_name = ", classes[class_index])


def flip_data(data):
data['flipped'] = True
data['label'] = data['label'][:, ::-1, :]
for y in range(7):
for x in range(7):
if data['label'][y, x, 0] == 1:
data['label'][y, x, 1] = 448 - 1 - data['label'][y, x, 1] # cx = 448 -1 - cx flipped cx


def show_image(filename):
image = cv2.imread(filename)
# convert from BGR to RGB
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.axis("off")
plt.imshow(rgb_image)
plt.show()


print("========================PASCAL=================================")
pascal = pascal_voc('train')
print(pascal.class_to_ind) # dict 20
print(len(pascal.gt_labels)) # list: default 5011; flipped 10022
data = pascal.gt_labels[0]
print(data.keys()) # dict_keys(['flipped', 'imname', 'label'])
print(data['imname'])
print(data['label'].shape) # 7,7,25 (confidence+ (x,y,w,h) + 20-classes)

print(classes)
show_image(data['imname'])

print("========================DATA=================================")
print_data(data)

print("=========================FLIPPED================================")
# flip data
flipped = copy.deepcopy(data)
flip_data(flipped)
print_data(flipped)
Loading gt_labels from: data\pascal_voc\cache\pascal_train_gt_labels.pkl
{'dog': 11, 'train': 18, 'bus': 5, 'motorbike': 13, 'aeroplane': 0, 'bicycle': 1, 'person': 14, 'horse': 12, 'bird': 2, 'tvmonitor': 19, 'sheep': 16, 'boat': 3, 'car': 6, 'diningtable': 10, 'pottedplant': 15, 'sofa': 17, 'bottle': 4, 'chair': 8, 'cat': 7, 'cow': 9}
5011
dict_keys(['flipped', 'label', 'imname'])
data\pascal_voc\VOCdevkit\VOC2007\JPEGImages\002939.jpg
(7, 7, 25)
['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

png

========================DATA=================================
0 , 0 =  [0. 0. 0. 0. 0.]
0 , 1 =  [0. 0. 0. 0. 0.]
0 , 2 =  [0. 0. 0. 0. 0.]
0 , 3 =  [0. 0. 0. 0. 0.]
0 , 4 =  [0. 0. 0. 0. 0.]
0 , 5 =  [0. 0. 0. 0. 0.]
0 , 6 =  [0. 0. 0. 0. 0.]
1 , 0 =  [0. 0. 0. 0. 0.]
1 , 1 =  [0. 0. 0. 0. 0.]
1 , 2 =  [0. 0. 0. 0. 0.]
1 , 3 =  [0. 0. 0. 0. 0.]
1 , 4 =  [0. 0. 0. 0. 0.]
1 , 5 =  [0. 0. 0. 0. 0.]
1 , 6 =  [0. 0. 0. 0. 0.]
2 , 0 =  [0. 0. 0. 0. 0.]
2 , 1 =  [0. 0. 0. 0. 0.]
2 , 2 =  [0. 0. 0. 0. 0.]
2 , 3 =  [0. 0. 0. 0. 0.]
2 , 4 =  [0. 0. 0. 0. 0.]
2 , 5 =  [0. 0. 0. 0. 0.]
2 , 6 =  [0. 0. 0. 0. 0.]
3 , 0 =  [0. 0. 0. 0. 0.]
3 , 1 =  [  1.          70.336      202.496       74.368      149.33333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
    class_index =  19
    class_name =  tvmonitor
3 , 2 =  [0. 0. 0. 0. 0.]
3 , 3 =  [0. 0. 0. 0. 0.]
3 , 4 =  [  1.         267.456      229.97333333  29.568       77.65333333]
    class_one_hot =  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  4
    class_name =  bottle
3 , 5 =  [0. 0. 0. 0. 0.]
3 , 6 =  [0. 0. 0. 0. 0.]
4 , 0 =  [0. 0. 0. 0. 0.]
4 , 1 =  [0. 0. 0. 0. 0.]
4 , 2 =  [0. 0. 0. 0. 0.]
4 , 3 =  [  1.         220.864      283.136      158.592      327.33866667]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    class_index =  14
    class_name =  person
4 , 4 =  [0. 0. 0. 0. 0.]
4 , 5 =  [0. 0. 0. 0. 0.]
4 , 6 =  [0. 0. 0. 0. 0.]
5 , 0 =  [0. 0. 0. 0. 0.]
5 , 1 =  [0. 0. 0. 0. 0.]
5 , 2 =  [0. 0. 0. 0. 0.]
5 , 3 =  [0. 0. 0. 0. 0.]
5 , 4 =  [  1.         283.584      337.49333333  92.288      185.17333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
5 , 5 =  [0. 0. 0. 0. 0.]
5 , 6 =  [0. 0. 0. 0. 0.]
6 , 0 =  [0. 0. 0. 0. 0.]
6 , 1 =  [0. 0. 0. 0. 0.]
6 , 2 =  [0. 0. 0. 0. 0.]
6 , 3 =  [0. 0. 0. 0. 0.]
6 , 4 =  [0. 0. 0. 0. 0.]
6 , 5 =  [0. 0. 0. 0. 0.]
6 , 6 =  [0. 0. 0. 0. 0.]
=========================FLIPPED================================
0 , 0 =  [0. 0. 0. 0. 0.]
0 , 1 =  [0. 0. 0. 0. 0.]
0 , 2 =  [0. 0. 0. 0. 0.]
0 , 3 =  [0. 0. 0. 0. 0.]
0 , 4 =  [0. 0. 0. 0. 0.]
0 , 5 =  [0. 0. 0. 0. 0.]
0 , 6 =  [0. 0. 0. 0. 0.]
1 , 0 =  [0. 0. 0. 0. 0.]
1 , 1 =  [0. 0. 0. 0. 0.]
1 , 2 =  [0. 0. 0. 0. 0.]
1 , 3 =  [0. 0. 0. 0. 0.]
1 , 4 =  [0. 0. 0. 0. 0.]
1 , 5 =  [0. 0. 0. 0. 0.]
1 , 6 =  [0. 0. 0. 0. 0.]
2 , 0 =  [0. 0. 0. 0. 0.]
2 , 1 =  [0. 0. 0. 0. 0.]
2 , 2 =  [0. 0. 0. 0. 0.]
2 , 3 =  [0. 0. 0. 0. 0.]
2 , 4 =  [0. 0. 0. 0. 0.]
2 , 5 =  [0. 0. 0. 0. 0.]
2 , 6 =  [0. 0. 0. 0. 0.]
3 , 0 =  [0. 0. 0. 0. 0.]
3 , 1 =  [0. 0. 0. 0. 0.]
3 , 2 =  [  1.         179.544      229.97333333  29.568       77.65333333]
    class_one_hot =  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  4
    class_name =  bottle
3 , 3 =  [0. 0. 0. 0. 0.]
3 , 4 =  [0. 0. 0. 0. 0.]
3 , 5 =  [  1.         376.664      202.496       74.368      149.33333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
    class_index =  19
    class_name =  tvmonitor
3 , 6 =  [0. 0. 0. 0. 0.]
4 , 0 =  [0. 0. 0. 0. 0.]
4 , 1 =  [0. 0. 0. 0. 0.]
4 , 2 =  [0. 0. 0. 0. 0.]
4 , 3 =  [  1.         226.136      283.136      158.592      327.33866667]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    class_index =  14
    class_name =  person
4 , 4 =  [0. 0. 0. 0. 0.]
4 , 5 =  [0. 0. 0. 0. 0.]
4 , 6 =  [0. 0. 0. 0. 0.]
5 , 0 =  [0. 0. 0. 0. 0.]
5 , 1 =  [0. 0. 0. 0. 0.]
5 , 2 =  [  1.         163.416      337.49333333  92.288      185.17333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
5 , 3 =  [0. 0. 0. 0. 0.]
5 , 4 =  [0. 0. 0. 0. 0.]
5 , 5 =  [0. 0. 0. 0. 0.]
5 , 6 =  [0. 0. 0. 0. 0.]
6 , 0 =  [0. 0. 0. 0. 0.]
6 , 1 =  [0. 0. 0. 0. 0.]
6 , 2 =  [0. 0. 0. 0. 0.]
6 , 3 =  [0. 0. 0. 0. 0.]
6 , 4 =  [0. 0. 0. 0. 0.]
6 , 5 =  [0. 0. 0. 0. 0.]
6 , 6 =  [0. 0. 0. 0. 0.]

Reference

History

  • 20181126: created.

Series

Guide

introduction

在异构计算架构中,GPU与CPU通过PCIe总线连接在一起来协同工作,CPU所在位置称为为主机端(host),而GPU所在位置称为设备端(device),如下图所示。
host device

基于CPU+GPU的异构计算平台可以优势互补,CPU负责处理逻辑复杂的串行程序,而GPU重点处理数据密集型的并行计算程序,从而发挥最大功效。
workflow

CUDA编程模型基础

  • host: CPU,Memory
  • device: GPU,Memory

CUDA程序中既包含host程序,又包含device程序,它们分别在CPU和GPU上运行。同时,host与device之间可以进行通信,这样它们之间可以进行数据拷贝。典型的CUDA程序的执行流程如下:

  1. 分配host内存,并进行数据初始化;
  2. 分配device内存,并从host将数据拷贝到device上;
  3. 调用CUDA的核函数(kernel function)在device上完成指定的运算;
  4. 将device上的运算结果拷贝到host上;
  5. 释放device和host上分配的内存。

kernel

kernel是CUDA中一个重要的概念,kernel是在device上线程中并行执行的函数,核函数用__global__符号声明,在调用时需要用<<<grid, block>>>来指定kernel要执行的线程数量,在CUDA中,每一个线程都要执行核函数,并且每个线程会分配一个唯一的线程号thread ID,这个ID值可以通过核函数的内置变量threadIdx来获得。

由于GPU实际上是异构模型,所以需要区分host和device上的代码,在CUDA中是通过函数类型限定词开区别host和device上的函数,主要的三个函数类型限定词如下:

  • __global__:在device上执行,从host中调用(一些特定的GPU也可以从device上调用),返回类型必须是void,不支持可变参数,不能成为类成员函数。注意用__global__定义的kernel是异步的,这意味着host不会等待kernel执行完就执行下一步。
  • __device__:在device上执行,单仅可以从device中调用,不可以和__global__同时用。
  • __host__:在host上执行,仅可以从host上调用,一般省略不写,不可以和__global__同时用,但可和__device__同时用,此时函数会在device和host都编译。

grid/block/thread

1
2
3
dim3 grid(3, 2);
dim3 block(5, 3);
kernel_fun<<< grid, block >>>(prams...);

grid block thread

The key is in CUDA’s <<<1, 1>>>syntax. This is called the execution configuration, and it tells the CUDA runtime how many parallel threads to use for the launch on the GPU.

builtin variables

  • threadIdx
  • blockIdx
  • blockDim
  • gridDim

对于一个2-dim的block(Dx,Dy),线程(x,y)的ID值为(x+y∗Dx)
如果是3-dim的block(Dx,Dy,Dz),线程(x,y,z)的ID值为(x+y∗Dx+z∗Dx∗Dy)

matrix add

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# kernel function
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < N && row < N)
C[row][col] = A[row][col] + B[row][col];
}
int main()
{
...
// Kernel config
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
// kernel call
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

CUDA内存模型

gpu memory

gpu memory

logical/physical layer

logical physical

  • SP最基本的处理单元,Streaming Processor,也称为CUDA core。

  • SM是英文名是 Streaming Multiprocessor,翻译过来就是流式多处理器。

  • 一个kernel的各个线程块有可能被分配多个SM,所以grid只是逻辑层,而SM才是执行的物理层。SM采用的是SIMT (Single-Instruction, Multiple-Thread,单指令多线程)架构,基本的执行单元是线程束(wraps),线程束包含32个线程,这些线程同时执行相同的指令,但是每个线程都包含自己的指令地址计数器和寄存器状态,也有自己独立的执行路径。

  • 由于SM的基本执行单元是包含32个线程的线程束,所以block大小一般要设置为32的倍数。

  • 每个thread由每个SP执行

  • 每个thread block由SM执行

  • 一个kernel其实由一个grid来执行,一个kernel一次只能在一个GPU上执行

Code

see cuda-demo

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
cmake_minimum_required (VERSION 2.8.7)

project (CudaExample)
enable_language(C)
enable_language(CXX)
set(CMAKE_CXX_STANDARD 11)

# Set the output folder where your program will be created
set(CMAKE_BINARY_DIR ${CMAKE_SOURCE_DIR}/bin)
set(EXECUTABLE_OUTPUT_PATH ${CMAKE_BINARY_DIR})
set(LIBRARY_OUTPUT_PATH ${CMAKE_BINARY_DIR})

find_package(CUDA REQUIRED) # user-defined

MESSAGE( [Main] " CUDA_LIBRARIES = ${CUDA_LIBRARIES}")
MESSAGE( [Main] " CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")

# The following folder will be included
include_directories(
${CUDA_INCLUDE_DIRS}
)

set(CUDA_NVCC_FLAGS "-g -G")
set(GENCODE -gencode=arch=compute_61,code=sm_61)

cuda_add_executable(demo src/demo.cu OPTIONS ${GENCODE})
target_link_libraries(demo ${CUDA_LIBRARIES})

#cuda_add_library(gpu SHARED ${CURRENT_HEADERS} ${CURRENT_SOURCES})
#cuda_add_library(gpu STATIC ${CURRENT_HEADERS} ${CURRENT_SOURCES})

vector add

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

#include <stdlib.h>
#include <iostream>

#include <cuda_runtime.h>

using namespace std;

/*
https://blog.csdn.net/fb_help/article/details/79330815
foo.cuh + foo.cu
*/

// function to add the elements of two arrays
void add(int n, float *a, float *b, float *c)
{
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i];
}

__global__ void kernel_add(int n, float *a, float *b, float *c)
{
// thread id
int i = blockDim.x * blockIdx.x + threadIdx.x;
c[i] = a[i] + b[i];
}

__global__ void kernel_add2(int n, float *a, float *b, float *c)
{
// thread id
int index = blockDim.x * blockIdx.x + threadIdx.x;

// grid-stride loop
int grid_stride = blockDim.x * gridDim.x; // 256*4096
// in this case; only 1 loop
for (int i = index; i < n; i += grid_stride)
{
c[i] = a[i] + b[i];
}
}

void device_info()
{
int deviceCount;
cudaGetDeviceCount(&deviceCount);
for (int i = 0;i<deviceCount;i++)
{
cudaDeviceProp devProp;
cudaGetDeviceProperties(&devProp, i);
std::cout << "使用GPU device " << i << ": " << devProp.name << std::endl;
std::cout << "设备全局内存总量: " << devProp.totalGlobalMem / 1024 / 1024 << "MB" << std::endl;
std::cout << "SM的数量:" << devProp.multiProcessorCount << std::endl;
std::cout << "每个SM的最大线程数:" << devProp.maxThreadsPerMultiProcessor << std::endl;
std::cout << "每个SM的最大线程束数(warps):" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;

std::cout << "每个线程块(Block)的共享内存大小:" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
std::cout << "每个线程块(Block)的最大线程数:" << devProp.maxThreadsPerBlock << std::endl;
std::cout << "每个线程块(Block)可用的32位寄存器数量: " << devProp.regsPerBlock << std::endl;
std::cout << "======================================================" << std::endl;
}
}

void test_cpu()
{
float *A, *B, *C;
int n = 1024 * 1024;
int size = n * sizeof(float);

// CPU端分配内存
A = (float*)malloc(size);
B = (float*)malloc(size);
C = (float*)malloc(size);

// 初始化数组
for (int i = 0;i<n;i++)
{
A[i] = 90.0;
B[i] = 10.0;
}

// Run kernel on 1M elements on the CPU
add(n, A, B, C);

// 校验误差
float max_error = 0.0;
for (int i = 0;i<n;i++)
{
max_error += fabs(100.0 - C[i]);
}

cout << "max error is " << max_error << endl;

// 释放CPU端的内存
free(A);
free(B);
free(C);
}

/*
cudaMalloc+cudaMemcpy+cudaFree
*/
int test_gpu_1()
{
float*A, *Ad, *B, *Bd, *C, *Cd;
int n = 1024 * 1024;
int size = n * sizeof(float);

// CPU端分配内存
A = (float*)malloc(size);
B = (float*)malloc(size);
C = (float*)malloc(size);

// 初始化数组
for(int i=0;i<n;i++)
{
A[i] = 90.0;
B[i] = 10.0;
}

// GPU端分配内存
cudaMalloc((void**)&Ad, size);
cudaMalloc((void**)&Bd, size);
cudaMalloc((void**)&Cd, size);

// CPU的数据拷贝到GPU端
cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

// 1-dim
// 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程
int block_size = 512;
int num_of_blocks = (n + block_size - 1) / block_size;
dim3 dimBlock(block_size);
dim3 dimGrid(num_of_blocks);

// 执行kernel
kernel_add<<<dimGrid, dimBlock>>>(n, Ad, Bd, Cd);

// 将在GPU端计算好的结果拷贝回CPU端
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

// 校验误差
float max_error = 0.0;
for(int i=0;i<n;i++)
{
max_error += fabs(100.0 - C[i]);
}

cout << "max error is " << max_error << endl;

// 释放CPU端、GPU端的内存
free(A);
free(B);
free(C);
cudaFree(Ad);
cudaFree(Bd);
cudaFree(Cd);
return 0;
}

/*
cudaMallocManaged+cudaDeviceSynchronize+cudaFree
*/
void test_gpu_2()
{
float*A, *B, *C;
int n = 1024 * 1024;
int size = n * sizeof(float);

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged((void**)&A, size);
cudaMallocManaged((void**)&B, size);
cudaMallocManaged((void**)&C, size);

// 初始化数组
for (int i = 0;i<n;i++)
{
A[i] = 90.0;
B[i] = 10.0;
}

// 1-dim
// 定义kernel执行配置,(1024*1024/512)个block,每个block里面有512个线程
int block_size = 512;
int num_of_blocks = (n + block_size - 1) / block_size;
dim3 dimBlock(block_size);
dim3 dimGrid(num_of_blocks);

// 执行kernel
kernel_add2 << <dimGrid, dimBlock >> >(n, A, B, C);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize(); // block until the GPU has finished all tasks

// 校验误差
float max_error = 0.0;
for (int i = 0;i<n;i++)
{
max_error += fabs(100.0 - C[i]);
}

cout << "max error is " << max_error << endl;

// Free Unified Memory
cudaFree(A);
cudaFree(B);
cudaFree(C);
}

int main()
{
device_info();
test_cpu();
test_gpu_1();
test_gpu_2();
return 0;
}

notes for block_size and num_of_blocks

1
2
3
4
int block_size = 512;
int num_of_blocks = (n + block_size - 1) / block_size; // 4096
dim3 dimBlock(block_size);
dim3 dimGrid(num_of_blocks);

notes for grid-stride loop

1
2
3
4
5
6
7
8
9
10
11
12
13
__global__ void kernel_add2(int n, float *a, float *b, float *c)
{
// thread id
int index = blockDim.x * blockIdx.x + threadIdx.x;

// grid-stride loop
int grid_stride = blockDim.x * gridDim.x; // 256*4096
// in this case; only 1 loop
for (int i = index; i < n; i += grid_stride)
{
c[i] = a[i] + b[i];
}
}

thread block and grid size

nvprof

1
nvprof.exe demo.exe 
==8748== Profiling application: .\demo.exe
==8748== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 43.63%  1.6413ms         3  547.10us  517.71us  591.41us  [CUDA memcpy HtoD]
 30.11%  1.1327ms         1  1.1327ms  1.1327ms  1.1327ms  [CUDA memcpy DtoH]
 26.26%  987.80us         2  493.90us  243.43us  744.37us  kernel_add(int, float*, float*, float*)

at C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvprof.exe

matrix multiply

  • for 1-dim vector add, we use 1-dim grid and block
  • for 2-dim matrix multiply, we use 2-dim grid and block.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// ========================================
// 2-dim
// ========================================
// 矩阵类型,行优先,M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
int width;
int height;
float *elements;
};

// 获取矩阵A的(row, col)元素
__device__ float getElement(Matrix *A, int row, int col)
{
return A->elements[row * A->width + col];
}

// 为矩阵A的(row, col)元素赋值
__device__ void setElement(Matrix *A, int row, int col, float value)
{
A->elements[row * A->width + col] = value;
}

// 矩阵相乘kernel,2-D,每个线程计算一个元素
__global__ void matMulKernel(Matrix *A, Matrix *B, Matrix *C)
{
float sum = 0.0;
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < A->width; ++i)
{
sum += getElement(A, row, i) * getElement(B, i, col);
}
setElement(C, row, col, sum);
}

void test_gpu_3()
{
int width = 1 << 8;
int height = 1 << 8;

Matrix *A, *B, *C;
// 申请托管内存
cudaMallocManaged((void**)&A, sizeof(Matrix));
cudaMallocManaged((void**)&B, sizeof(Matrix));
cudaMallocManaged((void**)&C, sizeof(Matrix));

int nBytes = width * height * sizeof(float);
cudaMallocManaged((void**)&A->elements, nBytes);
cudaMallocManaged((void**)&B->elements, nBytes);
cudaMallocManaged((void**)&C->elements, nBytes);

// 初始化数据
A->height = height;
A->width = width;
B->height = height;
B->width = width;
C->height = height;
C->width = width;
for (int i = 0; i < width * height; ++i)
{
A->elements[i] = 1.0;
B->elements[i] = 2.0;
}

// 定义kernel的执行配置
dim3 blockSize(32, 32);
dim3 gridSize(
(width + blockSize.x - 1) / blockSize.x,
(height + blockSize.y - 1) / blockSize.y
);
// 执行kernel
matMulKernel<<<gridSize, blockSize>>>(A, B, C);

// 同步device 保证结果能正确访问
cudaDeviceSynchronize();

// 检查执行结果
float maxError = 0.0;
for (int i = 0; i < width * height; ++i)
maxError += fabs(C->elements[i] - 2 * width);
cout << "max error is " << maxError << endl;

// 释放托管内存
cudaFree(A->elements);
cudaFree(B->elements);
cudaFree(C->elements);
cudaFree(A);
cudaFree(B);
cudaFree(C);
}


int main()
{
test_gpu_3();
return 0;
}

notes for

1
2
3
4
5
dim3 blockSize(32, 32);
dim3 gridSize(
(width + blockSize.x - 1) / blockSize.x,
(height + blockSize.y - 1) / blockSize.y
);

Reference

History

  • 20181121: created.

Series

Guide

FP32/FP16/INT8 range

INT8 has significantly lower precision and dynamic range compared to FP32.

png

High-throughput INT8 math
png

DP4A: int8 dot product Requires sm_61+ (Pascal TitanX, GTX 1080, Tesla P4, P40 and others).

Calibration Dataset

When preparing the calibration dataset, you should capture the expected distribution of data in typical inference scenarios. You want to make sure that the calibration dataset covers all the expected scenarios; for example, clear weather, rainy day, night scenes, etc. If you are creating your own dataset, we recommend creating a separate calibration dataset. The calibration dataset shouldn’t overlap with the training, validation or test datasets, in order to avoid a situation where the calibrated model only works well on the these datasets.
具有代表性,最好是val set的子集。

result

caffe / tensorrt FP32 / tensorrt INT8

png

Code

fp32

by default.

fp16

  • cpp

    1
    builder->setFp16Mode(true);
  • python

    1
    builder.set_fp16_mode(True)

int8

  • cpp usage

    1
    2
    builder->setInt8Mode(true);
    builder->setInt8Calibrator(calibrator);
  • python usage

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    import tensorrt as trt

    NUM_IMAGES_PER_BATCH = 5
    batchstream = ImageBatchStream(NUM_IMAGES_PER_BATCH,calibration_files)

    Int8_calibrator = trt.infer.EntropyCalibrator([“input_node_name”],batchstream)

    trt_builder = trt.infer.create_infer_builder(G_LOGGER)
    trt_builder.set_int8_mode(True)
    trt_builder.set_int8_calibrator(Int8_calibrator)

Int8 Calibrator

see 5.1.3.2. INT8 Calibration Using C++

Calibration can be slow, therefore, the IInt8Calibrator interface provides methods for caching intermediate data. Using these methods effectively requires a more detailed understanding of calibration.

When building an INT8 engine, the builder performs the following steps:

  1. Builds a 32-bit engine, runs it on the calibration set, and records a histogram for each tensor of the distribution of activation values.
  2. Builds a calibration table from the histograms.
  3. Builds the INT8 engine from the calibration table and the network definition.

The calibration table can be cached. Caching is useful when building the same network multiple times, for example, on multiple platforms. It captures data derived from the network and the calibration set. The parameters are recorded in the table. If the network or calibration set changes, it is the application’s responsibility to invalidate the cache.

The cache is used as follows:

  1. if a calibration table is found, calibration is skipped, otherwise:
    the calibration table is built from the histograms and parameters
  2. then the INT8 network is built from the network definition and the calibration table.

Cached data is passed as a pointer and length.
After you have implemented the calibrator, you can configure the builder to use it:

1
builder->setInt8Calibrator(calibrator);

The make_plan program must run on the target system in order for the TensorRT engine to be optimized correctly for that system. However, if an INT8 calibration cache was produced on the host, the cache may be re-used by the builder on the target when generating the engine (in other words, there is no need to do INT8 calibration on the target system itself).

INT8 calibration cache can be re-used, while engine can not.

demo

c++

cpp:

calibrator.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#ifndef _CALIBRATOR_H_
#define _CALIBRATOR_H_

#include "NvInfer.h"
#include "ds_image.h"
#include "trt_utils.h"

class Int8EntropyCalibrator : public nvinfer1::IInt8EntropyCalibrator
{
public:
Int8EntropyCalibrator(const uint& batchSize, const std::string& calibrationSetPath,
const std::string& calibTableFilePath, const uint64_t& inputSize,
const uint& inputH, const uint& inputW, const std::string& inputBlobName);
virtual ~Int8EntropyCalibrator() { NV_CUDA_CHECK(cudaFree(m_DeviceInput)); }

int getBatchSize() const override { return m_BatchSize; }
bool getBatch(void* bindings[], const char* names[], int nbBindings) override;
const void* readCalibrationCache(size_t& length) override;
void writeCalibrationCache(const void* cache, size_t length) override;

private:
const uint m_BatchSize;
const uint m_InputH;
const uint m_InputW;
const uint64_t m_InputSize;
const uint64_t m_InputCount;
const char* m_InputBlobName;
const std::string m_CalibTableFilePath{nullptr};
uint m_ImageIndex;
bool m_ReadCache{true};
void* m_DeviceInput{nullptr};
std::vector<std::string> m_ImageList;
std::vector<char> m_CalibrationCache;
};

#endif

calibrator.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

#include "calibrator.h"
#include <fstream>
#include <iostream>
#include <iterator>

Int8EntropyCalibrator::Int8EntropyCalibrator(const uint& batchSize,
const std::string& calibrationSetPath,
const std::string& calibTableFilePath,
const uint64_t& inputSize, const uint& inputH,
const uint& inputW, const std::string& inputBlobName) :
m_BatchSize(batchSize),
m_InputH(inputH),
m_InputW(inputW),
m_InputSize(inputSize),
m_InputCount(batchSize * inputSize),
m_InputBlobName(inputBlobName.c_str()),
m_CalibTableFilePath(calibTableFilePath),
m_ImageIndex(0)
{
m_ImageList = loadListFromTextFile(calibrationSetPath);
m_ImageList.resize(static_cast<int>(m_ImageList.size() / m_BatchSize) * m_BatchSize);
std::random_shuffle(m_ImageList.begin(), m_ImageList.end(), [](int i) { return rand() % i; });
NV_CUDA_CHECK(cudaMalloc(&m_DeviceInput, m_InputCount * sizeof(float)));
}

bool Int8EntropyCalibrator::getBatch(void* bindings[], const char* names[], int nbBindings)
{
if (m_ImageIndex + m_BatchSize >= m_ImageList.size()) return false;

// Load next batch
std::vector<DsImage> dsImages(m_BatchSize);
for (uint j = m_ImageIndex; j < m_ImageIndex + m_BatchSize; ++j)
{
dsImages.at(j - m_ImageIndex) = DsImage(m_ImageList.at(j), m_InputH, m_InputW);
}
m_ImageIndex += m_BatchSize;

cv::Mat trtInput = blobFromDsImages(dsImages, m_InputH, m_InputW);

NV_CUDA_CHECK(cudaMemcpy(m_DeviceInput, trtInput.ptr<float>(0), m_InputCount * sizeof(float),
cudaMemcpyHostToDevice));
assert(!strcmp(names[0], m_InputBlobName));
bindings[0] = m_DeviceInput;
return true;
}

const void* Int8EntropyCalibrator::readCalibrationCache(size_t& length)
{
void* output;
m_CalibrationCache.clear();
assert(!m_CalibTableFilePath.empty());
std::ifstream input(m_CalibTableFilePath, std::ios::binary);
input >> std::noskipws;
if (m_ReadCache && input.good())
std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(),
std::back_inserter(m_CalibrationCache));

length = m_CalibrationCache.size();
if (length)
{
std::cout << "Using cached calibration table to build the engine" << std::endl;
output = &m_CalibrationCache[0];
}

else
{
std::cout << "New calibration table will be created to build the engine" << std::endl;
output = nullptr;
}

return output;
}

void Int8EntropyCalibrator::writeCalibrationCache(const void* cache, size_t length)
{
assert(!m_CalibTableFilePath.empty());
std::ofstream output(m_CalibTableFilePath, std::ios::binary);
output.write(reinterpret_cast<const char*>(cache), length);
output.close();
}

c++ v2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Int8CacheCalibrator : public IInt8EntropyCalibrator {
public:
Int8CacheCalibrator(std::string cacheFile)
: mCacheFile(cacheFile) {}
virtual ~Int8CacheCalibrator() {}

int getBatchSize() const override {return 1;}

bool getBatch(void* bindings[], const char* names[], int nbBindings) override {
return false;
}

const void* readCalibrationCache(size_t& length) override
{
mCalibrationCache.clear();
std::ifstream input(mCacheFile, std::ios::binary);
input >> std::noskipws;
if (input.good()) {
std::copy(std::istream_iterator(input),
std::istream_iterator<char>(),
std::back_inserter<char>(mCalibrationCache));
}
length = mCalibrationCache.size();
return length ? &mCalibrationCache[0] : nullptr;
}

private:
std::string mCacheFile;
std::vector<char> mCalibrationCache;
};

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
import ctypes
import tensorrt as trt

CHANNEL = 3
HEIGHT = 512
WIDTH = 1024

class PythonEntropyCalibrator(trt.infer.EntropyCalibrator):
def __init__(self, input_layers, stream):
trt.infer.EntropyCalibrator.__init__(self)
self.input_layers = input_layers
self.stream = stream
self.d_input = cuda.mem_alloc(self.stream.calibration_data.nbytes)
stream.reset()

def get_batch_size(self):
return self.stream.batch_size

def get_batch(self, bindings, names):
batch = self.stream.next_batch()
if not batch.size:
return None

cuda.memcpy_htod(self.d_input, batch)
for i in self.input_layers[0]:
assert names[0] != i

bindings[0] = int(self.d_input)
return bindings

def read_calibration_cache(self, length):
return None

def write_calibration_cache(self, ptr, size):
cache = ctypes.c_char_p(int(ptr))
with open('calibration_cache.bin', 'wb') as f:
f.write(cache.value)
return None

Reference

History

  • 20181119: created.

Guide

install and config

download httpd-2.2.34-win64.zip
extract to c:/Apache2
edit conf/httpd.conf

1
2
3
4
ServerRoot "c:/Apache2"
Listen 80
ServerName 127.0.0.1
DocumentRoot "c:/Apache2/htdocs"

run cmd.exe as administrator and install apache2.2 service

1
2
cd c:/Apache2/bin
httpd.exe -k install -n apache2.2

tips: use sc delete apache2.2 to delete service first if error occurs and then install again.

install service

run bin/ApacheMonitor.exe to start apache2.2 service

apache2.2 service running

access from local

test localhost
works

access from remote

edit conf/httpd.conf

1
2
3
4
5
#Directory "c:/Apache2/htdocs"
AllowOverride None
Order allow,deny
Allow from all
#/Directory

open windows defender and allow Apache Http Server packets in/out
allow apache

  • server host ip: 192.168.6.149
  • access host ip: 192.168.6.100
1
[email protected]>$  wget http://192.168.6.149/1.txt

Reference

History

  • 20181114: created.

Series

Guide

requirements

my system requirements

Tips:
Install Cuda after VS 2015. otherwise errors occur. (don’t know why)

commands

1
git clone https://github.com/AlexeyAB/darknet.git

config

  • extract opencv-3.3.0-vc14.exe to C:\opencv330\
  • find CUDA 8.0.props from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\visual_studio_integration\MSBuildExtensions\CUDA 8.0.props

vim build/darknet/darknet.vcxproj

  • replace C:\opencv_2.4.9\ with C:\opencv330\
  • replace CUDA 9.2.props with CUDA 8.0.props
  • replace CUDA 9.2.targets with CUDA 8.0.targets
  • replace compute_30,sm_30; with compute_61,sm_61;

compile

open build/darknet/darknet.sln with VS 2015

include path

C:\opencv330\opencv\build\include
..\..\3rdparty\include
$(CUDA_PATH)\include

library path

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64
$(CUDA_PATH)\lib\$(PlatformName)  
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64

linker input

    ..\..\3rdparty\lib\x64\pthreadVC2.lib;cublas.lib;curand.lib;cudart.lib;

build with x64 Release.

run

darknet.exe

1
./darknet.exe detector test data/coco.data yolov3.cfg yolov3.weights -i 0 -thresh 0.25 dog.jpg -ext_output

darknet.py

compile yolo_cpp_dll.sln and generate yolo_cpp_dll.dll for python usage.

1
python darknet.py

Code

yolo.py

for linux and windows.
see yolo.py

yolo_cpp_dll

System Path

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\jre\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\jre\bin\plugin2

cuda80+cudnn6

  • cuda_8.0.61_win10.exe: cublas64_80.dll,curand64_80.dll,cudart64_80.dll
  • cudnn-8.0-windows10-x64-v6.0.zip: cudnn64_6.dll

cuda90+cudnn7

  • cuda_9.0.176_win10.exe: cublas64_90.dll,curand64_90.dll,cudart64_90.dll
  • cudnn-9.0-windows10-x64-v7.1.zip: cudnn64_7.dll

yolo dll

  • cuda80_yolo_cpp_dll: pthreadvc2.dll, cublas64_80.dll,curand64_80.dll,cudart64_80.dll,cudnn64_6.dll
  • cuda90_yolo_cpp_dll: pthreadvc2.dll, cublas64_90.dll,curand64_90.dll,cudart64_90.dll,cudnn64_7.dll

    Tips: use Dependency Walker to list dlls.

Reference

History

  • 20181101: created.