Kezunlin's Blog

linux command ref

Posted on 2018-11-30 Edited on 2024-10-14 In linux

Guide

ln

1
2
3

ln -s src dest

ln -s $(pwd)/yolo /home/kezunlin/program/darknet/kzl-yolo

tips: use absolute path to link folder

sed

case1

1	sed -i "s/old/new/g" 1.txt

replace “old” with “new”

case2

1	sed -i "7,7s/neat_enable\: false/neat_enable\: true/g" _config.yml

replace neat_enale: false to neat_enale: true

case3

1	sed -i "s/http:\/\/kezunlin.me/https:\/\/kezunlin.me/g" posts/post1.md

replace “http://kezunlin.me“ with “https://kezunlin.me“

grep

case1

grep UpsampleParameter . -r --include *.h 

grep UpsampleParameter . -r --include *.proto
./src/caffe/proto/caffe.proto:  optional UpsampleParameter upsample_param = 150;
./src/caffe/proto/caffe.proto:message UpsampleParameter {

case2

grep -r "http://kezunlin.me" posts/ 
posts/post1.md:[here](http://kezunlin.me/post/book)
posts/post2.md:-[here](http://kezunlin.me/post/book)
posts/post2.md:- [img](http://kezunlin.me/post/book)

case3

1
2
3

grep -rl "http://kezunlin.me" posts 
posts/post1.md
posts/post2.md

case4

1	sed -i "s/http:\/\/kezunlin.me/https:\/\/kezunlin.me/g" `grep -rl "http://kezunlin.me" posts`

replace http://kezunlin.me with https://kezunlin.me in all posts files.

case5

1	sed -i "s/comments\: false/comments\: true/g" _posts/*

replace comments: false with comments: true in all posts files.

case6

grep --include=*.py -lr OKUtil .  | xargs 

grep -r -l <old> * | xargs sed -i 's/<old>/<new>/g'
grep -r -l <OKUtil> * | xargs sed -i 's/<OKUtil>/<OkoooUtil>/g'

ldd

for so

ldd test_opencv | awk '/ => / { print $3 }'
(0x00007fff309d4000)
/usr/local/lib/libopencv_imgcodecs.so.3.1
/usr/local/lib/libopencv_core.so.3.1
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
/lib/x86_64-linux-gnu/libgcc_s.so.1
/lib/x86_64-linux-gnu/libc.so.6
/usr/local/lib/libopencv_imgproc.so.3.1
/usr/local/lib/libjpeg.so.8
/lib/x86_64-linux-gnu/libpng12.so.0
/usr/lib/x86_64-linux-gnu/libtiff.so.5
/usr/lib/x86_64-linux-gnu/libjasper.so.1
/lib/x86_64-linux-gnu/libz.so.1
/lib/x86_64-linux-gnu/libm.so.6
/lib/x86_64-linux-gnu/libpthread.so.0
/lib/x86_64-linux-gnu/libdl.so.2
/lib/x86_64-linux-gnu/librt.so.1
/lib/x86_64-linux-gnu/liblzma.so.5
/usr/lib/x86_64-linux-gnu/libjbig.so.0

for not found

1	ldd test_opencv \| awk 'NF==1 {file=$1} /not found/ { print file, $1 }'

useful

ldd example_opencv | awk  '{if (match($3,"/")){ printf("%s \n"),$3 } }'
/usr/local/lib/libopencv_imgcodecs.so.3.1 
/usr/local/lib/libopencv_core.so.3.1 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6 
/lib/x86_64-linux-gnu/libgcc_s.so.1 
/lib/x86_64-linux-gnu/libc.so.6 
/usr/local/lib/libopencv_imgproc.so.3.1 
/usr/local/lib/libjpeg.so.8 
/lib/x86_64-linux-gnu/libpng12.so.0 
/usr/lib/x86_64-linux-gnu/libtiff.so.5 
/usr/lib/x86_64-linux-gnu/libjasper.so.1 
/lib/x86_64-linux-gnu/libz.so.1 
/lib/x86_64-linux-gnu/libm.so.6 
/lib/x86_64-linux-gnu/libpthread.so.0 
/lib/x86_64-linux-gnu/libdl.so.2 
/lib/x86_64-linux-gnu/librt.so.1 
/lib/x86_64-linux-gnu/liblzma.so.5 
/usr/lib/x86_64-linux-gnu/libjbig.so.0

curl

curl -X POST \
    -H "Accept: application/json" -H "Content-type: application/json" \
    --data '{"name":"admin","password":"21232f297a57a5a743894a0e4a801fc3"}' \
    http://192.168.0.12:8888/api/login

md5(admin) = 21232f297a57a5a743894a0e4a801fc3

output

{"rtn":0,"message":"OK","session_id":"[email protected]"}

query results with session_id

1
2
3

curl -X GET \
    -H "session_id: [email protected]" \
    http://192.168.0.12:8888/api/book

download file

1	curl -LSso ~/.vim/autoload/pathogen.vim https://tpo.pe/pathogen.vim

Tips

check cpu

cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c 
        8  Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz


cat /proc/cpuinfo | grep physical | uniq -c 
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual
        1 physical id	: 0
        1 address sizes	: 39 bits physical, 48 bits virtual

check gpu

lspci | grep -i vga
    00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
    01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c20 (rev a1)


lspci -v -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c20 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: CLEVO/KAPOK Computer Device 65a1
    Flags: bus master, fast devsel, latency 0, IRQ 130
    Memory at db000000 (32-bit, non-prefetchable) [size=16M]
    Memory at 90000000 (64-bit, prefetchable) [size=256M]
    Memory at a0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at e000 [size=128]
    [virtual] Expansion ROM at dc000000 [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_396, nvidia_396_drm


nvidia-smi
Tue Feb 12 10:09:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   56C    P8     8W /  N/A |    601MiB /  6078MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+


watch -n 1 nvidia-smi

pip install gpustat
watch --color -n1 gpustat -cpu

multiple terminal

1	gnome-terminal --working-directory=/home --tab --tab --tab

stop lightdm

sudo apt-get install xserver-org

# stop desktop
sudo /etc/init.d/lightdm stop

sudo fbterm

changedir no effect

changedir.sh

1
2
3

#!/bin/bash
cd /home/firefox  
pwd

执行的时候是./changedir.sh来执行的，这样执行的话终端会产生一个subshell，subshell去执行脚本，在subshell中已经切换了目录了，但是subshell一旦执行完，马上退出，subshell中的变量和操作全部都收回。回到终端根本就看不到这个过程的变化。

solution

1
2
3

source ./changedir.sh
# or
. changedir.sh

source changedir.sh执行，这时候就是直接在终端的shell执行脚本了，没有生成子shell，所以当前终端切换了目录。

kill process

1
2
3

lsof -i:4000
netstat -tunlp | grep 4000
kill -9 1234

Reference

History

20181130: created.

compile and install dlib on ubuntu 16.04

Posted on 2018-11-27 Edited on 2024-10-14 In cpp

Series

Guide

compile

1
2
3

git clone https://github.com/davisking/dlib.git
cd dlib && mkdir build && cd build 
cmake-gui ..

with options

CMAKE_INSTALL_PREFIX /usr/local
CUDA 9.2 + cuDNN 7.1.4

generate

Found CUDA: /usr/local/cuda (found suitable version "9.2", minimum required is "7.5") 
Looking for cuDNN install...
Found cuDNN: /usr/local/cuda/lib64/libcudnn.so
Building a CUDA test project to see if your compiler is compatible with CUDA...
Checking if you have the right version of cuDNN installed.
Enabling CUDA support for dlib.  DLIB WILL USE CUDA
C++11 activated.

make and install

1 2	make -j8 sudo make install

output

[100%] Linking CXX static library libdlib.a
[100%] Built target dlib

generate static library libdlib.a

CMakeLists.txt

find_package(dlib REQUIRED)

if(MSVC)
	set(dlib_LIBRARIES "C:/Program Files/dlib/lib/dlib.lib") # replace dlib::dlib
else()
endif(MSVC)
# ${dlib_INCLUDE_DIRS} and ${dlib_LIBRARIES} are deprecated, simply use target_link_libraries(your_app dlib::dlib)
MESSAGE( [Main] " dlib_INCLUDE_DIRS = ${dlib_INCLUDE_DIRS}") 
MESSAGE( [Main] " dlib_LIBRARIES = ${dlib_LIBRARIES}")   


add_executable(demo demo.cpp)
#target_link_libraries(demo ${dlib_LIBRARIES})
target_link_libraries(demo dlib::dlib)

Reference

Part 1: compile dlib on windows 10

History

20181127: created.

Interfacing C++ and Python with pybind11 on ubuntu 16.04

Posted on 2018-11-27 Edited on 2024-10-14 In cpp

Series

Guide

requirements:

pybind11 v2.3.dev0
python 3.5

install pytest

1	pip3 install pytest

compile

git clone https://github.com/pybind/pybind11.git
cd pybind11
mkdir build
cd build
cmake-gui ..

with options

PYBIND11_CPP_STANDARD /std:c++11 # default c++14
PYTHON_EXECUTABLE /usr/bin/python3.5
CMAKE_INSTALL_PREFIX /usr/local

install

make and install

1 2	make -j8 sudo make install

install to /usr/local/include/pybind11 with only include and /usr/local/share/cmake/pybind11

output

Install the project...
-- Install configuration: "MinSizeRel"
-- Installing: /usr/local/include/pybind11
-- Installing: /usr/local/include/pybind11/chrono.h
-- Installing: /usr/local/include/pybind11/eigen.h
-- Installing: /usr/local/include/pybind11/stl.h
-- Installing: /usr/local/include/pybind11/complex.h
-- Installing: /usr/local/include/pybind11/detail
-- Installing: /usr/local/include/pybind11/detail/internals.h
-- Installing: /usr/local/include/pybind11/detail/common.h
-- Installing: /usr/local/include/pybind11/detail/descr.h
-- Installing: /usr/local/include/pybind11/detail/init.h
-- Installing: /usr/local/include/pybind11/detail/class.h
-- Installing: /usr/local/include/pybind11/detail/typeid.h
-- Installing: /usr/local/include/pybind11/common.h
-- Installing: /usr/local/include/pybind11/iostream.h
-- Installing: /usr/local/include/pybind11/buffer_info.h
-- Installing: /usr/local/include/pybind11/attr.h
-- Installing: /usr/local/include/pybind11/numpy.h
-- Installing: /usr/local/include/pybind11/pybind11.h
-- Installing: /usr/local/include/pybind11/operators.h
-- Installing: /usr/local/include/pybind11/options.h
-- Installing: /usr/local/include/pybind11/cast.h
-- Installing: /usr/local/include/pybind11/eval.h
-- Installing: /usr/local/include/pybind11/embed.h
-- Installing: /usr/local/include/pybind11/pytypes.h
-- Installing: /usr/local/include/pybind11/functional.h
-- Installing: /usr/local/include/pybind11/stl_bind.h
-- Installing: /usr/local/share/cmake/pybind11/pybind11Config.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11ConfigVersion.cmake
-- Installing: /usr/local/share/cmake/pybind11/FindPythonLibsNew.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11Tools.cmake
-- Installing: /usr/local/share/cmake/pybind11/pybind11Targets.cmake

Usage

pybind11

CMakeLists.txt

find_package(pybind11 CONFIG REQUIRED)
include_directories(${pybind11_INCLUDE_DIRS})

MESSAGE( [MAIN] "Found pybind11 v${pybind11_VERSION}: ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_INCLUDE_DIRS = ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_LIBRARIES = ${pybind11_LIBRARIES}")

add_library(examplelib 
	${HEADER_FILES}
	${SOURCE_FILES}
)

target_link_libraries (examplelib  
	pybind11::module
	${xxx_LIBRARIES}
)

embed

CMakeLists.txt

find_package(pybind11 CONFIG REQUIRED)
include_directories(${pybind11_INCLUDE_DIRS})

MESSAGE( [MAIN] "Found pybind11 v${pybind11_VERSION}: ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_INCLUDE_DIRS = ${pybind11_INCLUDE_DIRS}")
MESSAGE( [Main] " pybind11_LIBRARIES = ${pybind11_LIBRARIES}")

add_executable(cpp_use_python cpp_use_python.cpp)
target_link_libraries(cpp_use_python PRIVATE pybind11::embed)

Reference

Part 1: Interfacing C++ and Python with pybind11 on windows 10

History

20181127: created.

compile and install refinedet on ubuntu 16.04

Posted on 2018-11-27 Edited on 2024-10-14 In deep learning

Guide

RefineDet is based on Caffe.

See Install and Configure Caffe on ubuntu 16.04

ubuntu 16.04
CUDA 9.2 + cudnn 7.1.4 (for caffe/tensorrt/anakin)
opencv 3.3.0
python 2.7
caffe (from refinedet)

compile

git clone https://github.com/sfzhang15/RefineDet.git
cd RefineDet 
mkdir build && cd build && cmake-gui ..

make -j8 && make pycaffe

options

USE_CUDNN True
USE_OPENCV True
WITH_PYTHON_LAYER True
BLAS atlas

CMAKE_INSTALL_PREFIX /home/kezunlin/program/refinedet/build/install

tips: vim CMakeLists.txt and comment out examples and docs

#add_subdirectory(examples)
add_subdirectory(python)
add_subdirectory(matlab)
#add_subdirectory(docs)

fix gflags error

caffe/include/caffe/common.hpp
caffe/examples/mnist/convert_mnist_data.cpp

Comment out the ifndef

1
2
3

// #ifndef GFLAGS_GFLAGS_H_
namespace gflags = google;
// #endif  // GFLAGS_GFLAGS_H_

example

two version:

single version
batch version

'''
In this example, we will load a RefineDet model and use it to detect objects.
'''
import argparse
import os
import sys
import numpy as np
import skimage.io as io
import cv2
# Make sure that caffe is on the python path:
caffe_root = './'
os.chdir(caffe_root)
sys.path.insert(0, os.path.join(caffe_root, 'python'))
import caffe

classes = ['background', 'person']

def filter_boxs(boxs, threshold=0.4):
    """
    boxs: 500*6  (xmin,ymin,xmax,ymax,confidence,class_index) 
                class_index: 0 background, 1 person
                confidence: 0-1
    return: 
            new_boxs  `list`  [b1,b2,b3,...]
    """
    new_boxs = []
    for i in range(0, boxs.shape[0]):
        xmin,ymin,xmax,ymax,confidence,class_index = boxs[i]
        #print(type(class_index)) # float32
        if int(class_index)>0 and confidence >= threshold:
            box = [int(xmin),int(ymin),int(xmax),int(ymax),confidence, int(class_index)]
            new_boxs.append(box)
    return new_boxs # list [b1,b2,b3,...]

def save_results(counter, image_file, boxs, save_fig=False):

    img = cv2.imread(image_file)
    for i in range(0, len(boxs)):
        xmin,ymin,xmax,ymax,confidence,class_index = boxs[i]

        name = classes[class_index]
        coords = (xmin, ymin), xmax - xmin, ymax - ymin

        cv2.rectangle(img, (xmin, ymin), (xmax, ymax), (0, 0, 255), 3) # bgr
        #display_text = '%s: %.2f' % (name, confidence)
        display_text = '%.2f' % (confidence)
        cv2.putText(img, display_text, (xmin, ymin-5), cv2.FONT_HERSHEY_SIMPLEX, 1, color=(0,0,255), thickness=2)
            
    if save_fig:
        image_filepath = 'output/{0}_results.jpg'.format(counter)
        cv2.imwrite(image_filepath, img)
        print('Saved: ' + image_filepath)


def single():
    caffe.set_device(0)
    caffe.set_mode_gpu()

    save_dir = "./output"
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)

    # load model
    model_def = 'models/ResNet/coco/refinedet_resnet101_512x512/deploy.prototxt'
    model_weights = 'models/ResNet/coco/refinedet_resnet101_512x512/coco_refinedet_resnet101_512x512_iter_75000.caffemodel'
    net = caffe.Net(model_def, model_weights, caffe.TEST)

    # image preprocessing
    img_resize = 512
    net.blobs['data'].reshape(1, 3, img_resize, img_resize)
    data_shape = net.blobs['data'].data.shape
    print("data_shape=", data_shape) # 1, 3, 512, 512
    # by default, caffe use chw, bgr, 0-255, image-[104, 117, 123] 
    transformer = caffe.io.Transformer({'data':data_shape})
    transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
    transformer.set_channel_swap('data', (2, 1, 0))  # rgb===>bgr
    transformer.set_raw_scale('data', 255)  # [0-1]===> [0,255]
    transformer.set_mean('data', np.array([104, 117, 123]))  # bgr mean pixel
    
    files = ["./images/1.png", "./images/2.png"]# 500,7  + 384,7 === 500,7 + 500,7 
    for index,image_file in enumerate(files):
        print("image_file=", image_file)
        image = caffe.io.load_image(image_file) # hwc, rgb, 0-1 
        print("image.shape=", image.shape)

        transformed_image = transformer.preprocess('data', image)
        print("transformed_image.shape=", transformed_image.shape)

        net.blobs['data'].data[...] = transformed_image

        detections = net.forward()['detection_out']
        print("detections.shape = ",detections.shape) # 1, 1, 500, 7
        det_label = detections[0, 0, :, 1] # 0 back, 1 -person (now only ==1)
        det_conf = detections[0, 0, :, 2] # 0-1 
        det_xmin = detections[0, 0, :, 3] * image.shape[1]
        det_ymin = detections[0, 0, :, 4] * image.shape[0]
        det_xmax = detections[0, 0, :, 5] * image.shape[1]
        det_ymax = detections[0, 0, :, 6] * image.shape[0]
        boxs = np.column_stack([det_xmin, det_ymin, det_xmax, det_ymax, det_conf, det_label])
        print("boxs = ", boxs.shape) # 500,6

        new_boxs = filter_boxs(boxs)
        print("new_boxs = ", len(new_boxs)) # 3 boxs

        # show result
        save_results(index, image_file, new_boxs, save_fig=True)

def batch():
    caffe.set_device(0)
    caffe.set_mode_gpu()

    save_dir = "./output"
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)

    # load model
    model_def = 'models/ResNet/coco/refinedet_resnet101_512x512/deploy.prototxt'
    model_weights = 'models/ResNet/coco/refinedet_resnet101_512x512/coco_refinedet_resnet101_512x512_iter_75000.caffemodel'
    net = caffe.Net(model_def, model_weights, caffe.TEST)

    box_count_per_image = 500 
    #files = ["./images/2.png"]
    files = ["./images/1.png", "./images/2.png"]# 500,7  + 384,7 === 500,7 + 500,7 
    # update detection_output_layer.cpp and cu to keep 500 box results
    batch_size = len(files) 
    # image preprocessing
    img_resize = 512
    net.blobs['data'].reshape(batch_size, 3, img_resize, img_resize)
    data_shape = net.blobs['data'].data.shape
    print("data_shape=", data_shape) # 1, 3, 512, 512
    # by default, caffe use chw, bgr, 0-255, image-[104, 117, 123] 
    transformer = caffe.io.Transformer({'data':data_shape})
    transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
    transformer.set_channel_swap('data', (2, 1, 0))  # rgb===>bgr
    transformer.set_raw_scale('data', 255)  # [0-1]===> [0,255]
    transformer.set_mean('data', np.array([104, 117, 123]))  # bgr mean pixel
    
    for i in range(len(files)):
        #image_file = "./images/1.png"
        image_file = files[i]
        print("image_file=", image_file)
        image = caffe.io.load_image(image_file) # hwc, rgb, 0-1 
        print("image.shape=", image.shape)

        transformed_image = transformer.preprocess('data', image)
        print("transformed_image.shape=", transformed_image.shape)

        net.blobs['data'].data[i,:,:,:] = transformed_image

    detections = net.forward()['detection_out']
    print("detections.shape = ",detections.shape) # 1, 1, 500+384, 7 ===> 1,1, 1000,7

    for i in range(batch_size):
        start = i * box_count_per_image
        end = (i+1) * box_count_per_image
        print("start-end: ",start, end)
    
        det_label = detections[0, 0, start:end, 1] # 0 back, 1 -person (now only ==1)
        print(det_label[:10])
        det_conf = detections[0, 0, start:end, 2] # 0-1 
        det_xmin = detections[0, 0, start:end, 3] * image.shape[1]
        det_ymin = detections[0, 0, start:end, 4] * image.shape[0]
        det_xmax = detections[0, 0, start:end, 5] * image.shape[1]
        det_ymax = detections[0, 0, start:end, 6] * image.shape[0]
        boxs = np.column_stack([det_xmin, det_ymin, det_xmax, det_ymax, det_conf, det_label])
        print("boxs = ", boxs.shape) # 500,6

        new_boxs = filter_boxs(boxs)
        print("new_boxs = ", len(new_boxs)) # 3 boxs

        # show result
        save_results(i, image_file, new_boxs, save_fig=True)

if __name__ == '__main__':
   #single()
   batch()

output

('data_shape=', (2, 3, 512, 512))
('image_file=', './images/1.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
('image_file=', './images/2.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
('detections.shape = ', (1, 1, 1000, 7))
('start-end: ', 0, 500)
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
('boxs = ', (500, 6))
('new_boxs = ', 3)
Saved: output/0_results.jpg
('start-end: ', 500, 1000)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
('boxs = ', (500, 6))
('new_boxs = ', 6)
Saved: output/1_results.jpg

Reference

RefineDet offical

History

20181127: created.

yolov1 pascal label data

Posted on 2018-11-26 Edited on 2024-10-14 In deep learning

code example

import os
import xml.etree.ElementTree as ET
import numpy as np
import cv2
import pickle
import copy
import yolo.config as cfg
import matplotlib.pyplot as plt

np.random.seed(1234)  # for np.random.shuffle(gt_labels)

classes = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus',
           'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse',
           'motorbike', 'person', 'pottedplant', 'sheep', 'sofa',
           'train', 'tvmonitor']


class pascal_voc(object):
    def __init__(self, phase, rebuild=False):
        self.devkil_path = os.path.join(cfg.PASCAL_PATH, 'VOCdevkit')
        self.data_path = os.path.join(self.devkil_path, 'VOC2007')
        self.cache_path = cfg.CACHE_PATH
        self.batch_size = cfg.BATCH_SIZE
        self.image_size = cfg.IMAGE_SIZE
        self.cell_size = cfg.CELL_SIZE
        self.classes = cfg.CLASSES
        self.class_to_ind = dict(zip(self.classes, range(len(self.classes))))
        self.flipped = False  # cfg.FLIPPED
        self.phase = phase
        self.rebuild = rebuild
        self.cursor = 0
        self.epoch = 1
        self.gt_labels = None
        self.prepare()

    def get(self):
        images = np.zeros(
            (self.batch_size, self.image_size, self.image_size, 3))
        labels = np.zeros(
            (self.batch_size, self.cell_size, self.cell_size, 25))
        count = 0
        while count < self.batch_size:
            imname = self.gt_labels[self.cursor]['imname']
            flipped = self.gt_labels[self.cursor]['flipped']
            images[count, :, :, :] = self.image_read(imname, flipped)
            labels[count, :, :, :] = self.gt_labels[self.cursor]['label']
            count += 1
            self.cursor += 1
            if self.cursor >= len(self.gt_labels):
                np.random.shuffle(self.gt_labels)
                self.cursor = 0
                self.epoch += 1
        return images, labels

    def image_read(self, imname, flipped=False):
        image = cv2.imread(imname)
        image = cv2.resize(image, (self.image_size, self.image_size))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image = (image / 255.0) * 2.0 - 1.0
        if flipped:
            image = image[:, ::-1, :]
        return image

    def prepare(self):
        gt_labels = self.load_labels()
        if self.flipped:
            print('Appending horizontally-flipped training examples ...')
            # keep y; flip x;
            gt_labels_cp = copy.deepcopy(gt_labels)
            for idx in range(len(gt_labels_cp)):
                gt_labels_cp[idx]['flipped'] = True
                gt_labels_cp[idx]['label'] = \
                    gt_labels_cp[idx]['label'][:, ::-1, :]  # flip x grid index  [0,1,2,3,4,5,6] ===>[6,5,4,3,2,1,0]
                for i in range(self.cell_size):
                    for j in range(self.cell_size):
                        if gt_labels_cp[idx]['label'][i, j, 0] == 1:
                            gt_labels_cp[idx]['label'][i, j, 1] = \
                                self.image_size - 1 - \
                                gt_labels_cp[idx]['label'][i, j, 1]  # cx = 448 -1 - cx  flipped cx
            gt_labels += gt_labels_cp
        np.random.shuffle(gt_labels)  # shuffle labels
        self.gt_labels = gt_labels
        return gt_labels

    def load_labels(self):
        cache_file = os.path.join(
            self.cache_path, 'pascal_' + self.phase + '_gt_labels.pkl')

        if os.path.isfile(cache_file) and not self.rebuild:
            print('Loading gt_labels from: ' + cache_file)
            with open(cache_file, 'rb') as f:
                gt_labels = pickle.load(f)
            return gt_labels

        print('Processing gt_labels from: ' + self.data_path)

        if not os.path.exists(self.cache_path):
            os.makedirs(self.cache_path)

        if self.phase == 'train':
            txtname = os.path.join(
                self.data_path, 'ImageSets', 'Main', 'trainval.txt')
        else:
            txtname = os.path.join(
                self.data_path, 'ImageSets', 'Main', 'test.txt')
        with open(txtname, 'r') as f:
            self.image_index = [x.strip() for x in f.readlines()]  # 5011 lines

        gt_labels = []
        for index in self.image_index:
            label, num = self.load_pascal_annotation(index)
            if num == 0:
                continue
            imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
            gt_labels.append({'imname': imname,
                              'label': label,
                              'flipped': False})
        print('Saving gt_labels to: ' + cache_file)
        with open(cache_file, 'wb') as f:
            pickle.dump(gt_labels, f)
        return gt_labels

    def load_pascal_annotation(self, index):
        """
        Load image and bounding boxes info from XML file in the PASCAL VOC
        format.   002939
        """

        imname = os.path.join(self.data_path, 'JPEGImages', index + '.jpg')
        im = cv2.imread(imname)
        h_ratio = 1.0 * self.image_size / im.shape[0]
        w_ratio = 1.0 * self.image_size / im.shape[1]
        # im = cv2.resize(im, [self.image_size, self.image_size])

        label = np.zeros((self.cell_size, self.cell_size, 25))  # 7,7,25
        filename = os.path.join(self.data_path, 'Annotations', index + '.xml')
        tree = ET.parse(filename)
        objs = tree.findall('object')

        for obj in objs:
            bbox = obj.find('bndbox')  # xmin,ymin,xmax,ymax  1-based ===> 0-based
            # Make pixel indexes 0-based
            x1 = max(min((float(bbox.find('xmin').text) - 1) * w_ratio, self.image_size - 1), 0)
            y1 = max(min((float(bbox.find('ymin').text) - 1) * h_ratio, self.image_size - 1), 0)
            x2 = max(min((float(bbox.find('xmax').text) - 1) * w_ratio, self.image_size - 1), 0)
            y2 = max(min((float(bbox.find('ymax').text) - 1) * h_ratio, self.image_size - 1), 0)
            cls_ind = self.class_to_ind[obj.find('name').text.lower().strip()]
            boxes = [(x2 + x1) / 2.0, (y2 + y1) / 2.0, x2 - x1, y2 - y1]  # cx,cy,w,h   [0-447]
            x_ind = int(boxes[0] * self.cell_size / self.image_size)  # grid x,y index  [0-6]
            y_ind = int(boxes[1] * self.cell_size / self.image_size)
            if label[y_ind, x_ind, 0] == 1:  # if multiple objects fall in same grid, we only use the first one
                continue
            label[y_ind, x_ind, 0] = 1  # has object  1 or 0
            label[y_ind, x_ind, 1:5] = boxes  # boxs (cx,cy,w,h)  [0-447]
            label[y_ind, x_ind, 5 + cls_ind] = 1  # class   20-one-hot-vector

        return label, len(objs)


"""
3 , 4 =  [0. 0. 0. 0. 0.]
3 , 5 =  [  1.    325.248 229.6   111.104 228.48 ]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
3 , 6 =  [0. 0. 0. 0. 0.]
4 , 0 =  [0. 0. 0. 0. 0.]
4 , 1 =  [0. 0. 0. 0. 0.]
4 , 2 =  [  1.    132.16  288.4   172.928 316.96 ]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
4 , 3 =  [0. 0. 0. 0. 0.]


data['label'].shape # 7,7,25  (confidence+ (x,y,w,h) + 20-classes)

confidence: 1 if gt_box center falls in this grid, otherwise 0
box(x,y,w,h): gt_box center x,y,w,h; otherwize [0,0,0,0]
class: 20-one-hot-vector if gt_box; othersize [0]*20

how flip works: flip x dim

(1) flip grid x-dim:  data['label'] = data['label'][:, ::-1, :]

y-grid = y-grid  [0,1,2,3,4,5,6]
x-grid flip      [0,1,2,3,4,5,6] ===>[6,5,4,3,2,1,0]

(2) flip data['label']
confidence = confidence
cx:  flip cx = 417-cx:   data['label'][i, j, 1] = 448 - 1 - data['label'][i, j, 1]
cy = cy
w = w
h = h
class = class
"""


def print_data(data):
    # grid y,x
    for y in range(7):
        for x in range(7):
            print(y, ",", x, "= ", data['label'][y, x, :5])
            if data['label'][y, x, 0] > 0:  # confidence >0
                class_one_hot = data['label'][y, x, 5:]  # (20)
                class_index = np.argmax(class_one_hot)
                print("    class_one_hot = ", class_one_hot)
                print("    class_index = ", class_index)
                print("    class_name = ", classes[class_index])


def flip_data(data):
    data['flipped'] = True
    data['label'] = data['label'][:, ::-1, :]
    for y in range(7):
        for x in range(7):
            if data['label'][y, x, 0] == 1:
                data['label'][y, x, 1] = 448 - 1 - data['label'][y, x, 1]  # cx = 448 -1 - cx  flipped cx


def show_image(filename):
    image = cv2.imread(filename)
    # convert from BGR to RGB
    rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    plt.axis("off")
    plt.imshow(rgb_image)
    plt.show()


print("========================PASCAL=================================")
pascal = pascal_voc('train')
print(pascal.class_to_ind)  # dict 20
print(len(pascal.gt_labels))  # list: default 5011; flipped 10022
data = pascal.gt_labels[0]
print(data.keys())  # dict_keys(['flipped', 'imname', 'label'])
print(data['imname'])
print(data['label'].shape)  # 7,7,25  (confidence+ (x,y,w,h) + 20-classes)

print(classes)
show_image(data['imname'])

print("========================DATA=================================")
print_data(data)

print("=========================FLIPPED================================")
# flip data
flipped = copy.deepcopy(data)
flip_data(flipped)
print_data(flipped)

Loading gt_labels from: data\pascal_voc\cache\pascal_train_gt_labels.pkl
{'dog': 11, 'train': 18, 'bus': 5, 'motorbike': 13, 'aeroplane': 0, 'bicycle': 1, 'person': 14, 'horse': 12, 'bird': 2, 'tvmonitor': 19, 'sheep': 16, 'boat': 3, 'car': 6, 'diningtable': 10, 'pottedplant': 15, 'sofa': 17, 'bottle': 4, 'chair': 8, 'cat': 7, 'cow': 9}
5011
dict_keys(['flipped', 'label', 'imname'])
data\pascal_voc\VOCdevkit\VOC2007\JPEGImages\002939.jpg
(7, 7, 25)
['aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']

========================DATA=================================
0 , 0 =  [0. 0. 0. 0. 0.]
0 , 1 =  [0. 0. 0. 0. 0.]
0 , 2 =  [0. 0. 0. 0. 0.]
0 , 3 =  [0. 0. 0. 0. 0.]
0 , 4 =  [0. 0. 0. 0. 0.]
0 , 5 =  [0. 0. 0. 0. 0.]
0 , 6 =  [0. 0. 0. 0. 0.]
1 , 0 =  [0. 0. 0. 0. 0.]
1 , 1 =  [0. 0. 0. 0. 0.]
1 , 2 =  [0. 0. 0. 0. 0.]
1 , 3 =  [0. 0. 0. 0. 0.]
1 , 4 =  [0. 0. 0. 0. 0.]
1 , 5 =  [0. 0. 0. 0. 0.]
1 , 6 =  [0. 0. 0. 0. 0.]
2 , 0 =  [0. 0. 0. 0. 0.]
2 , 1 =  [0. 0. 0. 0. 0.]
2 , 2 =  [0. 0. 0. 0. 0.]
2 , 3 =  [0. 0. 0. 0. 0.]
2 , 4 =  [0. 0. 0. 0. 0.]
2 , 5 =  [0. 0. 0. 0. 0.]
2 , 6 =  [0. 0. 0. 0. 0.]
3 , 0 =  [0. 0. 0. 0. 0.]
3 , 1 =  [  1.          70.336      202.496       74.368      149.33333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
    class_index =  19
    class_name =  tvmonitor
3 , 2 =  [0. 0. 0. 0. 0.]
3 , 3 =  [0. 0. 0. 0. 0.]
3 , 4 =  [  1.         267.456      229.97333333  29.568       77.65333333]
    class_one_hot =  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  4
    class_name =  bottle
3 , 5 =  [0. 0. 0. 0. 0.]
3 , 6 =  [0. 0. 0. 0. 0.]
4 , 0 =  [0. 0. 0. 0. 0.]
4 , 1 =  [0. 0. 0. 0. 0.]
4 , 2 =  [0. 0. 0. 0. 0.]
4 , 3 =  [  1.         220.864      283.136      158.592      327.33866667]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    class_index =  14
    class_name =  person
4 , 4 =  [0. 0. 0. 0. 0.]
4 , 5 =  [0. 0. 0. 0. 0.]
4 , 6 =  [0. 0. 0. 0. 0.]
5 , 0 =  [0. 0. 0. 0. 0.]
5 , 1 =  [0. 0. 0. 0. 0.]
5 , 2 =  [0. 0. 0. 0. 0.]
5 , 3 =  [0. 0. 0. 0. 0.]
5 , 4 =  [  1.         283.584      337.49333333  92.288      185.17333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
5 , 5 =  [0. 0. 0. 0. 0.]
5 , 6 =  [0. 0. 0. 0. 0.]
6 , 0 =  [0. 0. 0. 0. 0.]
6 , 1 =  [0. 0. 0. 0. 0.]
6 , 2 =  [0. 0. 0. 0. 0.]
6 , 3 =  [0. 0. 0. 0. 0.]
6 , 4 =  [0. 0. 0. 0. 0.]
6 , 5 =  [0. 0. 0. 0. 0.]
6 , 6 =  [0. 0. 0. 0. 0.]
=========================FLIPPED================================
0 , 0 =  [0. 0. 0. 0. 0.]
0 , 1 =  [0. 0. 0. 0. 0.]
0 , 2 =  [0. 0. 0. 0. 0.]
0 , 3 =  [0. 0. 0. 0. 0.]
0 , 4 =  [0. 0. 0. 0. 0.]
0 , 5 =  [0. 0. 0. 0. 0.]
0 , 6 =  [0. 0. 0. 0. 0.]
1 , 0 =  [0. 0. 0. 0. 0.]
1 , 1 =  [0. 0. 0. 0. 0.]
1 , 2 =  [0. 0. 0. 0. 0.]
1 , 3 =  [0. 0. 0. 0. 0.]
1 , 4 =  [0. 0. 0. 0. 0.]
1 , 5 =  [0. 0. 0. 0. 0.]
1 , 6 =  [0. 0. 0. 0. 0.]
2 , 0 =  [0. 0. 0. 0. 0.]
2 , 1 =  [0. 0. 0. 0. 0.]
2 , 2 =  [0. 0. 0. 0. 0.]
2 , 3 =  [0. 0. 0. 0. 0.]
2 , 4 =  [0. 0. 0. 0. 0.]
2 , 5 =  [0. 0. 0. 0. 0.]
2 , 6 =  [0. 0. 0. 0. 0.]
3 , 0 =  [0. 0. 0. 0. 0.]
3 , 1 =  [0. 0. 0. 0. 0.]
3 , 2 =  [  1.         179.544      229.97333333  29.568       77.65333333]
    class_one_hot =  [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  4
    class_name =  bottle
3 , 3 =  [0. 0. 0. 0. 0.]
3 , 4 =  [0. 0. 0. 0. 0.]
3 , 5 =  [  1.         376.664      202.496       74.368      149.33333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
    class_index =  19
    class_name =  tvmonitor
3 , 6 =  [0. 0. 0. 0. 0.]
4 , 0 =  [0. 0. 0. 0. 0.]
4 , 1 =  [0. 0. 0. 0. 0.]
4 , 2 =  [0. 0. 0. 0. 0.]
4 , 3 =  [  1.         226.136      283.136      158.592      327.33866667]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    class_index =  14
    class_name =  person
4 , 4 =  [0. 0. 0. 0. 0.]
4 , 5 =  [0. 0. 0. 0. 0.]
4 , 6 =  [0. 0. 0. 0. 0.]
5 , 0 =  [0. 0. 0. 0. 0.]
5 , 1 =  [0. 0. 0. 0. 0.]
5 , 2 =  [  1.         163.416      337.49333333  92.288      185.17333333]
    class_one_hot =  [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    class_index =  8
    class_name =  chair
5 , 3 =  [0. 0. 0. 0. 0.]
5 , 4 =  [0. 0. 0. 0. 0.]
5 , 5 =  [0. 0. 0. 0. 0.]
5 , 6 =  [0. 0. 0. 0. 0.]
6 , 0 =  [0. 0. 0. 0. 0.]
6 , 1 =  [0. 0. 0. 0. 0.]
6 , 2 =  [0. 0. 0. 0. 0.]
6 , 3 =  [0. 0. 0. 0. 0.]
6 , 4 =  [0. 0. 0. 0. 0.]
6 , 5 =  [0. 0. 0. 0. 0.]
6 , 6 =  [0. 0. 0. 0. 0.]

Reference

History

20181126: created.

Untitled

Posted on 2018-11-26 Edited on 2024-10-14

tutorial to cuda programming with C++

Posted on 2018-11-21 Edited on 2024-10-14 In cpp

Series

Guide

introduction

在异构计算架构中，GPU与CPU通过PCIe总线连接在一起来协同工作，CPU所在位置称为为主机端（host），而GPU所在位置称为设备端（device），如下图所示。

基于CPU+GPU的异构计算平台可以优势互补，CPU负责处理逻辑复杂的串行程序，而GPU重点处理数据密集型的并行计算程序，从而发挥最大功效。

CUDA编程模型基础

host: CPU,Memory
device: GPU,Memory

CUDA程序中既包含host程序，又包含device程序，它们分别在CPU和GPU上运行。同时，host与device之间可以进行通信，这样它们之间可以进行数据拷贝。典型的CUDA程序的执行流程如下：

分配host内存，并进行数据初始化；
分配device内存，并从host将数据拷贝到device上；
调用CUDA的核函数(kernel function)在device上完成指定的运算；
将device上的运算结果拷贝到host上；
释放device和host上分配的内存。

kernel

kernel是CUDA中一个重要的概念，kernel是在device上线程中并行执行的函数，核函数用__global__符号声明，在调用时需要用<<<grid, block>>>来指定kernel要执行的线程数量，在CUDA中，每一个线程都要执行核函数，并且每个线程会分配一个唯一的线程号thread ID，这个ID值可以通过核函数的内置变量threadIdx来获得。

由于GPU实际上是异构模型，所以需要区分host和device上的代码，在CUDA中是通过函数类型限定词开区别host和device上的函数，主要的三个函数类型限定词如下：

__global__：在device上执行，从host中调用（一些特定的GPU也可以从device上调用），返回类型必须是void，不支持可变参数，不能成为类成员函数。注意用__global__定义的kernel是异步的，这意味着host不会等待kernel执行完就执行下一步。
__device__：在device上执行，单仅可以从device中调用，不可以和__global__同时用。
__host__：在host上执行，仅可以从host上调用，一般省略不写，不可以和__global__同时用，但可和__device__同时用，此时函数会在device和host都编译。

grid/block/thread

1
2
3

dim3 grid(3, 2);
dim3 block(5, 3);
kernel_fun<<< grid, block >>>(prams...);

The key is in CUDA’s <<<1, 1>>>syntax. This is called the execution configuration, and it tells the CUDA runtime how many parallel threads to use for the launch on the GPU.

builtin variables

threadIdx
blockIdx
blockDim
gridDim

对于一个2-dim的block(Dx,Dy)，线程(x,y)的ID值为(x+y∗Dx)，
如果是3-dim的block(Dx,Dy,Dz)，线程(x,y,z)的ID值为(x+y∗Dx+z∗Dx∗Dy)。

matrix add

# kernel function
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) 
{ 
    int col = blockIdx.x * blockDim.x + threadIdx.x; 
    int row = blockIdx.y * blockDim.y + threadIdx.y; 
    if (col < N && row < N) 
        C[row][col] = A[row][col] + B[row][col]; 
}
int main() 
{ 
    ...
    // Kernel config
    dim3 threadsPerBlock(16, 16); 
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    // kernel call
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); 
    ...
}

CUDA内存模型

gpu memory

logical/physical layer

SP最基本的处理单元，Streaming Processor，也称为CUDA core。
SM是英文名是 Streaming Multiprocessor，翻译过来就是流式多处理器。
一个kernel的各个线程块有可能被分配多个SM，所以grid只是逻辑层，而SM才是执行的物理层。SM采用的是SIMT (Single-Instruction, Multiple-Thread，单指令多线程)架构，基本的执行单元是线程束（wraps)，线程束包含32个线程，这些线程同时执行相同的指令，但是每个线程都包含自己的指令地址计数器和寄存器状态，也有自己独立的执行路径。
由于SM的基本执行单元是包含32个线程的线程束，所以block大小一般要设置为32的倍数。
每个thread由每个SP执行
每个thread block由SM执行
一个kernel其实由一个grid来执行，一个kernel一次只能在一个GPU上执行

Code

see cuda-demo

CMakeLists.txt

cmake_minimum_required (VERSION 2.8.7)

project (CudaExample)
enable_language(C)
enable_language(CXX)
set(CMAKE_CXX_STANDARD 11)

# Set the output folder where your program will be created
set(CMAKE_BINARY_DIR ${CMAKE_SOURCE_DIR}/bin)
set(EXECUTABLE_OUTPUT_PATH ${CMAKE_BINARY_DIR})
set(LIBRARY_OUTPUT_PATH ${CMAKE_BINARY_DIR})

find_package(CUDA REQUIRED) # user-defined

MESSAGE( [Main] " CUDA_LIBRARIES = ${CUDA_LIBRARIES}")
MESSAGE( [Main] " CUDA_INCLUDE_DIRS = ${CUDA_INCLUDE_DIRS}")

# The following folder will be included
include_directories(
	${CUDA_INCLUDE_DIRS}
)

set(CUDA_NVCC_FLAGS "-g -G")
set(GENCODE -gencode=arch=compute_61,code=sm_61)

cuda_add_executable(demo src/demo.cu OPTIONS ${GENCODE})
target_link_libraries(demo ${CUDA_LIBRARIES})

#cuda_add_library(gpu SHARED ${CURRENT_HEADERS} ${CURRENT_SOURCES})
#cuda_add_library(gpu STATIC ${CURRENT_HEADERS} ${CURRENT_SOURCES})

vector add


#include <stdlib.h>
#include <iostream>

#include <cuda_runtime.h>

using namespace std;

/*
https://blog.csdn.net/fb_help/article/details/79330815
foo.cuh + foo.cu
*/

// function to add the elements of two arrays
void add(int n, float *a, float *b, float *c)
{
	for (int i = 0; i < n; i++)
		c[i] = a[i] + b[i];
}

__global__ void kernel_add(int n, float *a, float *b, float *c)
{
	// thread id
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    c[i] = a[i] + b[i];
}

__global__ void kernel_add2(int n, float *a, float *b, float *c)
{
	// thread id
	int index = blockDim.x * blockIdx.x + threadIdx.x;
	
	// grid-stride loop
	int grid_stride = blockDim.x * gridDim.x; // 256*4096 
	// in this case; only 1 loop
	for (int i = index; i < n; i += grid_stride)
	{
		c[i] = a[i] + b[i];
	}
}

void device_info()
{
	int deviceCount;
	cudaGetDeviceCount(&deviceCount);
	for (int i = 0;i<deviceCount;i++)
	{
		cudaDeviceProp devProp;
		cudaGetDeviceProperties(&devProp, i);
		std::cout << "使用GPU device " << i << ": " << devProp.name << std::endl;
		std::cout << "设备全局内存总量： " << devProp.totalGlobalMem / 1024 / 1024 << "MB" << std::endl;
		std::cout << "SM的数量：" << devProp.multiProcessorCount << std::endl;
		std::cout << "每个SM的最大线程数：" << devProp.maxThreadsPerMultiProcessor << std::endl;
		std::cout << "每个SM的最大线程束数(warps)：" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;

		std::cout << "每个线程块(Block)的共享内存大小：" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
		std::cout << "每个线程块(Block)的最大线程数：" << devProp.maxThreadsPerBlock << std::endl;
		std::cout << "每个线程块(Block)可用的32位寄存器数量： " << devProp.regsPerBlock << std::endl;
		std::cout << "======================================================" << std::endl;
	}
}

void test_cpu()
{
	float *A, *B, *C;
	int n = 1024 * 1024;
	int size = n * sizeof(float);

	// CPU端分配内存
	A = (float*)malloc(size);
	B = (float*)malloc(size);
	C = (float*)malloc(size);

	// 初始化数组
	for (int i = 0;i<n;i++)
	{
		A[i] = 90.0;
		B[i] = 10.0;
	}

	// Run kernel on 1M elements on the CPU
	add(n, A, B, C);

	// 校验误差
	float max_error = 0.0;
	for (int i = 0;i<n;i++)
	{
		max_error += fabs(100.0 - C[i]);
	}

	cout << "max error is " << max_error << endl;

	// 释放CPU端的内存
	free(A);
	free(B);
	free(C);
}

/*
cudaMalloc+cudaMemcpy+cudaFree
*/
int test_gpu_1()
{
    float*A, *Ad, *B, *Bd, *C, *Cd;
    int n = 1024 * 1024;
    int size = n * sizeof(float);

    // CPU端分配内存
    A = (float*)malloc(size);
    B = (float*)malloc(size);
    C = (float*)malloc(size);

    // 初始化数组
    for(int i=0;i<n;i++)
    {
        A[i] = 90.0;
        B[i] = 10.0;
    }

    // GPU端分配内存
    cudaMalloc((void**)&Ad, size);
    cudaMalloc((void**)&Bd, size);
    cudaMalloc((void**)&Cd, size);

    // CPU的数据拷贝到GPU端
    cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
    cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);

	// 1-dim
    // 定义kernel执行配置，（1024*1024/512）个block，每个block里面有512个线程
	int block_size = 512;
	int num_of_blocks = (n + block_size - 1) / block_size; 
    dim3 dimBlock(block_size);
    dim3 dimGrid(num_of_blocks);

    // 执行kernel
    kernel_add<<<dimGrid, dimBlock>>>(n, Ad, Bd, Cd);

    // 将在GPU端计算好的结果拷贝回CPU端
    cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);

    // 校验误差
    float max_error = 0.0;
    for(int i=0;i<n;i++)
    {
        max_error += fabs(100.0 - C[i]);
    }

    cout << "max error is " << max_error << endl;

    // 释放CPU端、GPU端的内存
    free(A);
    free(B);
    free(C);
    cudaFree(Ad);
    cudaFree(Bd);
    cudaFree(Cd);
    return 0;
}

/*
cudaMallocManaged+cudaDeviceSynchronize+cudaFree
*/
void test_gpu_2()
{
	float*A, *B, *C;
	int n = 1024 * 1024;
	int size = n * sizeof(float);

	// Allocate Unified Memory – accessible from CPU or GPU
	cudaMallocManaged((void**)&A, size);
	cudaMallocManaged((void**)&B, size);
	cudaMallocManaged((void**)&C, size);

	// 初始化数组
	for (int i = 0;i<n;i++)
	{
		A[i] = 90.0;
		B[i] = 10.0;
	}

	// 1-dim
	// 定义kernel执行配置，（1024*1024/512）个block，每个block里面有512个线程
	int block_size = 512;
	int num_of_blocks = (n + block_size - 1) / block_size;
	dim3 dimBlock(block_size);
	dim3 dimGrid(num_of_blocks);

	// 执行kernel
	kernel_add2 << <dimGrid, dimBlock >> >(n, A, B, C);

	// Wait for GPU to finish before accessing on host
	cudaDeviceSynchronize(); // block until the GPU has finished all tasks

	// 校验误差
	float max_error = 0.0;
	for (int i = 0;i<n;i++)
	{
		max_error += fabs(100.0 - C[i]);
	}

	cout << "max error is " << max_error << endl;

	// Free Unified Memory
	cudaFree(A);
	cudaFree(B);
	cudaFree(C);
}

int main()
{
	device_info();
	test_cpu();
	test_gpu_1();
	test_gpu_2();
	return 0;
}

notes for block_size and num_of_blocks

int block_size = 512;
int num_of_blocks = (n + block_size - 1) / block_size; // 4096
dim3 dimBlock(block_size);
dim3 dimGrid(num_of_blocks);

notes for grid-stride loop

__global__ void kernel_add2(int n, float *a, float *b, float *c)
{
	// thread id
	int index = blockDim.x * blockIdx.x + threadIdx.x;

	// grid-stride loop
	int grid_stride = blockDim.x * gridDim.x; // 256*4096 
	// in this case; only 1 loop
	for (int i = index; i < n; i += grid_stride)
	{
		c[i] = a[i] + b[i];
	}
}

nvprof

1	nvprof.exe demo.exe

==8748== Profiling application: .\demo.exe
==8748== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 43.63%  1.6413ms         3  547.10us  517.71us  591.41us  [CUDA memcpy HtoD]
 30.11%  1.1327ms         1  1.1327ms  1.1327ms  1.1327ms  [CUDA memcpy DtoH]
 26.26%  987.80us         2  493.90us  243.43us  744.37us  kernel_add(int, float*, float*, float*)

at C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvprof.exe

matrix multiply

for 1-dim vector add, we use 1-dim grid and block
for 2-dim matrix multiply, we use 2-dim grid and block.

// ========================================
// 2-dim
// ========================================
// 矩阵类型，行优先，M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
	int width;
	int height;
	float *elements;
};

// 获取矩阵A的(row, col)元素
__device__ float getElement(Matrix *A, int row, int col)
{
	return A->elements[row * A->width + col];
}

// 为矩阵A的(row, col)元素赋值
__device__ void setElement(Matrix *A, int row, int col, float value)
{
	A->elements[row * A->width + col] = value;
}

// 矩阵相乘kernel，2-D，每个线程计算一个元素
__global__ void matMulKernel(Matrix *A, Matrix *B, Matrix *C)
{
	float sum = 0.0;
	int row = threadIdx.y + blockIdx.y * blockDim.y;
	int col = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = 0; i < A->width; ++i)
	{
		sum += getElement(A, row, i) * getElement(B, i, col);
	}
	setElement(C, row, col, sum);
}

void test_gpu_3()
{
	int width = 1 << 8;
	int height = 1 << 8;

	Matrix *A, *B, *C;
	// 申请托管内存
	cudaMallocManaged((void**)&A, sizeof(Matrix));
	cudaMallocManaged((void**)&B, sizeof(Matrix));
	cudaMallocManaged((void**)&C, sizeof(Matrix));

	int nBytes = width * height * sizeof(float);
	cudaMallocManaged((void**)&A->elements, nBytes);
	cudaMallocManaged((void**)&B->elements, nBytes);
	cudaMallocManaged((void**)&C->elements, nBytes);

	// 初始化数据
	A->height = height;
	A->width = width;
	B->height = height;
	B->width = width;
	C->height = height;
	C->width = width;
	for (int i = 0; i < width * height; ++i)
	{
		A->elements[i] = 1.0;
		B->elements[i] = 2.0;
	}

	// 定义kernel的执行配置
	dim3 blockSize(32, 32);
	dim3 gridSize(
		(width + blockSize.x - 1) / blockSize.x,
		(height + blockSize.y - 1) / blockSize.y
	);
	// 执行kernel
	matMulKernel<<<gridSize, blockSize>>>(A, B, C);

	// 同步device 保证结果能正确访问
	cudaDeviceSynchronize();

	// 检查执行结果
	float maxError = 0.0;
	for (int i = 0; i < width * height; ++i)
		maxError += fabs(C->elements[i] - 2 * width);
	cout << "max error is " << maxError << endl;

	// 释放托管内存
	cudaFree(A->elements);
	cudaFree(B->elements);
	cudaFree(C->elements);
	cudaFree(A);
	cudaFree(B);
	cudaFree(C);
}


int main()
{
	test_gpu_3();
	return 0;
}

notes for

dim3 blockSize(32, 32);
dim3 gridSize(
	(width + blockSize.x - 1) / blockSize.x,
	(height + blockSize.y - 1) / blockSize.y
);

Reference

History

20181121: created.

how to use tensorrt int8 to do network calibration

Posted on 2018-11-19 Edited on 2024-10-14 In deep learning

Series

Guide

FP32/FP16/INT8 range

INT8 has significantly lower precision and dynamic range compared to FP32.

High-throughput INT8 math

DP4A: int8 dot product Requires sm_61+ (Pascal TitanX, GTX 1080, Tesla P4, P40 and others).

Calibration Dataset

When preparing the calibration dataset, you should capture the expected distribution of data in typical inference scenarios. You want to make sure that the calibration dataset covers all the expected scenarios; for example, clear weather, rainy day, night scenes, etc. If you are creating your own dataset, we recommend creating a separate calibration dataset. The calibration dataset shouldn’t overlap with the training, validation or test datasets, in order to avoid a situation where the calibrated model only works well on the these datasets.
具有代表性，最好是val set的子集。

result

caffe / tensorrt FP32 / tensorrt INT8

Code

fp32

by default.

fp16

cpp
1
builder->setFp16Mode(true);
python
1
builder.set_fp16_mode(True)

int8

cpp usage

1 2	builder->setInt8Mode(true); builder->setInt8Calibrator(calibrator);

python usage

import tensorrt as trt

NUM_IMAGES_PER_BATCH = 5
batchstream = ImageBatchStream(NUM_IMAGES_PER_BATCH，calibration_files)

Int8_calibrator = trt.infer.EntropyCalibrator([“input_node_name”]，batchstream)

trt_builder = trt.infer.create_infer_builder(G_LOGGER)
trt_builder.set_int8_mode(True)
trt_builder.set_int8_calibrator(Int8_calibrator)

Int8 Calibrator

see 5.1.3.2. INT8 Calibration Using C++

Calibration can be slow, therefore, the IInt8Calibrator interface provides methods for caching intermediate data. Using these methods effectively requires a more detailed understanding of calibration.

When building an INT8 engine, the builder performs the following steps:

Builds a 32-bit engine, runs it on the calibration set, and records a histogram for each tensor of the distribution of activation values.
Builds a calibration table from the histograms.
Builds the INT8 engine from the calibration table and the network definition.

The calibration table can be cached. Caching is useful when building the same network multiple times, for example, on multiple platforms. It captures data derived from the network and the calibration set. The parameters are recorded in the table. If the network or calibration set changes, it is the application’s responsibility to invalidate the cache.

The cache is used as follows:

if a calibration table is found, calibration is skipped, otherwise:
the calibration table is built from the histograms and parameters
then the INT8 network is built from the network definition and the calibration table.

Cached data is passed as a pointer and length.
After you have implemented the calibrator, you can configure the builder to use it:

1	builder->setInt8Calibrator(calibrator);

The make_plan program must run on the target system in order for the TensorRT engine to be optimized correctly for that system. However, if an INT8 calibration cache was produced on the host, the cache may be re-used by the builder on the target when generating the engine (in other words, there is no need to do INT8 calibration on the target system itself).

INT8 calibration cache can be re-used, while engine can not.

demo

c++

cpp:

see calibrator.h
and calibrator.cpp

calibrator.h

#ifndef _CALIBRATOR_H_
#define _CALIBRATOR_H_

#include "NvInfer.h"
#include "ds_image.h"
#include "trt_utils.h"

class Int8EntropyCalibrator : public nvinfer1::IInt8EntropyCalibrator
{
public:
    Int8EntropyCalibrator(const uint& batchSize, const std::string& calibrationSetPath,
                          const std::string& calibTableFilePath, const uint64_t& inputSize,
                          const uint& inputH, const uint& inputW, const std::string& inputBlobName);
    virtual ~Int8EntropyCalibrator() { NV_CUDA_CHECK(cudaFree(m_DeviceInput)); }

    int getBatchSize() const override { return m_BatchSize; }
    bool getBatch(void* bindings[], const char* names[], int nbBindings) override;
    const void* readCalibrationCache(size_t& length) override;
    void writeCalibrationCache(const void* cache, size_t length) override;

private:
    const uint m_BatchSize;
    const uint m_InputH;
    const uint m_InputW;
    const uint64_t m_InputSize;
    const uint64_t m_InputCount;
    const char* m_InputBlobName;
    const std::string m_CalibTableFilePath{nullptr};
    uint m_ImageIndex;
    bool m_ReadCache{true};
    void* m_DeviceInput{nullptr};
    std::vector<std::string> m_ImageList;
    std::vector<char> m_CalibrationCache;
};

#endif

calibrator.cpp


#include "calibrator.h"
#include <fstream>
#include <iostream>
#include <iterator>

Int8EntropyCalibrator::Int8EntropyCalibrator(const uint& batchSize,
                                             const std::string& calibrationSetPath,
                                             const std::string& calibTableFilePath,
                                             const uint64_t& inputSize, const uint& inputH,
                                             const uint& inputW, const std::string& inputBlobName) :
    m_BatchSize(batchSize),
    m_InputH(inputH),
    m_InputW(inputW),
    m_InputSize(inputSize),
    m_InputCount(batchSize * inputSize),
    m_InputBlobName(inputBlobName.c_str()),
    m_CalibTableFilePath(calibTableFilePath),
    m_ImageIndex(0)
{
    m_ImageList = loadListFromTextFile(calibrationSetPath);
    m_ImageList.resize(static_cast<int>(m_ImageList.size() / m_BatchSize) * m_BatchSize);
    std::random_shuffle(m_ImageList.begin(), m_ImageList.end(), [](int i) { return rand() % i; });
    NV_CUDA_CHECK(cudaMalloc(&m_DeviceInput, m_InputCount * sizeof(float)));
}

bool Int8EntropyCalibrator::getBatch(void* bindings[], const char* names[], int nbBindings)
{
    if (m_ImageIndex + m_BatchSize >= m_ImageList.size()) return false;

    // Load next batch
    std::vector<DsImage> dsImages(m_BatchSize);
    for (uint j = m_ImageIndex; j < m_ImageIndex + m_BatchSize; ++j)
    {
        dsImages.at(j - m_ImageIndex) = DsImage(m_ImageList.at(j), m_InputH, m_InputW);
    }
    m_ImageIndex += m_BatchSize;

    cv::Mat trtInput = blobFromDsImages(dsImages, m_InputH, m_InputW);

    NV_CUDA_CHECK(cudaMemcpy(m_DeviceInput, trtInput.ptr<float>(0), m_InputCount * sizeof(float),
                             cudaMemcpyHostToDevice));
    assert(!strcmp(names[0], m_InputBlobName));
    bindings[0] = m_DeviceInput;
    return true;
}

const void* Int8EntropyCalibrator::readCalibrationCache(size_t& length)
{
    void* output;
    m_CalibrationCache.clear();
    assert(!m_CalibTableFilePath.empty());
    std::ifstream input(m_CalibTableFilePath, std::ios::binary);
    input >> std::noskipws;
    if (m_ReadCache && input.good())
        std::copy(std::istream_iterator<char>(input), std::istream_iterator<char>(),
                  std::back_inserter(m_CalibrationCache));

    length = m_CalibrationCache.size();
    if (length)
    {
        std::cout << "Using cached calibration table to build the engine" << std::endl;
        output = &m_CalibrationCache[0];
    }

    else
    {
        std::cout << "New calibration table will be created to build the engine" << std::endl;
        output = nullptr;
    }

    return output;
}

void Int8EntropyCalibrator::writeCalibrationCache(const void* cache, size_t length)
{
    assert(!m_CalibTableFilePath.empty());
    std::ofstream output(m_CalibTableFilePath, std::ios::binary);
    output.write(reinterpret_cast<const char*>(cache), length);
    output.close();
}

c++ v2

class Int8CacheCalibrator : public IInt8EntropyCalibrator {
public:
  Int8CacheCalibrator(std::string cacheFile)
    : mCacheFile(cacheFile) {}
  virtual ~Int8CacheCalibrator() {}
  
  int getBatchSize() const override {return 1;}
  
  bool getBatch(void* bindings[], const char* names[], int nbBindings) override {
    return false;
  }
  
  const void* readCalibrationCache(size_t& length) override
  {
    mCalibrationCache.clear();
    std::ifstream input(mCacheFile, std::ios::binary);
    input >> std::noskipws;
    if (input.good()) {
      std::copy(std::istream_iterator(input),
      std::istream_iterator<char>(),
      std::back_inserter<char>(mCalibrationCache));
    }
    length = mCalibrationCache.size();
    return length ? &mCalibrationCache[0] : nullptr;
  }

private:
  std::string mCacheFile;
  std::vector<char> mCalibrationCache;
};

python

see calibrator.py

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
import ctypes
import tensorrt as trt

CHANNEL = 3
HEIGHT = 512
WIDTH = 1024

class PythonEntropyCalibrator(trt.infer.EntropyCalibrator):
  def __init__(self, input_layers, stream):
    trt.infer.EntropyCalibrator.__init__(self)       
    self.input_layers = input_layers
    self.stream = stream
  self.d_input = cuda.mem_alloc(self.stream.calibration_data.nbytes)
    stream.reset()

  def get_batch_size(self):
    return self.stream.batch_size

  def get_batch(self, bindings, names):
    batch = self.stream.next_batch()
    if not batch.size:   
      return None
      
    cuda.memcpy_htod(self.d_input, batch)
    for i in self.input_layers[0]:
      assert names[0] != i

    bindings[0] = int(self.d_input)
    return bindings

  def read_calibration_cache(self, length):
    return None

  def write_calibration_cache(self, ptr, size):
    cache = ctypes.c_char_p(int(ptr))
    with open('calibration_cache.bin', 'wb') as f:
      f.write(cache.value)
    return None

Reference

History

20181119: created.

install apache httpd server 2.2 on windows 10

Posted on 2018-11-14 Edited on 2024-10-14 In web

Guide

install and config

download httpd-2.2.34-win64.zip
extract to c:/Apache2
edit conf/httpd.conf

ServerRoot "c:/Apache2"
Listen 80
ServerName 127.0.0.1
DocumentRoot "c:/Apache2/htdocs"

run cmd.exe as administrator and install apache2.2 service

1 2	cd c:/Apache2/bin httpd.exe -k install -n apache2.2

tips: use sc delete apache2.2 to delete service first if error occurs and then install again.

run bin/ApacheMonitor.exe to start apache2.2 service

access from local

test localhost

access from remote

edit conf/httpd.conf

#Directory "c:/Apache2/htdocs"
    AllowOverride None
    Order allow,deny
    Allow from all
#/Directory

open windows defender and allow Apache Http Server packets in/out

server host ip: 192.168.6.149
access host ip: 192.168.6.100

1	[email protected]>$ wget http://192.168.6.149/1.txt

Reference

History

20181114: created.

compile and install darknet on windows 10

Posted on 2018-11-01 Edited on 2024-10-14 In deep learning

Series

Guide

requirements

my system requirements

windows 10
GeForce 1060 (6G) sm_61
VS: 2015
cuda: 8.0
cudnn: 6.0.1
opencv: opencv-3.3.0-vc14.exe

Tips:
Install Cuda after VS 2015. otherwise errors occur. (don’t know why)

commands

1	git clone https://github.com/AlexeyAB/darknet.git

config

extract opencv-3.3.0-vc14.exe to C:\opencv330\
find CUDA 8.0.props from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\visual_studio_integration\MSBuildExtensions\CUDA 8.0.props

vim build/darknet/darknet.vcxproj

replace C:\opencv_2.4.9\ with C:\opencv330\
replace CUDA 9.2.props with CUDA 8.0.props
replace CUDA 9.2.targets with CUDA 8.0.targets
replace compute_30,sm_30; with compute_61,sm_61;

compile

open build/darknet/darknet.sln with VS 2015

include path

C:\opencv330\opencv\build\include
..\..\3rdparty\include
$(CUDA_PATH)\include

library path

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64
$(CUDA_PATH)\lib\$(PlatformName)  
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64

linker input

    ..\..\3rdparty\lib\x64\pthreadVC2.lib;cublas.lib;curand.lib;cudart.lib;

build with x64 Release.

run

darknet.exe

1	./darknet.exe detector test data/coco.data yolov3.cfg yolov3.weights -i 0 -thresh 0.25 dog.jpg -ext_output

darknet.py

compile yolo_cpp_dll.sln and generate yolo_cpp_dll.dll for python usage.

1	python darknet.py

Code

yolo.py

for linux and windows.
see yolo.py

yolo_cpp_dll

System Path

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\jre\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\jre\bin\plugin2

cuda80+cudnn6

cuda_8.0.61_win10.exe: cublas64_80.dll,curand64_80.dll,cudart64_80.dll
cudnn-8.0-windows10-x64-v6.0.zip: cudnn64_6.dll

cuda90+cudnn7

cuda_9.0.176_win10.exe: cublas64_90.dll,curand64_90.dll,cudart64_90.dll
cudnn-9.0-windows10-x64-v7.1.zip: cudnn64_7.dll

yolo dll

cuda80_yolo_cpp_dll: pthreadvc2.dll, cublas64_80.dll,curand64_80.dll,cudart64_80.dll,cudnn64_6.dll
cuda90_yolo_cpp_dll: pthreadvc2.dll, cublas64_90.dll,curand64_90.dll,cudart64_90.dll,cudnn64_7.dll

Tips: use Dependency Walker to list dlls.

Reference

History

20181101: created.