0%

Guide

version

  • ubuntu 16.04 64 bit
  • cuda driver 384.130
  • tensorflow-gpu 1.4.0 (CUDA 8.0 + cudnn 6.0)
  • tensorflow-gpu 1.5.0+ (CUDA 9.0 + cudnn )
  • python 3.5

install

version:

  • cpu: tensorflow
  • gpu: tensorflow-gpu

commands

1
2
3
4
5
6
7
8
workon py3
pip install tensorflow-gpu==1.4
pip install keras

pip install Pillow scipy sklearn scikit-image ipython

pip list
pip3 list # same results as pip

tips, for virtualenv workon see python virtualenv tutorial

test

1
2
3
4
5
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print hello
print(sess.run(hello))

output

Tensor("Const:0", shape=(), dtype=string)
Hello, TensorFlow!

fix errors

error

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory

tensorflow-gpu 1.5 use cuda 9.0, so we install tensorflow-gpu 1.4to use cuda 8.0

1
2
pip uninstall tensorflow-gpu
pip install tensorflow-gpu==1.4

Jupyter notebook with tensorflow

install tensorflow kernel

1
2
3
4
5
workon py3
pip install ipykernel

python -m ipykernel install --user --name=tensorflow
Installed kernelspec tensorflow in /home/kezunlin/.local/share/jupyter/kernels/tensorflow

use tensorflow kernel

1
2
cd workspace/anjian
jupyter notebook

create a notebook with tensorflow kernel

png

Demo

disable info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
#默认为0:输出所有log信息
#设置为1:进一步屏蔽INFO信息
#设置为2:进一步屏蔽WARNING信息
#设置为3:进一步屏蔽ERROR信息
```bash

### tensorflow errors

error

Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7fd3edd13e10>>

Traceback (most recent call last):
File "venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in __del__
TypeError: 'NoneType' object is not callable

fix
```python
from keras import backend as K
#...
#...
K.clear_session()

Reference

History

  • 20180821: created.

How to Install

Bazel is an open-source build and test tool similar to Make, Maven, and Gradle. It uses a human-readable, high-level build language. Bazel supports projects in multiple languages and builds outputs for multiple platforms. Bazel supports large codebases across multiple repositories, and large numbers of users.

support language and platform:

  • c++
  • java
  • android
  • ios

Using binary installer

1
2
3
4
5
6
7
8
9
10
sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python

#download `bazel-0.16.1-installer-linux-x86_64.sh` from `https://github.com/bazelbuild/bazel/releases`

chmod +x bazel-0.16.1-installer-linux-x86_64.sh
./bazel-0.16.1-installer-linux-x86_64.sh --user
# The --user flag installs Bazel to the $HOME/bin directory on your system and sets the .bazelrc path to $HOME/.bazelrc.

vim .bashrc
export PATH="$PATH:$HOME/bin"

Using Bazel custom APT repository

unable to access googleapis.com

1
2
3
4
5
6
7
8
sudo apt-get install openjdk-8-jdk

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -


sudo apt-get update && sudo apt-get install bazel
sudo apt-get upgrade bazel

Tutorial

get examples

1
git clone https://github.com/bazelbuild/examples/

folder structure

cpp-tutorial/
├── README.md
├── stage1
│   ├── main
│   │   ├── BUILD
│   │   └── hello-world.cc
│   ├── README.md
│   └── WORKSPACE
├── stage2
│   ├── main
│   │   ├── BUILD
│   │   ├── hello-greet.cc
│   │   ├── hello-greet.h
│   │   └── hello-world.cc
│   ├── README.md
│   └── WORKSPACE
└── stage3
    ├── lib
    │   ├── BUILD
    │   ├── hello-time.cc
    │   └── hello-time.h
    ├── main
    │   ├── BUILD
    │   ├── hello-greet.cc
    │   ├── hello-greet.h
    │   └── hello-world.cc
    ├── README.md
    └── WORKSPACE

7 directories, 20 files

stage1

Understand the BUILD file

cpp-tutorial/stage1/main/BUILD

1
2
3
4
cc_binary(
name = "hello-world",
srcs = ["hello-world.cc"],
)

build target

1
2
cd stage1
bazel build //main:hello-world

output

INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 2.267s, Critical Path: 0.25s

test binary

1
bazel-bin/main/hello-world

Review the dependency graph

install graphviz

1
sudo apt install graphviz xdot

vizualize

1
2
bazel query --nohost_deps --noimplicit_deps 'deps(//main:hello-world)' --output graph
xdot <(bazel query --nohost_deps --noimplicit_deps 'deps(//main:hello-world)' --output graph)

graph

png

stage2

Specify multiple build targets

cpp-tutorial/stage2/main/BUILD

1
2
3
4
5
6
7
8
9
10
11
12
13
cc_library(
name = "hello-greet",
srcs = ["hello-greet.cc"],
hdrs = ["hello-greet.h"],
)

cc_binary(
name = "hello-world",
srcs = ["hello-world.cc"],
deps = [
":hello-greet",
],
)

build target

1
2
cd stage2
bazel build //main:hello-world

output

INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 2.267s, Critical Path: 0.25s

test binary

1
bazel-bin/main/hello-world

graph

png

stage3

Use multiple packages

folder structure

└──stage3
   ├── main
   │   ├── BUILD
   │   ├── hello-world.cc
   │   ├── hello-greet.cc
   │   └── hello-greet.h
   ├── lib
   │   ├── BUILD
   │   ├── hello-time.cc
   │   └── hello-time.h
   └── WORKSPACE

lib/BUILD

1
2
3
4
5
6
cc_library(
name = "hello-time",
srcs = ["hello-time.cc"],
hdrs = ["hello-time.h"],
visibility = ["//main:__pkg__"],
)

This is because by default targets are only visible to other targets in the same BUILD file.

main/BUILD

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cc_library(
name = "hello-greet",
srcs = ["hello-greet.cc"],
hdrs = ["hello-greet.h"],
)

cc_binary(
name = "hello-world",
srcs = ["hello-world.cc"],
deps = [
":hello-greet",
"//lib:hello-time",
],
)

build target

1
2
cd stage3
bazel build //main:hello-world

output

INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 2.267s, Critical Path: 0.25s

test binary

1
bazel-bin/main/hello-world

graph

png

Use labels to reference targets

//path/to/package:target-name
  • When referencing targets within the same package, you can skip the package path and just use //:target-name.
  • When referencing targets within the same BUILD file, you can even skip the // workspace root identifier and just use :target-name.

Reference

History

  • 20180821: created.

Guide

ncnn

ncnn 是一个为手机端极致优化的高性能神经网络前向计算框架。ncnn 从设计之初深刻考虑手机端的部署和使用。无第三方依赖,跨平台,手机端 cpu 的速度快于目前所有已知的开源框架。基于 ncnn,开发者能够将深度学习算法轻松移植到手机端高效执行,开发出人工智能 APP,将 AI 带到你的指尖。ncnn 目前已在腾讯多款应用中使用,如 QQ,Qzone,微信,天天P图等。

功能概述

  • 支持卷积神经网络,支持多输入和多分支结构,可计算部分分支
  • 无任何第三方库依赖,不依赖 BLAS/NNPACK 等计算框架
  • 纯 C++ 实现,跨平台,支持 android ios 等
  • ARM NEON 汇编级良心优化,计算速度极快
  • 精细的内存管理和数据结构设计,内存占用极低
  • 支持多核并行计算加速,ARM big.LITTLE cpu 调度优化
  • 整体库体积小于 500K,并可轻松精简到小于 300K
  • 可扩展的模型设计,支持 8bit 量化和半精度浮点存储,可导入 caffe 模型
  • 支持直接内存零拷贝引用加载网络模型
  • 可注册自定义层实现并扩展
  • 恩,很强就是了,不怕被塞卷 QvQ

nihui,喜爱C/C++,腾讯优图实验室基础研究组高级研究员,负责图像和人脸相关的技术研究和软件开发,非常热爱开源社区,系腾讯社交网络事业群首个AI开源项目ncnn负责人。

features:

  • 跑vgg、googlenet、resnet等模型速度比其他已知的开源框架快2~4倍
  • C++较接近底层,能控制几乎所有资源,运行代价小。目前主要是面向android和ios的,实际上只要有C++编译器就可以。无任何第三方库依赖,不依赖 BLAS/NNPACK等计算框架
  • ncnn代码全部使用C/C++实现,跨平台的cmake编译系统,可在已知的绝大多数平台编译运行,如Linux,Windows,MacOS,Android,iOS等。由于ncnn不依赖第三方库,且采用C++03标准实现,只用到了std::vector和std::string两个STL模板,可轻松移植到其他系统和设备上。
  • 为什么在计算硬件上选择CPU而不是GPU?CPU的兼容性很好,但是各种各样的GPU功能支持都不一样,不容易实现,比如ios的metal和android的opencl。不否认GPU会更快,但GPU优化很复杂,想写一个通用的GPU路径很难,目前实现起来也有一定的难度。
  • ncnn支撑着一些优图提供的算法,例如人脸相关的应用:人像自动美颜,照片风格化,超分辨率,物体识别等等,对于小型的网络模型可以跑到实时。
  • 云端vs终端?AR,VR都需要实时性,云端即使再快也无法实时,所以终端部署是很有必要的。云端适合处理大数据,比如推荐系统,安全系统,终端适合实时化的应用场景,比如智能机器人,无人驾驶。

tools:

  • caffe2ncnn: caffe模型(prototxt,caffemodel)转换为ncnn的xxx.param,xxx.bin文件
  • ncnn2mem: 对模型xxx.param进行加密生成二进制文件xxx.param.bin

NCNN暂时只支持opencv2。

FeatherCNN

腾讯出品。

see here

mace

小米出品。

features

  • 速度:对于放在移动端进行计算的模型,一般对整体的预测延迟有着非常高的要求。在框架底层,针对ARM CPU进行了NEON指令级优化,针对移动端GPU,实现了高效的OpenCL内核代码。针对高通DSP,集成了nnlib计算库进行HVX加速。同时在算法层面,采用Winograd算法对卷积进行加速。
  • 功耗:移动端对功耗非常敏感,框架针对ARM处理器的big.LITTLE架构,提供了高性能,低功耗等多种组合配置。针对Adreno GPU,提供了不同的功耗性能选项,使得开发者能够对性能和功耗进行灵活的调整。
  • 系统响应:对于GPU计算模式,框架底层对OpenCL内核自适应的进行分拆调度,保证GPU渲染任务能够更好的进行抢占调度,从而保证系统的流畅度。
  • 初始化延迟:在实际项目中,初始化时间对用户体验至关重要,框架对此进行了针对性的优化。
  • 内存占用:通过对模型的算子进行依赖分析,引入内存复用技术,大大减少了内存的占用。
  • 模型保护:对于移动端模型,知识产权的保护往往非常重要,MACE支持将模型转换成C++代码,大大提高了逆向工程的难度。
    此外,MACE 支持 TensorFlow 和 Caffe 模型,提供转换工具,可以将训练好的模型转换成专有的模型数据文件,同时还可以选择将模型转换成C++代码,支持生成动态库或者静态库,提高模型保密性。

TensorRT

NVIDIA TensorRT是一种用于产品开发的高性能的深度学习推理引擎,应用有图像分类,分割和目标检测,提供的帧/秒速度比只有CPU的推理引擎高14倍。

主要特点:

1)生成优化了的、实现好了的、可以用于预测的模型;

2)优化和部署广泛的神经网络层,如卷积,全连接,LRN,汇集,激活,softmax,Concat和反卷积层

3)支持caffe prototxt网络描述文件;

4)实现神经网络在全精度上(FP32)或减少(INT8、FP16精度);

5)使用自定义层API来定义和实现独特的功能;

DIGITS 5和TensorRT可供NVIDIA开发者计划成员免费下载。

在线的部署最大的特点是对实时性要求很高,它对latency非常敏感,要我们能非常快的给出推断(Inference)的结果。部署端不只是成本的问题,如果方法不得当,即使使用目前最先进的GPU,也无法满足推断(Inference)的实时性要求。因为模型如果做得不好,没有做优化,可能需要二三百毫秒才能做完一次推断(Inference),再加上来回的网络传输,用户可能一秒后才能得到结果。在语音识别的场景之下,用户可以等待;但是在驾驶的场景之下,可能会有性命之庾。

在部署阶段,latency是非常重要的点,而TensorRT是专门针对部署端进行优化的,目前TensorRT支持大部分主流的深度学习应用,当然最擅长的是CNN(卷积神经网络)领域,但是的TensorRT 3.0也是有RNN的API。

总结一下推断(Inference)和训练(Training)的不同:

  • 推断(Inference)的网络权值已经固定下来,无后向传播过程,因此可以模型固定,可以对计算图进行优化; 输入输出大小固定,可以做memory优化(注意:有一个概念是fine-tuning,即训练好的模型继续调优,只是在已有的模型做小的改动,本质上仍然是训练(Training)的过程,TensorRT没有fine-tuning)

  • 推断(Inference)的batch size要小很多,仍然是latency的问题,因为训练(training)如果batch size很大,吞吐可以达到很大,比如每秒可以处理1024个batch,500毫秒处理完,吞吐可以达到2048,可以很好地利用GPU;但是推断(Inference)不能做500毫秒处理,可以是8或者16,吞吐降低,没有办法很好地利用GPU.

  • 推断(Inference)可以使用低精度的技术,训练的时候因为要保证前后向传播,每次梯度的更新是很微小的,这个时候需要相对较高的精度,一般来说需要float型,如FP32,32位的浮点型来处理数据,但是在推断(Inference)的时候,对精度的要求没有那么高,很多研究表明可以用低精度,如半长(16)的float型,即FP16,也可以用8位的整型(INT8)来做推断(Inference),研究结果表明没有特别大的精度损失,尤其对CNN。更有甚者,对Binary(二进制)的使用也处在研究过程中,即权值只有0和1。目前FP16和INT8的研究使用相对来说比较成熟。低精度计算的好处是一方面可以减少计算量,原来计算32位的单元处理FP16的时候,理论上可以达到两倍的速度,处理INT8的时候理论上可以达到四倍的速度。当然会引入一些其他额外的操作,后面的讲解中会详细介绍FP18和INT8;另一方面是模型需要的空间减少,不管是权值的存储还是中间值的存储,应用更低的精度,模型大小会相应减小。

暂时抛开TensorRT,如果让大家从头写一个深度学习模型的前向过程,具体过程应该是

  1. 首先实现NN的layer,如卷积的实现,pooling的实现。

  2. 管理memory,数据在各层之间如何流动。

  3. 推断(Inference)的engine来调用各层的实现。

TensorRT高级特征介绍:

  • 插件支持: 在某些层TensorRT不支持的情况下,需要通过Plugin的形式自己去实现。
  • 低精度支持: 低精度指的是之前所说过的FP16和INT8,其中FP16主要是Pascal P100和V100(tensor core)这两张卡支持;而INT8主要针对的是 P4和P40这两张卡,P4是专门针对线上做推断(Inference)的小卡,和IPhone手机差不多大,75瓦的一张卡,功耗和性能非常好。
  • Python接口和更多的框架支持: TensorRT目前支持Python和C++的API。Model importer(即Parser)主要支持Caffe和Uff,其他的框架可以通过API来添加。TensorRT去做推断(Inference)的时候是不再需要框架的(caffe,tensorflow)

低精度的推断(Inference)

  • FP16 推断(Inference: TensorRT支持高度自动化的FP16推断(Inference),解析模型要将模型的的数据类型设置为DataType::kHALF,同时通过builder- >setHalf2Mode(true)指令将推断(Inference)设置为FP16的模式。需要注意两点,一点是FP16推断(Inference)不需要额外的输入,只需要输入预先训练好的FP32模型,另一点是目前只有Tesla P100/V100支持原生的FP16。

  • INT8 推断(Inference: 主要关注INT8推断(Inference)的几个方面,即:如何生成校准表,如何使用校准表,和INT8推断(Inference)实例。

最后总结一下TensorRT的优点:

  • TensorRT是一个高性能的深度学习推断(Inference)的优化器和运行的引擎;
  • TensorRT支持Plugin,对于不支持的层,用户可以通过Plugin来支持自定义创建;
  • TensorRT使用低精度的技术获得相对于FP32二到三倍的加速,用户只需要通过相应的代码来实现。

Anakin

百度PaddlePaddle Anakin。
see here

Anakin supports a wide range of neural network architectures and different hardware platforms. It is easy to run Anakin on GPU / x86 / ARM platform.

TVM

see here

TVM是一个全新的框架,可以:

  • 为CPU、GPU和其他专用硬件,表示和优化常见的深度学习计算工作负载
  • 自动转换计算图以最小化内存占用,优化数据布局和融合计算模式
  • 提供端到端编译,从现有的前端框架到裸机硬件,直到浏览器可执行的javascript

在TVM的帮助下,可以轻松在手机、嵌入式设备甚至浏览器上运行深度学习的工作负载,而不需要额外的工作。TVM还为许多硬件平台上的深度学习工作负载,提供统一的优化框架,包括依赖于新计算基元的专用加速器。

Reference

TenorRT

History

  • 20180817: created.

Guide

Quick guide with Demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# install packages

pip install cython
pip install easydict
apt-get install python-opencv

# Make sure to clone with --recursive
git clone --recursive https://github.com/rbgirshick/py-faster-rcnn.git

# Build the Cython modules
py-faster-rcnn/lib
make

# Build Caffe and pycaffe
cd py-faster-rcnn/caffe-fast-rcnn
mkdir build && cd build && cmake-gui ..
make -j8


#Download pre-computed Faster R-CNN detectors

cd py-faster-rcnn
./data/scripts/fetch_faster_rcnn_models.sh

# This will populate the `FRCN_ROOT/data` folder with faster_rcnn_models. See `data/README.md` for details. These models were trained on VOC 2007 trainval.

# Demo
./tools/demo.py --gpu 0 --net zf
./tools/demo.py --gpu 0 --net vgg16

fix gflags error

  • caffe-fast-rcnn/include/caffe/common.hpp
  • caffe-fast-rcnn/examples/mnist/convert_mnist_data.cpp

Comment out the ifndef

1
2
3
// #ifndef GFLAGS_GFLAGS_H_
namespace gflags = google;
// #endif // GFLAGS_GFLAGS_H_

Train net with your own data

faster rcnn训练方式有两种

  • 一种是交替优化方法(alternating optimization),即训练两个网络,一个是rpn,一个是fast rcnn,总计两个stage,每个stage各训练一次rpn和fast rcnn。
  • 另外一种训练方式为近似联合训练(approximate joint training),也称end to end的训练方式,训练过程中只训练一个权重网络,训练速度有可观的提升,而训练精度不变。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# prepare data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar

tar xvf VOCtrainval_06-Nov-2007.tar
tar xvf VOCtest_06-Nov-2007.tar
tar xvf VOCdevkit_08-Jun-2007.tar


VOCdevkit/ # development kit
VOCdevkit/VOCcode/ # VOC utility code
VOCdevkit/VOC2007 # image sets, annotations, etc.
# ... and several other directories ...


cd py-faster-rcnn/data
ln -s VOCdevkit VOCdevkit2007
# Using symlinks is a good idea because you will likely want to share the same PASCAL dataset installation between multiple projects.

# download pre-trained imagenet models
./data/scripts/fetch_imagenet_models.sh

# train net
./experiments/scripts/faster_rcnn_end2end.sh 0 ZF pascal_voc

error fixs

error

AttributeError: 'module' object has no attribute 'text_format'

fix

./lib/fast_rcnn/train.py增加一行

import google.protobuf.text_format

training results

AP for aeroplane = 0.6312
AP for bicycle = 0.7069
AP for bird = 0.5836
AP for boat = 0.4471
AP for bottle = 0.3562
AP for bus = 0.6682
AP for car = 0.7569
AP for cat = 0.7249
AP for chair = 0.3844
AP for cow = 0.6152
AP for diningtable = 0.6162
AP for dog = 0.6502
AP for horse = 0.7580
AP for motorbike = 0.7128
AP for person = 0.6744
AP for pottedplant = 0.3358
AP for sheep = 0.5872
AP for sofa = 0.5649
AP for train = 0.7128
AP for tvmonitor = 0.6133
Mean AP = 0.6050

Results:
0.631
0.707
0.584
0.447
0.356
0.668
0.757
0.725
0.384
0.615
0.616
0.650
0.758
0.713
0.674
0.336
0.587
0.565
0.713
0.613
0.605

--------------------------------------------------------------
Results computed with the **unofficial** Python eval code.
Results should be very close to the official MATLAB eval code.
Recompute with `./tools/reval.py --matlab ...` for your paper.
-- Thanks, The Management
--------------------------------------------------------------

real	5m16.906s
user	4m6.179s
sys	1m16.157s

Reference

History

  • 20180816: created.

Guide

Matplot (skimage/ PIL Image)

1
2
3
4
5
6
7
8
9
# Matplot: dims: (height,width,channels),order: RGB,range: [0,255] dtype: uint8
import matplotlib.pyplot as plt
import matplotlib.image as img
image = img.imread("images/cat.jpg")
print image.shape # (360, 480, 3)
print image[:5,:5,0]
#plt.axis("off")
plt.imshow(image)
plt.show()
(360, 480, 3)
[[26 27 25 28 30]
 [26 27 25 26 28]
 [26 26 26 26 27]
 [27 26 27 28 29]
 [29 27 26 26 29]]

png

PIL.Image

1
2
3
4
5
6
7
8
9
# PIL Image.open: dims: hwc,order: RGB, ??( range: [0,255] dtype: uint8)??

import matplotlib.pyplot as plt
from PIL import Image
image = Image.open("images/cat.jpg")
print(image)
# <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=480x360 at 0x7F258E0B8410>
plt.imshow(image)
plt.show()

png

skimage

1
2
import skimage
image = skimage.io.imread(image_filepath) # RGB (608, 606, 3)

OpenCV

1
2
3
4
5
6
7
8
# OpenCV: dims: (height,width,channels),order: BGR,range: [0,255] dtype: uint8
import cv2
image = cv2.imread("images/cat.jpg")
print image.shape # (360, 480, 3)
print image[:5,:5,0]
#plt.axis("off")
plt.imshow(image)
plt.show()
(360, 480, 3)
[[49 50 47 48 50]
 [51 52 48 48 50]
 [51 51 49 48 49]
 [50 49 49 48 49]
 [52 50 49 48 49]]

png

The colors of our image are clearly wrong! Why is this?

The answer lies as a caveat with OpenCV.OpenCV represents RGB images as multi-dimensional NumPy arrays…but in reverse order! This means that OpenCV images are actually represented in BGR order rather than RGB!

1
2
3
4
5
6
7
import cv2
image = cv2.imread("images/cat.jpg")
# convert from BGR to RGB
rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
plt.axis("off")
plt.imshow(rgb_image)
plt.show()

png

Matplot VS. OpenCV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as img
image1 = img.imread("images/cat.jpg") # rgb

import cv2
image = cv2.imread("images/cat.jpg")
# convert from BGR to RGB
image2 = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # rgb

#image1 and image2 are same at all.

print image1.dtype
print image1[:5,:5,0]

print
print image2.dtype
print image2[:5,:5,0]

equal_count = np.sum( np.equal(image1[:,:,:],image2[:,:,:]) )
print equal_count
print equal_count == 360*480*3
uint8
[[26 27 25 28 30]
 [26 27 25 26 28]
 [26 26 26 26 27]
 [27 26 27 28 29]
 [29 27 26 26 29]]

uint8
[[26 27 25 28 30]
 [26 27 25 26 28]
 [26 26 26 26 27]
 [27 26 27 28 29]
 [29 27 26 26 29]]
518400
True

caffe.io.load_image

caffe.io.load_image loads data in a normalized form (0-1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# caffe.io.load_image: dims: (height,width,channels),order: RGB,range: [0,1] dtype: float32
# matplot: caffe_image = matplot_image/255.0

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# configure plotting
#plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['image.interpolation'] = 'nearest'
#plt.rcParams['image.cmap'] = 'gray'

import sys
caffe_root = '../' # this file should be run from {caffe_root}/examples (otherwise change this line)
sys.path.insert(0, caffe_root + 'python')

import caffe
#======================================================================
# load image
#======================================================================
image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg')
print image.shape,image.dtype # (360, 480, 3) float32
print image[:5,:5,0]

plt.figure()
plt.imshow(image) # (360, 480, 3) RGB

#======================================================================
# load color image with color=False
#======================================================================
image2 = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg',color=False)
print image2.shape #(360, 480, 1)
gray_image2 = image2.squeeze()
print gray_image2.shape,gray_image2.dtype # (360, 480) float32
print gray_image2[:5,:5]

plt.figure()
plt.imshow(gray_image2) # (360, 480) gray


#======================================================================
# load color image with color=False
#======================================================================
image3 = caffe.io.load_image(caffe_root + 'examples/images/cat_gray.jpg',color=False)
print image3.shape #(360, 480, 1)
gray_image3 = image3.squeeze()
print gray_image3.shape,gray_image3.dtype # (360, 480) float32
print gray_image3[:5,:5]

plt.figure()
plt.imshow(gray_image3) # (360, 480) gray

plt.show()
(360, 480, 3) float32
[[ 0.10196079  0.10588235  0.09803922  0.10980392  0.11764706]
 [ 0.10196079  0.10588235  0.09803922  0.10196079  0.10980392]
 [ 0.10196079  0.10196079  0.10196079  0.10196079  0.10588235]
 [ 0.10588235  0.10196079  0.10588235  0.10980392  0.11372549]
 [ 0.11372549  0.10588235  0.10196079  0.10196079  0.11372549]]
(360, 480, 1)
(360, 480) float32
[[ 0.19543412  0.19935569  0.18842432  0.19120707  0.1990502 ]
 [ 0.19599961  0.19992118  0.19151255  0.19234589  0.20018902]
 [ 0.19599961  0.19599961  0.19543412  0.19234589  0.19626746]
 [ 0.19935569  0.19543412  0.19626746  0.19120707  0.19512863]
 [ 0.20719883  0.19935569  0.19543412  0.19234589  0.19512863]]
(360, 480, 1)
(360, 480) float32
[[ 0.10196079  0.10588235  0.09803922  0.10980392  0.11372549]
 [ 0.10196079  0.10588235  0.09803922  0.10196079  0.10980392]
 [ 0.10196079  0.10588235  0.10196079  0.10196079  0.10588235]
 [ 0.10588235  0.10196079  0.10588235  0.10980392  0.11372549]
 [ 0.11764706  0.10196079  0.10196079  0.10588235  0.10980392]]

png

png

png

caffe.io.Transformer

caffe.io.Transformer for Network input blob(m,c,h,w):

  • caffe Network default use BGR image format just as OpenCV format.
  • caffe mean files use BGR ordering, which is calculated from trainning images instead of test images. mu = np.array([104, 117, 123] # BGR
  • pixel range in [0,255] with dtype float32.
  • (m,c,h,w), BGR order,[0,255] range,float32

caffe.io.load_image

caffe.io.Transformer:

  • input image: caffe.io.load_image: (h,w,c),RGB,[0,1],float32
  • transformed image: (c,h,w), BGR,[0,255] float32

caffe.io.Transformer steps:

Note that the mean subtraction is always carried out before scaling.

  • transformer.set_transpose(‘data’, (2,0,1)) #(h,w,c)->(c,h,w)
  • transformer.set_channel_swap(‘data’, (2,1,0)) # RGB->BGR
  • transformer.set_raw_scale(‘data’, 255) # [0,1]->[0,255] float32
  • transformer.set_mean(‘data’, mu) # subtract BGR

keep in mind that the Transformer is only required when using a deploy.prototxt-like network definition, so without the Data Layer. When using a Data Layer, things get easier to understand.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt

import sys
caffe_root = '../' # this file should be run from {caffe_root}/examples (otherwise change this line)
sys.path.insert(0, caffe_root + 'python')

import caffe
image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg')
# caffe.io.load_image: dims: (height,width,channels),order: RGB,range: [0,1] dtype: float32
print image.shape,image.dtype # (360, 480, 3) float32
print image[:5,:5,0]

#plt.imshow(image)
#plt.show()

mu = np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy')
mu = mu.mean(1).mean(1) # # BGR

data_shape = (10, 3, 227, 227)
transformer = caffe.io.Transformer({'data': data_shape})

transformer.set_transpose('data', (2,0,1)) # h,w,c->c,h,w(012->201) move image channels to outermost dimension
transformer.set_channel_swap('data', (2,1,0)) # swap channels from RGB to BGR
transformer.set_raw_scale('data', 255) # rescale from [0, 1] to [0, 255]
transformer.set_mean('data', mu) # subtract the dataset-mean value(BGR) in each channel

transformed_image = transformer.preprocess('data', image)
print
print 'original image: ',image.shape,image.dtype # (360, 480, 3) float32
print 'transform image: ',transformed_image.shape,transformed_image.dtype #(3, 227, 227) float32
print transformed_image[0,:5,:5]

# By default, using CaffeNet, your net.blobs['data'].data.shape == (10, 3, 227, 227).
# This is because 10 random 227x227 crops are supposed to be extracted from a 256x256 image
# and passed through the net.

# net.blobs['data'].reshape(50,3,227,227) # we can change network input mini-batch to 50 as we like
# net.blobs['data'].data[...] = transformed_image # --->(50,3,227,227) 50 images
(360, 480, 3) float32
[[ 0.10196079  0.10588235  0.09803922  0.10980392  0.11764706]
 [ 0.10196079  0.10588235  0.09803922  0.10196079  0.10980392]
 [ 0.10196079  0.10196079  0.10196079  0.10196079  0.10588235]
 [ 0.10588235  0.10196079  0.10588235  0.10980392  0.11372549]
 [ 0.11372549  0.10588235  0.10196079  0.10196079  0.11372549]]

original  image:  (360, 480, 3) float32
transform image:  (3, 227, 227) float32
[[-53.86381531 -56.23903656 -53.54626465 -53.14715195 -51.32625961]
 [-52.93947601 -55.71855164 -54.00423813 -54.76469803 -52.88771057]
 [-53.89373398 -55.67879486 -55.4278717  -55.22265625 -53.47174454]
 [-50.98455811 -51.3506012  -54.06866074 -52.09104156 -52.94168854]
 [-49.92769241 -49.85874176 -52.08575439 -52.50840759 -51.3900528 ]]

cv2.imread

caffe.io.Transformer:

  • input image: cv2.imread: (h,w,c),BGR,[0,255],float32
  • transformed image: (c,h,w), BGR order,[0,255] float32

caffe.io.Transformer steps:

Note that the mean subtraction is always carried out before scaling.

  • transformer.set_transpose(‘data’, (2,0,1)) #(h,w,c)->(c,h,w)
  • transformer.set_mean(‘data’, mu) # subtract BGR
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import cv2

image = cv2.imread("test/cat.jpg")

data_shape = (10, 3, 227, 227)

transformer = caffe.io.Transformer({'data': data_shape})
transformer.set_transpose('data', (2,0,1))
transformer.set_mean('data', mu)

transformed_image = transformer.preprocess('data', image)
print
print 'original image: ',image.shape,image.dtype # (360, 480, 3) float32
print 'transform image: ',transformed_image.shape,transformed_image.dtype #(3, 227, 227) float32
print transformed_image[0,:5,:5]

# By default, using CaffeNet, your net.blobs['data'].data.shape == (10, 3, 227, 227).
# This is because 10 random 227x227 crops are supposed to be extracted from a 256x256 image
# and passed through the net.

# net.blobs['data'].reshape(50,3,227,227) # we can change network input mini-batch to 50 as we like
# net.blobs['data'].data[...] = transformed_image # --->(50,3,227,227) 50

deprocess transformed_image

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Helper function for deprocessing preprocessed images, e.g., for display.
def deprocess_net_image(image):
# [('B', 104.0069879317889), ('G', 116.66876761696767), ('R', 122.6789143406786)]

# input: (c,h,w), BGR,[lower,upper],float32
# output: (h,w,c), RGB,[0,255], uint8
image = image.copy() # don't modify destructively
image = image[::-1] # BGR -> RGB
image = image.transpose(1, 2, 0) # CHW -> HWC
image += [123, 117, 104] # (approximately) undo mean subtraction RGB

# clamp values in [0, 255]
image[image < 0], image[image > 255] = 0, 255

# round and cast from float32 to uint8
image = np.round(image)
image = np.require(image, dtype=np.uint8)

return image

image = deprocess_net_image(transformed_image)
#(h,w,c), RGB,[0,255], uint8

print image.shape,image.dtype # (227, 227, 3) uint8
print image[:5,:5,0]
plt.imshow(image)
plt.show()
(227, 227, 3) uint8
[[27 27 29 29 30]
 [26 26 28 27 28]
 [27 27 27 26 28]
 [27 28 25 28 27]
 [26 29 28 28 28]]

png

set 3-dim image to 4-dim input blob data

1
2
3
4
5
6
7
8
9
10
import numpy as np
data = np.zeros((2,3,4,4))
print data
image = np.arange(48).reshape(3,4,4)
print
print image

print 'set image to data'
data[...] = image # auto broadcasting from 3-dims to 4-dims
print data
[[[[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]

  [[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]

  [[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]]


 [[[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]

  [[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]

  [[ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]
   [ 0.  0.  0.  0.]]]]

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]
  [12 13 14 15]]

 [[16 17 18 19]
  [20 21 22 23]
  [24 25 26 27]
  [28 29 30 31]]

 [[32 33 34 35]
  [36 37 38 39]
  [40 41 42 43]
  [44 45 46 47]]]
set image to data
[[[[  0.   1.   2.   3.]
   [  4.   5.   6.   7.]
   [  8.   9.  10.  11.]
   [ 12.  13.  14.  15.]]

  [[ 16.  17.  18.  19.]
   [ 20.  21.  22.  23.]
   [ 24.  25.  26.  27.]
   [ 28.  29.  30.  31.]]

  [[ 32.  33.  34.  35.]
   [ 36.  37.  38.  39.]
   [ 40.  41.  42.  43.]
   [ 44.  45.  46.  47.]]]


 [[[  0.   1.   2.   3.]
   [  4.   5.   6.   7.]
   [  8.   9.  10.  11.]
   [ 12.  13.  14.  15.]]

  [[ 16.  17.  18.  19.]
   [ 20.  21.  22.  23.]
   [ 24.  25.  26.  27.]
   [ 28.  29.  30.  31.]]

  [[ 32.  33.  34.  35.]
   [ 36.  37.  38.  39.]
   [ 40.  41.  42.  43.]
   [ 44.  45.  46.  47.]]]]

transformer vs. python code

caffe.io.load_image

transformer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import os
import sys
import cv2
import numpy as np
# Make sure that caffe is on the python path:
caffe_root = './'
os.chdir(caffe_root)
sys.path.insert(0, os.path.join(caffe_root, 'python'))
import caffe


# caffe.io.load_image: transformer + python code
data_shape = [1,3,512,512]

transformer = caffe.io.Transformer({'data':data_shape}) # resize
transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
transformer.set_channel_swap('data', (2, 1, 0)) # rgb===>bgr
transformer.set_raw_scale('data', 255) # [0-1]===> [0,255]
transformer.set_mean('data', np.array([104, 117, 123])) # bgr mean pixel

image_file = "./images/1.png"
print("image_file=", image_file)
image = caffe.io.load_image(image_file) # hwc, rgb, 0-1
print("image.shape=", image.shape)

transformed_image = transformer.preprocess('data', image) #
print("transformed_image.shape=", transformed_image.shape) # 3,512,512
b,g,r = transformed_image
print(b.shape) # 512,512
print(g.shape)
print(r.shape)

print("")
print(transformed_image[:,:5,:5])

output

('image_file=', './images/1.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
(512, 512)
(512, 512)
(512, 512)

[[[ -98.          -98.          -98.          -98.          -98.        ]
  [ -98.          -98.          -98.          -98.          -98.        ]
  [ -23.96776581  -28.58105469  -31.359375    -25.08592987  -28.90721893]
  [  -8.21874237  -12.71092987  -15.46875     -15.27832031  -10.57226562]
  [  -7.75        -12.12499237  -15.          -15.          -10.984375  ]]

 [[-117.         -117.         -117.         -117.         -117.        ]
  [-117.         -117.         -117.         -117.         -117.        ]
  [ -43.96776581  -48.58105469  -51.359375    -45.08592987  -48.90721893]
  [ -26.21874237  -30.71092987  -33.46875     -33.27832031  -33.57226562]
  [ -24.75        -29.12499237  -32.          -32.          -31.984375  ]]

 [[-123.         -123.         -123.         -123.         -123.        ]
  [-123.         -123.         -123.         -123.         -123.        ]
  [ -52.96776581  -57.58105469  -60.359375    -54.08592987  -57.90721893]
  [ -40.21874237  -44.71092987  -47.46875     -47.27832031  -44.572258  ]
  [ -40.75        -45.12499237  -48.          -48.          -47.984375  ]]]
python code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
print(image.shape) # hwc,rgb,0-1   (1080, 1920, 3)
print(image.dtype) # float32

# resize
image = cv2.resize(image, (512,512))
print("image resize = ",image.shape) # (512, 512, 3)

# hwc,rgb ===> chw, bgr
r,g,b = image[:,:,0],image[:,:,1],image[:,:,2]

print(b.shape) # (512, 512)
print(g.shape) # (512, 512)
print(r.shape) # (512, 512)

bgr = np.zeros([3,b.shape[0],b.shape[1]])
print(bgr.shape)
bgr[0,:,:] = b
bgr[1,:,:] = g
bgr[2,:,:] = r

# 0-1 ===>0-255
bgr = bgr *255.

# -mean
print("")
bgr[0] -= 104
bgr[1] -= 117
bgr[2] -= 123
print(bgr[:,:5,:5])

output

(1080, 1920, 3)
float32
('image resize = ', (512, 512, 3))
float32
(512, 512)
(512, 512)
(512, 512)
(3, 512, 512)

[[[ -97.99999988  -97.99999988  -97.99999988  -97.99999988  -97.99999988]
  [ -97.99999988  -97.99999988  -97.99999988  -97.99999988  -97.99999988]
  [ -23.9677673   -28.58105415  -31.35937387  -25.0859333   -28.90722105]
  [  -8.21874478  -12.71093214  -15.46874815  -15.27831757  -10.5722701 ]
  [  -7.74999434  -12.12499598  -14.99999771  -14.99999771  -10.98437318]]

 [[-117.         -117.         -117.         -117.         -117.        ]
  [-117.         -117.         -117.         -117.         -117.        ]
  [ -43.96776688  -48.58105373  -51.35937345  -45.08593288  -48.90722823]
  [ -26.21874449  -30.71093184  -33.46874785  -33.27831727  -33.5722695 ]
  [ -24.7499941   -29.12499574  -31.99999747  -31.99999747  -31.98437271]]

 [[-123.         -123.         -123.         -123.         -123.        ]
  [-123.         -123.         -123.         -123.         -123.        ]
  [ -52.9677667   -57.58105356  -60.35937327  -54.0859327   -57.90722805]
  [ -40.21874401  -44.71093136  -47.46874738  -47.2783168   -44.5722692 ]
  [ -40.7499935   -45.12499514  -47.99999687  -47.99999687  -47.98437211]]]

cv2.imread

transformer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# cv2.imread: transformer  + python code
data_shape = [1,3,512,512]

transformer = caffe.io.Transformer({'data':data_shape}) # resize
transformer.set_transpose('data', (2, 0, 1)) # hwc ===> chw
#transformer.set_channel_swap('data', (2, 1, 0)) # rgb===>bgr
#transformer.set_raw_scale('data', 255) # [0-1]===> [0,255]
transformer.set_mean('data', np.array([104, 117, 123])) # bgr mean pixel

image_file = "./images/1.png"
print("image_file=", image_file)
image = cv2.imread(image_file) # hwc, bgr, 0-255
print("image.shape=", image.shape)

transformed_image = transformer.preprocess('data', image) #
print("transformed_image.shape=", transformed_image.shape) # 3,512,512
b,g,r = transformed_image
print(b.shape) # 512,512
print(g.shape)
print(r.shape)

print("")
print(transformed_image[:,:5,:5])

output

('image_file=', './images/1.png')
('image.shape=', (1080, 1920, 3))
('transformed_image.shape=', (3, 512, 512))
(512, 512)
(512, 512)
(512, 512)

[[[ -98.          -98.          -98.          -98.          -98.        ]
  [ -98.          -98.          -98.          -98.          -98.        ]
  [ -23.96777344  -28.58105469  -31.359375    -25.0859375   -28.90722656]
  [  -8.21875     -12.7109375   -15.46875     -15.27832031  -10.57226562]
  [  -7.75        -12.125       -15.          -15.          -10.984375  ]]

 [[-117.         -117.         -117.         -117.         -117.        ]
  [-117.         -117.         -117.         -117.         -117.        ]
  [ -43.96777344  -48.58105469  -51.359375    -45.0859375   -48.90722656]
  [ -26.21875     -30.7109375   -33.46875     -33.27832031  -33.57226562]
  [ -24.75        -29.125       -32.          -32.          -31.984375  ]]

 [[-123.         -123.         -123.         -123.         -123.        ]
  [-123.         -123.         -123.         -123.         -123.        ]
  [ -52.96777344  -57.58105469  -60.35937119  -54.0859375   -57.90722656]
  [ -40.21875     -44.7109375   -47.46875     -47.27832031  -44.57226562]
  [ -40.75        -45.125       -48.          -48.          -47.984375  ]]]
python code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
print(image.shape) # hwc,bgr,0-255   (1080, 1920, 3)
print(image.dtype) # uint8

# int8 ===>float32
image = image.astype('float32') # key steps
print(image.dtype) # float32

# resize
image = cv2.resize(image, (512,512))
print("image resize = ",image.shape) # (512, 512, 3)
print(image.dtype) # float32

# hwc ===> chw
b,g,r = image[:,:,0],image[:,:,1],image[:,:,2]

print(b.shape) # (512, 512)
print(g.shape) # (512, 512)
print(r.shape) # (512, 512)

bgr = np.zeros([3,b.shape[0],b.shape[1]])
print(bgr.shape)

# -mean
b -= 104
g -= 117
r -= 123

bgr[0,:,:] = b
bgr[1,:,:] = g
bgr[2,:,:] = r


print(bgr[:,:5,:5])
python code v2
1
2
3
4
5
6
7
8
9
10
11
12
image = cv2.imread(filepath) # hwc, bgr,0-255
print(image.dtype) # uint8

image = image.astype('float32') # key steps
image = cv2.resize(image, (512,512))
print("image resize = ",image.shape) # (512, 512, 3)
print(image.dtype) # float32

image -= np.array((104.00698793,116.66876762,122.67891434)) # bgr mean
image = image.transpose((2,0,1)) # hwc ===>chw

print(image[:,:5,:5])

output

(1080, 1920, 3)
uint8
float32
('image resize = ', (512, 512, 3))
float32
(512, 512)
(512, 512)
(512, 512)
(3, 512, 512)

[[[ -98.          -98.          -98.          -98.          -98.        ]
  [ -98.          -98.          -98.          -98.          -98.        ]
  [ -23.96777344  -28.58105469  -31.359375    -25.0859375   -28.90722656]
  [  -8.21875     -12.7109375   -15.46875     -15.27832031  -10.57226562]
  [  -7.75        -12.125       -15.          -15.          -10.984375  ]]

 [[-117.         -117.         -117.         -117.         -117.        ]
  [-117.         -117.         -117.         -117.         -117.        ]
  [ -43.96777344  -48.58105469  -51.359375    -45.0859375   -48.90722656]
  [ -26.21875     -30.7109375   -33.46875     -33.27832031  -33.57226562]
  [ -24.75        -29.125       -32.          -32.          -31.984375  ]]

 [[-123.         -123.         -123.         -123.         -123.        ]
  [-123.         -123.         -123.         -123.         -123.        ]
  [ -52.96777344  -57.58105469  -60.359375    -54.0859375   -57.90722656]
  [ -40.21875     -44.7109375   -47.46875     -47.27832031  -44.57226562]
  [ -40.75        -45.125       -48.          -48.          -47.984375  ]]]

Conclusions

  • Matplot.imread: dims: (height,width,channels),order: RGB,range: [0,255] dtype: uint8, plot
  • OpenCV.imread: dims: (height,width,channels),order: BGR,range: [0,255] dtype: uint8, plot
  • caffe.io.load_image: dims: (height,width,channels),order: RGB,range: [0,1] dtype: float32 (caffe_io_image = matplot_image/255.0) ,plot
  • caffe Network Input(Transformer): dims: (m,c,h,w), order: BGR, range [0,255],dtype: float32, PLOT ERROR

Reference

History

  • 20180816: created.

R-CNN

introduction

R-CNN is a state-of-the-art detector that classifies region proposals by a finetuned Caffe model. For the full details of the R-CNN system and model, refer to its project site and the paper:

Rich feature hierarchies for accurate object detection and semantic segmentation. Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. CVPR 2014. Arxiv 2013.

In this example, we do detection by a pure Caffe edition of the R-CNN model for ImageNet. The R-CNN detector outputs class scores for the 200 detection classes of ILSVRC13. Keep in mind that these are raw one vs. all SVM scores, so they are not probabilistically calibrated or exactly comparable across classes. Note that this off-the-shelf model is simply for convenience, and is not the full R-CNN model.

Let’s run detection on an image of a bicyclist riding a fish bike in the desert (from the ImageNet challenge—no joke).

First, we’ll need region proposals and the Caffe R-CNN ImageNet model:

Selective Search is the region proposer used by R-CNN. The selective_search_ijcv_with_python Python module takes care of extracting proposals through the selective search MATLAB implementation.

clone repo

1
2
3
4
cd $CAFFE_ROOT/caffe/python
git clone https://github.com/sergeyk/selective_search_ijcv_with_python.git`

pip install tables

install matlab

Install matlab and run demo.m file to compile functions

see here

Notice: Restart computer for Solving Errors:

OSError: [Errno 2] No such file or directory

compile matlab functions

1
2
3
4
5
cd caffe/python/caffe/selective_search_ijcv_with_python
which matlab
#/opt/MATLAB/R2016b/bin/matlab

matlab demo.m
  • run demo in matlab
    png

  • origin image
    png

  • region results
    png

detect regions

Run scripts to get the Caffe R-CNN ImageNet model.

1
./scripts/download_model_binary.py models/bvlc_reference_rcnn_ilsvrc13

With that done, we’ll call the bundled detect.py to generate the region proposals and run the network. For an explanation of the arguments, do ./detect.py --help.

1
2
3
4
cd caffe/examples/
mkdir -p _temp
echo `pwd`/images/fish-bike.jpg > _temp/det_input.txt
../python/detect.py --crop_mode=selective_search --pretrained_model=../models/bvlc_reference_rcnn_ilsvrc13/bvlc_reference_rcnn_ilsvrc13.caffemodel --model_def=../models/bvlc_reference_rcnn_ilsvrc13/deploy.prototxt --gpu --raw_scale=255 _temp/det_input.txt _temp/det_output.h5
...
I1129 15:02:22.498908  3483 net.cpp:242] This network produces output fc-rcnn
I1129 15:02:22.498919  3483 net.cpp:255] Network initialization done.
I1129 15:02:22.577332  3483 upgrade_proto.cpp:53] Attempting to upgrade input file specified using deprecated V1LayerParameter: ../models/bvlc_reference_rcnn_ilsvrc13/bvlc_reference_rcnn_ilsvrc13.caffemodel
I1129 15:02:22.685262  3483 upgrade_proto.cpp:61] Successfully upgraded file specified using deprecated V1LayerParameter
I1129 15:02:22.685796  3483 upgrade_proto.cpp:67] Attempting to upgrade input file specified using deprecated input fields: ../models/bvlc_reference_rcnn_ilsvrc13/bvlc_reference_rcnn_ilsvrc13.caffemodel
I1129 15:02:22.685804  3483 upgrade_proto.cpp:70] Successfully upgraded file specified using deprecated input fields.
W1129 15:02:22.685809  3483 upgrade_proto.cpp:72] Note that future Caffe releases will only support input layers and not input fields.
Loading input...
selective_search_rcnn({'/home/kezunlin/program/caffe/examples/images/fish-bike.jpg'}, '/tmp/tmpkOe6J0.mat')
/home/kezunlin/program/caffe/python/caffe/detector.py:140: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  crop = im[window[0]:window[2], window[1]:window[3]]
/home/kezunlin/program/caffe/python/caffe/detector.py:174: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  context_crop = im[box[0]:box[2], box[1]:box[3]]
/usr/local/lib/python2.7/dist-packages/skimage/transform/_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "
/home/kezunlin/program/caffe/python/caffe/detector.py:177: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  crop[pad_y:(pad_y + crop_h), pad_x:(pad_x + crop_w)] = context_crop
Processed 1565 windows in 15.899 s.
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py:1299: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['prediction']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)
Saved to _temp/det_output.h5 in 0.082 s.

This run was in GPU mode. For CPU mode detection, call detect.py without the --gpu argument.

process regions

Running this outputs a DataFrame with the filenames, selected windows, and their detection scores to an HDF5 file.
(We only ran on one image, so the filenames will all be the same.)

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_hdf('_temp/det_output.h5', 'df')
print(df.shape)
row = df.iloc[0] # prediction(200,), bbox as input image
print 'row ',row.shape
print 'prediction ',row[0].shape
print type(row) # class 'pandas.core.series.Series
print row
(1565, 5)
row  (5,)
prediction  (200,)
<class 'pandas.core.series.Series'>
prediction    [-2.60202, -2.87814, -3.0061, -2.77251, -2.077...
ymin                                                    152.958
xmin                                                    159.692
ymax                                                    261.702
xmax                                                    340.586
Name: /home/kezunlin/program/caffe/examples/images/fish-bike.jpg, dtype: object

1570 regions were proposed with the R-CNN configuration of selective search. The number of proposals will vary from image to image based on its contents and size – selective search isn’t scale invariant.

In general, detect.py is most efficient when running on a lot of images: it first extracts window proposals for all of them, batches the windows for efficient GPU processing, and then outputs the results.
Simply list an image per line in the images_file, and it will process all of them.

Although this guide gives an example of R-CNN ImageNet detection, detect.py is clever enough to adapt to different Caffe models’ input dimensions, batch size, and output categories. You can switch the model definition and pretrained model as desired. Refer to python detect.py --help for the parameters to describe your data set. There’s no need for hardcoding.

Anyway, let’s now load the ILSVRC13 detection class names and make a DataFrame of the predictions. Note you’ll need the auxiliary ilsvrc2012 data fetched by data/ilsvrc12/get_ilsvrc12_aux.sh.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#n01443537 goldfish
#n03445777 golf ball
#...
with open('../data/ilsvrc12/det_synset_words.txt') as f: # 200 classes from 1000 imagenet classes
labels_df = pd.DataFrame([
{
'synset_id': l.strip().split(' ')[0],
'name': ' '.join(l.strip().split(' ')[1:]).split(',')[0]
}
for l in f.readlines()
])
labels_df.sort_values(by='synset_id') # from a... to z
print labels_df.shape # (200, 2)
print labels_df.head(5)
(200, 2)
        name  synset_id
0  accordion  n02672831
1   airplane  n02691156
2        ant  n02219486
3   antelope  n02419796
4      apple  n07739125
1
2
3
4
5
6
7
8
9
10
#print type(df.prediction) # <class 'pandas.core.series.Series'>

print df.prediction.values.shape # numpy.ndarray (1565,)
print df.prediction.values[0].shape # numpy.ndarray (200,)

print np.vstack(df.prediction.values).shape # (1565, 200)

predictions_df = pd.DataFrame(np.vstack(df.prediction.values), columns=labels_df['name'])
#print predictions_df.values.shape # (1565, 200)
print(predictions_df.iloc[:5,:7])
(1565,)
(200,)
(1565, 200)
name  accordion  airplane       ant  antelope     apple  armadillo  artichoke
0     -2.602018 -2.878137 -3.006104 -2.772514 -2.077227  -2.590448  -2.414262
1     -2.997767 -3.312270 -2.878942 -3.434367 -2.227469  -2.492260  -2.383878
2     -2.476110 -3.145484 -2.377191 -2.684406 -2.289587  -2.428077  -2.390187
3     -2.362699 -2.784188 -1.981096 -2.664146 -2.207042  -2.299127  -2.181105
4     -2.929469 -2.323617 -2.755007 -3.165601 -2.188648  -2.486410  -2.505435

Let’s look at the activations.

1
2
3
4
plt.gray()
plt.matshow(predictions_df.values) # (1565, 200)
plt.xlabel('Classes')
plt.ylabel('Windows')

png

Now let’s take max across all windows and plot the top classes.

1
2
3
max_s = predictions_df.max(0)
max_s = max_s.sort_values(ascending=False)
print(max_s[:10])
name
person          1.839882
bicycle         0.855625
unicycle        0.085192
motorcycle      0.003604
turtle         -0.030388
banjo          -0.114999
electric fan   -0.220595
cart           -0.225192
lizard         -0.365949
helmet         -0.477555
dtype: float32

The top detections are in fact a person and bicycle.
Picking good localizations is a work in progress; we pick the top-scoring person and bicycle detections.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
i = predictions_df['person'].argmax() # 70  rect
j = predictions_df['bicycle'].argmax()# 262 rect

# Show top predictions for top detection.
f = pd.Series(df['prediction'].iloc[i], index=labels_df['name']) # (200,)
#print f.head(5)
#print
print('Top detection:')
print(f.sort_values(ascending=False)[:5])
print('')

# Show top predictions for second-best detection.
f = pd.Series(df['prediction'].iloc[j], index=labels_df['name']) # (200,)
print('Second-best detection:')
print(f.sort_values(ascending=False)[:5])
Top detection:
name
person             1.839882
swimming trunks   -1.157806
turtle            -1.168884
tie               -1.217267
rubber eraser     -1.246662
dtype: float32

Second-best detection:
name
bicycle     0.855625
unicycle   -0.334367
scorpion   -0.824552
lobster    -0.965544
lamp       -1.076224
dtype: float32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Find, print, and display the top detections: person and bicycle.
i = predictions_df['person'].argmax()
j = predictions_df['bicycle'].argmax()

# Show top predictions for top detection.
f = pd.Series(df['prediction'].iloc[i], index=labels_df['name'])
print('Top detection:')
print(f.sort_values(ascending=False)[:5])
print('')

# Show top predictions for second-best detection.
f = pd.Series(df['prediction'].iloc[j], index=labels_df['name'])
print('Second-best detection:')
print(f.sort_values(ascending=False)[:5])

# Show top detection in red, second-best top detection in blue.
im = plt.imread('images/fish-bike.jpg')
plt.imshow(im)
currentAxis = plt.gca()

det = df.iloc[i]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
print coords # ((207.792, 7.6959999999999997), 134.71799999999999, 155.88200000000001)
currentAxis.add_patch(plt.Rectangle(*coords, fill=False, edgecolor='r', linewidth=5))

det = df.iloc[j]
coords = (det['xmin'], det['ymin']), det['xmax'] - det['xmin'], det['ymax'] - det['ymin']
print coords # ((108.706, 184.70400000000001), 284.78999999999996, 127.98399999999998)
currentAxis.add_patch(plt.Rectangle(*coords, fill=False, edgecolor='b', linewidth=5))
Top detection:
name
person             1.839882
swimming trunks   -1.157806
turtle            -1.168884
tie               -1.217267
rubber eraser     -1.246662
dtype: float32

Second-best detection:
name
bicycle     0.855625
unicycle   -0.334367
scorpion   -0.824552
lobster    -0.965544
lamp       -1.076224
dtype: float32
((207.792, 7.6959999999999997), 134.71799999999999, 155.88200000000001)
((108.706, 184.70400000000001), 284.78999999999996, 127.98399999999998)

png

That’s cool. Let’s take all ‘bicycle’ detections and NMS them to get rid of overlapping windows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def nms_detections(dets, overlap=0.3):
"""
Non-maximum suppression: Greedily select high-scoring detections and
skip detections that are significantly covered by a previously
selected detection.

This version is translated from Matlab code by Tomasz Malisiewicz,
who sped up Pedro Felzenszwalb's code.

Parameters
----------
dets: ndarray
each row is ['xmin', 'ymin', 'xmax', 'ymax', 'score']
overlap: float
minimum overlap ratio (0.3 default) >iou,then drop rect

Output
------
dets: ndarray
remaining after suppression.
"""
x1 = dets[:, 0]
y1 = dets[:, 1]
x2 = dets[:, 2]
y2 = dets[:, 3]
ind = np.argsort(dets[:, 4]) # current ind set (min--->max)

w = x2 - x1
h = y2 - y1
area = (w * h).astype(float)
"""
dets
pick = []
ind = [a,b,c,d,e,f]

while not ind.empty:
f, pick=[f], ind=[a,b,c,d,e], o=[0.1,0.2,0.5,0.9,0.2],keep_ind=[0,1,4],ind=[a,b,e]
e, pick=[f,e], ind=[a,b], o=[0.4,0.1],keep_ind=[1],ind=[b]
b, pick=[f,e,b],ind=[], o=[], keep_ind=[], ind=[]
return dets[pick]
"""

pick = [] # pick index
while len(ind) > 0:
i = ind[-1] # choose last best index
pick.append(i)
ind = ind[:-1] # remove last one

xx1 = np.maximum(x1[i], x1[ind])
yy1 = np.maximum(y1[i], y1[ind])
xx2 = np.minimum(x2[i], x2[ind])
yy2 = np.minimum(y2[i], y2[ind])

w = np.maximum(0., xx2 - xx1)
h = np.maximum(0., yy2 - yy1)

wh = w * h
o = wh / (area[i] + area[ind] - wh) # [0.1,0.2,0.5,0.9,0.2]

keep_ind = np.nonzero(o <= overlap)[0] # (array([0, 1, 4]),) ===>[0 1 4]
ind = ind[keep_ind]

return dets[pick, :]
1
2
3
4
5
6
scores = predictions_df['bicycle'] # (1565,)
windows = df[['xmin', 'ymin', 'xmax', 'ymax']].values # (1565, 4)
dets = np.hstack((windows, scores[:, np.newaxis])) # (1565, 4) (1565,1)===>(1565,5) xmin,ymin,xmax,ymax,score
nms_dets = nms_detections(dets,0.3)
print dets.shape # (1565, 5)
print nms_dets.shape # (181, 5)
(1565, 5)
(181, 5)
1
print nms_dets[:3]
[[ 108.706       184.704       393.496       312.688         0.85562503]
 [   0.           14.43        397.344       323.27         -0.73134482]
 [ 131.794       202.982       249.196       290.562        -1.26836455]]

Show top 3 NMS’d detections for ‘bicycle’ in the image and note the gap between the top scoring box (red) and the remaining boxes.

1
2
3
4
5
6
7
8
9
plt.imshow(im)
currentAxis = plt.gca()
colors = ['r', 'b', 'y']
for c, det in zip(colors, nms_dets[:3]):
currentAxis.add_patch(
plt.Rectangle((det[0], det[1]), det[2]-det[0], det[3]-det[1],
fill=False, edgecolor=c, linewidth=5)
)
print 'scores:', nms_dets[:3, 4]
scores: [ 0.85562503 -0.73134482 -1.26836455]

png

This was an easy instance for bicycle as it was in the class’s training set. However, the person result is a true detection since this was not in the set for that class.

You should try out detection on an image of your own next!

(Remove the temp directory to clean up, and we’re done.)

1
!rm -rf _temp

Reference

History

  • 20180816: created.

Tutorial

In this tutorial we will do multilabel classification on PASCAL VOC 2012.

Multilabel classification is a generalization of multiclass classification, where each instance (image) can belong to many classes. For example, an image may both belong to a “beach” category and a “vacation pictures” category. In multiclass classification, on the other hand, each image belongs to a single class.

Caffe supports multilabel classification through the SigmoidCrossEntropyLoss layer, and we will load data using a Python data layer. Data could also be provided through HDF5 or LMDB data layers, but the python data layer provides endless flexibility, so that’s what we will use.

Preliminaries

  • First, make sure you compile caffe using
    WITH_PYTHON_LAYER := 1

  • Second, download PASCAL VOC 2012. It’s available here:

  • Third, import modules:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import sys 
import os

import numpy as np
import os.path as osp
import matplotlib.pyplot as plt

from copy import copy

% matplotlib inline
plt.rcParams['figure.figsize'] = (6, 6)

caffe_root = '../' # this file is expected to be in {caffe_root}/examples
sys.path.append(caffe_root + 'python')
import caffe # If you get "No module named _caffe", either you have not built pycaffe or you have the wrong path.

from caffe import layers as L, params as P # Shortcuts to define the net prototxt.

sys.path.append("pycaffe/layers") # the datalayers we will use are in this directory.
sys.path.append("pycaffe") # the tools file is in this folder

import tools #this contains some tools that we need
  • Fourth, set data directories and initialize caffe
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# set data root directory, e.g:
pascal_root = osp.join(caffe_root, 'data/pascal/VOC2012')

# these are the PASCAL classes, we'll need them later.
classes = np.asarray(['aeroplane', 'bicycle', 'bird', 'boat', 'bottle',
'bus', 'car', 'cat', 'chair', 'cow', 'diningtable',
'dog', 'horse', 'motorbike', 'person', 'pottedplant',
'sheep', 'sofa', 'train', 'tvmonitor'])

# make sure we have the caffenet weight downloaded.
#if not os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'):
# print("Downloading pre-trained CaffeNet model...")
# !../scripts/download_model_binary.py ../models/bvlc_reference_caffenet

# initialize caffe for gpu mode
caffe.set_mode_gpu()
caffe.set_device(0)

Define network prototxts

  • Let’s start by defining the nets using caffe.NetSpec. Note how we used the SigmoidCrossEntropyLoss layer. This is the right loss for multilabel classification. Also note how the data layer is defined.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# helper function for common structures
def conv_relu(bottom, ks, nout, stride=1, pad=0, group=1):
conv = L.Convolution(bottom, kernel_size=ks, stride=stride,
num_output=nout, pad=pad, group=group)
return conv, L.ReLU(conv, in_place=True)

# another helper function
def fc_relu(bottom, nout):
fc = L.InnerProduct(bottom, num_output=nout)
return fc, L.ReLU(fc, in_place=True)

# yet another helper function
def max_pool(bottom, ks, stride=1):
return L.Pooling(bottom, pool=P.Pooling.MAX, kernel_size=ks, stride=stride)

# main netspec wrapper
def caffenet_multilabel(data_layer_params, datalayer):
# setup the python data layer
n = caffe.NetSpec()
n.data, n.label = L.Python(module = 'pascal_multilabel_datalayers', layer = datalayer,
ntop = 2, param_str=str(data_layer_params))

# the net itself
n.conv1, n.relu1 = conv_relu(n.data, 11, 96, stride=4)
n.pool1 = max_pool(n.relu1, 3, stride=2)
n.norm1 = L.LRN(n.pool1, local_size=5, alpha=1e-4, beta=0.75)
n.conv2, n.relu2 = conv_relu(n.norm1, 5, 256, pad=2, group=2)
n.pool2 = max_pool(n.relu2, 3, stride=2)
n.norm2 = L.LRN(n.pool2, local_size=5, alpha=1e-4, beta=0.75)
n.conv3, n.relu3 = conv_relu(n.norm2, 3, 384, pad=1)
n.conv4, n.relu4 = conv_relu(n.relu3, 3, 384, pad=1, group=2)
n.conv5, n.relu5 = conv_relu(n.relu4, 3, 256, pad=1, group=2)
n.pool5 = max_pool(n.relu5, 3, stride=2)
n.fc6, n.relu6 = fc_relu(n.pool5, 4096)
n.drop6 = L.Dropout(n.relu6, in_place=True)
n.fc7, n.relu7 = fc_relu(n.drop6, 4096)
n.drop7 = L.Dropout(n.relu7, in_place=True)
n.score = L.InnerProduct(n.drop7, num_output=20) # z value
n.loss = L.SigmoidCrossEntropyLoss(n.score, n.label) # a = sigmoid(z)

return str(n.to_proto())

Write nets and solver files

  • Now we can crete net and solver prototxts. For the solver, we use the CaffeSolver class from the “tools” module
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
workdir = './pascal_multilabel_with_datalayer'
if not os.path.isdir(workdir):
os.makedirs(workdir)

solverprototxt = tools.CaffeSolver(trainnet_prototxt_path = osp.join(workdir, "trainnet.prototxt"),
testnet_prototxt_path = osp.join(workdir, "valnet.prototxt"))
solverprototxt.sp['display'] = "1"
solverprototxt.sp['base_lr'] = "0.0001"
solverprototxt.write(osp.join(workdir, 'solver.prototxt'))

# write train net.
with open(osp.join(workdir, 'trainnet.prototxt'), 'w') as f:
# provide parameters to the data layer as a python dictionary. Easy as pie!
data_layer_params = dict(batch_size = 128, im_shape = [227, 227], split = 'train', pascal_root = pascal_root)
f.write(caffenet_multilabel(data_layer_params, 'PascalMultilabelDataLayerSync'))

# write validation net.
with open(osp.join(workdir, 'valnet.prototxt'), 'w') as f:
data_layer_params = dict(batch_size = 128, im_shape = [227, 227], split = 'val', pascal_root = pascal_root)
f.write(caffenet_multilabel(data_layer_params, 'PascalMultilabelDataLayerSync'))
  • This net uses a python datalayer: ‘PascalMultilabelDataLayerSync’, which is defined in ‘./pycaffe/layers/pascal_multilabel_datalayers.py’.

  • Take a look at the code. It’s quite straight-forward, and gives you full control over data and labels.

  • Now we can load the caffe solver as usual.

1
2
3
4
5
solver = caffe.SGDSolver(osp.join(workdir, 'solver.prototxt'))
solver.net.copy_from(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel')
solver.test_nets[0].share_with(solver.net)
solver.step(1) # load 128 train images
# 5717 train images; 5823 val images
BatchLoader initialized with 5717 images
PascalMultilabelDataLayerSync initialized for split: train, with bs: 128, im_shape: [227, 227].
BatchLoader initialized with 5823 images
PascalMultilabelDataLayerSync initialized for split: val, with bs: 128, im_shape: [227, 227].
1
2
3
4
5
6
7
8
print solver.net.blobs['data'].data.shape # (128, 3, 227, 227)
print solver.net.blobs['label'].data.shape # (128, 20)

#print solver.net.blobs['loss'].data # 13.8629436493
#print solver.test_nets[0].blobs['data'].data.shape # (128, 3, 227, 227) no test images loaded

#print solver.net.params['score'][0].data.shape # (20, 4096) filled weights
#print solver.net.params['score'][0].data[:20,:5]
(128, 3, 227, 227)
(128, 20)
  • Let’s check the data we have loaded.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
transformer = tools.SimpleTransformer() # This is simply to add back the bias, re-shuffle the color channels to RGB, and so on...
image_index = 0 # First image in the batch.
image = solver.net.blobs['data'].data[image_index, ...]
print image.shape # (3, 227, 227) BGR [0,255]
#print image[0,:10,:10]

plot_image = transformer.deprocess(copy(image))
#print plot_image.shape #(227, 227, 3) RGB [0,255]
#print plot_image[:10,:10,0]

image_labels = solver.net.blobs['label'].data[image_index]
print image_labels.shape # (20,)
print image_labels #float32 [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]

plt.figure()
plt.imshow(plot_image)
gtlist = image_labels.astype(np.int) # float32->int labels
plt.title('GT: {}'.format(classes[np.where(gtlist)])) # ground truth label list
plt.axis('off')
(3, 227, 227)
(20,)
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  0.
  1.  0.]

png

  • NOTE: we are readin the image from the data layer, so the resolution is lower than the original PASCAL image.

Train a net

  • Let’s train the net. First, though, we need some way to measure the accuracy. Hamming distance is commonly used in multilabel problems. We also need a simple test loop. Let’s write that down.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def hamming_distance(gt, est):
# accu for only one image
# gt(20,) [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
# est(20,) [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
# accu = 19/20 = 0.95
#print gt.shape,est.shape
return sum([1 for (g, e) in zip(gt, est) if g == e]) / float(len(gt))

def check_accuracy(net, num_batches, batch_size = 128):
acc = 0.0
for t in range(num_batches):
net.forward() # load 128 batch images from test_nets
gts = net.blobs['label'].data # (128,20)
gts = gts.astype(np.int) # float32->int

ests = net.blobs['score'].data > 0 # (128,20) z-score>0===>1,otherwise ===>0
ests = ests.astype(np.int) # bool->int

for gt, est in zip(gts, ests): #for each ground truth and estimated label vector
acc += hamming_distance(gt, est) # gt(20,) est(20,) for 1 image
return acc / (num_batches * batch_size)
  • Alright, now let’s train for a while
1
2
3
4
5
6
7
8
9
10
11
%%time
for itt in range(6):
solver.step(100)
print 'itt:{:3d}'.format((itt + 1) * 100), 'accuracy:{0:.4f}'.format(check_accuracy(solver.test_nets[0], 50))

#itt:100 accuracy:0.9591
#itt:200 accuracy:0.9599
#itt:300 accuracy:0.9596
#itt:400 accuracy:0.9584
#itt:500 accuracy:0.9598
#itt:600 accuracy:0.9590
itt:100 accuracy:0.9591
itt:200 accuracy:0.9599
itt:300 accuracy:0.9596
itt:400 accuracy:0.9584
itt:500 accuracy:0.9598
itt:600 accuracy:0.9590
  • Great, the accuracy is increasing, and it seems to converge rather quickly. It may seem strange that it starts off so high but it is because the ground truth is sparse. There are 20 classes in PASCAL, and usually only one or two is present. So predicting all zeros yields rather high accuracy. Let’s check to make sure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
%%time
num_train_images = 5717
num_val_images = 5823
num_batches = num_val_images/128 # 45

def check_baseline_accuracy(net, num_batches, batch_size = 128):
acc = 0.0
for t in range(num_batches):
net.forward()
gts = net.blobs['label'].data # (128,20) labels
ests = np.zeros((batch_size, 20)) # (128,20) set to [0,0,0,...0,0]
for gt, est in zip(gts, ests): #for each ground truth and estimated label vector
acc += hamming_distance(gt, est)
return acc / (num_batches * batch_size)

# gts 19 + 1, est 20, accu = 19/20 = 0.95
# gts 18 + 2, est 20, accu = 18/20 = 0.90
# avg cases: 0.925
print 'Baseline accuracy:{0:.4f}'.format(check_baseline_accuracy(solver.test_nets[0], num_batches))
Baseline accuracy:0.9241
CPU times: user 40.4 s, sys: 864 ms, total: 41.3 s
Wall time: 41.3 s

Look at some prediction results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
test_net = solver.test_nets[0]
print classes
for image_index in range(5):
print
plt.figure()
plot_image = transformer.deprocess(copy(test_net.blobs['data'].data[image_index,...]))
plt.imshow(plot_image)
gtlist = test_net.blobs['label'].data[image_index, ...].astype(np.int)
print 'gt',gtlist
estlist = test_net.blobs['score'].data[image_index, ...] > 0
estlist = estlist.astype(np.int)
print 'est',estlist
plt.title('GT: {} \n EST: {}'.format(classes[np.where(gtlist)], classes[np.where(estlist)]))
plt.axis('off')
['aeroplane' 'bicycle' 'bird' 'boat' 'bottle' 'bus' 'car' 'cat' 'chair'
 'cow' 'diningtable' 'dog' 'horse' 'motorbike' 'person' 'pottedplant'
 'sheep' 'sofa' 'train' 'tvmonitor']

gt [0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0]
est [0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0]

gt [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
est [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]

gt [0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
est [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

gt [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
est [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]

gt [0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0]
est [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]

png

png

png

png

png

Reference

History

  • 20180816: created.

Tutorial

Caffe networks can be transformed to your particular needs by editing the model parameters. The data, diffs, and parameters of a net are all exposed in pycaffe.

Roll up your sleeves for net surgery with pycaffe!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Make sure that caffe is on the python path:
caffe_root = '../' # this file is expected to be in {caffe_root}/examples
import sys
sys.path.insert(0, caffe_root + 'python')

import caffe

# configure plotting
plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

Designer Filters

To show how to load, manipulate, and save parameters we’ll design our own filters into a simple network that’s only a single convolution layer. This net has two blobs, data for the input and conv for the convolution output and one parameter conv for the convolution filter weights and biases.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Load the net, list its data and params, and filter an example image.
caffe.set_mode_cpu()
net = caffe.Net('net_surgery/conv.prototxt', caffe.TEST)
print("blobs {}\nparams {}".format(net.blobs.keys(), net.params.keys()))

# load image and prepare as a single input batch for Caffe
im = np.array(caffe.io.load_image('images/cat_gray.jpg', color=False)).squeeze()
# caffe.io.load_image: dims: (height,width,channels),order: RGB,range: [0,1] dtype: float32
#(360, 480, 1)-->(360, 480)
#print im[:5,:5]

plt.title("original image")
plt.imshow(im)
plt.axis('off')

im_input = im[np.newaxis, np.newaxis, :, :] #(1, 1, 360, 480) (c,h,w) [0,1] float32

net.blobs['data'].reshape(*im_input.shape) # (1, 1, 100, 100) --->(1, 1, 360, 480)
print net.blobs['data'].data.shape
net.blobs['data'].data[...] = im_input
blobs ['data', 'conv']
params ['conv']
[[ 0.10196079  0.10588235  0.09803922  0.10980392  0.11372549]
 [ 0.10196079  0.10588235  0.09803922  0.10196079  0.10980392]
 [ 0.10196079  0.10588235  0.10196079  0.10196079  0.10588235]
 [ 0.10588235  0.10196079  0.10588235  0.10980392  0.11372549]
 [ 0.11764706  0.10196079  0.10196079  0.10588235  0.10980392]]
(1, 1, 360, 480)

png

The convolution weights are initialized from Gaussian noise while the biases are initialized to zero. These random filters give output somewhat like edge detections.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# helper show filter outputs
def show_filters(net):
net.forward()

for name,blob in net.blobs.iteritems():
print name,blob.data.shape
# data (1, 1, 360, 480) o = (i+2*p-k)/s+1 ->360-5+1=356, 480-5+1=476
# conv (1, 3, 356, 476)
print
for name,param in net.params.iteritems():
print name,param[0].data.shape # conv (3, 1, 5, 5)

plt.figure()
filt_count = 3
filt_min, filt_max = net.blobs['conv'].data.min(), net.blobs['conv'].data.max()
print filt_min,filt_max
for i in range(3):
plt.subplot(1,4,i+2)
plt.title("filter #{} output".format(i))

plt.imshow(net.blobs['conv'].data[0, i], vmin=filt_min, vmax=filt_max)
#plt.imshow(net.blobs['conv'].data[0, i])
#cbar = plt.colorbar() # depends on vmin,vmax

plt.tight_layout()
plt.axis('off')

# filter the image with initial
show_filters(net)
data (1, 1, 360, 480)
conv (1, 3, 356, 476)

conv (3, 1, 5, 5)
-0.0651154 0.097207

png

Raising the bias of a filter will correspondingly raise its output:

1
2
3
4
5
6
7
8
9
10
# pick first filter output
conv0 = net.blobs['conv'].data[0, 0]
print("pre-surgery output mean {:.2f}".format(conv0.mean()))
# set first filter bias to 1
#print net.params['conv'][1].data.shape
net.params['conv'][1].data[0] = 1. #(3,)
net.forward()
print("post-surgery output mean {:.2f}".format(conv0.mean()))
# for conv data,z = wx+b
# z = wx+0, z = wx+1
pre-surgery output mean 0.04
(3,)
post-surgery output mean 1.04

Altering the filter weights is more exciting since we can assign any kernel like Gaussian blur, the Sobel operator for edges, and so on. The following surgery turns the 0th filter into a Gaussian blur and the 1st and 2nd filters into the horizontal and vertical gradient parts of the Sobel operator.

See how the 0th output is blurred, the 1st picks up horizontal edges, and the 2nd picks up vertical edges.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ksize = net.params['conv'][0].data.shape[2:] # conv (3, 1, 5, 5)--->(5,5)

# make Gaussian blur
sigma = 1.
y, x = np.mgrid[-ksize[0]//2 + 1:ksize[0]//2 + 1, -ksize[1]//2 + 1:ksize[1]//2 + 1]
g = np.exp(-((x**2 + y**2)/(2.0*sigma**2)))
gaussian = (g / g.sum()).astype(np.float32)

net.params['conv'][0].data[0] = gaussian

# make Sobel operator for edge detection
net.params['conv'][0].data[1:] = 0.
sobel = np.array((-1, -2, -1, 0, 0, 0, 1, 2, 1), dtype=np.float32).reshape((3,3))
net.params['conv'][0].data[1, 0, 1:-1, 1:-1] = sobel # horizontal
net.params['conv'][0].data[2, 0, 1:-1, 1:-1] = sobel.T # vertical
show_filters(net)
data (1, 1, 360, 480)
conv (1, 3, 356, 476)

conv (3, 1, 5, 5)
-3.67843 3.77647

png

With net surgery, parameters can be transplanted across nets, regularized by custom per-parameter operations, and transformed according to your schemes.

Casting a Classifier into a Fully Convolutional Network

Let’s take the standard Caffe Reference ImageNet model “CaffeNet” and transform it into a fully convolutional net for efficient, dense inference on large inputs. This model generates a classification map that covers a given input size instead of a single classification. In particular a 8 $\times$ 8 classification map on a 451 $\times$ 451 input gives 64x the output in only 3x the time. The computation exploits a natural efficiency of convolutional network (convnet) structure by amortizing the computation of overlapping receptive fields.

To do so we translate the InnerProduct matrix multiplication layers of CaffeNet into Convolutional layers. This is the only change: the other layer types are agnostic to spatial size. Convolution is translation-invariant, activations are elementwise operations, and so on. The fc6 inner product when carried out as convolution by fc6-conv turns into a 6 $\times$ 6 filter with stride 1 on pool5. Back in image space this gives a classification for each 227 $\times$ 227 box with stride 32 in pixels. Remember the equation for output map / receptive field size, output = (input - kernel_size) / stride + 1, and work out the indexing details for a clear understanding.

1
!diff net_surgery/bvlc_caffenet_full_conv.prototxt ../models/bvlc_reference_caffenet/deploy.prototxt
1,2c1
< # Fully convolutional network version of CaffeNet.
< name: "CaffeNetConv"
---
> name: "CaffeNet"
7,11c6
<   input_param {
<     # initial shape for a fully convolutional network:
<     # the shape can be set for each input by reshape.
<     shape: { dim: 1 dim: 3 dim: 451 dim: 451 }
<   }
---
>   input_param { shape: { dim: 10 dim: 3 dim: 227 dim: 227 } }
157,158c152,153
<   name: "fc6-conv"
<   type: "Convolution"
---
>   name: "fc6"
>   type: "InnerProduct"
160,161c155,156
<   top: "fc6-conv"
<   convolution_param {
---
>   top: "fc6"
>   inner_product_param {
163d157
<     kernel_size: 6
169,170c163,164
<   bottom: "fc6-conv"
<   top: "fc6-conv"
---
>   bottom: "fc6"
>   top: "fc6"
175,176c169,170
<   bottom: "fc6-conv"
<   top: "fc6-conv"
---
>   bottom: "fc6"
>   top: "fc6"
182,186c176,180
<   name: "fc7-conv"
<   type: "Convolution"
<   bottom: "fc6-conv"
<   top: "fc7-conv"
<   convolution_param {
---
>   name: "fc7"
>   type: "InnerProduct"
>   bottom: "fc6"
>   top: "fc7"
>   inner_product_param {
188d181
<     kernel_size: 1
194,195c187,188
<   bottom: "fc7-conv"
<   top: "fc7-conv"
---
>   bottom: "fc7"
>   top: "fc7"
200,201c193,194
<   bottom: "fc7-conv"
<   top: "fc7-conv"
---
>   bottom: "fc7"
>   top: "fc7"
207,211c200,204
<   name: "fc8-conv"
<   type: "Convolution"
<   bottom: "fc7-conv"
<   top: "fc8-conv"
<   convolution_param {
---
>   name: "fc8"
>   type: "InnerProduct"
>   bottom: "fc7"
>   top: "fc8"
>   inner_product_param {
213d205
<     kernel_size: 1
219c211
<   bottom: "fc8-conv"
---
>   bottom: "fc8"

The only differences needed in the architecture are to change the fully connected classifier inner product layers into convolutional layers with the right filter size – 6 x 6, since the reference model classifiers take the 36 elements of pool5 as input – and stride 1 for dense classification. Note that the layers are renamed so that Caffe does not try to blindly load the old parameters when it maps layer names to the pretrained model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Load the original network and extract the fully connected layers' parameters.
net = caffe.Net('../models/bvlc_reference_caffenet/deploy.prototxt',
'../models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel',
caffe.TEST)
params = ['fc6', 'fc7', 'fc8']
# fc_params = {name: (weights, biases)}
fc_params = {pr: (net.params[pr][0].data, net.params[pr][1].data) for pr in params}

for pr in params:
print '{} weights are {} dimensional and biases are {} dimensional'.format(pr, fc_params[pr][0].shape, fc_params[pr][1].shape)

pr = 'fc6'
print net.params[pr][0].data[0,:6*6] # no weight_filler,loaded from weights file
print net.params[pr][1].data[0] # no bias_filler,loaded from weights file
fc6 weights are (4096, 9216) dimensional and biases are (4096,) dimensional
fc7 weights are (4096, 4096) dimensional and biases are (4096,) dimensional
fc8 weights are (1000, 4096) dimensional and biases are (1000,) dimensional
[ 0.00639847  0.00915686  0.00467043  0.00118941  0.00083305  0.00249258
  0.00249609 -0.00354958 -0.00502381 -0.00660044 -0.00810635 -0.00120969
 -0.00182751 -0.00181385 -0.00327348 -0.00657627 -0.01059825 -0.00223066
  0.00023664  0.00040984 -0.00052619 -0.00124062 -0.00269398 -0.00051081
  0.0014997   0.00123309 -0.00013806 -0.00111619  0.00321043  0.00284487
  0.00051387 -0.00087142 -0.00038937 -0.0008678   0.0049024   0.00155215]
0.983698
1
2
for layer_name, blob in net.blobs.iteritems():
print layer_name + '\t' + str(blob.data.shape)
data	(10, 3, 227, 227)
conv1	(10, 96, 55, 55)
pool1	(10, 96, 27, 27)
norm1	(10, 96, 27, 27)
conv2	(10, 256, 27, 27)
pool2	(10, 256, 13, 13)
norm2	(10, 256, 13, 13)
conv3	(10, 384, 13, 13)
conv4	(10, 384, 13, 13)
conv5	(10, 256, 13, 13)
pool5	(10, 256, 6, 6)
fc6	(10, 4096)
fc7	(10, 4096)
fc8	(10, 1000)
prob	(10, 1000)
1
2
for layer_name, param in net.params.iteritems():
print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)
conv1	(96, 3, 11, 11) (96,)
conv2	(256, 48, 5, 5) (256,)
conv3	(384, 256, 3, 3) (384,)
conv4	(384, 192, 3, 3) (384,)
conv5	(256, 192, 3, 3) (256,)
fc6	(4096, 9216) (4096,)
fc7	(4096, 4096) (4096,)
fc8	(1000, 4096) (1000,)

Consider the shapes of the inner product parameters. The weight dimensions are the output and input sizes while the bias dimension is the output size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Load the fully convolutional network to transplant the parameters.
net_full_conv = caffe.Net('net_surgery/bvlc_caffenet_full_conv.prototxt',
'../models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel',
caffe.TEST)
params_full_conv = ['fc6-conv', 'fc7-conv', 'fc8-conv']
# conv_params = {name: (weights, biases)}
conv_params = {pr: (net_full_conv.params[pr][0].data, net_full_conv.params[pr][1].data) for pr in params_full_conv}

for pr in params_full_conv:
print '{} weights are {} dimensional and biases are {} dimensional'.format(pr, conv_params[pr][0].shape, conv_params[pr][1].shape)

pr = 'fc6-conv'
print net_full_conv.params[pr][0].data[0,0,:,:] # no weight_filler,default to 0s
print net_full_conv.params[pr][1].data[0] # no bias_filler,default to 0s
fc6-conv weights are (4096, 256, 6, 6) dimensional and biases are (4096,) dimensional
fc7-conv weights are (4096, 4096, 1, 1) dimensional and biases are (4096,) dimensional
fc8-conv weights are (1000, 4096, 1, 1) dimensional and biases are (1000,) dimensional
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]
0.0
1
2
for layer_name, blob in net_full_conv.blobs.iteritems():
print layer_name + '\t' + str(blob.data.shape)
data	(1, 3, 451, 451)
conv1	(1, 96, 111, 111)
pool1	(1, 96, 55, 55)
norm1	(1, 96, 55, 55)
conv2	(1, 256, 55, 55)
pool2	(1, 256, 27, 27)
norm2	(1, 256, 27, 27)
conv3	(1, 384, 27, 27)
conv4	(1, 384, 27, 27)
conv5	(1, 256, 27, 27)
pool5	(1, 256, 13, 13)
fc6-conv	(1, 4096, 8, 8)
fc7-conv	(1, 4096, 8, 8)
fc8-conv	(1, 1000, 8, 8)
prob	(1, 1000, 8, 8)
1
2
for layer_name, param in net_full_conv.params.iteritems():
print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)
conv1	(96, 3, 11, 11) (96,)
conv2	(256, 48, 5, 5) (256,)
conv3	(384, 256, 3, 3) (384,)
conv4	(384, 192, 3, 3) (384,)
conv5	(256, 192, 3, 3) (256,)
fc6-conv	(4096, 256, 6, 6) (4096,)
fc7-conv	(4096, 4096, 1, 1) (4096,)
fc8-conv	(1000, 4096, 1, 1) (1000,)

The convolution weights are arranged in output $\times$ input $\times$ height $\times$ width dimensions. To map the inner product weights to convolution filters, we could roll the flat inner product vectors into channel $\times$ height $\times$ width filter matrices, but actually these are identical in memory (as row major arrays) so we can assign them directly.

The biases are identical to those of the inner product.

Let’s transplant!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def print_params():
for pr in params:
print pr, fc_params[pr][0].shape, fc_params[pr][1].shape

for pr in params_full_conv:
print pr, conv_params[pr][0].shape, conv_params[pr][1].shape

pr = 'fc6-conv'
print 'params value for ',pr
print net_full_conv.params[pr][0].data[0,0,:,:]
print net_full_conv.params[pr][1].data[0]

print '*'*50
print '(1) before updated by fc'
print '*'*50
print_params()

#print type(conv_params[pr_conv][0]) # ndarray ndarray.flat
#conv_params[pr_conv][0].flat = fc_params[pr][0].flat

# set w6,w7,w8 of conv from fc w6,w7,w8
for pr, pr_conv in zip(params, params_full_conv):
conv_params[pr_conv][0].flat = fc_params[pr][0].flat # flat unrolls the arrays
conv_params[pr_conv][1][...] = fc_params[pr][1]

print_conv_params = True
print_conv_params = False
if print_conv_params:
pr = 'fc6'
print net.params[pr][0].data[0,:6*6] # no weight_filler,loaded from weights file
print net.params[pr][1].data[0] # no bias_filler,loaded from weights file

print
print 'after init from fc'
pr = 'fc6-conv'
print net_full_conv.params[pr][0].data[0,0,:,:] # no weight_filler,default to 0s, here updated by fc
print net_full_conv.params[pr][1].data[0] # no bias_filler,default to 0s , here updated by fc


print '*'*50
print '(2) after updated by fc'
print '*'*50
print_params()
**************************************************
(1) before updated by fc
**************************************************
fc6 (4096, 9216) (4096,)
fc7 (4096, 4096) (4096,)
fc8 (1000, 4096) (1000,)
fc6-conv (4096, 256, 6, 6) (4096,)
fc7-conv (4096, 4096, 1, 1) (4096,)
fc8-conv (1000, 4096, 1, 1) (1000,)
params value for  fc6-conv
[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]
0.0
**************************************************
(2) after updated by  fc
**************************************************
fc6 (4096, 9216) (4096,)
fc7 (4096, 4096) (4096,)
fc8 (1000, 4096) (1000,)
fc6-conv (4096, 256, 6, 6) (4096,)
fc7-conv (4096, 4096, 1, 1) (4096,)
fc8-conv (1000, 4096, 1, 1) (1000,)
params value for  fc6-conv
[[ 0.00639847  0.00915686  0.00467043  0.00118941  0.00083305  0.00249258]
 [ 0.00249609 -0.00354958 -0.00502381 -0.00660044 -0.00810635 -0.00120969]
 [-0.00182751 -0.00181385 -0.00327348 -0.00657627 -0.01059825 -0.00223066]
 [ 0.00023664  0.00040984 -0.00052619 -0.00124062 -0.00269398 -0.00051081]
 [ 0.0014997   0.00123309 -0.00013806 -0.00111619  0.00321043  0.00284487]
 [ 0.00051387 -0.00087142 -0.00038937 -0.0008678   0.0049024   0.00155215]]
0.983698

Next, save the new model weights.

1
net_full_conv.save('net_surgery/bvlc_caffenet_full_conv.caffemodel')

To conclude, let’s make a classification map from the example cat image and visualize the confidence of “tiger cat” as a probability heatmap. This gives an 8-by-8 prediction on overlapping regions of the 451 $\times$ 451 input.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# load input and configure preprocessing
im = caffe.io.load_image('images/cat.jpg')
transformer = caffe.io.Transformer({'data': net_full_conv.blobs['data'].data.shape}) # (1,3,451,451)
transformer.set_mean('data', np.load('../python/caffe/imagenet/ilsvrc_2012_mean.npy').mean(1).mean(1))
transformer.set_transpose('data', (2,0,1))
transformer.set_channel_swap('data', (2,1,0))
transformer.set_raw_scale('data', 255.0)

transformed_image = transformer.preprocess('data', im)
#print transformed_image.shape #(3, 451, 451)
net_full_conv.blobs['data'].data[...] = transformed_image # (1, 3, 451, 451)


#out = net_full_conv.forward_all(data=np.asarray([transformer.preprocess('data', im)]))

# make classification map by forward and print prediction indices at each location
out = net_full_conv.forward()
prob = out['prob'][0] # (1, 1000, 8, 8)-->(1000, 8, 8)
classification_map = out['prob'][0].argmax(axis=0)
print classification_map # (8,8)

# show net input and confidence map (probability of the top prediction at each location)
plt.subplot(1, 2, 1)
plt.imshow(transformer.deprocess('data', net_full_conv.blobs['data'].data[0]))

plt.subplot(1, 2, 2)
plt.imshow(out['prob'][0,281]) # correct class = 281
plt.colorbar()

plt.tight_layout()
[[282 282 281 281 281 281 277 282]
 [281 283 283 281 281 281 281 282]
 [283 283 283 283 283 283 287 282]
 [283 283 283 281 283 283 283 259]
 [283 283 283 283 283 283 283 259]
 [283 283 283 283 283 283 259 259]
 [283 283 283 283 259 259 259 277]
 [335 335 283 259 263 263 263 277]]

png

The classifications include various cats – 282 = tiger cat, 281 = tabby, 283 = persian – and foxes and other mammals.

In this way the fully connected layers can be extracted as dense features across an image (see net_full_conv.blobs['fc6'].data for instance), which is perhaps more useful than the classification map itself.

Note that this model isn’t totally appropriate for sliding-window detection since it was trained for whole-image classification. Nevertheless it can work just fine. Sliding-window training and finetuning can be done by defining a sliding-window ground truth and loss such that a loss map is made for every location and solving as usual. (This is an exercise for the reader.)

A thank you to Rowland Depp for first suggesting this trick.

1
net_full_conv.blobs['fc6-conv'].data[0,176,:,:] # (1, 4096, 8, 8)
array([[  0.        ,   3.78561878,   4.91759014,  11.89788914,
         14.29053116,  16.50216484,   3.7467947 ,   0.        ],
       [  0.        ,  17.67206573,  25.0014534 ,  39.59349442,
         39.08831787,  29.11470604,   9.98679352,   0.        ],
       [  1.67216611,  18.15454102,  24.08405876,  39.18917847,
         37.54191971,  15.41128445,   0.        ,   0.        ],
       [  0.        ,   3.00706673,   5.87482309,  15.25675011,
         12.55344582,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        ,   0.        ]], dtype=float32)

Reference

History

  • 20180816: created.

Tutorial

In this example, we’ll explore a common approach that is particularly useful in real-world applications: take a pre-trained Caffe network and fine-tune the parameters on your custom data.

The advantage of this approach is that, since pre-trained networks are learned on a large set of images, the intermediate layers capture the “semantics” of the general visual appearance. Think of it as a very powerful generic visual feature that you can treat as a black box. On top of that, only a relatively small amount of data is needed for good performance on the target task.

First, we will need to prepare the data. This involves the following parts:
(1) Get the ImageNet ilsvrc pretrained model with the provided shell scripts.
(2) Download a subset of the overall Flickr style dataset for this demo.
(3) Compile the downloaded Flickr dataset into a database that Caffe can then consume.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
caffe_root = '../'  # this file should be run from {caffe_root}/examples (otherwise change this line)

import sys
sys.path.insert(0, caffe_root + 'python')
import caffe

caffe.set_device(0)
caffe.set_mode_gpu()

import numpy as np
from pylab import *
%matplotlib inline
import tempfile

# Helper function for deprocessing preprocessed images, e.g., for display.
def deprocess_net_image(image):
image = image.copy() # don't modify destructively
image = image[::-1] # BGR -> RGB
image = image.transpose(1, 2, 0) # CHW -> HWC
image += [123, 117, 104] # (approximately) undo mean subtraction

# clamp values in [0, 255]
image[image < 0], image[image > 255] = 0, 255

# round and cast from float32 to uint8
image = np.round(image)
image = np.require(image, dtype=np.uint8)

return image

Setup and dataset download

Download data required for this exercise.

  • get_ilsvrc_aux.sh to download the ImageNet data mean, labels, etc.
  • download_model_binary.py to download the pretrained reference model
  • finetune_flickr_style/assemble_data.py downloads the style training and testing data

We’ll download just a small subset of the full dataset for this exercise: just 2000 of the 80K images, from 5 of the 20 style categories. (To download the full dataset, set full_dataset = True in the cell below.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Download just a small subset of the data for this exercise.
# (2000 of 80K images, 5 of 20 labels.)
# To download the entire dataset, set `full_dataset = True`.
full_dataset = False
if full_dataset:
NUM_STYLE_IMAGES = NUM_STYLE_LABELS = -1
else:
NUM_STYLE_IMAGES = 2000
NUM_STYLE_LABELS = 5

# This downloads the ilsvrc auxiliary data (mean file, etc),
# and a subset of 2000 images for the style recognition task.
import os
os.chdir(caffe_root) # run scripts from caffe root
!data/ilsvrc12/get_ilsvrc_aux.sh
!scripts/download_model_binary.py models/bvlc_reference_caffenet
!python examples/finetune_flickr_style/assemble_data.py \
--workers=-1 --seed=1701 \
--images=$NUM_STYLE_IMAGES --label=$NUM_STYLE_LABELS
# back to examples
os.chdir('examples')
Downloading...
--2016-02-24 00:28:36--  http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
Resolving dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)... 169.229.222.251
Connecting to dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)|169.229.222.251|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17858008 (17M) [application/octet-stream]
Saving to: ‘caffe_ilsvrc12.tar.gz’

100%[======================================>] 17,858,008   112MB/s   in 0.2s   

2016-02-24 00:28:36 (112 MB/s) - ‘caffe_ilsvrc12.tar.gz’ saved [17858008/17858008]

Unzipping...
Done.
Model already exists.
Downloading 2000 images with 7 workers...
Writing train/val for 1996 successfully downloaded images.

Define weights, the path to the ImageNet pretrained weights we just downloaded, and make sure it exists.

1
2
3
import os
weights = os.path.join(caffe_root, 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel')
assert os.path.exists(weights)

Load the 1000 ImageNet labels from ilsvrc12/synset_words.txt, and the 5 style labels from finetune_flickr_style/style_names.txt.

1
2
3
4
5
6
7
8
9
10
11
12
# Load ImageNet labels to imagenet_labels
imagenet_label_file = caffe_root + 'data/ilsvrc12/synset_words.txt'
imagenet_labels = list(np.loadtxt(imagenet_label_file, str, delimiter='\t'))
assert len(imagenet_labels) == 1000
print 'Loaded ImageNet labels:\n', '\n'.join(imagenet_labels[:10] + ['...'])

# Load style labels to style_labels
style_label_file = caffe_root + 'examples/finetune_flickr_style/style_names.txt'
style_labels = list(np.loadtxt(style_label_file, str, delimiter='\n'))
if NUM_STYLE_LABELS > 0:
style_labels = style_labels[:NUM_STYLE_LABELS]
print '\nLoaded style labels:\n', ', '.join(style_labels)
Loaded ImageNet labels:
n01440764 tench, Tinca tinca
n01443537 goldfish, Carassius auratus
n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias
n01491361 tiger shark, Galeocerdo cuvieri
n01494475 hammerhead, hammerhead shark
n01496331 electric ray, crampfish, numbfish, torpedo
n01498041 stingray
n01514668 cock
n01514859 hen
n01518878 ostrich, Struthio camelus
...

Loaded style labels:
Detailed, Pastel, Melancholy, Noir, HDR

Defining and running the nets

We’ll start by defining caffenet, a function which initializes the CaffeNet architecture (a minor variant on AlexNet), taking arguments specifying the data and number of output classes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
from caffe import layers as L
from caffe import params as P

weight_param = dict(lr_mult=1, decay_mult=1)
bias_param = dict(lr_mult=2, decay_mult=0)
learned_param = [weight_param, bias_param]

frozen_param = [dict(lr_mult=0)] * 2

def conv_relu(bottom, ks, nout, stride=1, pad=0, group=1,
param=learned_param,
weight_filler=dict(type='gaussian', std=0.01),
bias_filler=dict(type='constant', value=0.1)):
conv = L.Convolution(bottom, kernel_size=ks, stride=stride,
num_output=nout, pad=pad, group=group,
param=param, weight_filler=weight_filler,
bias_filler=bias_filler)
return conv, L.ReLU(conv, in_place=True)

def fc_relu(bottom, nout, param=learned_param,
weight_filler=dict(type='gaussian', std=0.005),
bias_filler=dict(type='constant', value=0.1)):
fc = L.InnerProduct(bottom, num_output=nout, param=param,
weight_filler=weight_filler,
bias_filler=bias_filler)
return fc, L.ReLU(fc, in_place=True)

def max_pool(bottom, ks, stride=1):
return L.Pooling(bottom, pool=P.Pooling.MAX, kernel_size=ks, stride=stride)

def caffenet(data, label=None, train=True, num_classes=1000,
classifier_name='fc8', learn_all=False):
"""Returns a NetSpec specifying CaffeNet, following the original proto text
specification (./models/bvlc_reference_caffenet/train_val.prototxt)."""
n = caffe.NetSpec()
n.data = data
param = learned_param if learn_all else frozen_param
n.conv1, n.relu1 = conv_relu(n.data, 11, 96, stride=4, param=param)
n.pool1 = max_pool(n.relu1, 3, stride=2)
n.norm1 = L.LRN(n.pool1, local_size=5, alpha=1e-4, beta=0.75)
n.conv2, n.relu2 = conv_relu(n.norm1, 5, 256, pad=2, group=2, param=param)
n.pool2 = max_pool(n.relu2, 3, stride=2)
n.norm2 = L.LRN(n.pool2, local_size=5, alpha=1e-4, beta=0.75)
n.conv3, n.relu3 = conv_relu(n.norm2, 3, 384, pad=1, param=param)
n.conv4, n.relu4 = conv_relu(n.relu3, 3, 384, pad=1, group=2, param=param)
n.conv5, n.relu5 = conv_relu(n.relu4, 3, 256, pad=1, group=2, param=param)
n.pool5 = max_pool(n.relu5, 3, stride=2)
n.fc6, n.relu6 = fc_relu(n.pool5, 4096, param=param)
if train:
n.drop6 = fc7input = L.Dropout(n.relu6, in_place=True)
else:
fc7input = n.relu6
n.fc7, n.relu7 = fc_relu(fc7input, 4096, param=param)
if train:
n.drop7 = fc8input = L.Dropout(n.relu7, in_place=True)
else:
fc8input = n.relu7
# always learn fc8 (param=learned_param)
fc8 = L.InnerProduct(fc8input, num_output=num_classes, param=learned_param)
# give fc8 the name specified by argument `classifier_name`
n.__setattr__(classifier_name, fc8)
if not train:
n.probs = L.Softmax(fc8)
if label is not None:
n.label = label
n.loss = L.SoftmaxWithLoss(fc8, n.label)
n.acc = L.Accuracy(fc8, n.label)
# write the net to a temporary file and return its filename
with tempfile.NamedTemporaryFile(delete=False) as f:
f.write(str(n.to_proto()))
return f.name

Now, let’s create a CaffeNet that takes unlabeled “dummy data” as input, allowing us to set its input images externally and see what ImageNet classes it predicts.

1
2
3
dummy_data = L.DummyData(shape=dict(dim=[1, 3, 227, 227]))
imagenet_net_filename = caffenet(data=dummy_data, train=False)
imagenet_net = caffe.Net(imagenet_net_filename, weights, caffe.TEST)

Define a function style_net which calls caffenet on data from the Flickr style dataset.

The new network will also have the CaffeNet architecture, with differences in the input and output:

  • the input is the Flickr style data we downloaded, provided by an ImageData layer
  • the output is a distribution over 20 classes rather than the original 1000 ImageNet classes
  • the classification layer is renamed from fc8 to fc8_flickr to tell Caffe not to load the original classifier (fc8) weights from the ImageNet-pretrained model
1
2
3
4
5
6
7
8
9
10
11
12
13
def style_net(train=True, learn_all=False, subset=None):
if subset is None:
subset = 'train' if train else 'test'
source = caffe_root + 'data/flickr_style/%s.txt' % subset
transform_param = dict(mirror=train, crop_size=227,
mean_file=caffe_root + 'data/ilsvrc12/imagenet_mean.binaryproto')
style_data, style_label = L.ImageData(
transform_param=transform_param, source=source,
batch_size=50, new_height=256, new_width=256, ntop=2)
return caffenet(data=style_data, label=style_label, train=train,
num_classes=NUM_STYLE_LABELS,
classifier_name='fc8_flickr',
learn_all=learn_all)

Use the style_net function defined above to initialize untrained_style_net, a CaffeNet with input images from the style dataset and weights from the pretrained ImageNet model.

Call forward on untrained_style_net to get a batch of style training data.

1
2
3
4
5
untrained_style_net = caffe.Net(style_net(train=False, subset='train'),
weights, caffe.TEST)
untrained_style_net.forward()
style_data_batch = untrained_style_net.blobs['data'].data.copy()
style_label_batch = np.array(untrained_style_net.blobs['label'].data, dtype=np.int32)

Pick one of the style net training images from the batch of 50 (we’ll arbitrarily choose #8 here). Display it, then run it through imagenet_net, the ImageNet-pretrained network to view its top 5 predicted classes from the 1000 ImageNet classes.

Below we chose an image where the network’s predictions happen to be reasonable, as the image is of a beach, and “sandbar” and “seashore” both happen to be ImageNet-1000 categories. For other images, the predictions won’t be this good, sometimes due to the network actually failing to recognize the object(s) present in the image, but perhaps even more often due to the fact that not all images contain an object from the (somewhat arbitrarily chosen) 1000 ImageNet categories. Modify the batch_index variable by changing its default setting of 8 to another value from 0-49 (since the batch size is 50) to see predictions for other images in the batch. (To go beyond this batch of 50 images, first rerun the above cell to load a fresh batch of data into style_net.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def disp_preds(net, image, labels, k=5, name='ImageNet'):
input_blob = net.blobs['data']
net.blobs['data'].data[0, ...] = image
probs = net.forward(start='conv1')['probs'][0]
top_k = (-probs).argsort()[:k]
print 'top %d predicted %s labels =' % (k, name)
print '\n'.join('\t(%d) %5.2f%% %s' % (i+1, 100*probs[p], labels[p])
for i, p in enumerate(top_k))

def disp_imagenet_preds(net, image):
disp_preds(net, image, imagenet_labels, name='ImageNet')

def disp_style_preds(net, image):
disp_preds(net, image, style_labels, name='style')
1
2
3
4
batch_index = 8
image = style_data_batch[batch_index]
plt.imshow(deprocess_net_image(image))
print 'actual label =', style_labels[style_label_batch[batch_index]]
actual label = Melancholy

png

1
disp_imagenet_preds(imagenet_net, image)
top 5 predicted ImageNet labels =
    (1) 69.89% n09421951 sandbar, sand bar
    (2) 21.76% n09428293 seashore, coast, seacoast, sea-coast
    (3)  3.22% n02894605 breakwater, groin, groyne, mole, bulwark, seawall, jetty
    (4)  1.89% n04592741 wing
    (5)  1.23% n09332890 lakeside, lakeshore

We can also look at untrained_style_net‘s predictions, but we won’t see anything interesting as its classifier hasn’t been trained yet.

In fact, since we zero-initialized the classifier (see caffenet definition – no weight_filler is passed to the final InnerProduct layer), the softmax inputs should be all zero and we should therefore see a predicted probability of 1/N for each label (for N labels). Since we set N = 5, we get a predicted probability of 20% for each class.

1
disp_style_preds(untrained_style_net, image)
top 5 predicted style labels =
    (1) 20.00% Detailed
    (2) 20.00% Pastel
    (3) 20.00% Melancholy
    (4) 20.00% Noir
    (5) 20.00% HDR

We can also verify that the activations in layer fc7 immediately before the classification layer are the same as (or very close to) those in the ImageNet-pretrained model, since both models are using the same pretrained weights in the conv1 through fc7 layers.

1
2
3
diff = untrained_style_net.blobs['fc7'].data[0] - imagenet_net.blobs['fc7'].data[0]
error = (diff ** 2).sum()
assert error < 1e-8

Delete untrained_style_net to save memory. (Hang on to imagenet_net as we’ll use it again later.)

1
del untrained_style_net

Training the style classifier

Now, we’ll define a function solver to create our Caffe solvers, which are used to train the network (learn its weights). In this function we’ll set values for various parameters used for learning, display, and “snapshotting” – see the inline comments for explanations of what they mean. You may want to play with some of the learning parameters to see if you can improve on the results here!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from caffe.proto import caffe_pb2

def solver(train_net_path, test_net_path=None, base_lr=0.001):
s = caffe_pb2.SolverParameter()

# Specify locations of the train and (maybe) test networks.
s.train_net = train_net_path
if test_net_path is not None:
s.test_net.append(test_net_path)
s.test_interval = 1000 # Test after every 1000 training iterations.
s.test_iter.append(100) # Test on 100 batches each time we test.

# The number of iterations over which to average the gradient.
# Effectively boosts the training batch size by the given factor, without
# affecting memory utilization.
s.iter_size = 1

s.max_iter = 100000 # # of times to update the net (training iterations)

# Solve using the stochastic gradient descent (SGD) algorithm.
# Other choices include 'Adam' and 'RMSProp'.
s.type = 'SGD'

# Set the initial learning rate for SGD.
s.base_lr = base_lr

# Set `lr_policy` to define how the learning rate changes during training.
# Here, we 'step' the learning rate by multiplying it by a factor `gamma`
# every `stepsize` iterations.
s.lr_policy = 'step'
s.gamma = 0.1
s.stepsize = 20000

# Set other SGD hyperparameters. Setting a non-zero `momentum` takes a
# weighted average of the current gradient and previous gradients to make
# learning more stable. L2 weight decay regularizes learning, to help prevent
# the model from overfitting.
s.momentum = 0.9
s.weight_decay = 5e-4

# Display the current training loss and accuracy every 1000 iterations.
s.display = 1000

# Snapshots are files used to store networks we've trained. Here, we'll
# snapshot every 10K iterations -- ten times during training.
s.snapshot = 10000
s.snapshot_prefix = caffe_root + 'models/finetune_flickr_style/finetune_flickr_style'

# Train on the GPU. Using the CPU to train large networks is very slow.
s.solver_mode = caffe_pb2.SolverParameter.GPU

# Write the solver to a temporary file and return its filename.
with tempfile.NamedTemporaryFile(delete=False) as f:
f.write(str(s))
return f.name

Now we’ll invoke the solver to train the style net’s classification layer.

For the record, if you want to train the network using only the command line tool, this is the command:

1
2
3
4
build/tools/caffe train \
-solver models/finetune_flickr_style/solver.prototxt \
-weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel \
-gpu 0

However, we will train using Python in this example.

We’ll first define run_solvers, a function that takes a list of solvers and steps each one in a round robin manner, recording the accuracy and loss values each iteration. At the end, the learned weights are saved to a file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def run_solvers(niter, solvers, disp_interval=10):
"""Run solvers for niter iterations,
returning the loss and accuracy recorded each iteration.
`solvers` is a list of (name, solver) tuples."""
blobs = ('loss', 'acc')
loss, acc = ({name: np.zeros(niter) for name, _ in solvers}
for _ in blobs)
for it in range(niter):
for name, s in solvers:
s.step(1) # run a single SGD step in Caffe
loss[name][it], acc[name][it] = (s.net.blobs[b].data.copy()
for b in blobs)
if it % disp_interval == 0 or it + 1 == niter:
loss_disp = '; '.join('%s: loss=%.3f, acc=%2d%%' %
(n, loss[n][it], np.round(100*acc[n][it]))
for n, _ in solvers)
print '%3d) %s' % (it, loss_disp)
# Save the learned weights from both nets.
weight_dir = tempfile.mkdtemp()
weights = {}
for name, s in solvers:
filename = 'weights.%s.caffemodel' % name
weights[name] = os.path.join(weight_dir, filename)
s.net.save(weights[name])
return loss, acc, weights

Let’s create and run solvers to train nets for the style recognition task. We’ll create two solvers – one (style_solver) will have its train net initialized to the ImageNet-pretrained weights (this is done by the call to the copy_from method), and the other (scratch_style_solver) will start from a randomly initialized net.

During training, we should see that the ImageNet pretrained net is learning faster and attaining better accuracies than the scratch net.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
niter = 200  # number of iterations to train

# Reset style_solver as before.
style_solver_filename = solver(style_net(train=True))
style_solver = caffe.get_solver(style_solver_filename)
style_solver.net.copy_from(weights)

# For reference, we also create a solver that isn't initialized from
# the pretrained ImageNet weights.
scratch_style_solver_filename = solver(style_net(train=True))
scratch_style_solver = caffe.get_solver(scratch_style_solver_filename)

print 'Running solvers for %d iterations...' % niter
solvers = [('pretrained', style_solver),
('scratch', scratch_style_solver)]
loss, acc, weights = run_solvers(niter, solvers)
print 'Done.'

train_loss, scratch_train_loss = loss['pretrained'], loss['scratch']
train_acc, scratch_train_acc = acc['pretrained'], acc['scratch']
style_weights, scratch_style_weights = weights['pretrained'], weights['scratch']

# Delete solvers to save memory.
del style_solver, scratch_style_solver, solvers
Running solvers for 200 iterations...
  1) pretrained: loss=1.609, acc=28%; scratch: loss=1.609, acc=28%
 1)  pretrained: loss=1.293, acc=52%; scratch: loss=1.626, acc=14%
 2)  pretrained: loss=1.110, acc=56%; scratch: loss=1.646, acc=10%
 3)  pretrained: loss=1.084, acc=60%; scratch: loss=1.616, acc=20%
 4)  pretrained: loss=0.898, acc=64%; scratch: loss=1.588, acc=26%
 5)  pretrained: loss=1.024, acc=54%; scratch: loss=1.607, acc=32%
 6)  pretrained: loss=0.925, acc=66%; scratch: loss=1.616, acc=20%
 7)  pretrained: loss=0.861, acc=74%; scratch: loss=1.598, acc=24%
 8)  pretrained: loss=0.967, acc=60%; scratch: loss=1.588, acc=30%
 9)  pretrained: loss=1.274, acc=52%; scratch: loss=1.608, acc=20%
1)   pretrained: loss=1.113, acc=62%; scratch: loss=1.588, acc=30%
2)   pretrained: loss=0.922, acc=62%; scratch: loss=1.578, acc=36%
3)   pretrained: loss=0.918, acc=62%; scratch: loss=1.599, acc=20%
4)   pretrained: loss=0.959, acc=58%; scratch: loss=1.594, acc=22%
5)   pretrained: loss=1.228, acc=50%; scratch: loss=1.608, acc=14%
6)   pretrained: loss=0.727, acc=76%; scratch: loss=1.623, acc=16%
7)   pretrained: loss=1.074, acc=66%; scratch: loss=1.607, acc=20%
8)   pretrained: loss=0.887, acc=60%; scratch: loss=1.614, acc=20%
9)   pretrained: loss=0.961, acc=62%; scratch: loss=1.614, acc=18%
10)  pretrained: loss=0.737, acc=76%; scratch: loss=1.613, acc=18%
11)  pretrained: loss=0.836, acc=70%; scratch: loss=1.614, acc=16%
Done.

Let’s look at the training loss and accuracy produced by the two training procedures. Notice how quickly the ImageNet pretrained model’s loss value (blue) drops, and that the randomly initialized model’s loss value (green) barely (if at all) improves from training only the classifier layer.

1
2
3
plot(np.vstack([train_loss, scratch_train_loss]).T)
xlabel('Iteration #')
ylabel('Loss')
<matplotlib.text.Text at 0x7f75d49e1090>

png

1
2
3
plot(np.vstack([train_acc, scratch_train_acc]).T)
xlabel('Iteration #')
ylabel('Accuracy')
<matplotlib.text.Text at 0x7f75d49e1a90>

png

Let’s take a look at the testing accuracy after running 200 iterations of training. Note that we’re classifying among 5 classes, giving chance accuracy of 20%. We expect both results to be better than chance accuracy (20%), and we further expect the result from training using the ImageNet pretraining initialization to be much better than the one from training from scratch. Let’s see.

1
2
3
4
5
6
7
def eval_style_net(weights, test_iters=10):
test_net = caffe.Net(style_net(train=False), weights, caffe.TEST)
accuracy = 0
for it in xrange(test_iters):
accuracy += test_net.forward()['acc']
accuracy /= test_iters
return test_net, accuracy
1
2
3
4
test_net, accuracy = eval_style_net(style_weights)
print 'Accuracy, trained from ImageNet initialization: %3.1f%%' % (100*accuracy, )
scratch_test_net, scratch_accuracy = eval_style_net(scratch_style_weights)
print 'Accuracy, trained from random initialization: %3.1f%%' % (100*scratch_accuracy, )
Accuracy, trained from ImageNet initialization: 50.0%
Accuracy, trained from   random initialization: 23.6%

End-to-end finetuning for style

Finally, we’ll train both nets again, starting from the weights we just learned. The only difference this time is that we’ll be learning the weights “end-to-end” by turning on learning in all layers of the network, starting from the RGB conv1 filters directly applied to the input image. We pass the argument learn_all=True to the style_net function defined earlier in this notebook, which tells the function to apply a positive (non-zero) lr_mult value for all parameters. Under the default, learn_all=False, all parameters in the pretrained layers (conv1 through fc7) are frozen (lr_mult = 0), and we learn only the classifier layer fc8_flickr.

Note that both networks start at roughly the accuracy achieved at the end of the previous training session, and improve significantly with end-to-end training. To be more scientific, we’d also want to follow the same additional training procedure without the end-to-end training, to ensure that our results aren’t better simply because we trained for twice as long. Feel free to try this yourself!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
end_to_end_net = style_net(train=True, learn_all=True)

# Set base_lr to 1e-3, the same as last time when learning only the classifier.
# You may want to play around with different values of this or other
# optimization parameters when fine-tuning. For example, if learning diverges
# (e.g., the loss gets very large or goes to infinity/NaN), you should try
# decreasing base_lr (e.g., to 1e-4, then 1e-5, etc., until you find a value
# for which learning does not diverge).
base_lr = 0.001

style_solver_filename = solver(end_to_end_net, base_lr=base_lr)
style_solver = caffe.get_solver(style_solver_filename)
style_solver.net.copy_from(style_weights)

scratch_style_solver_filename = solver(end_to_end_net, base_lr=base_lr)
scratch_style_solver = caffe.get_solver(scratch_style_solver_filename)
scratch_style_solver.net.copy_from(scratch_style_weights)

print 'Running solvers for %d iterations...' % niter
solvers = [('pretrained, end-to-end', style_solver),
('scratch, end-to-end', scratch_style_solver)]
_, _, finetuned_weights = run_solvers(niter, solvers)
print 'Done.'

style_weights_ft = finetuned_weights['pretrained, end-to-end']
scratch_style_weights_ft = finetuned_weights['scratch, end-to-end']

# Delete solvers to save memory.
del style_solver, scratch_style_solver, solvers
Running solvers for 200 iterations...
  1) pretrained, end-to-end: loss=0.781, acc=64%; scratch, end-to-end: loss=1.585, acc=28%
 1)  pretrained, end-to-end: loss=1.178, acc=62%; scratch, end-to-end: loss=1.638, acc=14%
 2)  pretrained, end-to-end: loss=1.084, acc=60%; scratch, end-to-end: loss=1.637, acc= 8%
 3)  pretrained, end-to-end: loss=0.902, acc=76%; scratch, end-to-end: loss=1.600, acc=20%
 4)  pretrained, end-to-end: loss=0.865, acc=64%; scratch, end-to-end: loss=1.574, acc=26%
 5)  pretrained, end-to-end: loss=0.888, acc=60%; scratch, end-to-end: loss=1.604, acc=26%
 6)  pretrained, end-to-end: loss=0.538, acc=78%; scratch, end-to-end: loss=1.555, acc=34%
 7)  pretrained, end-to-end: loss=0.717, acc=72%; scratch, end-to-end: loss=1.563, acc=30%
 8)  pretrained, end-to-end: loss=0.695, acc=74%; scratch, end-to-end: loss=1.502, acc=42%
 9)  pretrained, end-to-end: loss=0.708, acc=68%; scratch, end-to-end: loss=1.523, acc=26%
1)   pretrained, end-to-end: loss=0.432, acc=78%; scratch, end-to-end: loss=1.500, acc=38%
2)   pretrained, end-to-end: loss=0.611, acc=78%; scratch, end-to-end: loss=1.618, acc=18%
3)   pretrained, end-to-end: loss=0.610, acc=76%; scratch, end-to-end: loss=1.473, acc=30%
4)   pretrained, end-to-end: loss=0.471, acc=78%; scratch, end-to-end: loss=1.488, acc=26%
5)   pretrained, end-to-end: loss=0.500, acc=76%; scratch, end-to-end: loss=1.514, acc=38%
6)   pretrained, end-to-end: loss=0.476, acc=80%; scratch, end-to-end: loss=1.452, acc=46%
7)   pretrained, end-to-end: loss=0.368, acc=82%; scratch, end-to-end: loss=1.419, acc=34%
8)   pretrained, end-to-end: loss=0.556, acc=76%; scratch, end-to-end: loss=1.583, acc=36%
9)   pretrained, end-to-end: loss=0.574, acc=72%; scratch, end-to-end: loss=1.556, acc=22%
10)  pretrained, end-to-end: loss=0.360, acc=88%; scratch, end-to-end: loss=1.429, acc=44%
11)  pretrained, end-to-end: loss=0.458, acc=78%; scratch, end-to-end: loss=1.370, acc=44%
Done.

Let’s now test the end-to-end finetuned models. Since all layers have been optimized for the style recognition task at hand, we expect both nets to get better results than the ones above, which were achieved by nets with only their classifier layers trained for the style task (on top of either ImageNet pretrained or randomly initialized weights).

1
2
3
4
test_net, accuracy = eval_style_net(style_weights_ft)
print 'Accuracy, finetuned from ImageNet initialization: %3.1f%%' % (100*accuracy, )
scratch_test_net, scratch_accuracy = eval_style_net(scratch_style_weights_ft)
print 'Accuracy, finetuned from random initialization: %3.1f%%' % (100*scratch_accuracy, )
Accuracy, finetuned from ImageNet initialization: 53.6%
Accuracy, finetuned from   random initialization: 39.2%

We’ll first look back at the image we started with and check our end-to-end trained model’s predictions.

1
2
plt.imshow(deprocess_net_image(image))
disp_style_preds(test_net, image)
top 5 predicted style labels =
    (1) 55.67% Melancholy
    (2) 27.21% HDR
    (3) 16.46% Pastel
    (4)  0.63% Detailed
    (5)  0.03% Noir

png

Whew, that looks a lot better than before! But note that this image was from the training set, so the net got to see its label at training time.

Finally, we’ll pick an image from the test set (an image the model hasn’t seen) and look at our end-to-end finetuned style model’s predictions for it.

1
2
3
4
batch_index = 1
image = test_net.blobs['data'].data[batch_index]
plt.imshow(deprocess_net_image(image))
print 'actual label =', style_labels[int(test_net.blobs['label'].data[batch_index])]
actual label = Pastel

png

1
disp_style_preds(test_net, image)
top 5 predicted style labels =
    (1) 99.76% Pastel
    (2)  0.13% HDR
    (3)  0.11% Detailed
    (4)  0.00% Melancholy
    (5)  0.00% Noir

We can also look at the predictions of the network trained from scratch. We see that in this case, the scratch network also predicts the correct label for the image (Pastel), but is much less confident in its prediction than the pretrained net.

1
disp_style_preds(scratch_test_net, image)
top 5 predicted style labels =
    (1) 49.81% Pastel
    (2) 19.76% Detailed
    (3) 17.06% Melancholy
    (4) 11.66% HDR
    (5)  1.72% Noir

Of course, we can again look at the ImageNet model’s predictions for the above image:

1
disp_imagenet_preds(imagenet_net, image)
top 5 predicted ImageNet labels =
    (1) 34.90% n07579787 plate
    (2) 21.63% n04263257 soup bowl
    (3) 17.75% n07875152 potpie
    (4)  5.72% n07711569 mashed potato
    (5)  5.27% n07584110 consomme

So we did finetuning and it is awesome. Let’s take a look at what kind of results we are able to get with a longer, more complete run of the style recognition dataset. Note: the below URL might be occasionally down because it is run on a research machine.

Reference

History

  • 20180808: created.

Solving in Python with LeNet

In this example, we’ll explore learning with Caffe in Python, using the fully-exposed Solver interface.

Setup

  • Set up the Python environment: we’ll use the pylab import for numpy and plot inline.
1
2
from pylab import *
%matplotlib inline
  • Import caffe, adding it to sys.path if needed. Make sure you’ve built pycaffe.
1
2
3
4
5
caffe_root = '../'  # this file should be run from {caffe_root}/examples (otherwise change this line)

import sys
sys.path.insert(0, caffe_root + 'python')
import caffe
  • We’ll be using the provided LeNet example data and networks (make sure you’ve downloaded the data and created the databases, as below).
1
2
3
4
5
6
7
8
9
# run scripts from caffe root
import os
os.chdir(caffe_root)
# Download data
!data/mnist/get_mnist.sh
# Prepare data
!examples/mnist/create_mnist.sh
# back to examples
os.chdir('examples')
Downloading...
Creating lmdb...
Done.

Creating the net

Now let’s make a variant of LeNet, the classic 1989 convnet architecture.

We’ll need two external files to help out:

  • the net prototxt, defining the architecture and pointing to the train/test data
  • the solver prototxt, defining the learning parameters

We start by creating the net. We’ll write the net in a succinct and natural way as Python code that serializes to Caffe’s protobuf model format.

This network expects to read from pregenerated LMDBs, but reading directly from ndarrays is also possible using MemoryDataLayer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from caffe import layers as L, params as P

def lenet(lmdb, batch_size):
# our version of LeNet: a series of linear and simple nonlinear transformations
n = caffe.NetSpec()

n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
transform_param=dict(scale=1./255), ntop=2)

n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
n.fc1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.fc1, in_place=True)
n.score = L.InnerProduct(n.relu1, num_output=10, weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.score, n.label)

return n.to_proto()

with open('mnist/lenet_auto_train.prototxt', 'w') as f:
f.write(str(lenet('mnist/mnist_train_lmdb', 64)))

with open('mnist/lenet_auto_test.prototxt', 'w') as f:
f.write(str(lenet('mnist/mnist_test_lmdb', 100)))

The net has been written to disk in a more verbose but human-readable serialization format using Google’s protobuf library. You can read, write, and modify this description directly. Let’s take a look at the train net.

1
!cat mnist/lenet_auto_train.prototxt
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  transform_param {
    scale: 0.00392156862745
  }
  data_param {
    source: "mnist/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  convolution_param {
    num_output: 20
    kernel_size: 5
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  convolution_param {
    num_output: 50
    kernel_size: 5
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "fc1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "fc1"
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "fc1"
  top: "fc1"
}
layer {
  name: "score"
  type: "InnerProduct"
  bottom: "fc1"
  top: "score"
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "score"
  bottom: "label"
  top: "loss"
}

Now let’s see the learning parameters, which are also written as a prototxt file (already provided on disk). We’re using SGD with momentum, weight decay, and a specific learning rate schedule.

1
!cat mnist/lenet_auto_solver.prototxt
# The train/test net protocol buffer definition
train_net: "mnist/lenet_auto_train.prototxt"
test_net: "mnist/lenet_auto_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "mnist/lenet"

Loading and checking the solver

  • Let’s pick a device and load the solver. We’ll use SGD (with momentum), but other methods (such as Adagrad and Nesterov’s accelerated gradient) are also available.
1
2
3
4
5
6
caffe.set_device(0)
caffe.set_mode_gpu()

### load the solver and create train and test nets
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('mnist/lenet_auto_solver.prototxt')
  • To get an idea of the architecture of our net, we can check the dimensions of the intermediate features (blobs) and parameters (these will also be useful to refer to when manipulating data later).
1
2
# each output is (batch size, feature dim, spatial dim)
[(k, v.data.shape) for k, v in solver.net.blobs.items()]
[('data', (64, 1, 28, 28)),
 ('label', (64,)),
 ('conv1', (64, 20, 24, 24)),
 ('pool1', (64, 20, 12, 12)),
 ('conv2', (64, 50, 8, 8)),
 ('pool2', (64, 50, 4, 4)),
 ('fc1', (64, 500)),
 ('score', (64, 10)),
 ('loss', ())]
1
2
# just print the weight sizes (we'll omit the biases)
[(k, v[0].data.shape) for k, v in solver.net.params.items()]
[('conv1', (20, 1, 5, 5)),
 ('conv2', (50, 20, 5, 5)),
 ('fc1', (500, 800)),
 ('score', (10, 500))]
  • Before taking off, let’s check that everything is loaded as we expect. We’ll run a forward pass on the train and test nets and check that they contain our data.
1
2
solver.net.forward()  # train net
solver.test_nets[0].forward() # test net (there can be more than one)
{'loss': array(2.365971088409424, dtype=float32)}
1
2
3
# we use a little trick to tile the first eight images
imshow(solver.net.blobs['data'].data[:8, 0].transpose(1, 0, 2).reshape(28, 8*28), cmap='gray'); axis('off')
print 'train labels:', solver.net.blobs['label'].data[:8]
train labels: [ 5.  0.  4.  1.  9.  2.  1.  3.]

png

1
2
imshow(solver.test_nets[0].blobs['data'].data[:8, 0].transpose(1, 0, 2).reshape(28, 8*28), cmap='gray'); axis('off')
print 'test labels:', solver.test_nets[0].blobs['label'].data[:8]
test labels: [ 7.  2.  1.  0.  4.  1.  4.  9.]

png

Stepping the solver

Both train and test nets seem to be loading data, and to have correct labels.

  • Let’s take one step of (minibatch) SGD and see what happens.
1
solver.step(1)

Do we have gradients propagating through our filters? Let’s see the updates to the first layer, shown here as a $4 \times 5$ grid of $5 \times 5$ filters.

1
2
imshow(solver.net.params['conv1'][0].diff[:, 0].reshape(4, 5, 5, 5)
.transpose(0, 2, 1, 3).reshape(4*5, 5*5), cmap='gray'); axis('off')
(-0.5, 24.5, 19.5, -0.5)

png

Writing a custom training loop

Something is happening. Let’s run the net for a while, keeping track of a few things as it goes.
Note that this process will be the same as if training through the caffe binary. In particular:

  • logging will continue to happen as normal
  • snapshots will be taken at the interval specified in the solver prototxt (here, every 5000 iterations)
  • testing will happen at the interval specified (here, every 500 iterations)

Since we have control of the loop in Python, we’re free to compute additional things as we go, as we show below. We can do many other things as well, for example:

  • write a custom stopping criterion
  • change the solving process by updating the net in the loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
%%time
niter = 200
test_interval = 25
# losses will also be stored in the log
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter / test_interval)))
output = zeros((niter, 8, 10))

# the main solver loop
for it in range(niter):
solver.step(1) # SGD by Caffe

# store the train loss
train_loss[it] = solver.net.blobs['loss'].data

# store the output on the first test batch
# (start the forward pass at conv1 to avoid loading new data)
solver.test_nets[0].forward(start='conv1')
output[it] = solver.test_nets[0].blobs['score'].data[:8]

# run a full test every so often
# (Caffe can also do this for us and write to a log, but we show here
# how to do it directly in Python, where more complicated things are easier.)
if it % test_interval == 0:
print 'Iteration', it, 'testing...'
correct = 0
for test_it in range(100):
solver.test_nets[0].forward()
correct += sum(solver.test_nets[0].blobs['score'].data.argmax(1)
== solver.test_nets[0].blobs['label'].data)
test_acc[it // test_interval] = correct / 1e4
Iteration 0 testing...
Iteration 25 testing...
Iteration 50 testing...
Iteration 75 testing...
Iteration 100 testing...
Iteration 125 testing...
Iteration 150 testing...
Iteration 175 testing...
CPU times: user 12.6 s, sys: 2.4 s, total: 15 s
Wall time: 14.4 s
  • Let’s plot the train loss and test accuracy.
1
2
3
4
5
6
7
8
_, ax1 = subplots()
ax2 = ax1.twinx()
ax1.plot(arange(niter), train_loss)
ax2.plot(test_interval * arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
ax2.set_title('Test Accuracy: {:.2f}'.format(test_acc[-1]))
<matplotlib.text.Text at 0x7f5199b33610>

png

The loss seems to have dropped quickly and coverged (except for stochasticity), while the accuracy rose correspondingly. Hooray!

  • Since we saved the results on the first test batch, we can watch how our prediction scores evolved. We’ll plot time on the $x$ axis and each possible label on the $y$, with lightness indicating confidence.
1
2
3
4
5
6
7
for i in range(8):
figure(figsize=(2, 2))
imshow(solver.test_nets[0].blobs['data'].data[i, 0], cmap='gray')
figure(figsize=(10, 2))
imshow(output[:50, i].T, interpolation='nearest', cmap='gray')
xlabel('iteration')
ylabel('label')

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

We started with little idea about any of these digits, and ended up with correct classifications for each. If you’ve been following along, you’ll see the last digit is the most difficult, a slanted “9” that’s (understandably) most confused with “4”.

  • Note that these are the “raw” output scores rather than the softmax-computed probability vectors. The latter, shown below, make it easier to see the confidence of our net (but harder to see the scores for less likely digits).
1
2
3
4
5
6
7
for i in range(8):
figure(figsize=(2, 2))
imshow(solver.test_nets[0].blobs['data'].data[i, 0], cmap='gray')
figure(figsize=(10, 2))
imshow(exp(output[:50, i].T) / exp(output[:50, i].T).sum(0), interpolation='nearest', cmap='gray')
xlabel('iteration')
ylabel('label')

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

Experiment with architecture and optimization

Now that we’ve defined, trained, and tested LeNet there are many possible next steps:

  • Define new architectures for comparison
  • Tune optimization by setting base_lr and the like or simply training longer
  • Switching the solver type from SGD to an adaptive method like AdaDelta or Adam

Feel free to explore these directions by editing the all-in-one example that follows.
Look for “EDIT HERE“ comments for suggested choice points.

By default this defines a simple linear classifier as a baseline.

In case your coffee hasn’t kicked in and you’d like inspiration, try out

  1. Switch the nonlinearity from ReLU to ELU or a saturing nonlinearity like Sigmoid
  2. Stack more fully connected and nonlinear layers
  3. Search over learning rate 10x at a time (trying 0.1 and 0.001)
  4. Switch the solver type to Adam (this adaptive solver type should be less sensitive to hyperparameters, but no guarantees…)
  5. Solve for longer by setting niter higher (to 500 or 1,000 for instance) to better show training differences
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
train_net_path = 'mnist/custom_auto_train.prototxt'
test_net_path = 'mnist/custom_auto_test.prototxt'
solver_config_path = 'mnist/custom_auto_solver.prototxt'

### define net
def custom_net(lmdb, batch_size):
# define your own net!
n = caffe.NetSpec()

# keep this data layer for all networks
n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB, source=lmdb,
transform_param=dict(scale=1./255), ntop=2)

# EDIT HERE to try different networks
# this single layer defines a simple linear classifier
# (in particular this defines a multiway logistic regression)
n.score = L.InnerProduct(n.data, num_output=10, weight_filler=dict(type='xavier'))

# EDIT HERE this is the LeNet variant we have already tried
# n.conv1 = L.Convolution(n.data, kernel_size=5, num_output=20, weight_filler=dict(type='xavier'))
# n.pool1 = L.Pooling(n.conv1, kernel_size=2, stride=2, pool=P.Pooling.MAX)
# n.conv2 = L.Convolution(n.pool1, kernel_size=5, num_output=50, weight_filler=dict(type='xavier'))
# n.pool2 = L.Pooling(n.conv2, kernel_size=2, stride=2, pool=P.Pooling.MAX)
# n.fc1 = L.InnerProduct(n.pool2, num_output=500, weight_filler=dict(type='xavier'))
# EDIT HERE consider L.ELU or L.Sigmoid for the nonlinearity
# n.relu1 = L.ReLU(n.fc1, in_place=True)
# n.score = L.InnerProduct(n.fc1, num_output=10, weight_filler=dict(type='xavier'))

# keep this loss layer for all networks
n.loss = L.SoftmaxWithLoss(n.score, n.label)

return n.to_proto()

with open(train_net_path, 'w') as f:
f.write(str(custom_net('mnist/mnist_train_lmdb', 64)))
with open(test_net_path, 'w') as f:
f.write(str(custom_net('mnist/mnist_test_lmdb', 100)))

### define solver
from caffe.proto import caffe_pb2
s = caffe_pb2.SolverParameter()

# Set a seed for reproducible experiments:
# this controls for randomization in training.
s.random_seed = 0xCAFFE

# Specify locations of the train and (maybe) test networks.
s.train_net = train_net_path
s.test_net.append(test_net_path)
s.test_interval = 500 # Test after every 500 training iterations.
s.test_iter.append(100) # Test on 100 batches each time we test.

s.max_iter = 10000 # no. of times to update the net (training iterations)

# EDIT HERE to try different solvers
# solver types include "SGD", "Adam", and "Nesterov" among others.
s.type = "SGD"

# Set the initial learning rate for SGD.
s.base_lr = 0.01 # EDIT HERE to try different learning rates
# Set momentum to accelerate learning by
# taking weighted average of current and previous updates.
s.momentum = 0.9
# Set weight decay to regularize and prevent overfitting
s.weight_decay = 5e-4

# Set `lr_policy` to define how the learning rate changes during training.
# This is the same policy as our default LeNet.
s.lr_policy = 'inv'
s.gamma = 0.0001
s.power = 0.75
# EDIT HERE to try the fixed rate (and compare with adaptive solvers)
# `fixed` is the simplest policy that keeps the learning rate constant.
# s.lr_policy = 'fixed'

# Display the current training loss and accuracy every 1000 iterations.
s.display = 1000

# Snapshots are files used to store networks we've trained.
# We'll snapshot every 5K iterations -- twice during training.
s.snapshot = 5000
s.snapshot_prefix = 'mnist/custom_net'

# Train on the GPU
s.solver_mode = caffe_pb2.SolverParameter.GPU

# Write the solver to a temporary file and return its filename.
with open(solver_config_path, 'w') as f:
f.write(str(s))

### load the solver and create train and test nets
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.get_solver(solver_config_path)

### solve
niter = 250 # EDIT HERE increase to train for longer
test_interval = niter / 10
# losses will also be stored in the log
train_loss = zeros(niter)
test_acc = zeros(int(np.ceil(niter / test_interval)))

# the main solver loop
for it in range(niter):
solver.step(1) # SGD by Caffe

# store the train loss
train_loss[it] = solver.net.blobs['loss'].data

# run a full test every so often
# (Caffe can also do this for us and write to a log, but we show here
# how to do it directly in Python, where more complicated things are easier.)
if it % test_interval == 0:
print 'Iteration', it, 'testing...'
correct = 0
for test_it in range(100):
solver.test_nets[0].forward()
correct += sum(solver.test_nets[0].blobs['score'].data.argmax(1)
== solver.test_nets[0].blobs['label'].data)
test_acc[it // test_interval] = correct / 1e4

_, ax1 = subplots()
ax2 = ax1.twinx()
ax1.plot(arange(niter), train_loss)
ax2.plot(test_interval * arange(len(test_acc)), test_acc, 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
ax2.set_title('Custom Test Accuracy: {:.2f}'.format(test_acc[-1]))
Iteration 0 testing...
Iteration 25 testing...
Iteration 50 testing...
Iteration 75 testing...
Iteration 100 testing...
Iteration 125 testing...
Iteration 150 testing...
Iteration 175 testing...
Iteration 200 testing...
Iteration 225 testing...





<matplotlib.text.Text at 0x7f5199af9f50>

png

Reference

History

  • 20180808: created.