0%

Tutorial

In this example we’ll classify an image with the bundled CaffeNet model (which is based on the network architecture of Krizhevsky et al. for ImageNet).

We’ll compare CPU and GPU modes and then dig into the model to inspect features and the output.

Setup

  • First, set up Python, numpy, and matplotlib.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# set up Python environment: numpy for numerical routines, and matplotlib for plotting
import numpy as np
import matplotlib.pyplot as plt
# display plots in this notebook
%matplotlib inline

# set display defaults
plt.rcParams['figure.figsize'] = (10, 10) # large images
plt.rcParams['image.interpolation'] = 'nearest' # don't interpolate: show square pixels
plt.rcParams['image.cmap'] = 'gray' # use grayscale output rather than a (potentially misleading) color heatmap


# The caffe module needs to be on the Python path;
# we'll add it here explicitly.
import sys
caffe_root = '../' # this file should be run from {caffe_root}/examples (otherwise change this line)
sys.path.insert(0, caffe_root + 'python')

import caffe
# If you get "No module named _caffe", either you have not built pycaffe or you have the wrong path.

import os
if os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'):
print 'CaffeNet found.'
else:
print 'Downloading pre-trained CaffeNet model...'
!../scripts/download_model_binary.py ../models/bvlc_reference_caffenet
CaffeNet found.

Load net and set up input preprocessing

  • Set Caffe to CPU mode and load the net from disk.
1
2
3
4
5
6
7
8
caffe.set_mode_cpu()

model_def = caffe_root + 'models/bvlc_reference_caffenet/deploy.prototxt'
model_weights = caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'

net = caffe.Net(model_def, # defines the structure of the model
model_weights, # contains the trained weights
caffe.TEST) # use test mode (e.g., don't perform dropout)
  • Set up input preprocessing. (We’ll use Caffe’s caffe.io.Transformer to do this, but this step is independent of other parts of Caffe, so any custom preprocessing code may be used).

    Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them. In addition, the channel dimension is expected as the first (outermost) dimension.

    As matplotlib will load images with values in the range [0, 1] in RGB format with the channel as the innermost dimension, we are arranging for the needed transformations here.

1
2
3
4
5
6
7
8
9
10
11
12
13
# load the mean ImageNet image (as distributed with Caffe) for subtraction
mu = np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy')
mu = mu.mean(1).mean(1) # average over pixels to obtain the mean (BGR) pixel values
#mu = np.array([ 104.00698793, 116.66876762, 122.67891434])
print 'mean-subtracted values:', zip('BGR', mu)

# create transformer for the input called 'data'
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

transformer.set_transpose('data', (2,0,1)) # move image channels to outermost dimension
transformer.set_mean('data', mu) # subtract the dataset-mean value in each channel
transformer.set_raw_scale('data', 255) # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap('data', (2,1,0)) # swap channels from RGB to BGR
mean-subtracted values: [('B', 104.0069879317889), ('G', 116.66876761696767), ('R', 122.6789143406786)]

CPU classification

  • Now we’re ready to perform classification. Even though we’ll only classify one image, we’ll set a batch size of 50 to demonstrate batching.
1
2
3
4
5
# set the size of the input (we can skip this if we're happy
# with the default; we can also change it later, e.g., for different batch sizes)
net.blobs['data'].reshape(50, # batch size
3, # 3-channel (BGR) images
227, 227) # image size is 227x227
  • Load an image (that comes with Caffe) and perform the preprocessing we’ve set up.
1
2
3
image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg')
transformed_image = transformer.preprocess('data', image)
plt.imshow(image)
/usr/local/lib/python2.7/dist-packages/skimage/transform/_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "





<matplotlib.image.AxesImage at 0x7f2088044450>

png

  • Adorable! Let’s classify it!
1
2
3
4
5
6
7
8
9
10
# copy the image data into the memory allocated for the net
net.blobs['data'].data[...] = transformed_image

### perform classification
output = net.forward()

output_prob = output['prob'][0] # the output probability vector for the first image in the batch

print output_prob.shape
print 'predicted class is:', output_prob.argmax()
(1000,)
predicted class is: 281
  • The net gives us a vector of probabilities; the most probable class was the 281st one. But is that correct? Let’s check the ImageNet labels…
1
2
3
4
5
6
7
8
9
# load ImageNet labels
labels_file = caffe_root + 'data/ilsvrc12/synset_words.txt'
if not os.path.exists(labels_file):
!../data/ilsvrc12/get_ilsvrc_aux.sh

labels = np.loadtxt(labels_file, str, delimiter='\t')

print labels.shape
print 'output label:', labels[output_prob.argmax()]
(1000,)
output label: n02123045 tabby, tabby cat
  • “Tabby cat” is correct! But let’s also look at other top (but less confident predictions).
1
2
3
4
5
# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5] # reverse sort and take five largest items

print 'probabilities and labels:'
zip(output_prob[top_inds], labels[top_inds])
probabilities and labels:





[(0.31243625, 'n02123045 tabby, tabby cat'),
 (0.23797157, 'n02123159 tiger cat'),
 (0.12387245, 'n02124075 Egyptian cat'),
 (0.10075716, 'n02119022 red fox, Vulpes vulpes'),
 (0.070957333, 'n02127052 lynx, catamount')]
  • We see that less confident predictions are sensible.

Switching to GPU mode

  • Let’s see how long classification took, and compare it to GPU mode.
1
%timeit net.forward()
1 loop, best of 3: 4.26 s per loop
  • That’s a while, even for a batch of 50 images. Let’s switch to GPU mode.
1
2
3
4
caffe.set_device(0)  # if we have multiple GPUs, pick the first one
caffe.set_mode_gpu()
net.forward() # run once before timing to set up memory
%timeit net.forward()
10 loops, best of 3: 29.6 ms per loop
  • That should be much faster!

Examining intermediate output

  • A net is not just a black box; let’s take a look at some of the parameters and intermediate activations.

First we’ll see how to read out the structure of the net in terms of activation and parameter shapes.

  • For each layer, let’s look at the activation shapes, which typically have the form (batch_size, channel_dim, height, width).

    The activations are exposed as an OrderedDict, net.blobs.

1
2
3
# for each layer, show the output shape
for layer_name, blob in net.blobs.iteritems():
print layer_name + '\t' + str(blob.data.shape)
data	(50, 3, 227, 227)
conv1	(50, 96, 55, 55)
pool1	(50, 96, 27, 27)
norm1	(50, 96, 27, 27)
conv2	(50, 256, 27, 27)
pool2	(50, 256, 13, 13)
norm2	(50, 256, 13, 13)
conv3	(50, 384, 13, 13)
conv4	(50, 384, 13, 13)
conv5	(50, 256, 13, 13)
pool5	(50, 256, 6, 6)
fc6	(50, 4096)
fc7	(50, 4096)
fc8	(50, 1000)
prob	(50, 1000)
  • Now look at the parameter shapes. The parameters are exposed as another OrderedDict, net.params. We need to index the resulting values with either [0] for weights or [1] for biases.

    The param shapes typically have the form (output_channels, input_channels, filter_height, filter_width) (for the weights) and the 1-dimensional shape (output_channels,) (for the biases).

1
2
for layer_name, param in net.params.iteritems():
print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)
conv1	(96, 3, 11, 11) (96,)
conv2	(256, 48, 5, 5) (256,)
conv3	(384, 256, 3, 3) (384,)
conv4	(384, 192, 3, 3) (384,)
conv5	(256, 192, 3, 3) (256,)
fc6	(4096, 9216) (4096,)
fc7	(4096, 4096) (4096,)
fc8	(1000, 4096) (1000,)
  • Since we’re dealing with four-dimensional data here, we’ll define a helper function for visualizing sets of rectangular heatmaps.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def vis_square(data):
"""Take an array of shape (n, height, width) or (n, height, width, 3)
and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)"""

# normalize data for display
data = (data - data.min()) / (data.max() - data.min())

# force the number of filters to be square
n = int(np.ceil(np.sqrt(data.shape[0])))
padding = (((0, n ** 2 - data.shape[0]),
(0, 1), (0, 1)) # add some space between filters
+ ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one)
data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white)

# tile the filters into an image
data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])

plt.imshow(data); plt.axis('off')
  • First we’ll look at the first layer filters, conv1
1
2
3
# the parameters are a list of [weights, biases]
filters = net.params['conv1'][0].data
vis_square(filters.transpose(0, 2, 3, 1))

png

  • The first layer output, conv1 (rectified responses of the filters above, first 36 only)
1
2
feat = net.blobs['conv1'].data[0, :36]
vis_square(feat)

png

  • The fifth layer after pooling, pool5
1
2
feat = net.blobs['pool5'].data[0]
vis_square(feat)

png

  • The first fully connected layer, fc6 (rectified)

    We show the output values and the histogram of the positive values

1
2
3
4
5
feat = net.blobs['fc6'].data[0]
plt.subplot(2, 1, 1)
plt.plot(feat.flat)
plt.subplot(2, 1, 2)
_ = plt.hist(feat.flat[feat.flat > 0], bins=100)

png

  • The final probability output, prob
1
2
3
feat = net.blobs['prob'].data[0]
plt.figure(figsize=(15, 3))
plt.plot(feat.flat)
[<matplotlib.lines.Line2D at 0x7f2060250650>]

png

Note the cluster of strong predictions; the labels are sorted semantically. The top peaks correspond to the top predicted labels, as shown above.

Try your own image

Now we’ll grab an image from the web and classify it using the steps above.

  • Try setting my_image_url to any JPEG image URL.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# download an image
my_image_url = "..." # paste your URL here
# for example:
# my_image_url = "https://upload.wikimedia.org/wikipedia/commons/b/be/Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG"
!wget -O image.jpg $my_image_url

# transform it and copy it into the net
image = caffe.io.load_image('image.jpg')
net.blobs['data'].data[...] = transformer.preprocess('data', image)

# perform classification
net.forward()

# obtain the output probabilities
output_prob = net.blobs['prob'].data[0]

# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5]

plt.imshow(image)

print 'probabilities and labels:'
zip(output_prob[top_inds], labels[top_inds])

Reference

History

  • 20180807: created.

LeNet

The design of LeNet contains the essence of CNNs that are still used in larger models such as the ones in ImageNet. In general, it consists of a convolutional layer followed by a pooling layer, another convolution layer followed by a pooling layer, and then two fully connected layers similar to the conventional multilayer perceptrons.

经典LeNet结构:

input->conv1(20,)-pool1-conv2(50,)-pool2-f1(500,ReLU)-f2(10,softmax)->output

lenet_train_test.prototxt

  • batch size设置在net.prototxt中而不是solver.prototxt中,用以明确blob的dims
  • bottom: layer的input blob; top: layer的output blob
  • 对于0-255区间的pixel,需要归一化到0-1区间,scale = 1/256. = 0.00390625
  • lr_mult: 1表示learning时,weight的learning rate需要x1;
  • lr_mult: 2表示learning时,bias的learning rate需要x2 (this usually leads to better convergence rates)
  • InnerProduct默认输出的是z,而不是a=sigmoid(z)
  • ReLU是Inplace操作,输入输出blob都是ip1,对于其他Layer,input和output的blob不能是相同的

Input Layer types

Input Layer types for train_val.prototxt

``python
solver.net.forward() # load mini-batch images from training data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

**Data**
```prototxt
name: "mnist"
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}


name: "CaffeNet"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
# mean pixel / channel-wise mean instead of mean image
# transform_param {
# crop_size: 227
# mean_value: 104
# mean_value: 117
# mean_value: 123
# mirror: true
# }
data_param {
source: "examples/imagenet/ilsvrc12_train_lmdb"
batch_size: 256
backend: LMDB
}
}

** ImageData**: read raw images.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
}
image_data_param {
source: "data/flickr_style/train.txt"
batch_size: 50
new_height: 256
new_width: 256
}
}

Input Layer types for deploy.prototxt

** DummyData **: no labels, only for forward and get probs

1
2
3
4
5
6
7
8
9
10
11
12
13
layer {
name: "data"
type: "DummyData"
top: "data"
dummy_data_param {
shape {
dim: 1
dim: 3
dim: 227
dim: 227
}
}
}

** Input** : typically used for networks that are being deployed.

1
2
3
4
5
6
7
8
9
10
11
12
layer {
name: "data"
type: "Input"
top: "data"
input_param {
shape {
dim: 10
dim: 3
dim: 227
dim: 227
}
}

LeNet train_val.prototxt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
name: "LeNet"
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}

LeNet solver.prototxt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"

# batch_size定义在net.prototxt中,train_mini_batch_size = 64,test_mini_batch_size = 100

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100 # test_iter = num_test_images/test_mini_batch_size = 10000/100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000 # epoch =
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU

learning rate policy (todo…)

This is the same policy as our default LeNet.

1
2
3
s.lr_policy = 'inv'
s.gamma = 0.0001
s.power = 0.75

EDIT HERE to try the fixed rate (and compare with adaptive solvers)
fixed is the simplest policy that keeps the learning rate constant.

1
s.lr_policy = 'fixed'

Set lr_policy to define how the learning rate changes during training.

1
2
3
4
5
# Here, we 'step' the learning rate by multiplying it by a factor `gamma`
# every `stepsize` iterations.
s.lr_policy = 'step'
s.gamma = 0.1
s.stepsize = 20000

solver types (todo…)

solver types include “SGD”, “Adam”, and “Nesterov” among others.

1
s.type = "SGD"

Train LeNet

1
2
cd $CAFFE_ROOT
./examples/mnist/train_lenet.sh
#!/usr/bin/env sh
set -e

./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@

train output

I0807 16:15:29.555564  4273 solver.cpp:310] Iteration 10000, loss = 0.00251452
I0807 16:15:29.555619  4273 solver.cpp:330] Iteration 10000, Testing net (#0)
I0807 16:15:29.634243  4281 data_layer.cpp:73] Restarting data prefetching from start.
I0807 16:15:29.635372  4273 solver.cpp:397]     Test net output #0: accuracy = 0.9909
I0807 16:15:29.635409  4273 solver.cpp:397]     Test net output #1: loss = 0.0302912 (* 1 = 0.0302912 loss)
I0807 16:15:29.635416  4273 solver.cpp:315] Optimization Done.
I0807 16:15:29.635439  4273 caffe.cpp:259] Optimization Done.

Deploy model

  • for train, train_test.prototxt + solver.prototxt
  • for deploy, deploy.prototxt+ model.caffemodel

depoly: no weight_filler,bias_filler, loaded from weights.caffemodel. if not set weights file, w,b default to 0s

PyCaffe

pycaffe interfaces

The Python interface – pycaffe – is the caffe module and its scripts in caffe/python. import caffe to load models, do forward and backward, handle IO, visualize networks, and even instrument model solving. All model data, derivatives, and parameters are exposed for reading and writing.

  • caffe.Net is the central interface for loading, configuring, and running models.
  • caffe.Classifier and caffe.Detector provide convenience interfaces for common tasks.
  • caffe.SGDSolver exposes the solving interface.
  • caffe.io handles input / output with preprocessing and protocol buffers.
  • caffe.draw visualizes network architectures.
  • Caffe blobs are exposed as numpy ndarrays for ease-of-use and efficiency.

Tutorial IPython notebooks are found in caffe/examples: do ipython notebook caffe/examples to try them. For developer reference docstrings can be found throughout the code.

Compile pycaffe by make pycaffe. Add the module directory to your $PYTHONPATH by export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH or the like for import caffe.

Reference

History

  • 20180807: created.

Series

Guide

requirements:

  • NVIDIA driver 396.54
  • CUDA 8.0 + cudnn 6.0.21
  • CUDA 9.2 +cudnn 7.1.4
  • opencv 3.1.0 —>3.3.0
  • python 2.7 + numpy 1.15.1
  • python 3.5.2 + numpy 1.16.2
  • protobuf 3.6.1 (static)
  • caffe latest

默认的protobuf,2.6.1测试通过。
此处,使用最新的3.6.1 也可以,编译caffe需要加上-std=c++11

install CUDA + cudnn

see install and configure cuda 9.2 with cudnn 7.1 on ubuntu 16.04
tips: we need to recompile caffe with cudnn 7.1

before we compile caffe, move caffe/python/caffe/selective_search_ijcv_with_python folder outside caffe source folder, otherwise error occurs.

recompile caffe with cudnn

install protobuf

see Part 1: compile protobuf-cpp on ubuntu 16.04

1
2
3
4
5
which protoc 
/usr/local/bin/protoc

protoc --version
libprotoc 3.6.1

caffe使用static的libprotoc 3.6.1

install opencv

see compile opencv on ubuntu 16.04

1
2
3
4
5
which opencv_version
/usr/local/bin/opencv_version

opencv_version
3.3.0

python

1
2
python --version
Python 2.7.12

check numpy version

1
2
3
4
5
6
7
8
import numpy
numpy.__version__
'1.15.1'

import numpy
import inspect
inspect.getfile(numpy)
'/usr/local/lib/python2.7/dist-packages/numpy/__init__.pyc'

compile caffe

clone repo

1
2
git clone https://github.com/BVLC/caffe.git 
cd caffe

update repo

update at 20180822.

if you change your local Makefile and git pull origin master merge conflict, solution

1
2
git checkout HEAD Makefile
git pull origin master

configure

1
mkdir build && cd build && cmake-gui ..

cmake-gui options

USE_CUDNN ON
USE_OPENCV ON
Build_python ON
Build_python_layer ON

BLAS atlas
CMAKE_CXX_FLGAS -std=c++11

CMAKE_INSTALL_PREFIX /home/kezunlin/program/caffe/build/install

使用-std=c++11

configure output

Dependencies:
  BLAS              :   Yes (Atlas)
  Boost             :   Yes (ver. 1.66)
  glog              :   Yes
  gflags            :   Yes
  protobuf          :   Yes (ver. 3.6.1)
  lmdb              :   Yes (ver. 0.9.17)
  LevelDB           :   Yes (ver. 1.18)
  Snappy            :   Yes (ver. 1.1.3)
  OpenCV            :   Yes (ver. 3.1.0)
  CUDA              :   Yes (ver. 9.2)

NVIDIA CUDA:
  Target GPU(s)     :   Auto
  GPU arch(s)       :   sm_61
  cuDNN             :   Yes (ver. 7.1.4)

Python:
  Interpreter       :   /usr/bin/python2.7 (ver. 2.7.12)
  Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
  NumPy             :   /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.51.1)

Documentaion:
  Doxygen           :   /usr/bin/doxygen (1.8.11)
  config_file       :   /home/kezunlin/program/caffe/.Doxyfile

Install:
  Install path      :   /home/kezunlin/program/caffe/build/install

Configuring done

we can also use python3.5 and numpy 1.16.2

Python:
  Interpreter       :   /usr/bin/python3 (ver. 3.5.2)
  Libraries         :   /usr/lib/x86_64-linux-gnu/libpython3.5m.so (ver 3.5.2)
  NumPy             :   /home/kezunlin/.local/lib/python3.5/site-packages/numpy/core/include (ver 1.16.2)

use -std=c++11, otherwise errors occur

make -j8
[  1%] Running C++/Python protocol buffer compiler on /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
Scanning dependencies of target caffeproto
[  1%] Building CXX object src/caffe/CMakeFiles/caffeproto.dir/__/__/include/caffe/proto/caffe.pb.cc.o
In file included from /usr/include/c++/5/mutex:35:0,
                 from /usr/local/include/google/protobuf/stubs/mutex.h:33,
                 from /usr/local/include/google/protobuf/stubs/common.h:52,
                 from /home/kezunlin/program/caffe/build/include/caffe/proto/caffe.pb.h:9,
                 from /home/kezunlin/program/caffe/build/include/caffe/proto/caffe.pb.cc:4:
/usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support \

fix gcc error

edit /usr/local/cuda/include/host_config.h

将其中的第115行注释掉:

1
2
3
#error-- unsupported GNU version! gcc versions later than 4.9 are not supported!
======>
//#error-- unsupported GNU version! gcc versions later than 4.9 are not supported!

fix gflags error

  • caffe/include/caffe/common.hpp
  • caffe/examples/mnist/convert_mnist_data.cpp

Comment out the ifndef

1
2
3
// #ifndef GFLAGS_GFLAGS_H_
namespace gflags = google;
// #endif // GFLAGS_GFLAGS_H_

compile

1
2
3
make clean
make -j8
make pycaffe

output

[  1%] Running C++/Python protocol buffer compiler on /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
Scanning dependencies of target caffeproto
[  1%] Building CXX object src/caffe/CMakeFiles/caffeproto.dir/__/__/include/caffe/proto/caffe.pb.cc.o
[  1%] Linking CXX static library ../../lib/libcaffeproto.a
[  1%] Built target caffeproto

libcaffeproto.a static library

install

1
2
3
4
make install

ls build/install
bin include lib python share

install to build/install folder

1
2
ls build/install/lib
libcaffeproto.a libcaffe.so libcaffe.so.1.0.0

advanced

  • INTERFACE_INCLUDE_DIRECTORIES
  • INTERFACE_LINK_LIBRARIES

Target “caffe” has an INTERFACE_LINK_LIBRARIES property which differs from its LINK_INTERFACE_LIBRARIES properties.

Play with Caffe

python caffe

fix python caffe

fix ipython 6.1 version conflict

vim caffe/python/requirements.txt

1
2
3
ipython>=3.0.0
====>
ipython==5.4.1

reinstall ipython

1
2
3
4
5
pip install -r requirements.txt

cd caffe/python
python
>>>import caffe

python draw net

1
2
3
4
5
6
7
8
9
10
11
sudo apt-get install graphviz
sudo pip install theano=0.9

# for theano d3viz
sudo pip install pydot==1.1.0
sudo pip install pydot-ng


# other usefull tools
sudo pip install jupyter
sudo pip install seaborn

we need to install graphviz, otherwise we get ERROR:”dot” not found in path

draw net

1
2
3
4
cd $CAFFE_HOME
./python/draw_net.py ./examples/mnist/lenet.prototxt ./examples/mnist/lenet.png

eog ./examples/mnist/lenet.png

lenet

cpp caffe

train net

1
2
3
4
5
6
cd caffe
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh

cat ./examples/mnist/train_lenet.sh
./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@

output results

I0912 15:57:28.812655 14094 solver.cpp:327] Iteration 10000, loss = 0.00272129
I0912 15:57:28.812675 14094 solver.cpp:347] Iteration 10000, Testing net (#0)
I0912 15:57:28.891481 14100 data_layer.cpp:73] Restarting data prefetching from start.
I0912 15:57:28.893678 14094 solver.cpp:414]     Test net output #0: accuracy = 0.9904
I0912 15:57:28.893707 14094 solver.cpp:414]     Test net output #1: loss = 0.0276084 (* 1 = 0.0276084 loss)
I0912 15:57:28.893714 14094 solver.cpp:332] Optimization Done.
I0912 15:57:28.893719 14094 caffe.cpp:250] Optimization Done.

tips, for caffe, errors because no imdb data.

I0417 13:28:17.764714 35030 layer_factory.hpp:77] Creating layer mnist
F0417 13:28:17.765067 35030 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory
--------------------- 

upgrade net

1
2
./tools/upgrade_net_proto_text  old.prototxt new.prototxt
./tools/upgrade_net_proto_binary old.caffemodel new.caffemodel

caffe time

yolov3

1
2
3
4
5
./build/tools/caffe time --model='det/yolov3/yolov3.prototxt' --iterations=100 --gpu=0

I0313 10:15:41.888208 12527 caffe.cpp:408] Average Forward pass: 49.7012 ms.
I0313 10:15:41.888213 12527 caffe.cpp:410] Average Backward pass: 84.946 ms.
I0313 10:15:41.888248 12527 caffe.cpp:412] Average Forward-Backward: 134.85 ms.

yolov3 5 class

1
2
3
4
5
./build/tools/caffe time --model='det/autotrain/yolo3-5c.prototxt' --iterations=100 --gpu=0

I0313 10:19:27.283625 12894 caffe.cpp:408] Average Forward pass: 38.4823 ms.
I0313 10:19:27.283630 12894 caffe.cpp:410] Average Backward pass: 74.1656 ms.
I0313 10:19:27.283638 12894 caffe.cpp:412] Average Forward-Backward: 112.732 ms.

Example

Caffe Classifier

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
#include <caffe/caffe.hpp>
#ifdef USE_OPENCV
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#endif // USE_OPENCV
#include <algorithm>
#include <iosfwd>
#include <memory>
#include <string>
#include <utility>
#include <vector>

#ifdef USE_OPENCV
using namespace caffe; // NOLINT(build/namespaces)
using std::string;

/* Pair (label, confidence) representing a prediction. */
typedef std::pair<string, float> Prediction;

class Classifier {
public:
Classifier(const string& model_file,
const string& trained_file,
const string& mean_file,
const string& label_file);

std::vector<Prediction> Classify(const cv::Mat& img, int N = 5);

private:
void SetMean(const string& mean_file);

std::vector<float> Predict(const cv::Mat& img);

void WrapInputLayer(std::vector<cv::Mat>* input_channels);

void Preprocess(const cv::Mat& img,
std::vector<cv::Mat>* input_channels);

private:
shared_ptr<Net<float> > net_;
cv::Size input_geometry_;
int num_channels_;
cv::Mat mean_;
std::vector<string> labels_;
};

Classifier::Classifier(const string& model_file,
const string& trained_file,
const string& mean_file,
const string& label_file) {
#ifdef CPU_ONLY
Caffe::set_mode(Caffe::CPU);
#else
Caffe::set_mode(Caffe::GPU);
#endif

/* Load the network. */
net_.reset(new Net<float>(model_file, TEST));
net_->CopyTrainedLayersFrom(trained_file);

CHECK_EQ(net_->num_inputs(), 1) << "Network should have exactly one input.";
CHECK_EQ(net_->num_outputs(), 1) << "Network should have exactly one output.";

Blob<float>* input_layer = net_->input_blobs()[0];
num_channels_ = input_layer->channels();
CHECK(num_channels_ == 3 || num_channels_ == 1)
<< "Input layer should have 1 or 3 channels.";
input_geometry_ = cv::Size(input_layer->width(), input_layer->height());

/* Load the binaryproto mean file. */
SetMean(mean_file);

/* Load labels. */
std::ifstream labels(label_file.c_str());
CHECK(labels) << "Unable to open labels file " << label_file;
string line;
while (std::getline(labels, line))
labels_.push_back(string(line));

Blob<float>* output_layer = net_->output_blobs()[0];
CHECK_EQ(labels_.size(), output_layer->channels())
<< "Number of labels is different from the output layer dimension.";
}

static bool PairCompare(const std::pair<float, int>& lhs,
const std::pair<float, int>& rhs) {
return lhs.first > rhs.first;
}

/* Return the indices of the top N values of vector v. */
static std::vector<int> Argmax(const std::vector<float>& v, int N) {
std::vector<std::pair<float, int> > pairs;
for (size_t i = 0; i < v.size(); ++i)
pairs.push_back(std::make_pair(v[i], i));
std::partial_sort(pairs.begin(), pairs.begin() + N, pairs.end(), PairCompare);

std::vector<int> result;
for (int i = 0; i < N; ++i)
result.push_back(pairs[i].second);
return result;
}

/* Return the top N predictions. */
std::vector<Prediction> Classifier::Classify(const cv::Mat& img, int N) {
std::vector<float> output = Predict(img);

N = std::min<int>(labels_.size(), N);
std::vector<int> maxN = Argmax(output, N);
std::vector<Prediction> predictions;
for (int i = 0; i < N; ++i) {
int idx = maxN[i];
predictions.push_back(std::make_pair(labels_[idx], output[idx]));
}

return predictions;
}

/* Load the mean file in binaryproto format. */
void Classifier::SetMean(const string& mean_file) {
BlobProto blob_proto;
ReadProtoFromBinaryFileOrDie(mean_file.c_str(), &blob_proto);

/* Convert from BlobProto to Blob<float> */
Blob<float> mean_blob;
mean_blob.FromProto(blob_proto);
CHECK_EQ(mean_blob.channels(), num_channels_)
<< "Number of channels of mean file doesn't match input layer.";

/* The format of the mean file is planar 32-bit float BGR or grayscale. */
std::vector<cv::Mat> channels;
float* data = mean_blob.mutable_cpu_data();
for (int i = 0; i < num_channels_; ++i) {
/* Extract an individual channel. */
cv::Mat channel(mean_blob.height(), mean_blob.width(), CV_32FC1, data);
channels.push_back(channel);
data += mean_blob.height() * mean_blob.width();
}

/* Merge the separate channels into a single image. */
cv::Mat mean;
cv::merge(channels, mean);

/* Compute the global mean pixel value and create a mean image
* filled with this value. */
cv::Scalar channel_mean = cv::mean(mean);
mean_ = cv::Mat(input_geometry_, mean.type(), channel_mean);
}

std::vector<float> Classifier::Predict(const cv::Mat& img) {
Blob<float>* input_layer = net_->input_blobs()[0];
input_layer->Reshape(1, num_channels_,
input_geometry_.height, input_geometry_.width);
/* Forward dimension change to all layers. */
net_->Reshape();

std::vector<cv::Mat> input_channels;
WrapInputLayer(&input_channels);

Preprocess(img, &input_channels);

net_->Forward();

/* Copy the output layer to a std::vector */
Blob<float>* output_layer = net_->output_blobs()[0];
const float* begin = output_layer->cpu_data();
const float* end = begin + output_layer->channels();
return std::vector<float>(begin, end);
}

/* Wrap the input layer of the network in separate cv::Mat objects
* (one per channel). This way we save one memcpy operation and we
* don't need to rely on cudaMemcpy2D. The last preprocessing
* operation will write the separate channels directly to the input
* layer. */
void Classifier::WrapInputLayer(std::vector<cv::Mat>* input_channels) {
Blob<float>* input_layer = net_->input_blobs()[0];

int width = input_layer->width();
int height = input_layer->height();
float* input_data = input_layer->mutable_cpu_data();
for (int i = 0; i < input_layer->channels(); ++i) {
cv::Mat channel(height, width, CV_32FC1, input_data);
input_channels->push_back(channel);
input_data += width * height;
}
}

void Classifier::Preprocess(const cv::Mat& img,
std::vector<cv::Mat>* input_channels) {
/* Convert the input image to the input image format of the network. */
cv::Mat sample;
if (img.channels() == 3 && num_channels_ == 1)
cv::cvtColor(img, sample, cv::COLOR_BGR2GRAY);
else if (img.channels() == 4 && num_channels_ == 1)
cv::cvtColor(img, sample, cv::COLOR_BGRA2GRAY);
else if (img.channels() == 4 && num_channels_ == 3)
cv::cvtColor(img, sample, cv::COLOR_BGRA2BGR);
else if (img.channels() == 1 && num_channels_ == 3)
cv::cvtColor(img, sample, cv::COLOR_GRAY2BGR);
else
sample = img;

cv::Mat sample_resized;
if (sample.size() != input_geometry_)
cv::resize(sample, sample_resized, input_geometry_);
else
sample_resized = sample;

cv::Mat sample_float;
if (num_channels_ == 3)
sample_resized.convertTo(sample_float, CV_32FC3);
else
sample_resized.convertTo(sample_float, CV_32FC1);

cv::Mat sample_normalized;
cv::subtract(sample_float, mean_, sample_normalized);

/* This operation will write the separate BGR planes directly to the
* input layer of the network because it is wrapped by the cv::Mat
* objects in input_channels. */
cv::split(sample_normalized, *input_channels);

CHECK(reinterpret_cast<float*>(input_channels->at(0).data)
== net_->input_blobs()[0]->cpu_data())
<< "Input channels are not wrapping the input layer of the network.";
}

int main(int argc, char** argv) {
if (argc != 6) {
std::cerr << "Usage: " << argv[0]
<< " deploy.prototxt network.caffemodel"
<< " mean.binaryproto labels.txt img.jpg" << std::endl;
return 1;
}

::google::InitGoogleLogging(argv[0]);

string model_file = argv[1];
string trained_file = argv[2];
string mean_file = argv[3];
string label_file = argv[4];
Classifier classifier(model_file, trained_file, mean_file, label_file);

string file = argv[5];

std::cout << "---------- Prediction for "
<< file << " ----------" << std::endl;

cv::Mat img = cv::imread(file, -1);
CHECK(!img.empty()) << "Unable to decode image " << file;
std::vector<Prediction> predictions = classifier.Classify(img);

/* Print the top N predictions. */
for (size_t i = 0; i < predictions.size(); ++i) {
Prediction p = predictions[i];
std::cout << std::fixed << std::setprecision(4) << p.second << " - \""
<< p.first << "\"" << std::endl;
}
}
#else
int main(int argc, char** argv) {
LOG(FATAL) << "This example requires OpenCV; compile with USE_OPENCV.";
}
#endif // USE_OPENCV

CMakeLists.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
find_package(OpenCV REQUIRED)


set(Caffe_DIR "/home/kezunlin/program/caffe/build/install/share/Caffe") # caffe caffe

# for CaffeConfig.cmake/ caffe-config.cmake
find_package(Caffe)
# offical caffe : There is no Caffe_INCLUDE_DIRS and Caffe_DEFINITIONS
# refinedet caffe: OK.

add_definitions(${Caffe_DEFINITIONS})

MESSAGE( [Main] " Caffe_INCLUDE_DIRS = ${Caffe_INCLUDE_DIRS}")
MESSAGE( [Main] " Caffe_DEFINITIONS = ${Caffe_DEFINITIONS}")
MESSAGE( [Main] " Caffe_LIBRARIES = ${Caffe_LIBRARIES}") # caffe
MESSAGE( [Main] " Caffe_CPU_ONLY = ${Caffe_CPU_ONLY}")
MESSAGE( [Main] " Caffe_HAVE_CUDA = ${Caffe_HAVE_CUDA}")
MESSAGE( [Main] " Caffe_HAVE_CUDNN = ${Caffe_HAVE_CUDNN}")


include_directories(${Caffe_INCLUDE_DIRS})

target_link_libraries(demo
${OpenCV_LIBS}
${Caffe_LIBRARIES}
)

run

1
./demo

if error occurs:

libcaffe.so.1.0.0 => not found

edit .bashrc

1
2
# for caffe
export LD_LIBRARY_PATH=/home/kezunlin/program/caffe/build/install/lib:$LD_LIBRARY_PATH

Reference

History

  • 20180807: created.
  • 20180822: update cmake-gui for caffe

Getting Started

Shared variables tips

We encourage you to store the dataset into shared variables and access it based on the minibatch index, given a fixed and known batch size. The reason behind shared variables is related to using the GPU. There is a large overhead when copying data into the GPU memory.

将数据存储在shared variable便于加速GPU计算,避免数据从CPU拷贝到GPU。

If you have your data in Theano shared variables though, you give Theano the possibility to copy the entire data on the GPU in a single call when the shared variables are constructed.

shared构建的时候,theano一次性讲所有数据拷贝至GPU.

Because the datapoints and their labels are usually of different nature (labels are usually integers while datapoints are real numbers) we suggest to use different variables for label and data. Also we recommend using different variables for the training set, validation set and testing set to make the code more readable (resulting in 6 different shared variables).

data,label使用2个shared variables,train,valid,test使用不同的shared variables,总计6个

Mini-batch data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def shared_dataset(data_xy):
""" Function that loads the dataset into shared variables

The reason we store our dataset in shared variables is to allow
Theano to copy it into the GPU memory (when code is run on GPU).
Since copying data into the GPU is slow, copying a minibatch everytime
is needed (the default behaviour if the data is not in a shared
variable) would lead to a large decrease in performance.
"""
data_x, data_y = data_xy
shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
# shared变量中的数据在GPU上必须是float32类型,但是计算阶段可能需要int类型(y),所以需要
# 将float32--->int.
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets us get around this issue
return shared_x, T.cast(shared_y, 'int32')

test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)

batch_size = 500 # size of the minibatch

# accessing the third minibatch of the training set

data = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]

SGD pseudo code in theano

Traditional GD (m=N)

1
2
3
4
5
6
7
8
# GRADIENT DESCENT

while True:
loss = f(params)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params

Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time.

Online Learning SGD (m=1)

1
2
3
4
5
6
7
8
9
# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
# imagine an infinite generator
# that may repeat examples (if there is only a finite training set)
loss = f(params, x_i, y_i)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params

The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD (MSGD) works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers.

Minibatch SGD (m)

1
2
3
4
5
6
7
8
for (x_batch,y_batch) in train_batches:
# imagine an infinite generator
# that may repeat examples
loss = f(params, x_batch, y_batch)
d_loss_wrt_params = ... # compute gradient using theano
params -= learning_rate * d_loss_wrt_params
if <stopping condition is met>:
return params

Theano pseudo code of Minibatch SGD

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
# here x_batch and y_batch are elements of train_batches and
# therefore numpy arrays; function MSGD also updates the params
print('Current loss is ', MSGD(x_batch, y_batch))
if stopping_condition_is_met:
return params

Regularization

L1/L2 regularization

L1/L2 regularization and early-stopping.

Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the regularizer is also called “weight decay”.

To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data.

1
2
3
4
5
6
7
8
# symbolic Theano variable that represents the L1 regularization term
L1 = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2 = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2

Early-Stopping

Early-stopping combats overfitting by monitoring the model’s performance on a validation set. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set.

所谓early stopping,即在每一个epoch结束时(一个epoch即对所有训练数据的一轮遍历)计算 validation data的accuracy,当accuracy不再提高时,就停止训练。这是很自然的做法,因为accuracy不再提高了,训练下去也没用。另外,这样做还能防止overfitting。

那么,怎么样才算是validation accuracy不再提高呢?并不是说validation accuracy一降下来,它就是“不再提高”,因为可能经过这个epoch后,accuracy降低了,但是随后的epoch又让accuracy升上去了,所以不能根据一两次的连续降低就判断“不再提高”。正确的做法是,在训练的过程中,记录最佳的validation accuracy,当连续10次epoch(或者更多次)没达到最佳accuracy时,你可以认为“不再提高”,此时使用early stopping。这个策略就叫“ no-improvement-in-n”,n即epoch的次数,可以根据实际情况取10、20、30….

Variable learning rate

  • Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: eta decrease constant, d=0.001)

  • Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024

一个简单有效的做法就是,当validation accuracy满足 no-improvement-in-n规则时,本来我们是要early stopping的,但是我们可以不stop,而是让learning rate减半,之后让程序继续跑。下一次validation accuracy又满足no-improvement-in-n规则时,我们同样再将learning rate减半(此时变为原始learni rate的四分之一)…继续这个过程,直到learning rate变为原来的1/1024再终止程序。(1/1024还是1/512还是其他可以根据实际确定)。【PS:也可以选择每一次将learning rate除以10,而不是除以2.】

实践中,eta/2变化太快,eta0/(1+d*epoch),d=0.001比较合适。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience/2) = 2500 # for iters
# go through this many
# minibatches before checking the network
# on the validation set; in this case we
# check every epoch

best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()

done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
# Report "1" for first epoch, "n_epochs" for last epoch
epoch = epoch + 1
for minibatch_index in range(n_train_batches):

d_loss_wrt_params = ... # compute gradient
params -= learning_rate * d_loss_wrt_params # gradient descent

# iteration number. We want it to start at 0.
iter = (epoch - 1) * n_train_batches + minibatch_index
# note that if we do `iter % validation_frequency` it will be
# true for iter = 0 which we do not want. We want it true for
# iter = validation_frequency - 1.
if (iter + 1) % validation_frequency == 0:

this_validation_loss = ... # compute zero-one loss on validation set

if this_validation_loss < best_validation_loss:

# improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss * improvement_threshold:
patience = max(patience, iter * patience_increase)

best_params = copy.deepcopy(params)
best_validation_loss = this_validation_loss

if patience <= iter:
done_looping = True
break

# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization

Theano/Python Tips

Loading and Saving Models

  • DO: Pickle the numpy ndarrays from your shared variables

  • DON’T: Do not pickle your training or test functions for long-term storage

1
2
3
4
5
6
import cPickle
save_file = open('path', 'wb') # this will overwrite current contents
cPickle.dump(w.get_value(borrow=True), save_file, -1) # the -1 is for HIGHEST_PROTOCOL
cPickle.dump(v.get_value(borrow=True), save_file, -1) # .. and it triggers much more efficient
cPickle.dump(u.get_value(borrow=True), save_file, -1) # .. storage than numpy's default
save_file.close()

Then later, you can load your data back like this:

1
2
3
4
save_file = open('path')
w.set_value(cPickle.load(save_file), borrow=True)
v.set_value(cPickle.load(save_file), borrow=True)
u.set_value(cPickle.load(save_file), borrow=True)

Plotting Intermediate Results

If you have enough disk space, your training script should save intermediate models and a visualization script should process those saved models.

MLP

see here

An MLP can be viewed as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation sigmoid. This transformation projects the input data into a space where it becomes linearly separable. This intermediate layer is referred to as a hidden layer. A single hidden layer is sufficient to make MLPs a universal approximator.

weight initializations

  • old version: 1/sqrt(n_in)

The initial values for the weights of a hidden layer i should be uniformly sampled from a symmetric interval that depends on the activation function. weight的初始化依赖于activation

  • tanh: uniformely sampled from -sqrt(6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
  • sigmoid : use 4 times larger initial weights for sigmoid compared to tanh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# `W` is initialized with `W_values` which is uniformely sampled
# from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
# for tanh activation function
# the output of uniform if converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
# Note : optimal initialization of weights is dependent on the
# activation function used (among other things).
# For example, results presented in [Xavier10] suggest that you
# should use 4 times larger initial weights for sigmoid
# compared to tanh
# We have no info for other function, so we use the same as
# tanh.
if W is None:
W_values = numpy.asarray(
rng.uniform(
low=-numpy.sqrt(6. / (n_in + n_out)),
high=numpy.sqrt(6. / (n_in + n_out)),
size=(n_in, n_out)
),
dtype=theano.config.floatX
)
if activation == theano.tensor.nnet.sigmoid:
W_values *= 4

W = theano.shared(value=W_values, name='W', borrow=True)

if b is None:
b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
b = theano.shared(value=b_values, name='b', borrow=True)

self.W = W
self.b = b

Tips and Tricks for training MLPs

Nonlinearity

Two of the most common ones are the sigmoid and the tanh function. nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties.

Weight initialization

At initialization we want the weights to be small enough around the origin so that the activation function operates in its linear regime, where gradients are the largest. weight的初始化依赖于activation

Learning rate

  • The simplest solution is to simply have a constant rate. Rule of thumb: try several log-spaced values (10^{-1},10^{-2},\ldots) and narrow the (logarithmic) grid search to the region where you obtain the lowest validation error.

  • Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: decrease constant, 0.001)

  • Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024

Regularization parameter

Typical values to try for the L1/L2 regularization parameter \lambda are 10^{-2},10^{-3},\ldots. In the framework that we described so far, optimizing this parameter will not lead to significantly better solutions, but is worth exploring nonetheless.

Number of hidden units

This hyper-parameter is very much dataset-dependent. hidden neurons的数量依赖于具体的数据集。Unless we employ some regularization scheme (early stopping or L1/L2 penalties), a typical number of hidden units vs. generalization performance graph will be U-shaped.

CNN

The Convolution and Pool Operator

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d,sigmoid
from theano.tensor.signal.pool import pool_2d

import numpy
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

rng = numpy.random.RandomState(23455)

# instantiate 4D tensor for input
input = T.tensor4(name='input')

# initialize shared variable for weights.
w_shp = (2, 3, 9, 9)
w_bound = numpy.sqrt(3 * 9 * 9)
W = theano.shared( numpy.asarray(
rng.uniform(
low=-1.0 / w_bound,
high=1.0 / w_bound,
size=w_shp),
dtype=input.dtype), name ='W')

# initialize shared variable for bias (1D tensor) with random values
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,)
b = theano.shared(numpy.asarray(
rng.uniform(low=-.5, high=.5, size=b_shp),
dtype=input.dtype), name ='b')

# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv2d(input, W)

poolsize=(2,2)
pooled_out = pool_2d( input=conv_out, ws=poolsize, ignore_border=True)

conv_activations = sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
# create theano function to compute filtered images
f = theano.function([input], conv_activations)

pooled_activations = sigmoid(pooled_out + b.dimshuffle('x', 0, 'x', 'x'))
f2 = theano.function([input], pooled_activations)


#===========================================================
# processing image file
#===========================================================
# open random image of dimensions 639x516
img = Image.open(open('./3wolfmoon.jpg'))
# dimensions are (height, width, channel)
img = numpy.asarray(img, dtype=theano.config.floatX) / 256.

# put image in 4D tensor of shape (1, 3, height, width)
input_img_ = img.transpose(2, 0, 1).reshape(1, 3, 639, 516)
filtered_img = f(input_img_)
pooled_img = f2(input_img_)
print filtered_img.shape # (1, 2, 631, 508) 2 feature maps
print pooled_img.shape # (1, 2, 315, 254) 2 feature maps

fig = plt.figure(figsize=(16,8))
# (1)
# plot original image and first and second components of output
plt.subplot(2, 3, 1); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 2); plt.axis('off'); plt.imshow(filtered_img[0, 0, :, :])
plt.subplot(2, 3, 3); plt.axis('off'); plt.imshow(filtered_img[0, 1, :, :])


# (2)
# plot original image and first and second components of output
plt.subplot(2, 3, 4); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 5); plt.axis('off'); plt.imshow(pooled_img[0, 0, :, :])
plt.subplot(2, 3, 6); plt.axis('off'); plt.imshow(pooled_img[0, 1, :, :])
plt.show()

# Notice that a randomly initialized filter acts very much like an edge detector!
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1060 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5105)


(1, 2, 631, 508)
(1, 2, 315, 254)

png

Reference

History

  • 20180807: created.

Theano conv pool example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d,sigmoid
from theano.tensor.signal.pool import pool_2d

import numpy
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

rng = numpy.random.RandomState(23455)

# instantiate 4D tensor for input
input = T.tensor4(name='input')

# initialize shared variable for weights.
w_shp = (2, 3, 9, 9)
w_bound = numpy.sqrt(3 * 9 * 9)
W = theano.shared( numpy.asarray(
rng.uniform(
low=-1.0 / w_bound,
high=1.0 / w_bound,
size=w_shp),
dtype=input.dtype), name ='W')

# initialize shared variable for bias (1D tensor) with random values
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,)
b = theano.shared(numpy.asarray(
rng.uniform(low=-.5, high=.5, size=b_shp),
dtype=input.dtype), name ='b')

# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv2d(input, W)

poolsize=(2,2)
pooled_out = pool_2d( input=conv_out, ws=poolsize, ignore_border=True)

conv_activations = sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
# create theano function to compute filtered images
f = theano.function([input], conv_activations)

pooled_activations = sigmoid(pooled_out + b.dimshuffle('x', 0, 'x', 'x'))
f2 = theano.function([input], pooled_activations)


#===========================================================
# processing image file
#===========================================================
# open random image of dimensions 639x516
img = Image.open(open('./3wolfmoon.jpg'))
# dimensions are (height, width, channel)
img = numpy.asarray(img, dtype=theano.config.floatX) / 256.

# put image in 4D tensor of shape (1, 3, height, width)
input_img_ = img.transpose(2, 0, 1).reshape(1, 3, 639, 516)
filtered_img = f(input_img_)
pooled_img = f2(input_img_)
print filtered_img.shape # (1, 2, 631, 508) 2 feature maps
print pooled_img.shape # (1, 2, 315, 254) 2 feature maps

fig = plt.figure(figsize=(16,8))
# (1)
# plot original image and first and second components of output
plt.subplot(2, 3, 1); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 2); plt.axis('off'); plt.imshow(filtered_img[0, 0, :, :])
plt.subplot(2, 3, 3); plt.axis('off'); plt.imshow(filtered_img[0, 1, :, :])


# (2)
# plot original image and first and second components of output
plt.subplot(2, 3, 4); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 5); plt.axis('off'); plt.imshow(pooled_img[0, 0, :, :])
plt.subplot(2, 3, 6); plt.axis('off'); plt.imshow(pooled_img[0, 1, :, :])
plt.show()

# Notice that a randomly initialized filter acts very much like an edge detector!
(1, 2, 631, 508)
(1, 2, 315, 254)

png

Pool example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from theano.tensor.signal import pool

input = T.dtensor4('input')
maxpool_shape = (2, 2)
pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=True)
f = theano.function([input],pool_out)

#invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
invals = np.arange(50).reshape(1, 2, 5, 5)
print 'With ignore_border set to True:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :]

pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=False)
f = theano.function([input],pool_out)
print
print 'With ignore_border set to False:'
print 'invals[1, 0, :, :] =\n ', invals[0, 0, :, :]
print 'output[1, 0, :, :] =\n ', f(invals)[0, 0, :, :]
With ignore_border set to True:
invals[0, 0, :, :] =
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
output[0, 0, :, :] =
[[  6.   8.]
 [ 16.  18.]]

With ignore_border set to False:
invals[1, 0, :, :] =
  [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
output[1, 0, :, :] =
  [[  6.   8.   9.]
 [ 16.  18.  19.]
 [ 21.  23.  24.]]

Reference

History

  • 20180807: created.

network3.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""network3.py
~~~~~~~~~~~~~~

A Theano-based program for training and running simple neural
networks.

Supports several layer types (fully connected, convolutional, max
pooling, softmax), and activation functions (sigmoid, tanh, and
rectified linear units, with more easily added).

When run on a CPU, this program is much faster than network.py and
network2.py. However, unlike network.py and network2.py it can also
be run on a GPU, which makes it faster still.

Because the code is based on Theano, the code is different in many
ways from network.py and network2.py. However, where possible I have
tried to maintain consistency with the earlier programs. In
particular, the API is similar to network2.py. Note that I have
focused on making the code simple, easily readable, and easily
modifiable. It is not optimized, and omits many desirable features.

This program incorporates ideas from the Theano documentation on
convolutional neural nets (notably,
http://deeplearning.net/tutorial/lenet.html ), from Misha Denil's
implementation of dropout (https://github.com/mdenil/dropout ), and
from Chris Olah (http://colah.github.io ).

Written for Theano 0.6 and 0.7, needs some changes for more recent
versions of Theano.

对于N=50000数据全部参与训练,time(python) = 7分钟; time(theano) = 1分钟。


But the big win is the ability to do fast symbolic differentiation,
using a very general form of the backpropagation algorithm.
This is extremely useful for applying stochastic gradient
descent to a wide variety of network architectures.
"""

#### Libraries
# Standard library
import cPickle
import gzip
import time
import copy

# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
#from theano.tensor.signal.downsample import max_pool_2d # for version theano-0.7
from theano.tensor.signal.pool import pool_2d # for version theano-0.9

# Activation functions for neurons
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh

#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz",training_set_size=1000):
print 'loading data from {0} of #{1}'.format(filename,training_set_size)
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f) # float32(N,784); int64(N,)
f.close()
def shared(data):
"""Place the data into shared variables. This allows Theano to copy
the data to the GPU, if one is available.

shared_x.get_value().shape float32(50000, 784)
shared_y.get_value().shape float32(50000,)

y_cast = T.cast(shared_y, "int8") # float32--->int8

shared_x.type TensorType(float32, matrix) theano.tensor.sharedvar.TensorSharedVariable
shared_y.type TensorType(float32, vector) theano.tensor.sharedvar.TensorSharedVariable
y_cast.type TensorType(int32, vector) theano.tensor.var.TensorVariable (y_cast不是shared变量)
"""

# 默认floatX = float64,在运行的时候需要设置floatX = float32
# 取x[N,784],y[N]的前training_set_size个样本参与训练
shared_x = theano.shared(np.asarray(data[0][:training_set_size,],dtype=theano.config.floatX), borrow=True)
shared_y = theano.shared(np.asarray(data[1][:training_set_size], dtype=theano.config.floatX), borrow=True)

# shared变量中的数据在GPU上必须是float32类型,但是计算阶段可能需要int类型(y),所以需要将float32--->int.
# 并且int8类型需要和 self.y = T.bvector("y")的b类型一样。
# When storing data on the GPU it has to be stored as floats
# therefore we will store the labels as ``floatX`` as well
# (``shared_y`` does exactly that). But during our computations
# we need them as ints (we use labels as index, and if they are
# floats it doesn't make sense) therefore instead of returning
# ``shared_y`` we will have to cast it to int. This little hack
# lets us get around this issue
return shared_x, T.cast(shared_y, 'int8')
return [shared(training_data), shared(validation_data), shared(test_data)]

def load_data_expanded(filename="../data/mnist_expanded.pkl.gz",training_set_size=1000):
return load_data_shared(filename=filename,training_set_size=training_set_size)

#### Main class used to construct and train networks
class Network(object):

def __init__(self, layers, mini_batch_size):
"""Takes a list of `layers`, describing the network architecture, and
a value for the `mini_batch_size` to be used during training
by stochastic gradient descent.

"""
self.layers = layers
assert len(self.layers)>=2
self.mini_batch_size = mini_batch_size
self.params = [param for layer in self.layers for param in layer.params]
self.x = T.matrix("x") # batch x float32,(m,784) 不需要指定fmatrix
self.y = T.bvector("y") # batch y int8,(m,)

# first layer init with inpt=x,inpt_dropout=x
init_layer = self.layers[0]
init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
for j in xrange(1, len(self.layers)):
prev_layer, layer = self.layers[j-1], self.layers[j]
layer.set_inpt(prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)

self.output = self.layers[-1].output
self.output_dropout = self.layers[-1].output_dropout

def SGD(self, training_data, epochs, mini_batch_size, eta,
validation_data, test_data, lmbda=0.0,
no_improvement_in_n=20,use_constant_eta=True, # default not vary eta because accuracy not imporved too much
eta_shrink_times=10,eta_descrease_factor = 0.0001):

"""Train the network using mini-batch stochastic gradient descent."""
training_x, training_y = training_data # (N,784) (N,)
validation_x, validation_y = validation_data
test_x, test_y = test_data

# compute number of minibatches for training, validation and testing
num_training_batches = size(training_data)/mini_batch_size
num_validation_batches = size(validation_data)/mini_batch_size
num_test_batches = size(test_data)/mini_batch_size

# define the (regularized) cost function, symbolic gradients, and updates
l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
cost0 = self.layers[-1].cost(self) # 计算最后一层的输出代价,传递Network作为net参数

cost = cost0 + 0.5*lmbda*l2_norm_squared/size(training_data) # ??? N instead of num_training_batches
grads = T.grad(cost, self.params)

shared_eta = theano.shared(eta,borrow=True) #(same as shared_b) use SharedVariable instead of value

updates = [(param, param-T.cast(shared_eta*grad,dtype=theano.config.floatX)) for param, grad in zip(self.params, grads)]

"""
grad(float32),没有指定floatX=float32,则eta*grad(float64),指定之后eta*grad(float32),无需cast

#for param, grad in zip(self.params, grads):
# print param.type,grad.type,(eta*grad).type

# updates = [(param, T.cast(param-eta*grad,'float32') ) for param, grad in zip(self.params, grads)]
"""

# define functions to train a mini-batch, and to compute the
# accuracy in validation and test mini-batches.
i = T.lscalar() # mini-batch index
train_mb = theano.function(
[i], cost, updates=updates, # 给定i,===>x,y===>cost中的x,y被替换掉,从而计算mini-batch的代价,最后updates
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})

# cost and accuracy for train,val,test
# (1) train
train_mb_cost = theano.function(
[i], cost,
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
train_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y), # y(m,)
givens={
self.x:
training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
# (2) val
validate_mb_cost = theano.function(
[i], cost,
givens={
self.x:
validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
validate_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y), # y(m,)
givens={
self.x:
validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
# (3) test
test_mb_cost = theano.function(
[i], cost,
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
test_mb_accuracy = theano.function(
[i], self.layers[-1].accuracy(self.y), # y(m,)
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
self.y:
test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})
self.test_mb_predictions = theano.function(
[i], self.layers[-1].y_out, # y(m,) m个样本的预测结果
givens={
self.x:
test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
})

"""
def shuffle_data(x,y):
seed = int(time.time())
np.random.seed(seed)
np.random.shuffle(x)

np.random.seed(seed)
np.random.shuffle(y)

def shuffle_training_data(training_x,training_y):
# CPU, OK; GPU, FAILED (在GPU中borrow失效)
originX = training_x.get_value(borrow=True) # shared---> nparray
originY = training_y.get_value(borrow=True) # shared---> nparray
shuffle_data(originX,originY)
"""

evaluation_costs, evaluation_accuracys = [], []
training_costs, training_accuracys = [], []

# use no-improvement-in-n early stopping
# 记录best_validation_accuracy,best_epoch,如果epoch-best_epoch>=no_improvement_in_n,stop
best_epoch = 0
cur_eta_shrink_times = 0 # if cur_eta_shrink_times>=eta_shrink_times,stop
best_validation_accuracy = 0.0 # with gpu, numpy.float64

for epoch in xrange(epochs):
#random.shuffle(training_data) # for list[(x1,y1),(x2,y2),...] 此处training_data是(X,Y)
# shuffle_training_data(training_x,training_y) # FAILED on GPU

for minibatch_index in xrange(num_training_batches):
# iteration记录训练次数,每训练1000次输出一次
iteration = num_training_batches*epoch+minibatch_index
if iteration % 1000 == 0:
print("Training mini-batch number {0}".format(iteration))
cost_ij = train_mb(minibatch_index)

# 一个epoch训练结束,训练了num_training_batches次,iterration=4999。利用w,b计算一次验证accuracy
#if (iteration+1) % num_training_batches == 0:
validation_cost = np.mean( [validate_mb_cost(j) for j in xrange(num_validation_batches)] )
validation_accuracy = np.mean( [validate_mb_accuracy(j) for j in xrange(num_validation_batches)] )
print("\nEpoch {0}: validation accuracy {1:.2%}".format(epoch, validation_accuracy))

train_cost = np.mean( [train_mb_cost(j) for j in xrange(num_training_batches)] )
train_accuracy = np.mean( [train_mb_accuracy(j) for j in xrange(num_training_batches)] )

# save 4 return lists
evaluation_costs.append(validation_cost)
evaluation_accuracys.append(validation_accuracy)
training_costs.append(train_cost)
training_accuracys.append(train_accuracy)

#记录best_validation_accuracy
# 关键在于<,满足足够多的NIIN,才能满足eta_shrink_times>=10
if best_validation_accuracy - validation_accuracy < 0.0: # <=
print("This is the best validation accuracy to date.")
best_validation_accuracy = validation_accuracy
best_epoch = epoch
best_iteration = iteration

# save best network
best_net = copy.deepcopy(self)

#计算在val取得最佳accuracy情况下,test数据集的accuracy
if test_data:
test_accuracy = np.mean( [test_mb_accuracy(j) for j in xrange(num_test_batches)] )
print('The corresponding test accuracy is {0:.2%}'.format(test_accuracy))

#============================================================================================
# early stopping with variable learning rate
# (1) (epoch - best_epoch) >= no_improvement_in_n: stop NIIN = 20
# (2) new_eta = 1/2*eta until new_eta<=1/1024*eta ETA_SHRINK_TIME = 10
#============================================================================================

# check in last epoch of NIIN stage
if (epoch+1) % no_improvement_in_n == 0:
# (1) check NIIN
if (epoch - best_epoch) >= no_improvement_in_n:
# stop learning
print '!'*100
print '[HIT] Early stopping at epoch #{0},best_epoch #{1},iteration #{2},validation accuracy {3:.2%}'.format(epoch,best_epoch,best_iteration,best_validation_accuracy)
print '!'*100
break;

#******************************************************************************
if use_constant_eta:
break # goto (2) instead of break
else:
# (2) shrink eta to 1/2*eta  (accuracy not improved too much)
print 'cur_eta_shrink_times = {0}'.format(cur_eta_shrink_times)
if cur_eta_shrink_times >= eta_shrink_times:
print '+'*100
print '[HIT] Eta shrink OK. at epoch #{0},best_epoch #{1},iteration #{2},validation accuracy {3:.2%}'.format(epoch,best_epoch,best_iteration,best_validation_accuracy)
print '+'*100
break;

cur_eta_shrink_times +=1

# update eta every epoch
eta_descrease_factor = 0.0001
new_eta = eta/(1.0+eta_descrease_factor*(epoch+1))
shared_eta.set_value(np.asarray(new_eta,dtype=theano.config.floatX),borrow=True) # update eta

#eta = eta/2.0
#shared_eta.set_value(np.asarray(eta,dtype=theano.config.floatX),borrow=True) # update eta
#******************************************************************************
#============================================================================================


# once early stopping, we save the best model to file
with open('best_model.pkl', 'wb') as fp:
print 'Saving best mode to best_model.pkl...'
cPickle.dump(best_net, fp)

print("\nFinished training network.")
print("Best validation accuracy of {0:.2%} obtained at best_epoch {1}".format(best_validation_accuracy, best_epoch))
print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

return evaluation_costs, evaluation_accuracys, training_costs, training_accuracys,best_epoch # for plot


#********************************************************
# load model and predict on test data
#********************************************************
def load_network_and_predict():
"""
An example of how to load a trained model and use it
to predict labels.
"""
# load the saved model
net = cPickle.load(open('best_model.pkl'))

# predict
training_set_size = 50000
train_data,val_data,test_data = load_data_shared(training_set_size=training_set_size)
test_x,test_y = test_data

mini_batch_size = 10
num_test_batches = size(test_data)/mini_batch_size

i = T.lscalar()
# test predict
test_mb_predictions = theano.function(
[i], net.layers[-1].y_out, # y(m,) m个样本的预测结果
givens={
net.x:
test_x[i*mini_batch_size: (i+1)*mini_batch_size]
})
# test accuracy
test_mb_accuracy = theano.function(
[i], net.layers[-1].accuracy(net.y), # y(m,)
givens={
net.x:
test_x[i*mini_batch_size: (i+1)*mini_batch_size],
net.y:
test_y[i*mini_batch_size: (i+1)*mini_batch_size]
})

test_predictions = test_mb_predictions(0)
print 'real values of first 10: ',test_y[:10].eval()
print 'predictions of first 10: ',test_predictions

test_accuracy = np.mean( [test_mb_accuracy(j) for j in xrange(num_test_batches)] )
print 'test_accuracy ',test_accuracy

#********************************************************
# end of predict
#********************************************************


#### Define layer types

class ConvPoolLayer(object):
"""Used to create a combination of a convolutional and a max-pooling
layer. A more sophisticated implementation would separate the
two, but for our purposes we'll always use them together, and it
simplifies the code, so it makes sense to combine them.

"""

def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
activation_fn=sigmoid):
"""`filter_shape` is a tuple of length 4, whose entries are the number
of filters, the number of input feature maps, the filter height, and the
filter width.

`image_shape` is a tuple of length 4, whose entries are the
mini-batch size, the number of input feature maps, the image
height, and the image width.

`poolsize` is a tuple of length 2, whose entries are the y and
x pooling sizes.

np.prod((2,2)) = 4 # int64

ConvPoolLayer1
image_shape=(m,1,28,28) 1*28*28 (1 input feature map)
filter_shape=(20,1,5,5) 20*24*24
poolsize=(2,2) 20*12*12

ConvPoolLayer2
image_shape=(m,20,12,12) 20*12*12 (20 input feature map)
filter_shape=(40,20,5,5) 40*8*8
poolsize=(2,2) 40*4*4

ConvPoolLayer1
(20,1,5,5)
20指定当前ConvLayer1的features的数量: c1_f1,c1_f2,....c1_f19,c1_f20。
(1,5,5)指定feature的一个pixel所对应的local receptive field(LRF),此处对应1个input feature的5*5区域。
对应的w: w1,w2,...w19,w20 of size(1,5,5)===>w(20,1,5,5) filter_shape
对应的b: b1,b2,...b19,b20 of size() ===>b(20,)

ConvPoolLayer2
(40,20,5,5)
40指定当前ConvLayer2的features的数量: c2_f1,c2_f2,....c2_f39,c2_f40。
(20,5,5)指定feature的一个pixel所对应的local receptive field(LRF),此处对应20个input feature的5*5区域。
对应的w: w1,w2,...w39,w40 of size(20,5,5)===>w(40,20,5,5) filter_shape
对应的b: b1,b2,...b39,b40 of size() ===>b(40,)
"""
assert image_shape[1] == filter_shape[1] # input feature maps
self.filter_shape = filter_shape
self.image_shape = image_shape
self.poolsize = poolsize
self.activation_fn=activation_fn

# initialize weights and biases
# 20*(5*5)/(2*2) = 500/4 = 125
# 40*(5*5)/(2*2) = 1000/4 = 250
#n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize)) # 125 250 (why???)

# for tanh: w_bound = numpy.sqrt(6./(n_in+n_out))
# for sigmoid: w_bound = 4*w_bound(tanh)
# for ReLU: w = 0

# there are "num input feature maps * filter height * filter width" inputs to each hidden unit
n_in = np.prod(filter_shape[1:]) # LRF
# each unit in the lower layer receives a gradient from:
# "num output feature maps * filter height * filter width" / pooling size
n_out = (filter_shape[0] * np.prod(filter_shape[2:]) // np.prod(poolsize))

w_bound = np.sqrt(6./(n_in+n_out))
if activation_fn == sigmoid:
w_bound = 4*w_bound

self.w = theano.shared(
np.asarray(
#np.random.normal(loc=0, scale=np.sqrt(1.0/n_out),
np.random.uniform(low=-w_bound,high=w_bound,
size=filter_shape),
# w(20,1,5,5) w(40,20,5,5)
dtype=theano.config.floatX),
borrow=True)
self.b = theano.shared(
np.asarray(
np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)),
# b(20,) b(40,)
dtype=theano.config.floatX),
borrow=True)
self.params = [self.w, self.b]

def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
"""
inpt = x: fmatrix(m,784)
ConvPoolLayer1
image_shape=(m,1,28,28) m,1*28*28 (1 input feature map)
filter_shape=(20,1,5,5) m,20*24*24 w(20,1,5,5) b(20,)
poolsize=(2,2) m,20*12*12

ConvPoolLayer2
image_shape=(m,20,12,12) m,20*12*12 (20 input feature map)
filter_shape=(40,20,5,5) m,40*8*8 w(40,20,5,5) b(40,)
poolsize=(2,2) m,40*4*4


ConvPoolLayer1
inpt(m,784)--->inpt(m,1,28,28)
conv_out(m,20,24,24)
pooled_out(m,20,12,12)
output(m,20,12,12)

ConvPoolLayer2
inpt(m,20,12,12)
conv_out(m,40,8,8)
pooled_out(m,40,4,4)
output(m,40,4,4)
"""

self.inpt = inpt.reshape(self.image_shape)
conv_out = conv.conv2d( input=self.inpt, image_shape=self.image_shape,
filters=self.w, filter_shape=self.filter_shape)

#conv_out = conv.conv2d(input=self.inpt,filters=self.w)
#theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)

pooled_out = pool_2d( input=conv_out, ws=self.poolsize, ignore_border=True)
#theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)

b_shuffle = self.b.dimshuffle('x', 0, 'x', 'x')
# TensorVariable TensorType(float32, (True, False, True, True))
# ConvPoolLayer1: b(20,) 20个feature map分别增加b0,b1,...b19,b20
# 对于pooled_out=(m,20,12,12)而言,('x', 0, 'x', 'x')的dim2=0,其他为x

# ConvPoolLayer2: b(40,) 40个feature map分别增加b0,b1,...b39,b40
# 对于pooled_out=(m,40,4,4)而言,('x', 0, 'x', 'x')的dim2=0,其他为x

self.output = self.activation_fn( pooled_out + b_shuffle )
#theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)

self.output_dropout = self.output # no dropout in the convolutional layers

class FullyConnectedLayer(object):

def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.activation_fn = activation_fn
self.p_dropout = p_dropout

#rng = numpy.random.RandomState(1234) # for w initialization

# for tanh: w_bound = numpy.sqrt(6./(n_in+n_out))
# for sigmoid: w_bound = 4*w_bound(tanh)
# for ReLU: w = 0

w_bound = np.sqrt(6./(n_in+n_out))
if activation_fn == sigmoid:
w_bound = 4*w_bound

# Initialize weights and biases
self.w = theano.shared(
np.asarray(
#np.random.normal(loc=0.0, scale=np.sqrt(1.0/n_in),
np.random.uniform(low=-w_bound,high=w_bound,
size=(n_in, n_out)),
dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]

def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
"""
(1) inpt,output for validating and testing
(2) inpt_dropout,output_dropout for training (output_dropout--->[cost]--->grad--->params)

以 ConvPoolLayer1(m,20,12,12),ConvPoolLayer2(m,40,4,4),[640,30,10]网络结构为例说明:
************************************************************************************************
X(m,784),Y(m,)

ConvPoolLayer1:
当前层的inpt是前一层的output,因为是第一层,所以初始化为inpt = X(m,784)
inpt(m,784)--->inpt(m,1,28,28)
conv_out(m,20,24,24)
pooled_out(m,20,12,12)
output(m,20,12,12)

ConvPoolLayer2:
inpt(m,20,12,12)
conv_out(m,40,8,8)
pooled_out(m,40,4,4)
output(m,40,4,4)
************************************************************************************************

对于FullyConnectedLayer而言,inpt是ConvPoolLayer2的output=(m,40,4,4)
================================================================================================
Layer1:
inpt=(m,40,4,4)--->inpt(m,640) a1(m,640)即:m个样本,每个样本640个neurons
output = sigmoid(input*w+b) ===> a2 = sigmoid(a1*w+b)
a2(m,30) = sigmoid( a1(m,640)* w(640,30)+ b(30,) )

Layer2:
当前层的inpt是前一层的output,即是FullyConnectedLayer1的output,包含30个hidden neurons输出 a2(m,30)
output = SOFTMAX(input*w+b) ===> a3 = SOFTMAX(a2*w+b)
a3(m,10) = SOFTMAX( a2(m,30)* w(30,10)+ b(10,) )

output是m个样本对应的10个概率,y_out是m个样本对应的真实数值。
================================================================================================
"""

self.inpt = inpt.reshape((mini_batch_size, self.n_in))
#self.output = self.activation_fn((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.output = self.activation_fn(T.dot(self.inpt, self.w) + self.b)

#self.y_out = T.argmax(self.output, axis=1) # 暂时不用,只是用最后一层的y_out作为输出结果

self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = self.activation_fn(T.dot(self.inpt_dropout, self.w) + self.b)

#def accuracy(self, y):
# "Return the accuracy for the mini-batch."
# # 暂时不用,只是用最后一层
# return T.mean(T.eq(y, self.y_out))

class SoftmaxLayer(object):

def __init__(self, n_in, n_out, p_dropout=0.0):
self.n_in = n_in
self.n_out = n_out
self.activation_fn = softmax # default to softmax
self.p_dropout = p_dropout

# Initialize weights and biases
# for sigmoid neurons,w--->(0, 1/sqrt(n_in)) b--->(0,1)
# for softmax neurons,w = 0,b = 0, no need using suitably parameteried normal random variables
self.w = theano.shared(
np.zeros((n_in, n_out), dtype=theano.config.floatX),
name='w', borrow=True)
self.b = theano.shared(
np.zeros((n_out,), dtype=theano.config.floatX),
name='b', borrow=True)
self.params = [self.w, self.b]

def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
"""
(1) inpt,output for validating and testing
(2) inpt_dropout,output_dropout for training (output_dropout--->[cost]--->grad--->params)


在Python中,a = sigmoid(w*a+b), w=(30,784),a=(784,1)一次使用一个样本参与计算。
在Theano中修改为,a = sigmoid(a*w+b) a=(m,784),w=(784,30)一次使用m个样本参与计算。

以[784,30,10]网络结构为例说明:
Layer1:
当前层的inpt是前一层的output,因为是第一层,所以初始化为a1 = X(m,784) Matrix,每一个样本包含784个输入neurons
output = sigmoid(input*w+b) ===> a2 = sigmoid(a1*w+b)
a2(m,30) = sigmoid( a1(m,784)* w(784,30)+ b(30,) )

Layer2:
当前层的inpt是前一层的output,即是FullyConnectedLayer的output,包含30个hidden neurons输出 a2(m,30)
output = SOFTMAX(input*w+b) ===> a3 = SOFTMAX(a2*w+b)
a3(m,10) = SOFTMAX( a2(m,30)* w(30,10)+ b(10,) )

output是m个样本对应的10个概率,y_out是m个样本对应的真实数值。
"""
self.inpt = inpt.reshape((mini_batch_size, self.n_in)) # tesorvariable Matrix(m,n_in)
#self.output = self.activation_fn((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
self.output = self.activation_fn(T.dot(self.inpt, self.w) + self.b)

"""
input--> output ---> y_out
X1---> [y0,y1,...y9] ---> 1
X2---> [y0,y1,...y9] ---> 0
...
Xm---> [y0,y1,...y9] ---> 2

axis沿着row作为一个整体进行,y_out作为最终的输出=vector(m,)。
"""
self.y_out = T.argmax(self.output, axis=1) # 对应的数值 [2,1,...7]

self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
self.output_dropout = self.activation_fn(T.dot(self.inpt_dropout, self.w) + self.b)

def cost(self, net):
"Return the log-likelihood cost."

"""
使用output_dropout用于train

(1) 一个样本对应的代价Cx
C = -log(a[i])
i = np.argmax(y) # a(10,1) y(10,1)
return -np.log(a[i,0])

(2) m个样本的平均代价
计算代价的时候,传递Network作为参数,方便获取net.y

output(m,10) net.y cost
X1---> [y0,y1,...y9] ---> 1 -log a[1,1]
X2---> [y0,y1,...y9] ---> 0 -log a[2,0]
Xm---> [y0,y1,...y9] ---> 2 -log a[m,2]


a = np.array([[0, 0.8, 0, 0,...],
[0.9, 0, 0, 0,...],
[0, 0, 0.7, 0...]])
y = [1,0,2]
a[[0,1,2],y]

> array([ 0.8, 0.9, 0.7])
"""

m = net.y.shape[0]
rows = T.arange(m)
return -T.mean(T.log( self.output_dropout[rows, net.y] ))

def accuracy(self, y):
"Return the accuracy for the mini-batch."

"""
使用output,y_out用于test

y(m,) 对应m个样本的真实结果
y_out(m,) 对应m个样本的预测结果
如果mini_batch_size = 5

y = np.array([2,1,7,8,9])
y_out = np.array([2,1,7,6,9])
np.mean(np.equal(y,y_out)) # [1,1,1,0,1] 0.80
"""
return T.mean(T.eq(y, self.y_out))


#### Miscellanea
def size(data):
"Return the size of the dataset `data`."
return data[0].get_value(borrow=True).shape[0] # N = 50000

def dropout_layer(layer, p_dropout):
"""
对于[784,30,10]
Layer1:
layer= float32 (m,784), p_dropout = 0.2,对每个节点以一定的概率进行drop

参考:http://www.jianshu.com/p/ba9ca3b07922

Inverted Dropout
我们稍微将 Dropout 方法改进一下,使得我们只需要在训练阶段缩放激活函数的输出值,而不用在测试阶段改变什么。
这个改进的 Dropout 方法就被称之为 Inverted Dropout 。

在各种深度学习框架的实现中,我们都是用 Inverted Dropout 来代替 Dropout,因为这种方式有助于模型的完整性,
我们只需要修改一个参数(保留/丢弃概率),而整个模型都不用修改。
"""
srng = shared_randomstreams.RandomStreams( np.random.RandomState(0).randint(999999) )
retain_prob = 1. - p_dropout # retain probility theano.config.floatX
#mask = srng.binomial(n=1, p=retain_prob, size=layer.shape,dtype='int8') # int8

#mask: <class 'theano.tensor.var.TensorVariable'> TensorType(float32, vector)
mask = srng.binomial(n=1, p=retain_prob, size=layer.shape,dtype=theano.config.floatX)
mask_layer = layer*mask
return mask_layer/retain_prob #在train阶段除以retain_prob,以便test阶段每一个Layer的output形式保持不变。

Test Network3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import random
import numpy as np
random.seed(12345678)
np.random.seed(12345678)

#from ke_network3 import *
epochs = 3
training_set_size = 100
mini_batch_size = 10
train_data,val_data,test_data = load_data_shared(training_set_size=training_set_size)

# for conv pool layer
image_shape=(mini_batch_size,1,28,28)
filter_shape=(20,1,5,5)
poolsize=(2,2)
convpool_layer1 = ConvPoolLayer(image_shape=image_shape,filter_shape=filter_shape, poolsize=poolsize)
n_in = 20*12*12


image_shape=(mini_batch_size,20,12,12)
filter_shape=(40,20,5,5)
poolsize=(2,2)
n_in = 40*4*4
convpool_layer2 = ConvPoolLayer(image_shape=image_shape,filter_shape=filter_shape, poolsize=poolsize)


full_layer = FullyConnectedLayer(n_in=n_in,n_out=30)
softmax_layer = SoftmaxLayer(n_in=30,n_out=10)
#net = Network([convpool_layer1,full_layer,softmax_layer],10)
net = Network([convpool_layer1,convpool_layer2,full_layer,softmax_layer],10)
net.SGD(train_data,epochs,mini_batch_size,0.3,val_data,test_data,lmbda=0)
updates TensorType(float32, 4D) TensorType(float32, 4D)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, 4D) TensorType(float32, 4D)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, matrix) TensorType(float32, matrix)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, matrix) TensorType(float32, matrix)
updates TensorType(float32, vector) TensorType(float32, vector)
Training mini-batch number 0
Epoch 0: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%
Epoch 1: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%
Epoch 2: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%

Finished training network.
Best validation accuracy of 10.00% obtained at iteration 29
Corresponding test accuracy of 8.00%





([2.2949765, 2.2951121, 2.2958748],
 [0.10000000000000001, 0.10000000000000001, 0.10000000000000001],
 [2.2682509, 2.2655275, 2.2644706],
 [0.13, 0.13, 0.13])

Basic Test of Network3.py

(1) load data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ke_network3 import *
filename="../data/mnist.pkl.gz"
filename="../data/mnist_expanded.pkl.gz"
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()

x = training_data[0] # (m,784)
y = training_data[1] # (m,)
print type(x),type(y)
print type(x[0]),type(y[0])
print x.shape,y.shape
print x[0].shape,y[0].shape
x2 = x[:10,]

set_size = 10
x = training_data[0] # float32 (50000, 784)
y = training_data[1] # int64 (50000,)

training_x = theano.shared( training_data[0][:set_size,], borrow=True) #float32
training_y = theano.shared( np.asarray(training_data[0][:set_size,],dtype='int8'), borrow=True) # int8
#training_x2 = theano.shared(np.asarray(training_data[0], dtype=theano.config.floatX), borrow=True) # float64
print training_x.type
print training_y.type
#print training_x2.type

# 乘法可能会改变TensorVariable的类型
new_x = training_x*0.1 # float32--->float64
print training_x.type,new_x.type
<type 'numpy.ndarray'> <type 'numpy.ndarray'>
<type 'numpy.ndarray'> <type 'numpy.int64'>
(50, 784) (50,)
(784,) ()
TensorType(float32, matrix)
TensorType(int8, matrix)
TensorType(float32, matrix) TensorType(float64, matrix)

(2) dimshuffle b to match pooled_out

1
2
3
4
5
6
7
8
9
10
11
12
13
pooled_out = np.arange(18).reshape(1,2,3,3)
print pooled_out
b = np.array([0.0,1.0],dtype='float32') # [0,1]

# shuffle b to match pooled_out
sb = theano.shared(np.asarray(b,dtype='float32'))
y = sb.dimshuffle('x', 0, 'x', 'x') # TensorVariable TensorType(float32, (True, False, True, True))
# 2个feature map分别增加b0,b1
print type(y),y.type,y.shape.eval()

b_value = y.eval()
print b_value
pooled_out + b_value
[[[[ 0  1  2]
   [ 3  4  5]
   [ 6  7  8]]

  [[ 9 10 11]
   [12 13 14]
   [15 16 17]]]]
<class 'theano.tensor.var.TensorVariable'> TensorType(float32, (True, False, True, True)) [1 2 1 1]
[[[[ 0.]]

  [[ 1.]]]]





array([[[[  0.,   1.,   2.],
         [  3.,   4.,   5.],
         [  6.,   7.,   8.]],

        [[ 10.,  11.,  12.],
         [ 13.,  14.,  15.],
         [ 16.,  17.,  18.]]]])

Reference

History

  • 20180807: created.

network2.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""network2.py
~~~~~~~~~~~~~~

An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights. Note
that I have focused on making the code simple, easily readable, and
easily modifiable. It is not optimized, and omits many desirable
features.

"""

#### Libraries
# Standard library
import json
import random
import sys

# Third-party libraries
import numpy as np


#### Define the quadratic and cross-entropy cost functions

class QuadraticCost(object):

@staticmethod
def fn(a, y):
"""Return the cost associated with an output ``a`` and desired output ``y``."""
return 0.5*np.linalg.norm(a-y)**2

@staticmethod
def delta(z, a, y):
"""Return the error delta from the output layer."""
return (a-y) * sigmoid_prime(z)


class CrossEntropyCost(object):

@staticmethod
def fn(a, y):
"""Return the cost associated with an output ``a`` and desired output
``y``. Note that np.nan_to_num is used to ensure numerical
stability. In particular, if both ``a`` and ``y`` have a 1.0
in the same slot, then the expression (1-y)*np.log(1-a)
returns nan. The np.nan_to_num ensures that that is converted
to the correct value (0.0).

"""
return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

@staticmethod
def delta(z, a, y):
"""Return the error delta from the output layer. Note that the
parameter ``z`` is not used by the method. It is included in
the method's parameters in order to make the interface
consistent with the delta method for other cost classes.

"""
return (a-y)

#********************************************************
class LogLikelihoodCost(object):

@staticmethod
def fn(a, y):
"""
C = -log(a[i])

a(10,1) y(10,1)
y = [0,0,1,0,0,0,0,0,0,0,0]
i = 2
C = -log a[2,0]
"""
i = np.argmax(y)
return -np.log(a[i,0])

@staticmethod
def delta(z, a, y):
"""
delta_j = aj-yj
"""
return (a-y)
#********************************************************

#### Main Network class
class Network(object):

def __init__(self, sizes, cost=CrossEntropyCost):
"""The list ``sizes`` contains the number of neurons in the respective
layers of the network. For example, if the list was [2, 3, 1]
then it would be a three-layer network, with the first layer
containing 2 neurons, the second layer 3 neurons, and the
third layer 1 neuron. The biases and weights for the network
are initialized randomly, using
``self.default_weight_initializer`` (see docstring for that
method).

"""
self.num_layers = len(sizes)
self.sizes = sizes
self.cost=cost
# init use_softmax with cost type
if self.cost == LogLikelihoodCost:
self.use_softmax = True
else:
self.use_softmax = False

# init weight initializer and feedforward method
if self.use_softmax:
self.default_weight_initializer = self.default_weight_initializer_with_softmax
self.feedforward = self.feedforward_with_softmax
else:
self.default_weight_initializer = self.default_weight_initializer_1
self.feedforward = self.feedforward_1
# init weights and biases
self.default_weight_initializer()
# at least an input and output layers
assert self.num_layers>=2

def default_weight_initializer_with_softmax(self):
# sigmoid neurons: w(0,1/sqrt(n_in)) b(0,1)
# softmax neurons: w = b = 0
# len(sizes)>=2
# (1) for sigmoid neuros
self.biases = [np.random.randn(y, 1) for y in self.sizes[1:-1]]
self.weights = [np.random.randn(y, x)/np.sqrt(x)
for x, y in zip(self.sizes[:-1], self.sizes[1:-1])]
#(2) for last somtmax neurons
x = self.sizes[-2]
y = self.sizes[-1]
last_b = np.zeros((y, 1))
last_w = np.zeros((y, x))

self.biases.append(last_b)
self.weights.append(last_w)


def default_weight_initializer_1(self):
"""Initialize each weight using a Gaussian distribution with mean 0
and standard deviation 1 over the square root of the number of
weights connecting to the same neuron. Initialize the biases
using a Gaussian distribution with mean 0 and standard
deviation 1.

Note that the first layer is assumed to be an input layer, and
by convention we won't set any biases for those neurons, since
biases are only ever used in computing the outputs from later
layers.

"""
self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
self.weights = [np.random.randn(y, x)/np.sqrt(x)
for x, y in zip(self.sizes[:-1], self.sizes[1:])]

def large_weight_initializer(self):
"""Initialize the weights using a Gaussian distribution with mean 0
and standard deviation 1. Initialize the biases using a
Gaussian distribution with mean 0 and standard deviation 1.

Note that the first layer is assumed to be an input layer, and
by convention we won't set any biases for those neurons, since
biases are only ever used in computing the outputs from later
layers.

This weight and bias initializer uses the same approach as in
Chapter 1, and is included for purposes of comparison. It
will usually be better to use the default weight initializer
instead.

"""
self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(self.sizes[:-1], self.sizes[1:])]

def feedforward_with_softmax(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases[:-1], self.weights[:-1]):
a = sigmoid(np.dot(w, a)+b)
# last layer
b,w = self.biases[-1],self.weights[-1]
last_z = np.dot(w, a)+b
last_a = softmax(last_z)
return last_a

def feedforward_1(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a

def SGD(self, training_data, epochs, mini_batch_size, eta,
lmbda = 0.0,
evaluation_data=None,
monitor_evaluation_cost=False,
monitor_evaluation_accuracy=False,
monitor_training_cost=False,
monitor_training_accuracy=False):
"""Train the neural network using mini-batch stochastic gradient
descent. The ``training_data`` is a list of tuples ``(x, y)``
representing the training inputs and the desired outputs. The
other non-optional parameters are self-explanatory, as is the
regularization parameter ``lmbda``. The method also accepts
``evaluation_data``, usually either the validation or test
data. We can monitor the cost and accuracy on either the
evaluation data or the training data, by setting the
appropriate flags. The method returns a tuple containing four
lists: the (per-epoch) costs on the evaluation data, the
accuracies on the evaluation data, the costs on the training
data, and the accuracies on the training data. All values are
evaluated at the end of each training epoch. So, for example,
if we train for 30 epochs, then the first element of the tuple
will be a 30-element list containing the cost on the
evaluation data at the end of each epoch. Note that the lists
are empty if the corresponding flag is not set.

"""
if evaluation_data: n_data = len(evaluation_data)
n = len(training_data)
num_batches = n/mini_batch_size
evaluation_cost, evaluation_accuracy = [], []
training_cost, training_accuracy = [], []
for j in xrange(epochs):
random.shuffle(training_data)
for k in xrange(0,num_batches):
mini_batch = training_data[k*mini_batch_size : (k+1)*mini_batch_size]
self.update_mini_batch(mini_batch, eta, lmbda, len(training_data))
print "Epoch %s training complete" % j
if monitor_training_cost:
cost = self.total_cost(training_data, lmbda)
training_cost.append(cost)
print "Cost on training data: {}".format(cost)
if monitor_training_accuracy:
accuracy = self.accuracy(training_data, convert=True)
training_accuracy.append(accuracy)
print "Accuracy on training data: {} / {}".format(
accuracy, n)
if monitor_evaluation_cost:
cost = self.total_cost(evaluation_data, lmbda, convert=True)
evaluation_cost.append(cost)
print "Cost on evaluation data: {}".format(cost)
if monitor_evaluation_accuracy:
accuracy = self.accuracy(evaluation_data)
evaluation_accuracy.append(accuracy)
print "Accuracy on evaluation data: {} / {}".format(
self.accuracy(evaluation_data), n_data)
print
return evaluation_cost, evaluation_accuracy, training_cost, training_accuracy

def calculate_sum_derivatives_of_mini_batch(self,mini_batch):
"""
计算m个样本的总梯度和。
利用反向传播计算每一个样本(x,y)对应的梯度。
"""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
# 给定一个样本X,利用反向传播算法计算对应w,b的梯度
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
# 对m个样本的梯度进行累计求和
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
return nabla_b,nabla_w

def update_mini_batch(self, mini_batch, eta, lmbda, n):
"""Update the network's weights and biases by applying gradient
descent using backpropagation to a single mini batch. The
``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
learning rate, ``lmbda`` is the regularization parameter, and
``n`` is the total size of the training data set.

"""
m = len(mini_batch)
nabla_b,nabla_w = self.calculate_sum_derivatives_of_mini_batch(mini_batch)

#self.weights = [w-(eta/m)*nw for w, nw in zip(self.weights, nabla_w)]
#self.biases = [b-(eta/m)*nb for b, nb in zip(self.biases, nabla_b)]

# L2 regularization
weight_decay = 1-eta*(lmbda/n)
self.weights = [weight_decay*w-(eta/m)*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/m)*nb
for b, nb in zip(self.biases, nabla_b)]

def backprop(self, x, y):
"""Return a tuple "(nabla_b, nabla_w)" representing the
gradient for the cost function C_x. "nabla_b" and
"nabla_w" are layer-by-layer lists of numpy arrays, similar
to "self.biases" and "self.weights"."""
# 初始化nb,nw,结构和b,w一样
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]

# feedforward
# 执行算法的feedforward阶段
# (1)初始化x作为a_1
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
# (2)l=2,....L层,分别计算z_l,a_l并且保存下来。
#*************************************************************
if self.use_softmax:
for b, w in zip(self.biases[:-1], self.weights[:-1]):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
#last layer
b,w = self.biases[-1],self.weights[-1]
last_z = np.dot(w, activation)+b
last_a = softmax(last_z)
zs.append(last_z)
activations.append(last_a)
#*************************************************************
else:
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)

#========================================================================
# 先计算所有的误差delta,最后计算所有层的梯度nb,nw,代码可读性更高一些
#========================================================================
# method2
# backward pass
# 执行算法的backward阶段
# (3)初始化第L层的误差,delta_L = cost(z,a,y)
l = -1
#***************************************************
delta = self.cost.delta(zs[l],activations[l], y)
#***************************************************
deltas = [delta] # list to store all the errors,layer by layer
# (4)初始化l=L-1,....2层的误差,delta_l = np.dot(w_l+1^T,delta_l+1)* sigmoid_prime(z_l)
for i in xrange(2, self.num_layers):
l = -i #(-2代表L-1,-3代表L-2,-(L-1)代表2)
delta = np.dot(self.weights[l+1].transpose(), deltas[l+1]) * sigmoid_prime(zs[l])
deltas.insert(0,delta) # 确保误差的顺序,从后往前计算,所以需要insert在数组的最前面

#(5)l=L,L-1,....2层,计算所有的梯度向量nb,nw
for i in xrange(1, self.num_layers):
l = -i #(-1,-2,....-(L-1))
nabla_b[l] = deltas[l]
nabla_w[l] = np.dot(deltas[l], activations[l-1].transpose())

return (nabla_b, nabla_w)

def accuracy(self, data, convert=False):
"""Return the number of inputs in ``data`` for which the neural
network outputs the correct result. The neural network's
output is assumed to be the index of whichever neuron in the
final layer has the highest activation.

The flag ``convert`` should be set to False if the data set is
validation or test data (the usual case), and to True if the
data set is the training data. The need for this flag arises
due to differences in the way the results ``y`` are
represented in the different data sets. In particular, it
flags whether we need to convert between the different
representations. It may seem strange to use different
representations for the different data sets. Why not use the
same representation for all three data sets? It's done for
efficiency reasons -- the program usually evaluates the cost
on the training data and the accuracy on other data sets.
These are different types of computations, and using different
representations speeds things up. More details on the
representations can be found in
mnist_loader.load_data_wrapper.

"""
if convert: # train data
results = [(np.argmax(self.feedforward(x)), np.argmax(y))
for (x, y) in data]
else: # val/test data
results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in data]
return sum(int(x == y) for (x, y) in results)

def total_cost(self, data, lmbda, convert=False):
"""Return the total cost for the data set ``data``. The flag
``convert`` should be set to False if the data set is the
training data (the usual case), and to True if the data set is
the validation or test data. See comments on the similar (but
reversed) convention for the ``accuracy`` method, above.
"""
cost = 0.0
n = len(data)
for x, y in data:
a = self.feedforward(x)
if convert: # val/test data
y = vectorized_result(y)
cost += self.cost.fn(a, y)/n
# L2 term = lmbda/(2n)*sum(w**2)
l2_term = 0.5*(lmbda/n)*sum(np.linalg.norm(w)**2 for w in self.weights)
cost += l2_term
return cost

def save(self, filename):
"""Save the neural network to the file ``filename``."""
data = {"sizes": self.sizes,
"weights": [w.tolist() for w in self.weights],
"biases": [b.tolist() for b in self.biases],
"cost": str(self.cost.__name__)}
f = open(filename, "w")
json.dump(data, f)
f.close()

#### Loading a Network
def load(filename):
"""Load a neural network from the file ``filename``. Returns an
instance of Network.

"""
f = open(filename, "r")
data = json.load(f)
f.close()
name = sys.modules[__name__]
cost = getattr(name, data["cost"])
net = Network(data["sizes"], cost=cost)
net.weights = [np.array(w) for w in data["weights"]]
net.biases = [np.array(b) for b in data["biases"]]
return net

#### Miscellaneous functions
def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the j'th position
and zeroes elsewhere. This is used to convert a digit (0...9)
into a corresponding desired output from the neural network.

"""
e = np.zeros((10, 1))
e[j] = 1.0
return e

def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))

def softmax(z):
#e^z/ sum(e^z)
ez = np.exp(z)
sum_ez = sum(ez)
return ez/sum_ez
1
2
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
1
2
net = load("./1.json")
net.accuracy(validation_data)
9385
1
2
net = load("./0.json")
net.accuracy(validation_data)
8904

Reference

History

  • 20180807: created.

network1.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network. Gradients are calculated
using backpropagation. Note that I have focused on making the code
simple, easily readable, and easily modifiable. It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]

def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a

def SGD(self, training_data, epochs, mini_batch_size, eta,
test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
num_batches = n/mini_batch_size
for j in xrange(epochs):
random.shuffle(training_data)
for k in xrange(0,num_batches):
mini_batch = training_data[k*mini_batch_size : (k+1)*mini_batch_size]
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)

def calculate_sum_derivatives_of_mini_batch(self,mini_batch):
"""
计算m个样本的总梯度和。
利用反向传播计算每一个样本(x,y)对应的梯度。
"""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
# 给定一个样本X,利用反向传播算法计算对应w,b的梯度
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
# 对m个样本的梯度进行累计求和
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
return nabla_b,nabla_w

def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The "mini_batch" is a list of tuples "(x, y)", and "eta"
is the learning rate."""
m = len(mini_batch)
nabla_b,nabla_w = self.calculate_sum_derivatives_of_mini_batch(mini_batch)

self.weights = [w-(eta/m)*nw for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/m)*nb for b, nb in zip(self.biases, nabla_b)]

def backprop(self, x, y):
"""Return a tuple "(nabla_b, nabla_w)" representing the
gradient for the cost function C_x. "nabla_b" and
"nabla_w" are layer-by-layer lists of numpy arrays, similar
to "self.biases" and "self.weights"."""
# 初始化nb,nw,结构和b,w一样
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]

# feedforward
# 执行算法的feedforward阶段
# (1)初始化x作为a_1
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
# (2)l=2,....L层,分别计算z_l,a_l并且保存下来。
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)

#========================================================================
# 先计算所有的误差delta,最后计算所有层的梯度nb,nw,代码可读性更高一些
#========================================================================
# method2
# backward pass
# 执行算法的backward阶段
# (3)初始化第L层的误差,delta_L = cost(a_L,y) * sigmoid_prime(z_L)
l = -1
delta = self.cost_derivative_of_a_L(activations[l], y) * sigmoid_prime(zs[l])
deltas = [delta] # list to store all the errors,layer by layer
# (4)初始化l=L-1,....2层的误差,delta_l = np.dot(w_l+1^T,delta_l+1)* sigmoid_prime(z_l)
for i in xrange(2, self.num_layers):
l = -i #(-2代表L-1,-3代表L-2,-(L-1)代表2)
delta = np.dot(self.weights[l+1].transpose(), deltas[l+1]) * sigmoid_prime(zs[l])
deltas.insert(0,delta) # 确保误差的顺序,从后往前计算,所以需要insert在数组的最前面

#(5)l=L,L-1,....2层,计算所有的梯度向量nb,nw
for i in xrange(1, self.num_layers):
l = -i #(-1,-2,....-(L-1))
nabla_b[l] = deltas[l]
nabla_w[l] = np.dot(deltas[l], activations[l-1].transpose())

return (nabla_b, nabla_w)

def evaluate(self, test_data):
"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""

"""
l = [0,1,0,0,0,0,0,0,0,0]
a = np.array(l).reshape(10,1)
np.argmax(a) #输出向量对应的数字1

test_results = [(1,1),(2,2),(3,3),(1,9)]
[int(x == y) for (x, y) in test_results]
#[1, 1, 1, 0]
sum([int(x == y) for (x, y) in test_results])
#3
"""

test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)

def cost_derivative_of_a_L(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))

mnist_loader

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data. For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
"""Return the MNIST data as a tuple containing the training data,
the validation data, and the test data.

The ``training_data`` is returned as a tuple with two entries.
The first entry contains the actual training images. This is a
numpy ndarray with 50,000 entries. Each entry is, in turn, a
numpy ndarray with 784 values, representing the 28 * 28 = 784
pixels in a single MNIST image.

The second entry in the ``training_data`` tuple is a numpy ndarray
containing 50,000 entries. Those entries are just the digit
values (0...9) for the corresponding images contained in the first
entry of the tuple.

The ``validation_data`` and ``test_data`` are similar, except
each contains only 10,000 images.

This is a nice data format, but for use in neural networks it's
helpful to modify the format of the ``training_data`` a little.
That's done in the wrapper function ``load_data_wrapper()``, see
below.
"""
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)

def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data,
test_data)``. Based on ``load_data``, but the format is more
convenient for use in our implementation of neural networks.

In particular, ``training_data`` is a list containing 50,000
2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray
containing the input image. ``y`` is a 10-dimensional
numpy.ndarray representing the unit vector corresponding to the
correct digit for ``x``.

``validation_data`` and ``test_data`` are lists containing 10,000
2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional
numpy.ndarry containing the input image, and ``y`` is the
corresponding classification, i.e., the digit values (integers)
corresponding to ``x``.

Obviously, this means we're using slightly different formats for
the training data and the validation / test data. These formats
turn out to be the most convenient for use in our neural network
code."""
tr_d, va_d, te_d = load_data()
# train
training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
# vector train_y, while val and test y are integers.
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)

# val
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
# test
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])

return (training_data, validation_data, test_data)

def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth
position and zeroes elsewhere. This is used to convert a digit
(0...9) into a corresponding desired output from the neural
network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e

test

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/python
# -*- coding: utf-8 -*-

import mnist_loader
from ke_network import Network

def test():
train,val,test = mnist_loader.load_data_wrapper()
epoch = 30
mini_batch_size = 10
eta = 3.0
net = Network([784,30,10])
net.SGD(train,epoch,mini_batch_size,eta,test_data=test)

def main():
test()

if __name__=="__main__":
main()
Epoch 0: 9135 / 10000
Epoch 1: 9238 / 10000
Epoch 2: 9337 / 10000
Epoch 3: 9345 / 10000
Epoch 4: 9393 / 10000
Epoch 5: 9389 / 10000
Epoch 6: 9419 / 10000
Epoch 7: 9410 / 10000

Reference

History

  • 20180807: created.

Examples of shuffle

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
"""
numpy.random.shuffle(x)
Modify a sequence in-place by shuffling its contents.

random.shuffle(list)只能对list进行随机打乱。

Parameters:
x : array_like
The array or list to be shuffled.

Returns: None

This function only shuffles the array along the first index of a multi-dimensional array
(多维矩阵中,只对第一维(行)做打乱顺序操作)
"""
import numpy as np
arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
1
2
np.random.shuffle(arr)
arr
array([1, 4, 7, 3, 0, 9, 5, 8, 2, 6])
1
2
arr = np.arange(9).reshape((3, 3))
arr
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
1
2
3
# 对于矩阵,按照row进行打乱。
np.random.shuffle(arr)
arr
array([[3, 4, 5],
       [0, 1, 2],
       [6, 7, 8]])
1
2
3
# 对于矩阵,按照row进行打乱。
np.random.shuffle(arr)
arr
array([[3, 4, 5],
       [6, 7, 8],
       [0, 1, 2]])
1
2
# random.shuffle(list)
l = [i for i in xrange(0,10)]
1
2
3
import random
random.shuffle(l)
l
[2, 9, 5, 8, 6, 3, 7, 1, 4, 0]
1
2
random.shuffle(l)
l
[3, 9, 2, 1, 5, 7, 8, 4, 0, 6]

Shuffe two sequences at the same time

shuffle的状态依赖于random.seed,np.random.seed,可以使用time作为seed

random.seed(time.time())

2次shuffle的seed不一样,结果不一致

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import random
import numpy as np
train_data = (np.array([[1,1],
[2,2],
[3,3],
[4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

# shuffle x,y的时候,只要保证seed相同,那么shuffle之后的x,y对应元素的顺序保持一样
#np.random.seed(1)
np.random.shuffle(x)
print x

#np.random.seed(1)
np.random.shuffle(y)
print y

print
print train_data
[[3 3]
 [1 1]
 [2 2]
 [4 4]]
[22 11 33 44]

(array([[3, 3],
       [1, 1],
       [2, 2],
       [4, 4]]), array([22, 11, 33, 44]))

2次shuffle的seed一样,结果保持一致

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import random
import numpy as np
train_data = (np.array([[1,1],
[2,2],
[3,3],
[4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

# shuffle x,y的时候,只要保证seed相同,那么shuffle之后的x,y对应元素的顺序保持一样
seed = 1

np.random.seed(seed)
np.random.shuffle(x)
print x

np.random.seed(seed)
np.random.shuffle(y)
print y

print
print train_data
[[4 4]
 [3 3]
 [1 1]
 [2 2]]
[44 33 11 22]

(array([[4, 4],
       [3, 3],
       [1, 1],
       [2, 2]]), array([44, 33, 11, 22]))

shuffle_data(x,y)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import time
import random
import numpy as np
train_data = (np.array([[1,1],
[2,2],
[3,3],
[4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

def shuffle_data(x,y):
#seed = int(time.time())

seed = 1
np.random.seed(seed)
np.random.shuffle(x)

np.random.seed(seed)
np.random.shuffle(y)

shuffle_data(x,y)
print x
print y
print
print train_data
[[4 4]
 [3 3]
 [1 1]
 [2 2]]
[44 33 11 22]

(array([[4, 4],
       [3, 3],
       [1, 1],
       [2, 2]]), array([44, 33, 11, 22]))

Shuffle in theano with TensorVariable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import time
import random
import numpy as np
import theano
import theano.tensor as T

train_data = (np.array([[1,1],
[2,2],
[3,3],
[4,4]],dtype='float64'), np.array([11,22,33,44],dtype='int32'))
x = train_data[0]
y = train_data[1]

def shuffle_data(x,y):
#seed = int(time.time())

seed = 1
np.random.seed(seed)
np.random.shuffle(x)

np.random.seed(seed)
np.random.shuffle(y)

# OK because of shared<---->train data
shared_x = theano.shared(train_data[0], borrow=True)
shared_y = theano.shared(train_data[1], borrow=True)

#shared_x = theano.shared(np.asarray(train_data[0], dtype=theano.config.floatX), borrow=True) # no copy
#shared_y = theano.shared(np.asarray(train_data[1], dtype=theano.config.floatX), borrow=True) # no copy

y_cast = T.cast(shared_y,"int32")
# shared_y dtype int32, no copy, y_cast is TensorSharedVariable(int32,vector)
# shared_y dtype float64, copy, y_cast is TensorVariable(int32,vector)

print shared_x.type,shared_y.type,y_cast.type
print type(shared_x),type(shared_y),type(y_cast)
print shared_y is y_cast

print 'old train'
print train_data

#print '\nupdate train'
#x[0] = 100
#y[0] = 100

print '\nshuffle train'
#shuffle_data(train_data[0],train_data[1])

originX = shared_x.get_value(borrow=True)
originY = shared_y.get_value(borrow=True)
print originX is x # true
print originY is y # true
shuffle_data(originX,originY)

print train_data

print '\nshared x and y'
print shared_x.get_value()
print shared_y.get_value()
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: Unable to get the number of gpus available: no CUDA-capable device is detected)


TensorType(float64, matrix) TensorType(int32, vector) TensorType(int32, vector)
<class 'theano.tensor.sharedvar.TensorSharedVariable'> <class 'theano.tensor.sharedvar.TensorSharedVariable'> <class 'theano.tensor.sharedvar.TensorSharedVariable'>
True
old train
(array([[ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]]), array([11, 22, 33, 44], dtype=int32))

shuffle train
True
True
(array([[ 4.,  4.],
       [ 3.,  3.],
       [ 1.,  1.],
       [ 2.,  2.]]), array([44, 33, 11, 22], dtype=int32))

shared x and y
[[ 4.  4.]
 [ 3.  3.]
 [ 1.  1.]
 [ 2.  2.]]
[44 33 11 22]

Reference

History

  • 20180807: created.

Series

code example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy
import numpy as np
import theano
import theano.tensor as T
rng = numpy.random

N = 400 # training sample size
feats = 784 # number of input variables

# generate a dataset: D = (input_values, target_class)
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
# (400, 784) (400,)
training_steps = 10000

# Declare Theano symbolic variables
x = T.dmatrix("x") # (400,784)
y = T.dvector("y") # (400,)

# initialize the weight vector w randomly
#
# this and the following bias variable b
# are shared so they keep their values
# between training iterations (updates)
w = theano.shared(rng.randn(feats), name="w") #vector (784,)

# initialize the bias term
b = theano.shared(0., name="b") #scalar ()

print("Initial model:")
print(w.get_value().shape)
print(b.get_value().shape)

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1 = vector(400,)
prediction = p_1 > 0.5 # The prediction thresholded = vector(400,) of bool
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function = vector(400,) of Cx for given x
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize = scalar f(x,y,w,b)
gw, gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# w.r.t weight vector w and
# bias term b
# (we shall return to this in a
# following section of this tutorial)

updates = [(w, w - 0.1 * gw),(b, b - 0.1 * gb)]
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent,cost],
updates= updates
)

predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
pred, xcent,cost_t = train(D[0], D[1])

prediction = predict(D[0]) # vector(400,) of bool


#a = np.array([1,2,3,4])
#b = np.array([2,2,4,3])
#np.mean( np.equal(a,b) ) #[0,1,0,0] 0.25

accuracy = 100* np.mean( np.equal(prediction,D[1]) )

print("Final model:")
print(w.get_value().shape)
print(b.get_value().shape)

print "\nCost = ",cost_t
print "Accuracy = ", accuracy,"%"

print("\ntarget values for D:")
#print(D[1])
print("prediction on D:")
#print(prediction)
Initial model:
(784,)
()
Final model:
(784,)
()

Cost =  0.121553645129
Accuracy =  100.0 %

target values for D:
prediction on D:

Reference

History

  • 20180807: created.