Classification: Instant Recognition with Caffe

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

Tutorial

In this example we’ll classify an image with the bundled CaffeNet model (which is based on the network architecture of Krizhevsky et al. for ImageNet).

We’ll compare CPU and GPU modes and then dig into the model to inspect features and the output.

Setup

First, set up Python, numpy, and matplotlib.

# set up Python environment: numpy for numerical routines, and matplotlib for plotting
import numpy as np
import matplotlib.pyplot as plt
# display plots in this notebook
%matplotlib inline

# set display defaults
plt.rcParams['figure.figsize'] = (10, 10)        # large images
plt.rcParams['image.interpolation'] = 'nearest'  # don't interpolate: show square pixels
plt.rcParams['image.cmap'] = 'gray'  # use grayscale output rather than a (potentially misleading) color heatmap


# The caffe module needs to be on the Python path;
#  we'll add it here explicitly.
import sys
caffe_root = '../'  # this file should be run from {caffe_root}/examples (otherwise change this line)
sys.path.insert(0, caffe_root + 'python')

import caffe
# If you get "No module named _caffe", either you have not built pycaffe or you have the wrong path.

import os
if os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'):
    print 'CaffeNet found.'
else:
    print 'Downloading pre-trained CaffeNet model...'
    !../scripts/download_model_binary.py ../models/bvlc_reference_caffenet

CaffeNet found.

Load net and set up input preprocessing

Set Caffe to CPU mode and load the net from disk.

caffe.set_mode_cpu()

model_def = caffe_root + 'models/bvlc_reference_caffenet/deploy.prototxt'
model_weights = caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'

net = caffe.Net(model_def,      # defines the structure of the model
                model_weights,  # contains the trained weights
                caffe.TEST)     # use test mode (e.g., don't perform dropout)

Set up input preprocessing. (We’ll use Caffe’s caffe.io.Transformer to do this, but this step is independent of other parts of Caffe, so any custom preprocessing code may be used).

Our default CaffeNet is configured to take images in BGR format. Values are expected to start in the range [0, 255] and then have the mean ImageNet pixel value subtracted from them. In addition, the channel dimension is expected as the first (outermost) dimension.

As matplotlib will load images with values in the range [0, 1] in RGB format with the channel as the innermost dimension, we are arranging for the needed transformations here.

# load the mean ImageNet image (as distributed with Caffe) for subtraction
mu = np.load(caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy')
mu = mu.mean(1).mean(1)  # average over pixels to obtain the mean (BGR) pixel values
#mu = np.array([ 104.00698793,  116.66876762,  122.67891434]) 
print 'mean-subtracted values:', zip('BGR', mu)

# create transformer for the input called 'data'
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})

transformer.set_transpose('data', (2,0,1))  # move image channels to outermost dimension
transformer.set_mean('data', mu)            # subtract the dataset-mean value in each channel
transformer.set_raw_scale('data', 255)      # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap('data', (2,1,0))  # swap channels from RGB to BGR

mean-subtracted values: [('B', 104.0069879317889), ('G', 116.66876761696767), ('R', 122.6789143406786)]

CPU classification

Now we’re ready to perform classification. Even though we’ll only classify one image, we’ll set a batch size of 50 to demonstrate batching.

# set the size of the input (we can skip this if we're happy
#  with the default; we can also change it later, e.g., for different batch sizes)
net.blobs['data'].reshape(50,        # batch size
                          3,         # 3-channel (BGR) images
                          227, 227)  # image size is 227x227

Load an image (that comes with Caffe) and perform the preprocessing we’ve set up.

1
2
3

image = caffe.io.load_image(caffe_root + 'examples/images/cat.jpg')
transformed_image = transformer.preprocess('data', image)
plt.imshow(image)

/usr/local/lib/python2.7/dist-packages/skimage/transform/_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15.
  warn("The default mode, 'constant', will be changed to 'reflect' in "





<matplotlib.image.AxesImage at 0x7f2088044450>

Adorable! Let’s classify it!

# copy the image data into the memory allocated for the net
net.blobs['data'].data[...] = transformed_image

### perform classification
output = net.forward()

output_prob = output['prob'][0]  # the output probability vector for the first image in the batch

print output_prob.shape
print 'predicted class is:', output_prob.argmax()

(1000,)
predicted class is: 281

The net gives us a vector of probabilities; the most probable class was the 281st one. But is that correct? Let’s check the ImageNet labels…

# load ImageNet labels
labels_file = caffe_root + 'data/ilsvrc12/synset_words.txt'
if not os.path.exists(labels_file):
    !../data/ilsvrc12/get_ilsvrc_aux.sh
    
labels = np.loadtxt(labels_file, str, delimiter='\t')

print labels.shape
print 'output label:', labels[output_prob.argmax()]

(1000,)
output label: n02123045 tabby, tabby cat

“Tabby cat” is correct! But let’s also look at other top (but less confident predictions).

# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5]  # reverse sort and take five largest items

print 'probabilities and labels:'
zip(output_prob[top_inds], labels[top_inds])

probabilities and labels:





[(0.31243625, 'n02123045 tabby, tabby cat'),
 (0.23797157, 'n02123159 tiger cat'),
 (0.12387245, 'n02124075 Egyptian cat'),
 (0.10075716, 'n02119022 red fox, Vulpes vulpes'),
 (0.070957333, 'n02127052 lynx, catamount')]

We see that less confident predictions are sensible.

Switching to GPU mode

Let’s see how long classification took, and compare it to GPU mode.

1	%timeit net.forward()

1 loop, best of 3: 4.26 s per loop

That’s a while, even for a batch of 50 images. Let’s switch to GPU mode.

caffe.set_device(0)  # if we have multiple GPUs, pick the first one
caffe.set_mode_gpu()
net.forward()  # run once before timing to set up memory
%timeit net.forward()

10 loops, best of 3: 29.6 ms per loop

That should be much faster!

Examining intermediate output

A net is not just a black box; let’s take a look at some of the parameters and intermediate activations.

First we’ll see how to read out the structure of the net in terms of activation and parameter shapes.

For each layer, let’s look at the activation shapes, which typically have the form (batch_size, channel_dim, height, width).

The activations are exposed as an OrderedDict, net.blobs.

1
2
3

# for each layer, show the output shape
for layer_name, blob in net.blobs.iteritems():
    print layer_name + '\t' + str(blob.data.shape)

data	(50, 3, 227, 227)
conv1	(50, 96, 55, 55)
pool1	(50, 96, 27, 27)
norm1	(50, 96, 27, 27)
conv2	(50, 256, 27, 27)
pool2	(50, 256, 13, 13)
norm2	(50, 256, 13, 13)
conv3	(50, 384, 13, 13)
conv4	(50, 384, 13, 13)
conv5	(50, 256, 13, 13)
pool5	(50, 256, 6, 6)
fc6	(50, 4096)
fc7	(50, 4096)
fc8	(50, 1000)
prob	(50, 1000)

Now look at the parameter shapes. The parameters are exposed as another OrderedDict, net.params. We need to index the resulting values with either [0] for weights or [1] for biases.

The param shapes typically have the form (output_channels, input_channels, filter_height, filter_width) (for the weights) and the 1-dimensional shape (output_channels,) (for the biases).

1 2	for layer_name, param in net.params.iteritems(): print layer_name + '\t' + str(param[0].data.shape), str(param[1].data.shape)

conv1	(96, 3, 11, 11) (96,)
conv2	(256, 48, 5, 5) (256,)
conv3	(384, 256, 3, 3) (384,)
conv4	(384, 192, 3, 3) (384,)
conv5	(256, 192, 3, 3) (256,)
fc6	(4096, 9216) (4096,)
fc7	(4096, 4096) (4096,)
fc8	(1000, 4096) (1000,)

Since we’re dealing with four-dimensional data here, we’ll define a helper function for visualizing sets of rectangular heatmaps.

def vis_square(data):
    """Take an array of shape (n, height, width) or (n, height, width, 3)
       and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)"""
    
    # normalize data for display
    data = (data - data.min()) / (data.max() - data.min())
    
    # force the number of filters to be square
    n = int(np.ceil(np.sqrt(data.shape[0])))
    padding = (((0, n ** 2 - data.shape[0]),
               (0, 1), (0, 1))                 # add some space between filters
               + ((0, 0),) * (data.ndim - 3))  # don't pad the last dimension (if there is one)
    data = np.pad(data, padding, mode='constant', constant_values=1)  # pad with ones (white)
    
    # tile the filters into an image
    data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
    data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
    
    plt.imshow(data); plt.axis('off')

First we’ll look at the first layer filters, conv1

1
2
3

# the parameters are a list of [weights, biases]
filters = net.params['conv1'][0].data
vis_square(filters.transpose(0, 2, 3, 1))

The first layer output, conv1 (rectified responses of the filters above, first 36 only)

1 2	feat = net.blobs['conv1'].data[0, :36] vis_square(feat)

The fifth layer after pooling, pool5

1 2	feat = net.blobs['pool5'].data[0] vis_square(feat)

The first fully connected layer, fc6 (rectified)

We show the output values and the histogram of the positive values

feat = net.blobs['fc6'].data[0]
plt.subplot(2, 1, 1)
plt.plot(feat.flat)
plt.subplot(2, 1, 2)
_ = plt.hist(feat.flat[feat.flat > 0], bins=100)

The final probability output, prob

1
2
3

feat = net.blobs['prob'].data[0]
plt.figure(figsize=(15, 3))
plt.plot(feat.flat)

[<matplotlib.lines.Line2D at 0x7f2060250650>]

Note the cluster of strong predictions; the labels are sorted semantically. The top peaks correspond to the top predicted labels, as shown above.

Try your own image

Now we’ll grab an image from the web and classify it using the steps above.

Try setting my_image_url to any JPEG image URL.

# download an image
my_image_url = "..."  # paste your URL here
# for example:
# my_image_url = "https://upload.wikimedia.org/wikipedia/commons/b/be/Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG"
!wget -O image.jpg $my_image_url

# transform it and copy it into the net
image = caffe.io.load_image('image.jpg')
net.blobs['data'].data[...] = transformer.preprocess('data', image)

# perform classification
net.forward()

# obtain the output probabilities
output_prob = net.blobs['prob'].data[0]

# sort top five predictions from softmax output
top_inds = output_prob.argsort()[::-1][:5]

plt.imshow(image)

print 'probabilities and labels:'
zip(output_prob[top_inds], labels[top_inds])

Reference

History

20180807: created.

Tutorial for Training LeNet on MNIST with Caffe

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

LeNet

The design of LeNet contains the essence of CNNs that are still used in larger models such as the ones in ImageNet. In general, it consists of a convolutional layer followed by a pooling layer, another convolution layer followed by a pooling layer, and then two fully connected layers similar to the conventional multilayer perceptrons.

经典LeNet结构:

input->conv1(20,)-pool1-conv2(50,)-pool2-f1(500,ReLU)-f2(10,softmax)->output

lenet_train_test.prototxt

batch size设置在net.prototxt中而不是solver.prototxt中,用以明确blob的dims
bottom: layer的input blob; top: layer的output blob
对于0-255区间的pixel，需要归一化到0-1区间，scale = 1/256. = 0.00390625
lr_mult: 1表示learning时，weight的learning rate需要x1;
lr_mult: 2表示learning时，bias的learning rate需要x2 (this usually leads to better convergence rates)
InnerProduct默认输出的是z,而不是a=sigmoid(z)
ReLU是Inplace操作，输入输出blob都是ip1,对于其他Layer,input和output的blob不能是相同的

Input Layer types

Input Layer types for train_val.prototxt

``python
solver.net.forward() # load mini-batch images from training data


**Data** 
```prototxt
name: "mnist"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}


name: "CaffeNet"
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
  }
# mean pixel / channel-wise mean instead of mean image
#  transform_param {
#    crop_size: 227
#    mean_value: 104
#    mean_value: 117
#    mean_value: 123
#    mirror: true
#  }
  data_param {
    source: "examples/imagenet/ilsvrc12_train_lmdb"
    batch_size: 256
    backend: LMDB
  }
}

** ImageData**: read raw images.

layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 227
    mean_file: "data/ilsvrc12/imagenet_mean.binaryproto"
  }
  image_data_param {
    source: "data/flickr_style/train.txt"
    batch_size: 50
    new_height: 256
    new_width: 256
  }
}

Input Layer types for deploy.prototxt

** DummyData **: no labels, only for forward and get probs

layer {
  name: "data"
  type: "DummyData"
  top: "data"
  dummy_data_param {
    shape {
      dim: 1
      dim: 3
      dim: 227
      dim: 227
    }
  }
}

** Input** : typically used for networks that are being deployed.

layer {
  name: "data"
  type: "Input"
  top: "data"
  input_param { 
  shape {
      dim: 10
      dim: 3
      dim: 227
      dim: 227
    }
}

LeNet train_val.prototxt

name: "LeNet"
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}
layer {
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "examples/mnist/mnist_test_lmdb"
    batch_size: 100
    backend: LMDB
  }
}
layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 20
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 50
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "pool2"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 500
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

LeNet solver.prototxt

# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
    
# batch_size定义在net.prototxt中,train_mini_batch_size = 64,test_mini_batch_size = 100

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100 # test_iter = num_test_images/test_mini_batch_size = 10000/100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000  # epoch = 
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU

learning rate policy (todo…)

This is the same policy as our default LeNet.

1
2
3

s.lr_policy = 'inv'
s.gamma = 0.0001
s.power = 0.75

EDIT HERE to try the fixed rate (and compare with adaptive solvers)
fixed is the simplest policy that keeps the learning rate constant.

1	s.lr_policy = 'fixed'

Set lr_policy to define how the learning rate changes during training.

# Here, we 'step' the learning rate by multiplying it by a factor `gamma`
# every `stepsize` iterations.
s.lr_policy = 'step'
s.gamma = 0.1
s.stepsize = 20000

solver types (todo…)

solver types include “SGD”, “Adam”, and “Nesterov” among others.

1	s.type = "SGD"

Train LeNet

1 2	cd $CAFFE_ROOT ./examples/mnist/train_lenet.sh

#!/usr/bin/env sh
set -e

./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@

train output

I0807 16:15:29.555564  4273 solver.cpp:310] Iteration 10000, loss = 0.00251452
I0807 16:15:29.555619  4273 solver.cpp:330] Iteration 10000, Testing net (#0)
I0807 16:15:29.634243  4281 data_layer.cpp:73] Restarting data prefetching from start.
I0807 16:15:29.635372  4273 solver.cpp:397]     Test net output #0: accuracy = 0.9909
I0807 16:15:29.635409  4273 solver.cpp:397]     Test net output #1: loss = 0.0302912 (* 1 = 0.0302912 loss)
I0807 16:15:29.635416  4273 solver.cpp:315] Optimization Done.
I0807 16:15:29.635439  4273 caffe.cpp:259] Optimization Done.

Deploy model

for train, train_test.prototxt + solver.prototxt
for deploy, deploy.prototxt+ model.caffemodel

depoly: no weight_filler,bias_filler, loaded from weights.caffemodel. if not set weights file, w,b default to 0s

PyCaffe

pycaffe interfaces

The Python interface – pycaffe – is the caffe module and its scripts in caffe/python. import caffe to load models, do forward and backward, handle IO, visualize networks, and even instrument model solving. All model data, derivatives, and parameters are exposed for reading and writing.

caffe.Net is the central interface for loading, configuring, and running models.
caffe.Classifier and caffe.Detector provide convenience interfaces for common tasks.
caffe.SGDSolver exposes the solving interface.
caffe.io handles input / output with preprocessing and protocol buffers.
caffe.draw visualizes network architectures.
Caffe blobs are exposed as numpy ndarrays for ease-of-use and efficiency.

Tutorial IPython notebooks are found in caffe/examples: do ipython notebook caffe/examples to try them. For developer reference docstrings can be found throughout the code.

Compile pycaffe by make pycaffe. Add the module directory to your $PYTHONPATH by export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH or the like for import caffe.

Reference

History

20180807: created.

Install and Configure Caffe on ubuntu 16.04

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

Series

Guide

requirements:

NVIDIA driver 396.54
~~CUDA 8.0 + cudnn 6.0.21~~
CUDA 9.2 +cudnn 7.1.4
opencv 3.1.0 —>3.3.0
python 2.7 + numpy 1.15.1
~~python 3.5.2 + numpy 1.16.2~~
protobuf 3.6.1 （static）
caffe latest

默认的protobuf,2.6.1测试通过。
此处，使用最新的3.6.1 也可以，编译caffe需要加上-std=c++11

install CUDA + cudnn

see install and configure cuda 9.2 with cudnn 7.1 on ubuntu 16.04
tips: we need to recompile caffe with cudnn 7.1

before we compile caffe, move caffe/python/caffe/selective_search_ijcv_with_python folder outside caffe source folder, otherwise error occurs.

install protobuf

see Part 1: compile protobuf-cpp on ubuntu 16.04

which protoc 
/usr/local/bin/protoc

protoc --version
libprotoc 3.6.1

caffe使用static的libprotoc 3.6.1

install opencv

see compile opencv on ubuntu 16.04

which opencv_version
/usr/local/bin/opencv_version

opencv_version 
3.3.0

python

1 2	python --version Python 2.7.12

check numpy version

import numpy
numpy.__version__
'1.15.1'

import numpy
import inspect
inspect.getfile(numpy)
'/usr/local/lib/python2.7/dist-packages/numpy/__init__.pyc'

compile caffe

clone repo

1 2	git clone https://github.com/BVLC/caffe.git cd caffe

update repo

update at 20180822.

if you change your local Makefile and git pull origin master merge conflict, solution

1 2	git checkout HEAD Makefile git pull origin master

configure

1	mkdir build && cd build && cmake-gui ..

cmake-gui options

USE_CUDNN ON
USE_OPENCV ON
Build_python ON
Build_python_layer ON

BLAS atlas
CMAKE_CXX_FLGAS -std=c++11

CMAKE_INSTALL_PREFIX /home/kezunlin/program/caffe/build/install

使用-std=c++11

configure output

Dependencies:
  BLAS              :   Yes (Atlas)
  Boost             :   Yes (ver. 1.66)
  glog              :   Yes
  gflags            :   Yes
  protobuf          :   Yes (ver. 3.6.1)
  lmdb              :   Yes (ver. 0.9.17)
  LevelDB           :   Yes (ver. 1.18)
  Snappy            :   Yes (ver. 1.1.3)
  OpenCV            :   Yes (ver. 3.1.0)
  CUDA              :   Yes (ver. 9.2)

NVIDIA CUDA:
  Target GPU(s)     :   Auto
  GPU arch(s)       :   sm_61
  cuDNN             :   Yes (ver. 7.1.4)

Python:
  Interpreter       :   /usr/bin/python2.7 (ver. 2.7.12)
  Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
  NumPy             :   /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.51.1)

Documentaion:
  Doxygen           :   /usr/bin/doxygen (1.8.11)
  config_file       :   /home/kezunlin/program/caffe/.Doxyfile

Install:
  Install path      :   /home/kezunlin/program/caffe/build/install

Configuring done

we can also use python3.5 and numpy 1.16.2

Python:
  Interpreter       :   /usr/bin/python3 (ver. 3.5.2)
  Libraries         :   /usr/lib/x86_64-linux-gnu/libpython3.5m.so (ver 3.5.2)
  NumPy             :   /home/kezunlin/.local/lib/python3.5/site-packages/numpy/core/include (ver 1.16.2)

use -std=c++11, otherwise errors occur

make -j8
[  1%] Running C++/Python protocol buffer compiler on /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
Scanning dependencies of target caffeproto
[  1%] Building CXX object src/caffe/CMakeFiles/caffeproto.dir/__/__/include/caffe/proto/caffe.pb.cc.o
In file included from /usr/include/c++/5/mutex:35:0,
                 from /usr/local/include/google/protobuf/stubs/mutex.h:33,
                 from /usr/local/include/google/protobuf/stubs/common.h:52,
                 from /home/kezunlin/program/caffe/build/include/caffe/proto/caffe.pb.h:9,
                 from /home/kezunlin/program/caffe/build/include/caffe/proto/caffe.pb.cc:4:
/usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
 #error This file requires compiler and library support \

fix gcc error

edit /usr/local/cuda/include/host_config.h

将其中的第115行注释掉：

1
2
3

#error-- unsupported GNU version! gcc versions later than 4.9 are not supported!
======>
//#error-- unsupported GNU version! gcc versions later than 4.9 are not supported!

fix gflags error

caffe/include/caffe/common.hpp
caffe/examples/mnist/convert_mnist_data.cpp

Comment out the ifndef

1
2
3

// #ifndef GFLAGS_GFLAGS_H_
namespace gflags = google;
// #endif  // GFLAGS_GFLAGS_H_

compile

1
2
3

make clean
make -j8 
make pycaffe

output

[  1%] Running C++/Python protocol buffer compiler on /home/kezunlin/program/caffe/src/caffe/proto/caffe.proto
Scanning dependencies of target caffeproto
[  1%] Building CXX object src/caffe/CMakeFiles/caffeproto.dir/__/__/include/caffe/proto/caffe.pb.cc.o
[  1%] Linking CXX static library ../../lib/libcaffeproto.a
[  1%] Built target caffeproto

libcaffeproto.a　static library

install

make install

ls build/install
bin  include  lib  python  share

install to build/install folder

1 2	ls build/install/lib libcaffeproto.a libcaffe.so libcaffe.so.1.0.0

advanced

INTERFACE_INCLUDE_DIRECTORIES
INTERFACE_LINK_LIBRARIES

Target “caffe” has an INTERFACE_LINK_LIBRARIES property which differs from its LINK_INTERFACE_LIBRARIES properties.

Play with Caffe

python caffe

fix python caffe

fix ipython 6.1 version conflict

vim caffe/python/requirements.txt

1
2
3

ipython>=3.0.0
====>
ipython==5.4.1

reinstall ipython

pip install -r requirements.txt

cd caffe/python
python
>>>import caffe

python draw net

sudo apt-get install graphviz
sudo pip install theano=0.9

# for theano d3viz
sudo pip install pydot==1.1.0
sudo pip install pydot-ng


# other usefull tools
sudo pip install jupyter
sudo pip install seaborn

we need to install graphviz, otherwise we get ERROR:”dot” not found in path

draw net

cd $CAFFE_HOME
./python/draw_net.py ./examples/mnist/lenet.prototxt ./examples/mnist/lenet.png

eog ./examples/mnist/lenet.png

cpp caffe

train net

cd caffe
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh 

cat ./examples/mnist/train_lenet.sh 
./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@

output results

I0912 15:57:28.812655 14094 solver.cpp:327] Iteration 10000, loss = 0.00272129
I0912 15:57:28.812675 14094 solver.cpp:347] Iteration 10000, Testing net (#0)
I0912 15:57:28.891481 14100 data_layer.cpp:73] Restarting data prefetching from start.
I0912 15:57:28.893678 14094 solver.cpp:414]     Test net output #0: accuracy = 0.9904
I0912 15:57:28.893707 14094 solver.cpp:414]     Test net output #1: loss = 0.0276084 (* 1 = 0.0276084 loss)
I0912 15:57:28.893714 14094 solver.cpp:332] Optimization Done.
I0912 15:57:28.893719 14094 caffe.cpp:250] Optimization Done.

tips, for caffe, errors because no imdb data.

I0417 13:28:17.764714 35030 layer_factory.hpp:77] Creating layer mnist
F0417 13:28:17.765067 35030 db_lmdb.hpp:15] Check failed: mdb_status == 0 (2 vs. 0) No such file or directory
---------------------

upgrade net

1 2	./tools/upgrade_net_proto_text old.prototxt new.prototxt ./tools/upgrade_net_proto_binary old.caffemodel new.caffemodel

caffe time

yolov3

./build/tools/caffe time --model='det/yolov3/yolov3.prototxt' --iterations=100 --gpu=0

I0313 10:15:41.888208 12527 caffe.cpp:408] Average Forward pass: 49.7012 ms.
I0313 10:15:41.888213 12527 caffe.cpp:410] Average Backward pass: 84.946 ms.
I0313 10:15:41.888248 12527 caffe.cpp:412] Average Forward-Backward: 134.85 ms.

yolov3 5 class

./build/tools/caffe time --model='det/autotrain/yolo3-5c.prototxt' --iterations=100 --gpu=0

I0313 10:19:27.283625 12894 caffe.cpp:408] Average Forward pass: 38.4823 ms.
I0313 10:19:27.283630 12894 caffe.cpp:410] Average Backward pass: 74.1656 ms.
I0313 10:19:27.283638 12894 caffe.cpp:412] Average Forward-Backward: 112.732 ms.

Example

Caffe Classifier

#include <caffe/caffe.hpp>
#ifdef USE_OPENCV
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#endif  // USE_OPENCV
#include <algorithm>
#include <iosfwd>
#include <memory>
#include <string>
#include <utility>
#include <vector>

#ifdef USE_OPENCV
using namespace caffe;  // NOLINT(build/namespaces)
using std::string;

/* Pair (label, confidence) representing a prediction. */
typedef std::pair<string, float> Prediction;

class Classifier {
 public:
  Classifier(const string& model_file,
             const string& trained_file,
             const string& mean_file,
             const string& label_file);

  std::vector<Prediction> Classify(const cv::Mat& img, int N = 5);

 private:
  void SetMean(const string& mean_file);

  std::vector<float> Predict(const cv::Mat& img);

  void WrapInputLayer(std::vector<cv::Mat>* input_channels);

  void Preprocess(const cv::Mat& img,
                  std::vector<cv::Mat>* input_channels);

 private:
  shared_ptr<Net<float> > net_;
  cv::Size input_geometry_;
  int num_channels_;
  cv::Mat mean_;
  std::vector<string> labels_;
};

Classifier::Classifier(const string& model_file,
                       const string& trained_file,
                       const string& mean_file,
                       const string& label_file) {
#ifdef CPU_ONLY
  Caffe::set_mode(Caffe::CPU);
#else
  Caffe::set_mode(Caffe::GPU);
#endif

  /* Load the network. */
  net_.reset(new Net<float>(model_file, TEST));
  net_->CopyTrainedLayersFrom(trained_file);

  CHECK_EQ(net_->num_inputs(), 1) << "Network should have exactly one input.";
  CHECK_EQ(net_->num_outputs(), 1) << "Network should have exactly one output.";

  Blob<float>* input_layer = net_->input_blobs()[0];
  num_channels_ = input_layer->channels();
  CHECK(num_channels_ == 3 || num_channels_ == 1)
    << "Input layer should have 1 or 3 channels.";
  input_geometry_ = cv::Size(input_layer->width(), input_layer->height());

  /* Load the binaryproto mean file. */
  SetMean(mean_file);

  /* Load labels. */
  std::ifstream labels(label_file.c_str());
  CHECK(labels) << "Unable to open labels file " << label_file;
  string line;
  while (std::getline(labels, line))
    labels_.push_back(string(line));

  Blob<float>* output_layer = net_->output_blobs()[0];
  CHECK_EQ(labels_.size(), output_layer->channels())
    << "Number of labels is different from the output layer dimension.";
}

static bool PairCompare(const std::pair<float, int>& lhs,
                        const std::pair<float, int>& rhs) {
  return lhs.first > rhs.first;
}

/* Return the indices of the top N values of vector v. */
static std::vector<int> Argmax(const std::vector<float>& v, int N) {
  std::vector<std::pair<float, int> > pairs;
  for (size_t i = 0; i < v.size(); ++i)
    pairs.push_back(std::make_pair(v[i], i));
  std::partial_sort(pairs.begin(), pairs.begin() + N, pairs.end(), PairCompare);

  std::vector<int> result;
  for (int i = 0; i < N; ++i)
    result.push_back(pairs[i].second);
  return result;
}

/* Return the top N predictions. */
std::vector<Prediction> Classifier::Classify(const cv::Mat& img, int N) {
  std::vector<float> output = Predict(img);

  N = std::min<int>(labels_.size(), N);
  std::vector<int> maxN = Argmax(output, N);
  std::vector<Prediction> predictions;
  for (int i = 0; i < N; ++i) {
    int idx = maxN[i];
    predictions.push_back(std::make_pair(labels_[idx], output[idx]));
  }

  return predictions;
}

/* Load the mean file in binaryproto format. */
void Classifier::SetMean(const string& mean_file) {
  BlobProto blob_proto;
  ReadProtoFromBinaryFileOrDie(mean_file.c_str(), &blob_proto);

  /* Convert from BlobProto to Blob<float> */
  Blob<float> mean_blob;
  mean_blob.FromProto(blob_proto);
  CHECK_EQ(mean_blob.channels(), num_channels_)
    << "Number of channels of mean file doesn't match input layer.";

  /* The format of the mean file is planar 32-bit float BGR or grayscale. */
  std::vector<cv::Mat> channels;
  float* data = mean_blob.mutable_cpu_data();
  for (int i = 0; i < num_channels_; ++i) {
    /* Extract an individual channel. */
    cv::Mat channel(mean_blob.height(), mean_blob.width(), CV_32FC1, data);
    channels.push_back(channel);
    data += mean_blob.height() * mean_blob.width();
  }

  /* Merge the separate channels into a single image. */
  cv::Mat mean;
  cv::merge(channels, mean);

  /* Compute the global mean pixel value and create a mean image
   * filled with this value. */
  cv::Scalar channel_mean = cv::mean(mean);
  mean_ = cv::Mat(input_geometry_, mean.type(), channel_mean);
}

std::vector<float> Classifier::Predict(const cv::Mat& img) {
  Blob<float>* input_layer = net_->input_blobs()[0];
  input_layer->Reshape(1, num_channels_,
                       input_geometry_.height, input_geometry_.width);
  /* Forward dimension change to all layers. */
  net_->Reshape();

  std::vector<cv::Mat> input_channels;
  WrapInputLayer(&input_channels);

  Preprocess(img, &input_channels);

  net_->Forward();

  /* Copy the output layer to a std::vector */
  Blob<float>* output_layer = net_->output_blobs()[0];
  const float* begin = output_layer->cpu_data();
  const float* end = begin + output_layer->channels();
  return std::vector<float>(begin, end);
}

/* Wrap the input layer of the network in separate cv::Mat objects
 * (one per channel). This way we save one memcpy operation and we
 * don't need to rely on cudaMemcpy2D. The last preprocessing
 * operation will write the separate channels directly to the input
 * layer. */
void Classifier::WrapInputLayer(std::vector<cv::Mat>* input_channels) {
  Blob<float>* input_layer = net_->input_blobs()[0];

  int width = input_layer->width();
  int height = input_layer->height();
  float* input_data = input_layer->mutable_cpu_data();
  for (int i = 0; i < input_layer->channels(); ++i) {
    cv::Mat channel(height, width, CV_32FC1, input_data);
    input_channels->push_back(channel);
    input_data += width * height;
  }
}

void Classifier::Preprocess(const cv::Mat& img,
                            std::vector<cv::Mat>* input_channels) {
  /* Convert the input image to the input image format of the network. */
  cv::Mat sample;
  if (img.channels() == 3 && num_channels_ == 1)
    cv::cvtColor(img, sample, cv::COLOR_BGR2GRAY);
  else if (img.channels() == 4 && num_channels_ == 1)
    cv::cvtColor(img, sample, cv::COLOR_BGRA2GRAY);
  else if (img.channels() == 4 && num_channels_ == 3)
    cv::cvtColor(img, sample, cv::COLOR_BGRA2BGR);
  else if (img.channels() == 1 && num_channels_ == 3)
    cv::cvtColor(img, sample, cv::COLOR_GRAY2BGR);
  else
    sample = img;

  cv::Mat sample_resized;
  if (sample.size() != input_geometry_)
    cv::resize(sample, sample_resized, input_geometry_);
  else
    sample_resized = sample;

  cv::Mat sample_float;
  if (num_channels_ == 3)
    sample_resized.convertTo(sample_float, CV_32FC3);
  else
    sample_resized.convertTo(sample_float, CV_32FC1);

  cv::Mat sample_normalized;
  cv::subtract(sample_float, mean_, sample_normalized);

  /* This operation will write the separate BGR planes directly to the
   * input layer of the network because it is wrapped by the cv::Mat
   * objects in input_channels. */
  cv::split(sample_normalized, *input_channels);

  CHECK(reinterpret_cast<float*>(input_channels->at(0).data)
        == net_->input_blobs()[0]->cpu_data())
    << "Input channels are not wrapping the input layer of the network.";
}

int main(int argc, char** argv) {
  if (argc != 6) {
    std::cerr << "Usage: " << argv[0]
              << " deploy.prototxt network.caffemodel"
              << " mean.binaryproto labels.txt img.jpg" << std::endl;
    return 1;
  }

  ::google::InitGoogleLogging(argv[0]);

  string model_file   = argv[1];
  string trained_file = argv[2];
  string mean_file    = argv[3];
  string label_file   = argv[4];
  Classifier classifier(model_file, trained_file, mean_file, label_file);

  string file = argv[5];

  std::cout << "---------- Prediction for "
            << file << " ----------" << std::endl;

  cv::Mat img = cv::imread(file, -1);
  CHECK(!img.empty()) << "Unable to decode image " << file;
  std::vector<Prediction> predictions = classifier.Classify(img);

  /* Print the top N predictions. */
  for (size_t i = 0; i < predictions.size(); ++i) {
    Prediction p = predictions[i];
    std::cout << std::fixed << std::setprecision(4) << p.second << " - \""
              << p.first << "\"" << std::endl;
  }
}
#else
int main(int argc, char** argv) {
  LOG(FATAL) << "This example requires OpenCV; compile with USE_OPENCV.";
}
#endif  // USE_OPENCV

CMakeLists.txt

find_package(OpenCV REQUIRED)


set(Caffe_DIR "/home/kezunlin/program/caffe/build/install/share/Caffe")  # caffe caffe

# for CaffeConfig.cmake/ caffe-config.cmake
find_package(Caffe)
# offical caffe  : There is no Caffe_INCLUDE_DIRS and Caffe_DEFINITIONS
# refinedet caffe: OK.

add_definitions(${Caffe_DEFINITIONS})

MESSAGE( [Main] " Caffe_INCLUDE_DIRS = ${Caffe_INCLUDE_DIRS}")
MESSAGE( [Main] " Caffe_DEFINITIONS = ${Caffe_DEFINITIONS}")
MESSAGE( [Main] " Caffe_LIBRARIES = ${Caffe_LIBRARIES}") # caffe
MESSAGE( [Main] " Caffe_CPU_ONLY = ${Caffe_CPU_ONLY}")
MESSAGE( [Main] " Caffe_HAVE_CUDA = ${Caffe_HAVE_CUDA}")
MESSAGE( [Main] " Caffe_HAVE_CUDNN = ${Caffe_HAVE_CUDNN}")


include_directories(${Caffe_INCLUDE_DIRS})

target_link_libraries(demo 
    ${OpenCV_LIBS}
    ${Caffe_LIBRARIES}
)

run

./demo

if error occurs:

libcaffe.so.1.0.0 => not found

edit .bashrc

1 2	# for caffe export LD_LIBRARY_PATH=/home/kezunlin/program/caffe/build/install/lib:$LD_LIBRARY_PATH

Reference

History

20180807: created.
20180822: update cmake-gui for caffe

tutorials on deep learning for image classification

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

Getting Started

Shared variables tips

We encourage you to store the dataset into shared variables and access it based on the minibatch index, given a fixed and known batch size. The reason behind shared variables is related to using the GPU. There is a large overhead when copying data into the GPU memory.

将数据存储在shared variable便于加速GPU计算，避免数据从CPU拷贝到GPU。

If you have your data in Theano shared variables though, you give Theano the possibility to copy the entire data on the GPU in a single call when the shared variables are constructed.

shared构建的时候，theano一次性讲所有数据拷贝至GPU.

Because the datapoints and their labels are usually of different nature (labels are usually integers while datapoints are real numbers) we suggest to use different variables for label and data. Also we recommend using different variables for the training set, validation set and testing set to make the code more readable (resulting in 6 different shared variables).

data,label使用２个shared variables，train,valid,test使用不同的shared variables，总计6个

Mini-batch data

def shared_dataset(data_xy):
    """ Function that loads the dataset into shared variables

    The reason we store our dataset in shared variables is to allow
    Theano to copy it into the GPU memory (when code is run on GPU).
    Since copying data into the GPU is slow, copying a minibatch everytime
    is needed (the default behaviour if the data is not in a shared
    variable) would lead to a large decrease in performance.
    """
    data_x, data_y = data_xy
    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
    # shared变量中的数据在GPU上必须是float32类型，但是计算阶段可能需要int类型(y)，所以需要
    # 将float32--->int.
    # When storing data on the GPU it has to be stored as floats
    # therefore we will store the labels as ``floatX`` as well
    # (``shared_y`` does exactly that). But during our computations
    # we need them as ints (we use labels as index, and if they are
    # floats it doesn't make sense) therefore instead of returning
    # ``shared_y`` we will have to cast it to int. This little hack
    # lets us get around this issue
    return shared_x, T.cast(shared_y, 'int32')

test_set_x, test_set_y = shared_dataset(test_set)
valid_set_x, valid_set_y = shared_dataset(valid_set)
train_set_x, train_set_y = shared_dataset(train_set)

batch_size = 500    # size of the minibatch

# accessing the third minibatch of the training set

data  = train_set_x[2 * batch_size: 3 * batch_size]
label = train_set_y[2 * batch_size: 3 * batch_size]

SGD pseudo code in theano

Traditional GD (m=N)

# GRADIENT DESCENT

while True:
    loss = f(params)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time.

Online Learning SGD (m=1)

# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
                            # imagine an infinite generator
                            # that may repeat examples (if there is only a finite training set)
    loss = f(params, x_i, y_i)
    d_loss_wrt_params = ... # compute gradient
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD (MSGD) works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers.

Minibatch SGD (m)

for (x_batch,y_batch) in train_batches:
                            # imagine an infinite generator
                            # that may repeat examples
    loss = f(params, x_batch, y_batch)
    d_loss_wrt_params = ... # compute gradient using theano
    params -= learning_rate * d_loss_wrt_params
    if <stopping condition is met>:
        return params

Theano pseudo code of Minibatch SGD

# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
    # here x_batch and y_batch are elements of train_batches and
    # therefore numpy arrays; function MSGD also updates the params
    print('Current loss is ', MSGD(x_batch, y_batch))
    if stopping_condition_is_met:
        return params

Regularization

L1/L2 regularization

L1/L2 regularization and early-stopping.

Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the regularizer is also called “weight decay”.

To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data.

# symbolic Theano variable that represents the L1 regularization term
L1  = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2 = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2

Early-Stopping

Early-stopping combats overfitting by monitoring the model’s performance on a validation set. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set.

所谓early stopping，即在每一个epoch结束时（一个epoch即对所有训练数据的一轮遍历）计算 validation data的accuracy，当accuracy不再提高时，就停止训练。这是很自然的做法，因为accuracy不再提高了，训练下去也没用。另外，这样做还能防止overfitting。

那么，怎么样才算是validation accuracy不再提高呢？并不是说validation accuracy一降下来，它就是“不再提高”，因为可能经过这个epoch后，accuracy降低了，但是随后的epoch又让accuracy升上去了，所以不能根据一两次的连续降低就判断“不再提高”。正确的做法是，在训练的过程中，记录最佳的validation accuracy，当连续10次epoch（或者更多次）没达到最佳accuracy时，你可以认为“不再提高”，此时使用early stopping。这个策略就叫“ no-improvement-in-n”，n即epoch的次数，可以根据实际情况取10、20、30….

Variable learning rate

Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: eta decrease constant, d=0.001)
Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024

一个简单有效的做法就是，当validation accuracy满足 no-improvement-in-n规则时，本来我们是要early stopping的，但是我们可以不stop，而是让learning rate减半，之后让程序继续跑。下一次validation accuracy又满足no-improvement-in-n规则时，我们同样再将learning rate减半（此时变为原始learni rate的四分之一）…继续这个过程，直到learning rate变为原来的1/1024再终止程序。（1/1024还是1/512还是其他可以根据实际确定）。【PS：也可以选择每一次将learning rate除以10，而不是除以2.】

实践中，eta/2变化太快，eta0/(1+d*epoch),d=0.001比较合适。

# early-stopping parameters
patience = 5000  # look as this many examples regardless
patience_increase = 2     # wait this much longer when a new best is
                              # found
improvement_threshold = 0.995  # a relative improvement of this much is
                               # considered significant
validation_frequency = min(n_train_batches, patience/2) = 2500 # for iters
                              # go through this many
                              # minibatches before checking the network
                              # on the validation set; in this case we
                              # check every epoch

best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()

done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
    # Report "1" for first epoch, "n_epochs" for last epoch
    epoch = epoch + 1
    for minibatch_index in range(n_train_batches):

        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params # gradient descent

        # iteration number. We want it to start at 0.
        iter = (epoch - 1) * n_train_batches + minibatch_index
        # note that if we do `iter % validation_frequency` it will be
        # true for iter = 0 which we do not want. We want it true for
        # iter = validation_frequency - 1.
        if (iter + 1) % validation_frequency == 0:

            this_validation_loss = ... # compute zero-one loss on validation set

            if this_validation_loss < best_validation_loss:

                # improve patience if loss improvement is good enough
                if this_validation_loss < best_validation_loss * improvement_threshold:
                    patience = max(patience, iter * patience_increase)
                    
                best_params = copy.deepcopy(params)
                best_validation_loss = this_validation_loss

        if patience <= iter:
            done_looping = True
            break

# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization

Theano/Python Tips

Loading and Saving Models

DO: Pickle the numpy ndarrays from your shared variables
DON’T: Do not pickle your training or test functions for long-term storage

import cPickle
save_file = open('path', 'wb')  # this will overwrite current contents
cPickle.dump(w.get_value(borrow=True), save_file, -1)  # the -1 is for HIGHEST_PROTOCOL
cPickle.dump(v.get_value(borrow=True), save_file, -1)  # .. and it triggers much more efficient
cPickle.dump(u.get_value(borrow=True), save_file, -1)  # .. storage than numpy's default
save_file.close()

Then later, you can load your data back like this:

save_file = open('path')
w.set_value(cPickle.load(save_file), borrow=True)
v.set_value(cPickle.load(save_file), borrow=True)
u.set_value(cPickle.load(save_file), borrow=True)

Plotting Intermediate Results

If you have enough disk space, your training script should save intermediate models and a visualization script should process those saved models.

MLP

see here

An MLP can be viewed as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation sigmoid. This transformation projects the input data into a space where it becomes linearly separable. This intermediate layer is referred to as a hidden layer. A single hidden layer is sufficient to make MLPs a universal approximator.

weight initializations

old version: 1/sqrt(n_in)

The initial values for the weights of a hidden layer i should be uniformly sampled from a symmetric interval that depends on the activation function. weight的初始化依赖于activation

tanh: uniformely sampled from -sqrt(6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
sigmoid : use 4 times larger initial weights for sigmoid compared to tanh

# `W` is initialized with `W_values` which is uniformely sampled
        # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
        # for tanh activation function
        # the output of uniform if converted using asarray to dtype
        # theano.config.floatX so that the code is runable on GPU
        # Note : optimal initialization of weights is dependent on the
        #        activation function used (among other things).
        #        For example, results presented in [Xavier10] suggest that you
        #        should use 4 times larger initial weights for sigmoid
        #        compared to tanh
        #        We have no info for other function, so we use the same as
        #        tanh.
        if W is None:
            W_values = numpy.asarray(
                rng.uniform(
                    low=-numpy.sqrt(6. / (n_in + n_out)),
                    high=numpy.sqrt(6. / (n_in + n_out)),
                    size=(n_in, n_out)
                ),
                dtype=theano.config.floatX
            )
            if activation == theano.tensor.nnet.sigmoid:
                W_values *= 4

            W = theano.shared(value=W_values, name='W', borrow=True)

        if b is None:
            b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
            b = theano.shared(value=b_values, name='b', borrow=True)

        self.W = W
        self.b = b

Tips and Tricks for training MLPs

Nonlinearity

Two of the most common ones are the sigmoid and the tanh function. nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties.

Weight initialization

At initialization we want the weights to be small enough around the origin so that the activation function operates in its linear regime, where gradients are the largest. weight的初始化依赖于activation

Learning rate

The simplest solution is to simply have a constant rate. Rule of thumb: try several log-spaced values (10^{-1},10^{-2},\ldots) and narrow the (logarithmic) grid search to the region where you obtain the lowest validation error.
Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: decrease constant, 0.001)
Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024

Regularization parameter

Typical values to try for the L1/L2 regularization parameter \lambda are 10^{-2},10^{-3},\ldots. In the framework that we described so far, optimizing this parameter will not lead to significantly better solutions, but is worth exploring nonetheless.

Number of hidden units

This hyper-parameter is very much dataset-dependent. hidden neurons的数量依赖于具体的数据集。Unless we employ some regularization scheme (early stopping or L1/L2 penalties), a typical number of hidden units vs. generalization performance graph will be U-shaped.

CNN

LeNet

The Convolution and Pool Operator

import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d,sigmoid
from theano.tensor.signal.pool import pool_2d

import numpy
import numpy as np
import matplotlib.pyplot as plt  
from PIL import Image

rng = numpy.random.RandomState(23455)

# instantiate 4D tensor for input
input = T.tensor4(name='input')

# initialize shared variable for weights.
w_shp = (2, 3, 9, 9)
w_bound = numpy.sqrt(3 * 9 * 9)
W = theano.shared( numpy.asarray(
            rng.uniform(
                low=-1.0 / w_bound,
                high=1.0 / w_bound,
                size=w_shp),
            dtype=input.dtype), name ='W')

# initialize shared variable for bias (1D tensor) with random values
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,)
b = theano.shared(numpy.asarray(
            rng.uniform(low=-.5, high=.5, size=b_shp),
            dtype=input.dtype), name ='b')

# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv2d(input, W)

poolsize=(2,2)
pooled_out = pool_2d( input=conv_out, ws=poolsize, ignore_border=True) 

conv_activations = sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
# create theano function to compute filtered images
f = theano.function([input], conv_activations)

pooled_activations = sigmoid(pooled_out + b.dimshuffle('x', 0, 'x', 'x'))
f2 = theano.function([input], pooled_activations)


#===========================================================
# processing image file
#===========================================================
# open random image of dimensions 639x516
img = Image.open(open('./3wolfmoon.jpg'))
# dimensions are (height, width, channel)
img = numpy.asarray(img, dtype=theano.config.floatX) / 256.

# put image in 4D tensor of shape (1, 3, height, width)
input_img_ = img.transpose(2, 0, 1).reshape(1, 3, 639, 516)
filtered_img = f(input_img_)
pooled_img = f2(input_img_)
print filtered_img.shape # (1, 2, 631, 508) 2 feature maps
print pooled_img.shape   # (1, 2, 315, 254) 2 feature maps

fig = plt.figure(figsize=(16,8))
# (1)
# plot original image and first and second components of output
plt.subplot(2, 3, 1); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 2); plt.axis('off'); plt.imshow(filtered_img[0, 0, :, :])
plt.subplot(2, 3, 3); plt.axis('off'); plt.imshow(filtered_img[0, 1, :, :])


# (2)
# plot original image and first and second components of output
plt.subplot(2, 3, 4); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 5); plt.axis('off'); plt.imshow(pooled_img[0, 0, :, :])
plt.subplot(2, 3, 6); plt.axis('off'); plt.imshow(pooled_img[0, 1, :, :])
plt.show()

# Notice that a randomly initialized filter acts very much like an edge detector!

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1060 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5105)


(1, 2, 631, 508)
(1, 2, 315, 254)

Reference

History

20180807: created.

theano conv pool example

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

Theano conv pool example

import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d,sigmoid
from theano.tensor.signal.pool import pool_2d

import numpy
import numpy as np
import matplotlib.pyplot as plt  
from PIL import Image

rng = numpy.random.RandomState(23455)

# instantiate 4D tensor for input
input = T.tensor4(name='input')

# initialize shared variable for weights.
w_shp = (2, 3, 9, 9)
w_bound = numpy.sqrt(3 * 9 * 9)
W = theano.shared( numpy.asarray(
            rng.uniform(
                low=-1.0 / w_bound,
                high=1.0 / w_bound,
                size=w_shp),
            dtype=input.dtype), name ='W')

# initialize shared variable for bias (1D tensor) with random values
# IMPORTANT: biases are usually initialized to zero. However in this
# particular application, we simply apply the convolutional layer to
# an image without learning the parameters. We therefore initialize
# them to random values to "simulate" learning.
b_shp = (2,)
b = theano.shared(numpy.asarray(
            rng.uniform(low=-.5, high=.5, size=b_shp),
            dtype=input.dtype), name ='b')

# build symbolic expression that computes the convolution of input with filters in w
conv_out = conv2d(input, W)

poolsize=(2,2)
pooled_out = pool_2d( input=conv_out, ws=poolsize, ignore_border=True) 

conv_activations = sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
# create theano function to compute filtered images
f = theano.function([input], conv_activations)

pooled_activations = sigmoid(pooled_out + b.dimshuffle('x', 0, 'x', 'x'))
f2 = theano.function([input], pooled_activations)


#===========================================================
# processing image file
#===========================================================
# open random image of dimensions 639x516
img = Image.open(open('./3wolfmoon.jpg'))
# dimensions are (height, width, channel)
img = numpy.asarray(img, dtype=theano.config.floatX) / 256.

# put image in 4D tensor of shape (1, 3, height, width)
input_img_ = img.transpose(2, 0, 1).reshape(1, 3, 639, 516)
filtered_img = f(input_img_)
pooled_img = f2(input_img_)
print filtered_img.shape # (1, 2, 631, 508) 2 feature maps
print pooled_img.shape   # (1, 2, 315, 254) 2 feature maps

fig = plt.figure(figsize=(16,8))
# (1)
# plot original image and first and second components of output
plt.subplot(2, 3, 1); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 2); plt.axis('off'); plt.imshow(filtered_img[0, 0, :, :])
plt.subplot(2, 3, 3); plt.axis('off'); plt.imshow(filtered_img[0, 1, :, :])


# (2)
# plot original image and first and second components of output
plt.subplot(2, 3, 4); plt.axis('off'); plt.imshow(img)
plt.gray();
# recall that the convOp output (filtered image) is actually a "minibatch",
# of size 1 here, so we take index 0 in the first dimension:
plt.subplot(2, 3, 5); plt.axis('off'); plt.imshow(pooled_img[0, 0, :, :])
plt.subplot(2, 3, 6); plt.axis('off'); plt.imshow(pooled_img[0, 1, :, :])
plt.show()

# Notice that a randomly initialized filter acts very much like an edge detector!

(1, 2, 631, 508)
(1, 2, 315, 254)

Pool example

from theano.tensor.signal import pool

input = T.dtensor4('input')
maxpool_shape = (2, 2)
pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=True)
f = theano.function([input],pool_out)

#invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
invals = np.arange(50).reshape(1, 2, 5, 5)
print 'With ignore_border set to True:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :]

pool_out = pool.pool_2d(input, maxpool_shape, ignore_border=False)
f = theano.function([input],pool_out)
print 
print 'With ignore_border set to False:'
print 'invals[1, 0, :, :] =\n ', invals[0, 0, :, :]
print 'output[1, 0, :, :] =\n ', f(invals)[0, 0, :, :]

With ignore_border set to True:
invals[0, 0, :, :] =
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
output[0, 0, :, :] =
[[  6.   8.]
 [ 16.  18.]]

With ignore_border set to False:
invals[1, 0, :, :] =
  [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
output[1, 0, :, :] =
  [[  6.   8.   9.]
 [ 16.  18.  19.]
 [ 21.  23.  24.]]

Reference

theano conv

History

20180807: created.

network3.py

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

network3.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""network3.py
~~~~~~~~~~~~~~

A Theano-based program for training and running simple neural
networks.

Supports several layer types (fully connected, convolutional, max
pooling, softmax), and activation functions (sigmoid, tanh, and
rectified linear units, with more easily added).

When run on a CPU, this program is much faster than network.py and
network2.py.  However, unlike network.py and network2.py it can also
be run on a GPU, which makes it faster still.

Because the code is based on Theano, the code is different in many
ways from network.py and network2.py.  However, where possible I have
tried to maintain consistency with the earlier programs.  In
particular, the API is similar to network2.py.  Note that I have
focused on making the code simple, easily readable, and easily
modifiable.  It is not optimized, and omits many desirable features.

This program incorporates ideas from the Theano documentation on
convolutional neural nets (notably,
http://deeplearning.net/tutorial/lenet.html ), from Misha Denil's
implementation of dropout (https://github.com/mdenil/dropout ), and
from Chris Olah (http://colah.github.io ).

Written for Theano 0.6 and 0.7, needs some changes for more recent
versions of Theano.

对于N=50000数据全部参与训练，time(python) = 7分钟; time(theano) = 1分钟。


But the big win is the ability to do fast symbolic differentiation, 
using a very general form of the backpropagation algorithm. 
This is extremely useful for applying stochastic gradient
descent to a wide variety of network architectures.
"""

#### Libraries
# Standard library
import cPickle
import gzip
import time
import copy

# Third-party libraries
import numpy as np
import theano
import theano.tensor as T
from theano.tensor.nnet import conv
from theano.tensor.nnet import softmax
from theano.tensor import shared_randomstreams
#from theano.tensor.signal.downsample import max_pool_2d  # for version theano-0.7
from theano.tensor.signal.pool import pool_2d # for version theano-0.9

# Activation functions for neurons
def linear(z): return z
def ReLU(z): return T.maximum(0.0, z)
from theano.tensor.nnet import sigmoid
from theano.tensor import tanh

#### Load the MNIST data
def load_data_shared(filename="../data/mnist.pkl.gz",training_set_size=1000):
    print 'loading data from {0} of #{1}'.format(filename,training_set_size)
    f = gzip.open(filename, 'rb')
    training_data, validation_data, test_data = cPickle.load(f) # float32(N,784); int64(N,)
    f.close()
    def shared(data):
        """Place the data into shared variables.  This allows Theano to copy
        the data to the GPU, if one is available.

        shared_x.get_value().shape   float32(50000, 784)
        shared_y.get_value().shape   float32(50000,)   
        
        y_cast = T.cast(shared_y, "int8") # float32--->int8
        
        shared_x.type  TensorType(float32, matrix)  theano.tensor.sharedvar.TensorSharedVariable
        shared_y.type  TensorType(float32, vector)  theano.tensor.sharedvar.TensorSharedVariable
        y_cast.type    TensorType(int32, vector)    theano.tensor.var.TensorVariable  (y_cast不是shared变量)
        """
        
        # 默认floatX = float64,在运行的时候需要设置floatX = float32
        # 取x[N,784],y[N]的前training_set_size个样本参与训练
        shared_x = theano.shared(np.asarray(data[0][:training_set_size,],dtype=theano.config.floatX), borrow=True)  
        shared_y = theano.shared(np.asarray(data[1][:training_set_size], dtype=theano.config.floatX), borrow=True)
        
        # shared变量中的数据在GPU上必须是float32类型，但是计算阶段可能需要int类型(y)，所以需要将float32--->int.
        # 并且int8类型需要和　self.y = T.bvector("y")的b类型一样。
        # When storing data on the GPU it has to be stored as floats
        # therefore we will store the labels as ``floatX`` as well
        # (``shared_y`` does exactly that). But during our computations
        # we need them as ints (we use labels as index, and if they are
        # floats it doesn't make sense) therefore instead of returning
        # ``shared_y`` we will have to cast it to int. This little hack
        # lets us get around this issue
        return shared_x, T.cast(shared_y, 'int8')
    return [shared(training_data), shared(validation_data), shared(test_data)]

def load_data_expanded(filename="../data/mnist_expanded.pkl.gz",training_set_size=1000):
    return load_data_shared(filename=filename,training_set_size=training_set_size)

#### Main class used to construct and train networks
class Network(object):

    def __init__(self, layers, mini_batch_size):
        """Takes a list of `layers`, describing the network architecture, and
        a value for the `mini_batch_size` to be used during training
        by stochastic gradient descent.

        """
        self.layers = layers
        assert len(self.layers)>=2
        self.mini_batch_size = mini_batch_size
        self.params = [param for layer in self.layers for param in layer.params]
        self.x = T.matrix("x")  # batch x  float32,(m,784) 不需要指定fmatrix
        self.y = T.bvector("y")  # batch y   int8,(m,)
        
        # first layer init with inpt=x,inpt_dropout=x
        init_layer = self.layers[0]
        init_layer.set_inpt(self.x, self.x, self.mini_batch_size)
        for j in xrange(1, len(self.layers)):
            prev_layer, layer  = self.layers[j-1], self.layers[j]
            layer.set_inpt(prev_layer.output, prev_layer.output_dropout, self.mini_batch_size)
            
        self.output = self.layers[-1].output
        self.output_dropout = self.layers[-1].output_dropout

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            validation_data, test_data, lmbda=0.0,
            no_improvement_in_n=20,use_constant_eta=True, # default not vary eta because accuracy not imporved too much
            eta_shrink_times=10,eta_descrease_factor = 0.0001):
        
        """Train the network using mini-batch stochastic gradient descent."""
        training_x, training_y = training_data       # (N,784) (N,)
        validation_x, validation_y = validation_data
        test_x, test_y = test_data

        # compute number of minibatches for training, validation and testing
        num_training_batches = size(training_data)/mini_batch_size
        num_validation_batches = size(validation_data)/mini_batch_size
        num_test_batches = size(test_data)/mini_batch_size

        # define the (regularized) cost function, symbolic gradients, and updates
        l2_norm_squared = sum([(layer.w**2).sum() for layer in self.layers])
        cost0 = self.layers[-1].cost(self) # 计算最后一层的输出代价，传递Network作为net参数
        
        cost = cost0 + 0.5*lmbda*l2_norm_squared/size(training_data)  # ??? N instead of num_training_batches
        grads = T.grad(cost, self.params)
        
        shared_eta = theano.shared(eta,borrow=True) #(same as shared_b) use SharedVariable instead of value
        
        updates = [(param, param-T.cast(shared_eta*grad,dtype=theano.config.floatX)) for param, grad in zip(self.params, grads)] 
        
        """
        grad(float32),没有指定floatX=float32,则eta*grad(float64),指定之后eta*grad(float32)，无需cast
        
        #for param, grad in zip(self.params, grads):
        #    print param.type,grad.type,(eta*grad).type
        
        # updates = [(param, T.cast(param-eta*grad,'float32') ) for param, grad in zip(self.params, grads)]
        """
        
        # define functions to train a mini-batch, and to compute the
        # accuracy in validation and test mini-batches.
        i = T.lscalar() # mini-batch index
        train_mb = theano.function(
            [i], cost, updates=updates, # 给定i,===>x,y===>cost中的x,y被替换掉，从而计算mini-batch的代价，最后updates
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        
        # cost and accuracy for train,val,test
        # (1) train
        train_mb_cost = theano.function(
            [i], cost,
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            }) 
        train_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y), # y(m,)
            givens={
                self.x:
                training_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                training_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            }) 
        # (2) val
        validate_mb_cost = theano.function(
            [i], cost,
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        validate_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y), # y(m,)
            givens={
                self.x:
                validation_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                validation_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        # (3) test
        test_mb_cost = theano.function(
            [i], cost,
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        test_mb_accuracy = theano.function(
            [i], self.layers[-1].accuracy(self.y), # y(m,)
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size],
                self.y:
                test_y[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        self.test_mb_predictions = theano.function(
            [i], self.layers[-1].y_out,   # y(m,)　m个样本的预测结果
            givens={
                self.x:
                test_x[i*self.mini_batch_size: (i+1)*self.mini_batch_size]
            })
        
        """
         def shuffle_data(x,y):
            seed = int(time.time()) 
            np.random.seed(seed)
            np.random.shuffle(x)

            np.random.seed(seed)
            np.random.shuffle(y)
        
        def shuffle_training_data(training_x,training_y):
            # CPU, OK; GPU, FAILED (在GPU中borrow失效)
            originX = training_x.get_value(borrow=True) # shared---> nparray
            originY = training_y.get_value(borrow=True) # shared---> nparray
            shuffle_data(originX,originY)
        """
        
        evaluation_costs, evaluation_accuracys = [], []
        training_costs, training_accuracys = [], []
        
        # use no-improvement-in-n early stopping
        # 记录best_validation_accuracy,best_epoch，如果epoch-best_epoch>=no_improvement_in_n,stop
        best_epoch = 0
        cur_eta_shrink_times = 0 # if cur_eta_shrink_times>=eta_shrink_times,stop
        best_validation_accuracy = 0.0 # with gpu, numpy.float64
        
        for epoch in xrange(epochs):
            #random.shuffle(training_data) # for list[(x1,y1),(x2,y2),...] 此处training_data是(X,Y)
            # shuffle_training_data(training_x,training_y) # FAILED on GPU
            
            for minibatch_index in xrange(num_training_batches):
                # iteration记录训练次数，每训练1000次输出一次
                iteration = num_training_batches*epoch+minibatch_index
                if iteration % 1000 == 0:
                    print("Training mini-batch number {0}".format(iteration))
                cost_ij = train_mb(minibatch_index)
           
            # 一个epoch训练结束，训练了num_training_batches次，iterration=4999。利用w,b计算一次验证accuracy
            #if (iteration+1) % num_training_batches == 0:
            validation_cost = np.mean( [validate_mb_cost(j) for j in xrange(num_validation_batches)] )
            validation_accuracy = np.mean( [validate_mb_accuracy(j) for j in xrange(num_validation_batches)] )
            print("\nEpoch {0}: validation accuracy {1:.2%}".format(epoch, validation_accuracy))

            train_cost = np.mean( [train_mb_cost(j) for j in xrange(num_training_batches)] )
            train_accuracy = np.mean( [train_mb_accuracy(j) for j in xrange(num_training_batches)] )
            
            # save 4 return lists
            evaluation_costs.append(validation_cost)
            evaluation_accuracys.append(validation_accuracy)
            training_costs.append(train_cost)
            training_accuracys.append(train_accuracy)
            
            #记录best_validation_accuracy
            # 关键在于<,满足足够多的NIIN,才能满足eta_shrink_times>=10
            if best_validation_accuracy - validation_accuracy < 0.0:  # <=
                print("This is the best validation accuracy to date.")
                best_validation_accuracy = validation_accuracy
                best_epoch = epoch
                best_iteration = iteration
                
                # save best network
                best_net = copy.deepcopy(self)
                
                #计算在val取得最佳accuracy情况下，test数据集的accuracy
                if test_data:
                    test_accuracy = np.mean( [test_mb_accuracy(j) for j in xrange(num_test_batches)] )
                    print('The corresponding test accuracy is {0:.2%}'.format(test_accuracy))
                    
            #============================================================================================
            # early stopping with variable learning rate
            # (1) (epoch - best_epoch) >= no_improvement_in_n: stop   NIIN = 20
            # (2) new_eta = 1/2*eta until new_eta<=1/1024*eta         ETA_SHRINK_TIME = 10 
            #============================================================================================
            
            # check in last epoch of NIIN stage
            if (epoch+1) % no_improvement_in_n == 0:
                # (1) check NIIN
                if (epoch - best_epoch) >= no_improvement_in_n:
                    # stop learning
                    print '!'*100
                    print '[HIT] Early stopping at epoch #{0},best_epoch #{1},iteration #{2},validation accuracy {3:.2%}'.format(epoch,best_epoch,best_iteration,best_validation_accuracy)
                    print '!'*100
                    break;
                    
                    #******************************************************************************
                    if use_constant_eta:
                        break # goto (2) instead of break
                    else:
                        # (2) shrink eta to 1/2*eta　　(accuracy not improved too much)
                        print 'cur_eta_shrink_times = {0}'.format(cur_eta_shrink_times)
                        if cur_eta_shrink_times >= eta_shrink_times:
                            print '+'*100
                            print '[HIT] Eta shrink OK. at epoch #{0},best_epoch #{1},iteration #{2},validation accuracy {3:.2%}'.format(epoch,best_epoch,best_iteration,best_validation_accuracy)
                            print '+'*100
                            break;

                        cur_eta_shrink_times +=1 

                        # update eta every epoch
                        eta_descrease_factor = 0.0001
                        new_eta = eta/(1.0+eta_descrease_factor*(epoch+1))
                        shared_eta.set_value(np.asarray(new_eta,dtype=theano.config.floatX),borrow=True) # update eta

                        #eta = eta/2.0 
                        #shared_eta.set_value(np.asarray(eta,dtype=theano.config.floatX),borrow=True) # update eta
                    #******************************************************************************
            #============================================================================================
            
            
        # once early stopping, we save the best model to file
        with open('best_model.pkl', 'wb') as fp:
            print 'Saving best mode to best_model.pkl...'
            cPickle.dump(best_net, fp)
                    
        print("\nFinished training network.")
        print("Best validation accuracy of {0:.2%} obtained at best_epoch {1}".format(best_validation_accuracy, best_epoch))
        print("Corresponding test accuracy of {0:.2%}".format(test_accuracy))

        return evaluation_costs, evaluation_accuracys, training_costs, training_accuracys,best_epoch # for plot


#********************************************************
# load model and predict on test data
#********************************************************
def load_network_and_predict():
    """
    An example of how to load a trained model and use it
    to predict labels.
    """
    # load the saved model
    net = cPickle.load(open('best_model.pkl'))
    
    # predict
    training_set_size = 50000
    train_data,val_data,test_data = load_data_shared(training_set_size=training_set_size)
    test_x,test_y = test_data
    
    mini_batch_size = 10
    num_test_batches = size(test_data)/mini_batch_size 
    
    i = T.lscalar()
    # test predict
    test_mb_predictions = theano.function(
        [i], net.layers[-1].y_out,   # y(m,)　m个样本的预测结果
        givens={
            net.x:
            test_x[i*mini_batch_size: (i+1)*mini_batch_size]
        })
    # test accuracy
    test_mb_accuracy = theano.function(
        [i], net.layers[-1].accuracy(net.y), # y(m,)
        givens={
            net.x:
            test_x[i*mini_batch_size: (i+1)*mini_batch_size],
            net.y:
            test_y[i*mini_batch_size: (i+1)*mini_batch_size]
        })
    
    test_predictions = test_mb_predictions(0)
    print 'real values of first 10: ',test_y[:10].eval()
    print 'predictions of first 10: ',test_predictions
    
    test_accuracy = np.mean( [test_mb_accuracy(j) for j in xrange(num_test_batches)] )
    print 'test_accuracy ',test_accuracy
    
#********************************************************
# end of predict
#********************************************************


#### Define layer types

class ConvPoolLayer(object):
    """Used to create a combination of a convolutional and a max-pooling
    layer.  A more sophisticated implementation would separate the
    two, but for our purposes we'll always use them together, and it
    simplifies the code, so it makes sense to combine them.

    """

    def __init__(self, filter_shape, image_shape, poolsize=(2, 2),
                 activation_fn=sigmoid):
        """`filter_shape` is a tuple of length 4, whose entries are the number
        of filters, the number of input feature maps, the filter height, and the
        filter width.
        
        `image_shape` is a tuple of length 4, whose entries are the
        mini-batch size, the number of input feature maps, the image
        height, and the image width.

        `poolsize` is a tuple of length 2, whose entries are the y and
        x pooling sizes.

        np.prod((2,2)) = 4 # int64
        
        ConvPoolLayer1
        image_shape=(m,1,28,28)  1*28*28   (1 input feature map)
        filter_shape=(20,1,5,5)  20*24*24 
        poolsize=(2,2)           20*12*12 
        
        ConvPoolLayer2
        image_shape=(m,20,12,12) 20*12*12  (20 input feature map)
        filter_shape=(40,20,5,5) 40*8*8
        poolsize=(2,2)           40*4*4
        
        ConvPoolLayer1
        (20,1,5,5) 
        20指定当前ConvLayer1的features的数量: c1_f1,c1_f2,....c1_f19,c1_f20。
        (1,5,5)指定feature的一个pixel所对应的local receptive field(LRF),此处对应1个input feature的5*5区域。
        对应的w: w1,w2,...w19,w20 of size(1,5,5)===>w(20,1,5,5) filter_shape
        对应的b: b1,b2,...b19,b20 of size()     ===>b(20,)

        ConvPoolLayer2
        (40,20,5,5) 
        40指定当前ConvLayer2的features的数量: c2_f1,c2_f2,....c2_f39,c2_f40。
        (20,5,5)指定feature的一个pixel所对应的local receptive field(LRF),此处对应20个input feature的5*5区域。
        对应的w: w1,w2,...w39,w40 of size(20,5,5)===>w(40,20,5,5) filter_shape
        对应的b: b1,b2,...b39,b40 of size()     ===>b(40,)
        """
        assert image_shape[1] == filter_shape[1] # input feature maps
        self.filter_shape = filter_shape
        self.image_shape = image_shape
        self.poolsize = poolsize
        self.activation_fn=activation_fn
        
        # initialize weights and biases
        # 20*(5*5)/(2*2) = 500/4 = 125
        # 40*(5*5)/(2*2) = 1000/4 = 250
        #n_out = (filter_shape[0]*np.prod(filter_shape[2:])/np.prod(poolsize)) # 125  250 (why???)
        
        # for tanh: w_bound = numpy.sqrt(6./(n_in+n_out))
        # for sigmoid: w_bound = 4*w_bound(tanh)
        # for ReLU: w = 0
        
        # there are "num input feature maps * filter height * filter width" inputs to each hidden unit
        n_in = np.prod(filter_shape[1:]) # LRF
        # each unit in the lower layer receives a gradient from:
        # "num output feature maps * filter height * filter width" / pooling size
        n_out = (filter_shape[0] * np.prod(filter_shape[2:]) // np.prod(poolsize))
        
        w_bound = np.sqrt(6./(n_in+n_out))
        if activation_fn == sigmoid:
            w_bound = 4*w_bound
            
        self.w = theano.shared(
            np.asarray(
                #np.random.normal(loc=0, scale=np.sqrt(1.0/n_out), 
                np.random.uniform(low=-w_bound,high=w_bound, 
                                 size=filter_shape), 
                # w(20,1,5,5) w(40,20,5,5)
                dtype=theano.config.floatX),
            borrow=True) 
        self.b = theano.shared(
            np.asarray(
                np.random.normal(loc=0, scale=1.0, size=(filter_shape[0],)), 
                # b(20,) b(40,)
                dtype=theano.config.floatX),
            borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        """
        inpt = x:  fmatrix(m,784)
        ConvPoolLayer1
        image_shape=(m,1,28,28)  m,1*28*28   (1 input feature map)
        filter_shape=(20,1,5,5)  m,20*24*24   w(20,1,5,5) b(20,)
        poolsize=(2,2)           m,20*12*12 
        
        ConvPoolLayer2
        image_shape=(m,20,12,12) m,20*12*12  (20 input feature map)
        filter_shape=(40,20,5,5) m,40*8*8     w(40,20,5,5) b(40,)
        poolsize=(2,2)           m,40*4*4
        
        
        ConvPoolLayer1
        inpt(m,784)--->inpt(m,1,28,28)
        conv_out(m,20,24,24)
        pooled_out(m,20,12,12)
        output(m,20,12,12)
        
        ConvPoolLayer2
        inpt(m,20,12,12)
        conv_out(m,40,8,8)
        pooled_out(m,40,4,4)
        output(m,40,4,4)
        """
        
        self.inpt = inpt.reshape(self.image_shape)
        conv_out = conv.conv2d( input=self.inpt, image_shape=self.image_shape, 
                               filters=self.w, filter_shape=self.filter_shape) 
        
        #conv_out = conv.conv2d(input=self.inpt,filters=self.w) 
        #theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)
        
        pooled_out = pool_2d( input=conv_out, ws=self.poolsize, ignore_border=True) 
        #theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)
        
        b_shuffle = self.b.dimshuffle('x', 0, 'x', 'x')  
        # TensorVariable TensorType(float32, (True, False, True, True))
        # ConvPoolLayer1: b(20,) 20个feature map分别增加b0,b1,...b19,b20
        # 对于pooled_out=(m,20,12,12)而言，('x', 0, 'x', 'x')的dim2=0，其他为x
        
        # ConvPoolLayer2: b(40,) 40个feature map分别增加b0,b1,...b39,b40
        # 对于pooled_out=(m,40,4,4)而言，('x', 0, 'x', 'x')的dim2=0，其他为x
        
        self.output = self.activation_fn( pooled_out + b_shuffle )
        #theano.tensor.var.TensorVariable float32 TensorType(float32, 4D)
        
        self.output_dropout = self.output # no dropout in the convolutional layers

class FullyConnectedLayer(object):

    def __init__(self, n_in, n_out, activation_fn=sigmoid, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = activation_fn
        self.p_dropout = p_dropout
        
        #rng = numpy.random.RandomState(1234) # for w initialization
        
        # for tanh: w_bound = numpy.sqrt(6./(n_in+n_out))
        # for sigmoid: w_bound = 4*w_bound(tanh)
        # for ReLU: w = 0
        
        w_bound = np.sqrt(6./(n_in+n_out))
        if activation_fn == sigmoid:
            w_bound = 4*w_bound
        
        # Initialize weights and biases
        self.w = theano.shared(
            np.asarray(
                #np.random.normal(loc=0.0, scale=np.sqrt(1.0/n_in),
                np.random.uniform(low=-w_bound,high=w_bound,          
                size=(n_in, n_out)),
                dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.asarray(np.random.normal(loc=0.0, scale=1.0, size=(n_out,)),
                       dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        """
        (1) inpt,output for validating and testing
        (2) inpt_dropout,output_dropout for training (output_dropout--->[cost]--->grad--->params)
        
        以 ConvPoolLayer1(m,20,12,12),ConvPoolLayer2(m,40,4,4),[640,30,10]网络结构为例说明：
        ************************************************************************************************
        X(m,784),Y(m,)
        
        ConvPoolLayer1:
        当前层的inpt是前一层的output，因为是第一层，所以初始化为inpt = X(m,784)
        inpt(m,784)--->inpt(m,1,28,28)
        conv_out(m,20,24,24)
        pooled_out(m,20,12,12)
        output(m,20,12,12)
        
        ConvPoolLayer2:
        inpt(m,20,12,12)
        conv_out(m,40,8,8)
        pooled_out(m,40,4,4)
        output(m,40,4,4)
        ************************************************************************************************
        
        对于FullyConnectedLayer而言，inpt是ConvPoolLayer2的output=(m,40,4,4) 
        ================================================================================================
        Layer1:
        inpt=(m,40,4,4)--->inpt(m,640)    a1(m,640)即：m个样本，每个样本640个neurons
        output = sigmoid(input*w+b) ===> a2 = sigmoid(a1*w+b)
        a2(m,30) = sigmoid(  a1(m,640)* w(640,30)+ b(30,) ) 
        
        Layer2:
        当前层的inpt是前一层的output，即是FullyConnectedLayer1的output，包含30个hidden neurons输出 a2(m,30)
        output = SOFTMAX(input*w+b) ===> a3 = SOFTMAX(a2*w+b)
        a3(m,10) = SOFTMAX(  a2(m,30)* w(30,10)+ b(10,) ) 
        
        output是m个样本对应的10个概率,y_out是m个样本对应的真实数值。
        ================================================================================================
        """
        
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))
        #self.output = self.activation_fn((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)  
        self.output = self.activation_fn(T.dot(self.inpt, self.w) + self.b)
       
        #self.y_out = T.argmax(self.output, axis=1) # 暂时不用，只是用最后一层的y_out作为输出结果
        
        self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(T.dot(self.inpt_dropout, self.w) + self.b)

    #def accuracy(self, y):
    #    "Return the accuracy for the mini-batch."
    #    # 暂时不用，只是用最后一层
    #    return T.mean(T.eq(y, self.y_out))

class SoftmaxLayer(object):

    def __init__(self, n_in, n_out, p_dropout=0.0):
        self.n_in = n_in
        self.n_out = n_out
        self.activation_fn = softmax # default to softmax
        self.p_dropout = p_dropout
        
        # Initialize weights and biases
        # for sigmoid neurons,w--->(0, 1/sqrt(n_in)) b--->(0,1)
        # for softmax neurons,w = 0,b = 0, no need using suitably parameteried normal random variables
        self.w = theano.shared(
            np.zeros((n_in, n_out), dtype=theano.config.floatX),
            name='w', borrow=True)
        self.b = theano.shared(
            np.zeros((n_out,), dtype=theano.config.floatX),
            name='b', borrow=True)
        self.params = [self.w, self.b]

    def set_inpt(self, inpt, inpt_dropout, mini_batch_size):
        """
        (1) inpt,output for validating and testing
        (2) inpt_dropout,output_dropout for training (output_dropout--->[cost]--->grad--->params)
        
        
        在Python中，a = sigmoid(w*a+b), w=(30,784),a=(784,1)一次使用一个样本参与计算。
        在Theano中修改为,a = sigmoid(a*w+b) a=(m,784),w=(784,30)一次使用m个样本参与计算。
        
        以[784,30,10]网络结构为例说明：
        Layer1:
        当前层的inpt是前一层的output，因为是第一层，所以初始化为a1 = X(m,784) Matrix，每一个样本包含784个输入neurons
        output = sigmoid(input*w+b) ===> a2 = sigmoid(a1*w+b)
        a2(m,30) = sigmoid(  a1(m,784)* w(784,30)+ b(30,) ) 
        
        Layer2:
        当前层的inpt是前一层的output，即是FullyConnectedLayer的output，包含30个hidden neurons输出 a2(m,30)
        output = SOFTMAX(input*w+b) ===> a3 = SOFTMAX(a2*w+b)
        a3(m,10) = SOFTMAX(  a2(m,30)* w(30,10)+ b(10,) ) 
        
        output是m个样本对应的10个概率,y_out是m个样本对应的真实数值。
        """
        self.inpt = inpt.reshape((mini_batch_size, self.n_in))  # tesorvariable Matrix(m,n_in)
        #self.output = self.activation_fn((1-self.p_dropout)*T.dot(self.inpt, self.w) + self.b)
        self.output = self.activation_fn(T.dot(self.inpt, self.w) + self.b)
        
        """
        input-->    output   ---> y_out
        X1---> [y0,y1,...y9] --->  1
        X2---> [y0,y1,...y9] --->  0
        ...
        Xm---> [y0,y1,...y9] --->  2
        
        axis沿着row作为一个整体进行，y_out作为最终的输出=vector(m,)。
        """
        self.y_out = T.argmax(self.output, axis=1) # 对应的数值 [2,1,...7]
        
        self.inpt_dropout = dropout_layer( inpt_dropout.reshape((mini_batch_size, self.n_in)), self.p_dropout)
        self.output_dropout = self.activation_fn(T.dot(self.inpt_dropout, self.w) + self.b)

    def cost(self, net):
        "Return the log-likelihood cost."
        
        """
        使用output_dropout用于train
        
        (1) 一个样本对应的代价Cx
        C = -log(a[i])
        i = np.argmax(y)  # a(10,1) y(10,1)
        return -np.log(a[i,0])
        
        (2) m个样本的平均代价
        计算代价的时候，传递Network作为参数，方便获取net.y
       
                output(m,10)     net.y   cost
        X1---> [y0,y1,...y9] --->  1     -log a[1,1]
        X2---> [y0,y1,...y9] --->  0     -log a[2,0]
        Xm---> [y0,y1,...y9] --->  2     -log a[m,2]
        
        
        a = np.array([[0,   0.8, 0,   0,...],
                      [0.9, 0,   0,   0,...],
                      [0,   0,   0.7, 0...]])
        y = [1,0,2]
        a[[0,1,2],y]
        
        > array([ 0.8,  0.9,  0.7])
        """
        
        m = net.y.shape[0]
        rows = T.arange(m)
        return -T.mean(T.log( self.output_dropout[rows, net.y] ))

    def accuracy(self, y):
        "Return the accuracy for the mini-batch."
        
        """
        使用output,y_out用于test
        
        y(m,) 对应m个样本的真实结果
        y_out(m,)　对应m个样本的预测结果
        如果mini_batch_size = 5
        
        y = np.array([2,1,7,8,9])
        y_out = np.array([2,1,7,6,9])
        np.mean(np.equal(y,y_out))  # [1,1,1,0,1] 0.80
        """
        return T.mean(T.eq(y, self.y_out))


#### Miscellanea
def size(data):
    "Return the size of the dataset `data`."
    return data[0].get_value(borrow=True).shape[0]  # N = 50000

def dropout_layer(layer, p_dropout):
    """
    对于[784,30,10]
    Layer1:
    layer= float32 (m,784), p_dropout = 0.2,对每个节点以一定的概率进行drop
    
    参考：http://www.jianshu.com/p/ba9ca3b07922
    
    Inverted Dropout 
    我们稍微将 Dropout 方法改进一下，使得我们只需要在训练阶段缩放激活函数的输出值，而不用在测试阶段改变什么。
    这个改进的 Dropout 方法就被称之为 Inverted Dropout 。
    
    在各种深度学习框架的实现中，我们都是用 Inverted Dropout 来代替 Dropout，因为这种方式有助于模型的完整性，
    我们只需要修改一个参数（保留/丢弃概率），而整个模型都不用修改。
    """
    srng = shared_randomstreams.RandomStreams( np.random.RandomState(0).randint(999999) )
    retain_prob = 1. - p_dropout # retain probility  theano.config.floatX
    #mask = srng.binomial(n=1, p=retain_prob, size=layer.shape,dtype='int8') # int8
    
    #mask: <class 'theano.tensor.var.TensorVariable'> TensorType(float32, vector)
    mask = srng.binomial(n=1, p=retain_prob, size=layer.shape,dtype=theano.config.floatX)
    mask_layer = layer*mask
    return mask_layer/retain_prob #在train阶段除以retain_prob，以便test阶段每一个Layer的output形式保持不变。

Test Network3

import random
import numpy as np
random.seed(12345678)
np.random.seed(12345678)

#from ke_network3 import *
epochs = 3
training_set_size = 100
mini_batch_size = 10
train_data,val_data,test_data = load_data_shared(training_set_size=training_set_size)

# for conv pool layer
image_shape=(mini_batch_size,1,28,28)
filter_shape=(20,1,5,5)
poolsize=(2,2)
convpool_layer1 = ConvPoolLayer(image_shape=image_shape,filter_shape=filter_shape, poolsize=poolsize)
n_in = 20*12*12


image_shape=(mini_batch_size,20,12,12)
filter_shape=(40,20,5,5)
poolsize=(2,2)
n_in = 40*4*4
convpool_layer2 = ConvPoolLayer(image_shape=image_shape,filter_shape=filter_shape, poolsize=poolsize)


full_layer = FullyConnectedLayer(n_in=n_in,n_out=30)
softmax_layer = SoftmaxLayer(n_in=30,n_out=10)
#net = Network([convpool_layer1,full_layer,softmax_layer],10)
net = Network([convpool_layer1,convpool_layer2,full_layer,softmax_layer],10)
net.SGD(train_data,epochs,mini_batch_size,0.3,val_data,test_data,lmbda=0)

updates TensorType(float32, 4D) TensorType(float32, 4D)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, 4D) TensorType(float32, 4D)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, matrix) TensorType(float32, matrix)
updates TensorType(float32, vector) TensorType(float32, vector)
updates TensorType(float32, matrix) TensorType(float32, matrix)
updates TensorType(float32, vector) TensorType(float32, vector)
Training mini-batch number 0
Epoch 0: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%
Epoch 1: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%
Epoch 2: validation accuracy 10.00%

This is the best validation accuracy to date.
The corresponding test accuracy is 8.00%

Finished training network.
Best validation accuracy of 10.00% obtained at iteration 29
Corresponding test accuracy of 8.00%





([2.2949765, 2.2951121, 2.2958748],
 [0.10000000000000001, 0.10000000000000001, 0.10000000000000001],
 [2.2682509, 2.2655275, 2.2644706],
 [0.13, 0.13, 0.13])

Basic Test of Network3.py

(1) load data

from ke_network3 import *
filename="../data/mnist.pkl.gz"
filename="../data/mnist_expanded.pkl.gz"
f = gzip.open(filename, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()

x = training_data[0] # (m,784)
y = training_data[1] # (m,)
print type(x),type(y)
print type(x[0]),type(y[0])
print x.shape,y.shape
print x[0].shape,y[0].shape
x2 = x[:10,]

set_size = 10
x = training_data[0] # float32  (50000, 784)
y = training_data[1] # int64  (50000,)

training_x = theano.shared( training_data[0][:set_size,],  borrow=True) #float32
training_y = theano.shared( np.asarray(training_data[0][:set_size,],dtype='int8'),  borrow=True) # int8
#training_x2 = theano.shared(np.asarray(training_data[0], dtype=theano.config.floatX), borrow=True) # float64
print training_x.type
print training_y.type
#print training_x2.type

# 乘法可能会改变TensorVariable的类型
new_x = training_x*0.1 # float32--->float64
print training_x.type,new_x.type

<type 'numpy.ndarray'> <type 'numpy.ndarray'>
<type 'numpy.ndarray'> <type 'numpy.int64'>
(50, 784) (50,)
(784,) ()
TensorType(float32, matrix)
TensorType(int8, matrix)
TensorType(float32, matrix) TensorType(float64, matrix)

(2) dimshuffle b to match pooled_out

pooled_out = np.arange(18).reshape(1,2,3,3)
print pooled_out
b = np.array([0.0,1.0],dtype='float32') # [0,1]

# shuffle b to match pooled_out
sb = theano.shared(np.asarray(b,dtype='float32')) 
y = sb.dimshuffle('x', 0, 'x', 'x')  # TensorVariable TensorType(float32, (True, False, True, True))
# 2个feature map分别增加b0,b1
print type(y),y.type,y.shape.eval()

b_value = y.eval()
print b_value
pooled_out + b_value

[[[[ 0  1  2]
   [ 3  4  5]
   [ 6  7  8]]

  [[ 9 10 11]
   [12 13 14]
   [15 16 17]]]]
<class 'theano.tensor.var.TensorVariable'> TensorType(float32, (True, False, True, True)) [1 2 1 1]
[[[[ 0.]]

  [[ 1.]]]]





array([[[[  0.,   1.,   2.],
         [  3.,   4.,   5.],
         [  6.,   7.,   8.]],

        [[ 10.,  11.,  12.],
         [ 13.,  14.,  15.],
         [ 16.,  17.,  18.]]]])

Reference

History

20180807: created.

network2.py

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

network2.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""network2.py
~~~~~~~~~~~~~~

An improved version of network.py, implementing the stochastic
gradient descent learning algorithm for a feedforward neural network.
Improvements include the addition of the cross-entropy cost function,
regularization, and better initialization of network weights.  Note
that I have focused on making the code simple, easily readable, and
easily modifiable.  It is not optimized, and omits many desirable
features.

"""

#### Libraries
# Standard library
import json
import random
import sys

# Third-party libraries
import numpy as np


#### Define the quadratic and cross-entropy cost functions

class QuadraticCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output ``y``."""
        return 0.5*np.linalg.norm(a-y)**2

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer."""
        return (a-y) * sigmoid_prime(z)


class CrossEntropyCost(object):

    @staticmethod
    def fn(a, y):
        """Return the cost associated with an output ``a`` and desired output
        ``y``.  Note that np.nan_to_num is used to ensure numerical
        stability.  In particular, if both ``a`` and ``y`` have a 1.0
        in the same slot, then the expression (1-y)*np.log(1-a)
        returns nan.  The np.nan_to_num ensures that that is converted
        to the correct value (0.0).

        """
        return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

    @staticmethod
    def delta(z, a, y):
        """Return the error delta from the output layer.  Note that the
        parameter ``z`` is not used by the method.  It is included in
        the method's parameters in order to make the interface
        consistent with the delta method for other cost classes.

        """
        return (a-y)

#********************************************************    
class LogLikelihoodCost(object):

    @staticmethod
    def fn(a, y):
        """
        C = -log(a[i])
        
        a(10,1) y(10,1)
        y = [0,0,1,0,0,0,0,0,0,0,0] 
        i = 2
        C = -log a[2,0]
        """
        i = np.argmax(y)  
        return -np.log(a[i,0])
    
    @staticmethod
    def delta(z, a, y):
        """
        delta_j = aj-yj
        """
        return (a-y)
#********************************************************  
    
#### Main Network class
class Network(object):

    def __init__(self, sizes, cost=CrossEntropyCost):
        """The list ``sizes`` contains the number of neurons in the respective
        layers of the network.  For example, if the list was [2, 3, 1]
        then it would be a three-layer network, with the first layer
        containing 2 neurons, the second layer 3 neurons, and the
        third layer 1 neuron.  The biases and weights for the network
        are initialized randomly, using
        ``self.default_weight_initializer`` (see docstring for that
        method).

        """
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.cost=cost
        # init use_softmax with cost type
        if self.cost == LogLikelihoodCost:
            self.use_softmax = True
        else:
            self.use_softmax = False
        
        # init weight initializer and feedforward method
        if self.use_softmax:
            self.default_weight_initializer = self.default_weight_initializer_with_softmax
            self.feedforward = self.feedforward_with_softmax
        else:
            self.default_weight_initializer = self.default_weight_initializer_1
            self.feedforward = self.feedforward_1
        # init weights and biases
        self.default_weight_initializer()
        # at least an input and output layers
        assert self.num_layers>=2

    def default_weight_initializer_with_softmax(self):
        # sigmoid neurons: w(0,1/sqrt(n_in)) b(0,1)
        # softmax neurons: w = b = 0
        # len(sizes)>=2
        # (1) for sigmoid neuros
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:-1]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:-1])]
        #(2) for last somtmax neurons
        x = self.sizes[-2]
        y = self.sizes[-1]
        last_b = np.zeros((y, 1))
        last_w = np.zeros((y, x))
        
        self.biases.append(last_b)
        self.weights.append(last_w)

        
    def default_weight_initializer_1(self):
        """Initialize each weight using a Gaussian distribution with mean 0
        and standard deviation 1 over the square root of the number of
        weights connecting to the same neuron.  Initialize the biases
        using a Gaussian distribution with mean 0 and standard
        deviation 1.

        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.

        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)/np.sqrt(x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def large_weight_initializer(self):
        """Initialize the weights using a Gaussian distribution with mean 0
        and standard deviation 1.  Initialize the biases using a
        Gaussian distribution with mean 0 and standard deviation 1.

        Note that the first layer is assumed to be an input layer, and
        by convention we won't set any biases for those neurons, since
        biases are only ever used in computing the outputs from later
        layers.

        This weight and bias initializer uses the same approach as in
        Chapter 1, and is included for purposes of comparison.  It
        will usually be better to use the default weight initializer
        instead.

        """
        self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(self.sizes[:-1], self.sizes[1:])]

    def feedforward_with_softmax(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases[:-1], self.weights[:-1]):
            a = sigmoid(np.dot(w, a)+b)
        # last layer
        b,w = self.biases[-1],self.weights[-1]
        last_z = np.dot(w, a)+b
        last_a = softmax(last_z)
        return last_a
    
    def feedforward_1(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            lmbda = 0.0,
            evaluation_data=None,
            monitor_evaluation_cost=False,
            monitor_evaluation_accuracy=False,
            monitor_training_cost=False,
            monitor_training_accuracy=False):
        """Train the neural network using mini-batch stochastic gradient
        descent.  The ``training_data`` is a list of tuples ``(x, y)``
        representing the training inputs and the desired outputs.  The
        other non-optional parameters are self-explanatory, as is the
        regularization parameter ``lmbda``.  The method also accepts
        ``evaluation_data``, usually either the validation or test
        data.  We can monitor the cost and accuracy on either the
        evaluation data or the training data, by setting the
        appropriate flags.  The method returns a tuple containing four
        lists: the (per-epoch) costs on the evaluation data, the
        accuracies on the evaluation data, the costs on the training
        data, and the accuracies on the training data.  All values are
        evaluated at the end of each training epoch.  So, for example,
        if we train for 30 epochs, then the first element of the tuple
        will be a 30-element list containing the cost on the
        evaluation data at the end of each epoch. Note that the lists
        are empty if the corresponding flag is not set.

        """
        if evaluation_data: n_data = len(evaluation_data)
        n = len(training_data)
        num_batches = n/mini_batch_size
        evaluation_cost, evaluation_accuracy = [], []
        training_cost, training_accuracy = [], []
        for j in xrange(epochs):
            random.shuffle(training_data)
            for k in xrange(0,num_batches):
                mini_batch = training_data[k*mini_batch_size : (k+1)*mini_batch_size]
                self.update_mini_batch(mini_batch, eta, lmbda, len(training_data))
            print "Epoch %s training complete" % j
            if monitor_training_cost:
                cost = self.total_cost(training_data, lmbda)
                training_cost.append(cost)
                print "Cost on training data: {}".format(cost)
            if monitor_training_accuracy:
                accuracy = self.accuracy(training_data, convert=True)
                training_accuracy.append(accuracy)
                print "Accuracy on training data: {} / {}".format(
                    accuracy, n)
            if monitor_evaluation_cost:
                cost = self.total_cost(evaluation_data, lmbda, convert=True)
                evaluation_cost.append(cost)
                print "Cost on evaluation data: {}".format(cost)
            if monitor_evaluation_accuracy:
                accuracy = self.accuracy(evaluation_data)
                evaluation_accuracy.append(accuracy)
                print "Accuracy on evaluation data: {} / {}".format(
                    self.accuracy(evaluation_data), n_data)
            print
        return evaluation_cost, evaluation_accuracy, training_cost, training_accuracy

    def calculate_sum_derivatives_of_mini_batch(self,mini_batch):
        """
        计算m个样本的总梯度和。
        利用反向传播计算每一个样本(x,y)对应的梯度。
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            # 给定一个样本X,利用反向传播算法计算对应w,b的梯度
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            # 对m个样本的梯度进行累计求和
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        return nabla_b,nabla_w
    
    def update_mini_batch(self, mini_batch, eta, lmbda, n):
        """Update the network's weights and biases by applying gradient
        descent using backpropagation to a single mini batch.  The
        ``mini_batch`` is a list of tuples ``(x, y)``, ``eta`` is the
        learning rate, ``lmbda`` is the regularization parameter, and
        ``n`` is the total size of the training data set.

        """
        m = len(mini_batch)
        nabla_b,nabla_w = self.calculate_sum_derivatives_of_mini_batch(mini_batch)
        
        #self.weights = [w-(eta/m)*nw for w, nw in zip(self.weights, nabla_w)]
        #self.biases =  [b-(eta/m)*nb for b, nb in zip(self.biases,  nabla_b)]
        
        # L2 regularization
        weight_decay = 1-eta*(lmbda/n)
        self.weights = [weight_decay*w-(eta/m)*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/m)*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        # 初始化nb,nw,结构和b,w一样
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        
        # feedforward
        # 执行算法的feedforward阶段
        #　(1)初始化x作为a_1
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        # (2)l=2,....L层，分别计算z_l,a_l并且保存下来。
        #*************************************************************
        if self.use_softmax:
            for b, w in zip(self.biases[:-1], self.weights[:-1]):
                z = np.dot(w, activation)+b
                zs.append(z)
                activation = sigmoid(z)
                activations.append(activation)
            #last layer
            b,w = self.biases[-1],self.weights[-1]
            last_z = np.dot(w, activation)+b
            last_a = softmax(last_z)
            zs.append(last_z)
            activations.append(last_a)
        #*************************************************************
        else:
            for b, w in zip(self.biases, self.weights):
                z = np.dot(w, activation)+b
                zs.append(z)
                activation = sigmoid(z)
                activations.append(activation)
        
        #========================================================================
        # 先计算所有的误差delta，最后计算所有层的梯度nb,nw，代码可读性更高一些
        #========================================================================
        # method2
        # backward pass
        # 执行算法的backward阶段
        # (3)初始化第L层的误差,delta_L　= cost(z,a,y)
        l = -1
        #***************************************************
        delta = self.cost.delta(zs[l],activations[l], y)
        #***************************************************
        deltas = [delta] # list to store all the errors,layer by layer
        # (4)初始化l=L-1,....2层的误差,delta_l = np.dot(w_l+1^T,delta_l+1)* sigmoid_prime(z_l)
        for i in xrange(2, self.num_layers):
            l = -i #(-２代表L-1,-3代表L-2,-(L-1)代表2)
            delta = np.dot(self.weights[l+1].transpose(), deltas[l+1]) * sigmoid_prime(zs[l])
            deltas.insert(0,delta) # 确保误差的顺序，从后往前计算，所以需要insert在数组的最前面
        
        #(5)l=L,L-1,....2层，计算所有的梯度向量nb,nw
        for i in xrange(1, self.num_layers):
            l = -i #(-1,-2,....-(L-1))
            nabla_b[l] = deltas[l]
            nabla_w[l] = np.dot(deltas[l], activations[l-1].transpose())
            
        return (nabla_b, nabla_w)

    def accuracy(self, data, convert=False):
        """Return the number of inputs in ``data`` for which the neural
        network outputs the correct result. The neural network's
        output is assumed to be the index of whichever neuron in the
        final layer has the highest activation.

        The flag ``convert`` should be set to False if the data set is
        validation or test data (the usual case), and to True if the
        data set is the training data. The need for this flag arises
        due to differences in the way the results ``y`` are
        represented in the different data sets.  In particular, it
        flags whether we need to convert between the different
        representations.  It may seem strange to use different
        representations for the different data sets.  Why not use the
        same representation for all three data sets?  It's done for
        efficiency reasons -- the program usually evaluates the cost
        on the training data and the accuracy on other data sets.
        These are different types of computations, and using different
        representations speeds things up.  More details on the
        representations can be found in
        mnist_loader.load_data_wrapper.

        """
        if convert: # train data
            results = [(np.argmax(self.feedforward(x)), np.argmax(y))
                       for (x, y) in data]
        else: # val/test data
            results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in data]
        return sum(int(x == y) for (x, y) in results)

    def total_cost(self, data, lmbda, convert=False):
        """Return the total cost for the data set ``data``.  The flag
        ``convert`` should be set to False if the data set is the
        training data (the usual case), and to True if the data set is
        the validation or test data.  See comments on the similar (but
        reversed) convention for the ``accuracy`` method, above.
        """
        cost = 0.0
        n = len(data)
        for x, y in data:
            a = self.feedforward(x)
            if convert: # val/test data
                y = vectorized_result(y)
            cost += self.cost.fn(a, y)/n
        # L2 term = lmbda/(2n)*sum(w**2)
        l2_term = 0.5*(lmbda/n)*sum(np.linalg.norm(w)**2 for w in self.weights)
        cost += l2_term
        return cost

    def save(self, filename):
        """Save the neural network to the file ``filename``."""
        data = {"sizes": self.sizes,
                "weights": [w.tolist() for w in self.weights],
                "biases": [b.tolist() for b in self.biases],
                "cost": str(self.cost.__name__)}
        f = open(filename, "w")
        json.dump(data, f)
        f.close()

#### Loading a Network
def load(filename):
    """Load a neural network from the file ``filename``.  Returns an
    instance of Network.

    """
    f = open(filename, "r")
    data = json.load(f)
    f.close()
    name = sys.modules[__name__]
    cost = getattr(name, data["cost"])
    net = Network(data["sizes"], cost=cost)
    net.weights = [np.array(w) for w in data["weights"]]
    net.biases = [np.array(b) for b in data["biases"]]
    return net

#### Miscellaneous functions
def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the j'th position
    and zeroes elsewhere.  This is used to convert a digit (0...9)
    into a corresponding desired output from the neural network.

    """
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

def softmax(z):
    #e^z/ sum(e^z)
    ez = np.exp(z)
    sum_ez = sum(ez)
    return ez/sum_ez

1 2	import mnist_loader training_data, validation_data, test_data = mnist_loader.load_data_wrapper()

1 2	net = load("./1.json") net.accuracy(validation_data)

1 2	net = load("./0.json") net.accuracy(validation_data)

Reference

History

20180807: created.

network1.py

Posted on 2018-08-07 Edited on 2024-10-14 In deep learning

network1.py

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        num_batches = n/mini_batch_size
        for j in xrange(epochs):
            random.shuffle(training_data)
            for k in xrange(0,num_batches):
                mini_batch = training_data[k*mini_batch_size : (k+1)*mini_batch_size]
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def calculate_sum_derivatives_of_mini_batch(self,mini_batch):
        """
        计算m个样本的总梯度和。
        利用反向传播计算每一个样本(x,y)对应的梯度。
        """
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            # 给定一个样本X,利用反向传播算法计算对应w,b的梯度
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            # 对m个样本的梯度进行累计求和
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        return nabla_b,nabla_w
        
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        m = len(mini_batch)
        nabla_b,nabla_w = self.calculate_sum_derivatives_of_mini_batch(mini_batch)
        
        self.weights = [w-(eta/m)*nw for w, nw in zip(self.weights, nabla_w)]
        self.biases =  [b-(eta/m)*nb for b, nb in zip(self.biases,  nabla_b)]
    
    def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        # 初始化nb,nw,结构和b,w一样
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        
        # feedforward
        # 执行算法的feedforward阶段
        #　(1)初始化x作为a_1
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        # (2)l=2,....L层，分别计算z_l,a_l并且保存下来。
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        
        #========================================================================
        # 先计算所有的误差delta，最后计算所有层的梯度nb,nw，代码可读性更高一些
        #========================================================================
        # method2
        # backward pass
        # 执行算法的backward阶段
        # (3)初始化第L层的误差,delta_L　= cost(a_L,y) * sigmoid_prime(z_L)
        l = -1
        delta = self.cost_derivative_of_a_L(activations[l], y) * sigmoid_prime(zs[l])
        deltas = [delta] # list to store all the errors,layer by layer
        # (4)初始化l=L-1,....2层的误差,delta_l = np.dot(w_l+1^T,delta_l+1)* sigmoid_prime(z_l)
        for i in xrange(2, self.num_layers):
            l = -i #(-２代表L-1,-3代表L-2,-(L-1)代表2)
            delta = np.dot(self.weights[l+1].transpose(), deltas[l+1]) * sigmoid_prime(zs[l])
            deltas.insert(0,delta) # 确保误差的顺序，从后往前计算，所以需要insert在数组的最前面
        
        #(5)l=L,L-1,....2层，计算所有的梯度向量nb,nw
        for i in xrange(1, self.num_layers):
            l = -i #(-1,-2,....-(L-1))
            nabla_b[l] = deltas[l]
            nabla_w[l] = np.dot(deltas[l], activations[l-1].transpose())
            
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        
        """
        l = [0,1,0,0,0,0,0,0,0,0]
        a = np.array(l).reshape(10,1)
        np.argmax(a) #输出向量对应的数字１

        test_results = [(1,1),(2,2),(3,3),(1,9)]
        [int(x == y) for (x, y) in test_results] 
        #[1, 1, 1, 0]
        sum([int(x == y) for (x, y) in test_results])
        #3
        """
        
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative_of_a_L(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

mnist_loader

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it's
    helpful to modify the format of the ``training_data`` a little.
    That's done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we're using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    # train
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    # vector train_y, while val and test y are integers.
    training_results = [vectorized_result(y) for y in tr_d[1]] 
    training_data = zip(training_inputs, training_results)
    
    # val
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    # test
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

test

#!/usr/bin/python
# -*- coding: utf-8 -*-

import mnist_loader
from ke_network import Network

def test():
    train,val,test = mnist_loader.load_data_wrapper()
    epoch = 30
    mini_batch_size  = 10
    eta = 3.0
    net = Network([784,30,10])
    net.SGD(train,epoch,mini_batch_size,eta,test_data=test)

def main():
    test()

if __name__=="__main__":
    main()

Epoch 0: 9135 / 10000
Epoch 1: 9238 / 10000
Epoch 2: 9337 / 10000
Epoch 3: 9345 / 10000
Epoch 4: 9393 / 10000
Epoch 5: 9389 / 10000
Epoch 6: 9419 / 10000
Epoch 7: 9410 / 10000

Reference

History

20180807: created.

python numpy random shuffle and theano shuffle

Posted on 2018-08-07 Edited on 2024-10-14 In python

Examples of shuffle

"""
numpy.random.shuffle(x)
Modify a sequence in-place by shuffling its contents.

random.shuffle(list)只能对list进行随机打乱。

Parameters:
x : array_like
    The array or list to be shuffled.
    
Returns: None

This function only shuffles the array along the first index of a multi-dimensional array
（多维矩阵中，只对第一维（行）做打乱顺序操作）
"""
import numpy as np
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

1 2	np.random.shuffle(arr) arr

array([1, 4, 7, 3, 0, 9, 5, 8, 2, 6])

1 2	arr = np.arange(9).reshape((3, 3)) arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

1
2
3

# 对于矩阵，按照row进行打乱。
np.random.shuffle(arr)
arr

array([[3, 4, 5],
       [0, 1, 2],
       [6, 7, 8]])

1
2
3

# 对于矩阵，按照row进行打乱。
np.random.shuffle(arr)
arr

array([[3, 4, 5],
       [6, 7, 8],
       [0, 1, 2]])

1 2	# random.shuffle(list) l = [i for i in xrange(0,10)]

1
2
3

import random
random.shuffle(l)
l

[2, 9, 5, 8, 6, 3, 7, 1, 4, 0]

1 2	random.shuffle(l) l

[3, 9, 2, 1, 5, 7, 8, 4, 0, 6]

Shuffe two sequences at the same time

shuffle的状态依赖于random.seed,np.random.seed，可以使用time作为seed

random.seed(time.time())

2次shuffle的seed不一样，结果不一致

import random
import numpy as np
train_data = (np.array([[1,1],
                       [2,2],
                       [3,3],
                       [4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

# shuffle x,y的时候，只要保证seed相同，那么shuffle之后的x,y对应元素的顺序保持一样
#np.random.seed(1)
np.random.shuffle(x)
print x

#np.random.seed(1)
np.random.shuffle(y)
print y

print
print train_data

[[3 3]
 [1 1]
 [2 2]
 [4 4]]
[22 11 33 44]

(array([[3, 3],
       [1, 1],
       [2, 2],
       [4, 4]]), array([22, 11, 33, 44]))

2次shuffle的seed一样，结果保持一致

import random
import numpy as np
train_data = (np.array([[1,1],
                       [2,2],
                       [3,3],
                       [4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

# shuffle x,y的时候，只要保证seed相同，那么shuffle之后的x,y对应元素的顺序保持一样
seed = 1

np.random.seed(seed)
np.random.shuffle(x)
print x

np.random.seed(seed)
np.random.shuffle(y)
print y

print
print train_data

[[4 4]
 [3 3]
 [1 1]
 [2 2]]
[44 33 11 22]

(array([[4, 4],
       [3, 3],
       [1, 1],
       [2, 2]]), array([44, 33, 11, 22]))

shuffle_data(x,y)

import time
import random
import numpy as np
train_data = (np.array([[1,1],
                       [2,2],
                       [3,3],
                       [4,4]]), np.array([11,22,33,44]))
x = train_data[0]
y = train_data[1]

def shuffle_data(x,y):
    #seed = int(time.time()) 
    
    seed = 1
    np.random.seed(seed)
    np.random.shuffle(x)

    np.random.seed(seed)
    np.random.shuffle(y)

shuffle_data(x,y)
print x
print y
print 
print train_data

[[4 4]
 [3 3]
 [1 1]
 [2 2]]
[44 33 11 22]

(array([[4, 4],
       [3, 3],
       [1, 1],
       [2, 2]]), array([44, 33, 11, 22]))

Shuffle in theano with TensorVariable

import time
import random
import numpy as np
import theano
import theano.tensor as T

train_data = (np.array([[1,1],
                       [2,2],
                       [3,3],
                       [4,4]],dtype='float64'), np.array([11,22,33,44],dtype='int32'))
x = train_data[0]
y = train_data[1]

def shuffle_data(x,y):
    #seed = int(time.time()) 
    
    seed = 1
    np.random.seed(seed)
    np.random.shuffle(x)

    np.random.seed(seed)
    np.random.shuffle(y)

# OK because of shared<---->train data
shared_x = theano.shared(train_data[0],  borrow=True) 
shared_y = theano.shared(train_data[1],  borrow=True) 

#shared_x = theano.shared(np.asarray(train_data[0], dtype=theano.config.floatX), borrow=True)  # no copy
#shared_y = theano.shared(np.asarray(train_data[1], dtype=theano.config.floatX), borrow=True)  # no copy

y_cast = T.cast(shared_y,"int32") 
# shared_y dtype int32, no copy, y_cast is  TensorSharedVariable(int32,vector)
# shared_y dtype float64, copy,  y_cast is  TensorVariable(int32,vector)

print shared_x.type,shared_y.type,y_cast.type
print type(shared_x),type(shared_y),type(y_cast)
print shared_y is y_cast
    
print 'old train'
print train_data

#print '\nupdate train'
#x[0] = 100
#y[0] = 100

print '\nshuffle train'
#shuffle_data(train_data[0],train_data[1])

originX = shared_x.get_value(borrow=True)
originY = shared_y.get_value(borrow=True)
print originX is x # true 
print originY is y # true
shuffle_data(originX,originY)

print train_data

print '\nshared x and y'
print shared_x.get_value() 
print shared_y.get_value()

WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10).  Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available  (error: Unable to get the number of gpus available: no CUDA-capable device is detected)


TensorType(float64, matrix) TensorType(int32, vector) TensorType(int32, vector)
<class 'theano.tensor.sharedvar.TensorSharedVariable'> <class 'theano.tensor.sharedvar.TensorSharedVariable'> <class 'theano.tensor.sharedvar.TensorSharedVariable'>
True
old train
(array([[ 1.,  1.],
       [ 2.,  2.],
       [ 3.,  3.],
       [ 4.,  4.]]), array([11, 22, 33, 44], dtype=int32))

shuffle train
True
True
(array([[ 4.,  4.],
       [ 3.,  3.],
       [ 1.,  1.],
       [ 2.,  2.]]), array([44, 33, 11, 22], dtype=int32))

shared x and y
[[ 4.  4.]
 [ 3.  3.]
 [ 1.  1.]
 [ 2.  2.]]
[44 33 11 22]

Reference

History

20180807: created.

Example 3- A Real Example Logistic Regression with theano

Posted on 2018-08-07 Edited on 2024-10-14 In machine learning

Series

code example

import numpy
import numpy as np
import theano
import theano.tensor as T
rng = numpy.random

N = 400                                   # training sample size
feats = 784                               # number of input variables

# generate a dataset: D = (input_values, target_class)
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
# (400, 784) (400,)
training_steps = 10000

# Declare Theano symbolic variables
x = T.dmatrix("x") # (400,784)
y = T.dvector("y") # (400,)

# initialize the weight vector w randomly
#
# this and the following bias variable b
# are shared so they keep their values
# between training iterations (updates)
w = theano.shared(rng.randn(feats), name="w") #vector (784,)

# initialize the bias term
b = theano.shared(0., name="b") #scalar ()

print("Initial model:")
print(w.get_value().shape)
print(b.get_value().shape)

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))   # Probability that target = 1        = vector(400,)
prediction = p_1 > 0.5                    # The prediction thresholded         = vector(400,) of bool
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function    = vector(400,) of Cx for given x
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize               = scalar   f(x,y,w,b)
gw, gb = T.grad(cost, [w, b])             # Compute the gradient of the cost
                                          # w.r.t weight vector w and
                                          # bias term b
                                          # (we shall return to this in a
                                          # following section of this tutorial)

updates = [(w, w - 0.1 * gw),(b, b - 0.1 * gb)]                
# Compile
train = theano.function(
          inputs=[x,y],
          outputs=[prediction, xent,cost],
          updates= updates
        )

predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
    pred, xcent,cost_t = train(D[0], D[1])

prediction = predict(D[0])  # vector(400,) of bool


#a = np.array([1,2,3,4])
#b = np.array([2,2,4,3])
#np.mean( np.equal(a,b) ) #[0,1,0,0] 0.25

accuracy = 100* np.mean( np.equal(prediction,D[1]) )

print("Final model:")
print(w.get_value().shape)
print(b.get_value().shape)

print "\nCost = ",cost_t
print "Accuracy = ", accuracy,"%"

print("\ntarget values for D:")
#print(D[1])
print("prediction on D:")
#print(prediction)

Initial model:
(784,)
()
Final model:
(784,)
()

Cost =  0.121553645129
Accuracy =  100.0 %

target values for D:
prediction on D:

Reference

History

20180807: created.