Getting Started
Shared variables tips
We encourage you to store the dataset into shared variables and access it based on the minibatch index, given a fixed and known batch size. The reason behind shared variables is related to using the GPU. There is a large overhead when copying data into the GPU memory.
将数据存储在shared variable便于加速GPU计算,避免数据从CPU拷贝到GPU。
If you have your data in Theano shared variables though, you give Theano the possibility to copy the entire data on the GPU in a single call when the shared variables are constructed.
shared构建的时候,theano一次性讲所有数据拷贝至GPU.
Because the datapoints and their labels are usually of different nature (labels are usually integers while datapoints are real numbers) we suggest to use different variables for label and data. Also we recommend using different variables for the training set, validation set and testing set to make the code more readable (resulting in 6 different shared variables).
data,label使用2个shared variables,train,valid,test使用不同的shared variables,总计6个
Mini-batch data
1 | def shared_dataset(data_xy): |
SGD pseudo code in theano
Traditional GD (m=N)
1 | # GRADIENT DESCENT |
Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire training set. In its purest form, we estimate the gradient from just a single example at a time.
Online Learning SGD (m=1)
1 | # STOCHASTIC GRADIENT DESCENT |
The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-called “minibatches”. Minibatch SGD (MSGD) works identically to SGD, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers.
Minibatch SGD (m)
1 | for (x_batch,y_batch) in train_batches: |
Theano pseudo code of Minibatch SGD
1 | # Minibatch Stochastic Gradient Descent |
Regularization
L1/L2 regularization
L1/L2 regularization and early-stopping.
Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the regularizer is also called “weight decay”.
To follow Occam’s razor principle, this minimization should find us the simplest solution (as measured by our simplicity criterion) that fits the training data.
1 | # symbolic Theano variable that represents the L1 regularization term |
Early-Stopping
Early-stopping combats overfitting by monitoring the model’s performance on a validation set. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set.
所谓early stopping,即在每一个epoch结束时(一个epoch即对所有训练数据的一轮遍历)计算 validation data的accuracy,当accuracy不再提高时,就停止训练。这是很自然的做法,因为accuracy不再提高了,训练下去也没用。另外,这样做还能防止overfitting。
那么,怎么样才算是validation accuracy不再提高呢?并不是说validation accuracy一降下来,它就是“不再提高”,因为可能经过这个epoch后,accuracy降低了,但是随后的epoch又让accuracy升上去了,所以不能根据一两次的连续降低就判断“不再提高”。正确的做法是,在训练的过程中,记录最佳的validation accuracy,当连续10次epoch(或者更多次)没达到最佳accuracy时,你可以认为“不再提高”,此时使用early stopping。这个策略就叫“ no-improvement-in-n”,n即epoch的次数,可以根据实际情况取10、20、30….
Variable learning rate
Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: eta decrease constant, d=0.001)
Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024
一个简单有效的做法就是,当validation accuracy满足 no-improvement-in-n规则时,本来我们是要early stopping的,但是我们可以不stop,而是让learning rate减半,之后让程序继续跑。下一次validation accuracy又满足no-improvement-in-n规则时,我们同样再将learning rate减半(此时变为原始learni rate的四分之一)…继续这个过程,直到learning rate变为原来的1/1024再终止程序。(1/1024还是1/512还是其他可以根据实际确定)。【PS:也可以选择每一次将learning rate除以10,而不是除以2.】
实践中,eta/2变化太快,eta0/(1+d*epoch),d=0.001比较合适。
1 | # early-stopping parameters |
Theano/Python Tips
Loading and Saving Models
DO: Pickle the numpy ndarrays from your shared variables
DON’T: Do not pickle your training or test functions for long-term storage
1 | import cPickle |
Then later, you can load your data back like this:
1 | save_file = open('path') |
Plotting Intermediate Results
If you have enough disk space, your training script should save intermediate models and a visualization script should process those saved models.
MLP
see here
An MLP can be viewed as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation sigmoid. This transformation projects the input data into a space where it becomes linearly separable. This intermediate layer is referred to as a hidden layer. A single hidden layer is sufficient to make MLPs a universal approximator.
weight initializations
- old version: 1/sqrt(n_in)
The initial values for the weights of a hidden layer i should be uniformly sampled from a symmetric interval that depends on the activation function. weight的初始化依赖于activation
- tanh: uniformely sampled from -sqrt(6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden))
- sigmoid : use 4 times larger initial weights for sigmoid compared to tanh
1 | # `W` is initialized with `W_values` which is uniformely sampled |
Tips and Tricks for training MLPs
Nonlinearity
Two of the most common ones are the sigmoid and the tanh function. nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties.
Weight initialization
At initialization we want the weights to be small enough around the origin so that the activation function operates in its linear regime, where gradients are the largest. weight的初始化依赖于activation
Learning rate
The simplest solution is to simply have a constant rate. Rule of thumb: try several log-spaced values (10^{-1},10^{-2},\ldots) and narrow the (logarithmic) grid search to the region where you obtain the lowest validation error.
Decreasing the learning rate over time is sometimes a good idea. eta = eta0/(1+d*epoch) (d: decrease constant, 0.001)
Early stopping + decrease learning rate. eta = eta0/2 until eta= eta0/1024
Regularization parameter
Typical values to try for the L1/L2 regularization parameter \lambda are 10^{-2},10^{-3},\ldots. In the framework that we described so far, optimizing this parameter will not lead to significantly better solutions, but is worth exploring nonetheless.
Number of hidden units
This hyper-parameter is very much dataset-dependent. hidden neurons的数量依赖于具体的数据集。Unless we employ some regularization scheme (early stopping or L1/L2 penalties), a typical number of hidden units vs. generalization performance graph will be U-shaped.
CNN
The Convolution and Pool Operator
1 | import theano |
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
Using gpu device 0: GeForce GTX 1060 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5105)
(1, 2, 631, 508)
(1, 2, 315, 254)
Reference
History
- 20180807: created.