推荐星级:
- 1
- 2
- 3
- 4
- 5
对深度神经网络训练难点的认识
资料介绍
然而在2006年之前,深层多层神经网络似乎并不成功
经过训练,从那时起已经有好几种算法
成功地训练了他们,实验结果显示了
与深度较低的架构相比。所有这些实验结果都是通过新的初始化得到的
或者训练机制。我们的目标是
更好地理解为什么标准梯度下降
随机初始化的结果很糟糕
使用深层神经网络,以便更好地理解
这些最近的相对成功有助于设计
未来会有更好的算法。我们首先观察
非线性激活函数的影响。我们发现logistic乙状结肠激活
不适合于具有随机初始化的深层网络,因为它的平均值可以
尤其是最上面的隐藏层。令人惊讶的是,我们发现饱和单元
可以自己脱离饱和,尽管
慢慢地,有时解释高原
当训练神经网络的时候。我们发现
一种新的不饱和非线性
有益于健康。最后,我们研究如何激活
在训练过程中,梯度在不同的层次上变化,这意味着当雅可比矩阵的奇异值
与每一层相关联的距离远不止1。基于
基于这些考虑,我们提出了一种新的初始化方案,它可以大大加快
汇聚。
1深层神经网络
面向特征层次的深度学习方法
具有更高层次的特征
由低级特征组成。它们包括
出席第十三届国际会议记录
关于人工智能和统计(AISTATS)2010,意大利撒丁岛Chia Laguna度假村。JMLR第9卷:W&CP 9。作者2010年版权所有。
部分文件列表
文件名 | 大小 |
对深度神经网络训练难点的认识.pdf | 2M |
部分页面预览
(完整内容请下载后查看)Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot
Yoshua Bengio
DIRO, Universite´ de Montre´al, Montre´al, Que´bec, Canada
Abstract
learning methods for a wide array of deep architectures,
including neural networks with many hidden layers (Vin-
cent et al., 2008) and graphical models with many levels of
hidden variables (Hinton et al., 2006), among others (Zhu
et al., 2009; Weston et al., 2008). Much attention has re-
cently been devoted to them (see (Bengio, 2009) for a re-
view), because of their theoretical appeal, inspiration from
biology and human cognition, and because of empirical
success in vision (Ranzato et al., 2007; Larochelle et al.,
2007; Vincent et al., 2008) and natural language process-
ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,
2009). Theoretical results reviewed and discussed by Ben-
gio (2009), suggest that in order to learn the kind of com-
plicated functions that can represent high-level abstractions
(e.g. in vision, language, and other AI-level tasks), one
may need deep architectures.
Whereas before 2006 it appears that deep multi-
layer neural networks were not successfully
trained, since then several algorithms have been
shown to successfully train them, with experi-
mental results showing the superiority of deeper
vs less deep architectures. All these experimen-
tal results were obtained with new initialization
or training mechanisms. Our objective here is to
understand better why standard gradient descent
from random initialization is doing so poorly
with deep neural networks, to better understand
these recent relative successes and help design
better algorithms in the future. We first observe
the influence of the non-linear activations func-
tions. We find that the logistic sigmoid activation
is unsuited for deep networks with random ini-
tialization because of its mean value, which can
drive especially the top hidden layer into satu-
ration. Surprisingly, we find that saturated units
can move out of saturation by themselves, albeit
slowly, and explaining the plateaus sometimes
seen when training neural networks. We find that
a new non-linearity that saturates less can often
be beneficial. Finally, we study how activations
and gradients vary across layers and during train-
ing, with the idea that training may be more dif-
ficult when the singular values of the Jacobian
associated with each layer are far from 1. Based
on these considerations, we propose a new ini-
tialization scheme that brings substantially faster
convergence.
Most of the recent experimental results with deep archi-
tecture are obtained with models that can be turned into
deep supervised neural networks, but with initialization or
training schemes different from the classical feedforward
neural networks (Rumelhart et al., 1986). Why are these
new algorithms working so much better than the standard
random initialization and gradient-based optimization of a
supervised training criterion? Part of the answer may be
found in recent analyses of the effect of unsupervised pre-
training (Erhan et al., 2009), showing that it acts as a regu-
larizer that initializes the parameters in a “better” basin of
attraction of the optimization procedure, corresponding to
an apparent local minimum associated with better general-
ization. But earlier work (Bengio et al., 2007) had shown
that even a purely supervised but greedy layer-wise proce-
dure would give better results. So here instead of focus-
ing on what unsupervised pre-training or semi-supervised
criteria bring to deep architectures, we focus on analyzing
what may be going wrong with good old (but deep) multi-
layer neural networks.
1 Deep Neural Networks
Our analysis is driven by investigative experiments to mon-
itor activations (watching for saturation of hidden units)
and gradients, across layers and across training iterations.
We also evaluate the effects on these of choices of acti-
vation function (with the idea that it might affect satura-
tion) and initialization procedure (since unsupervised pre-
training is a particular form of initialization and it has a
drastic impact).
Deep learning methods aim at learning feature hierarchies
with features from higher levels of the hierarchy formed
by the composition of lower level features. They include
Appearing in Proceedings of the 13th International Conference
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-
right 2010 by the authors.
249
全部评论(1)
2019-12-12 12:16:06suxindg
谢谢分享