推荐星级:
  • 1
  • 2
  • 3
  • 4
  • 5

对深度神经网络训练难点的认识

更新时间:2019-12-05 09:40:27 大小:2M 上传用户:梦留香查看TA发布的资源 标签:深度神经网络 下载积分:2分 评价赚积分 (如何评价?) 收藏 评论(1) 举报

资料介绍

然而在2006年之前,深层多层神经网络似乎并不成功


经过训练,从那时起已经有好几种算法


成功地训练了他们,实验结果显示了


与深度较低的架构相比。所有这些实验结果都是通过新的初始化得到的


或者训练机制。我们的目标是


更好地理解为什么标准梯度下降


随机初始化的结果很糟糕


使用深层神经网络,以便更好地理解


这些最近的相对成功有助于设计


未来会有更好的算法。我们首先观察


非线性激活函数的影响。我们发现logistic乙状结肠激活


不适合于具有随机初始化的深层网络,因为它的平均值可以


尤其是最上面的隐藏层。令人惊讶的是,我们发现饱和单元


可以自己脱离饱和,尽管


慢慢地,有时解释高原


当训练神经网络的时候。我们发现


一种新的不饱和非线性


有益于健康。最后,我们研究如何激活


在训练过程中,梯度在不同的层次上变化,这意味着当雅可比矩阵的奇异值


与每一层相关联的距离远不止1。基于


基于这些考虑,我们提出了一种新的初始化方案,它可以大大加快


汇聚。


1深层神经网络


面向特征层次的深度学习方法


具有更高层次的特征


由低级特征组成。它们包括


出席第十三届国际会议记录


关于人工智能和统计(AISTATS)2010,意大利撒丁岛Chia Laguna度假村。JMLR第9卷:W&CP 9。作者2010年版权所有。


部分文件列表

文件名 大小
对深度神经网络训练难点的认识.pdf 2M

部分页面预览

(完整内容请下载后查看)
Understanding the difficulty of training deep feedforward neural networks  
Xavier Glorot  
Yoshua Bengio  
DIRO, Universite´ de Montre´al, Montre´al, Que´bec, Canada  
Abstract  
learning methods for a wide array of deep architectures,  
including neural networks with many hidden layers (Vin-  
cent et al., 2008) and graphical models with many levels of  
hidden variables (Hinton et al., 2006), among others (Zhu  
et al., 2009; Weston et al., 2008). Much attention has re-  
cently been devoted to them (see (Bengio, 2009) for a re-  
view), because of their theoretical appeal, inspiration from  
biology and human cognition, and because of empirical  
success in vision (Ranzato et al., 2007; Larochelle et al.,  
2007; Vincent et al., 2008) and natural language process-  
ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,  
2009). Theoretical results reviewed and discussed by Ben-  
gio (2009), suggest that in order to learn the kind of com-  
plicated functions that can represent high-level abstractions  
(e.g. in vision, language, and other AI-level tasks), one  
may need deep architectures.  
Whereas before 2006 it appears that deep multi-  
layer neural networks were not successfully  
trained, since then several algorithms have been  
shown to successfully train them, with experi-  
mental results showing the superiority of deeper  
vs less deep architectures. All these experimen-  
tal results were obtained with new initialization  
or training mechanisms. Our objective here is to  
understand better why standard gradient descent  
from random initialization is doing so poorly  
with deep neural networks, to better understand  
these recent relative successes and help design  
better algorithms in the future. We first observe  
the influence of the non-linear activations func-  
tions. We find that the logistic sigmoid activation  
is unsuited for deep networks with random ini-  
tialization because of its mean value, which can  
drive especially the top hidden layer into satu-  
ration. Surprisingly, we find that saturated units  
can move out of saturation by themselves, albeit  
slowly, and explaining the plateaus sometimes  
seen when training neural networks. We find that  
a new non-linearity that saturates less can often  
be beneficial. Finally, we study how activations  
and gradients vary across layers and during train-  
ing, with the idea that training may be more dif-  
ficult when the singular values of the Jacobian  
associated with each layer are far from 1. Based  
on these considerations, we propose a new ini-  
tialization scheme that brings substantially faster  
convergence.  
Most of the recent experimental results with deep archi-  
tecture are obtained with models that can be turned into  
deep supervised neural networks, but with initialization or  
training schemes different from the classical feedforward  
neural networks (Rumelhart et al., 1986). Why are these  
new algorithms working so much better than the standard  
random initialization and gradient-based optimization of a  
supervised training criterion? Part of the answer may be  
found in recent analyses of the effect of unsupervised pre-  
training (Erhan et al., 2009), showing that it acts as a regu-  
larizer that initializes the parameters in a “better” basin of  
attraction of the optimization procedure, corresponding to  
an apparent local minimum associated with better general-  
ization. But earlier work (Bengio et al., 2007) had shown  
that even a purely supervised but greedy layer-wise proce-  
dure would give better results. So here instead of focus-  
ing on what unsupervised pre-training or semi-supervised  
criteria bring to deep architectures, we focus on analyzing  
what may be going wrong with good old (but deep) multi-  
layer neural networks.  
1 Deep Neural Networks  
Our analysis is driven by investigative experiments to mon-  
itor activations (watching for saturation of hidden units)  
and gradients, across layers and across training iterations.  
We also evaluate the effects on these of choices of acti-  
vation function (with the idea that it might affect satura-  
tion) and initialization procedure (since unsupervised pre-  
training is a particular form of initialization and it has a  
drastic impact).  
Deep learning methods aim at learning feature hierarchies  
with features from higher levels of the hierarchy formed  
by the composition of lower level features. They include  
Appearing in Proceedings of the 13th International Conference  
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-  
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-  
right 2010 by the authors.  
249  

全部评论(1)

  • 2019-12-12 12:16:06suxindg

    谢谢分享