DL Related Network Part

CNN

Neural Network(MLP多层感知机/全连接网络)
- Goal: 引入激活函数(activation function)，即可得到非线性分类器
- Activation fucntions
- Process: forward passing + backward propagation(Gradient descent)
Convolutional Neural Networks
- Overview
  - Convolution layer
    - - N表示image的batch数量
      - Cin表示图片3维维度
      - H W表示图片的长宽
      - Cout表示输出Activation map的层数
      - Kw Kh表示filter的长宽
      - H' W'表示卷积后输出Activation map的长宽
    - 无padding时输出大小 $\frac{N-F}{stride} + 1$
    - 有padding时输出大小 $\frac{N+2P-F}{stride} + 1$
    - 有参: $para\_num = C_{in}*C_{out}*K_w*K_h + 1 * C_{out}(bias)$
  - Pooling layer
    - makes the representations smaller and more manageable
    - operates over each activation map independently(layer by layer)
    - 无参
  - Fully-connected layer
    - The reduced form of our image is flattened into a column vector and is fed through a feed forward neural network

CNN architecture

RNN-GAN

RNN(Recurrent Neural Network) 短期记忆
- Markov assumption: $p(w_t|w_1,w_2,\cdots,w_{t-1}) = p(w_t|w_{t-1},w_{t-2},w_{t-3})$
- Formula: $h_t = f_W(h_{t-1}, x_t)$ $h_{t} = f_{W} (h_{t - 1}, x_{t})$
  - $h_t = tanh(W_{hh}h_{t-1} +W_{xh}x_t)$
  - $y_t = W_{hy}h_t$
- 共享参数和激活函数
- 一般可作为自学习，即上一步的输出作为这一步的输入！！！
- Different types
- Training example
  - 对于Loss，要前向全部序列计算;对于梯度下降，要反向全部序列计算
  - 优化:
    - chunks！！！
  - Problem:
    - 短期记忆
    - 梯度消失和爆炸
      - 由于反向传播是linear，而正向传播由于激活函数的存在是非线性的
LSTM 长期记忆(在RNN的基础上额外加入了记忆信号)
- 遗忘门
- 输入门
- 更新门
- 输出门
- Eg.
Attention model
- h(t)是Encoder网络的隐藏层中间输出
- C(t)是时刻t的上下文向量(Context Vector) $C_t = \sum_{i=1}^{T_x}\alpha_{ti}h^{(t)}$ $C_{t} = \sum_{i = 1}^{T_{x}} α_{t i} h^{(t)}$ ,而这些分配的权重我们称之为全局对齐权重(Global Alignment Weights)
  - $e_{ij} = score(H^{(i-1)},h^{(j)})$ $e_{i j} = s c o r e (H^{(i - 1)}, h^{(j)})$
    - 加性模型
    - 乘法模型
    - 点积模型
    - 缩放点积模型
  - $\alpha_{ij} =\frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$ , 即softmax操作
- H(t)是decoder网络的隐藏层输出
- 而RNN模型，中间状态由于来自于输入网络最后的隐藏层，一般来说它是一个大小固定的向量。既然是大小固定的向量，那么它能储存的信息就是有限的，当句子长度不断变长，由于后方的decoder网络的所有信息都来自中间状态，中间状态需要表达的信息就越来越多。
- Process
  - 1. Encoder网络按照原来的方法计算出 $h^{(1),\cdots}$
  - 1. Decoder网络种对于第K个输出词语
    - 计算出 $C_k$ 对应的全局对齐权重
    - 计算 $C_k$ 得到加权求和
  - 1. 计算新的 $H^{(k)}$ , using $H^{(k-1)},\ y^{(k-1)},\ C_k$
  - 1. 计算最终的 $y^{(k)}$
  - 1. k++, 直至网络输出 $<end>$
DGN
1. Auto-encoder
  - 无监督学习的一种
  - 也可用于有监督模型中,通过去标签的数据训练得到decoder
2. VAE(Variational Autoencoder)
3. GAN(Generative Adversarial Networks)
  - For generator, we are trying to get 更加逼真的 image to increase loss.
  - For discriminator, we are trying to determine more correctly, to reduce loss
  - 训练过程可以视为双人的zero-sum games