人工神经网络(Artificial Neural Network) - 自动编码器(Autoencoder) - 《Deeplearning Algorithms Tutorial（深度学习算法教程）》

Deeplearning Algorithms tutorial
- 自动编码器(Autoencoder)
- 应用示例

Deeplearning Algorithms tutorial

谷歌的人工智能位于全球前列，在图像识别、语音识别、无人驾驶等技术上都已经落地。而百度实质意义上扛起了国内的人工智能的大旗，覆盖无人驾驶、智能助手、图像识别等许多层面。苹果业已开始全面拥抱机器学习，新产品进军家庭智能音箱并打造工作站级别Mac。另外，腾讯的深度学习平台Mariana已支持了微信语音识别的语音输入法、语音开放平台、长按语音消息转文本等产品，在微信图像识别中开始应用。全球前十大科技公司全部发力人工智能理论研究和应用的实现，虽然入门艰难，但是一旦入门，高手也就在你的不远处！

机器学习主要有三种方式：监督学习，无监督学习与半监督学习。

（1）监督学习：从给定的训练数据集中学习出一个函数，当新的数据输入时，可以根据函数预测相应的结果。监督学习的训练集要求是包括输入和输出，也就是特征和目标。训练集中的目标是有标注的。如今机器学习已固有的监督学习算法有可以进行分类的，例如贝叶斯分类，SVM，ID3，C4.5以及分类决策树，以及现在最火热的人工神经网络，例如BP神经网络，RBF神经网络，Hopfield神经网络、深度信念网络和卷积神经网络等。人工神经网络是模拟人大脑的思考方式来进行分析，在人工神经网络中有显层，隐层以及输出层，而每一层都会有神经元，神经元的状态或开启或关闭，这取决于大数据。同样监督机器学习算法也可以作回归，最常用便是逻辑回归。

（2）无监督学习：与有监督学习相比，无监督学习的训练集的类标号是未知的，并且要学习的类的个数或集合可能事先不知道。常见的无监督学习算法包括聚类和关联，例如K均值法、Apriori算法。

（3）半监督学习：介于监督学习和无监督学习之间,例如EM算法。

如今的机器学习领域主要的研究工作在三个方面进行：1）面向任务的研究，研究和分析改进一组预定任务的执行性能的学习系统；2）认知模型，研究人类学习过程并进行计算模拟；3）理论的分析，从理论的层面探索可能的算法和独立的应用领域算法。

自动编码器(Autoencoder)

自动编码器(Autoencoder)是一种无监督的学习算法，主要用于数据的降维或者特征的抽取，在深度学习中，自动编码器(Autoencoder)可用于在训练阶段开始前，确定权重矩阵的初始值。

神经网络中的权重矩阵可看作是对输入的数据进行特征转换，即先将数据编码为另一种形式，然后在此基础上进行一系列学习。然而，在对权重初始化时，我们并不知道初始的权重值在训练时会起到怎样的作用，也不知道在训练过程中权重会怎样的变化。因此一种较好的思路是，利用初始化生成的权重矩阵进行编码时，我们希望编码后的数据能够较好的保留原始数据的主要特征。那么，如何衡量码后的数据是否保留了较完整的信息呢？答案是：如果编码后的数据能够较为容易地通过解码恢复成原始数据，我们则认为较好的保留了数据信息。

自动编码器(Autoencoder)中：原始input（设为x）经过加权（W、b)、映射（Sigmoid）之后得到y，再对y反向加权映射回来成为z。

通过反复迭代训练两组（W、b），使得误差函数最小，即尽可能保证z近似于x，即完美重构了x。

那么可以说正向第一组权（W、b）是成功的，很好的学习了input中的关键特征，不然也不会重构得如此完美.

这个过程很有趣，首先，它没有使用数据标签来计算误差update参数，所以是无监督学习。

其次，利用类似神经网络的双隐层的方式，简单粗暴地提取了样本的特征。

这个双隐层是有争议的，最初的编码器确实使用了两组（W，b），但是Vincent在2010年的论文中做了研究，发现只要单组W就可以了。

即W’=W^T, W和W’称为Tied Weights。实验证明，W’真的只是在打酱油，完全没有必要去做训练。

逆向重构矩阵让人想起了逆矩阵，若W^-1=W^T的话，W就是个正交矩阵了，即W是可以训成近似正交阵的。

由于W’就是个酱油，训练完之后就没它事了。正向传播用W即可，相当于为input预先编个码，再导入到下一layer去。所以叫自动编码器，而不叫自动编码解码器。

自动编码器相当于创建了一个隐层，一个简单想法就是加在深度网络的开头，作为原始信号的初级filter，起到降维、提取特征的效果。当然，这种做法就有一个问题，AutoEncoder可以看作是PCA的非线性补丁加强版，PCA的取得的效果是建立在降维基础上的。

仔细想想CNN这种结构，随着layer的推进，每层的神经元个数在递增，如果用了AutoEncoder去预训练，岂不是增维了？真的没问题？

相关论文中给出的实验结果认为AutoEncoder的增维效果还不赖，原因可能是非线性网络能力很强，尽管神经元个数增多，但是每个神经元的效果在衰减。

同时，随机梯度算法给了后续监督学习一个良好的开端。整体上，增维是利大于弊的。

应用示例

from __future__ import division
import tensorflow as tf
import numpy as np
import logging
import json
import os
class TextAutoencoder(object):
    """
    Class that encapsulates the encoder-decoder architecture to
    reconstruct pieces of text.
    """
    def __init__(self, lstm_units, embeddings, go, train=True,
                 train_embeddings=False, bidirectional=True):
        """
        Initialize the encoder/decoder and creates Tensor objects
        :param lstm_units: number of LSTM units
        :param embeddings: numpy array with initial embeddings
        :param go: index of the GO symbol in the embedding matrix
        :param train_embeddings: whether to adjust embeddings during training
        :param bidirectional: whether to create a bidirectional autoencoder
            (if False, a simple linear LSTM is used)
        """
        # EOS and GO share the same symbol. Only GO needs to be embedded, and
        # only EOS exists as a possible network output
        self.go = go
        self.eos = go
        self.bidirectional = bidirectional
        self.vocab_size = embeddings.shape[0]
        self.embedding_size = embeddings.shape[1]
        self.global_step = tf.Variable(0, name='global_step', trainable=False)
        # the sentence is the object to be memorized
        self.sentence = tf.placeholder(tf.int32, [None, None], 'sentence')
        self.sentence_size = tf.placeholder(tf.int32, [None],
                                            'sentence_size')
        self.l2_constant = tf.placeholder(tf.float32, name='l2_constant')
        self.clip_value = tf.placeholder(tf.float32, name='clip')
        self.learning_rate = tf.placeholder(tf.float32, name='learning_rate')
        self.dropout_keep = tf.placeholder(tf.float32, name='dropout_keep')
        self.decoder_step_input = tf.placeholder(tf.int32,
                                                 [None],
                                                 'prediction_step')
        name = 'decoder_fw_step_state_c'
        self.decoder_fw_step_c = tf.placeholder(tf.float32,
                                                [None, lstm_units], name)
        name = 'decoder_fw_step_state_h'
        self.decoder_fw_step_h = tf.placeholder(tf.float32,
                                                [None, lstm_units], name)
        self.decoder_bw_step_c = tf.placeholder(tf.float32,
                                                [None, lstm_units],
                                                'decoder_bw_step_state_c')
        self.decoder_bw_step_h = tf.placeholder(tf.float32,
                                                [None, lstm_units],
                                                'decoder_bw_step_state_h')
        with tf.variable_scope('autoencoder') as self.scope:
            self.embeddings = tf.Variable(embeddings, name='embeddings',
                                          trainable=train_embeddings)
            initializer = tf.glorot_normal_initializer()
            self.lstm_fw = tf.nn.rnn_cell.LSTMCell(lstm_units,
                                                   initializer=initializer)
            self.lstm_bw = tf.nn.rnn_cell.LSTMCell(lstm_units,
                                                   initializer=initializer)
            embedded = tf.nn.embedding_lookup(self.embeddings, self.sentence)
            embedded = tf.nn.dropout(embedded, self.dropout_keep)
            # encoding step
            if bidirectional:
                bdr = tf.nn.bidirectional_dynamic_rnn
                ret = bdr(self.lstm_fw, self.lstm_bw,
                          embedded, dtype=tf.float32,
                          sequence_length=self.sentence_size,
                          scope=self.scope)
            else:
                ret = tf.nn.dynamic_rnn(self.lstm_fw, embedded,
                                        dtype=tf.float32,
                                        sequence_length=self.sentence_size,
                                        scope=self.scope)
            _, self.encoded_state = ret
            if bidirectional:
                encoded_state_fw, encoded_state_bw = self.encoded_state
                # set the scope name used inside the decoder.
                # maybe there's a more elegant way to do it?
                fw_scope_name = self.scope.name + '/fw'
                bw_scope_name = self.scope.name + '/bw'
            else:
                encoded_state_fw = self.encoded_state
                fw_scope_name = self.scope
            self.scope.reuse_variables()
            # generate a batch of embedded GO
            # sentence_size has the batch dimension
            go_batch = self._generate_batch_go(self.sentence_size)
            embedded_eos = tf.nn.embedding_lookup(self.embeddings,
                                                  go_batch)
            embedded_eos = tf.reshape(embedded_eos,
                                      [-1, 1, self.embedding_size])
            decoder_input = tf.concat([embedded_eos, embedded], axis=1)
            # decoding step
            # We give the same inputs to the forward and backward LSTMs,
            # but each one has its own hidden state
            # their outputs are concatenated and fed to the softmax layer
            if bidirectional:
                outputs, _ = tf.nn.bidirectional_dynamic_rnn(
                    self.lstm_fw, self.lstm_bw, decoder_input,
                    self.sentence_size, encoded_state_fw, encoded_state_bw)
                # concat fw and bw outputs
                outputs = tf.concat(outputs, -1)
            else:
                outputs, _ = tf.nn.dynamic_rnn(
                    self.lstm_fw, decoder_input, self.sentence_size,
                    encoded_state_fw)
            self.decoder_outputs = outputs
        # now project the outputs to the vocabulary
        with tf.variable_scope('projection') as self.projection_scope:
            # decoder_outputs has shape (batch, max_sentence_size, vocab_size)
            self.logits = tf.layers.dense(outputs, self.vocab_size)
        # tensors for running a model
        embedded_step = tf.nn.embedding_lookup(self.embeddings,
                                               self.decoder_step_input)
        state_fw = tf.nn.rnn_cell.LSTMStateTuple(self.decoder_fw_step_c,
                                                 self.decoder_fw_step_h)
        state_bw = tf.nn.rnn_cell.LSTMStateTuple(self.decoder_bw_step_c,
                                                 self.decoder_bw_step_h)
        with tf.variable_scope(fw_scope_name, reuse=True):
            ret_fw = self.lstm_fw(embedded_step, state_fw)
        step_output_fw, self.decoder_fw_step_state = ret_fw
        if bidirectional:
            with tf.variable_scope(bw_scope_name, reuse=True):
                ret_bw = self.lstm_bw(embedded_step, state_bw)
                step_output_bw, self.decoder_bw_step_state = ret_bw
                step_output = tf.concat(axis=1, values=[step_output_fw,
                                                        step_output_bw])
        else:
            step_output = step_output_fw
        with tf.variable_scope(self.projection_scope, reuse=True):
            self.projected_step_output = tf.layers.dense(step_output,
                                                         self.vocab_size)
        if train:
            self._create_training_tensors()
    def _create_training_tensors(self):
        """
        Create member variables related to training.
        """
        eos_batch = self._generate_batch_go(self.sentence_size)
        eos_batch = tf.reshape(eos_batch, [-1, 1])
        decoder_labels = tf.concat([self.sentence, eos_batch], -1)
        projection_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                            scope=self.projection_scope.name)
        # a bit ugly, maybe we should improve this?
        projection_w = [var for var in projection_vars
                        if 'kernel' in var.name][0]
        projection_b = [var for var in projection_vars
                        if 'bias' in var.name][0]
        # set the importance of each time step
        # 1 if before sentence end or EOS itself; 0 otherwise
        max_len = tf.shape(self.sentence)[1]
        mask = tf.sequence_mask(self.sentence_size + 1, max_len + 1, tf.float32)
        num_actual_labels = tf.reduce_sum(mask)
        projection_w_t = tf.transpose(projection_w)
        # reshape to have batch and time steps in the same dimension
        decoder_outputs2d = tf.reshape(self.decoder_outputs,
                                       [-1, tf.shape(self.decoder_outputs)[-1]])
        labels = tf.reshape(decoder_labels, [-1, 1])
        sampled_loss = tf.nn.sampled_softmax_loss(
            projection_w_t, projection_b, labels, decoder_outputs2d, 100,
            self.vocab_size)
        masked_loss = tf.reshape(mask, [-1]) * sampled_loss
        self.loss = tf.reduce_sum(masked_loss) / num_actual_labels
        optimizer = tf.train.AdamOptimizer(self.learning_rate)
        gradients, v = zip(*optimizer.compute_gradients(self.loss))
        gradients, _ = tf.clip_by_global_norm(gradients, self.clip_value)
        self.train_op = optimizer.apply_gradients(zip(gradients, v),
                                                  global_step=self.global_step)
    def get_trainable_variables(self):
        """
        Return all trainable variables inside the model
        """
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
    def train(self, session, save_path, train_data, valid_data,
              batch_size, epochs, learning_rate, dropout_keep,
              clip_value, report_interval):
        """
        Train the model
        :param session: tensorflow session
        :param train_data: Dataset object with training data
        :param valid_data: Dataset object with validation data
        :param batch_size: batch size
        :param learning_rate: initial learning rate
        :param dropout_keep: the probability that each LSTM input/output is kept
        :param epochs: how many epochs to train for
        :param clip_value: value to clip tensor norm during training
        :param save_path: folder to save the model
        :param report_interval: report after that many batches
        """
        saver = tf.train.Saver(self.get_trainable_variables(),
                               max_to_keep=1)
        best_loss = 10000
        accumulated_loss = 0
        batch_counter = 0
        num_sents = 0
        # get all data at once. we need all matrices with the same size,
        # or else they don't fit the placeholders
        # train_sents, train_sizes = train_data.join_all(self.go,
        #                                                self.num_time_steps,
        #                                                shuffle=True)
        # del train_data  # save memory...
        valid_sents, valid_sizes = valid_data.join_all(self.go,
                                                       shuffle=True)
        train_data.reset_epoch_counter()
        feeds = {self.clip_value: clip_value,
                 self.dropout_keep: dropout_keep,
                 self.learning_rate: learning_rate}
        while train_data.epoch_counter < epochs:
            batch_counter += 1
            train_sents, train_sizes = train_data.next_batch(batch_size)
            feeds[self.sentence] = train_sents
            feeds[self.sentence_size] = train_sizes
            _, loss = session.run([self.train_op, self.loss], feeds)
            # multiply by len because some batches may be smaller
            # (due to bucketing), then take the average
            accumulated_loss += loss * len(train_sents)
            num_sents += len(train_sents)
            if batch_counter % report_interval == 0:
                avg_loss = accumulated_loss / num_sents
                accumulated_loss = 0
                num_sents = 0
                # we can't use all the validation at once, since it would
                # take too much memory. running many small batches would
                # instead take too much time. So let's just sample it.
                sample_indices = np.random.randint(0, len(valid_data),
                                                   5000)
                validation_feeds = {
                    self.sentence: valid_sents[sample_indices],
                    self.sentence_size: valid_sizes[sample_indices],
                    self.dropout_keep: 1}
                loss = session.run(self.loss, validation_feeds)
                msg = '%d epochs, %d batches\t' % (train_data.epoch_counter,
                                                   batch_counter)
                msg += 'Avg batch loss: %f\t' % avg_loss
                msg += 'Validation loss: %f' % loss
                if loss < best_loss:
                    best_loss = loss
                    self.save(saver, session, save_path)
                    msg += '\t(saved model)'
                logging.info(msg)
    def save(self, saver, session, directory):
        """
        Save the autoencoder model and metadata to the specified
        directory.
        """
        model_path = os.path.join(directory, 'model')
        saver.save(session, model_path)
        metadata = {'vocab_size': self.vocab_size,
                    'embedding_size': self.embedding_size,
                    'num_units': self.lstm_fw.output_size,
                    'go': self.go,
                    'bidirectional': self.bidirectional
                    }
        metadata_path = os.path.join(directory, 'metadata.json')
        with open(metadata_path, 'wb') as f:
            json.dump(metadata, f)
    @classmethod
    def load(cls, directory, session, train=False):
        """
        Load an instance of this class from a previously saved one.
        :param directory: directory with the model files
        :param session: tensorflow session
        :param train: if True, also create training tensors
        :return: a TextAutoencoder instance
        """
        model_path = os.path.join(directory, 'model')
        metadata_path = os.path.join(directory, 'metadata.json')
        with open(metadata_path, 'rb') as f:
            metadata = json.load(f)
        dummy_embeddings = np.empty((metadata['vocab_size'],
                                     metadata['embedding_size'],),
                                    dtype=np.float32)
        ae = TextAutoencoder(metadata['num_units'], dummy_embeddings,
                             metadata['go'], train=train,
                             bidirectional=metadata['bidirectional'])
        vars_to_load = ae.get_trainable_variables()
        if not train:
            # if not flagged for training, the embeddings won't be in
            # the list
            vars_to_load.append(ae.embeddings)
        saver = tf.train.Saver(vars_to_load)
        saver.restore(session, model_path)
        return ae
    def encode(self, session, inputs, sizes):
        """
        Run the encoder to obtain the encoded hidden state
        :param session: tensorflow session
        :param inputs: 2-d array with the word indices
        :param sizes: 1-d array with size of each sentence
        :return: a 2-d numpy array with the hidden state
        """
        feeds = {self.sentence: inputs,
                 self.sentence_size: sizes,
                 self.dropout_keep: 1}
        state = session.run(self.encoded_state, feeds)
        if self.bidirectional:
            state_fw, state_bw = state
            return np.hstack((state_fw.c, state_bw.c))
        return state.c
    def run(self, session, inputs, sizes):
        """
        Run the autoencoder with the given data
        :param session: tensorflow session
        :param inputs: 2-d array with the word indices
        :param sizes: 1-d array with size of each sentence
        :return: a 2-d array (batch, output_length) with the answer
            produced by the autoencoder. The output length is not
            fixed; it stops after producing EOS for all items in the
            batch or reaching two times the maximum number of time
            steps in the inputs.
        """
        feeds = {self.sentence: inputs,
                 self.sentence_size: sizes,
                 self.dropout_keep: 1}
        state = session.run(self.encoded_state, feeds)
        if self.bidirectional:
            state_fw, state_bw = state
        else:
            state_fw = state
        time_steps = 0
        max_time_steps = 2 * len(inputs[0])
        answer = []
        input_symbol = self.go * np.ones_like(sizes, dtype=np.int32)
        # this array control which sequences have already been finished by the
        # decoder, i.e., for which ones it already produced the END symbol
        sequences_done = np.zeros_like(sizes, dtype=np.bool)
        while True:
            # we could use tensorflow's rnn_decoder, but this gives us
            # finer control
            feeds = {self.decoder_fw_step_c: state_fw.c,
                     self.decoder_fw_step_h: state_fw.h,
                     self.decoder_step_input: input_symbol,
                     self.dropout_keep: 1}
            if self.bidirectional:
                feeds[self.decoder_bw_step_c] = state_bw.c
                feeds[self.decoder_bw_step_h] = state_bw.h
                ops = [self.projected_step_output,
                       self.decoder_fw_step_state,
                       self.decoder_bw_step_state]
                outputs, state_fw, state_bw = session.run(ops, feeds)
            else:
                ops = [self.projected_step_output,
                       self.decoder_fw_step_state]
                outputs, state_fw = session.run(ops, feeds)
            input_symbol = outputs.argmax(1)
            answer.append(input_symbol)
            # use an "additive" or in order to avoid infinite loops
            sequences_done |= (input_symbol == self.eos)
            if sequences_done.all() or time_steps > max_time_steps:
                break
            else:
                time_steps += 1
        return np.hstack(answer)
    def _generate_batch_go(self, like):
        """
        Generate a 1-d tensor with copies of EOS as big as the batch size,
        :param like: a tensor whose shape the returned embeddings should match
        :return: a tensor with shape as `like`
        """
        ones = tf.ones_like(like)
        return ones * self.go