APP下载

Language Model Using Differentiable Neural Computer Based on Forget Gate-Based Memory Deallocation

2021-12-14DonghyunLeeHosungParkSoonshinSeoChangminKimHyunsooSonGyujinKimandJiHwanKim

Computers Materials&Continua 2021年7期

Donghyun Lee,Hosung Park,Soonshin Seo,Changmin Kim,Hyunsoo Son,Gyujin Kim and Ji-Hwan Kim

Department of Computer Science and Engineering,Sogang University,Seoul,04107,Korea

Abstract: A differentiable neural computer (DNC) is analogous to the Von Neumann machine with a neural network controller that interacts with an external memory through an attention mechanism.Such DNC’s offer a generalized method for task-specific deep learning models and have demonstrated reliability with reasoning problems.In this study, we apply a DNC to a language model (LM) task.The LM task is one of the reasoning problems,because it can predict the next word using the previous word sequence.However, memory deallocation is a problem in DNCs as some information unrelated to the input sequence is not allocated and remains in the external memory, which degrades performance.Therefore, we propose a forget gatebased memory deallocation(FMD)method,which searches for the minimum value of elements in a forget gate-based retention vector.The forget gatebased retention vector indicates the retention degree of information stored in each external memory address.In experiments, we applied our proposed NTM architecture to LM tasks as a task-specific example and to rescoring for speech recognition as a general-purpose example.For LM tasks,we evaluated DNC using the Penn Treebank and enwik8 LM tasks.Although it does not yield SOTA results in LM tasks,the FMD method exhibits relatively improved performance compared with DNC in terms of bits-per-character.For the speech recognition rescoring tasks,FMD again showed a relativeimprovement using the LibriSpeech data in terms of word error rate.

Keywords: Forget gate-based memory deallocation; differentiable neural computer; language model; forget gate-based retention vector

1 Introduction

Various deep learning models have been used for a variety of task-specific problems [1].For example, convolutional neural networks (CNNs) are commonly used for image and video tasks.Recurrent neural networks (RNNs) have been used for speech recognition and language model(LM) tasks.Deep learning models are usable for specific tasks if they are trained with pertinent datasets, but deep learning models adapted to specific tasks have difficulties with general-purpose tasks.General-purpose tasks are those with having different training and test sets.For example, in a question-answering task, a story composed of multiple sentences is used to train deep learning models, with these models then able to predict the next word or sentence.In the test stage (i.e.,actual application), presenting a question to the trained model results in the deep learning model’s inability to predict an answer.Despite various pre-training methods, the performance of deep learning models does not achieve the best available performance on general-purpose tasks [2].

A differentiable neural computer (DNC) has been proposed for general-purpose tasks [3].A DNC comprises a controller and an external memory.The controller is equivalent to a deep learning model.The external memory is a matrix comprisingM-dimensional vectors.On generalpurpose tasks, the DNC stores information about the training data into the external memory.The controller writes its output vector to the external memory and reads the information therein.When the DNC performs read and write operations, attention vectors are generated using both a content-based addressing and temporal linking methods [4].In the testing stage, the DNC predicts the answer using information stored in the external memory.Previous work [5] showed that the DNC achieves performance close to that of the best available performance on approaches to the question-answering bAbI task [6].

However, DNCs exhibit a problem with memory deallocation during write operations.During each write operation, the external memory at time (t−1) is multiplied with an erase vector, with the attention vector generated from the current external memory.If thei-th address of the external memory must be deallocated, thei-th element of the erase vector should be 1 during the write operation.However, the erase vector is generated with a sigmoid function.Therefore, the erase vector can only be 1 when the input is infinity.For example, in a question-answering task, if a new story is used as training data to a pre-trained DNC, information unrelated to the story should be erased from external memory.However, the DNC does not do so reliably.It becomes a garbage value and that affects the attention vector.This causes performance degradation on general-purpose tasks.

In this study, we propose a DNC using forget gate-based memory deallocation (FMD).The FMD method searches for the minimum value of elements in a forget gate-based retention vector,a vector indicating the retention degree of information stored in each external memory address.Elements of the forget gate-based retention vector with a value of 0 imply that a read operation has already been performed on external memory addresses related to these elements, and these addresses are to be deallocated.The minimum value of the elements is converted to 0, and the values of the external memory at time (t−1) are multiplied by the converted forget gate-based retention vector.

We evaluate our proposed DNC on benchmark LM tasks.Previous studies have discovered that DNCs are reliable when used with reasoning problems such as LM, which predicts the next word using the previous word sequence.Although previous studies on Transformer have demonstrated performance competitive with the best available performance for LM tasks, we evaluate our proposed DNC on the Penn Treebank (PTB) and enwik8 LM tasks.In all our tests,our proposed DNC outperforms the unmodified DNC.

We organize our article as follows.Section 2 presents other works related to deep learningbased LMs.Section 3 provides a brief description of the DNC.Section 4 analyzes our proposed FMD method in detail.Finally, we present our experimental results and conclusions in Sections 5 and 6, respectively.

2 Related Works

The LM process assigns the probability of the next word or character based on the previous word or character sequence.Modeling LM uses ann-gram LM, a conventional LM based on the Markov assumption, is used.However,n-gram LM has two drawbacks.First, it assigns a probability of 0 to an unseenn-gram.If ann-gram is not found in the training text, thisn-gram becomes the unseenn-gram.Second, the value ofnis limited.Ifnis 10 and the vocabulary size is 1000, then the number ofn-grams is 100010.Thus LM is affected by the curse of dimensionality,limiting its modeling ability on large-scale datasets [7].

A deep neural network (DNN) was applied to the LM to solve the unseenn-gram problem [8].The DNN-based LM represents each word or character in a high-dimensional vector space.The probability of the unseenn-gram can be calculated on a high-dimensional vector space generated by the DNN-based LM.Continuous bag-of-words and skip-gram are both representative DNNbased LMs [9].However, DNN-based LMs present a few disadvantages.Asnincreases, the number of input layers increases as well.This causes the weight parameters to be trained in proportion to the number of input layers.The DNN can only be trained within a limited context of lengthn[10].Therefore, the DNN-based LM cannot solve the curse of dimensionality.

To address the limited context size, RNNs have been used to train the LM [10].The RNN can be modeled long range sequences because of recurrent hidden layers.In RNN-based LM,the input of a recurrent hidden layer is one word or character at timetand an output vector of the recurrent hidden layer at time (t−1).Therefore, theoretically, RNN-based LM solves the problem of a limited context size.Previous studies proposed a bidirectional RNN-based LM that demonstrated improved performance compared with unmodified RNN-based LM [11].The bidirectional RNN-based LM is trained not only with the forward context but also with the backward context.However, it is affected by the problems of vanishing gradients and exploding gradients.The vanishing gradient problem occurs when the gradients of an activation function become 0, whereas the exploding gradient problem occurs when the gradients of the activation function become infinite [12].

To address the vanishing gradient problem, RNN-based LM uses a long short-term memory(LSTM) [13].The LSTM comprises one or more memory cells, an input, an output, and a forget gate and controls the amplification and reduction of information.The LSTM RNN-based LM achieved a higher performance than the RNN-based LM [14].However, training the LSTM RNN is time-consuming as each of the tree gates has to be trained.In addition, the LSTM RNN-based LM uses gradient clipping to prevent the exploding gradient problem [15].The clipping factor for the gradient clipping technique is chosen by the developers, with the clipping factor depending on the training datasets.In particular, recent studies have shown that an LSTM RNN-based LM cannot be trained with more than 200 words or character sequences as its input [16].

To address the problem of limited context size, attention mechanisms have also been used in the LM [17].The next word or character is related to words or characters in context, but not to all words or characters.The attention mechanism determines the words or characters that must be addressed by the attention vector generated from the input sequence [18].Transformer is the most common model using an attention mechanism [19].This is an encoder-decoder model that also uses a multi-head self-attention mechanism and positional encoding.The multi-head self-attention mechanism allows Transformer to address context information on different highdimensional vector spaces in the input sequence [20].Positional encoding applies the sine and cosine functions to the input sequence for long-context dependency [21].Although Transformer encodes a longer context into a fixed size chunk, variable Transformer-based models have also been proposed [22].

The most widely used Transformer-based models are the bidirectional encoder representations from Transformers (BERT), generative pretrained Transformer 2 (GPT-2), and Transformer extralong (Transformer-XL).BERT, which is a multi-layer bidirectional encoder of Transformer, is often functions as a pre-training model in natural language processing tasks [23].The deep bidirectional encoder can be trained with the left and right context of the input sequence in all layers.BERT also demonstrates the best available performance on question-answer and named entity tasks.GPT-2 differs from BERT by using a multi-layer decoder with Transformer [24].In LM tasks, GPT-2 achieves the best available performances in terms of bits-per-character (BPC):0.93 with 1.5 B weight parameters.The study used 12-layer decoder blocks with 12 heads.In contrast, Transformer-XL maintains a longer context using hidden states computed at the previous time step [22].These hidden states represent the previous context.Transformer-XL demonstrated a high perplexity of 54.52 on a word-level PTB LM task and a BPC of 0.99 on a character-level enwik8 LM task.

3 Differentiable Neural Computer

Read and write operations on the external memory require read and write weighting vectors.The DNC treats the read and write weighting vectors as attention vectors.Content-based addressing method has been used in previous studies to generate these vectors.This method calculates the cosine similarity between every memory vector and a key vectorkt.kt(∈RW)is one of elements ofIFt.The cosine similarity values are normalized using the softmax function.The DNC uses not only the content-based addressing method, but also two addressing methods,a temporal linking addressing method and a memory allocation-based addressing method.The temporal linking addressing method uses a temporal link matrixTLMt (∈RH×H)to determine the memory vector to be written after or before the read operation in the previous time step.This method is used to generate the read weighting vector.The memory allocation-based method employs usage vectors for every memory vector.The values of usage vectors are incremented on write operations and decremented on read operations.The memory address shows the lowest value of the usage vector and the location of new information.The memory allocation-based method is used to generate the write weighting vector.

Figure 1:Simplified diagram of DNC.In this figure, the external memory has four memory vectors, and the dimension of each memory vectors is eight.The temporal link matrix is a 4 ×4 matrix

During the read operation, a specific memory area is identified by the read weighting vector,and the DNC generates read vectors.The read vectoris defined as a weighted summation of all external memory vectors.Thei-th read vector at timetis defined as

whereand∈RH)are the transposed external memory andi-th read weighting vector at timet, respectively.

During the write operation, the DNC generates the write weighting vector, which is used in the write operation as

where ° is the element-wise product andOMis a matrix with all values equal to 1.The size of theOMis the same as that of the external memory.∈RH)is the write weighting vector at timet,∈RW)is the transposed erase vector at timet, and∈RW)is the transposed converted input vector of the controller at timet.In Eq.(2), the term (OM−) determines the ratio at which information in the external memory is deallocated.

4 Differentiable Neural Computer Using Forget Gate-Based Memory Deallocation

During the write operation, the DNC uses a retention vectorψt(∈[0,1]H)to determine whether the usage of information stored in the external memory should be increased or decreased.Thus,ψtindicates the retention degree of information stored in each memory address.The following definition ofψtis used in an actual computer.

However,ψtis only used to determine whether to increase or decrease the usage of information stored in the external memory, but not the information stored in the external memory.Assuming thatψt[0]=0, memory deallocation must be performed in the first external memory address.However, in Fig.2, the first external memory address is not deallocated becuaseψtonly affects the generation of the write weighting vector.Therefore, the value of the first external memory address is maintained until the training is complete.

Figure 2:Example of memory deallocation in the write operation of DNC

Hence, we introduce a DNC using an FMD.A similar memory deallocation method as that shown in Eq.(5) has been proposed previously [4].However, to deallocate thei-th external memory address with certainly,ψt[i] must be 0.In Eq.(3),orhas to be 0, butis generated using the sigmoid function.Therefore,can be 0 when its input is negative infinity.In addition, becauseis generated using the softmax function, 0 is difficult to obtain.

In the proposed FMD, to obtain 0 inψt, the FMD searches for the minimum of elements inψt, and then convertsψtto 0.This process is the main difference between the previous methods and our proposed method, as the unmodified DNC does not select the minimum value ofψtfor memory deallocation.We define the FMD as

whereis defined as

We assume that[0]=0.In Fig.3, the first external memory address is deallocated asaffects the external memory.Therefore, the value of the first external memory address is not maintained and deallocated.

Figure 3:Example of forget gate-based memory deallocation in the write operation of DNC

The FMD searches for the minimum value ofψt.Becauseψtis not sorted, the time complexity isO(H), whereHis the number of vectors in the external memory.Therefore, the time complexity of the FMD isO(H).

5 Experiments and Discussion

We evaluated our proposed FMD-DNC using the character-level PTB LM and enwik8 LM for task-specific tasks.We also evaluated our proposed FMD-DNC-based LM when used as the rescoring task of speech recognition, a general-purpose task.

5.1 Experimental Environment

The character-level PTB LM task comprises characters collected from the Wall Street Journal domain [25].The basic character-level PTB LM task does not contain the beginning of a sentence marker and the space markers between characters, making it difficult to distinguish word boundaries.Hence, in this experiment, the beginning of a sentence marker and space marker were added to the basic character-level PTB LM task.The total number of characters used for the experiments was 50.The number of characters for the training, validation, and test datasets were 4.88, 0.38, and 0.43 million, respectively.We repeated experiments for the character-level PTB LM task five times to verify the stability of LMs and test their generalization.

The enwik8 LM task contains 100 million characters of unprocessed Wikipedia text [20].The enwik8 LM task dataset is split into 90, 5, and 5 million characters, for the training,validation, and test datasets, respectively, preserving the experimental environment of previous studies.Experiments for the enwik8 LM task were repeated three times.

For the rescoring task of speech recognition, we used the LibriSpeech corpus to train an acoustic model (AM) and LM.It consists of 920 h of speech from an audiobook domain.To train the AM, we used the Kaldi speech recognition toolkit.We used an AM based on a DNN employing a hidden Markov model.The number of hidden layers and hidden nodes were 6 and 3500, respectively.The learning rate was initialized at 1.5× 10−3.A test set consists of four data:dev_clean, dev_other, test_clean, and test_other.We generated 100-best lists from the speech recognition results of each test set.To generate the 100-best lists, we set the acoustic scale to 12.To rescore the 100-best lists, we used Eq.(8) to calculate the likelihoodLin each sentence:

whereascoreis an acoustic score generated by the AM, andlmscorennis the language score generated by neural network-based LMs.All neural network-based LMs were trained with transcriptions of a training set for the AM.It consisted of 40 million characters with 30, 5, and 5 million characters used to construct training, validation, and test sets, respectively.

We used BPC and inference time as the evaluation metrics of the LM tasks.BPC is the average number of bits for encoding one character, with a bit as the unit of entropy [20].We defined BPC asloss/log(2).We measured the inference time per batch.On the rescoring task of speech recognition, we used word error rate (WER) as the evaluation metric.Our system used a 3.40 GHz Intel Xeon E5-2643 v4 CPU and four Nvidia GTX 1080 Ti GPUs.

5.2 Character-Level PTB LM Task

5.2.1 Experimental Setup

The baseline LM was the LSTM RNN-based LM, which we trained using PyTorch with the following hyper-parameters:number of hidden layers, 3; number of hidden nodes for each hidden layer, 1024; number of nodes in the embedding layer, 50; learning rate initialized at 1×10−1;number of batches, 6; weight decay, 1 × 10−6; length of the back-propagation through time(BPTT), 120.We used the following hyper-parameters for training the Transformer-based LM:number of hidden layers, 3; number of hidden nodes for each hidden layer, 1024; number of nodes in the embedding layer, 50; number of heads in the encoder and decoder, 4; learning rate initialized at 1×10−3; number of batches, 6; weight decay, 1×10−6; length of input chunks, 120.The experimental results were the same as those of our previous work [26].

On the character-level PTB LM task, the SOTA LMs were the trellis and AWD-LSTM networks.We compared the LM based using our proposed DNC with the best available performance on LMs.The hyper-parameters of the trellis and AWD-LSTM networks were the same as those used in previous studies.In the experiments, we used a batch size of 6 and a BPTT length of 120.We applied a dropout factor of 0.5, which was not applied in the embedding, input, and output layers.

To train the basic DNC-based LM, we used the LSTM RNN for the controller.The following hyper-parameters were used:number of hidden layers, 3; number of nodes in the embedding layer,50; numbers of hidden nodes of each hidden layer, 1024, 512, and 512.The external memory used 1024 memory vectors with 512 dimensions each.The learning rate was initialized to 1×10−3.In PyTorch, we used a scheduler module to reduce the learning rate when the objective function plateaued, with a reduction rate of 1×10−1.We used a 6 batches, a weight decay of 1×10−7,and a BPTT length of 120.

To train the FMD-DNC-based LM, we used the LSTM RNN as the controller.The following hyper-parameters were used:number of hidden layers, 2; number of nodes in the embedding layer, 50; number of hidden nodes in hidden layer, 1024.The external memory used 128 memory vectors with 256 dimensions each.The learning rate was initialized to 1 × 10−6.We again used a scheduler module to reduce the learning rate when the objective function plateaued, with a reduction rate was 9×10−1.The number of batches was 10, the weight decay was 1×10−7,and the length of the BPTT was 120.To train the DNC using a memory deallocation (MD)method [4], we used the LSTM RNN as the controller and the same hyper-parameters as the FMD-DNC-based LM.

5.2.2 Experimental Results

We evaluated the performance of the FMD-DNC-based LM using the number of read vectors and the value of the weight decay.The FMD-DNC-based LM with a single read vector outperformed the other FMD-DNC-based LMs with a BPC of 1.5920.While it did not achieve the best available performance of the Transformer-based LM, the FMD method showed a relative improvement of 0.41% compared with the DNC in terms of BPC.We analyzed the performance of the FMD-DNC-based LM based on the number of read vectors in three ways.First, we observed that the number of weight parameters increased according to the number of read vectors.The controller interface layer generated key vectors for the read vectors, with the number of read vectors equal to the number of key vectors.In the experiments, the key and read vectors were 256-dimensional vectors.In addition, the read vectors generated at time (t−1) affected the size of the input layer in the controller.Second, the BPC decreased based on the number of read vectors.We found that the number of read vectors was proportional to the number of weight parameters.Therefore, the character-level PTB LM task was insufficient for training the FMD-DNC-based LM.Third, the inference time was proportional to the number of read vectors and related to the first analysis, because the number of weight parameters increased based on the number of read vectors.In addition, to obtain the minimum element ofwe used a search algorithm for the FMD.Therefore, the total time complexity for the FMD-DNC-based LM was the summation of the time complexity of the plain DNC andO(H).

We also evaluated the performance of the FMD-DNC-based LM in terms of weight decay,which reduces overfitting by adding large penalties as the weight parameters increase [27].To explain, we denote the set of weight parameters asPand addthe loss function.Asλincreases, an increasingly large penalty is added toP.In our experiments, we setλto 1×10−6and 1×10−7.As shown in Tab.1, when we usedλ=1×10−6with FMD-DNC-based LMs using one to four read vectors, we obtained a BPC of 1.5934 to 1.5980.The performance of the FMD-DNC-based LM with weight decayλ=1×10−6was lower than that withλ=1×10−7.Two important findings were observed in experiments.First, the BPC decreased according to the weight decay.With the FMD-DNC-based LM usedλvalues between 1×10−7and 1×10−6, the performance degraded.This means that the LMs were trained to underfit at extremely highλ.Whenλwas extremely low, the LMs were trained to overfit.Therefore, the FMD-DNC-based LM withλ= 1 × 10−6showed underfitting in the experiments.Second, the inference time was disproportional toλ.Even when the FMD-DNC-based LM was trained withλ, the inference time was the same because the number of weight parameters did not change.Tab.1 shows the evaluation results of FMD-DNC-based LMs on the character-level LM task.

Table 1:Evaluation results of FMD-DNC-based LMs on the character-level PTB LM task (TF,the Transformer; Trellis, the trellis network; AWD, the AWD-LSTM network; DNC, the vanilla DNC; MD, the MD-DNC; FMD, the FMD-DNC; nWP, number of weight parameters; nRV,number of read vectors; nVEM, number of vectors in the external memory; WD, weight decay;IT, inference time (ms/batch); μ, mean of BPC results; σ, standard deviation of BPC results)

We used BPC and inference time as the evaluation metrics of the LM tasks.BPC is the average number of bits for encoding one character, with a bit as the unit of entropy [20].We defined BPC asloss/log(2).We measured the inference time per batch.On the rescoring task of speech recognition, we used word error rate (WER) as the evaluation metric.Our system used a 3.40 GHz Intel Xeon E5-2643 v4 CPU and four Nvidia GTX 1080 Ti GPUs.

5.3 enwik8 LM Task

5.3.1 Experimental Setup

For the enwik8 test, we again used LSTM RNN-based LM as the baseline, but we used the previous experimental result of that model [28].In addition, we used the previous experimental results of the Transformer-based LM [22].To train the plain DNC-based LM, we used the LSTM RNN as the controller.The number of hidden layers was 3, and the number of hidden nodes of the hidden layers was 1024.The number of nodes in the embedding layer was 128.The external memory consisted of 128 memory vectors with 256 dimensions each.The learning rate was initialized to 1×10−3.We again used a PyTorch scheduler module to reduce the learning rate based on the objective function’s plateau, with a reduction rate of 1×10−1.The weight decay was 1×10−7, the length of the BPTT was 120, and the batch size was 5.

To train the FMD-DNC-based LM, we used the LSTM RNN for the controller.In experiments, we used the following hyper-parameters for the FMD-DNC-based LM:number of hidden layers, 3; number of nodes in the embedding layer, 128; number of hidden nodes for hidden layer, 1024.The external memory consisted of 128 memory vectors with 256 dimensions each.The learning rate was initialized to 1×10−3.The scheduler function was used again to reduce the learning rate, with a reduction rate of 9×10−1.The number of batches was 5, the weight decay was 1×10−7, and the length of the BPTT was 120.To train the MD-DNC [4], we used the LSTM RNN as the controller and the same hyper-parameters as the FMD-DNC-based LM.

5.3.2 Experimental Results

Tab.2 shows the evaluation results of the FMD-DNC-based LMs on the enwik8 LM task.We evaluated the performance of the FMD-DNC-based LM based on the number of read vectors and the weight decay value.The FMD-DNC-based LM using four read vectors outperformed the other FMD-DNC-based LMs, with a BPC of 1.3860.Although it does not match the results of the Transformer-based LM, the FMD method showed a relative improvement of 0.45% compared with the DNC in terms of BPC.

Table 2:Evaluation results of FMD-DNC-based LMs on the enwik8 LM task (TF, the Transformer; DNC, the vanilla DNC; MD, the MD-DNC; FMD, the FMD-DNC; nWP, number of weight parameters; nRV, number of read vectors; nVEM, number of vectors in the external memory; WD, weight decay; IT, inference time (ms/batch); μ, mean of BPC results; σ, standard deviation of BPC results)

Our general notes about the performance of the FMD-DNC-based LM based on the number of read vectors are as follows.First, the number of weight parameters was twice that of Transformer.We assumed that more weight parameters were involved and better performance was demonstrated because the enwik8 LM task dataset is a large-scale dataset.However, in the experiments, the Transformer-based LM showed the best available performance.Second, we found that the BPC was higher when using more read vectors.Although it was difficult to train the FMD-DNC-based LM using 5 and more read vectors, the FMD-DNC-based LM using four read vectors outperformed the plain DNC-based LM in terms of BPC.Third, BPC decreased with the weight decay.Whenλused in the FMD-DNC-based LM ranged from 1×10−7to 1×10−6,the performance degraded.Furthermore, we obtained these performance results from experiments involving the PTB LM task.This means that the LMs were trained to overfit at extremely lowλ,although we used a large-scale LM task dataset.

5.4 Rescoring Task of Speech Recognition

5.4.1 Experimental Setup

We calculated thelmscorennin Eq.(8) for the LSTM RNN, Transformer, plain DNC, MDDNC, and FMD-DNC.The number of weight parameters for the LSTM RNN was 10.2 million, producing BPC values of 1.6428 and 1.6437 for the validation and test sets, respectively.The number of weight parameters for the Transformer-based LM was 10.4 million, producing BPC values of 1.3179 and 1.3184 for the validation and test sets, respectively.The number of weight parameters for the plain DNC, MD-DNC, and FMD-DNC were also 10.4 million.Plain DNC showed BPC values of 1.5825 and 1.5813 for the validation and test sets, respectively.MD-DNC showed BPC values of 1.5722 and 1.5729 for the validation and test sets,respectively.FMD-DNC showed BPC values of 1.5687 and 1.5690 for the validation and test sets, respectively.

5.4.2 Experimental Results

Tab.3 presents the 100-best rescoring results using the neural network-based LMs trained with transcriptions of the AM for the LibriSpeech dataset.The experimental results of the AM were the rescoring result using the acoustic score alone.The FMD-DNC-based LM showed 11.16,33.02, and 34.58 WER according to the dev_clean, dev_other, and test_other tests, respectively.This was a relative improvement of 0.18, 0.24, and 0.03% compared with Transformer according to the dev_clean, dev_other, and test_other tests.However, with the test_clean dataset, the Transformer-based LM exhibited a relative improvement of 0.08% compared with FMD-DNC.

Table 3:100-best rescoring results using the neural network-based LMs

We summarize our performance findings as follows.First, although the Transformer-based LM showed a relative improvement of 19.03% compared with the FMD-DNC-based LM in terms of BPC, the FMD-DNC-based LM exhibited better performance than the Transformer-based LM on rescoring tasks.We originally anticipated that Transformer would show better performance in rescoring tasks based on it’s the best available performance on LM tasks.However, in the actual experiments, FMD-DNC showed the bset performance in rescoring tasks.Second, the FMD-DNC-based LM outperformed the MD-DNC-based LM in rescoring tasks.Therefore, our proposed FMD method outperformed the MD-DNC.

6 Conclusions and Future Work

As the basic DNC has disadvantageous due to memory deallocation, we proposed an LM using FMD-DNC to address this problem.The FMD method searches for the minimum value of elements in a forget gate-based retention vector, which is a vector that indicates the retention degree of information stored in each external memory address.Our method converts the minimum value of elements to 0 and subsequently multiplies the values of the external memory at time(t−1) by the converted forget gate-based retention vector.In experiments, we applied our proposed NTM architecture to LM tasks as a task-specific domain and the rescoring of speech recognition as a general-purpose domain.On LM tasks, tests with our proposed DNC using the Penn Treebank and enwik8 LM tasks did not achieve the best available results.However, the FMD method showed a relative improvement of 0.41%–0.45% compared with DNC in terms of BPC.For testing rescoring ability in speech recognition, we evaluated our proposed DNC-based LM on the LibriSpeech task.The FMD method showed a relative improvement of 0.03%–0.24% compared with Transformer in terms of WER.For the future work, we will improve the extremely long inference time of the FMD-DNC-based LM.Furthermore, we will evaluate the FMD-DNC-based LM on WikiText-103 and One Billion Word LM tasks.

Funding Statement:This work was supported by the ICT R&D By the Institute for Information& communications Technology Promotion (IITP) grant funded by the Korea government (MSIT)[Project Number:2020-0-00113, Project Name:Development of data augmentation technology by using heterogeneous information and data fusions].

Conficts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.