Deep learning methods for extractive text summarization

  1. Akanksha Joshi
Supervised by:
  1. Enrique Alegre Gutiérrez Director
  2. Eduardo Fidalgo Fernández Director

Defence university: Universidad de León

Year of defence: 2021

Committee:
  1. Ana M. García Serrano Chair
  2. Víctor González Castro Secretary
  3. Luis Fernando d' Haro Enríquez Committee member

Type: Thesis

Abstract

This thesis presents new algorithms, methods, and datasets to solve extractive text summarization of single documents using deep learning methods and fusion-based approaches. Our first contribution is SummCoder, an unsupervised method for extractive text summarization, unaffected by the non-availability of large labeled datasets required for supervised learning of extractive text summarization. Our proposal generates a summary according to three metrics for sentence selection: content relevance, novelty, and position relevance. The relevance of the sentence content is measured using a deep auto-encoder network. The novelty metric is derived by exploiting the similarity among sentences represented as embeddings in a distributed semantic space. And, the sentence position relevance is a hand-designed feature, which assigns more weight to the first few sentences through a dynamic weight calculation function regulated by the document length. Furthermore, we developed a sentence ranking and a selection technique for generating a document summary by ranking the sentences according to the final score obtained by fusing the three sentences selection metrics. We also introduce a new summarization benchmark, Tor Illegal Documents Summarization (TIDSumm) dataset, mainly to assist Law Enforcement Agencies (LEAs). It contains two sets of ground truth summaries, manually created, for 100 web documents extracted from onion websites in Tor (The Onion Router) network. The evaluation of SummCoder framework on DUC 2002, CNN/DailyMail, Blog Summarization and TIDSumm dataset exhibits a remarkable improvement in ROUGE scores on all of these datasets, compared to other state-of-the-art systems. To keep enhancing the accuracy on the task of text summarization, we propose DeepSumm, a summarization framework that utilizes the topic information in documents along with sequence to sequence networks. The topic vectors capture long-range semantic information in the document that is not otherwise encapsulated using other document representations. In DeepSumm, we utilize the latent information in the document estimated via topic vectors and sequence networks to improve the quality and accuracy of the summarized text. Each sentence is encoded through two different recurrent neural networks based on probabilistic topic distributions and word embeddings. Then, a sequence to sequence network is applied to each sentence encoding. The outputs of the encoder and the decoder in the sequence to sequence networks are combined after weighting using an attention mechanism and converted into a score through a multi-layer perceptron network. The sentence scores based on topic, sentence embeddings, position and novelty of each sentence are fused to generate a rank for each sentence indicating their importance. We empirically demonstrated that DeepSumm captures the global and local semantic information of the document, outperforming existing state-of-the-art approaches for extractive text summarization in DUC 2002 and CNN/DailyMail datasets. Our final contribution aims to increase the accuracy of the text summarization task without any supervision. We designed RankSum, a fusion-based approach that looks at multidimensional features of sentences in the document to achieve this. The proposed methodology utilizes the heterogeneous features of sentences such as topic information, semantic content, important keywords and positional information in sentences to rank them according to their significance. We use probabilistic topic models to determine topic rank, whereas semantic information is captured using sentence embeddings. To derive rankings using sentence embeddings, we utilize Siamese networks to produce abstractive sentence representation and then we formulate a novel strategy to arrange them in their order of importance. A graph-based strategy is applied to find the significant keywords and related sentence rankings in the document. We also formulate a sentence novelty measure based on bigrams, trigrams and sentence embeddings to eliminate the redundant sentences from the summary. We compute the rank of all the sentences in the document using each of these features. The ranks of all the sentences are finally fused to get the final score for each sentence in the document. Experimental results on CNN/DailyMail and DUC 2002 dataset show that our approach is one of the best approaches compared to existing state-of-the-art summarization methods.