TextRank text

11/6/2021 2-minute read

Starting from June 7th, I spent three days reading and implementing a page rank based text summarization algorithm.


TextRank algorithm is a graph-based ranking algorithm for text, by partitioning text into several constituent units (sentences), constructing a node-linked graph, using the similarity between sentences as the weights of edges, calculating the TextRank values of sentences through circular iterations, and finally extracting the sentences with high ranking to combine into a text summary.


The first step is to integrate all articles into text data. Then we split the text into individual sentences and find vector representation (word vectors) for each sentence. After that, we calculate the similarity between the sentences vectors and store them in a matrix. In the next step, the similarity matrix is transformed into a graph structure with sentences as nodes and similarity scores as edges for sentence TextRank calculation. Finally, a certain number of the highest ranked sentences constitute the final summary.


In our experiments, we found that the real problem lies in the evaluation. Traditional extractive abstracts rely on the use of BERT scores or similar mechanisms, which are evaluated by comparing the similarity of the abstract to the original text. This approach originally originated in the field of machine translation and has since been adopted for text summarization, however, in practice we have found that traditional evaluation methods are often used for general articles, which may involve entertainment and sports news, political commentary or fiction. In these articles, as mentioned earlier, there is a large amount of information redundancy and the same fact or knowledge is often mentioned repeatedly, so this evaluation method can be applied. However, in the specialized field of motor design, similarity can hardly be used to assess the quality of abstracts because of their low information redundancy and high logical coherence.

Based on these facts, we may have to resort to expert evaluation, which relies on the LCM motor engineers to give their subjective comments.


figure-normal (without any classes)


It can be found in Colab or my Github


I write two new subsection about TextRank Text Summarization and added somethingf in LESK section


