Plugin For similar content retrieval from Wikipedia Citations and Summarization of Article

10 min readMay 11, 2021

The drastic growth in the content available on the web has been observed in the past few years. Wikipedia saw exponential growth since 2006, and the total number of articles on the site reached 55 million, out of which 6.2 million are in English. This massive collection of articles poses new kinds of difficulties to users, such as retrieving semantically similar information.
As lexically similar texts may not relate semantically like food and bread are semantically equivalent however lexically different.
It is time-consuming to visit various cited links and read content manually when the data is vast and diverse. Even if we try reading content manually, not all cited text is relevant, leading to wastage of time. Sometimes the user needs context to the quoted text and requires to search manually for the cited text and may end up getting some relevant data or may deviate from the context. This problem needs the system to be more imaginative and efficient and increase the user readability and experience.

We aim to propose an information retrieval system that would retrieve semantically similar data from the noted link, noted link here means Wikipedia’s highlighted internal links and external citations. We propose to make a chrome extension plugin which :-

Can return the most relevant article from citations based on user’s query(selected line).
Can generate summary of current article.
Can get answer to user query from certain paragraph selected which has external citations.
Can get answer to users query from current Articles’ summary.

Our approach will save users precious time by not surfing over and over for the cited context. It helps the user to extract the vital information more engagingly. It also avoids the situation when the user visits multiple articles for the cited text and may often deviate from the context. This will enhance the overall knowledge of the context of the user as well.

Extracting Similar Paragraphs from Citations

Extracting most contextually similar paragraphs from the cited articles from the query. Recommending Contextually similar articles from the current article. For the given input query we fetched the citations present in the query, both external and internal links, here internal means links that are from some Wikipedia page it-self and external means those from internet that are cited by the authors. We scrap document id’s for the internal links. Then we used Doc2vec model for the documents to get the embeddings. For query we used Word2vec to get the vector embeddings. For finding the similarity between the document vectors and the input query we used Cosine Similarity measure and WMD (Word Mover’s Distance).For Recommending Contextually similar articles to the user for the query or for the current page depending on users choice, for this topic modeling is used. This can also be done by finding the similarity of the introductions para-graphs of different articles and the given query or the intro-duction paragraph of the current page and thereby ranking those articles to get top five most relevant articles for users to read. For this, scrapping of the topic and extraction of different topics from the the Wikipedia dumps. Creating vectors of the current and other articles using above models that is Word2Vec and Doc2Vec. Finally finding similarity between the vectors and ranking them to get the most similar topics to recommend.

Input from Front End- User can highlight some part of the Wikipedia text, and our plugin will take that highlighted text as input and other required details of the articles along with citations and hyperlinks in the articles as input.

Backend Flow

Scraping data- Query and title name is fetched using JavaScript and sends a post request to the flask server. Now using page title citations are scrapped from the current page. Citation occurring in the query is extracted. Using the extracted citations from the query, documents are scrapped. Each extracted articles are appended together which will further be broken into paragraphs.

Cleaning- Now the appended articles is broken down into paragraphs and we apply some pre-processing to clean data like, removal of junk symbols, tokenization, stop words, removing punctuation and applied lemmatization.

Embeddings- We used three types of word embed-dings as mentioned below:-

Word2Vec- Word2Vec is a combination of 2 techniques — CBOW (Continuous Bag of Words) and Skip-Gram (Skip -Gram model). It helps to keep semantic information and to preserve relation between different words. We get 300 dimensional embeddings for each word. Then we take mean for all the vectors and resultant query vector of 300 dimensions.

Document2Vec- It is used to create embed-dings for group of words taken collectively. It represents each document as a vector. The document is splitted into words. Then tokens are generated. This token list is passed as input todoc2vec model. It gives a 300 length dimensions embeddings for each paragraph.–

Tf-idf- We applied Tf-idf on both query and document after building the corpus from various Wikipedia documents. After building the corpus we vectorized the input query and the document where for each paragraph present in the document different vectors are made which later on will be checked for similarity against the query vector.

Similarity Measures- We applied similarity measuring technique like Cosine similarity , that is a metric to measure the similarity between document vector and query vector by determine the cosine angle and Word Mover Distance to find the similarity between the document and the query.

Output to Frontend- Most relevant results by co-sine similarity are sorted in descending order and top results are returned, which potentially helps user in finding similar paragraphs. On using the Word Mover Distance also we get the most relevant result after sorting the results in descending order according to the scores.

Pipeline for extraction of paragraph from Citations

Text Summarization

Text Summarization is a process of creating a summary of a document. Summarization refers to presenting data in a concise form, highlighting the part that convey fact and information while preserving the meaning. It is divided into two classes

Extractive Summarization- Extractive summarization picks up sentences directly from the original document depending on their importance.
Abstractive Summarization- Abstractive summarization tries to produce a bottom-up summary using sentences or verbal annotations that might not be a part of the original document.

We have used the following approaches for Extractive summarization :-

TF-IDF

Term frequency-inverse document frequency is a numeric measure that is used to score the importance of a word in a document based on it’s appearance in the document and a given collection of documents. If a word appears frequently in a document then it should be important and that word should be given high score. But if a word appears in too many other documents, it’s probably not a unique identifier and should be assigned low scores.

Steps :-
(a) The sentences are converted into tokens.
(b) A frequency table is created for each sentence which stores a term and it’s frequency.
(c ) The Term frequency is calculated for each term of the sentence. A term frequency is the number of times term t appears in a document by total number of terms in the document.
(d) The number of sentences containing a word are calculated. It will be useful in creating Inverse Document Frequency (IDF).
(e) The inverse document is calculated for each term. IDF is log of total number of documents by number of documents with term t in it.
(f) TF-IDF scores are calculated by multiplying tf and idf scores for each term.
(g) Each sentence is given a score by averaging the scores of the terms present in the sentence.
(h) Summary is generated by selecting the top scored sentences based on certain threshold

2. TextRank

It is an unsupervised graph-based technique. It is based on PageRank algorithm. It is used for ranking text sentences. Each sentence in the document is represented by a node in the graph. The edges denotes the similarity between the two nodes i.e. sentences. The algorithm assigns scores to each sentence and picks the top-ranked sentences to make summary.

Steps :-

(a) A graph is created where nodes denote the sentences. The edges in the graph denote the similarity between the two nodes i.e. sentences.

(b) PageRank algorithm is applied on the weighted graph.

( c ) The highest score nodes are added to the summary. We have used Gensim TextRank module

3. BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It works on mechanism that learns contextual relations between words(or sub words) in a text. The transformer includes two parts — an encoder that reads the text input and a decoder that produces a prediction for the task. By bidirectional it means that the encoder reads the entire sequence of words at once. We have used BERT module from Hugging Face Transformers

We have used the following approaches for Abstractive summarization :-

XLNet

It is a bidirectional transformer where the next tokens are predicted in random order. It is an extension of the transformer-XL model pre trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood overall permutations of the input sequence factorization or-der. It leverages the advantages of both, Auto regressive and Auto-encoding methods for its pretraining which helps it to overcome pretrain-finetune discrepancy. We have used XLNet module from HuggingFace Trans-formers for summarization.

2. GPT-2

It is Generative Pre-trained Transformer 2.It a variant of the transformer model which only has the decoder part of the Transformer network. It uses multi-headed masked self-attention, which allows it to look at only the first i tokens at time step t, and enables them to work like a unidirectional language models. The models process tokens in parallel i.e. by predicting tokens for all time steps at once. We have used GPT-2 module from Hugging Face Trans-formers.

3. XLM Transformer

It is a transformer based architecture that is pre-trained using one of the three language modelling objectives. Causal Language Modeling — models the probability of a word given the previous words in a sentence. Masked Language Modeling — the masked language modeling objective of BERT. Translation Language Modeling — a (new) translation language modeling objective for improving cross-lingual pre-training. [17]We have used XML module from Hugging Face Trans-formers.

Question Answering Module

The user can select context or may take summary as the context and give a question. Context and question are passed to the Transformers pipeline question answer module, that finds interactions between context and question and using that information predicts the answer.

Plugin Flow

For getting relevant answer to your question from Paragraph selected that contains external citations.

*Getting relevant answer to your question from Paragraph selected that contains external citations*

For getting relevant answer to the question from the summary of the article.

*Getting relevant answer to the question from the summary of the article*

Evaluation for similar paragraph retrieval from citations

Evaluation of summarization model using Rouge-1 and Rouge-L metric

Metrics Used Rouge-N : ROUGE-1 refers to the overlap of uni-gram (each word) between the system(output results from various algorithms) and reference summaries(ground truth). ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into ac-count sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically

Plugin Screenshots

Retrieved answer to user Query from selected Paragraph

Retrieved answer to user Query from Summary of Article

Conclusion

Extracting Similar Articles

We have solved the solved the problem of extracting similar content from cited sites from Wikipedia and displayed them on the plugin application.

Text Summarization

We have generated summary using various models such as TF-IDF, BERT, TextRank, GPT-2, XLNet, XLMTransformer. Each model generates slightly different summary for the same article ’Car’. Summarization based on TF-IDF outperformed other methods with F-Measure as0.3405 with Rouge-1 and 0.3393 with Rouge-L evaluation metric. The another way to judge the best summary generated by asking human evaluator. We also tried to implement question answer module using these summaries.

Contributors

Acknowledgement

Special Thanks to our Advisor Dr. Rajiv Ratn Shah for guiding us. Also, we would like to thank our TA Rahul Kukreja for advising us in different phases of development of the Plugin.

References

[1] P. Arnold and E. Rahm, “Automatic extraction of semantic relations from wikipedia,” International Journal on Artificial Intelligence Tools, vol. 24, p. 1540010, 042015.

[2] P. Banik, S. Gaikwad, A. Awate, S. Shaikh, P. Gunjgur, and P. Padiya, “Semantic analysis of Wikipedia documents using ontology,” in2018 IEEE International Conference on System, Computation, Automation and Networking (ICSCA), 2018, pp. 1–6.

[3] K. U. Manjari, S. Rousha, D. Sumanth, and J. Sirisha Devi, “Extractive text summarization from web pages using selenium and tf-idf algorithm,” in20204th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), 2020, pp. 648–652.

[4] M. R. Ramadhan, S. N. Endah, and A. B. J. Man-tau, “Implementation of textrank algorithm in product review summarization,” in2020 4th International Conference on Informatics and Computational Sciences(ICICoS), 2020, pp. 1–5.

[5] H. Gupta and M. Patel, “Method of text summarization using lsa and sentence based topic modelling with bert,”in2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), 2021, pp. 511–517.

[6] R. Horev, “Bert explained: State of the art language model for nlp,” Nov 2018. [Online]. Avail-able: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

[7] M. Singh, “Summarize reddit comments using t5,bart, gpt-2, XLNet models,” Jan 2021. [Online]. Avail-able:https://towardsdatascience.com/summarize-reddit-comments-using-t5-bart-gpt-2-xlnet-models-a3e78a5ab944

[8] “Papers with code — xlm explained.” [Online]. Available:https://paperswithcode.com/method/xlm