02. Text Summarization

Haiyue10/27/23About 6 min

The table below is the summarization for text summarization and key entity extraction. The next plan is to validate most of them find the best model and fine-tune them to fill our own needs.

Name	Website	Type
GPT2	Github	OpenSource
XLNet	Github	OpenSource
BERT	Github of Bert Bert-Summarizer	OpenSource
KeyBERT	Github of KeyBert	OpenSource
TextRank	Github Site Another Example	OpenSource
TF-IDF		Unknow
Word2Vec	Github Site	OpenSource
Gensim	Github Site	OpenSource
Sumy	Github Site	OpenSource
NLTK	Github Site	OpenSource
T5	Github Site	OpenSource
GPT-3~4	Github Sample openai documentation	Commercial
AWS Service	AWS bedrock	Commercial
PaLM	PaLM API: Text Quickstart with Python	Commerical

Annotated bibliography

Text Summarization using BERT, GPT2, XLNet
Info
Very general text to describe summarization.
Other options are given: GPT2, XLNet
BERT Extractive Summarizer vs Word2Vec Extractive Summarizer: Which one is better and faster?
Info
Very general text to describe summarization.
Other options are given: TextRank, TF-IDF, and Word2Vec
Extractive Summarization with BERT Extractive Summarizer
Info
Very general description fro summarization.

5 Powerful Text Summarization Techniques in Python

Gensim

from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia
import en_core_web_sm

# To import the wikipedia content:
wikisearch = wikipedia.page("")
wikicontent = wikisearch.content
nlp = en_core_web_sm.load()
doc = nlp(wikicontent)

# To summarize based on percentage:
summ_per = summarize(wikicontent, ratio = "")
print("Percent summary")
print(summ_per)

#To summarize based on word count:
summ_words = summarize(wikicontent, word_count = "")
print("Word count summary")
print(summ_words)

Sumy

LexRank: LexRank is a graphical-based summarizer.

from sumy.summarizers.lex_rank import LexRankSummarizer
summarizer_lex = LexRankSummarizer()

# Summarize using sumy LexRank
summary= summarizer_lex(parser.document, 2)
lex_summary=""
for sentence in summary:
    lex_summary+=str(sentence)
print(lex_summary)

Luhn: Developed by an IBM researcher of the same name, Luhn is one of the oldest summarization algorithms and ranks sentences based on a frequency criterion for words.

from sumy.summarizers.luhn import LuhnSummarizer
summarizer_1 = LuhnSummarizer()
summary_1 =summarizer_1(parser.document, 2)

for sentence in summary_1:
    print(sentence)

LSA: Latent semantic analysis is an automated method of summarization that utilizes term frequency with singular value decomposition. It has become one of the most used summarizers in recent years.

from sumy.summarizers.lsa import LsaSummarizer
summarizer_lsa = LsaSummarizer()

# Summarize using sumy LSA
summary =summarizer_lsa(parser.document,2)
lsa_summary=""

for sentence in summary:
    lsa_summary+=str(sentence)
print(lsa_summary)

TextRank: And last but not least, there is TextRank which works exactly the same as in Gensim.

# Load Packages
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

# For Strings
parser = PlaintextParser.from_string(text,Tokenizer("english"))
from sumy.summarizers.text_rank import TextRankSummarizer

# Summarize using sumy TextRank
summarizer = TextRankSummarizer()
summary =summarizer_4(parser.document,2)
text_summary=""

for sentence in summary:
    text_summary+=str(sentence)
print(text_summary)

NLTK

The 'Natural Language Toolkit' is an NLP-based toolkit in Python that helps with text summarization.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

Input your text for summarizing below:

text = """ """

# Next, you need to tokenize the text:
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

# Now, you will need to create a frequency table to keep a score of each word:
freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

# Next, create a dictionary to keep the score of each sentence:
sentences = sent_tokenize(text)
sentenceValue = dict()

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

# Now, we define the average value from the original text as such:
average = int(sumValues / len(sentenceValue))

# And lastly, we need to store the sentences into our summary:
summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence
print(summary)

To make use of Google’s T5 summarizer, there are a few prerequisites.

First, you will need to install PyTorch and Hugging Face’s Transformers. You can install the transformers using the code below:

pip install transformers

Next, import PyTorch along with the AutoTokenizer and AutoModelWithLMHead objects:

import torch
from transformers, import AutoTokenizer, AutoModelWithLMHead

Next, you need to initialize the tokenizer model:

tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

From here, you can use any data you like to summarize. Once you have gathered your data, input the code below to tokenize it:

inputs = tokenizer.encode("summarize: " + text,
    return_tensors='pt',
    max_length=512,
    truncation=True)

Now, you can generate the summary by using the model.generate function on T5:

summary_ids = model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)

Feel free to replace the values mentioned above with your desired values. Once it’s ready, you can move on to decode the tokenized summary using the tokenizer.decode function:

summary = tokenizer.decode(summary_ids[0])

And there you have it: a text summarizer with Google’s T5. You can replace the texts and values at any time to summarize various arrays of data.

GPT-3

GPT-3 is a successor to the GPT-2 API and is much more capable and functional. Let’s take a look at how to get it running on Python with an example of downloading PDF research papers.

First, you will need to import all dependencies:

import openai
import wget
import pathlib
import pdfplumber
import numpy as np

You will then need to install openai to interact with GPT-3, so make sure you have an API key. You can get one here.
You will also need wget to download PDFs from the internet. This will further require pdfplumber to convert it back to text. Install all three with pip:

pip install openai
pip install wget
pip install pdfplumber

To download the PDF and return its local path, enter the following:

def getPaper(paper_url, filename="random_paper.pdf"):
    """
    Downloads a paper from the given url and returns
    the local path to that file.
    """
    
    downloadedPaper = wget.download(paper_url, filename)
    downloadedPaperFilePath = pathlib.Path(downloadedPaper)
    
    return downloadedPaperFilePath

Now, you need to convert the PDF into text so GPT-3 can read it:

paperFilePath = "random_paper.pdf"
paperContent = pdfplumber.open(paperFilePath).pages

def displayPaperContent(paperContent, page_start=0, page_end=5):
    for page in paperContent[page_start:page_end]:
        print(page.extract_text())

displayPaperContent(paperContent)

Now that you have the text, it’s time to start summarizing it:

def showPaperSummary(paperContent):
    tldr_tag = "\n tl;dr:"
    openai.organization = 'organization key'
    openai.api_key = "your api key"
    engine_list = openai.Engine.list()

Here, we are letting the GPT-3 model know that we require a summary. Then, we proceed to set up the environment to use the openai API.

for page in paperContent:
    text = page.extract_text() + tldr_tag
    response = openai.Completion.create(
        engine="davinci",
        prompt=text,
        temperature=0.3, 
        max_tokens=140,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["\n"])
    print(response["choices"][0]["text"])

This code extracts the text from each page, feeds the GPT-3 model the max tokens for each page, and prints it to the terminal.
Now that everything is set up, we can run the summarizer:

paperContent = pdfplumber.open(paperFilePath).pages
showPaperSummary(paperContent)

Text summarization is very useful for people dealing with large amounts of written data on a daily basis, such as online magazines, research sites, and even for teachers in schools.
While there are simple methods of text summarization in Python such as Gensim and Sumy, there are far more powerful but slightly complicated summarizers such as T5 and GPT-3.
Which technique to choose really comes down to preference and the use-case for each of these summarizers. But in theory, AI-based summarizers will prove better in the long run as they will constantly learn and provide superior results.

How to do text summarization with deep learning and Python
Info
Text summarises based on frequency metric

Text Summarization in Python

Not valuable

Abstractive Summarization

# Step 1: import the required libraries. 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Step 2: Remove the Stop Words and store them in a separate array of words.
#Stop Words
#  Words such as is, an, a, the, and ‘for‘ do not add value to the meaning of a sentence. For example, let us take a look at the following sentence:
#  GreatLearning is one of the most valuable websites for ArtificialIntelligence aspirants.
#  After removing the stop words in the above sentence, we can narrow the number of words and preserve the meaning as follows:
#  [‘GreatLearning’, ‘one’, ‘useful’, ‘website’, ‘ArtificialIntelligence‘, ‘aspirants’, ‘.’]

# Step 3: create a frequency table of the words.
stopwords = set (stopwords.words("english"))
words = word_tokenize(text)
freqTable = dict()

# Step 4: We will assign a score to each sentence depending on the words it contains and the frequency table.
sentences = sent_tokenize(text)
sentenceValue = dict()

# Step 5: Assign a score to compare the sentences within the text.
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))

Bert Offical Website
Summarize a Text with Python — Continued
Info
Text Summarization Using NLTK.
Python | Text Summarizer
Info
Text Summarization Using NLTK.
Summarize text content using Generative AI (Generative AI)
Info
Summarization using google generative AI. (Useful but huge project, it will takes lots of time to validate)
Github XLNet(zihangdai)
Info
Valuable project usingXLNet. It will takes lot of time to validate.
XLNet — A new pre-training method outperforming BERT on 20 tasks
Info
Describe another approach (XLNet) versus to BERT.
The Illustrated GPT-2 (Visualizing Transformer Language Models)
Info
Principle introduction
Harnessing the Power of Google Bard with Python: A Comprehensive Guide