Models credited to Xin Pan Peter Liu and Google Brain
Papers and references Awesome text summarization
- Table of Contents
- Data Preparation
- Pipeline for Financial Dataset
- Seq2Seq with Attention
- Pointer-generator
- Non-financial
- CNN and Daily Mail
- Yelp Review Dataset
- Financial
- Non-financial
Pretrain model
Use pretrain model for financial news (currently based on non-financial news CNN/Dailymail)
Tokenize test financial news using corenlp-stanford
Preprocess tokenized financial news and store in test.bin
Use pointer generator network to load pretrain model to decode (generate summary)
python --mode=decode --data_path=/path/to/data/test.bin --vocab_path=/path/to/data/vocab --log_root=/path/to/directory/containing/pretrained_model --exp_name=pretrained_model --max_enc_steps=400 --max_dec_steps=100 --coverage=1
Adjust number of encode (input passage length) and decode step (ouput summary length)
Visualize the result
Sample abstractive summary for CNN news: Here
Visualize the attention network this
For Python3 run: python -m http.server
result with coverage and output 100 words
result with coverage and output 50 words
machine copy the whole sentence in the paragraph...
result without coverage and output 100 words
machine copy the whole sentence in the paragraph...
Convenience to extract news from url or text document, sumy has several algorithms to select important sentence from the article: luhn | edmundson | lsa | text-rank | lex-rank | sum-basic | kl
pip install sumy
sumy sum-basic --length=2 --url=
Like Zhang, Ye has recovered from the deadly novel coronavirus.
"So please go to the hospital for examination as soon as possible when you got it.
Encoder contains the input words that want to be transformed (translate, generate summary), and each word is a vector that go through forward and backward activation with bi-directional RNN. Then calculate the attention value for each words in encoder reflects its importance in a sentence. Decoder generates the output word one at a time, by taking dot product of the feature vector and their corresponding attention for each timestamp.
Encoder: Bi-directional RNN, feature vector
at timestampt
is the concatenation of forward RNN and backward RNN
Decoder: RNN of dot product between attention and activation
Beam search is used in decoder to keep up to k most likely words choice, where k is a user-specified parameter (beam width).
Abstrative text summarization requires sequence-to-sequence models, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. The state-of-the-art pointer-generator model came up by Google Brain at 2017 solves these problems. In addition to attention model, it add two features: first, it copys words from the source text via pointing which aids accurate repro- duction of information. Second, it uses coverage to keep track of what has been summarized, which discourages repetition.
In addition to attention, we add two things:
Copy frequent words occur in the text by adding distribution of the same word
Combine copy distribution
with general attention vocabulary distributionPvocab
(computed in attention earlier: ) with certain weightPgen
: pgen ∈ [0, 1] for timestep t is calculated from the context vectora
∗, the decoder states
and the decoder inputc
Training: use
to compute sigmoid probability
record certain sentences that have appear in decoder many times
Sum the attention over all previous decoder timesteps,
represents the degree of coverage that those words have received from the attention mechanism so far. -
additive attention of previous seq2seq attention model has changed to:
add one more term for loss
Training from scratch: GitHub Code Here
Transfer learning
Use a pre-trained model (Version for Tensorflow 1.2.1) which is a saved network that was previously trained by others on a large dataset. Then I don't need to re-train the model with number of hours starting from scratch (for this model it takes around 7 days to train the data), and the pre-trained model built from the massive dataset could already effectively served as a generic model of the visual world.
Model Performance
Metrics used:
ROUGE-1:overlap of unigrams between the system generated summary and reference summary / number of 1-gram in reference summary
ROUGE-2: overlap of bigrams between the system generated summary and reference summaries / number of 2-gram in reference summary
ROUGE-L: overlap of LCS (Longest Common Subsequence) between system generated summary and reference summaries / number of 1-gram in reference summary
Example from Paper: