Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IR based Supervised Sentence Ranker #314

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a9f9162
Update .gitattribute with model extension of supervised summarizer
dafajon Jan 27, 2022
892c33e
LGBMRanker summarizer model
dafajon Jan 27, 2022
e2cc13f
Add new config for supervised summarization
dafajon Jan 27, 2022
df47f3c
Commit initial implementation of ranker summarizer with builder pattern
dafajon Jan 27, 2022
c03012b
Add ranker to summarizer module init
dafajon Jan 27, 2022
4d847d1
Fix a typo in cluster summarizer docstring
dafajon Jan 28, 2022
9449c56
Remove summarizer setting from config.
dafajon Jan 28, 2022
a5b67ad
Update supervised summarizer implemantation. Remove builder pattern. …
dafajon Jan 28, 2022
ea98020
Update summarize module init
dafajon Jan 28, 2022
de6e877
Implement optimizer for tuning rankers for different embeddings or su…
dafajon Jan 28, 2022
7da5b24
Implement placeholders for the optuna utility methods for tuning.
dafajon Jan 28, 2022
fd056c5
Implement utility functions for optimization, training and dump of ne…
dafajon Feb 1, 2022
9b18e2e
Implement training dataset prep and optimizer methods of optimizer class
dafajon Feb 1, 2022
032ed2d
Update summarizer README with summarizer description and usage
dafajon Feb 1, 2022
40f35c2
Update new requirements for ranker summarization.
dafajon Feb 2, 2022
f377b15
Implement Live UI for tuning progress.
dafajon Feb 2, 2022
4852d2f
Add new reuirements for the supervised ranker summarizer.
dafajon Feb 4, 2022
aefd07d
Implement components for ranker tuning with optuna.
dafajon Feb 4, 2022
11012bd
Implement SentenceRanker class and a RankerOptimizer class that inher…
dafajon Feb 4, 2022
9b14cbf
Add import statements for SupervisedSentenceRanker and RankerOptimize…
dafajon Feb 4, 2022
aa45f92
Updade summarize module README with supervised ranker usage and its s…
dafajon Feb 4, 2022
0d7f546
Add the default ranker model to new summarize/models directory
dafajon Feb 4, 2022
869af98
Fix a typo in cluster summarizer docstring
dafajon Feb 4, 2022
0ae8284
Fix merge conflict at cluster summarizer docstring
dafajon Feb 4, 2022
f4daa56
Revert "Implement SentenceRanker class and a RankerOptimizer class th…
dafajon Feb 4, 2022
aafbb11
Implement SupervisedSentenceRanker and child RankerOptimizer class.
dafajon Feb 4, 2022
fd91f21
Implement tests for supervised ranker.
dafajon Feb 4, 2022
91ea522
Merge branch 'develop' into feature/ranker_summarizer
dafajon Feb 11, 2022
da29598
Fix lint error due to assert statement left in code.
dafajon Feb 11, 2022
e11069a
Fix import statement and test skips based on optional pandas dependency.
dafajon Feb 11, 2022
706100f
Fix import statement and test skips based on optional optuna dependency.
dafajon Feb 11, 2022
7c47a79
Fix type hinting that requires pandas dependency.
dafajon Feb 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ sadedegel/prebuilt/model/*.joblib filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/bert/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/icu/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
sadedegel/bblock/data/simple/vocabulary.hdf5 filter=lfs diff=lfs merge=lfs -text
sadedegel/summarize/model/*.joblib filter=lfs diff=lfs merge=lfs -text
5 changes: 4 additions & 1 deletion prod.requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,7 @@ sadedegel-icu
requests
rich
cached-property
h5py>=3.1.0,<=3.2.1
h5py>=3.1.0,<=3.2.1

lightgbm
randomname
2 changes: 1 addition & 1 deletion sadedegel/default.ini
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,4 @@ method = smooth
[bm25]
k1 = 1.25
b = 0.75
delta = 0
delta = 0
63 changes: 63 additions & 0 deletions sadedegel/summarize/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,64 @@ by recoding the **Round** of each sentences in which it is eliminated.

Later a sentence is eliminated, higher its relative score is within a given news document.

## Summarizer Usage

SadedeGel summarizers share same interface.

First a `sadedegel.summarize.ExtractiveSummarizer` instance is constructed.
```python
from sadedegel.summarize import LengthSummarizer, TFIDFSummarizer, DecomposedKMeansSummarizer

lsum = LengthSummarizer(normalize=True)
tfidf_sum = TFIDFSummarizer(normalize=True)
kmsum = DecomposedKMeansSummarizer(n_components=200, n_clusters=10)
```

Create a `sadedegel.Document` instance from the single document to be summarized.
```python
from sadedegel import Doc

d = Doc("ABD'li yayın organı New York Times, yaklaşık 3 ay içinde kullanıcı sayısını sıfırdan milyonlara çıkaran kelime oyunu Wordle’ı satın aldığını duyurdu. New York Times kısa bir süre önce de spor haberleri sitesi The Athletic'i satın almak için 550 milyon doları gözden çıkarmış ve bu satın alma ile birlikte 1.2 milyon abone kazanmıştı. ...")
```

For obtaining a summary of k sentences where k < n_sentences. Call the instance with a `Document` object or `List[Sentences]`

```python
summary1 = lsum(d, k=2)
summary2 = tfidf_sum(d, k=4)
summary3 = kmsum(d, k=5)
```
Alternatively you can obtain the relevance score of all sentences that is used to rank them to before selecting top k sentences.

```python
relevance_scores = kmsum.predict(d)
```

#### Supervised Ranker
All sadedegel summarizers work either with unsupervised or rule based methods to rank sentences before extracting top k as the summary. In the new release we are providing a ranker model that is trained on **SadedeGel Annotated Corpus** that has documents where each sentence has relevance label assigned by human annotators through a process of repetitive elimination.

Ranker uses document-sentence embedding pairs from transformer based pre-trained models as features. Future releases will accomodate BoW based and decomposition based embeddings as well.
For possible pre-trained embedding types supported by sadedegel are `bert_32k_cased`, `bert_128k_cased`, `bert_32k_uncased`, `bert_128k_uncased`, `distilbert`.

```python
from sadedegel.summarize import SupervisedSentenceRanker

ranker = SupervisedSentenceRanker(vector_type="bert_32k_cased")
```

Supervised Ranker can be tuned for optimal performance over an embedding type and summarization percentage. Current ranker is optimized with `bert_128k_based` for average summarization performance over 10%, 50% and 80% of full document length.

**Example**: Specific fine-tuning for short summaries with a smaller embedding extraction model.
```python
from sadedegel.summarize.supervised import RankerOptimizer

fine_tuner = RankerOptimizer(vector_type="distilbert",
summarization_perc=0.1,
n_trials=20)

fine_tuner.optimize()
```

## Summarizer Performance

Given this [Model Definition](#sadedegel-model),
Expand All @@ -28,6 +86,11 @@ ground truth human annotation (Best possible total `relevance` score that can be

### Performance Table

#### Release 0.21.1
| Method | Parameter | ndcg(optimized for k=0.1) | ndcg(optimized for k=0.5) | ndcg(optimized for k=0.8) |
|------------------|------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------|---------------|
| SupervisedSentenceRanker | `{"vector_type": "bert_128k_cased"}` | 0.7620 | 0.7269 | 0.8163 |

#### Release 0.18

By 0.18 we have significantly changed the way we evaluate our summarizers.
Expand Down
1 change: 1 addition & 0 deletions sadedegel/summarize/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
from .rank import TextRank, LexRankSummarizer # noqa: F401
from .tf_idf import TFIDFSummarizer # noqa: F401
from .bm25 import BM25Summarizer # noqa: F401
from. supervised import SupervisedSentenceRanker, RankerOptimizer # noqa: F401
7 changes: 3 additions & 4 deletions sadedegel/summarize/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,9 @@ def _predict(self, sentences: List[Sentences]):

class DecomposedKMeansSummarizer(ExtractiveSummarizer):
"""BERT embeddings are high in dimension and potentially carry redundant information that can cause
overfitting or curse of dimensionality effecting in clustering embeddings.

DecomposedKMeansSummarizer adds a PCA step (or any othe lsinear/non-linear dimensionality reduction technique)
before clustering to obtain highest variance in vector fed into clustering
overfitting or curse of dimensionality effecting in clustering embeddings.
DecomposedKMeansSummarizer adds a PCA step (or any other linear/non-linear dimensionality reduction technique)
before clustering to obtain highest variance in vector fed into clustering
"""

tags = ExtractiveSummarizer.tags + ['cluster', 'ml']
Expand Down
3 changes: 3 additions & 0 deletions sadedegel/summarize/model/ranker_bert_128k_cased.joblib
Git LFS file not shown
160 changes: 160 additions & 0 deletions sadedegel/summarize/supervised.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
from os.path import dirname
from pathlib import Path
from itertools import tee
import randomname

import numpy as np
from typing import List
import joblib
from rich.console import Console
from rich.progress import track

from ._base import ExtractiveSummarizer
from ..bblock.util import __transformer_model_mapper__
from ..bblock import Sentences
from ..bblock.doc import DocBuilder
from .util.supervised_tuning import optuna_handler, create_empty_model, fit_ranker, save_ranker


__vector_types__ = list(__transformer_model_mapper__.keys()) + ["tfidf", "bm25"]
console = Console()

try:
import pandas as pd
except ImportError:
console.log(("pandas package is not a general sadedegel dependency."
" But we do have a dependency on building our supervised ranker model"))


def load_model(vector_type, debug=False):
name = f"ranker_{vector_type}.joblib"

if vector_type == "bert_128k_cased":
path = (Path(dirname(__file__)) / 'model' / name).absolute()
else:
path = Path(f"~/.sadedegel_data/models/{name}").expanduser()

if not debug:
try:
model = joblib.load(path)
console.log(f"Initializing ranker model ranker_{vector_type}...", style="blue")
except Exception as e:
raise FileNotFoundError(f"A model trained for {vector_type} is not found. Please optimize one with "
f"sadedegel.summarize.RankerOptimizer. {e}")

else:
model = name

return model


class SupervisedSentenceRanker(ExtractiveSummarizer):
model = None
vector_type = None
debug = False
tags = ExtractiveSummarizer.tags + ["ml", "supervised", "rank"]

def __init__(self, normalize=True, vector_type="bert_128k_cased", **kwargs):
super().__init__(normalize)
self.debug = kwargs.get("debug", False)
self.init_model(vector_type, self.debug)

@classmethod
def init_model(cls, vector_type, debug):
db_switch = False
if vector_type not in __vector_types__:
raise ValueError(f"Not a valid vectorization for input sequence. Valid types are {__vector_types__}")
if cls.debug != debug:
cls.debug = debug
db_switch = True
if cls.debug:
console.log("SupervisedSentenceRanker: Switching debug mode ON.")
else:
console.log("SupervisedSentenceRanker Switching debug mode OFF.")
if cls.vector_type is not None and not db_switch:
if cls.vector_type == vector_type:
return 0

cls.model = load_model(vector_type, debug)
cls.vector_type = vector_type

def _predict(self, sents: List[Sentences]) -> np.ndarray:
if self.vector_type not in ["tfidf", "bm25"]:
doc_sent_embeddings = self._get_pretrained_embeddings(sents)
else:
raise NotImplementedError("BoW interface for SupervisedSentenceRanker is not yet implemented.")

if self.model is not None:
scores = self.model.predict(doc_sent_embeddings)
else:
raise ValueError("A ranker model is not found.")

return scores

def _get_pretrained_embeddings(self, sents: List[Sentences]) -> np.ndarray:
doc_embedding = sents[0].document.get_pretrained_embedding(architecture=self.vector_type, do_sents=False)
doc_embedding = np.vstack(len(sents) * [doc_embedding])
sent_embeddings = sents[0].document.get_pretrained_embedding(architecture=self.vector_type, do_sents=True)

return np.hstack([doc_embedding, sent_embeddings])

def _get_bow_vectors(self, sents: List[Sentences]) -> np.ndarray:
pass


class RankerOptimizer(SupervisedSentenceRanker):
def __init__(self, n_trials: int, vector_type: str, summarization_perc: float,**kwargs):
self.n_trials = n_trials
self.vector_type = vector_type
self.summarization_perc = summarization_perc

def optimize(self):
"""Optimize the ranker model for a custom summarization percentage. Optimize and dump a new model.
"""
run_name = randomname.get_name()
df, vecs = self._prepare_dataset()

optuna_handler(n_trials=self.n_trials, run_name=run_name,
metadata=df, vectors=vecs, k=self.summarization_perc)

model = create_empty_model(run_name)
ranker = fit_ranker(ranker=model, vectors=vecs, metadata=df)
save_ranker(ranker, name=self.vector_type)

def _prepare_dataset(self):
try:
from sadedegel.dataset import load_raw_corpus, load_annotated_corpus
except Exception as e:
raise ValueError("Cannot import raw and annotated corpi.")

annot = load_annotated_corpus()
annot_, annot = tee(annot)

embs = []
metadata = []
Doc = DocBuilder()
for doc_id, doc in track(enumerate(annot), description="Processing documents", total=len(list(annot_))):

relevance_scores = doc["relevance"]
d = Doc.from_sentences(doc["sentences"])
sents = list(d)

for sent_id, sent in enumerate(sents):
instance = dict()
instance["doc_id"] = doc_id
instance["sent_id"] = sent_id
instance["relevance"] = relevance_scores[sent_id]

metadata.append(instance)

if self.vector_type not in ["tfidf", "bm25"]:
doc_sent_embeddings = self._get_pretrained_embeddings(sents)
else:
raise NotImplementedError("BoW interface for SupervisedSentenceRanker is not yet implemented.")

embs.append(doc_sent_embeddings)

df = pd.DataFrame().from_records(metadata)
vecs = np.vstack(embs)

return df, vecs
Loading