Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IR based Supervised Sentence Ranker #314

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

dafajon
Copy link
Contributor

@dafajon dafajon commented Feb 11, 2022

  • This PR includes a retrieval based supervised summarizer implemented using lightgbm ranker.
  • sadedegel.dataset.annotated is used with its sentence, relevance pairs to train ranker.
  • Evaluation is done by leave-one-out cross validation due to small number of documents (~100).
  • Optimization with optuna is also implemented for user specified summarization length or picked embedding type.

test_supervised.py

  • Implement test for initializing ranker with lazy loading of the appropriate model.
  • Test re-loading of model when embedding type is switched.
  • Test for summary output with specified sentence length.

supervised.py

  • Implement SupervisedSentenceRankerclass as child of ExtractiveSummarizer.
  • Embedding generation phase prepares string input to doc-sentence representation for the LGBMRanker. Decouple embedding generation for transformer based and BoW based representations from predict method.
  • Implement a tuner class as RankerOptimizer if the user requires an optimized ranker for a summarization_percentage and another embedding with vector_type. Inherit SupervisedSentenceRanker for its embedding extraction methods.
  • _prepare_dataset uses extraction methods to prepare dataset for the format required for LGBMRanker.

util/supervised_tuning.py

  • Implement components for optimization of ranker.
  • Implement Logging and parsing for parameters of the best trial.
  • Implement Objective function for optuna with sampling of parameter space.
  • Implement Callback for Live status update rather that verbosity of optuna.
  • Implement fitting and saving for model with best hyperparameters.

README.md

  • Update with usage of all summarizers.
  • Add usage of supervised sentence ranker and tuner.
  • Add scores for the ranker.

model/ranker_bert_128k_cased.joblib

  • Add default model for the ranker.
  • User trained custom rankers via RankerOptimizer are serialized to ~/.sadedegel_data/models

…Implement class variable for lazy loading of model.
…w ranker models for differend tfm embeddings and summarization lengths.
…at inherits it."

Implement DEBUG mode for tests in the SupervisedSentenceRanker class. Merge it with previous commits.

This reverts commit 11012bd.
@dafajon dafajon requested a review from husnusensoy February 11, 2022 12:55
@askarbozcan
Copy link
Member

ToDo: More data for supervised ranker summarizers.

@askarbozcan askarbozcan added the cleanup-stay Issues that won't be removed as part of cleanup label Aug 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup-stay Issues that won't be removed as part of cleanup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants