Skip to content

Commit

Permalink
Merge pull request #153 from HLasse/tutorial_quality
Browse files Browse the repository at this point in the history
Added quality tutorial
  • Loading branch information
HLasse authored Jan 16, 2023
2 parents 246365c + e89a80c commit 21499a6
Show file tree
Hide file tree
Showing 8 changed files with 1,632 additions and 278 deletions.
39 changes: 21 additions & 18 deletions docs/quality.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,28 +71,30 @@ If you want to specify the thresholds for the quality metrics, you can do so by
# set thresholds for quality metrics (these are just the default)
thresholds = QualityThresholds(
n_stop_words=(2, None),
alpha_ratio=(0.8, None),
mean_word_length=(3, 10),
doc_length= (10, 100_000),
symbol_hashtag_to_word_ratio=(None, 0.1),
n_stop_words=(2, None), # at least 2 stop words, no upper bound
alpha_ratio=(0.7, None),
mean_word_length=(3, 10), # mean word length between 3 and 10 characters
doc_length=(10, 100000),
symbol_to_word_ratio={"#": (None, 0.1)},
proportion_ellipsis=(None, 0.3),
proportion_bullet_points=(None, 0.8),
contains={"lorem ipsum": False},
duplicate_line_chr_fraction=(None, 0.2),
duplicate_paragraph_chr_fraction=(None, 0.2),
duplicate_5gram_chr_fraction=(None, 0.15),
duplicate_6gram_chr_fraction=(None, 0.14),
duplicate_7gram_chr_fraction=(None, 0.13),
duplicate_8gram_chr_fraction=(None, 0.12),
duplicate_9gram_chr_fraction=(None, 0.11),
duplicate_10gram_chr_fraction=(None, 0.1),
top_2gram_chr_fraction=(None, 0.20),
top_3gram_chr_fraction=(None, 0.18),
top_4gram_chr_fraction=(None, 0.16),
contains_lorem_ipsum=False
duplicate_ngram_chr_fraction={
"5": (None, 0.15),
"6": (None, 0.14),
"7": (None, 0.13),
"8": (None, 0.12),
"9": (None, 0.11),
"10": (None, 0.1),
},
top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
)
nlp.add_pipe("textdescriptives.quality", config={"quality_thresholds": thresholds.dict()})
quality_pipe = nlp.add_pipe("textdescriptives.quality")
quality_pipe.set_quality_thresholds(thresholds) # update the quality thresholds
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# all attributes are stored as a dict in the ._.quality attribute
Expand All @@ -112,5 +114,6 @@ Component
Data Classes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autopydantic_model:: textdescriptives.components.quality.QualityThresholds

.. autopydantic_model:: textdescriptives.components.quality_data_classes.QualityThresholds
.. autopydantic_model:: textdescriptives.components.quality_data_classes.QualityOutput
.. autopydantic_model:: textdescriptives.components.quality_data_classes.ThresholdsOutput
1 change: 1 addition & 0 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@ locally.
:caption: Tutorials

tutorials/introductory_tutorial.ipynb
tutorials/filter_corpus_using_quality.ipynb

Loading

0 comments on commit 21499a6

Please sign in to comment.