From 3697e596cc397c6b22be04f9841f2d55b4721903 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Mon, 9 Jan 2023 11:40:17 +0100 Subject: [PATCH 01/14] docs: Added new tutorial for quality filtering --- .../filter_corpus_using_quality.ipynb | 615 ++++++++++++++++++ docs/tutorials/introductory_tutorial.ipynb | 2 +- pyproject.toml | 1 + src/textdescriptives/components/quality.py | 2 - 4 files changed, 617 insertions(+), 3 deletions(-) create mode 100644 docs/tutorials/filter_corpus_using_quality.ipynb diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb new file mode 100644 index 00000000..ff7566f1 --- /dev/null +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -0,0 +1,615 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Filtering corpora using Quality\n", + "\n", + "\n", + " \"Open\n", + "\n", + "\n", + "In many cases if you want to analyse tweets, train a model on text scraped from the web or similar, it is important to filter out low-quality texts.\n", + "\n", + "TextDescriptives implements a series of heuristic filters for removing low-quality text. This tutorial will take you through how to use these to filter\n", + "your text corpora." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). A large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. To install it please run:\n", + "\n", + "```python\n", + "!pip install datasets\n", + "```\n", + "\n", + "We can now easily donwload the dataset using the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Using custom data configuration DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48\n", + "Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ae17b122ba40474e83cb277625131e0f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/1 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textsourcedoc_idLICENSEuridate_built
0JØRGINE JØRGINE KØBENHAVN HAGE & CLAUSENS FORL...jvjjvj_JørgineAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:11 2020 CEST +0200
1MYTER MYTER NY SAMLING GYLDENDALSKE BOGHANDEL...jvjjvj_Myter-ny-samlingAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:01 2020 CEST +0200
2DEN NY VERDEN DEN NY VERDEN TIL INTERNATIONAL ...jvjjvj_Den-ny-VerdenAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:45 2020 CEST +0200
3CIMBRERNES TOG TIL EMMERIK JENSEN F . 15 . MAJ...jvjjvj_Cimbrernes-TogAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:05:56 2020 CEST +0200
4OM SPROGET OG UNDERVISNINGEN OM SPROGET OG UND...jvjjvj_Om-Sproget-og-UndervisningenAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:05:49 2020 CEST +0200
5GÆST KOMMER TIL VERDEN HAN var født paa Sjælan...jvjjvj_Gæst-kommer-til-VerdenAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:21 2020 CEST +0200
6MYTER OG JAGTER MYTER OG JAGTER GYLDENDALSKE B...jvjjvj_Myter-og-JagterAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:14 2020 CEST +0200
7DET TABTE LAND DET TABTE LAND, MENNESKET FØR I...jvjjvj_Det-tabte-LandAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:50 2020 CEST +0200
8SANGERINDEN SANGERINDEN (MADAME D'ORA) DRAMA I...jvjjvj_SangerindenAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:52 2020 CEST +0200
9DYRENES FORVANDLING DYRENES FORVANDLING TIL UD...jvjjvj_Dyrenes-ForvandlingAttribution-ShareAlike 4.0 InternationalNAFri Jun 26 13:06:48 2020 CEST +0200
\n", + "" + ], + "text/plain": [ + " text source \\\n", + "0 JØRGINE JØRGINE KØBENHAVN HAGE & CLAUSENS FORL... jvj \n", + "1 MYTER MYTER NY SAMLING GYLDENDALSKE BOGHANDEL... jvj \n", + "2 DEN NY VERDEN DEN NY VERDEN TIL INTERNATIONAL ... jvj \n", + "3 CIMBRERNES TOG TIL EMMERIK JENSEN F . 15 . MAJ... jvj \n", + "4 OM SPROGET OG UNDERVISNINGEN OM SPROGET OG UND... jvj \n", + "5 GÆST KOMMER TIL VERDEN HAN var født paa Sjælan... jvj \n", + "6 MYTER OG JAGTER MYTER OG JAGTER GYLDENDALSKE B... jvj \n", + "7 DET TABTE LAND DET TABTE LAND, MENNESKET FØR I... jvj \n", + "8 SANGERINDEN SANGERINDEN (MADAME D'ORA) DRAMA I... jvj \n", + "9 DYRENES FORVANDLING DYRENES FORVANDLING TIL UD... jvj \n", + "\n", + " doc_id LICENSE \\\n", + "0 jvj_Jørgine Attribution-ShareAlike 4.0 International \n", + "1 jvj_Myter-ny-samling Attribution-ShareAlike 4.0 International \n", + "2 jvj_Den-ny-Verden Attribution-ShareAlike 4.0 International \n", + "3 jvj_Cimbrernes-Tog Attribution-ShareAlike 4.0 International \n", + "4 jvj_Om-Sproget-og-Undervisningen Attribution-ShareAlike 4.0 International \n", + "5 jvj_Gæst-kommer-til-Verden Attribution-ShareAlike 4.0 International \n", + "6 jvj_Myter-og-Jagter Attribution-ShareAlike 4.0 International \n", + "7 jvj_Det-tabte-Land Attribution-ShareAlike 4.0 International \n", + "8 jvj_Sangerinden Attribution-ShareAlike 4.0 International \n", + "9 jvj_Dyrenes-Forvandling Attribution-ShareAlike 4.0 International \n", + "\n", + " uri date_built \n", + "0 NA Fri Jun 26 13:06:11 2020 CEST +0200 \n", + "1 NA Fri Jun 26 13:06:01 2020 CEST +0200 \n", + "2 NA Fri Jun 26 13:06:45 2020 CEST +0200 \n", + "3 NA Fri Jun 26 13:05:56 2020 CEST +0200 \n", + "4 NA Fri Jun 26 13:05:49 2020 CEST +0200 \n", + "5 NA Fri Jun 26 13:06:21 2020 CEST +0200 \n", + "6 NA Fri Jun 26 13:06:14 2020 CEST +0200 \n", + "7 NA Fri Jun 26 13:06:50 2020 CEST +0200 \n", + "8 NA Fri Jun 26 13:06:52 2020 CEST +0200 \n", + "9 NA Fri Jun 26 13:06:48 2020 CEST +0200 " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# We can take a look at one of the examples:\n", + "ten_samples = dataset.select(range(10))\n", + "ten_samples.to_pandas()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As previously mentioned the Danish Gigaword consist of multiple domains. For this tutorial, we will look at three of these domains. `retsinformationdk` which consist of legal documents, `wiki` which contain Wikipedia articles and `spont` which contains texts transcriped from spontaneous speech." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6e6efda35614635a.arrow\n", + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-3ce9447c21439e3f.arrow\n", + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6528b379c635e45c.arrow\n" + ] + } + ], + "source": [ + "# we can filter out these three datasets based on the \"source\"\n", + "legal = dataset.filter(lambda x: x[\"source\"] == \"retsinformationdk\")\n", + "wiki = dataset.filter(lambda x: x[\"source\"] == \"wiki\")\n", + "speech = dataset.filter(lambda x: x[\"source\"] == \"spont\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now examine these datasets a bit more:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Legal contains 64043 examples\n", + "Wiki contains 425938 examples\n", + "Speech contains 411 examples\n" + ] + } + ], + "source": [ + "print(f\"Legal contains {len(legal)} examples\")\n", + "print(f\"Wiki contains {len(wiki)} examples\")\n", + "print(f\"Speech contains {len(speech)} examples\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can for example see that the speech dataset contains notably fewer sampels than the rest. So let us downsample the rest to ~1000 samples each before we start the analysis." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Legal now contains 1000 examples\n", + "Wiki now contains 1000 examples\n" + ] + } + ], + "source": [ + "legal = legal.select(range(1000))\n", + "wiki = wiki.select(range(1000))\n", + "\n", + "print(f\"Legal now contains {len(legal)} examples\")\n", + "print(f\"Wiki now contains {len(wiki)} examples\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Quality Filtering\n", + "After we have prepared our datasets we can now start with the quality filtering. Using Textdescriptives this is extremely simple. We need to do 3 thing:\n", + "\n", + "1) Create a pipeline\n", + "2) Add the quality component from textdescriptives to it\n", + "3) Apply the pipeline to the dataset\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "import spacy\n", + "import textdescriptives as td\n", + "\n", + "# 1. Crease a blank spaCy model with a sentencizer\n", + "nlp = spacy.blank(\"da\")\n", + "nlp.add_pipe(\"sentencizer\")\n", + "nlp.max_length = 2000000 # as some of the documents are quite long we can increase the max length\n", + "# however it might be worth filtering out these documents before for very very long documents.\n", + "\n", + "# 2. Add the textdescriptives pipeline\n", + "quality_pipe = nlp.add_pipe(\"textdescriptives/quality\")\n", + "\n", + "# 3. Apply the pipeline to the legal documents\n", + "legal_docs = nlp.pipe(legal[\"text\"], batch_size=100, n_process=4)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we check now we can see that legal_docs is a generator. This can be a quite efficient format, but for now we just want to process all the text so we simply need to convert it to a list:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "legal_docs" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "legal_docs = list(legal_docs)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now inspect the output here:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Den fulde tekst Pressenævnets kendelse i sag nr. 15-70-00822\n", + "Resumé\n", + "Foreningen for Skånsomt Kystfiskeri har ikke retlig interesse\n", + "DR bragte et radioindslag om Natur- og Erhvervsstyrelsens fiskeriinspektorats fangst af ulovlige ålefælder. Foreningen for Skånsomt Kystfiskeri klagede blandt andet med den begrundelse, at betegnelsen ” ålefælder ” er forkert, idet ålene selv kan svømme ind og ud. Pressenævnet afviser at behandle klagen, da foreningen ikke er omtalt i udsendelsen og derfor ikke har retlig interesse.\n", + "Pressenævnets formand udtaler:\n", + "Det er en betingelse for at klage til Pressenævnet, at\n", + "----\n", + "This is pass the quality filter:\n" + ] + }, + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "legal_doc = legal_docs[0]\n", + "\n", + "print(legal_doc[:100]) # print the first 100 tokens\n", + "print(\"----\")\n", + "print(\"This is pass the quality filter:\")\n", + "legal_doc._.passed_quality_check" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we see that the text did not pass the quality filter. We can now examine why that using the following code:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'n_stop_words': 192,\n", + " 'alpha_ratio': 0.804,\n", + " 'mean_word_length': 4.546,\n", + " 'doc_length': 500,\n", + " 'proportion_ellipsis': 0.0,\n", + " 'proportion_bullet_points': 0.0,\n", + " 'duplicate_line_chr_fraction': 0.25737766156144937,\n", + " 'duplicate_paragraph_chr_fraction': 0.0,\n", + " 'duplicate_5-gram_chr_fraction': 0.5401568920433321,\n", + " 'duplicate_6-gram_chr_fraction': 0.519237952932387,\n", + " 'duplicate_7-gram_chr_fraction': 0.519237952932387,\n", + " 'duplicate_8-gram_chr_fraction': 0.519237952932387,\n", + " 'duplicate_9-gram_chr_fraction': 0.519237952932387,\n", + " 'duplicate_10-gram_chr_fraction': 0.519237952932387,\n", + " 'top_2-gram_chr_fraction': 0.017930519237952934,\n", + " 'top_3-gram_chr_fraction': 0.042958535674262235,\n", + " 'top_4-gram_chr_fraction': 0.0653716847217034,\n", + " 'symbol_#_to_word_ratio': 0.0,\n", + " 'contains_lorem ipsum': False}" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "legal_doc._.quality" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is likely the reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. You can see an example of how to do this in the [documentation](file:///Users/au561649/Github/TextDescriptives/docs/_build/html/quality.html).\n", + "\n", + "You can also inspect the existing thresholds:" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.8, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "quality_pipe.quality_thresholds" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we see that the `duplicate_ngram_chr_fraction` for 10-grams is 0.1. This means that if a text contains more than 10% of characters which are a part of a duplicate 10-gram it will be filtered out." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering out the text\n", + "Assuming we don't want to change the filter we can now use it to filter out the texts that we want to keep:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "# 4. Filter out the documents that do not pass the quality\n", + "legal_docs_filtered = [doc for doc in legal_docs if doc._.passed_quality_check]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We had a total of 1000 which we filtered down to 68.\n" + ] + } + ], + "source": [ + "print(f\"We had a total of {len(legal['text'])} which we filtered down to {len(legal_docs_filtered)}.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "textdescriptives", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + }, + "orig_nbformat": 4, + "vscode": { + "interpreter": { + "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" + } + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/tutorials/introductory_tutorial.ipynb b/docs/tutorials/introductory_tutorial.ipynb index 95af3fd5..75ec99ea 100644 --- a/docs/tutorials/introductory_tutorial.ipynb +++ b/docs/tutorials/introductory_tutorial.ipynb @@ -608,7 +608,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.8.15 (default, Oct 11 2022, 21:31:25) \n[Clang 14.0.0 (clang-1400.0.29.102)]" }, "orig_nbformat": 4, "vscode": { diff --git a/pyproject.toml b/pyproject.toml index c7b9d3ba..668818c8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -71,6 +71,7 @@ tutorials = [ "jupyter", "seaborn", "matplotlib", + "datasets>=2.8.0,<2.9.0", ] [project.readme] diff --git a/src/textdescriptives/components/quality.py b/src/textdescriptives/components/quality.py index 70742d64..34343228 100644 --- a/src/textdescriptives/components/quality.py +++ b/src/textdescriptives/components/quality.py @@ -612,8 +612,6 @@ def set_extensions(self): """Set required extensions.""" for ext_name, span_getter in self.extensions.items(): - # doc_getter = span_getter_to_doc_getter(span_getter) - if not Span.has_extension(ext_name) or self.force is True: Span.set_extension(ext_name, getter=span_getter, force=True) if not Doc.has_extension(ext_name) or self.force is True: From fa2d65c1114cd6ae03d4638a44e7fe0c525aa80f Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Mon, 9 Jan 2023 14:17:04 +0100 Subject: [PATCH 02/14] docs: added tutorial --- .../filter_corpus_using_quality.ipynb | 529 +++++++++++++----- src/textdescriptives/components/quality.py | 35 +- 2 files changed, 404 insertions(+), 160 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index ff7566f1..81951628 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -24,7 +24,7 @@ "source": [ "## Setup\n", "\n", - "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). A large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. To install it please run:\n", + "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). Actually in this tutorial we will just use a small test version of it, but you could change it to use the whole dataset. A large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. To install it please run:\n", "\n", "```python\n", "!pip install datasets\n", @@ -35,21 +35,21 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Using custom data configuration DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48\n", - "Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" + "Using custom data configuration DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d\n", + "Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "ae17b122ba40474e83cb277625131e0f", + "model_id": "7ecf4788a2a8499b82115927bf126ccd", "version_major": 2, "version_minor": 0 }, @@ -65,7 +65,7 @@ "from datasets import load_dataset\n", "\n", "# note this can take quite a while\n", - "dataset = load_dataset(\"DDSC/partial-danish-gigaword-no-twitter\")\n", + "dataset = load_dataset(\"DDSC/partial-danish-gigaword-small-test-sample\")\n", "\n", "# All of the dataset is available in the train split - we can simply:\n", "dataset = dataset[\"train\"]" @@ -73,7 +73,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -108,137 +108,161 @@ " \n", " \n", " 0\n", - " JØRGINE JØRGINE KØBENHAVN HAGE & CLAUSENS FORL...\n", - " jvj\n", - " jvj_Jørgine\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:11 2020 CEST +0200\n", + " Den fulde tekst Pressenævnets kendelse i sag n...\n", + " retsinformationdk\n", + " retsinformationdk_173889\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:51:31 2019 +0100\n", " \n", " \n", " 1\n", - " MYTER MYTER NY SAMLING GYLDENDALSKE BOGHANDEL...\n", - " jvj\n", - " jvj_Myter-ny-samling\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:01 2020 CEST +0200\n", + " Resume\\n\\nEfter at der var sket afskedigelser ...\n", + " retsinformationdk\n", + " retsinformationdk_39059\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:51:14 2019 +0100\n", " \n", " \n", " 2\n", - " DEN NY VERDEN DEN NY VERDEN TIL INTERNATIONAL ...\n", - " jvj\n", - " jvj_Den-ny-Verden\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:45 2020 CEST +0200\n", + " Resume\\n\\nContainere kunne ikke anses som genb...\n", + " retsinformationdk\n", + " retsinformationdk_15045\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:51:28 2019 +0100\n", " \n", " \n", " 3\n", - " CIMBRERNES TOG TIL EMMERIK JENSEN F . 15 . MAJ...\n", - " jvj\n", - " jvj_Cimbrernes-Tog\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:05:56 2020 CEST +0200\n", + " Resume\\n\\nEn forhandler ved »home-parties« af ...\n", + " retsinformationdk\n", + " retsinformationdk_37261\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:49:27 2019 +0100\n", " \n", " \n", " 4\n", - " OM SPROGET OG UNDERVISNINGEN OM SPROGET OG UND...\n", - " jvj\n", - " jvj_Om-Sproget-og-Undervisningen\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:05:49 2020 CEST +0200\n", + " Den fulde tekst\\n\\nSkrivelse om lov om fleksyd...\n", + " retsinformationdk\n", + " retsinformationdk_19415\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:52:27 2019 +0100\n", " \n", " \n", " 5\n", - " GÆST KOMMER TIL VERDEN HAN var født paa Sjælan...\n", - " jvj\n", - " jvj_Gæst-kommer-til-Verden\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:21 2020 CEST +0200\n", + " Resume\\n\\nResumé\\n\\nKlage over påbud om særlig...\n", + " retsinformationdk\n", + " retsinformationdk_31217\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:49:18 2019 +0100\n", " \n", " \n", " 6\n", - " MYTER OG JAGTER MYTER OG JAGTER GYLDENDALSKE B...\n", - " jvj\n", - " jvj_Myter-og-Jagter\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:14 2020 CEST +0200\n", + " Resume\\n\\nResumé\\n\\nI en række af de af Danmar...\n", + " retsinformationdk\n", + " retsinformationdk_14387\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:49:49 2019 +0100\n", " \n", " \n", " 7\n", - " DET TABTE LAND DET TABTE LAND, MENNESKET FØR I...\n", - " jvj\n", - " jvj_Det-tabte-Land\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:50 2020 CEST +0200\n", + " Oversigt (indholdsfortegnelse)\\n\\nBilag 1\\n\\nD...\n", + " retsinformationdk\n", + " retsinformationdk_166197\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:49:44 2019 +0100\n", " \n", " \n", " 8\n", - " SANGERINDEN SANGERINDEN (MADAME D'ORA) DRAMA I...\n", - " jvj\n", - " jvj_Sangerinden\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:52 2020 CEST +0200\n", + " Den fulde tekst\\n\\nBekendtgørelse om afregning...\n", + " retsinformationdk\n", + " retsinformationdk_76994\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:52:52 2019 +0100\n", " \n", " \n", " 9\n", - " DYRENES FORVANDLING DYRENES FORVANDLING TIL UD...\n", - " jvj\n", - " jvj_Dyrenes-Forvandling\n", - " Attribution-ShareAlike 4.0 International\n", - " NA\n", - " Fri Jun 26 13:06:48 2020 CEST +0200\n", + " Den fulde tekst Ligebehandlingsnævnets afgørel...\n", + " retsinformationdk\n", + " retsinformationdk_192325\n", + " Danish Copyright law at https://www.retsinform...\n", + " https://www.retsinformation.dk/Forms/R0710.asp...\n", + " Fri Nov 22 00:51:41 2019 +0100\n", " \n", " \n", "\n", "" ], "text/plain": [ - " text source \\\n", - "0 JØRGINE JØRGINE KØBENHAVN HAGE & CLAUSENS FORL... jvj \n", - "1 MYTER MYTER NY SAMLING GYLDENDALSKE BOGHANDEL... jvj \n", - "2 DEN NY VERDEN DEN NY VERDEN TIL INTERNATIONAL ... jvj \n", - "3 CIMBRERNES TOG TIL EMMERIK JENSEN F . 15 . MAJ... jvj \n", - "4 OM SPROGET OG UNDERVISNINGEN OM SPROGET OG UND... jvj \n", - "5 GÆST KOMMER TIL VERDEN HAN var født paa Sjælan... jvj \n", - "6 MYTER OG JAGTER MYTER OG JAGTER GYLDENDALSKE B... jvj \n", - "7 DET TABTE LAND DET TABTE LAND, MENNESKET FØR I... jvj \n", - "8 SANGERINDEN SANGERINDEN (MADAME D'ORA) DRAMA I... jvj \n", - "9 DYRENES FORVANDLING DYRENES FORVANDLING TIL UD... jvj \n", + " text source \\\n", + "0 Den fulde tekst Pressenævnets kendelse i sag n... retsinformationdk \n", + "1 Resume\\n\\nEfter at der var sket afskedigelser ... retsinformationdk \n", + "2 Resume\\n\\nContainere kunne ikke anses som genb... retsinformationdk \n", + "3 Resume\\n\\nEn forhandler ved »home-parties« af ... retsinformationdk \n", + "4 Den fulde tekst\\n\\nSkrivelse om lov om fleksyd... retsinformationdk \n", + "5 Resume\\n\\nResumé\\n\\nKlage over påbud om særlig... retsinformationdk \n", + "6 Resume\\n\\nResumé\\n\\nI en række af de af Danmar... retsinformationdk \n", + "7 Oversigt (indholdsfortegnelse)\\n\\nBilag 1\\n\\nD... retsinformationdk \n", + "8 Den fulde tekst\\n\\nBekendtgørelse om afregning... retsinformationdk \n", + "9 Den fulde tekst Ligebehandlingsnævnets afgørel... retsinformationdk \n", + "\n", + " doc_id \\\n", + "0 retsinformationdk_173889 \n", + "1 retsinformationdk_39059 \n", + "2 retsinformationdk_15045 \n", + "3 retsinformationdk_37261 \n", + "4 retsinformationdk_19415 \n", + "5 retsinformationdk_31217 \n", + "6 retsinformationdk_14387 \n", + "7 retsinformationdk_166197 \n", + "8 retsinformationdk_76994 \n", + "9 retsinformationdk_192325 \n", + "\n", + " LICENSE \\\n", + "0 Danish Copyright law at https://www.retsinform... \n", + "1 Danish Copyright law at https://www.retsinform... \n", + "2 Danish Copyright law at https://www.retsinform... \n", + "3 Danish Copyright law at https://www.retsinform... \n", + "4 Danish Copyright law at https://www.retsinform... \n", + "5 Danish Copyright law at https://www.retsinform... \n", + "6 Danish Copyright law at https://www.retsinform... \n", + "7 Danish Copyright law at https://www.retsinform... \n", + "8 Danish Copyright law at https://www.retsinform... \n", + "9 Danish Copyright law at https://www.retsinform... \n", "\n", - " doc_id LICENSE \\\n", - "0 jvj_Jørgine Attribution-ShareAlike 4.0 International \n", - "1 jvj_Myter-ny-samling Attribution-ShareAlike 4.0 International \n", - "2 jvj_Den-ny-Verden Attribution-ShareAlike 4.0 International \n", - "3 jvj_Cimbrernes-Tog Attribution-ShareAlike 4.0 International \n", - "4 jvj_Om-Sproget-og-Undervisningen Attribution-ShareAlike 4.0 International \n", - "5 jvj_Gæst-kommer-til-Verden Attribution-ShareAlike 4.0 International \n", - "6 jvj_Myter-og-Jagter Attribution-ShareAlike 4.0 International \n", - "7 jvj_Det-tabte-Land Attribution-ShareAlike 4.0 International \n", - "8 jvj_Sangerinden Attribution-ShareAlike 4.0 International \n", - "9 jvj_Dyrenes-Forvandling Attribution-ShareAlike 4.0 International \n", + " uri \\\n", + "0 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "1 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "2 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "3 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "4 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "5 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "6 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "7 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "8 https://www.retsinformation.dk/Forms/R0710.asp... \n", + "9 https://www.retsinformation.dk/Forms/R0710.asp... \n", "\n", - " uri date_built \n", - "0 NA Fri Jun 26 13:06:11 2020 CEST +0200 \n", - "1 NA Fri Jun 26 13:06:01 2020 CEST +0200 \n", - "2 NA Fri Jun 26 13:06:45 2020 CEST +0200 \n", - "3 NA Fri Jun 26 13:05:56 2020 CEST +0200 \n", - "4 NA Fri Jun 26 13:05:49 2020 CEST +0200 \n", - "5 NA Fri Jun 26 13:06:21 2020 CEST +0200 \n", - "6 NA Fri Jun 26 13:06:14 2020 CEST +0200 \n", - "7 NA Fri Jun 26 13:06:50 2020 CEST +0200 \n", - "8 NA Fri Jun 26 13:06:52 2020 CEST +0200 \n", - "9 NA Fri Jun 26 13:06:48 2020 CEST +0200 " + " date_built \n", + "0 Fri Nov 22 00:51:31 2019 +0100 \n", + "1 Fri Nov 22 00:51:14 2019 +0100 \n", + "2 Fri Nov 22 00:51:28 2019 +0100 \n", + "3 Fri Nov 22 00:49:27 2019 +0100 \n", + "4 Fri Nov 22 00:52:27 2019 +0100 \n", + "5 Fri Nov 22 00:49:18 2019 +0100 \n", + "6 Fri Nov 22 00:49:49 2019 +0100 \n", + "7 Fri Nov 22 00:49:44 2019 +0100 \n", + "8 Fri Nov 22 00:52:52 2019 +0100 \n", + "9 Fri Nov 22 00:51:41 2019 +0100 " ] }, - "execution_count": 2, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" } @@ -254,28 +278,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As previously mentioned the Danish Gigaword consist of multiple domains. For this tutorial, we will look at three of these domains. `retsinformationdk` which consist of legal documents, `wiki` which contain Wikipedia articles and `spont` which contains texts transcriped from spontaneous speech." + "As previously mentioned the Danish Gigaword consist of multiple domains. For this tutorial, we will look at three of these domains. `retsinformationdk` which consist of legal documents, `hest` which contain post from a Danish debate forum and `spont` which contains texts transcriped from spontaneous speech." ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6e6efda35614635a.arrow\n", - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-3ce9447c21439e3f.arrow\n", - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-no-twitter-d9c41a85c2339e48/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6528b379c635e45c.arrow\n" + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-beca55bc168c3e3d.arrow\n", + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-be9e6b466f0d4ee9.arrow\n", + "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-56a5eac62a6adddf.arrow\n" ] } ], "source": [ "# we can filter out these three datasets based on the \"source\"\n", "legal = dataset.filter(lambda x: x[\"source\"] == \"retsinformationdk\")\n", - "wiki = dataset.filter(lambda x: x[\"source\"] == \"wiki\")\n", + "news = dataset.filter(lambda x: x[\"source\"] == \"tv2r\")\n", "speech = dataset.filter(lambda x: x[\"source\"] == \"spont\")" ] }, @@ -289,22 +313,22 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Legal contains 64043 examples\n", - "Wiki contains 425938 examples\n", + "Legal contains 1000 examples\n", + "News contains 1000 examples\n", "Speech contains 411 examples\n" ] } ], "source": [ "print(f\"Legal contains {len(legal)} examples\")\n", - "print(f\"Wiki contains {len(wiki)} examples\")\n", + "print(f\"News contains {len(news)} examples\")\n", "print(f\"Speech contains {len(speech)} examples\")" ] }, @@ -313,29 +337,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can for example see that the speech dataset contains notably fewer sampels than the rest. So let us downsample the rest to ~1000 samples each before we start the analysis." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Legal now contains 1000 examples\n", - "Wiki now contains 1000 examples\n" - ] - } - ], - "source": [ - "legal = legal.select(range(1000))\n", - "wiki = wiki.select(range(1000))\n", - "\n", - "print(f\"Legal now contains {len(legal)} examples\")\n", - "print(f\"Wiki now contains {len(wiki)} examples\")" + "We can for example see that the speech dataset contains notably fewer sampels than the rest and the news and legal dataset contains 1000 samples." ] }, { @@ -353,7 +355,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -383,16 +385,16 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 7, + "execution_count": 33, "metadata": {}, "output_type": "execute_result" } @@ -403,7 +405,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 34, "metadata": {}, "outputs": [], "source": [ @@ -420,7 +422,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 35, "metadata": {}, "outputs": [ { @@ -443,7 +445,7 @@ "False" ] }, - "execution_count": 17, + "execution_count": 35, "metadata": {}, "output_type": "execute_result" } @@ -467,7 +469,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 36, "metadata": {}, "outputs": [ { @@ -494,7 +496,7 @@ " 'contains_lorem ipsum': False}" ] }, - "execution_count": 18, + "execution_count": 36, "metadata": {}, "output_type": "execute_result" } @@ -508,23 +510,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is likely the reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. You can see an example of how to do this in the [documentation](file:///Users/au561649/Github/TextDescriptives/docs/_build/html/quality.html).\n", + "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is likely the reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. You can see an example of how to do this in the [documentation](https://hlasse.github.io/TextDescriptives/quality.html). We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters.\n", "\n", "You can also inspect the existing thresholds:" ] }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.8, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})" + "QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.6, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})" ] }, - "execution_count": 25, + "execution_count": 37, "metadata": {}, "output_type": "execute_result" } @@ -552,7 +554,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 38, "metadata": {}, "outputs": [], "source": [ @@ -562,14 +564,14 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "We had a total of 1000 which we filtered down to 68.\n" + "We had a total of 1000 which we filtered down to 335.\n" ] } ], @@ -577,12 +579,235 @@ "print(f\"We had a total of {len(legal['text'])} which we filtered down to {len(legal_docs_filtered)}.\")" ] }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That seems like a lot, we should probably check why that is. We can do this by looking at the distribution of the scores:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import seaborn as sns\n", + "\n", + "duplicate_10_gram_fraction = [doc._.quality[\"duplicate_10-gram_chr_fraction\"] for doc in legal_docs]\n", + "sns.histplot(duplicate_10_gram_fraction)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This seems fine like it explains a lot of the texts which were filtered out, but does not explain everything. Let us take a look at the `alpha_ratio` as well:" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in legal_docs]\n", + "sns.histplot(alpha_ratio)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that most of the text does not pass the `alpha_ratio` filter of 0.8 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the thresholdhold to 0.7 or lower." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Comparing across domains\n", + "We see that legal documents have quite a few perculiarities let us examine how the `alpha_ratio` behaves across different domains:" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "# first we apply the pipeline to the other domains\n", + "news_docs = nlp.pipe(news[\"text\"], batch_size=100, n_process=4)\n", + "news_docs = list(news_docs)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=4)\n", + "speech_docs = list(speech_docs)" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 44, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "# etract alpha ratio:\n", + "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]\n", + "speech_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in speech_docs]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have the metrics we can plot a histogram comparing the metrics:" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "# histogram\n", + "sns.histplot(news_alpha_ratio, label=\"News\", alpha=0.5, binwidth=0.05)\n", + "sns.histplot(speech_alpha_ratio, label=\"Speech\", alpha=0.5, binwidth=0.05)\n", + "sns.histplot(alpha_ratio, label=\"Legal\", alpha=0.5, binwidth=0.05)\n", + "\n", + "# add labels\n", + "plt.xlabel(\"Alpha ratio\")\n", + "plt.ylabel(\"Count\")\n", + "plt.legend()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we see a couple of things:\n", + "- Spontanous speech have a notably low alpha ratio. We should probably look into it.\n", + "- A reasonable amount of legal documents have an alpha ratio above 0.6.\n", + "- Almost no news text have a alpha ratio below 0.6.\n", + "\n", + "Let us examine the spontaneous speech a bit more:" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[Taler, 6, :, mm, \n", + ", Taler, 7, :, er, du, klar, ?, \n", + ", Taler, 6, :, ja, \n", + ", Taler, 7, :, så, er, spørgsmålet, om, vi, skal-, om, det, er, sådan, her, ja, det, kunne, man, godt, okay, \n", + ", Taler, 7, :, okay, så, det, er, ignore, tab, kill, og, kill, tab, \n", + ", Taler, 6, :, NA, \n", + ", Taler, 6, :, kill, \n", + ", Taler, 6, :, kill, tab, \n", + ", Taler, 7, :, super, \n", + ", Taler, 7, :, okay, det, er, det, hun, lige, har, sagt, \n", + ", Taler, 6, :, ja, \n", + ", Taler, 6, :, ja, \n", + ", Taler, 6, :, NA]\n" + ] + } + ], + "source": [ + "# Examing the first speech document\n", + "doc = speech_docs[0]\n", + "print([t for t in doc[:100]]) # print the first 100 tokens" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From this we can see that a high proportion of the tokens in the speech dataset actually denotes the speeaker. This might or might not be problematic for the dataset of interesting, but it does indeed make sense that it inflates the number of tokens.\n", + "\n", + "Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should probably the adjusted to the target domain." + ] } ], "metadata": { diff --git a/src/textdescriptives/components/quality.py b/src/textdescriptives/components/quality.py index 34343228..33b479a2 100644 --- a/src/textdescriptives/components/quality.py +++ b/src/textdescriptives/components/quality.py @@ -20,10 +20,12 @@ class QualityThresholds(BaseModel): + "at least 2 stop words, but no upper limit.", ) alpha_ratio: Interval = Field( - (0.8, None), - description="A Range for the alpha ratio. Default: (0.8, None), i.e. at " - + r"least 80% of tokens contain at least one alphabetic character, but no " - + "upper limit.", + (0.6, None), + description="A Range for the alpha ratio. Default: (0.6, None), i.e. at " + + r"least 60% of tokens contain at least one alphabetic character, but no " + + "upper limit. Note this is lowered from the original 0.8 to account for a" + + "different definition of word boundaries. E.g. in spaCy a punctuation is" + + "not a part of a word.", ) mean_word_length: Interval = Field( (3, 10), @@ -499,11 +501,14 @@ def __init__( # pylint: disable=dangerous-default-value self.set_extensions() - def quality_getter(self, span: Span) -> Dict[str, Union[float, int, bool]]: + def quality_getter( + self, + span: Union[Span, Doc], + ) -> Dict[str, Union[float, int, bool]]: """Apply quality functions to doc. Args: - span (Span): spaCy span object + span (Union[Span, Doc]): spaCy span or doc object Returns: Dict[str, Union[float, int, bool]]: dictionary of quality metrics @@ -522,6 +527,15 @@ def quality_getter(self, span: Span) -> Dict[str, Union[float, int, bool]]: quality[name] = getter(span) # type: ignore return quality + def set_quality(self, doc: Doc) -> None: + """Set the quality attribute on a doc. + + Args: + doc (Doc): spaCy doc object + """ + doc._.quality = self.quality_getter(doc) + doc._.passed_quality_check = self.passed_quality_thresholds(doc) + @staticmethod def is_within_range(rangetuple: Interval, value: float) -> bool: """Check if a value is within a range tuple. If one of the values in @@ -614,11 +628,16 @@ def set_extensions(self): for ext_name, span_getter in self.extensions.items(): if not Span.has_extension(ext_name) or self.force is True: Span.set_extension(ext_name, getter=span_getter, force=True) - if not Doc.has_extension(ext_name) or self.force is True: - Doc.set_extension(ext_name, getter=span_getter, force=True) + if ext_name == "quality": + if not Doc.has_extension(ext_name) or self.force is True: + Doc.set_extension(ext_name, default=None, force=True) + else: + if not Doc.has_extension(ext_name) or self.force is True: + Doc.set_extension(ext_name, getter=span_getter, force=True) def __call__(self, doc: Doc): """Run the pipeline component.""" + self.set_quality(doc) return doc From 9f5d22f459fc55b35ce61942b8511dbd9c222a37 Mon Sep 17 00:00:00 2001 From: Lasse Date: Mon, 9 Jan 2023 15:07:44 +0100 Subject: [PATCH 03/14] tutorial: minor updates to quality tutorial --- .../filter_corpus_using_quality.ipynb | 374 +++++++++++++----- 1 file changed, 275 insertions(+), 99 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 81951628..4f6c3238 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -24,56 +24,94 @@ "source": [ "## Setup\n", "\n", - "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). Actually in this tutorial we will just use a small test version of it, but you could change it to use the whole dataset. A large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. To install it please run:\n", + "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). For the purpose of this tutorial we will just use a small test version of it containing around 2500 examples, but you could easily change it to use the whole dataset. Danish Gigaword is a large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. Which you can install by running\n", "\n", "```python\n", "!pip install datasets\n", "```\n", "\n", + "Or by installing textdescriptives with the `[tutorials]` option as below" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " import textdescriptives\n", + "except:\n", + " !pip install \"textdescriptives[tutorials]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "We can now easily donwload the dataset using the following command:" ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Using custom data configuration DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d\n", - "Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" + "/Users/au554730/Desktop/Projects/TextDescriptives/.venv/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n", + "Downloading readme: 100%|██████████| 1.42k/1.42k [00:00<00:00, 486kB/s]\n", + "Using custom data configuration DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d\n" ] }, { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "7ecf4788a2a8499b82115927bf126ccd", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - " 0%| | 0/1 [00:00" + "" ] }, - "execution_count": 33, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -405,7 +443,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -422,7 +460,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -436,7 +474,7 @@ "Pressenævnets formand udtaler:\n", "Det er en betingelse for at klage til Pressenævnet, at\n", "----\n", - "This is pass the quality filter:\n" + "This passed the quality filter:\n" ] }, { @@ -445,7 +483,7 @@ "False" ] }, - "execution_count": 35, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -455,7 +493,7 @@ "\n", "print(legal_doc[:100]) # print the first 100 tokens\n", "print(\"----\")\n", - "print(\"This is pass the quality filter:\")\n", + "print(\"This passed the quality filter:\")\n", "legal_doc._.passed_quality_check" ] }, @@ -469,7 +507,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -496,7 +534,7 @@ " 'contains_lorem ipsum': False}" ] }, - "execution_count": 36, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -517,7 +555,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 10, "metadata": {}, "outputs": [ { @@ -526,7 +564,7 @@ "QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.6, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})" ] }, - "execution_count": 37, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } @@ -554,7 +592,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -564,7 +602,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -589,7 +627,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -598,7 +636,7 @@ "" ] }, - "execution_count": 40, + "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, @@ -625,12 +663,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This seems fine like it explains a lot of the texts which were filtered out, but does not explain everything. Let us take a look at the `alpha_ratio` as well:" + "This seems like it explains a lot of the texts which were filtered out, but does not explain everything. Let us take a look at the `alpha_ratio` as well:" ] }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 14, "metadata": {}, "outputs": [ { @@ -639,7 +677,7 @@ "" ] }, - "execution_count": 41, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, @@ -664,7 +702,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We see that most of the text does not pass the `alpha_ratio` filter of 0.8 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the thresholdhold to 0.7 or lower." + "We see that most of the text does not pass the `alpha_ratio` filter of 0.8 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower." ] }, { @@ -678,7 +716,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -689,23 +727,12 @@ }, { "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [], - "source": [ - "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=4)\n", - "speech_docs = list(speech_docs)" - ] - }, - { - "cell_type": "code", - "execution_count": 44, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# etract alpha ratio:\n", - "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]\n", - "speech_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in speech_docs]" + "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]" ] }, { @@ -718,22 +745,22 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 45, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { - "image/png": "", + "image/png": "", "text/plain": [ "
" ] @@ -746,7 +773,6 @@ "import matplotlib.pyplot as plt\n", "# histogram\n", "sns.histplot(news_alpha_ratio, label=\"News\", alpha=0.5, binwidth=0.05)\n", - "sns.histplot(speech_alpha_ratio, label=\"Speech\", alpha=0.5, binwidth=0.05)\n", "sns.histplot(alpha_ratio, label=\"Legal\", alpha=0.5, binwidth=0.05)\n", "\n", "# add labels\n", @@ -761,42 +787,187 @@ "metadata": {}, "source": [ "Here we see a couple of things:\n", - "- Spontanous speech have a notably low alpha ratio. We should probably look into it.\n", - "- A reasonable amount of legal documents have an alpha ratio above 0.6.\n", + "- A fair amount of legal documents have an alpha ratio above 0.6.\n", "- Almost no news text have a alpha ratio below 0.6.\n", "\n", - "Let us examine the spontaneous speech a bit more:" + "Let us examine one of the legal with a low alpha-ratio a bit more in-depth:" ] }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 19, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "[Taler, 6, :, mm, \n", - ", Taler, 7, :, er, du, klar, ?, \n", - ", Taler, 6, :, ja, \n", - ", Taler, 7, :, så, er, spørgsmålet, om, vi, skal-, om, det, er, sådan, her, ja, det, kunne, man, godt, okay, \n", - ", Taler, 7, :, okay, så, det, er, ignore, tab, kill, og, kill, tab, \n", - ", Taler, 6, :, NA, \n", - ", Taler, 6, :, kill, \n", - ", Taler, 6, :, kill, tab, \n", - ", Taler, 7, :, super, \n", - ", Taler, 7, :, okay, det, er, det, hun, lige, har, sagt, \n", - ", Taler, 6, :, ja, \n", - ", Taler, 6, :, ja, \n", - ", Taler, 6, :, NA]\n" - ] + "data": { + "text/plain": [ + "Oversigt (indholdsfortegnelse)\n", + "\n", + "Den fulde tekst\n", + "\n", + "Bekendtgørelse om\n", + "Fanefjord-Grønsund Vildtreservat\n", + "\n", + "I medfør af § 33 og § 49, stk. 1 og 3, i lov om jagt og\n", + "vildtforvaltning, jf. lovbekendtgørelse nr. 114 af 28. januar 1997,\n", + "fastsættes:\n", + "Formål\n", + "\n", + "§ 1. Bekendtgørelsen har til formål\n", + "at sikre Fanefjord og en del af Grønsund som yngle-, raste- og\n", + "fourageringsområde for vandfugle .\n", + "\n", + "Afgrænsning\n", + "\n", + "End of \"§ 1\"\n", + "\n", + "§ 2. Fanefjord-Grønsund Vildtreservat i\n", + "Storstrøms Amt omfatter, som angivet på kortbilag:\n", + "1)\tLandarealer ved Fanefjord:\n", + "a)\tMatr. nr. 12 b , del af 14 b , del af 20 b ,\n", + "20 e , 23 a , 23 b , 23 c, 55 a , 55 b ,\n", + "55 c og 64 Hårbølle By, Fanefjord. Den del af matr. nr.\n", + "8 f og 23 d Hårbølle By, Fanefjord, som er\n", + "beliggende nord for den øst-vestgående markvej til\n", + "Færgensvænge samt den del af matr. nr. 12 c\n", + "Hårbølle By, Fanefjord, som er beliggende indenfor en afstand af\n", + "200 m fra Fanefjord. De dele af matr. nr. 1, 4 e , 6 c ,\n", + "19 c , 20 a og 20 af Hårbølle By, Fanefjord,\n", + "som er beliggende indenfor en afstand af 100 m fra Fanefjord.\n", + "b)\tMatr. nr. 12 e , 12 k , 12 o , 13 d ,\n", + "13 f , 13 g og 13 k Kokseby By, Fanefjord, og de dele af\n", + "matr. nr. 13 c og 13 e Kokseby By, Fanefjord, som er beliggende\n", + "syd for diget mellem Færgegården og Vollerup Græsgange. De\n", + "dele af matr. nr. 12 u , 12 i og 12 f Kokseby By, Fanefjord,\n", + "som er beliggende syd for en ret linie fra vejen syd for Kirkegården\n", + "(60 m nord for Lammehavevej) til et punkt i matrikelskellet mellem 12 f\n", + "og 12 k Kokseby By, Fanefjord, beliggende i en afstand af ca. 35 m fra,\n", + "hvor matrikelskellet skærer kystlinien ved Fanefjord.\n", + "c)\tDe dele af matr. nr. 1 c og 1 b Grønsund\n", + "Færgegård, Fanefjord (herunder Malurt-holm), som er beliggende\n", + "syd for diget mellem Færgegården og Vollerup\n", + "Græsgange.\n", + "2)\tFanefjord og Grønsund afgrænset:\n", + "a)\tMod sydøst af en ret linie mellem den nordlige mole ved\n", + "lystbådehavnen ved Hårbøllebro og Skansepynt ved\n", + "Grønsund,\n", + "b)\tmod vest af en ret linie mellem høfden ved Ore Strand og det\n", + "punkt på kysten ved Bogø, hvor dæmningen møder\n", + "kysten ved Gundernæs, og\n", + "c)\tmod nord af Bogødæmningen.\n", + "3)\tDen del af Bogø Letten som er beliggende indenfor en afstand\n", + "af 200 meter fra Bogødæmningen.\n", + "Stk. 2. Mod land afgrænses de i stk. 1, nr. 2 og 3\n", + "nævnte dele af søterritoriet af højeste, daglige\n", + "vandstandslinie.\n", + "Jagt\n", + "\n", + "End of \"§ 2\"\n", + "\n", + "§ 3. Det er forbudt at udøve jagt på\n", + "eller på anden måde at ombringe, indfange eller forjage vandfugle\n", + "på de i § 2, stk. 1, nr. 1, nævnte landarealer.\n", + "End of \"§ 3\"\n", + "\n", + "§ 4. Det er forbudt at udøve jagt på\n", + "eller på anden måde at ombringe, indfange eller forjage pattedyr\n", + "og fugle på:\n", + "1)\tDen i § 2, stk. 1, nr. 2, nævnte del af søterritoriet,\n", + "der er beliggende nord og øst for en ret linie mellem positionerne\n", + "54 °\n", + "53,40 N. 12 °\n", + "07,80 E (200 m sydvest for Hårbøllebro), og 54 °\n", + "53,68 N. 12 °\n", + "07,35 E (250 m sydvest for Færgensvænge) og en ret linie derfra\n", + "til position 54 °\n", + "54,31 N. 12 °\n", + "04,63 E (300 m syd for Gundernæs på Bogø), og\n", + "2)\tden i § 2, stk. 1, nr. 3, nævnte del af søterritoriet,\n", + "der er beliggende indenfor en afstand af 200 m fra\n", + "Bogødæmningen.\n", + "Stk. 2. Færdsel med ladt skydevåben er\n", + "forbudt på de i stk. 1 nævnte dele af søterritoriet.\n", + "Stk. 3. De anvendte koordinater er geografiske positioner\n", + "i henhold til projektion WGS-84.\n", + "End of \"§ 4\"\n", + "\n", + "§ 5. Det er forbudt at udøve jagt fra\n", + "motordrevet fartøj på den i § 2, stk. 1, nr. 2, nævnte del\n", + "af søterritoriet, der er beliggende syd for de i § 4 stk. 1, nr. 1,\n", + "beskrevne linier.\n", + "Færdsel\n", + "\n", + "End of \"§ 5\"\n", + "\n", + "§ 6. Sejlads med motordrevet fartøj med\n", + "højere hastighed end 6 knob er forbudt på de i § 4, stk. 1,\n", + "nævnte dele af søterritoriet.\n", + "End of \"§ 6\"\n", + "\n", + "§ 7. Brætsejlads er forbudt på de i §\n", + "4, stk. 1, nævnte dele af søterritoriet fra 1. september til 30.\n", + "april.\n", + "End of \"§ 7\"\n", + "\n", + "§ 8. Færdsel er forbudt fra 1. april til 15.\n", + "juli på Malurtholm og på søterritoriet omkring øen\n", + "indenfor en afstand af 50 meter fra højeste, daglige\n", + "vandstandslinie.\n", + "Stk. 2. Bestemmelsen i stk. 1 gælder ikke\n", + "for:\n", + "1)\tEjere og brugere samt disses husstand og personale.\n", + "2)\tSejlads på den nævnte del af søterritoriet i\n", + "forbindelse med erhvervsfiskeri.\n", + "Dispensation og tilsyn\n", + "\n", + "End of \"§ 8\"\n", + "\n", + "§ 9. Skov- og Naturstyrelsen kan, når\n", + "særlige forhold taler derfor, dispensere fra bestemmelserne i §§\n", + "3-8.\n", + "Stk. 2. Skov- og Naturstyrelsens afgørelser efter\n", + "stk. 1 kan ikke indbringes for anden administrativ myndighed.\n", + "Stk. 3. Uanset bestemmelserne i § 6 og § 8 kan\n", + "Farvandsvæsenet eller andre (f.eks. havne) udføre arbejder i\n", + "forbindelse med redningsopgaver og den for sejladsen nødvendige\n", + "afmærkning m.v.\n", + "End of \"§ 9\"\n", + "\n", + "§ 10. Skov- og Naturstyrelsen fører tilsyn\n", + "med, at reservatbestemmelserne overholdes.\n", + "Straf og ikrafttrædelse\n", + "\n", + "End of \"§ 10\"\n", + "\n", + "§ 11. Efter § 54, stk. 1, nr. 5 og nr. 7, i lov om\n", + "jagt og vildtforvaltning, jf. lovbekendtgørelse nr. 114 af 28. januar\n", + "1997, straffes overtrædelse af bestemmelserne i §§ 3-8 eller\n", + "tilsidesættelse af vilkår, der er fastsat i en dispensation i\n", + "medfør af § 9 med bøde, medmindre strengere straf er forskyldt\n", + "efter den øvrige lovgivning.\n", + "End of \"§ 11\"\n", + "\n", + "§ 12. Bekendtgørelsen træder i kraft\n", + "den 1. september 1999.\n", + "End of \"§ 12\"\n", + "\n", + "Miljø- og Energiministeriet, den 28. juni\n", + "1999\n", + "Svend Auken\n", + "/Jens Peter Simonsen\n", + "End of \"GIVET\"" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" } ], "source": [ - "# Examing the first speech document\n", - "doc = speech_docs[0]\n", - "print([t for t in doc[:100]]) # print the first 100 tokens" + "# Findings docs with alpha ratio below 0.6\n", + "low_legal_alpha_ratio = [doc for doc in legal_docs if doc._.quality[\"alpha_ratio\"] < 0.6]\n", + "\n", + "low_legal_alpha_ratio[0]" ] }, { @@ -804,15 +975,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "From this we can see that a high proportion of the tokens in the speech dataset actually denotes the speeaker. This might or might not be problematic for the dataset of interesting, but it does indeed make sense that it inflates the number of tokens.\n", + "From this we can see that a high proportion of the tokens in the legal dataset are paragraph signs, paragraph numbers or numbers related to addresses (20 e, 23 a etc.). This might or might not be problematic for the task at hand.\n", "\n", - "Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should probably the adjusted to the target domain." + "**Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should probably be adjusted to the target domain.**" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "textdescriptives", + "display_name": "Python 3.10.9 ('.venv': venv)", "language": "python", "name": "python3" }, @@ -826,12 +1002,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.10.9" }, "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" + "hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8" } } }, From 2017fb351c9b2157bb59525322f509e8367792a6 Mon Sep 17 00:00:00 2001 From: Lasse Date: Mon, 9 Jan 2023 15:09:29 +0100 Subject: [PATCH 04/14] docs: added quality tutorial to docs --- docs/tutorial.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/tutorial.rst b/docs/tutorial.rst index afada036..333c209e 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -10,4 +10,5 @@ locally. :caption: Tutorials tutorials/introductory_tutorial.ipynb + tutorials/filter_corpus_using_quality.ipynb From 57d00543c1ce0525fee6570b8bb843bc62295bf7 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Tue, 10 Jan 2023 11:48:10 +0100 Subject: [PATCH 05/14] docs: updated tutorial --- .../filter_corpus_using_quality.ipynb | 701 +++++++++++------- 1 file changed, 433 insertions(+), 268 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 4f6c3238..51582d55 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -24,7 +24,7 @@ "source": [ "## Setup\n", "\n", - "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). For the purpose of this tutorial we will just use a small test version of it containing around 2500 examples, but you could easily change it to use the whole dataset. Danish Gigaword is a large collection of Danish texts collected from a variety of domains. To download it we will need the `datasets` package. Which you can install by running\n", + "For this we will use datasets available on [Huggingface Datasets](https://huggingface.co/datasets). Thus we will need the `datasets` package. Which you can install by running\n", "\n", "```python\n", "!pip install datasets\n", @@ -35,68 +35,410 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "try:\n", - " import textdescriptives\n", + " import textdescriptives as td\n", "except:\n", - " !pip install \"textdescriptives[tutorials]\"" + " !pip install \"textdescriptives[tutorials]\"\n", + " import textdescriptives as td" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Filtering Web content\n" ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "We can now easily donwload the dataset using the following command:" + "\n", + "### The Data\n", + "For our first example we will filter web content. For this we will use the [mC4 dataset](https://huggingface.co/datasets/mc4). It would take ages to download the whole data thus we will stream down 1000 samples from the dataset." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "# stream in the dataset\n", + "dataset = load_dataset(\"mc4\", \"en\", streaming = True, split = \"train\")\n", + "\n", + "# download the first 1 000\n", + "dataset = dataset.take(1000)\n", + "\n", + "# extract the text and remove text which are too long\n", + "texts = [sample [\"text\"] for sample in dataset]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, "metadata": {}, "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "/Users/au554730/Desktop/Projects/TextDescriptives/.venv/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n", - "Downloading readme: 100%|██████████| 1.42k/1.42k [00:00<00:00, 486kB/s]\n", - "Using custom data configuration DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d\n" + "Posts 4,362\tMore Info\n", + "Okay so to those of you that were very helpful this is not to you but for those of you that laugh when I ask about ohms or powering LSi15's this is to you. If you know a book, website, or someone to talk to to get more info that I seek so I know what some of you are talking about please share it with me. I ask questions to gain more info on audio thats all. Not to get laughed\n" ] - }, + } + ], + "source": [ + "# let us look at the first part (400 characters) of the first text\n", + "print(texts[0][:400])\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Filtering\n", + "\n", + "To filter domains using `textdescriptives` we need to first set up the pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import spacy\n", + "\n", + "# create the spacy nlp pipeline\n", + "nlp = spacy.blank(\"en\")\n", + "# add a component for sentence segmentation\n", + "nlp.add_pipe(\"sentencizer\")\n", + "# add a component for quality filtering\n", + "quality_pipe = nlp.add_pipe(\"textdescriptives/quality\")\n", + "\n", + "# apply the pipeline to the texts\n", + "docs = nlp.pipe(texts)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You will note here that docs is a generator. This can be quite useful (especially when streaming texts in one at a time), but for this example we can simply convert it to a list:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Downloading and preparing dataset None/None to /Users/au554730/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...\n" + "docs is type \n", + "docs is type \n" ] - }, + } + ], + "source": [ + "print(f\"docs is type {type(docs)}\")\n", + "docs = list(docs)\n", + "print(f\"docs is type {type(docs)}\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now it is easy to examine the documents using the `doc._.quality` or `doc._.passed_quality_check` extensions:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "Downloading data: 100%|██████████| 11.7M/11.7M [00:00<00:00, 20.8MB/s]\n", - "Downloading data files: 100%|██████████| 1/1 [00:01<00:00, 1.86s/it]\n", - "Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 585.80it/s]\n", - " \r" + "Posts 4,362\tMore Info\n", + "Okay so to those of you that were very helpful this is not to you but for those of you that laugh when I ask about ohms or powering LSi15's this is to you. If you know a book, website, or someone to talk to to get more info that I seek so I know what some of you are talking about please share it with me. I ask questions to gain more info on audio thats all. Not to get laughed at when asking it.\n" ] - }, + } + ], + "source": [ + "# examine the first document\n", + "doc = docs[0]\n", + "print(doc[:100])" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "doc._.passed_quality_check" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It seems like this documents did no pass the quality check. Let us examine why that is:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'n_stop_words': 435,\n", + " 'alpha_ratio': 0.7919463087248322,\n", + " 'mean_word_length': 3.523489932885906,\n", + " 'doc_length': 894,\n", + " 'proportion_ellipsis': 0.0,\n", + " 'proportion_bullet_points': 0.0,\n", + " 'duplicate_line_chr_fraction': 0.0,\n", + " 'duplicate_paragraph_chr_fraction': 0.0,\n", + " 'duplicate_5-gram_chr_fraction': 0.42479253112033194,\n", + " 'duplicate_6-gram_chr_fraction': 0.41649377593361,\n", + " 'duplicate_7-gram_chr_fraction': 0.3757780082987552,\n", + " 'duplicate_8-gram_chr_fraction': 0.36410788381742737,\n", + " 'duplicate_9-gram_chr_fraction': 0.36410788381742737,\n", + " 'duplicate_10-gram_chr_fraction': 0.3571058091286307,\n", + " 'top_2-gram_chr_fraction': 0.008817427385892116,\n", + " 'top_3-gram_chr_fraction': 0.011670124481327801,\n", + " 'top_4-gram_chr_fraction': 0.014004149377593362,\n", + " 'symbol_#_to_word_ratio': 0.0,\n", + " 'contains_lorem ipsum': False}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "doc._.quality" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Naturally we realize that you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of character which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n", + "\n", + "If we examine the quality thresholds of the pipeline we can see that the max allowed value for `duplicate_10-gram_chr_fraction` is 0.1:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Dataset parquet downloaded and prepared to /Users/au554730/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.\n" + "n_stop_words=(2, None) alpha_ratio=(0.7, None) mean_word_length=(3, 10) doc_length=(10, 100000) symbol_to_word_ratio={'#': (None, 0.1)} proportion_ellipsis=(None, 0.3) proportion_bullet_points=(None, 0.8) contains={'lorem ipsum': False} duplicate_line_chr_fraction=(None, 0.2) duplicate_paragraph_chr_fraction=(None, 0.2) duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)} top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)}\n", + "---\n", + "The thresholds for Duplicate n-grams:\n", + "{'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}\n" ] - }, + } + ], + "source": [ + "print(quality_pipe.quality_thresholds)\n", + "\n", + "print(\"---\")\n", + "print(\"The thresholds for Duplicate n-grams:\")\n", + "print(quality_pipe.quality_thresholds.duplicate_ngram_chr_fraction)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Extracting high quality texts\n", + "Naturally we are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "filtered_texts = [doc for doc in docs if doc._.passed_quality_check]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "A total of 1000 texts were processed and 576 passed the quality check.\n" + ] + } + ], + "source": [ + "print(f\"A total of {len(docs)} texts were processed and {len(filtered_texts)} passed the quality check.\")" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Changing the filters\n", + "Naturally, in some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "new_thresholds = td.QualityThresholds(\n", + " n_stop_words=(2, None), # at least 2 stop words, no upper bound\n", + " alpha_ratio= (0.7, None),\n", + " mean_word_length= (3, 10), # mean word length between 3 and 10 characters\n", + " doc_length = (10, 100_000),\n", + " symbol_to_word_ratio = {}, # don't filter based on symbol to word ratio.\n", + " proportion_ellipsis = (None, 0.3),\n", + " proportion_bullet_points = (None, 0.8),\n", + " contains = {\"lorem ipsum\": False}, # remove texts which contain the string \"lorem ipsum\"\n", + " duplicate_line_chr_fraction = (None, 0.2),\n", + " duplicate_paragraph_chr_fraction = (None, 0.2),\n", + " duplicate_ngram_chr_fraction = {}, # don't filter based on duplicate n-grams\n", + " top_ngram_chr_fraction = {\"2\": (None, 0.2), \"3\": (None, 0.18), \"4\": (None, 0.16)}\n", + ")\n", + "\n", + "# overwrite the existing thresholds\n", + "quality_pipe.set_quality_thresholds(new_thresholds)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to read more about what each argument does, please check out the [documentation](https://hlasse.github.io/TextDescriptives/quality.html#data-classes)." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check if the new text now pass the quality filter\n", + "doc._.passed_quality_check" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comparing Domains\n", + "\n", + "These quality metrics are heuristic based an thus, while they are reasonable for one domain, might not be reasonable for another. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems in applied directly to other domains.\n", + "\n" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data\n", + "\n", + "For this we will use the [Danish Gigaword](https://sprogteknologi.dk/dataset/danish-gigaword) available on [Huggingface Datasets](DDSC/partial-danish-gigaword-no-twitter). For the purpose of this tutorial we will just use a small test version of it containing around 2500 examples, but you could easily change it to use the whole dataset. Danish Gigaword is a large collection of Danish texts collected from a variety of domains." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "We can donwload the dataset using the following command:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "100%|██████████| 1/1 [00:00<00:00, 233.71it/s]\n" + "Using custom data configuration DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d\n", + "Found cached dataset parquet (/Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ff5e5f41e3414f82aa9694629ce34413", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/1 [00:00" + "" ] }, - "execution_count": 6, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -443,7 +782,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 20, "metadata": {}, "outputs": [], "source": [ @@ -460,7 +799,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 21, "metadata": {}, "outputs": [ { @@ -483,7 +822,7 @@ "False" ] }, - "execution_count": 8, + "execution_count": 21, "metadata": {}, "output_type": "execute_result" } @@ -507,7 +846,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 22, "metadata": {}, "outputs": [ { @@ -534,7 +873,7 @@ " 'contains_lorem ipsum': False}" ] }, - "execution_count": 9, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } @@ -548,37 +887,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is likely the reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. You can see an example of how to do this in the [documentation](https://hlasse.github.io/TextDescriptives/quality.html). We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters.\n", - "\n", - "You can also inspect the existing thresholds:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "QualityThresholds(n_stop_words=(2, None), alpha_ratio=(0.6, None), mean_word_length=(3, 10), doc_length=(10, 100000), symbol_to_word_ratio={'#': (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), contains={'lorem ipsum': False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), duplicate_ngram_chr_fraction={'5': (None, 0.15), '6': (None, 0.14), '7': (None, 0.13), '8': (None, 0.12), '9': (None, 0.11), '10': (None, 0.1)}, top_ngram_chr_fraction={'2': (None, 0.2), '3': (None, 0.18), '4': (None, 0.16)})" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "quality_pipe.quality_thresholds" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here we see that the `duplicate_ngram_chr_fraction` for 10-grams is 0.1. This means that if a text contains more than 10% of characters which are a part of a duplicate 10-gram it will be filtered out." + "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is a reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. We showed you have to do this in the previous section. We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters. This is good, but as we will see later by no mean common for legal texts." ] }, { @@ -592,7 +901,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -602,14 +911,14 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "We had a total of 1000 which we filtered down to 335.\n" + "We had a total of 1000 which we filtered down to 264.\n" ] } ], @@ -627,7 +936,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 25, "metadata": {}, "outputs": [ { @@ -636,7 +945,7 @@ "" ] }, - "execution_count": 13, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, @@ -663,12 +972,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This seems like it explains a lot of the texts which were filtered out, but does not explain everything. Let us take a look at the `alpha_ratio` as well:" + "This seems like it explains a lot of the texts which were filtered out, but does not explain everything. Let us take a look at the `alpha_ratio` (the proportion of words which contains at least one alphabetic character) as well:" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 26, "metadata": {}, "outputs": [ { @@ -677,7 +986,7 @@ "" ] }, - "execution_count": 14, + "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, @@ -698,41 +1007,42 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "We see that most of the text does not pass the `alpha_ratio` filter of 0.8 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower." + "We see that most of the text does not pass the `alpha_ratio` threshold of 0.7 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower." ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "# Comparing across domains\n", + "### Comparing across domains\n", "We see that legal documents have quite a few perculiarities let us examine how the `alpha_ratio` behaves across different domains:" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# first we apply the pipeline to the other domains\n", "news_docs = nlp.pipe(news[\"text\"], batch_size=100, n_process=4)\n", - "news_docs = list(news_docs)" + "news_docs = list(news_docs)\n", + "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=4)\n", + "speech_docs = list(speech_docs)" ] }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ - "# etract alpha ratio:\n", - "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]" + "# extract alpha ratio:\n", + "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]\n", + "speech_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in speech_docs]" ] }, { @@ -745,22 +1055,22 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 17, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, { "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAGzCAYAAADJ3dZzAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA950lEQVR4nO3deVxWdf7//yc7ggIqyGLglgtMmuaKbWaMVOZYOZMamTWWo4GNUmaWpVFJOo06FurUILbo2PTNnMYcU3FpETcazQUtzbqYFBAVEJFFOL8/+nlN18clwQvO5fFxv93O7cY5533Oeb3foT47q5thGIYAAAAsyt3sAgAAAOoTYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFiap5kHb926tX744Ydzlj/++ONKS0tTeXm5nnzySS1dulQVFRWKj4/XvHnzFBoaam9rs9k0duxYrV+/Xo0bN9bIkSOVmpoqT89L71pNTY0OHz6sJk2ayM3NzSl9AwAA9cswDJ08eVIRERFyd7/I+RvDRAUFBcaRI0fs05o1awxJxvr16w3DMIwxY8YYkZGRRmZmprF9+3ajT58+Rt++fe3bnzlzxrjuuuuMuLg44z//+Y+xcuVKIzg42Jg8eXKt6sjNzTUkMTExMTExMV2BU25u7kX/nXczDNf5EOj48eO1YsUKffvttyopKVFISIiWLFmi3/72t5Kkffv2KTo6WllZWerTp4/+/e9/6+6779bhw4ftZ3sWLFigSZMm6ejRo/L29r6k4xYXFysoKEi5ubkKCAiot/4BAADnKSkpUWRkpIqKihQYGHjBdqZexvq5yspKvffee0pOTpabm5uys7NVVVWluLg4e5tOnTopKirKHnaysrLUuXNnh8ta8fHxGjt2rPbs2aNu3bqd91gVFRWqqKiwz588eVKSFBAQQNgBAOAK80u3oLjMDcrLly9XUVGRHn74YUlSXl6evL29FRQU5NAuNDRUeXl59jY/Dzpn159ddyGpqakKDAy0T5GRkc7rCAAAcCkuE3bS09N15513KiIiot6PNXnyZBUXF9un3Nzcej8mAAAwh0tcxvrhhx+0du1aLVu2zL4sLCxMlZWVKioqcji7k5+fr7CwMHubrVu3OuwrPz/fvu5CfHx85OPj48QeAAAAV+USYScjI0MtWrTQwIED7cu6d+8uLy8vZWZmasiQIZKk/fv3y2azKTY2VpIUGxurV155RQUFBWrRooUkac2aNQoICFBMTIxTa6ypqVFlZaVT93k18/LykoeHh9llAACuAqaHnZqaGmVkZGjkyJEO78YJDAzUqFGjlJycrGbNmikgIEDjxo1TbGys+vTpI0kaMGCAYmJiNGLECM2cOVN5eXmaMmWKEhMTnXrmprKyUocOHVJNTY3T9gkpKChIYWFhvNsIAFCvTA87a9eulc1m0+9///tz1s2ePVvu7u4aMmSIw0sFz/Lw8NCKFSs0duxYxcbGyt/fXyNHjlRKSorT6jMMQ0eOHJGHh4ciIyMv/tIiXBLDMFRWVqaCggJJUnh4uMkVAQCszKXes2OWkpISBQYGqri4+JxHz6uqqnTgwAFFRERc9Bl+1N6xY8dUUFCgDh06cEkLAFBrF/v3++c4TfELqqurJemSX1CIS+fn5yfpp0AJAEB9IexcIu4rcT7GFADQEAg7AADA0ky/QflKZbPZVFhY2GDHCw4OVlRUVIMdDwAAqyDs1IHNZlOn6GidLitrsGM28vPTvpwcAg8AALVE2KmDwsJCnS4rU8KkPyk0ql29Hy/fdlCLZ0xUYWHhJYedhx9+WG+//bZSU1P1zDPP2JcvX75c9957r3gIDwBwtSDsXIbQqHa6pv2vzC7jgnx9fTVjxgz94Q9/UNOmTc0uBwAAUxB2LCwuLk4HDhxQamqqZs6ced42X3zxhSZPnqzt27crODhY9957r1JTU+Xv76833nhDCxYs0O7duyX976zQ/PnzNWbMGPsx+vTpo5dfflk7d+7U+PHjtX37drm5ual9+/b661//qh49ejRYnwGgvjX0PZuXins7L4ywY2EeHh6aPn26HnjgAT3xxBO65pprHNYfPHhQd9xxh15++WUtXLhQR48eVVJSkpKSkpSRkaFbb71VTzzxhI4ePaqQkBBt3LhRwcHB2rBhg8aMGaOqqiplZWXZL5MlJCSoW7dumj9/vjw8PLRjxw55eXmZ0XUAqBdm3LN5qbi388IIOxZ37733qmvXrpo6darS09Md1qWmpiohIUHjx4+XJLVv315z587Vrbfeqvnz5+u6665Ts2bNtHHjRv32t7/Vhg0b9OSTT+ovf/mLJGnr1q2qqqpS3759Jf30l8DEiRPVqVMn+/4AwEoa+p7NS1WXezuvJoSdq8CMGTPUv39/PfXUUw7Ld+7cqa+//lqLFy+2LzMMQzU1NTp06JCio6N1yy23aMOGDYqLi9PevXv1+OOPa+bMmdq3b582btyonj172t+EnJycrEcffVTvvvuu4uLi9Lvf/U7t2rnOXwYA4Cyufs8mHPFSwavALbfcovj4eE2ePNlheWlpqf7whz9ox44d9mnnzp369ttv7SGlX79+2rBhgz7//HN169ZNAQEB9gC0ceNG3Xrrrfb9TZs2TXv27NHAgQO1bt06xcTE6KOPPmrQvgIA8H9xZucq8eqrr6pr167q2LGjfdkNN9ygvXv36tprr73gdrfeeqvGjx+vDz74QP369ZP0UwBau3atvvzySz355JMO7Tt06KAOHTpowoQJGj58uDIyMnTvvffWS58AALgUhJ3LkG87eMUcp3PnzkpISNDcuXPtyyZNmqQ+ffooKSlJjz76qPz9/bV3716tWbNGb7zxhiSpS5cuatq0qZYsWaIVK1ZI+insPPXUU3Jzc9ONN94oSTp9+rQmTpyo3/72t2rTpo3++9//atu2bRoyZMhl1w4AwOUg7NRBcHCwGvn5afGMiQ12zEZ+fgoODr6sfaSkpOj999+3z3fp0kUbN27Uc889p5tvvlmGYahdu3YaOnSovY2bm5tuvvlmffLJJ7rpppvs2wUEBKhjx47y9/eX9NOTX8eOHdNDDz2k/Px8BQcH67777tOLL754WTUDAHC5CDt1EBUVpX05OS79baxFixads6x169aqqKhwWNazZ0+tXr36ovtavny5w7y7u7uOHz/usMzb21t///vfL7k+AAAaCmGnjqKioni8DwCAKwBPYwEAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEvjPTt1ZLPZXPqlggAA4CeEnTqw2WyKju6ksrLTDXZMP79GysnZd8UEng0bNui2227TiRMnFBQUZHY5AICrGGGnDgoLC1VWdlrvPXu/oqNC6v14ObajenD6P1RYWHjJYefhhx9WUVHROZ96AADgakPYuQzRUSG6oUNLs8sAAAAXwQ3KV6Hdu3frzjvvVOPGjRUaGqoRI0Y43H908uRJJSQkyN/fX+Hh4Zo9e7b69eun8ePH29u8++676tGjh5o0aaKwsDA98MADKigoMKE3AABcHGHnKlNUVKT+/furW7du2r59u1atWqX8/Hzdf//99jbJycn68ssv9fHHH2vNmjX6/PPP9dVXXznsp6qqSi+99JJ27typ5cuX6/vvv9fDDz/cwL0BAOCXcRnrKvPGG2+oW7dumj59un3ZwoULFRkZqW+++Ubh4eF6++23tWTJEt1+++2SpIyMDEVERDjs5/e//73957Zt22ru3Lnq2bOnSktL1bhx44bpDAAAl4Cwc5XZuXOn1q9ff95AcvDgQZ0+fVpVVVXq1auXfXlgYKA6duzo0DY7O1vTpk3Tzp07deLECdXU1Ej66Um1mJiY+u0EAAC1QNi5ypSWlmrQoEGaMWPGOevCw8N14MCBX9zHqVOnFB8fr/j4eC1evFghISGy2WyKj49XZWVlfZQNAECdEXauMjfccIM+/PBDtW7dWp6e5/7nb9u2rby8vLRt2zb7Y+7FxcX65ptvdMstt0iS9u3bp2PHjunVV19VZGSkJGn79u0N1wkAAGqBsHMZcmxHXfo4xcXF2rFjh8Oy0aNH66233tLw4cP19NNPq1mzZjpw4ICWLl2qv/3tb2rSpIlGjhypiRMnqlmzZmrRooWmTp0qd3d3ubm5SZKioqLk7e2t119/XWPGjNHu3bv10ksvXW43AQCoF4SdOggODpafXyM9OP0fDXZMP79GCg4OrtU2GzZsULdu3RyWjRo1Sl9++aUmTZqkAQMGqKKiQq1atdIdd9whd/efHs6bNWuWxowZo7vvvlsBAQF6+umnlZubK19fX0lSSEiIFi1apGeffVZz587VDTfcoNdee02/+c1vnNNZAACciLBTB1FRUcrJ2efS38ZatGiRFi1adMH1y5Ytu+C6Jk2aaPHixfb5U6dO6cUXX9To0aPty4YPH67hw4c7bGcYhv3nfv36OcwDAGAWwk4dRUVFXTHfqaqt//znP9q3b5969eql4uJipaSkSJIGDx5scmUAANQeYQfn9dprr2n//v3y9vZW9+7d9fnnn9f6MhoAAK6AsINzdOvWTdnZ2WaXAQCAU/C5iEvE/SfOx5gCABqC6WHnxx9/1IMPPqjmzZurUaNG6ty5s8M7WwzD0AsvvKDw8HA1atRIcXFx+vbbbx32cfz4cSUkJCggIEBBQUEaNWqUSktLnVKfh4eHJPGyvHpQVlYmSfLy8jK5EgCAlZl6GevEiRO68cYbddttt+nf//63QkJC9O2336pp06b2NjNnztTcuXP19ttvq02bNnr++ecVHx+vvXv32h+FTkhI0JEjR7RmzRpVVVXpkUce0ejRo7VkyZLLrtHT01N+fn46evSovLy87I9no+4Mw1BZWZkKCgoUFBRkD5QAANQHU8POjBkzFBkZqYyMDPuyNm3a2H82DENz5szRlClT7E8CvfPOOwoNDdXy5cs1bNgw5eTkaNWqVdq2bZt69OghSXr99dd111136bXXXjvnA5a15ebmpvDwcB06dEg//PDDZe0LjoKCghQWFmZ2GQAAizM17Hz88ceKj4/X7373O23cuFEtW7bU448/rscee0ySdOjQIeXl5SkuLs6+TWBgoHr37q2srCwNGzZMWVlZCgoKsgcdSYqLi5O7u7u2bNmie++995zjVlRUqKKiwj5fUlJy0Tq9vb3Vvn17LmU5kZeXF2d0AAANwtSw891332n+/PlKTk7Ws88+q23btumJJ56Qt7e3Ro4cqby8PElSaGiow3ahoaH2dXl5eWrRooXDek9PTzVr1sze5v9KTU3Viy++WKta3d3d7ZfNAADAlcPUG1Bqamp0ww03aPr06erWrZtGjx6txx57TAsWLKjX406ePFnFxcX2KTc3t16PBwAAzGNq2AkPD1dMTIzDsujoaNlsNkmy38+Rn5/v0CY/P9++LiwsTAUFBQ7rz5w5o+PHj1/wfhAfHx8FBAQ4TAAAwJpMDTs33nij9u/f77Dsm2++UatWrST9dLNyWFiYMjMz7etLSkq0ZcsWxcbGSpJiY2NVVFTk8BK8devWqaamRr17926AXgAAAFdm6j07EyZMUN++fTV9+nTdf//92rp1q9588029+eabkn56Emr8+PF6+eWX1b59e/uj5xEREbrnnnsk/XQm6I477rBf/qqqqlJSUpKGDRt22U9iAQCAK5+pYadnz5766KOPNHnyZKWkpKhNmzaaM2eOEhIS7G2efvppnTp1SqNHj1ZRUZFuuukmrVq1yuFm4cWLFyspKUm333673N3dNWTIEM2dO9eMLgEAABdj+rex7r77bt19990XXO/m5qaUlBT7l7fPp1mzZk55gSAAALAeXgcMAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAsjbADAAAszdSwM23aNLm5uTlMnTp1sq8vLy9XYmKimjdvrsaNG2vIkCHKz8932IfNZtPAgQPl5+enFi1aaOLEiTpz5kxDdwUAALgoT7ML+NWvfqW1a9fa5z09/1fShAkT9Mknn+iDDz5QYGCgkpKSdN999+nLL7+UJFVXV2vgwIEKCwvTpk2bdOTIET300EPy8vLS9OnTG7wvAADA9Zgedjw9PRUWFnbO8uLiYqWnp2vJkiXq37+/JCkjI0PR0dHavHmz+vTpo9WrV2vv3r1au3atQkND1bVrV7300kuaNGmSpk2bJm9v7/Mes6KiQhUVFfb5kpKS+ukcAAAwnen37Hz77beKiIhQ27ZtlZCQIJvNJknKzs5WVVWV4uLi7G07deqkqKgoZWVlSZKysrLUuXNnhYaG2tvEx8erpKREe/bsueAxU1NTFRgYaJ8iIyPrqXcAAMBspoad3r17a9GiRVq1apXmz5+vQ4cO6eabb9bJkyeVl5cnb29vBQUFOWwTGhqqvLw8SVJeXp5D0Dm7/uy6C5k8ebKKi4vtU25urnM7BgAAXIapl7HuvPNO+89dunRR79691apVK/3jH/9Qo0aN6u24Pj4+8vHxqbf9AwAA12H6ZayfCwoKUocOHXTgwAGFhYWpsrJSRUVFDm3y8/Pt9/iEhYWd83TW2fnz3QcEAACuPi4VdkpLS3Xw4EGFh4ere/fu8vLyUmZmpn39/v37ZbPZFBsbK0mKjY3Vrl27VFBQYG+zZs0aBQQEKCYmpsHrBwAArsfUy1hPPfWUBg0apFatWunw4cOaOnWqPDw8NHz4cAUGBmrUqFFKTk5Ws2bNFBAQoHHjxik2NlZ9+vSRJA0YMEAxMTEaMWKEZs6cqby8PE2ZMkWJiYlcpgIAAJJMDjv//e9/NXz4cB07dkwhISG66aabtHnzZoWEhEiSZs+eLXd3dw0ZMkQVFRWKj4/XvHnz7Nt7eHhoxYoVGjt2rGJjY+Xv76+RI0cqJSXFrC4BAAAXY2rYWbp06UXX+/r6Ki0tTWlpaRds06pVK61cudLZpQEAAItwqXt2AAAAnI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALI2wAwAALM1lws6rr74qNzc3jR8/3r6svLxciYmJat68uRo3bqwhQ4YoPz/fYTubzaaBAwfKz89PLVq00MSJE3XmzJkGrh4AALgqlwg727Zt01//+ld16dLFYfmECRP0r3/9Sx988IE2btyow4cP67777rOvr66u1sCBA1VZWalNmzbp7bff1qJFi/TCCy80dBcAAICLMj3slJaWKiEhQW+99ZaaNm1qX15cXKz09HTNmjVL/fv3V/fu3ZWRkaFNmzZp8+bNkqTVq1dr7969eu+999S1a1fdeeedeumll5SWlqbKykqzugQAAFyI6WEnMTFRAwcOVFxcnMPy7OxsVVVVOSzv1KmToqKilJWVJUnKyspS586dFRoaam8THx+vkpIS7dmz54LHrKioUElJicMEAACsydPMgy9dulRfffWVtm3bds66vLw8eXt7KygoyGF5aGio8vLy7G1+HnTOrj+77kJSU1P14osvXmb1AADgSmDamZ3c3Fz98Y9/1OLFi+Xr69ugx548ebKKi4vtU25uboMeHwAANBzTwk52drYKCgp0ww03yNPTU56entq4caPmzp0rT09PhYaGqrKyUkVFRQ7b5efnKywsTJIUFhZ2ztNZZ+fPtjkfHx8fBQQEOEwAAMCaTAs7t99+u3bt2qUdO3bYpx49eighIcH+s5eXlzIzM+3b7N+/XzabTbGxsZKk2NhY7dq1SwUFBfY2a9asUUBAgGJiYhq8TwAAwPWYds9OkyZNdN111zks8/f3V/Pmze3LR40apeTkZDVr1kwBAQEaN26cYmNj1adPH0nSgAEDFBMToxEjRmjmzJnKy8vTlClTlJiYKB8fnwbvEwAAcD2m3qD8S2bPni13d3cNGTJEFRUVio+P17x58+zrPTw8tGLFCo0dO1axsbHy9/fXyJEjlZKSYmLVAADAlbhU2NmwYYPDvK+vr9LS0pSWlnbBbVq1aqWVK1fWc2UAAOBKZfp7dgAAAOpTncJO27ZtdezYsXOWFxUVqW3btpddFAAAgLPUKex8//33qq6uPmd5RUWFfvzxx8suCgAAwFlqdc/Oxx9/bP/5008/VWBgoH2+urpamZmZat26tdOKAwAAuFy1Cjv33HOPJMnNzU0jR450WOfl5aXWrVvrz3/+s9OKAwAAuFy1Cjs1NTWSpDZt2mjbtm0KDg6ul6IAAACcpU6Pnh86dMjZdQAAANSLOr9nJzMzU5mZmSooKLCf8Tlr4cKFl10YAACAM9Qp7Lz44otKSUlRjx49FB4eLjc3N2fXBQAA4BR1CjsLFizQokWLNGLECGfXAwAA4FR1es9OZWWl+vbt6+xaAAAAnK5OYefRRx/VkiVLnF0LAACA09XpMlZ5ebnefPNNrV27Vl26dJGXl5fD+lmzZjmlOAAAgMtVp7Dz9ddfq2vXrpKk3bt3O6zjZmUAAOBK6hR21q9f7+w6AAAA6kWd7tkBAAC4UtTpzM5tt9120ctV69atq3NBAAAAzlSnsHP2fp2zqqqqtGPHDu3evfucD4QCAACYqU5hZ/bs2eddPm3aNJWWll5WQQAAAM7k1Ht2HnzwQb6LBQAAXIpTw05WVpZ8fX2duUsAAIDLUqfLWPfdd5/DvGEYOnLkiLZv367nn3/eKYUBAAA4Q53CTmBgoMO8u7u7OnbsqJSUFA0YMMAphQEAADhDncJORkaGs+sAAACoF3UKO2dlZ2crJydHkvSrX/1K3bp1c0pRAAAAzlKnsFNQUKBhw4Zpw4YNCgoKkiQVFRXptttu09KlSxUSEuLMGgEAAOqsTk9jjRs3TidPntSePXt0/PhxHT9+XLt371ZJSYmeeOIJZ9cIAABQZ3U6s7Nq1SqtXbtW0dHR9mUxMTFKS0vjBmUAAOBS6nRmp6amRl5eXucs9/LyUk1NzWUXBQAA4Cx1Cjv9+/fXH//4Rx0+fNi+7Mcff9SECRN0++23O604AACAy1WnsPPGG2+opKRErVu3Vrt27dSuXTu1adNGJSUlev31151dIwAAQJ3V6Z6dyMhIffXVV1q7dq327dsnSYqOjlZcXJxTiwMAALhctTqzs27dOsXExKikpERubm769a9/rXHjxmncuHHq2bOnfvWrX+nzzz+vr1oBAABqrVZhZ86cOXrssccUEBBwzrrAwED94Q9/0KxZs5xWHAAAwOWqVdjZuXOn7rjjjguuHzBggLKzsy+7KAAAAGepVdjJz88/7yPnZ3l6euro0aOXXRQAAICz1CrstGzZUrt3777g+q+//lrh4eGXXRQAAICz1Crs3HXXXXr++edVXl5+zrrTp09r6tSpuvvuu51WHAAAwOWq1aPnU6ZM0bJly9ShQwclJSWpY8eOkqR9+/YpLS1N1dXVeu655+qlUAAAgLqoVdgJDQ3Vpk2bNHbsWE2ePFmGYUiS3NzcFB8fr7S0NIWGhtZLoQAAAHVR65cKtmrVSitXrtSJEyd04MABGYah9u3bq2nTpvVRHwAAwGWp0xuUJalp06bq2bOnM2sBAABwujp9G8tZ5s+fry5duiggIEABAQGKjY3Vv//9b/v68vJyJSYmqnnz5mrcuLGGDBmi/Px8h33YbDYNHDhQfn5+atGihSZOnKgzZ840dFcAAICLMjXsXHPNNXr11VeVnZ2t7du3q3///ho8eLD27NkjSZowYYL+9a9/6YMPPtDGjRt1+PBh3Xffffbtq6urNXDgQFVWVmrTpk16++23tWjRIr3wwgtmdQkAALiYOl/GcoZBgwY5zL/yyiuaP3++Nm/erGuuuUbp6elasmSJ+vfvL0nKyMhQdHS0Nm/erD59+mj16tXau3ev1q5dq9DQUHXt2lUvvfSSJk2apGnTpsnb2/u8x62oqFBFRYV9vqSkpP46CQAATGXqmZ2fq66u1tKlS3Xq1CnFxsYqOztbVVVVDl9S79Spk6KiopSVlSVJysrKUufOnR2eAIuPj1dJSYn97ND5pKamKjAw0D5FRkbWX8cAAICpTA87u3btUuPGjeXj46MxY8boo48+UkxMjPLy8uTt7a2goCCH9qGhocrLy5Mk5eXlnfOo+9n5s23OZ/LkySouLrZPubm5zu0UAABwGaZexpKkjh07aseOHSouLtb/+3//TyNHjtTGjRvr9Zg+Pj7y8fGp12MAAADXYHrY8fb21rXXXitJ6t69u7Zt26a//OUvGjp0qCorK1VUVORwdic/P19hYWGSpLCwMG3dutVhf2ef1jrbBgAAXN1Mv4z1f9XU1KiiokLdu3eXl5eXMjMz7ev2798vm82m2NhYSVJsbKx27dqlgoICe5s1a9YoICBAMTExDV47AABwPaae2Zk8ebLuvPNORUVF6eTJk1qyZIk2bNigTz/9VIGBgRo1apSSk5PVrFkzBQQEaNy4cYqNjVWfPn0kSQMGDFBMTIxGjBihmTNnKi8vT1OmTFFiYiKXqQAAgCSTw05BQYEeeughHTlyRIGBgerSpYs+/fRT/frXv5YkzZ49W+7u7hoyZIgqKioUHx+vefPm2bf38PDQihUrNHbsWMXGxsrf318jR45USkqKWV0CAAAuxtSwk56eftH1vr6+SktLU1pa2gXbnP1WFwAAwPm43D07AAAAzkTYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlkbYAQAAlmb6t7EAALgYm82mwsJCs8uQJOXk5EiS8m0H5R/YVE1bRJhcES4FYQcA4LJsNpuiozuprOy02aU4WDxjorx9fDUp/d8EnisAYQcA4LIKCwtVVnZa7z17v6KjQswuR0cLC7Vs2TIFdeilmUu/1KniE4SdKwBhBwDg8qKjQnRDh5Zml6EjTdyV1cRdwS0CzS4FtcANygAAwNIIOwAAwNIIOwAAwNIIOwAAwNIIOwAAwNIIOwAAwNIIOwAAwNIIOwAAwNIIOwAAwNJ4gzIAwM6VProp/e/Dm8DlIOwAACS57kc3JelkaanZJeAKRtgBAEhyvY9uStLKrd/o+YVrVF5ebnYpuIIRdgAADlzlo5uSlGM7anYJsABuUAYAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJZG2AEAAJbmaXYBAABcSFlZmSSpqOiEjhw5YnI10tHCo2aXgDog7AAAXFJxcbHWrFkjSVq3br1ytm00uaL/OVNVZXYJqAVTw05qaqqWLVumffv2qVGjRurbt69mzJihjh072tuUl5frySef1NKlS1VRUaH4+HjNmzdPoaGh9jY2m01jx47V+vXr1bhxY40cOVKpqany9CTLAcCVqqysTNXV1ZKk9l37qHPH1uYWJKnw8A/6dscWnfn/68KVwdQ0sHHjRiUmJqpnz546c+aMnn32WQ0YMEB79+6Vv7+/JGnChAn65JNP9MEHHygwMFBJSUm677779OWXX0qSqqurNXDgQIWFhWnTpk06cuSIHnroIXl5eWn69Olmdg8A4CS+jZsooHkLs8vQqZITZpeAOjA17KxatcphftGiRWrRooWys7N1yy23qLi4WOnp6VqyZIn69+8vScrIyFB0dLQ2b96sPn36aPXq1dq7d6/Wrl2r0NBQde3aVS+99JImTZqkadOmydvb24yuAQAAF+FST2MVFxdLkpo1ayZJys7OVlVVleLi4uxtOnXqpKioKGVlZUmSsrKy1LlzZ4fLWvHx8SopKdGePXvOe5yKigqVlJQ4TAAAwJpcJuzU1NRo/PjxuvHGG3XddddJkvLy8uTt7a2goCCHtqGhocrLy7O3+XnQObv+7LrzSU1NVWBgoH2KjIx0cm8AAICrcJmwk5iYqN27d2vp0qX1fqzJkyeruLjYPuXm5tb7MQEAgDlc4nGlpKQkrVixQp999pmuueYa+/KwsDBVVlaqqKjI4exOfn6+wsLC7G22bt3qsL/8/Hz7uvPx8fGRj4+Pk3sBAABckalndgzDUFJSkj766COtW7dObdq0cVjfvXt3eXl5KTMz075s//79stlsio2NlSTFxsZq165dKigosLdZs2aNAgICFBMT0zAdAQAALsvUMzuJiYlasmSJ/vnPf6pJkyb2e2wCAwPVqFEjBQYGatSoUUpOTlazZs0UEBCgcePGKTY2Vn369JEkDRgwQDExMRoxYoRmzpypvLw8TZkyRYmJiZy9AQAA5oad+fPnS5L69evnsDwjI0MPP/ywJGn27Nlyd3fXkCFDHF4qeJaHh4dWrFihsWPHKjY2Vv7+/ho5cqRSUlIaqhsAAMCFmRp2DMP4xTa+vr5KS0tTWlraBdu0atVKK1eudGZpAADAIlzmaSwAAID6QNgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACW5ml2AQBwNbLZbCosLDS7DAc5OTlmlwDUC8IOADQwm82m6OhOKis7bXYp53WytNTsEgCnIuwAQAMrLCxUWdlpvffs/YqOCjG7HLuVW7/R8wvXqLy83OxSAKci7ACASaKjQnRDh5Zml2GXYztqdglAvSDsAABQR/m2g2aXIOl/deTk5Cg4OFhRUVEmV+RaCDsAANTSidJyuUlaPGOi2aU4ePDBB+Xn10g5OfsIPD9D2AEAoJZOlVfJkPT0sBsV3b6N2eWotPi4dn2xRl363KakeatVWFhI2PkZwg4AAHUU2SJQHVuFml2GSo65qbCJu66NaGp2KS6JlwoCAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLI+wAAABLMzXsfPbZZxo0aJAiIiLk5uam5cuXO6w3DEMvvPCCwsPD1ahRI8XFxenbb791aHP8+HElJCQoICBAQUFBGjVqlEpLSxuwFwAAwJWZGnZOnTql66+/XmlpaeddP3PmTM2dO1cLFizQli1b5O/vr/j4eJWXl9vbJCQkaM+ePVqzZo1WrFihzz77TKNHj26oLgAAABfnaebB77zzTt15553nXWcYhubMmaMpU6Zo8ODBkqR33nlHoaGhWr58uYYNG6acnBytWrVK27ZtU48ePSRJr7/+uu666y699tprioiIOO++KyoqVFFRYZ8vKSlxcs8AAICrcNl7dg4dOqS8vDzFxcXZlwUGBqp3797KysqSJGVlZSkoKMgedCQpLi5O7u7u2rJlywX3nZqaqsDAQPsUGRlZfx0BAACmctmwk5eXJ0kKDQ11WB4aGmpfl5eXpxYtWjis9/T0VLNmzextzmfy5MkqLi62T7m5uU6uHgAAuApTL2OZxcfHRz4+PmaXAQAAGoDLhp2wsDBJUn5+vsLDw+3L8/Pz1bVrV3ubgoICh+3OnDmj48eP27cHAJvNpsLCQrPLsMvJyTG7BOCq4rJhp02bNgoLC1NmZqY93JSUlGjLli0aO3asJCk2NlZFRUXKzs5W9+7dJUnr1q1TTU2NevfubVbpAFyIzWZTdHQnlZWdNruUc5zkNRlAgzA17JSWlurAgQP2+UOHDmnHjh1q1qyZoqKiNH78eL388stq37692rRpo+eff14RERG65557JEnR0dG644479Nhjj2nBggWqqqpSUlKShg0bdsEnsQBcXQoLC1VWdlrvPXu/oqNCzC5HkrRy6zd6fuEah9doAKg/poad7du367bbbrPPJycnS5JGjhypRYsW6emnn9apU6c0evRoFRUV6aabbtKqVavk6+tr32bx4sVKSkrS7bffLnd3dw0ZMkRz585t8L4AcG3RUSG6oUNLs8uQJOXYjppdAnBVMTXs9OvXT4ZhXHC9m5ubUlJSlJKScsE2zZo105IlS+qjPAAAYAEue88OAAConaKiE5Jc6yb44OBgRUVFmVoDYQcAgCtcxekySdK6deslSQ8++KCZ5Tho5OenfTk5pgYewg4AAFe4M1U/fQIpKqab9NVWJUz6k0Kj2plclZRvO6jFMyaqsLCQsAMAAC6fj19jSVJoVDtd0/5XJlfjOlz2cxEAAADOQNgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACWRtgBAACW5ml2AQCsw2azqbCw0OwyHOTk5JhdAgCTEXYAOIXNZlN0dCeVlZ02u5TzOllaanYJAExC2AHgFIWFhSorO633nr1f0VEhZpdjt3LrN3p+4RqVl5ebXQoAkxB2ADhVdFSIbujQ0uwy7HJsR80uAYDJuEEZAABYGmEHAABYGmEHAABYGvfsAFcoV3vMm0e8r2zFxcU6UXRCklRUdEJHjhwxuSLpaCH3W8E5CDvAFciVH/PmEe8rT3Fxsd5IS9NX//3pibV169YrZ9tGk6sCnIewA1yBXPExbx7xvnKVlZWpqqpKkR06S/v+o/Zd+6hzx9Zml6XCwz9o16ebzC4DFkDYAa5grvSYN494X/l8/PwkSb6NmyigeQuTq5FOlZwwuwRYBGEHAACLybcdNLsESf+rw+x7wAg7AABYxInScrlJWjxjotmlOPjdb4do3/5vFBUVZcrxCTsAAFjEqfIqGZKeHnajotu3MbsclRYf17rVq/XRvgoVFhYSdgAAgHNEtghUx1ahZpehkmNu2uXnZnYZvFQQAABYG2EHAABYGpexgF/gam8qlnhbMQDUBmEHuAhXflOxxNuKAeBSWCbspKWl6U9/+pPy8vJ0/fXX6/XXX1evXr3MLgtXOFd8U7HE24qvdGVlZZL4BhXQUCwRdt5//30lJydrwYIF6t27t+bMmaP4+Hjt379fLVqY/xZQXPlc6U3FxcXFCvL96WdX+cdS+t8/4Li44uJirVmzRpLrfYPqzJkzZpcA1AtLhJ1Zs2bpscce0yOPPCJJWrBggT755BMtXLhQzzzzjKm1ueL9HpJUUVEhHx8fs8tw4Io1udq9Ma78wca9rvdr7pLKyspUXV0tSS71Dapvd2xRdU2N2aUA9eKKDzuVlZXKzs7W5MmT7cvc3d0VFxenrKys825TUVGhiooK+3xxcbEkqaSkxKm15ebmqmfPHjp9mksNV7ovdh5Q6elKs8tQ4bFjOnC0XGf8QyX9qDOBkXILb252Wao8fVp5h/ZKkjbtOqhT5VUmV/QTNzc3bd/7gyTXqau4uEhHy34KFQfzimR45ZlckVR8rEg/FlWrqPyny1l7DuTqdIULjNWxfPtYuVJNrjpWrlhXxalSHT7503/D0tJSp/87e3Z/hmFcvKFxhfvxxx8NScamTZsclk+cONHo1avXebeZOnWqIYmJiYmJiYnJAlNubu5Fs8IVf2anLiZPnqzk5GT7fE1NjY4fP67mzZvLza1+3vRYUlKiyMhI5ebmKiAgoF6OYRWM1aVjrC4dY1U7jNelY6wunbPHyjAMnTx5UhERERdtd8WHneDgYHl4eCg/P99heX5+vsLCws67jY+Pzzn3hgQFBdVXiQ4CAgL4w3CJGKtLx1hdOsaqdhivS8dYXTpnjlVgYOAvtrni36Ds7e2t7t27KzMz076spqZGmZmZio2NNbEyAADgCq74MzuSlJycrJEjR6pHjx7q1auX5syZo1OnTtmfzgIAAFcvS4SdoUOH6ujRo3rhhReUl5enrl27atWqVQoNNf+Lr2f5+Pho6tSpLvdotStirC4dY3XpGKvaYbwuHWN16cwaKzfD+KXntQAAAK5cV/w9OwAAABdD2AEAAJZG2AEAAJZG2AEAAJZG2HGitLQ0tW7dWr6+vurdu7e2bt16wbbLli1Tjx49FBQUJH9/f3Xt2lXvvvtuA1ZrrtqM1c8tXbpUbm5uuueee+q3QBdSm7FatGiR3NzcHCZfX98GrNZctf29KioqUmJiosLDw+Xj46MOHTpo5cqVDVStuWozVv369Tvn98rNzU0DBw5swIrNVdvfrTlz5qhjx45q1KiRIiMjNWHCBJWXXx3fSazNWFVVVSklJUXt2rWTr6+vrr/+eq1atcr5RTnnC1VYunSp4e3tbSxcuNDYs2eP8dhjjxlBQUFGfn7+eduvX7/eWLZsmbF3717jwIEDxpw5cwwPDw9j1apVDVx5w6vtWJ116NAho2XLlsbNN99sDB48uGGKNVltxyojI8MICAgwjhw5Yp/y8vIauGpz1HasKioqjB49ehh33XWX8cUXXxiHDh0yNmzYYOzYsaOBK294tR2rY8eOOfxO7d692/Dw8DAyMjIatnCT1Ha8Fi9ebPj4+BiLFy82Dh06ZHz66adGeHi4MWHChAauvOHVdqyefvppIyIiwvjkk0+MgwcPGvPmzTN8fX2Nr776yql1EXacpFevXkZiYqJ9vrq62oiIiDBSU1MveR/dunUzpkyZUh/luZS6jNWZM2eMvn37Gn/729+MkSNHXjVhp7ZjlZGRYQQGBjZQda6ltmM1f/58o23btkZlZWVDlegyLvfvq9mzZxtNmjQxSktL66tEl1Lb8UpMTDT69+/vsCw5Odm48cYb67VOV1DbsQoPDzfeeOMNh2X33XefkZCQ4NS6uIzlBJWVlcrOzlZcXJx9mbu7u+Li4pSVlfWL2xuGoczMTO3fv1+33HJLfZZqurqOVUpKilq0aKFRo0Y1RJkuoa5jVVpaqlatWikyMlKDBw/Wnj17GqJcU9VlrD7++GPFxsYqMTFRoaGhuu666zR9+nRVV1c3VNmmuNy/ryQpPT1dw4YNk7+/f32V6TLqMl59+/ZVdna2/fLNd999p5UrV+quu+5qkJrNUpexqqioOOdSe6NGjfTFF184tTZLvEHZbIWFhaqurj7njc2hoaHat2/fBbcrLi5Wy5YtVVFRIQ8PD82bN0+//vWv67tcU9VlrL744gulp6drx44dDVCh66jLWHXs2FELFy5Uly5dVFxcrNdee019+/bVnj17dM011zRE2aaoy1h99913WrdunRISErRy5UodOHBAjz/+uKqqqjR16tSGKNsUdf376qytW7dq9+7dSk9Pr68SXUpdxuuBBx5QYWGhbrrpJhmGoTNnzmjMmDF69tlnG6Jk09RlrOLj4zVr1izdcsstateunTIzM7Vs2TKn/08HZ3ZM1KRJE+3YsUPbtm3TK6+8ouTkZG3YsMHsslzKyZMnNWLECL311lsKDg42uxyXFxsbq4ceekhdu3bVrbfeqmXLlikkJER//etfzS7N5dTU1KhFixZ688031b17dw0dOlTPPfecFixYYHZpLi09PV2dO3dWr169zC7FZW3YsEHTp0/XvHnz9NVXX2nZsmX65JNP9NJLL5ldmsv5y1/+ovbt26tTp07y9vZWUlKSHnnkEbm7OzeecGbHCYKDg+Xh4aH8/HyH5fn5+QoLC7vgdu7u7rr22mslSV27dlVOTo5SU1PVr1+/+izXVLUdq4MHD+r777/XoEGD7MtqamokSZ6entq/f7/atWtXv0WbpK6/Vz/n5eWlbt266cCBA/VRosuoy1iFh4fLy8tLHh4e9mXR0dHKy8tTZWWlvL2967Vms1zO79WpU6e0dOlSpaSk1GeJLqUu4/X8889rxIgRevTRRyVJnTt31qlTpzR69Gg999xzTv+H3FXUZaxCQkK0fPlylZeX69ixY4qIiNAzzzyjtm3bOrU2a454A/P29lb37t2VmZlpX1ZTU6PMzEzFxsZe8n5qampUUVFRHyW6jNqOVadOnbRr1y7t2LHDPv3mN7/Rbbfdph07digyMrIhy29Qzvi9qq6u1q5duxQeHl5fZbqEuozVjTfeqAMHDtjDsyR98803Cg8Pt2zQkS7v9+qDDz5QRUWFHnzwwfou02XUZbzKysrOCTRnQ7Vh4c9RXs7vlq+vr1q2bKkzZ87oww8/1ODBg51bnFNvd76KLV261PDx8TEWLVpk7N271xg9erQRFBRkf+x3xIgRxjPPPGNvP336dGP16tXGwYMHjb179xqvvfaa4enpabz11ltmdaHB1Has/q+r6Wms2o7Viy++aHz66afGwYMHjezsbGPYsGGGr6+vsWfPHrO60GBqO1Y2m81o0qSJkZSUZOzfv99YsWKF0aJFC+Pll182qwsNpq5/Bm+66SZj6NChDV2u6Wo7XlOnTjWaNGli/P3vfze+++47Y/Xq1Ua7du2M+++/36wuNJjajtXmzZuNDz/80Dh48KDx2WefGf379zfatGljnDhxwql1cRnLSYYOHaqjR4/qhRdeUF5enrp27apVq1bZb9Sy2WwOSf/UqVN6/PHH9d///leNGjVSp06d9N5772no0KFmdaHB1Hasrma1HasTJ07oscceU15enpo2baru3btr06ZNiomJMasLDaa2YxUZGalPP/1UEyZMUJcuXdSyZUv98Y9/1KRJk8zqQoOpy5/B/fv364svvtDq1avNKNlUtR2vKVOmyM3NTVOmTNGPP/6okJAQDRo0SK+88opZXWgwtR2r8vJyTZkyRd99950aN26su+66S++++66CgoKcWpebYVj4nBoAALjq8b/PAADA0gg7AADA0gg7AADA0gg7AADA0gg7AADA0gg7AADA0gg7AADA0gg7AADA0gg7ABrchg0b5ObmpqKiokveZtq0aeratWu91eQsrVu31pw5c8wuA8DPEHYA1IusrCx5eHho4MCBZpdSLxYtWnTeV9pv27ZNo0ePbviCAFwQYQdAvUhPT9e4ceP02Wef6fDhw2aXc8kqKysva/uQkBD5+fk5qRoAzkDYAeB0paWlev/99zV27FgNHDhQixYtumj7s2dJli9frvbt28vX11fx8fHKzc09p+27776r1q1bKzAwUMOGDdPJkyft61atWqWbbrpJQUFBat68ue6++24dPHjwosfu16+fkpKSNH78eAUHBys+Pl6SNGvWLHXu3Fn+/v6KjIzU448/rtLSUkk/XYZ75JFHVFxcLDc3N7m5uWnatGmSzr2MZbPZNHjwYDVu3FgBAQG6//77lZ+ffwmjCMBZCDsAnO4f//iHOnXqpI4dO+rBBx/UwoUL9UvfHC4rK9Mrr7yid955R19++aWKioo0bNgwhzYHDx7U8uXLtWLFCq1YsUIbN27Uq6++al9/6tQpJScna/v27crMzJS7u7vuvfde1dTUXPTYb7/9try9vfXll19qwYIFkiR3d3fNnTtXe/bs0dtvv61169bp6aefliT17dtXc+bMUUBAgI4cOaIjR47oqaeeOme/NTU1Gjx4sI4fP66NGzdqzZo1+u677zR06NBLGkcATmIAgJP17dvXmDNnjmEYhlFVVWUEBwcb69evt69fv369Ick4ceKEYRiGkZGRYUgyNm/ebG+Tk5NjSDK2bNliGIZhTJ061fDz8zNKSkrsbSZOnGj07t37gnUcPXrUkGTs2rXrgm1uvfVWo1u3br/Ypw8++MBo3ry5fT4jI8MIDAw8p12rVq2M2bNnG4ZhGKtXrzY8PDwMm81mX79nzx5DkrF169ZfPCYA5+DMDgCn2r9/v7Zu3arhw4dLkjw9PTV06FClp6dfdDtPT0/17NnTPt+pUycFBQUpJyfHvqx169Zq0qSJfT48PFwFBQX2+W+//VbDhw9X27ZtFRAQoNatW0v66VLSxXTv3v2cZWvXrtXtt9+uli1bqkmTJhoxYoSOHTumsrKyi+7r53JychQZGanIyEj7spiYmHP6BaB+EXYAOFV6errOnDmjiIgIeXp6ytPTU/Pnz9eHH36o4uLiy9q3l5eXw7ybm5vDJapBgwbp+PHjeuutt7RlyxZt2bJF0i/fdOzv7+8w//333+vuu+9Wly5d9OGHHyo7O1tpaWmXtC8AroewA8Bpzpw5o3feeUd//vOftWPHDvu0c+dORURE6O9///tFt92+fbt9fv/+/SoqKlJ0dPQlHfvYsWPav3+/pkyZottvv13R0dE6ceJEnfqRnZ2tmpoa/fnPf1afPn3UoUOHc54o8/b2VnV19UX3Ex0drdzcXIcbrffu3auioiLFxMTUqTYAtedpdgEArGPFihU6ceKERo0apcDAQId1Q4YMUXp6usaMGXPebb28vDRu3DjNnTtXnp6eSkpKUp8+fdSrV69LOnbTpk3VvHlzvfnmmwoPD5fNZtMzzzxTp35ce+21qqqq0uuvv65BgwY53Lh8VuvWrVVaWqrMzExdf/318vPzO+eR87i4OHXu3FkJCQmaM2eOzpw5o8cff1y33nqrevToUafaANQeZ3YAOE16erri4uLOCTrST2Fn+/bt+vrrr8+7rZ+fnyZNmqQHHnhAN954oxo3bqz333//ko/t7u6upUuXKjs7W9ddd50mTJigP/3pT3Xqx/XXX69Zs2ZpxowZuu6667R48WKlpqY6tOnbt6/GjBmjoUOHKiQkRDNnzjxnP25ubvrnP/+ppk2b6pZbblFcXJzatm1bq34BuHxuhvELz4MCQD1btGiRxo8fX6vPRwDApeLMDgAAsDTCDgAAsDQuYwEAAEvjzA4AALA0wg4AALA0wg4AALA0wg4AALA0wg4AALA0wg4AALA0wg4AALA0wg4AALC0/w8OXsSXmefTjwAAAABJRU5ErkJggg==", + "image/png": "", "text/plain": [ "
" ] @@ -774,6 +1084,7 @@ "# histogram\n", "sns.histplot(news_alpha_ratio, label=\"News\", alpha=0.5, binwidth=0.05)\n", "sns.histplot(alpha_ratio, label=\"Legal\", alpha=0.5, binwidth=0.05)\n", + "sns.histplot(speech_alpha_ratio, label=\"Speech\", alpha=0.5, binwidth=0.05)\n", "\n", "# add labels\n", "plt.xlabel(\"Alpha ratio\")\n", @@ -782,192 +1093,46 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Here we see a couple of things:\n", "- A fair amount of legal documents have an alpha ratio above 0.6.\n", "- Almost no news text have a alpha ratio below 0.6.\n", + "- The alpha ratio for the Speech corpus is suspicously low\n", "\n", - "Let us examine one of the legal with a low alpha-ratio a bit more in-depth:" + "Let us examine one of the speech samples a bit more in-depth:" ] }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 30, "metadata": {}, "outputs": [ { - "data": { - "text/plain": [ - "Oversigt (indholdsfortegnelse)\n", - "\n", - "Den fulde tekst\n", - "\n", - "Bekendtgørelse om\n", - "Fanefjord-Grønsund Vildtreservat\n", - "\n", - "I medfør af § 33 og § 49, stk. 1 og 3, i lov om jagt og\n", - "vildtforvaltning, jf. lovbekendtgørelse nr. 114 af 28. januar 1997,\n", - "fastsættes:\n", - "Formål\n", - "\n", - "§ 1. Bekendtgørelsen har til formål\n", - "at sikre Fanefjord og en del af Grønsund som yngle-, raste- og\n", - "fourageringsområde for vandfugle .\n", - "\n", - "Afgrænsning\n", - "\n", - "End of \"§ 1\"\n", - "\n", - "§ 2. Fanefjord-Grønsund Vildtreservat i\n", - "Storstrøms Amt omfatter, som angivet på kortbilag:\n", - "1)\tLandarealer ved Fanefjord:\n", - "a)\tMatr. nr. 12 b , del af 14 b , del af 20 b ,\n", - "20 e , 23 a , 23 b , 23 c, 55 a , 55 b ,\n", - "55 c og 64 Hårbølle By, Fanefjord. Den del af matr. nr.\n", - "8 f og 23 d Hårbølle By, Fanefjord, som er\n", - "beliggende nord for den øst-vestgående markvej til\n", - "Færgensvænge samt den del af matr. nr. 12 c\n", - "Hårbølle By, Fanefjord, som er beliggende indenfor en afstand af\n", - "200 m fra Fanefjord. De dele af matr. nr. 1, 4 e , 6 c ,\n", - "19 c , 20 a og 20 af Hårbølle By, Fanefjord,\n", - "som er beliggende indenfor en afstand af 100 m fra Fanefjord.\n", - "b)\tMatr. nr. 12 e , 12 k , 12 o , 13 d ,\n", - "13 f , 13 g og 13 k Kokseby By, Fanefjord, og de dele af\n", - "matr. nr. 13 c og 13 e Kokseby By, Fanefjord, som er beliggende\n", - "syd for diget mellem Færgegården og Vollerup Græsgange. De\n", - "dele af matr. nr. 12 u , 12 i og 12 f Kokseby By, Fanefjord,\n", - "som er beliggende syd for en ret linie fra vejen syd for Kirkegården\n", - "(60 m nord for Lammehavevej) til et punkt i matrikelskellet mellem 12 f\n", - "og 12 k Kokseby By, Fanefjord, beliggende i en afstand af ca. 35 m fra,\n", - "hvor matrikelskellet skærer kystlinien ved Fanefjord.\n", - "c)\tDe dele af matr. nr. 1 c og 1 b Grønsund\n", - "Færgegård, Fanefjord (herunder Malurt-holm), som er beliggende\n", - "syd for diget mellem Færgegården og Vollerup\n", - "Græsgange.\n", - "2)\tFanefjord og Grønsund afgrænset:\n", - "a)\tMod sydøst af en ret linie mellem den nordlige mole ved\n", - "lystbådehavnen ved Hårbøllebro og Skansepynt ved\n", - "Grønsund,\n", - "b)\tmod vest af en ret linie mellem høfden ved Ore Strand og det\n", - "punkt på kysten ved Bogø, hvor dæmningen møder\n", - "kysten ved Gundernæs, og\n", - "c)\tmod nord af Bogødæmningen.\n", - "3)\tDen del af Bogø Letten som er beliggende indenfor en afstand\n", - "af 200 meter fra Bogødæmningen.\n", - "Stk. 2. Mod land afgrænses de i stk. 1, nr. 2 og 3\n", - "nævnte dele af søterritoriet af højeste, daglige\n", - "vandstandslinie.\n", - "Jagt\n", - "\n", - "End of \"§ 2\"\n", - "\n", - "§ 3. Det er forbudt at udøve jagt på\n", - "eller på anden måde at ombringe, indfange eller forjage vandfugle\n", - "på de i § 2, stk. 1, nr. 1, nævnte landarealer.\n", - "End of \"§ 3\"\n", - "\n", - "§ 4. Det er forbudt at udøve jagt på\n", - "eller på anden måde at ombringe, indfange eller forjage pattedyr\n", - "og fugle på:\n", - "1)\tDen i § 2, stk. 1, nr. 2, nævnte del af søterritoriet,\n", - "der er beliggende nord og øst for en ret linie mellem positionerne\n", - "54 °\n", - "53,40 N. 12 °\n", - "07,80 E (200 m sydvest for Hårbøllebro), og 54 °\n", - "53,68 N. 12 °\n", - "07,35 E (250 m sydvest for Færgensvænge) og en ret linie derfra\n", - "til position 54 °\n", - "54,31 N. 12 °\n", - "04,63 E (300 m syd for Gundernæs på Bogø), og\n", - "2)\tden i § 2, stk. 1, nr. 3, nævnte del af søterritoriet,\n", - "der er beliggende indenfor en afstand af 200 m fra\n", - "Bogødæmningen.\n", - "Stk. 2. Færdsel med ladt skydevåben er\n", - "forbudt på de i stk. 1 nævnte dele af søterritoriet.\n", - "Stk. 3. De anvendte koordinater er geografiske positioner\n", - "i henhold til projektion WGS-84.\n", - "End of \"§ 4\"\n", - "\n", - "§ 5. Det er forbudt at udøve jagt fra\n", - "motordrevet fartøj på den i § 2, stk. 1, nr. 2, nævnte del\n", - "af søterritoriet, der er beliggende syd for de i § 4 stk. 1, nr. 1,\n", - "beskrevne linier.\n", - "Færdsel\n", - "\n", - "End of \"§ 5\"\n", - "\n", - "§ 6. Sejlads med motordrevet fartøj med\n", - "højere hastighed end 6 knob er forbudt på de i § 4, stk. 1,\n", - "nævnte dele af søterritoriet.\n", - "End of \"§ 6\"\n", - "\n", - "§ 7. Brætsejlads er forbudt på de i §\n", - "4, stk. 1, nævnte dele af søterritoriet fra 1. september til 30.\n", - "april.\n", - "End of \"§ 7\"\n", - "\n", - "§ 8. Færdsel er forbudt fra 1. april til 15.\n", - "juli på Malurtholm og på søterritoriet omkring øen\n", - "indenfor en afstand af 50 meter fra højeste, daglige\n", - "vandstandslinie.\n", - "Stk. 2. Bestemmelsen i stk. 1 gælder ikke\n", - "for:\n", - "1)\tEjere og brugere samt disses husstand og personale.\n", - "2)\tSejlads på den nævnte del af søterritoriet i\n", - "forbindelse med erhvervsfiskeri.\n", - "Dispensation og tilsyn\n", - "\n", - "End of \"§ 8\"\n", - "\n", - "§ 9. Skov- og Naturstyrelsen kan, når\n", - "særlige forhold taler derfor, dispensere fra bestemmelserne i §§\n", - "3-8.\n", - "Stk. 2. Skov- og Naturstyrelsens afgørelser efter\n", - "stk. 1 kan ikke indbringes for anden administrativ myndighed.\n", - "Stk. 3. Uanset bestemmelserne i § 6 og § 8 kan\n", - "Farvandsvæsenet eller andre (f.eks. havne) udføre arbejder i\n", - "forbindelse med redningsopgaver og den for sejladsen nødvendige\n", - "afmærkning m.v.\n", - "End of \"§ 9\"\n", - "\n", - "§ 10. Skov- og Naturstyrelsen fører tilsyn\n", - "med, at reservatbestemmelserne overholdes.\n", - "Straf og ikrafttrædelse\n", - "\n", - "End of \"§ 10\"\n", - "\n", - "§ 11. Efter § 54, stk. 1, nr. 5 og nr. 7, i lov om\n", - "jagt og vildtforvaltning, jf. lovbekendtgørelse nr. 114 af 28. januar\n", - "1997, straffes overtrædelse af bestemmelserne i §§ 3-8 eller\n", - "tilsidesættelse af vilkår, der er fastsat i en dispensation i\n", - "medfør af § 9 med bøde, medmindre strengere straf er forskyldt\n", - "efter den øvrige lovgivning.\n", - "End of \"§ 11\"\n", - "\n", - "§ 12. Bekendtgørelsen træder i kraft\n", - "den 1. september 1999.\n", - "End of \"§ 12\"\n", - "\n", - "Miljø- og Energiministeriet, den 28. juni\n", - "1999\n", - "Svend Auken\n", - "/Jens Peter Simonsen\n", - "End of \"GIVET\"" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" + "name": "stdout", + "output_type": "stream", + "text": [ + "Taler 6: mm\n", + "Taler 7: er du klar?\n", + "Taler 6: ja\n", + "Taler 7: så er spørgsmålet om vi skal- om det er sådan her ja det kunne man godt okay\n", + "Taler 7: okay så det er ignore tab kill og kill tab\n", + "Taler 6: NA\n", + "Taler 6: kill\n", + "Taler 6: kill tab\n", + "Taler 7: super\n", + "Taler 7: okay det er det hun lige har sagt\n", + "Taler 6: ja\n", + "Taler 6: ja\n", + "Taler 6: NA\n" + ] } ], "source": [ - "# Findings docs with alpha ratio below 0.6\n", - "low_legal_alpha_ratio = [doc for doc in legal_docs if doc._.quality[\"alpha_ratio\"] < 0.6]\n", - "\n", - "low_legal_alpha_ratio[0]" + "doc = speech_docs[0]\n", + "# examine the first 100 tokens in the first document\n", + "print(doc[:100])" ] }, { @@ -975,7 +1140,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "From this we can see that a high proportion of the tokens in the legal dataset are paragraph signs, paragraph numbers or numbers related to addresses (20 e, 23 a etc.). This might or might not be problematic for the task at hand.\n", + "From this we can see that a high proportion of the tokens in the speech dataset dentoes the speaker such and tokens such as `:` then lower the alpa ratio. This might or might not be problematic for the task at hand.\n", "\n", "**Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should probably be adjusted to the target domain.**" ] @@ -988,7 +1153,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3.10.9 ('.venv': venv)", + "display_name": "textdescriptives", "language": "python", "name": "python3" }, @@ -1002,12 +1167,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.9" + "version": "3.8.15" }, "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8" + "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" } } }, From f7991864ba62a4dc0a36f42f2a13ad3079909a02 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Tue, 10 Jan 2023 11:48:37 +0100 Subject: [PATCH 06/14] feat: Updated way that that quality thresholds is set --- src/textdescriptives/components/quality.py | 31 +++++++++------------- tests/test_quality.py | 6 ++--- 2 files changed, 15 insertions(+), 22 deletions(-) diff --git a/src/textdescriptives/components/quality.py b/src/textdescriptives/components/quality.py index 33b479a2..c28958d8 100644 --- a/src/textdescriptives/components/quality.py +++ b/src/textdescriptives/components/quality.py @@ -20,9 +20,9 @@ class QualityThresholds(BaseModel): + "at least 2 stop words, but no upper limit.", ) alpha_ratio: Interval = Field( - (0.6, None), - description="A Range for the alpha ratio. Default: (0.6, None), i.e. at " - + r"least 60% of tokens contain at least one alphabetic character, but no " + (0.7, None), + description="A Range for the alpha ratio. Default: (0.7, None), i.e. at " + + r"least 70% of tokens contain at least one alphabetic character, but no " + "upper limit. Note this is lowered from the original 0.8 to account for a" + "different definition of word boundaries. E.g. in spaCy a punctuation is" + "not a part of a word.", @@ -635,6 +635,14 @@ def set_extensions(self): if not Doc.has_extension(ext_name) or self.force is True: Doc.set_extension(ext_name, getter=span_getter, force=True) + def set_quality_thresholds(self, thresholds: QualityThresholds) -> None: + """Sets the quality thresholds. + + Args: + thresholds (QualityThresholds): The desired quality thresholds. + """ + self.quality_thresholds = thresholds + def __call__(self, doc: Doc): """Run the pipeline component.""" self.set_quality(doc) @@ -656,7 +664,6 @@ def __call__(self, doc: Doc): "top_ngram_min_count": 3, "duplicate_n_gram_fraction_range": [5, 10], "force": True, - "quality_thresholds": None, }, ) def create_quality_component( @@ -667,7 +674,6 @@ def create_quality_component( top_ngram_range: Tuple[int, int], top_ngram_min_count: int, duplicate_n_gram_fraction_range: Tuple[int, int], - quality_thresholds: Optional[dict] = None, force: bool = True, ) -> Callable[[Doc], Doc]: """Allows Quality to be added to a spaCy pipe using @@ -712,12 +718,6 @@ def create_quality_component( be considered a top n-gram. Defaults to 3. duplicate_n_gram_fraction_range (Tuple[int]): range of n-grams to calculate the proportion of duplicate n-grams. Defaults to [5, 10]. - quality_thresholds (Optional[dict]): A dictionary object containing the - thresholds indicated by either an interval (Tuple) or a boolean. We - recommend using the QualityThresholds class to create this dictionary by - calling QualityThresholds(...).dict(). This ensures that all the thresholds - are validated. Defaults to None in which case the default for - QualityThresholds is used. force (bool): whether to overwrite existing extensions. Defaults to True. @@ -735,13 +735,6 @@ def create_quality_component( >>> # check whether the document passed the quality thresholds >>> doc._.passed_quality_check """ - # recons quality_thresholds since it needs to be json serializable for the config - # in the nlp.add_pipe call - if quality_thresholds is not None: - quality_thresholds_ = QualityThresholds(**quality_thresholds) - else: - quality_thresholds_ = None - return Quality( nlp, name=name, @@ -750,6 +743,6 @@ def create_quality_component( top_ngram_range=top_ngram_range, top_ngram_min_count=top_ngram_min_count, duplicate_n_gram_fraction_range=duplicate_n_gram_fraction_range, - quality_thresholds=quality_thresholds_, + quality_thresholds=None, force=force, ) diff --git a/tests/test_quality.py b/tests/test_quality.py index f042240b..f5467c52 100644 --- a/tests/test_quality.py +++ b/tests/test_quality.py @@ -210,14 +210,14 @@ def test_quality_component_with_config(nlp: spacy.Language): contains={"lorem ipsum": False}, ) - nlp.add_pipe( + quality_pipe = nlp.add_pipe( "textdescriptives/quality", config={ "symbols": ["."], - "quality_thresholds": quality_thresholds.dict(), "force": True, }, ) + quality_pipe.set_quality_thresholds(quality_thresholds) doc = nlp("This is a test. This is a test. This is a test.") assert doc._.quality["n_stop_words"] == 9 @@ -261,7 +261,7 @@ def test_quality_multi_process(nlp): "A couple of texts here, yeah yeah yeah.", "This is a second text, no repetition what so ever.", ] - + nlp.add_pipe("textdescriptives/quality", config={"force": True}) docs = nlp.pipe(texts, n_process=2) for doc in docs: assert doc._.quality From 580bea199e91cb95e1876a0e4b7ea0ca5439129d Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Tue, 10 Jan 2023 11:52:10 +0100 Subject: [PATCH 07/14] docs: Updated docs with changes to the API --- docs/quality.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/quality.rst b/docs/quality.rst index 77fa0c09..431eed63 100644 --- a/docs/quality.rst +++ b/docs/quality.rst @@ -92,7 +92,8 @@ If you want to specify the thresholds for the quality metrics, you can do so by contains_lorem_ipsum=False ) - nlp.add_pipe("textdescriptives.quality", config={"quality_thresholds": thresholds.dict()}) + quality_pipe = nlp.add_pipe("textdescriptives.quality") + quality_pipe.set_quality_thresholds(thresholds) # update the quality thresholds doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") # all attributes are stored as a dict in the ._.quality attribute From c5e9eff8cbe82b31e21d37dab8c1be8c7f16abf2 Mon Sep 17 00:00:00 2001 From: Lasse Date: Wed, 11 Jan 2023 10:23:57 +0100 Subject: [PATCH 08/14] tutorial: minor --- docs/tutorials/filter_corpus_using_quality.ipynb | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 51582d55..79bf2f83 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -78,7 +78,7 @@ "# download the first 1 000\n", "dataset = dataset.take(1000)\n", "\n", - "# extract the text and remove text which are too long\n", + "# extract the text and remove texts which are too long\n", "texts = [sample [\"text\"] for sample in dataset]\n" ] }, @@ -257,7 +257,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Naturally we realize that you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of character which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n", + "Naturally, you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of characters which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n", "\n", "If we examine the quality thresholds of the pipeline we can see that the max allowed value for `duplicate_10-gram_chr_fraction` is 0.1:" ] @@ -292,7 +292,7 @@ "metadata": {}, "source": [ "### Extracting high quality texts\n", - "Naturally we are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." + "We are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." ] }, { @@ -327,7 +327,7 @@ "metadata": {}, "source": [ "### Changing the filters\n", - "Naturally, in some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:" + "In some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags it will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:" ] }, { @@ -391,7 +391,7 @@ "source": [ "## Comparing Domains\n", "\n", - "These quality metrics are heuristic based an thus, while they are reasonable for one domain, might not be reasonable for another. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems in applied directly to other domains.\n", + "These quality metrics are heuristic based and need to be tuned. While the defaults are reasonable for some domains, they may not be for others. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems when applied directly to other domains.\n", "\n" ] }, @@ -1153,7 +1153,7 @@ ], "metadata": { "kernelspec": { - "display_name": "textdescriptives", + "display_name": "Python 3.10.9 ('.venv': venv)", "language": "python", "name": "python3" }, @@ -1167,12 +1167,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.10.9" }, "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" + "hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8" } } }, From 18392c589a1d81c2e97220267b72ceb507444e36 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Fri, 13 Jan 2023 09:28:19 +0100 Subject: [PATCH 09/14] Fixed number of cores when filtering --- docs/tutorials/filter_corpus_using_quality.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 51582d55..dd09a271 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -678,9 +678,9 @@ ], "source": [ "# we can filter out these three datasets based on the \"source\"\n", - "legal = dataset.filter(lambda x: x[\"source\"] == \"retsinformationdk\")\n", - "news = dataset.filter(lambda x: x[\"source\"] == \"tv2r\")\n", - "speech = dataset.filter(lambda x: x[\"source\"] == \"spont\")" + "legal = dataset.filter(lambda x: x[\"source\"] == \"retsinformationdk\", num_proc=1)\n", + "news = dataset.filter(lambda x: x[\"source\"] == \"tv2r\", num_proc=1)\n", + "speech = dataset.filter(lambda x: x[\"source\"] == \"spont\", num_proc=1)" ] }, { @@ -1167,7 +1167,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.8.15 (default, Oct 11 2022, 21:31:25) \n[Clang 14.0.0 (clang-1400.0.29.102)]" }, "orig_nbformat": 4, "vscode": { From c0fb63c671ed1e4ccbe75afa4fb3301104a1ad0e Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Fri, 13 Jan 2023 16:35:22 +0100 Subject: [PATCH 10/14] feat: Added QualityOutput --- docs/quality.rst | 36 +- .../filter_corpus_using_quality.ipynb | 139 +++---- src/textdescriptives/components/quality.py | 340 ++++++------------ .../components/quality_data_classes.py | 259 +++++++++++++ src/textdescriptives/extractors.py | 2 +- tests/test_quality.py | 39 +- 6 files changed, 483 insertions(+), 332 deletions(-) create mode 100644 src/textdescriptives/components/quality_data_classes.py diff --git a/docs/quality.rst b/docs/quality.rst index 431eed63..fa2619ca 100644 --- a/docs/quality.rst +++ b/docs/quality.rst @@ -71,27 +71,28 @@ If you want to specify the thresholds for the quality metrics, you can do so by # set thresholds for quality metrics (these are just the default) thresholds = QualityThresholds( - n_stop_words=(2, None), - alpha_ratio=(0.8, None), - mean_word_length=(3, 10), - doc_length= (10, 100_000), - symbol_hashtag_to_word_ratio=(None, 0.1), + n_stop_words=(2, None), # at least 2 stop words, no upper bound + alpha_ratio=(0.7, None), + mean_word_length=(3, 10), # mean word length between 3 and 10 characters + doc_length=(10, 100000), + symbol_to_word_ratio={"#": (None, 0.1)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), + contains={"lorem ipsum": False}, duplicate_line_chr_fraction=(None, 0.2), duplicate_paragraph_chr_fraction=(None, 0.2), - duplicate_5gram_chr_fraction=(None, 0.15), - duplicate_6gram_chr_fraction=(None, 0.14), - duplicate_7gram_chr_fraction=(None, 0.13), - duplicate_8gram_chr_fraction=(None, 0.12), - duplicate_9gram_chr_fraction=(None, 0.11), - duplicate_10gram_chr_fraction=(None, 0.1), - top_2gram_chr_fraction=(None, 0.20), - top_3gram_chr_fraction=(None, 0.18), - top_4gram_chr_fraction=(None, 0.16), - contains_lorem_ipsum=False + duplicate_ngram_chr_fraction={ + "5": (None, 0.15), + "6": (None, 0.14), + "7": (None, 0.13), + "8": (None, 0.12), + "9": (None, 0.11), + "10": (None, 0.1), + }, + top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)}, ) + quality_pipe = nlp.add_pipe("textdescriptives.quality") quality_pipe.set_quality_thresholds(thresholds) # update the quality thresholds doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") @@ -113,5 +114,6 @@ Component Data Classes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. autopydantic_model:: textdescriptives.components.quality.QualityThresholds - +.. autopydantic_model:: textdescriptives.components.quality_data_classes.QualityThresholds +.. autopydantic_model:: textdescriptives.components.quality_data_classes.QualityOutput +.. autopydantic_model:: textdescriptives.components.quality_data_classes.ThresholdsOutput diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index d1f7490c..6c182c50 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -206,6 +206,11 @@ "doc._.passed_quality_check" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, { "attachments": {}, "cell_type": "markdown", @@ -222,25 +227,7 @@ { "data": { "text/plain": [ - "{'n_stop_words': 435,\n", - " 'alpha_ratio': 0.7919463087248322,\n", - " 'mean_word_length': 3.523489932885906,\n", - " 'doc_length': 894,\n", - " 'proportion_ellipsis': 0.0,\n", - " 'proportion_bullet_points': 0.0,\n", - " 'duplicate_line_chr_fraction': 0.0,\n", - " 'duplicate_paragraph_chr_fraction': 0.0,\n", - " 'duplicate_5-gram_chr_fraction': 0.42479253112033194,\n", - " 'duplicate_6-gram_chr_fraction': 0.41649377593361,\n", - " 'duplicate_7-gram_chr_fraction': 0.3757780082987552,\n", - " 'duplicate_8-gram_chr_fraction': 0.36410788381742737,\n", - " 'duplicate_9-gram_chr_fraction': 0.36410788381742737,\n", - " 'duplicate_10-gram_chr_fraction': 0.3571058091286307,\n", - " 'top_2-gram_chr_fraction': 0.008817427385892116,\n", - " 'top_3-gram_chr_fraction': 0.011670124481327801,\n", - " 'top_4-gram_chr_fraction': 0.014004149377593362,\n", - " 'symbol_#_to_word_ratio': 0.0,\n", - " 'contains_lorem ipsum': False}" + "QualityOutput(passed=False, n_stop_words=ThresholdsOutput(value=435.0, passed=True, threshold=(2.0, None)), alpha_ratio=ThresholdsOutput(value=0.79, passed=True, threshold=(0.7, None)), mean_word_length=ThresholdsOutput(value=3.52, passed=True, threshold=(3.0, 10.0)), doc_length=ThresholdsOutput(value=894.0, passed=True, threshold=(10.0, 100000.0)), symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}, proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.3)), proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.8)), contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=False)}, duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.38, passed=False, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.1))}, top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.16))})" ] }, "execution_count": 8, @@ -313,7 +300,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "A total of 1000 texts were processed and 576 passed the quality check.\n" + "A total of 1000 texts were processed and 572 passed the quality check.\n" ] } ], @@ -371,7 +358,7 @@ { "data": { "text/plain": [ - "True" + "False" ] }, "execution_count": 13, @@ -429,7 +416,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "ff5e5f41e3414f82aa9694629ce34413", + "model_id": "2c78220f7f1e4c119901389899b11a7b", "version_major": 2, "version_minor": 0 }, @@ -667,13 +654,46 @@ "metadata": {}, "outputs": [ { - "name": "stderr", - "output_type": "stream", - "text": [ - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-beca55bc168c3e3d.arrow\n", - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-be9e6b466f0d4ee9.arrow\n", - "Loading cached processed dataset at /Users/au561649/.cache/huggingface/datasets/DDSC___parquet/DDSC--partial-danish-gigaword-small-test-sample-6518b630de09688d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-56a5eac62a6adddf.arrow\n" - ] + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "cf05115807f14affa8d479778d1c466a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/3 [00:00" + "" ] }, "execution_count": 19, @@ -852,25 +872,7 @@ { "data": { "text/plain": [ - "{'n_stop_words': 192,\n", - " 'alpha_ratio': 0.804,\n", - " 'mean_word_length': 4.546,\n", - " 'doc_length': 500,\n", - " 'proportion_ellipsis': 0.0,\n", - " 'proportion_bullet_points': 0.0,\n", - " 'duplicate_line_chr_fraction': 0.25737766156144937,\n", - " 'duplicate_paragraph_chr_fraction': 0.0,\n", - " 'duplicate_5-gram_chr_fraction': 0.5401568920433321,\n", - " 'duplicate_6-gram_chr_fraction': 0.519237952932387,\n", - " 'duplicate_7-gram_chr_fraction': 0.519237952932387,\n", - " 'duplicate_8-gram_chr_fraction': 0.519237952932387,\n", - " 'duplicate_9-gram_chr_fraction': 0.519237952932387,\n", - " 'duplicate_10-gram_chr_fraction': 0.519237952932387,\n", - " 'top_2-gram_chr_fraction': 0.017930519237952934,\n", - " 'top_3-gram_chr_fraction': 0.042958535674262235,\n", - " 'top_4-gram_chr_fraction': 0.0653716847217034,\n", - " 'symbol_#_to_word_ratio': 0.0,\n", - " 'contains_lorem ipsum': False}" + "QualityOutput(passed=False, n_stop_words=ThresholdsOutput(value=192.0, passed=True, threshold=(2.0, None)), alpha_ratio=ThresholdsOutput(value=0.8, passed=True, threshold=(0.7, None)), mean_word_length=ThresholdsOutput(value=4.55, passed=True, threshold=(3.0, 10.0)), doc_length=ThresholdsOutput(value=500.0, passed=True, threshold=(10.0, 100000.0)), symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}, proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.3)), proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.8)), contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=False)}, duplicate_line_chr_fraction=ThresholdsOutput(value=0.26, passed=False, threshold=(None, 0.2)), duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.54, passed=False, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.1))}, top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.02, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.04, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.07, passed=True, threshold=(None, 0.16))})" ] }, "execution_count": 22, @@ -936,7 +938,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 29, "metadata": {}, "outputs": [ { @@ -945,7 +947,7 @@ "" ] }, - "execution_count": 25, + "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, @@ -963,7 +965,12 @@ "source": [ "import seaborn as sns\n", "\n", - "duplicate_10_gram_fraction = [doc._.quality[\"duplicate_10-gram_chr_fraction\"] for doc in legal_docs]\n", + "def get_duplicate_10_gram_fraction(doc):\n", + " quality = doc._.quality\n", + " duplicate_10_gram_fraction = quality.duplicate_ngram_chr_fraction[\"10\"]\n", + " return duplicate_10_gram_fraction.value\n", + "\n", + "duplicate_10_gram_fraction = [get_duplicate_10_gram_fraction(doc) for doc in legal_docs]\n", "sns.histplot(duplicate_10_gram_fraction)" ] }, @@ -977,7 +984,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -986,7 +993,7 @@ "" ] }, - "execution_count": 26, + "execution_count": 31, "metadata": {}, "output_type": "execute_result" }, @@ -1002,7 +1009,7 @@ } ], "source": [ - "alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in legal_docs]\n", + "alpha_ratio = [doc._.quality.alpha_ratio.value for doc in legal_docs]\n", "sns.histplot(alpha_ratio)" ] }, @@ -1023,7 +1030,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -1036,13 +1043,13 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "# extract alpha ratio:\n", - "news_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in news_docs]\n", - "speech_alpha_ratio = [doc._.quality[\"alpha_ratio\"] for doc in speech_docs]" + "news_alpha_ratio = [doc._.quality.alpha_ratio.value for doc in news_docs]\n", + "speech_alpha_ratio = [doc._.quality.alpha_ratio.value for doc in speech_docs]" ] }, { @@ -1055,16 +1062,16 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "" + "" ] }, - "execution_count": 29, + "execution_count": 37, "metadata": {}, "output_type": "execute_result" }, @@ -1106,7 +1113,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 38, "metadata": {}, "outputs": [ { @@ -1153,7 +1160,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3.10.9 ('.venv': venv)", + "display_name": "textdescriptives", "language": "python", "name": "python3" }, @@ -1167,12 +1174,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.9" + "version": "3.8.15" }, "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8" + "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" } } }, diff --git a/src/textdescriptives/components/quality.py b/src/textdescriptives/components/quality.py index c28958d8..e4860d9d 100644 --- a/src/textdescriptives/components/quality.py +++ b/src/textdescriptives/components/quality.py @@ -1,109 +1,12 @@ """Component for calculating quality metrics.""" from collections import Counter, defaultdict -from functools import partial from typing import Callable, Dict, List, Optional, Tuple, Union import numpy as np -from pydantic import BaseModel, Field from spacy.language import Language from spacy.tokens import Doc, Span -Interval = Tuple[Optional[float], Optional[float]] - - -class QualityThresholds(BaseModel): - """Thresholds for quality metrics.""" - - n_stop_words: Interval = Field( - (2, None), - description="A Range for the number of stop words. Default: (2, None), i.e. " - + "at least 2 stop words, but no upper limit.", - ) - alpha_ratio: Interval = Field( - (0.7, None), - description="A Range for the alpha ratio. Default: (0.7, None), i.e. at " - + r"least 70% of tokens contain at least one alphabetic character, but no " - + "upper limit. Note this is lowered from the original 0.8 to account for a" - + "different definition of word boundaries. E.g. in spaCy a punctuation is" - + "not a part of a word.", - ) - mean_word_length: Interval = Field( - (3, 10), - description="A Range for the mean word length. Default: (3, 10), i.e. between" - + " 3 and 10 characters.", - ) - doc_length: Interval = Field( - (10, 100_000), - description="A Range for the document length. Default: (10, 100_000), i.e." - + " between 10 and 100_000 characters.", - ) - symbol_to_word_ratio: Dict[str, Interval] = Field( - {"#": (None, 0.1)}, - description="A dict of symbols and the allowed range for the " - + r"symbol-to-word-ratio. The symbol-to-word-ratio is the ratio between symbol" - + "occurrence and word occurrence. Defaults to {'#': (None, 0.1)} i.e. no lower" - + r" limit, but there must at most be a ratio of 0.1 between the number of of " - + "words and hashtags. i.e. if we have 100 words the symbol should appear no " - + "more than 10 times. Values not in the dict are not checked.", - ) - proportion_ellipsis: Interval = Field( - (None, 0.3), - description="A Range for the proportion of lines which end with ellipsis. " - + "Default: (None, 0.3), " - + r"i.e. no lower limit, but at most 30% of lines end with an ellipsis.", - ) - proportion_bullet_points: Interval = Field( - (None, 0.8), - description="A Range for the proportion lines which start with a bullet " - + r"points. Default: (None, 0.8), i.e. no lower limit, but at most 80% of lines" - + " start with a bullet point.", - ) - contains: Dict[str, bool] = Field( - {"lorem ipsum": False}, - description="A dictionary of strings and whether they should be contained in " - + "the document. Default: {'lorem ipsum': False}, i.e. the document should not" - + " contain the string 'lorem ipsum'.", - ) - duplicate_line_chr_fraction: Interval = Field( - (None, 0.2), - description="A Range for the duplicate line character fraction. Default: " - + r"(None, 0.2), i.e. no lower limit, but at most 20% of characters are" - + " duplicates.", - ) - duplicate_paragraph_chr_fraction: Interval = Field( - (None, 0.2), - description="A Range for the duplicate paragraph character fraction. Default:" - + r" (None, 0.2), i.e. no lower limit, but at most 20% of characters are " - + "duplicates.", - ) - duplicate_ngram_chr_fraction: Dict[str, Interval] = Field( - { - "5": (None, 0.15), - "6": (None, 0.14), - "7": (None, 0.13), - "8": (None, 0.12), - "9": (None, 0.11), - "10": (None, 0.1), - }, - description="A dictionary of n-gram lengths and the allowed range for the " - + "duplicate n-gram character fraction. Default: {5: (None, 0.15), 6: (None, " - + "0.14), 7: (None, 0.13), 8: (None, 0.12), 9: (None, 0.11), 10: (None, 0.1)}, " - + r"i.e. no lower limit, but at most 15% of characters are duplicates for " - + r"5-grams, 14% for 6-grams, 13% for 7-grams, 12% for 8-grams, 11% for 9-grams" - + r" and 10% for 10-grams.", - ) - top_ngram_chr_fraction: Dict[str, Interval] = Field( - { - "2": (None, 0.2), - "3": (None, 0.18), - "4": (None, 0.16), - }, - description="A dictionary of n-gram lengths and the allowed range for the " - + "top n-gram character fraction. Default: {2: (None, 0.2), 3: (None, 0.18)" - + r", 4: (None, 0.16)}, i.e. no lower limit, but at most 20% of characters " - + r"are contained within a duplicate for 2-grams, 18% for 3-grams and 16% " - + "for 4-grams.", - ) +from .quality_data_classes import QualityOutput, QualityThresholds, ThresholdsOutput def n_stop_words(span: Union[Doc, Span]) -> int: @@ -463,7 +366,28 @@ def __init__( # pylint: disable=dangerous-default-value quality_thresholds = QualityThresholds() self.quality_thresholds = quality_thresholds - self.getters = { + self.set_extensions() + + def quality_setter( + self, + span: Union[Span, Doc], + ) -> QualityOutput: + """Apply quality functions to doc. + + Args: + span (Union[Span, Doc]): spaCy span or doc object + + Returns: + QualityOutput: The quality metrics + """ + threshold = self.quality_thresholds + + thresholds_outputs: Dict[ + str, + Union[Dict[str, ThresholdsOutput], ThresholdsOutput], + ] = {} + # filter with only one threshold + getters = { # heuristic quality filters "n_stop_words": n_stop_words, "alpha_ratio": alpha_ratio, @@ -474,58 +398,75 @@ def __init__( # pylint: disable=dangerous-default-value # text repetition "duplicate_line_chr_fraction": duplicate_line_chr_fraction, "duplicate_paragraph_chr_fraction": duplicate_paragraph_chr_fraction, - "duplicate_ngram_chr_fraction": partial( - duplicate_ngram_fraction, - ngram_range=duplicate_n_gram_fraction_range, - ), - "top_ngram_chr_fraction": partial( - top_ngram_chr_fraction, - ngram_range=top_ngram_range, - min_count=top_ngram_min_count, - ), } - # add symbol to word ratio - for symbol in symbols: - self.getters[f"symbol_{symbol}_to_word_ratio"] = partial( - symbol_to_word_ratio, - symbol=symbol, + + for name, getter in getters.items(): + thresholds_outputs[name] = ThresholdsOutput( + value=getter(span), # type: ignore + threshold=getattr(threshold, name), ) - # add contains - for string in contains: - self.getters[f"contains_{string}"] = partial(contains_string, string=string) - self.extensions = { - "passed_quality_check": self.passed_quality_thresholds, - "quality": self.quality_getter, + thresholds_outputs["contains"] = { + string: ThresholdsOutput( + value=contains_string(span, string), + threshold=threshold.contains.get(string, None), + ) + for string in self.contains + } + thresholds_outputs["symbol_to_word_ratio"] = { + symbol: ThresholdsOutput( + value=symbol_to_word_ratio(span, symbol), + threshold=threshold.symbol_to_word_ratio.get(symbol, None), + ) + for symbol in self.symbols } - self.set_extensions() + chr_frac = top_ngram_chr_fraction( + span, + ngram_range=self.top_ngram_range, + min_count=self.top_ngram_min_count, + ) - def quality_getter( - self, - span: Union[Span, Doc], - ) -> Dict[str, Union[float, int, bool]]: - """Apply quality functions to doc. + thresholds_outputs["top_ngram_chr_fraction"] = { + str(n_gram): ThresholdsOutput( + value=frac, + threshold=threshold.top_ngram_chr_fraction.get( + str(n_gram), + (None, None), + ), + ) + for n_gram, frac in chr_frac.items() + } + + duplicate_ngram_chr_fraction = duplicate_ngram_fraction( + span, + ngram_range=self.duplicate_n_gram_fraction_range, + ) + thresholds_outputs["duplicate_ngram_chr_fraction"] = { + str(n_gram): ThresholdsOutput( + value=frac, + threshold=threshold.duplicate_ngram_chr_fraction.get( + str(n_gram), + (None, None), + ), + ) + for n_gram, frac in duplicate_ngram_chr_fraction.items() + } + + return QualityOutput(**thresholds_outputs) + + def quality_getter(self, span: Union[Span, Doc]) -> QualityOutput: + """Get quality metrics from doc. Args: span (Union[Span, Doc]): spaCy span or doc object Returns: - Dict[str, Union[float, int, bool]]: dictionary of quality metrics + QualityOutput: The quality metrics """ - quality = {} - for name, getter in self.getters.items(): - if name == "top_ngram_chr_fraction": - chr_frac = getter(span) # type: ignore - for n_gram, frac in chr_frac.items(): - quality[f"top_{n_gram}-gram_chr_fraction"] = frac - elif name == "duplicate_ngram_chr_fraction": - chr_frac = getter(span) # type: ignore - for n_gram, frac in chr_frac.items(): - quality[f"duplicate_{n_gram}-gram_chr_fraction"] = frac - else: - quality[name] = getter(span) # type: ignore - return quality + if not hasattr(span._, "_quality"): + return self.quality_setter(span) + return QualityOutput(**span._._quality) def set_quality(self, doc: Doc) -> None: """Set the quality attribute on a doc. @@ -533,107 +474,48 @@ def set_quality(self, doc: Doc) -> None: Args: doc (Doc): spaCy doc object """ - doc._.quality = self.quality_getter(doc) + # to allow the variable to json serializable we convert it to json + # it is then converted back into a quality output object in the getter + + doc._._quality = self.quality_setter(doc).dict() doc._.passed_quality_check = self.passed_quality_thresholds(doc) - @staticmethod - def is_within_range(rangetuple: Interval, value: float) -> bool: - """Check if a value is within a range tuple. If one of the values in - the range tuple is None it is considered to be unbounded. + def passed_quality_thresholds(self, span: Union[Span, Doc]) -> bool: + """Check if a span passes the quality thresholds. Args: - rangetuple (Interval): range tuple - value (float): value to check + span (Union[Span, Doc]): spaCy span or doc object Returns: - bool: True if value is within range + bool: True if span passes quality thresholds """ - return (rangetuple[0] is None or rangetuple[0] <= value) and ( - rangetuple[1] is None or value <= rangetuple[1] - ) - - def passed_quality_thresholds(self, span: Span) -> bool: - """Checks whether a span passed the quality thresholds.""" - quality = span._.quality - qt = self.quality_thresholds - - # heuristic quality filters - if not self.is_within_range(qt.n_stop_words, quality["n_stop_words"]): - return False - if not self.is_within_range(qt.alpha_ratio, quality["alpha_ratio"]): - return False - if not self.is_within_range(qt.mean_word_length, quality["mean_word_length"]): - return False - if not self.is_within_range(qt.doc_length, quality["doc_length"]): - return False - if not self.is_within_range( - qt.proportion_ellipsis, - quality["proportion_ellipsis"], - ): - return False - if not self.is_within_range( - qt.proportion_bullet_points, - quality["proportion_bullet_points"], - ): - return False - - for symbol in self.symbols: - if symbol in qt.symbol_to_word_ratio: - if not self.is_within_range( - qt.symbol_to_word_ratio[symbol], - quality[f"symbol_{symbol}_to_word_ratio"], - ): - return False - - for string in self.contains: - if string in qt.contains and ( - qt.contains[string] is not quality[f"contains_{string}"] - ): - return False - - # text repetition - if not self.is_within_range( - qt.duplicate_line_chr_fraction, - quality["duplicate_line_chr_fraction"], - ): - return False - if not self.is_within_range( - qt.duplicate_paragraph_chr_fraction, - quality["duplicate_paragraph_chr_fraction"], - ): - return False - - for ngram in qt.duplicate_ngram_chr_fraction: - key = f"duplicate_{ngram}-gram_chr_fraction" - if key in quality: - if not self.is_within_range( - qt.duplicate_ngram_chr_fraction[ngram], - quality[key], - ): - return False - - for n_gram in qt.top_ngram_chr_fraction: - if n_gram in quality: - if not self.is_within_range( - qt.top_ngram_chr_fraction[n_gram], - quality[n_gram], - ): - return False - - return True + quality_output = self.quality_getter(span) + return quality_output.passed def set_extensions(self): """Set required extensions.""" - for ext_name, span_getter in self.extensions.items(): - if not Span.has_extension(ext_name) or self.force is True: - Span.set_extension(ext_name, getter=span_getter, force=True) - if ext_name == "quality": - if not Doc.has_extension(ext_name) or self.force is True: - Doc.set_extension(ext_name, default=None, force=True) - else: - if not Doc.has_extension(ext_name) or self.force is True: - Doc.set_extension(ext_name, getter=span_getter, force=True) + ext_name = "passed_quality_check" + if not Span.has_extension(ext_name) or self.force is True: + Span.set_extension( + ext_name, + getter=self.passed_quality_thresholds, + force=True, + ) + if not Doc.has_extension(ext_name) or self.force is True: + Doc.set_extension( + ext_name, + getter=self.passed_quality_thresholds, + force=True, + ) + + ext_name = "quality" + if not Doc.has_extension(ext_name) or self.force is True: + Doc.set_extension(ext_name, getter=self.quality_getter, force=True) + Doc.set_extension("_" + ext_name, default=None, force=True) + if not Span.has_extension(ext_name) or self.force is True: + Span.set_extension(ext_name, getter=self.quality_getter, force=True) + Span.set_extension("_" + ext_name, default=None, force=True) def set_quality_thresholds(self, thresholds: QualityThresholds) -> None: """Sets the quality thresholds. diff --git a/src/textdescriptives/components/quality_data_classes.py b/src/textdescriptives/components/quality_data_classes.py new file mode 100644 index 00000000..df72ab47 --- /dev/null +++ b/src/textdescriptives/components/quality_data_classes.py @@ -0,0 +1,259 @@ +"""Data classes used for the quality component.""" +from typing import Any, Dict, Optional, Tuple, Union + +from pydantic import BaseModel, Extra, Field + +Interval = Tuple[Optional[float], Optional[float]] + + +class ThresholdsOutput(BaseModel): + """An output which contains an three items. 1) a thresholds which is either + an interval or a accepted boolean value. 2) a value which is the value of + the metric. 3) a boolean which is True if the value is within the + thresholds. + + Example: + >>> t_out = ThresholdsOutput(threshold=(0, 2), value=2) + >>> t_out + ThresholdsOutput(value=2.0, passed=True, threshold=(0.0, 2.0)) + >>> t_out.passed + True + """ + + class Config: + extra = Extra.forbid + + threshold: Union[Interval, bool, None] + value: float + + @property + def passed(self) -> bool: + """Return True if the value is within the thresholds.""" + if self.threshold is None: + return True + if isinstance(self.threshold, bool): + return self.threshold == self.value + lower, upper = self.threshold + return (lower is None or lower <= self.value) and ( + upper is None or self.value <= upper + ) + + def __repr_str__(self, join_str: str) -> str: + value = round(self.value, 2) if isinstance(self.value, float) else self.value + return join_str.join( + repr(v) if a is None else f"{a}={v!r}" + for a, v in [ + ("value", value), + ("passed", self.passed), + ("threshold", self.threshold), + ] + ) + + def __eq__(self, other: Any) -> bool: + if isinstance(other, ThresholdsOutput): + return self.value == other.value and self.threshold == other.threshold + return self.value == other + + +class QualityThresholds(BaseModel): + """Thresholds for quality metrics.""" + + class Config: + extra = Extra.forbid + + n_stop_words: Interval = Field( + (2, None), + description="A Range for the number of stop words. Default: (2, None), i.e. " + + "at least 2 stop words, but no upper limit.", + ) + alpha_ratio: Interval = Field( + (0.7, None), + description="A Range for the alpha ratio. Default: (0.7, None), i.e. at " + + r"least 70% of tokens contain at least one alphabetic character, but no " + + "upper limit. Note this is lowered from the original 0.8 to account for a" + + "different definition of word boundaries. E.g. in spaCy a punctuation is" + + "not a part of a word.", + ) + mean_word_length: Interval = Field( + (3, 10), + description="A Range for the mean word length. Default: (3, 10), i.e. between" + + " 3 and 10 characters.", + ) + doc_length: Interval = Field( + (10, 100_000), + description="A Range for the document length. Default: (10, 100_000), i.e." + + " between 10 and 100_000 characters.", + ) + symbol_to_word_ratio: Dict[str, Interval] = Field( + {"#": (None, 0.1)}, + description="A dict of symbols and the allowed range for the " + + r"symbol-to-word-ratio. The symbol-to-word-ratio is the ratio between symbol" + + "occurrence and word occurrence. Defaults to {'#': (None, 0.1)} i.e. no lower" + + r" limit, but there must at most be a ratio of 0.1 between the number of of " + + "words and hashtags. i.e. if we have 100 words the symbol should appear no " + + "more than 10 times. Values not in the dict are not checked.", + ) + proportion_ellipsis: Interval = Field( + (None, 0.3), + description="A Range for the proportion of lines which end with ellipsis. " + + "Default: (None, 0.3), " + + r"i.e. no lower limit, but at most 30% of lines end with an ellipsis.", + ) + proportion_bullet_points: Interval = Field( + (None, 0.8), + description="A Range for the proportion lines which start with a bullet " + + r"points. Default: (None, 0.8), i.e. no lower limit, but at most 80% of lines" + + " start with a bullet point.", + ) + contains: Dict[str, bool] = Field( + {"lorem ipsum": False}, + description="A dictionary of strings and whether they should be contained in " + + "the document. Default: {'lorem ipsum': False}, i.e. the document should not" + + " contain the string 'lorem ipsum'.", + ) + duplicate_line_chr_fraction: Interval = Field( + (None, 0.2), + description="A Range for the duplicate line character fraction. Default: " + + r"(None, 0.2), i.e. no lower limit, but at most 20% of characters are" + + " duplicates.", + ) + duplicate_paragraph_chr_fraction: Interval = Field( + (None, 0.2), + description="A Range for the duplicate paragraph character fraction. Default:" + + r" (None, 0.2), i.e. no lower limit, but at most 20% of characters are " + + "duplicates.", + ) + duplicate_ngram_chr_fraction: Dict[str, Interval] = Field( + { + "5": (None, 0.15), + "6": (None, 0.14), + "7": (None, 0.13), + "8": (None, 0.12), + "9": (None, 0.11), + "10": (None, 0.1), + }, + description="A dictionary of n-gram lengths and the allowed range for the " + + "duplicate n-gram character fraction. Default: {5: (None, 0.15), 6: (None, " + + "0.14), 7: (None, 0.13), 8: (None, 0.12), 9: (None, 0.11), 10: (None, 0.1)}, " + + r"i.e. no lower limit, but at most 15% of characters are duplicates for " + + r"5-grams, 14% for 6-grams, 13% for 7-grams, 12% for 8-grams, 11% for 9-grams" + + r" and 10% for 10-grams.", + ) + top_ngram_chr_fraction: Dict[str, Interval] = Field( + { + "2": (None, 0.2), + "3": (None, 0.18), + "4": (None, 0.16), + }, + description="A dictionary of n-gram lengths and the allowed range for the " + + "top n-gram character fraction. Default: {2: (None, 0.2), 3: (None, 0.18)" + + r", 4: (None, 0.16)}, i.e. no lower limit, but at most 20% of characters " + + r"are contained within a duplicate for 2-grams, 18% for 3-grams and 16% " + + "for 4-grams.", + ) + + +class QualityOutput(BaseModel): + """The output of the quality function.""" + + class Config: + extra = Extra.forbid + + n_stop_words: ThresholdsOutput = Field( + ..., + description="The thresholds output for the number of stop words.", + ) + alpha_ratio: ThresholdsOutput = Field( + ..., + description="The thresholds output for the alpha ratio.", + ) + mean_word_length: ThresholdsOutput = Field( + ..., + description="The thresholds output for the mean word length.", + ) + doc_length: ThresholdsOutput = Field( + ..., + description="The thresholds output for the document length.", + ) + symbol_to_word_ratio: Dict[str, ThresholdsOutput] = Field( + ..., + description="The thresholds output for the symbol-to-word-ratio.", + ) + proportion_ellipsis: ThresholdsOutput = Field( + ..., + description="The thresholds output for the proportion of lines ending with " + + "ellipsis.", + ) + proportion_bullet_points: ThresholdsOutput = Field( + ..., + description="The thresholds output for the proportion of lines starting with " + + "bullet points.", + ) + contains: Dict[str, ThresholdsOutput] = Field( + ..., + description="The thresholds output for the presence of strings.", + ) + duplicate_line_chr_fraction: ThresholdsOutput = Field( + ..., + description="The thresholds output for the duplicate line character fraction.", + ) + duplicate_paragraph_chr_fraction: ThresholdsOutput = Field( + ..., + description="The thresholds output for the duplicate paragraph character " + + "fraction.", + ) + duplicate_ngram_chr_fraction: Dict[str, ThresholdsOutput] = Field( + ..., + description="The thresholds output for the duplicate n-gram character " + + "fraction.", + ) + top_ngram_chr_fraction: Dict[str, ThresholdsOutput] = Field( + ..., + description="The thresholds output for the top n-gram character fraction.", + ) + + @property + def passed(self) -> bool: + """ + Returns: + bool: Whether all thresholds have been passed. + """ + return all( + [ + self.n_stop_words.passed, + self.alpha_ratio.passed, + self.mean_word_length.passed, + self.doc_length.passed, + all(v.passed for v in self.symbol_to_word_ratio.values()), + self.proportion_ellipsis.passed, + self.proportion_bullet_points.passed, + all(v.passed for v in self.contains.values()), + self.duplicate_line_chr_fraction.passed, + self.duplicate_paragraph_chr_fraction.passed, + all(v.passed for v in self.duplicate_ngram_chr_fraction.values()), + all(v.passed for v in self.top_ngram_chr_fraction.values()), + ], + ) + + def __repr_str__(self, join_str: str) -> str: + return join_str.join( + repr(v) if a is None else f"{a}={v!r}" + for a, v in [ + ("passed", self.passed), + ] + + list(self.__repr_args__()) + ) + + def to_flat_value_dict(self) -> Dict[str, Any]: + """Creates a flat dictionary representation of the object to allow for + easy easy conversion to a pandas DataFrame.""" + flat_dict = {"passed_quality_check": self.passed} + + for k, v in self.__dict__.items(): + if isinstance(v, dict): + for k2, v2 in v.items(): + flat_dict[f"{k}_{k2}"] = v2.value + else: + flat_dict[k] = v.value + + return flat_dict diff --git a/src/textdescriptives/extractors.py b/src/textdescriptives/extractors.py index 1dea6ead..540cfb1d 100644 --- a/src/textdescriptives/extractors.py +++ b/src/textdescriptives/extractors.py @@ -14,7 +14,7 @@ def __get_quality(doc: Doc) -> dict: """Get quality metrics as well as boolean indicator for passing filters.""" - return {**doc._.quality, "passed_quality_check": doc._.passed_quality_check} + return doc._.quality.to_flat_value_dict() def __get_descriptive_stats_dict(doc: Doc) -> dict: diff --git a/tests/test_quality.py b/tests/test_quality.py index f5467c52..4778a4e8 100644 --- a/tests/test_quality.py +++ b/tests/test_quality.py @@ -4,7 +4,6 @@ import pytest import spacy - import textdescriptives as td from textdescriptives.components.quality import ( alpha_ratio, @@ -181,15 +180,17 @@ def test_quality_component(nlp: spacy.Language): """Test the quality component.""" nlp.add_pipe("textdescriptives/quality", config={"force": True}) doc = nlp("This is a test. This is a test. This is a test.") - assert doc._.quality["n_stop_words"] == 9 - assert doc._.quality["mean_word_length"] == 2.4 - assert doc._.quality["alpha_ratio"] == 0.8 - assert doc._.quality["proportion_bullet_points"] == 0 - assert doc._.quality["proportion_ellipsis"] == 0 - assert doc._.quality["symbol_#_to_word_ratio"] == 0 - assert doc._.quality["duplicate_5-gram_chr_fraction"] == 1 - assert abs(doc._.quality["top_2-gram_chr_fraction"] - 0.44) < 0.01 + quality = doc._.quality + assert quality.n_stop_words == 9 + assert quality.mean_word_length == 2.4 + assert quality.alpha_ratio == 0.8 + assert quality.proportion_bullet_points == 0 + assert quality.proportion_ellipsis == 0 + assert quality.symbol_to_word_ratio["#"] == 0 + assert quality.duplicate_ngram_chr_fraction["5"] == 1 + assert abs(quality.top_ngram_chr_fraction["2"].value - 0.44) < 0.01 assert doc._.passed_quality_check is False + assert quality.passed is False def test_quality_component_with_config(nlp: spacy.Language): @@ -200,7 +201,7 @@ def test_quality_component_with_config(nlp: spacy.Language): alpha_ratio=(None, 0.8), mean_word_length=(1, 10), doc_length=(10, 100_000), - symbols_to_word_ratio={".": (None, 0.3)}, + symbol_to_word_ratio={".": (None, 0.3)}, proportion_ellipsis=(None, 0.3), proportion_bullet_points=(None, 0.8), duplicate_line_chr_fraction=(None, 0.2), @@ -220,15 +221,15 @@ def test_quality_component_with_config(nlp: spacy.Language): quality_pipe.set_quality_thresholds(quality_thresholds) doc = nlp("This is a test. This is a test. This is a test.") - assert doc._.quality["n_stop_words"] == 9 - assert doc._.quality["mean_word_length"] == 2.4 - assert doc._.quality["alpha_ratio"] == 0.8 - assert doc._.quality["proportion_bullet_points"] == 0 - assert doc._.quality["proportion_ellipsis"] == 0 - assert doc._.quality["symbol_._to_word_ratio"] == 0.25 - assert doc._.quality["duplicate_5-gram_chr_fraction"] == 1 - assert doc._.quality["duplicate_8-gram_chr_fraction"] == 1 - assert abs(doc._.quality["top_3-gram_chr_fraction"] - 0.57) < 0.01 + assert doc._.quality.n_stop_words == 9 + assert doc._.quality.mean_word_length == 2.4 + assert doc._.quality.alpha_ratio == 0.8 + assert doc._.quality.proportion_bullet_points == 0 + assert doc._.quality.proportion_ellipsis == 0 + assert doc._.quality.symbol_to_word_ratio["."] == 0.25 + assert doc._.quality.duplicate_ngram_chr_fraction["5"] == 1 + assert doc._.quality.duplicate_ngram_chr_fraction["8"] == 1 + assert abs(doc._.quality.top_ngram_chr_fraction["3"].value - 0.57) < 0.01 assert doc._.passed_quality_check is True From 4ddebdf3be5124b92cb4bafe5b69fe1632143aa6 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Fri, 13 Jan 2023 20:58:05 +0100 Subject: [PATCH 11/14] docs: fixed multiprocessing in tutorial --- docs/tutorials/filter_corpus_using_quality.ipynb | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 6c182c50..cbf79a40 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -319,7 +319,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 42, "metadata": {}, "outputs": [], "source": [ @@ -769,7 +769,7 @@ "quality_pipe = nlp.add_pipe(\"textdescriptives/quality\")\n", "\n", "# 3. Apply the pipeline to the legal documents\n", - "legal_docs = nlp.pipe(legal[\"text\"], batch_size=100, n_process=4)" + "legal_docs = nlp.pipe(legal[\"text\"], batch_size=100, n_process=1)" ] }, { @@ -866,7 +866,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 43, "metadata": {}, "outputs": [ { @@ -875,7 +875,7 @@ "QualityOutput(passed=False, n_stop_words=ThresholdsOutput(value=192.0, passed=True, threshold=(2.0, None)), alpha_ratio=ThresholdsOutput(value=0.8, passed=True, threshold=(0.7, None)), mean_word_length=ThresholdsOutput(value=4.55, passed=True, threshold=(3.0, 10.0)), doc_length=ThresholdsOutput(value=500.0, passed=True, threshold=(10.0, 100000.0)), symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}, proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.3)), proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.8)), contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=False)}, duplicate_line_chr_fraction=ThresholdsOutput(value=0.26, passed=False, threshold=(None, 0.2)), duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.54, passed=False, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.52, passed=False, threshold=(None, 0.1))}, top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.02, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.04, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.07, passed=True, threshold=(None, 0.16))})" ] }, - "execution_count": 22, + "execution_count": 43, "metadata": {}, "output_type": "execute_result" } @@ -1035,9 +1035,9 @@ "outputs": [], "source": [ "# first we apply the pipeline to the other domains\n", - "news_docs = nlp.pipe(news[\"text\"], batch_size=100, n_process=4)\n", + "news_docs = nlp.pipe(news[\"text\"], batch_size=100, n_process=1)\n", "news_docs = list(news_docs)\n", - "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=4)\n", + "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=1)\n", "speech_docs = list(speech_docs)" ] }, @@ -1149,7 +1149,7 @@ "source": [ "From this we can see that a high proportion of the tokens in the speech dataset dentoes the speaker such and tokens such as `:` then lower the alpa ratio. This might or might not be problematic for the task at hand.\n", "\n", - "**Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should probably be adjusted to the target domain.**" + "**Therefore it is important to note that while these filters are useful for filtering large amount of texts it is also important to know that they should be adjusted to the target domain.**" ] }, { From deb148bcbe9014c5285fe5a062320f7541b1c1d5 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Mon, 16 Jan 2023 10:11:22 +0100 Subject: [PATCH 12/14] docs: changed print of quality --- .../filter_corpus_using_quality.ipynb | 20 +++++++++++++------ .../components/quality_data_classes.py | 2 +- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index cbf79a40..015b682a 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -206,11 +206,6 @@ "doc._.passed_quality_check" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, { "attachments": {}, "cell_type": "markdown", @@ -227,7 +222,20 @@ { "data": { "text/plain": [ - "QualityOutput(passed=False, n_stop_words=ThresholdsOutput(value=435.0, passed=True, threshold=(2.0, None)), alpha_ratio=ThresholdsOutput(value=0.79, passed=True, threshold=(0.7, None)), mean_word_length=ThresholdsOutput(value=3.52, passed=True, threshold=(3.0, 10.0)), doc_length=ThresholdsOutput(value=894.0, passed=True, threshold=(10.0, 100000.0)), symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}, proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.3)), proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.8)), contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=False)}, duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.38, passed=False, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.1))}, top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.16))})" + "QualityOutput(\n", + "\tpassed=False, \n", + "\tn_stop_words=ThresholdsOutput(value=435.0, passed=True, threshold=(2.0, None)), \n", + "\talpha_ratio=ThresholdsOutput(value=0.79, passed=True, threshold=(0.7, None)), \n", + "\tmean_word_length=ThresholdsOutput(value=3.52, passed=True, threshold=(3.0, 10.0)), \n", + "\tdoc_length=ThresholdsOutput(value=894.0, passed=True, threshold=(10.0, 100000.0)), \n", + "\tsymbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))}, \n", + "\tproportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.3)), \n", + "\tproportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.8)), \n", + "\tcontains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=False)}, \n", + "\tduplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), \n", + "\tduplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), \n", + "\tduplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.42, passed=False, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.38, passed=False, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.36, passed=False, threshold=(None, 0.1))}, \n", + "\ttop_ngram_chr_fraction={'2': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.01, passed=True, threshold=(None, 0.16))})" ] }, "execution_count": 8, diff --git a/src/textdescriptives/components/quality_data_classes.py b/src/textdescriptives/components/quality_data_classes.py index df72ab47..840d203f 100644 --- a/src/textdescriptives/components/quality_data_classes.py +++ b/src/textdescriptives/components/quality_data_classes.py @@ -237,7 +237,7 @@ def passed(self) -> bool: def __repr_str__(self, join_str: str) -> str: return join_str.join( - repr(v) if a is None else f"{a}={v!r}" + repr(v) if a is None else f"\n\t{a}={v!r}" for a, v in [ ("passed", self.passed), ] From c22458097d7ca97679da748e63be6ad1523c9a41 Mon Sep 17 00:00:00 2001 From: Kenneth Enevoldsen Date: Mon, 16 Jan 2023 10:12:44 +0100 Subject: [PATCH 13/14] docs: removed multiprocessing from pipes --- docs/tutorials/filter_corpus_using_quality.ipynb | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 015b682a..d419aaa4 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -777,7 +777,7 @@ "quality_pipe = nlp.add_pipe(\"textdescriptives/quality\")\n", "\n", "# 3. Apply the pipeline to the legal documents\n", - "legal_docs = nlp.pipe(legal[\"text\"], batch_size=100, n_process=1)" + "legal_docs = nlp.pipe(legal[\"text\"])" ] }, { @@ -1043,9 +1043,9 @@ "outputs": [], "source": [ "# first we apply the pipeline to the other domains\n", - "news_docs = nlp.pipe(news[\"text\"], batch_size=100, n_process=1)\n", + "news_docs = nlp.pipe(news[\"text\"])\n", "news_docs = list(news_docs)\n", - "speech_docs = nlp.pipe(speech[\"text\"], batch_size=100, n_process=1)\n", + "speech_docs = nlp.pipe(speech[\"text\"])\n", "speech_docs = list(speech_docs)" ] }, From e89a80cbfcb548778f99e0843bea0d6d1e8e7fa5 Mon Sep 17 00:00:00 2001 From: Lasse Date: Mon, 16 Jan 2023 12:36:55 +0100 Subject: [PATCH 14/14] tutorial: minor descriptions in tutorial --- .../filter_corpus_using_quality.ipynb | 29 ++++++++++--------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index d419aaa4..8ed71383 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -61,7 +61,7 @@ "source": [ "\n", "### The Data\n", - "For our first example we will filter web content. For this we will use the [mC4 dataset](https://huggingface.co/datasets/mc4). It would take ages to download the whole data thus we will stream down 1000 samples from the dataset." + "For our first example we will filter web content. For this we will use the [mC4 dataset](https://huggingface.co/datasets/mc4). It would take ages to download the whole data so instead we will stream down 1000 samples from the dataset." ] }, { @@ -78,7 +78,7 @@ "# download the first 1 000\n", "dataset = dataset.take(1000)\n", "\n", - "# extract the text and remove texts which are too long\n", + "# extract the text\n", "texts = [sample [\"text\"] for sample in dataset]\n" ] }, @@ -108,7 +108,7 @@ "source": [ "### Filtering\n", "\n", - "To filter domains using `textdescriptives` we need to first set up the pipeline:" + "To filter texts using `textdescriptives` we need to first set up the pipeline:" ] }, { @@ -211,7 +211,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "It seems like this documents did no pass the quality check. Let us examine why that is:" + "It seems like this document did no pass the quality check. Let us examine why that is:" ] }, { @@ -287,7 +287,7 @@ "metadata": {}, "source": [ "### Extracting high quality texts\n", - "We are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." + "We are typically interested in text which are not of low quality. We can extract these by filtering out the texts which did not pass the quality check." ] }, { @@ -355,7 +355,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you want to read more about what each argument does, please check out the [documentation](https://hlasse.github.io/TextDescriptives/quality.html#data-classes)." + "If you want to read more about what each argument does, please check out the [documentation](https://hlasse.github.io/TextDescriptives/quality.html#data-classes).\n", + "All the `passed` values and `passed_quality_check` attributes are dynamically updated when you can `.set_quality_thresholds`." ] }, { @@ -405,7 +406,7 @@ "metadata": {}, "source": [ "\n", - "We can donwload the dataset using the following command:" + "We can download the dataset using the following command:" ] }, { @@ -745,7 +746,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can for example see that the speech dataset contains notably fewer samples than the others. For that reason, let's only concern ourself with `legal` and `hest`" + "We can for example see that the speech dataset contains notably fewer samples than the others. " ] }, { @@ -754,10 +755,10 @@ "metadata": {}, "source": [ "### Quality Filtering\n", - "After we have prepared our datasets we can now start with the quality filtering. Using Textdescriptives this is extremely simple. We need to do 3 things:\n", + "After we have prepared our datasets we can now start with the quality filtering. Using TextDescriptives, this is extremely simple. We need to do 3 things:\n", "\n", "1) Create a pipeline\n", - "2) Add the quality component from textdescriptives to it\n", + "2) Add the quality component to it\n", "3) Apply the pipeline to the dataset\n" ] }, @@ -897,7 +898,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is a reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. We showed you have to do this in the previous section. We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters. This is good, but as we will see later by no mean common for legal texts." + "Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is a reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. We showed you have to do this in the previous section. We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters. This is good, but as we will see later, this is not common for legal texts." ] }, { @@ -906,7 +907,7 @@ "metadata": {}, "source": [ "### Filtering out the text\n", - "Assuming we don't want to change the filter we can now use it to filter out the texts that we want to keep:" + "Assuming we don't want to change the filters we can now use it to filter out the texts that we want to keep:" ] }, { @@ -941,7 +942,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "That seems like a lot, we should probably check why that is. We can do this by looking at the distribution of the scores:" + "That seems like a lot, we should probably check why that is. We can do this by looking at the distribution of the scores of e.g. duplicate 10-gram fraction:" ] }, { @@ -1025,7 +1026,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We see that most of the text does not pass the `alpha_ratio` threshold of 0.7 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower." + "We see that most of the text do not pass the `alpha_ratio` threshold of 0.7 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower." ] }, {