Skip to content

Commit

Permalink
tutorial: minor descriptions in tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
HLasse committed Jan 16, 2023
1 parent c224580 commit e89a80c
Showing 1 changed file with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions docs/tutorials/filter_corpus_using_quality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
"source": [
"\n",
"### The Data\n",
"For our first example we will filter web content. For this we will use the [mC4 dataset](https://huggingface.co/datasets/mc4). It would take ages to download the whole data thus we will stream down 1000 samples from the dataset."
"For our first example we will filter web content. For this we will use the [mC4 dataset](https://huggingface.co/datasets/mc4). It would take ages to download the whole data so instead we will stream down 1000 samples from the dataset."
]
},
{
Expand All @@ -78,7 +78,7 @@
"# download the first 1 000\n",
"dataset = dataset.take(1000)\n",
"\n",
"# extract the text and remove texts which are too long\n",
"# extract the text\n",
"texts = [sample [\"text\"] for sample in dataset]\n"
]
},
Expand Down Expand Up @@ -108,7 +108,7 @@
"source": [
"### Filtering\n",
"\n",
"To filter domains using `textdescriptives` we need to first set up the pipeline:"
"To filter texts using `textdescriptives` we need to first set up the pipeline:"
]
},
{
Expand Down Expand Up @@ -211,7 +211,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems like this documents did no pass the quality check. Let us examine why that is:"
"It seems like this document did no pass the quality check. Let us examine why that is:"
]
},
{
Expand Down Expand Up @@ -287,7 +287,7 @@
"metadata": {},
"source": [
"### Extracting high quality texts\n",
"We are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check."
"We are typically interested in text which are not of low quality. We can extract these by filtering out the texts which did not pass the quality check."
]
},
{
Expand Down Expand Up @@ -355,7 +355,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to read more about what each argument does, please check out the [documentation](https://hlasse.github.io/TextDescriptives/quality.html#data-classes)."
"If you want to read more about what each argument does, please check out the [documentation](https://hlasse.github.io/TextDescriptives/quality.html#data-classes).\n",
"All the `passed` values and `passed_quality_check` attributes are dynamically updated when you can `.set_quality_thresholds`."
]
},
{
Expand Down Expand Up @@ -405,7 +406,7 @@
"metadata": {},
"source": [
"\n",
"We can donwload the dataset using the following command:"
"We can download the dataset using the following command:"
]
},
{
Expand Down Expand Up @@ -745,7 +746,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can for example see that the speech dataset contains notably fewer samples than the others. For that reason, let's only concern ourself with `legal` and `hest`"
"We can for example see that the speech dataset contains notably fewer samples than the others. "
]
},
{
Expand All @@ -754,10 +755,10 @@
"metadata": {},
"source": [
"### Quality Filtering\n",
"After we have prepared our datasets we can now start with the quality filtering. Using Textdescriptives this is extremely simple. We need to do 3 things:\n",
"After we have prepared our datasets we can now start with the quality filtering. Using TextDescriptives, this is extremely simple. We need to do 3 things:\n",
"\n",
"1) Create a pipeline\n",
"2) Add the quality component from textdescriptives to it\n",
"2) Add the quality component to it\n",
"3) Apply the pipeline to the dataset\n"
]
},
Expand Down Expand Up @@ -897,7 +898,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is a reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. We showed you have to do this in the previous section. We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters. This is good, but as we will see later by no mean common for legal texts."
"Here we see that fraction of characters which is a part of a duplicate 10 gram is >50%. This is a reason why the sample was filtered out. This is not uncommon for legal documents which contain a lot of standard phrases. However you might wish to change the threshold for this filter. We showed you have to do this in the previous section. We also see that the `alpha_ratio` is close 0.8. This means that the text is mostly made up of alphabetic characters. This is good, but as we will see later, this is not common for legal texts."
]
},
{
Expand All @@ -906,7 +907,7 @@
"metadata": {},
"source": [
"### Filtering out the text\n",
"Assuming we don't want to change the filter we can now use it to filter out the texts that we want to keep:"
"Assuming we don't want to change the filters we can now use it to filter out the texts that we want to keep:"
]
},
{
Expand Down Expand Up @@ -941,7 +942,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"That seems like a lot, we should probably check why that is. We can do this by looking at the distribution of the scores:"
"That seems like a lot, we should probably check why that is. We can do this by looking at the distribution of the scores of e.g. duplicate 10-gram fraction:"
]
},
{
Expand Down Expand Up @@ -1025,7 +1026,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that most of the text does not pass the `alpha_ratio` threshold of 0.7 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower."
"We see that most of the text do not pass the `alpha_ratio` threshold of 0.7 or higher. This is not uncommon for legal documents as e.g. the paragraph sign `§` is not an alphabetic character. It might be relevant to change the threshold to 0.7 or lower."
]
},
{
Expand Down

0 comments on commit e89a80c

Please sign in to comment.