Skip to content
This repository was archived by the owner on May 4, 2021. It is now read-only.

Commit 7cb4aa5

Browse files
authored
Added phase 2 description to readme
1 parent cd6378d commit 7cb4aa5

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

README.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,17 @@ Collecting data for machine translation training from CommonCrawl is a two-phase
88

99
The first phase detects the languages of the web pages contained in the crawl and other meta-data. A database is built from this data that can be accessed via a RESTful web API.
1010

11-
In this phase monolingual data for language model training can be generated. The data for some of the CommonCrawl crawls and some languages can be found on:
11+
The [metadata documentation](/metadata/metadata.md) describes phase 1 step-by-step.
12+
13+
In this phase monolingual data for language model training can be extracted. The data for some of the CommonCrawl crawls and some languages can be found on:
1214

1315
* http://statmt.org/ngrams/
1416
* http://www.statmt.org/wmt16/translation-task.html
1517

1618
For more details on the monolingual data see [ModernMT Deliverable 2.1](http://www.modernmt.eu/deliverables/mmt-d2-1-report-on-data-repository/).
1719

20+
## Phase 2: Extracting parallel data and optional cleaning
21+
22+
In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the [baseline documentation](/baseline/baseline.md)
23+
24+
For the language pairs en↔it, en↔fr and en↔it matched URL data is available for quick data extraction in release 0.1.0 https://github.com/ModernMT/DataCollection/releases/tag/0.1.0

0 commit comments

Comments
 (0)