You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 4, 2021. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+8-1
Original file line number
Diff line number
Diff line change
@@ -8,10 +8,17 @@ Collecting data for machine translation training from CommonCrawl is a two-phase
8
8
9
9
The first phase detects the languages of the web pages contained in the crawl and other meta-data. A database is built from this data that can be accessed via a RESTful web API.
10
10
11
-
In this phase monolingual data for language model training can be generated. The data for some of the CommonCrawl crawls and some languages can be found on:
11
+
The [metadata documentation](/metadata/metadata.md) describes phase 1 step-by-step.
12
+
13
+
In this phase monolingual data for language model training can be extracted. The data for some of the CommonCrawl crawls and some languages can be found on:
For more details on the monolingual data see [ModernMT Deliverable 2.1](http://www.modernmt.eu/deliverables/mmt-d2-1-report-on-data-repository/).
17
19
20
+
## Phase 2: Extracting parallel data and optional cleaning
21
+
22
+
In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the [baseline documentation](/baseline/baseline.md)
23
+
24
+
For the language pairs en↔it, en↔fr and en↔it matched URL data is available for quick data extraction in release 0.1.0 https://github.com/ModernMT/DataCollection/releases/tag/0.1.0
0 commit comments