-
Notifications
You must be signed in to change notification settings - Fork 4
API reference: Dataset Data Format V0.2a
The data structure used to store datasets for:
-
training, and
-
testing
are not specifically defined as a class, but rather as is.
Data of these formats will be seen passed to functions, methods of the Aligner.
The description here is of module version 0.2a.
- old formats including
Bitext
andTritest
are replaced byDataset
Dataset
is the data format for training and testing datasets with the following components:
- source language information; and
- target language information; and
- optionally, alignment information.
Dataset
is in fact a list
, which elements are sentences(list
).
Each sentence has three parts, stored in a list:
-
source language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.
-
target language information, typically including the original text, and additional information such as POS tags. This is stored as a list, each word and its supplemental information stored as a tuple being element of that list.
-
Optionally,
SentenceAlignment
as the alignment of such sentence, as specified in Alignment Data Format V0.2a.
For example, the following dataset:
Source(DE):
Leise flehen meine Lieder
Durch die Nacht zu Dir
Target(EN):
Becken softly my songs
Through the night to you
Will be stored as:
dataset = [ [[("Leise"), ("flehen"), ("meine"), ("Lieder")],
[("Becken"), ("softly"), ("my"), ("songs")]],
[[("Durch"), ("die"), ("Nacht"), ("zu"), ("Dir")],
[("Through"), ("the"), ("night"), ("to"), ("you")]] ]
For Dataset with additional information such as POS tags, an example is given below:
Source(DE):
Leise[JJ] flehen[VBP] meine[PRP$] Lieder[NN]
Durch[IN] die[DT] Nacht[NN] zu[TO] Dir[PRP]
Target(EN):
Becken[VBP] softly[JJ] my[PRP$] songs[NN]
Through[IN] the[DT] night[NN] to[TO] you[PRP]
It will be stored as:
dataset = [ [[("Leise", "JJ"), ("flehen", "VBP"), ("meine", "PRP$"), ("Lieder", "NN")],
[("Becken", "VBP"), ("softly", "JJ"), ("my", "PRP$"), ("songs", "NN")]],
[[("Durch", "IN"), ("die", "DT"), ("Nacht", "NN"), ("zu", "TO"), ("Dir", "PRP")],
[("Through", "IN"), ("the", "DT"), ("night", "NN"), ("to", "TO"), ("you", "PRP")]] ]
Please refer to FileIO v0.3a for more detail.