-
Notifications
You must be signed in to change notification settings - Fork 4
API reference: Alignment Data Format V0.1a
The data structure used to store alignments are not specifically defined as a class, but rather as is.
Data of these formats will be seen passed to functions, methods of the Aligner.
The description here is of module version 0.1a.
SentenceAlignment
is the only data format used in the Aligner to store the alignment of one sentence.
It is basically a list, with each element being a tuple of 2 integers. The following is an example:
sentenceAlignment = [ (f_1, e_1),
(f_2, e_2),
... ...
(f_n, e_n) ]
Each tuple above represents a pair of aligned words from the source sentence and target sentence, marked by their original positions in the sentence.
For example, for the following alignment:
The sentenceAlignment
would be (order doesn't matter):
sentenceAlignment = [ (1, 4),
(2, 3),
(3, 1),
(4, 2) ]
Please note that in this data format, to match the data format of gold alignments the index of the first word should be 1 instead of 0.
In the event that there are multiple words aligned to the same source word, for example:
There would be two entries for the source word "gefällt", which is a verb meaning "attracts/makes one like":
sentenceAlignment = [ (1, 5),
(2, 2),
(2, 4),
(3, 1),
(4, 3) ]
Alignment
is the data format used to store the alignments produced by the Aligner. It is the most common alignment data format of the Aligner.
An alignment
is in fact an ordered list, with each element of SentenceAlignment
format. The following is an example:
alignment = [ sentenceAlignment1,
sentenceAlignment2,
... ... ,
sentenceAlignmentN ]
Unlike SentenceAlignment
, the order of the list here naturally matters.
GoldAlignment
is the data format used to store the reference alignments. It is different from Alignment
for the following reasons:
-
Our current Aligner doesn't produce certain alignments and probable alignments separably, in contrast to gold alignment, which usually contains certain alignments and probable alignments.
-
a
goldAlignment
is not a tuple of twoalignment
because it would not be inconvenient this way, and it would be a bit easier to compare agoldAlignment
and analignment
, like inEvaluators
.
A goldAlignment
is a list of dicts:
goldAlignment = [ goldAlignmentEntry1,
goldAlignmentEntry2,
... ... ,
goldAlignmentEntryN ]
each dict being the following format:
goldAlignmentEntry = { "certain": sentenceAlignment1,
"probable": sentenceAlignment2 }
in which sentenceAlignment1
and sentenceAlignment2
are SentenceAlignment
of format.