Skip to content

Tutorial: Training Sequence V0.5a

Jetic Gu edited this page Aug 9, 2017 · 3 revisions

Introduction

This tutorial works with Alignment Model version V0.5a.

The Aligner aims at providing an easy to use interface for people to implement their own HMM models to do word alignment. But how does it work?

Workflow

Here is the workflow for HMMWithAlignmentType:

As one can see, every model has its own training sequence. In the above image one finds the sequence for HMMWithAlignmentType, in which IBM model 1 is used to initialise HMM model's translation probability table, and then HMM models is run on first POS Tags, then FORM.

Get Started!

In this example, we'll be looking at the baseline HMM model, as it is simple

In every model file, there is a class named AlignmentModel, and the training sequence is defined in AlignmentModel.train. In the case of baseline HMM, it looks like this:

class AlignmentModel(Base):
    def train(self, dataset, iterations):
        dataset = self.initialiseLexikon(dataset)

        alignerIBM1 = AlignerIBM1()
        alignerIBM1.sharedLexikon(self)
        alignerIBM1.initialiseBiwordCount(dataset)
        alignerIBM1.EM(dataset, iterations)
        self.t = alignerIBM1.t

        self.baumWelch(dataset, iterations=iterations)

Let's take it apart, shall we? The first line in AlignmentModel.train initialises the dictionary and replaces the text in the dataset with indices. After that, the AlignmentModel will have a dictionary stored inside. To allow other models to use this same dictionary, alignerIBM1.sharedLexikon(self) is called. Note that this function is merely creating a reference in alignerIBM1 to the dictionary, meaning that any changes of the dictionary affects both models.

Then, the training on IBM1 model is performed, by calling:

alignerIBM1.initialiseBiwordCount(dataset)
alignerIBM1.EM(dataset, iterations)

The first line initialises alignerIBM1's translation probability table (alignerIBM1.t), and the second line performs the actual training.

After the IBM 1 is trained, we use alignerIBM1's translation probability table (alignerIBM1.t) as initial values of the HMM model. It is done by:

self.t = alignerIBM1.t

Then, BaumWelch algorithm is called, and the HMM model is trained

self.baumWelch(dataset, iterations=iterations)

Training with Index

The default training index in IBM1 and HMM model is 0, which means training will be done on FORM(original text) of the dataset. If index is set to 1, training will be done on POS tags. Should one wish to include more information, it can also be done by setting the index. For details on the dataset format and how indices work, please refer to Dataset Format V0.2a.

Your own models with customised training sequence

This is an example of using the HMM model.

Suppose one doesn't think that it is necessary to run IBM1 the same amount of iterations to initialise the HMM's translation table and wishes to train on index=1 (POS Tags), it can be easily done by the following code:

from models.HMM import AlignmentModel as Base
from models.IBM1 import AlignmentModel as IBM1

class AlignmentModel(Base):
    def train(self, dataset, iterations):
        dataset = self.initialiseLexikon(dataset)

        alignerIBM1 = AlignerIBM1()
        alignerIBM1.sharedLexikon(self)
        alignerIBM1.initialiseBiwordCount(dataset)
        alignerIBM1.EM(dataset, 3, index=1)

        self.t = alignerIBM1.t
        self.initialiseBiwordCount(dataset)

        self.baumWelch(dataset, iterations=iterations, index=1)

Note that at the end of this training sequence, the translation table trained on POS tags is still self.t. If one wishes to have tables of multiple indices, one has to choose somewhere else. For example, after training this model:

from models.HMM import AlignmentModel as Base
from models.IBM1 import AlignmentModel as IBM1

class AlignmentModel(Base):
    def train(self, dataset, iterations):
        dataset = self.initialiseLexikon(dataset)

        alignerIBM1 = AlignerIBM1()
        alignerIBM1.sharedLexikon(self)
        alignerIBM1.initialiseBiwordCount(dataset)
        alignerIBM1.EM(dataset, 3, index=1)

        self.t = alignerIBM1.t
        self.initialiseBiwordCount(dataset)

        self.baumWelch(dataset, iterations=iterations, index=1)

        self.t2 = self.t
        
        alignerIBM1 = AlignerIBM1()
        alignerIBM1.sharedLexikon(self)
        alignerIBM1.initialiseBiwordCount(dataset)
        alignerIBM1.EM(dataset, 3, index=0)

        self.t = alignerIBM1.t
        self.initialiseBiwordCount(dataset)

        self.baumWelch(dataset, iterations=iterations, index=0)

The translation table for POS tags will be stored in self.t2, and for FORM it is in self.t