Skip to content

API reference: FileIO V0.3a

Jetic Gu edited this page Jun 25, 2017 · 2 revisions

Introduction

FileIO is located in src/fileIO.py. It contains functions used to handle file operations in the Aligner.

The description here is of module version 0.2a.

Dependency

Supported Data Format

Changes from v0.2a

  • Support of Dataset formats including bitext and tritext is now dropped. All datasets will now use Dataset

Functions

def exportToFile(result, fileName)

Parameters:

def loadDataset(fFiles, eFiles, alignmentFile="", linesToLoad=sys.maxint)

Parameters:

  • fFiles : list of str, files for source language to read. Order: Original text, POS Tag.
  • eFiles : list of str, files for target language to read. Order: Original text, POS Tag.
  • alignmentFile : str, optional, the alignment file to read
  • linesToLoad: int, the lines to read

Return:

def loadAlignment(fileName, linesToLoad=sys.maxint):

Parameters:

  • fileName: str, the Alignment file to read
  • linesToLoad: int, the lines to read

Return:

File formats

Source/target language text files

UTF-8 text files. Each line contains one sentence, sentences are segmented in which words are separated by space. One language each file.

Gold Alignment files(.wa)

UTF-8 text files. Each line contains one sentence. Alignments of words of in one sentence are separated by space. Each alignment is represented in the following format:

  1. "NN-MM", where NN and MM are integers, means that there is a certain alignment between the NNth word of the source sentence and the MMth word of the target sentence. In addition, MM could be of the format: "M1,M2,M3,..." which means that there are certain alignments between the NNth word of the source sentence and each of the Mith words of the target sentence.

  2. "NN?MM", where NN and MM are integers, means that there is a probable alignment between the NNth word of the source sentence and the MMth word of the target sentence. In addition, MM could be of the format: "M1,M2,M3,..." which means that there are probable alignments between the NNth word of the source sentence and each of the Mith words of the target sentence.

  3. "NN-MM-TT", where NN and MM are integers, TT is a str representing the type of the alignment. It means that there is a certain alignment between the NNth word of the source sentence and the MMth word of the target sentence, both of which are of TT type.

Clone this wiki locally