Skip to content

Latest commit

 

History

History
495 lines (382 loc) · 17.3 KB

PyCIFRW_Overview.md

File metadata and controls

495 lines (382 loc) · 17.3 KB

Programming with PyCIFRW

PyCIFRW provides facilities for reading, manipulating and writing CIF and STAR files. In addition, CIF files and dictionaries may be validated against DDL1/2/m dictionaries.

Installing and Initialising PyCIFRW

(Note: these instructions refer to version 4.0 and higher. For older versions, see the documentation provided with those versions).

As of version 4.0, it is sufficient to install the PyCIFRW “wheel” using pip, for example:

pip install --use-wheel PyCifRW-4.2-cp27-none-linux_i686.whl

or using the platform independent source package found on PyPI:

pip install pycifrw

If you want to include PyCIFRW with your package, you can install the PyCIFRW wheel into your development environment and then bundle the contents of the CifFile directory found in the Python local libraries directory (usually site-packages).

If PyCIFRW has installed properly, the following command should complete without any errors:

  import CifFile

Working with CIF files

Reading CIF files

CIF files are represented in PyCIFRW as CifFile objects. These objects behave identically to Python dictionaries, with some additional methods. CifFile objects can be created by calling the ReadCif function on a filename or URL:

      from CifFile import ReadCif
      cf = ReadCif("mycif.cif")    
      df = ReadCif("ftp://ftp.iucr.org/pub/cifdics/cifdic.register")

Errors are raised if CIF syntax/grammar violations are encountered in the input file or line length limits are exceeded.

A compiled extension (StarScan.so) is available in binary distributions which increases parsing speed by a factor of three or more. To use this facility, include the keyword argument scantype='flex' in ReadCif commands:

      cf = ReadCif("mycif.cif",scantype="flex")

Binary distributions are generally only provided for the 'manylinux' target, but may also be generated from the source distribution for any platform if the appropriate compilers are available on that platform.

Alternatively, you may initialise a CifFile object with the URI:

      cf = CifFile("mycif.cif",scantype="flex")

If your CIF file contains characters that are not encoded in UTF8 or ASCII, you may pass the 'permissive' option to ReadCif, which will try other encodings (currently only latin1). Use of this option is not encouraged.

Grammar options

There are three variations in CIF file syntax. An early, little-used version of the standard allowed non-quoted data strings to begin with square bracket characters ('['). This was disallowed in version 1.1 in order to reserve such usage for later developments. The recently introduced CIF2 standard adds list and table datastructures to CIF1. Detection of the appropriate CIF grammar is automatic, but potentially time-consuming for multiple files, so specification of the particular version to use is possible with the grammar keyword:

     cf = ReadCif('oldcif.cif',grammar='1.0') #oldest CIF syntax      
     cf = ReadCif('normcif.cif',grammar='1.1') #widespread
     cf = ReadCif('future.cif',grammar='2.0') #latest standard
     cf = ReadCif('unknown.cif',grammar='auto') #try 2.0->1.1->1.0

Reading of STAR2 files is also possible by setting grammar='STAR2'. Currently, the default is set to 'auto'.

Creating a new CifFile

A new CifFile object is usually created empty:

        from CifFile import CifFile
        cf = CifFile()

You will need to create at least one CifBlock object to hold your data. The CifBlock is then added to the CifFile using the usual Python dictionary notation. The dictionary 'key' becomes the blockname used for output.

        from CifFile import CifBlock
        myblock = CifBlock()        
        cf['a_block'] = myblock

A CifBlock object may be initialised with another CifBlock, in which case a copy operation is performed.

Note that most operations on data provided by PyCIFRW involve CifBlock objects.

Manipulating values in a CIF file

Accessing data

The simplest form of access is using standard Python square bracket notation. Data blocks and data names within each data block are referenced identically to normal Python dictionaries:

      my_data = cf['a_data_block']['_a_data_name']

All values read in are stored as strings ^[ This deviates from the current CIF standard, which mandates interpreting unquoted strings as numbers where possible and in the absence of dictionary definitions to the contrary (International Tables, Vol. G., p24). ], with CIF syntactical elements stripped, that is, no enclosing quotation marks or semicolons are included in the values. The value associated with a CifFile dictionary key is always a CifBlock object. All standard Python dictionary methods (e.g. get, update, items(), keys()) are available for both CifFile and CifBlock objects. Note also the convenience method first_block(), which will return the first datablock stored which is not necessarily the first datablock in the physical file:

    my_data = cf.first_block()

If a data name occurs in a loop, a list of values is returned for the value of that dataname - the next section describes ways to access looped data.

Tabular (“looped”) data

For the purpose of the examples, we use the following example CIF file:

data_testblock
loop_
  _item_5   
  _item_7   
  _item_6    
  1  a  5    
  2  b  6    
  3  c  7    
  4  d  8 

Any table can be interacted with in a column-based or a row-based way. A PyCIFRW CifBlock object provides column-based access using normal square bracket syntax as described above: for example cf['testblock']['_item_6'] will return ['5','6','7','8'].

Table row access

The CifLoopBlock object represents a loop structure in the CIF file and facilitates row-based access. A CifLoopBlock object can be obtained by calling the CifBlock method GetLoop(dataname). Column-based access remains available for this object (e.g. keys() returns a list of datanames in the loop and square bracket notation returns a list of column values for that column).

A particular row can be selected using the CifLoopBlock GetKeyedPacket method:

    >>> lb = cf['testblock'].GetLoop('_item_6')
    >>> myrow = lb.GetKeyedPacket('_item_7','c') 
    >>> myrow._item_5
    '3'

In this example, the single packet with a value of 'c' for _item_7 is returned, and packet values can then be accessed using the dataname as an attribute of the packet. Note that a KeyError is raised if more than one packet matches, or no packets match, and that the packet returned is a copy of the data read in from the file, and therefore can be changed without affecting the CifBlock object.

You may also access the nth value in this CifLoopBlock object. ^[Warning: row and column order in a CIF loop is arbitrary; while PyCIFRW currently maintains the row order seen in the input file, there is nothing in the CIF standards which mandates this behaviour, and later implementations may change this behaviour ], and values can be obtained from these packets as attributes.

    >>> lb = cb.GetLoop("_item_5")
    >>> lb[0]
    ['1', 'a', '5']
    >>> lb[0]._item_7
    'a'

An alternative way of accessing loop data uses Python iterators, allowing the following syntax:

    >>> for a in lb: print `a["_item_7"]` 
    'a' 'b' 'c' 'd' 

Note that in both the above examples the row packet is a copy of the looped data, and therefore changes to it will not silently alter the contents of the original CifFile object, unlike the lists returned when column-based access is used.

Changing or adding data values

If many operations are going to be performed on a single data block, it is convenient to assign that block to a new variable:

    cb = cf['my_block']

A new data name and value may be added, or the value of an existing name changed, by straight assignment:

    cb['_new_data_name'] = 4.5
    cb['_old_data_name'] = 'cucumber'

By default, old values are overwritten silently. To instead raise an error when an item value is going to be overwritten, set attribute 'overwrite' to False:

    cb.overwrite = False
    cb['_old_data_name'] = 'cucumber' # Error is raised

To return to the original behaviour, set overwrite to True. To allow/disallow overwriting for all blocks in a file, call methods unlock()/lock() respectively.

Note that values may be strings or numbers.

Creating loops

To create a loop, simply set the column values to same-length lists, and then call the CifBlock method CreateLoop with a list of the looped datanames as a single argument. This method will raise an error if the datanames have different length columns assigned to them. For example, the following commands create the example loop above:

    cb['_item_5']  = [1,2,3,4]
    cb['_item_7']  = ['a','b','c','d']
    cb['_item_6']  = [5,6,7,8]
    cb.CreateLoop(['_item_5','_item_7','_item_6'])

As a special case, if CreateLoop is called with data names that are not list-valued, these items will be first placed into single-element lists before creating the loop, resulting in a loop with one row.

Another method, AddToLoop(dataname,newdata), adds columns in newdata to the pre-existing loop containing dataname, silently overwriting duplicate data. newdata should be a Python dictionary of dataname - datavalue pairs.

Note that lists (and other listlike objects except packets) returned by PyCIFRW actually point to the list currently inside the CifBlock object, and therefore any modification to them will modify the stored list. While this is often the desired behaviour, if you intend to manipulate such a list in other parts of your program while preserving the original CIF information, you should first copy the list to avoid destroying the loop structure:

    mysym = cb['_symmetry_ops'][:]
    mysym.append('x-1/2,y+1/2,z')

Changing item order

Item (and block) order has no semantic significance in CIF files. However, the readability of CIF files in simple text editors leads to a desire to organise the output order for human readers. The ChangeItemOrder method allows the order in which data items appear in the printed file to be changed:

    mycif['testblock'].ChangeItemOrder('_item_5',0)

will move _item_5 to the beginning of the datablock. When changing the order inside a loop block, the loop block's method must be called i.e.:

aloop = mycif['testblock'].GetLoop('_loop_item_1')
aloop.ChangeItemOrder('_loop_item_1',4)

Note also that the position of a loop within the file can be changed in this way as well, by passing the 'block number' object as the first argument. Each loop is assigned a simple integer number, which can be found by calling FindLoop with the name of a column in that loop:

loop_id = mycif['testblock'].FindLoop('_item_6')
mycif['testblock'].ChangeItemOrder(loop_id,0)

will move the loop block to the beginning of the printed datablock.

Adding and removing table rows

While it is most efficient to add columns to the CifBlock and then bind them together once into a loop, it is possible to add a new row into an existing loop using the AddPacket(packet) method of CifLoopBlock objects:

    aloop = mycif['testblock'].GetLoop('_item_7')
    template = aloop.GetKeyedPacket('_item_7','d')
    template._item_5 = '5'
    template._item_7 = 'e'
    template._item_6 = '9'
    aloop.AddPacket(template)

Note we use an existing packet as a template in this example. If you wish to create a packet from scratch, you should instantiate a StarPacket:

    from CifFile import StarFile   #installed with PyCIFRW
    newpack = StarFile.StarPacket()
    newpack._item_5 = '5'  
    ...
    aloop.AddPacket(newpack)

Note that an error will be raised when calling AddPacket if the packet attributes do not exactly match the item names in the loop.

A packet may be removed using the RemoveKeyedPacket method, which chooses the packet to be removed based on the value of the given dataname:

    aloop.RemoveKeyedPacket('_item_7','a')

Writing CIF Files

The CifFile method WriteOut returns a string which may be passed to an open file descriptor:

    outfile = open("mycif.cif")
    outfile.write(cf.WriteOut())

Or the built-in Python str() function can be used:

outfile.write(str(cf))

WriteOut takes an optional keyword argument, comment, which should be a string containing a comment which will be placed at the top of the output file. This comment string must already contain # characters at the beginning of lines:

    outfile.write(cf.WriteOut("#This is a test file"))

Two additional keyword arguments control line length in the output file: wraplength and maxoutlength. Lines in the output file are guaranteed to be shorter than maxoutlength characters, and PyCIFRW will additionally insert a line break if putting two data values or a dataname/datavalue pair together on the same line would exceed wraplength. In other words, unless data values are longer than maxoutlength characters long, no line breaks will be inserted into those datavalues the output file. By default, wraplength = 80 and maxoutlength = 2048. Note that the CIF line folding protocol is used, which makes wrapping of long datavalues reversible.

These values may be set on a per block basis by calling the SetOutputLength method of the block.

The order of output of items within a CifFile or CifBlock is specified using the ChangeItemOrder method (see above). The default order is the order that items were first added in to the CifFile/CifBlock. Note that this order is not guaranteed to be the order in which they appear in the input file.

Templating system

If you want precise control of the layout of your CIF file, you can pass a template file to the CifBlock.process_template method. A 'template' is a CIF file containing a single block, where the datanames are laid out in the way that the user desires. The layout elements that are picked up from this template are:

  1. order (overrides current order of CifBlock)
  2. column position of datavalues (only the first row of a loop block is inspected)
  3. delimiters
  4. If a semicolon-delimited string outside a loop contains 3 or more spaces in a row at the beginning of a line, that datavalue will be wrapped and indented by the same amount on output

Constraints on the template:

  1. There should only ever be one dataname on each line
  2. loop_ and and datablock tokens should appear as the only non-blank characters on their lines
  3. Comments are flagged by a '#' as the first character in the line
  4. Blank lines are acceptable (and ignored)
  5. The dummy datavalues should use only alphanumeric characters
  6. Semicolon-delimited strings are not allowed in loops

After calling process_template with the template file as the argument, subsequent calls to WriteOut will respect the template information, and revert to default behaviour for any datanames that were not found in the template. Templating is most useful when formatting CIF dictionaries which are read heavily by human readers, and have many (thousands!) of datablocks, each containing the same limited number of datanames.

Output format

CIF files are output by default in CIF2 grammar, but with the CIF2-only triple quotes avoided unless explicitly requested through a template. Therefore, as long as CIF2-only datastructures (lists and tables) are absent, the output CIF files will conform to 1.0,1.1 and 2.0 grammar. The grammar of the output files can be changed by calling CifFile.set_grammar with the choices being 1.0,1.1,2.0 or STAR2.

Deprecated classes

The ValidCifFile class is deprecated and will be removed in a future version.

Example programs

A program which uses PyCIFRW for validation, validate_cif.py, is included in the distribution in the Programs subdirectory. It will validate a CIF file (including dictionaries) against one or more dictionaries which may be specified by name and version or as a filename on the local disk. If name and version are specified, the IUCr canonical registry or a local registry is used to find the dictionary and download it if necessary.

Usage

python validate_cif.py [options] ciffile

Options

--version show version number and exit
-h,--help print short help message
-d dirname directory to find/store dictionary files
-f dictname filename of locally-stored dictionary
-u version dictionary version to resolve using registry
-n name dictionary name to resolve using registry
-s store downloaded dictionary locally (default True)
-c fetch and use canonical registry from IUCr
-r registry location of registry as filename or URL
-t The file to be checked is itself a DDL2 dictionary

Further information

The source files are in a literate programming format (noweb) with file extension .nw. HTML documentation generated from these files and containing both code and copious comments is included in the downloaded package. Details of interpretation of the current standards as relates to validation can be found in these files.