-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from weaverba137/tests-and-docs
Add tests and documentation.
- Loading branch information
Showing
22 changed files
with
929 additions
and
211 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
==================== | ||
HPSSPy Configuration | ||
==================== | ||
|
||
Introduction | ||
++++++++++++ | ||
|
||
The primary HPSSPy command-line program :command:`missing_from_hpss` is | ||
configured with a JSON_ file. Both the JSON standard and the | ||
Python :mod:`json` library are very strict. There is a very quick way | ||
to check the validity of JSON files however:: | ||
|
||
python -c 'import json; j = open("config.json"); data = json.load(j); j.close()' | ||
|
||
where ``"config.json"`` should be replaced with the name of the file to be | ||
tested. | ||
|
||
The top-level JSON container should be an "object", equivalent to a Python | ||
:class:`dict`. The simplest possible file that satisfies this requirement | ||
is:: | ||
|
||
{ | ||
} | ||
|
||
Obviously, that's not very much to go on. You will need further data | ||
described below. | ||
|
||
.. _JSON: http://json.org | ||
|
||
Metadata | ||
++++++++ | ||
|
||
The configuration file should contain a top-level keyword ``"config"``. | ||
The value should itself be a :class:`dict`, containing some important | ||
metadata:: | ||
|
||
{ | ||
"config": { | ||
"root": "/global/project/projectdirs/my_project", | ||
"hpss_root": "/nersc/projects/my_project", | ||
"physical_disks": ["my_project"] | ||
} | ||
} | ||
|
||
root/ | ||
The directory that contains *all* the data associated with the project. | ||
|
||
hpss\_root/ | ||
The path on the HPSS tape system that will contain the backups. | ||
|
||
physical\_disks/ | ||
If the data are spread across several physical disks and linked into | ||
the root path via symlinks, the various physical disks need to be listed | ||
here. If the value is equivalent to ``False``, *e.g.*, | ||
[``null``, ``false``, ``[]``] this is means that the | ||
``"root"`` disk contains all the physical data. If the value of | ||
is equivalent to a one-item list containing ``os.path.basename(root)``, | ||
then this *also* means that the ``"root"`` disk contains all the physical | ||
data. A list of simple names generates the physical disks by | ||
substitution on the basename of the ``"root"`` value. More complicated | ||
configurations are possible, see :func:`hpsspy.scan.physical_disks`. | ||
|
||
Sections | ||
++++++++ | ||
|
||
Inside the root directory, as described above, there may be several top-level | ||
directories. For the purposes of this documentation, these are called | ||
"sections" or "releases". The terms are interchangable. Each section | ||
has configuration items that describe its structure:: | ||
|
||
{ | ||
"config": { | ||
"root": "/projects/my_project", | ||
"hpss_root": "/hpss/projects/my_project", | ||
"physical_disks": ["my_project"] | ||
}, | ||
"data": { | ||
"exclude": [], | ||
"d1": { | ||
"d1/batch/.*$": "d1/batch.tar", | ||
"d1/([^/]+\\.txt)$": "d1/\\1", | ||
"d1/templates/[^/]+$": "d1/templates/templates_files.tar" | ||
} | ||
} | ||
} | ||
|
||
The :command:`missing_from_hpss` command works on one section at a time. | ||
The name of the section is passed on the command-line:: | ||
|
||
missing_from_hpss config.json data | ||
|
||
This would read the data section above. | ||
|
||
Each section should have an ``"exclude"`` keyword, whose value is a list | ||
of files to be ignored. In the example above, in order to ignore the file | ||
``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value | ||
would be ``["d1/README.html"]``. Note that this is relative to the | ||
path ``/projects/my_project/data``, since ``"data"`` is the section being | ||
processed. | ||
|
||
Mapping File Names to HPSS Archives | ||
+++++++++++++++++++++++++++++++++++ | ||
|
||
Within a section, each immediate subdirectory should be described with | ||
a keyword in the configuration file. :command:`missing_from_hpss` will | ||
complain if not, but it won't necessarily cause it to fail. In the | ||
example above, ``/projects/my_project/data/d1`` is configured. | ||
|
||
There are many possible ways to bundle files for archiving. Generally you | ||
want to make archives as large as possible, without spilling onto multiple | ||
tapes. However, with highly structured, deeply-nested directory structures, | ||
this isn't always the best way to do it from a data *retrieval* viewpoint. | ||
|
||
Consider this scenario. ``/projects/my_project/data`` has been archived to | ||
ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``. | ||
The file ``/projects/my_project/data/d1/templates/d1_template_05.fits`` | ||
needs to be recovered. Which tape archive contains it? | ||
|
||
Now consider the scenario where the files in | ||
``/projects/my_project/data/d1/templates`` have been archived to | ||
``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``. | ||
Now is it easier to recover the file? | ||
|
||
One should still try to make archives as big as possible, but generally | ||
speaking, long-term archiving of large, complex data sets should be | ||
done by **someone who actually knows the structure of the data set** . | ||
|
||
In coding terms we describe a portion of a directory tree hierarchy | ||
using regular expressions to match *files* in that portion. Then we map | ||
files that match that regular expression to tape archive files. | ||
|
||
Regular Expression Details | ||
++++++++++++++++++++++++++ | ||
|
||
The HPSSPy package, and :command:`missing_from_hpss` will validate the | ||
regular expressions used in the configuration file, in addition to checking | ||
the overall validity of the JSON file itself. That is, a bad regular | ||
expression will be rejected before it has any chance to "touch" any real data. | ||
|
||
The regular expressions should follow Python's conventions, | ||
described in :mod:`re`. In addition to those conventions, this package | ||
imposes some additional requirements, conventions and idioms: | ||
|
||
* Requirements | ||
|
||
- Backslashes must be escaped in JSON files. For example the | ||
metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``. | ||
- Regular expressions should end with the end-of-line marker ``$``. | ||
|
||
* Conventions | ||
|
||
- Any archive file name ending in ``.tar`` is assumed to be an HTAR file, | ||
and that command will be used to construct it. | ||
- Any archive file *not* ending in ``.tar`` will simply be copied to | ||
HPSS as is. | ||
- When constructing an archive file, :command:`missing_from_hpss` will | ||
obtain the directory it needs to archive from the name of the *archive* | ||
file, not the regular expression itself. This is because regular | ||
expression *substition* is performed on the archive file name. | ||
For example ``batch.tar`` means "archive a batch/ directory". | ||
For longer file names, the "suffix" of the file will be used. | ||
``data_d1_batch.tar`` also means "archive a batch/ directory", because | ||
``data_d1_`` is stripped off. | ||
- An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar`` | ||
is a signal to :command:`missing_from_hpss` to construct | ||
the archive file in a certain way, not by decending into a directory, | ||
but by constructing an explicit list of files and building an archive | ||
file out of that. | ||
|
||
* Idioms | ||
|
||
- Archive the entire contents of a directory into a single file: | ||
``"foo/.*$" : "foo.tar"``. | ||
- Archive several subdirectories of a directory, each into their own file: | ||
``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``. The name of the | ||
directory matched in parentheses will be substituted into the file name. | ||
- Archive arbitrary subdirectories of a *set* of subdirectories: | ||
``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"`` | ||
- Match files in a directory, but not any files in any | ||
subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``. See also the | ||
``_files.tar`` convention mentioned above. | ||
- Group some but not all subdirectories in a directory into a single | ||
archive file for efficiency: ``"foo/([0-9])([0-9][0-9])/.*$" : "foo/foo_\\1XX.tar"``. | ||
Note the ending of the archive file, and that the directories have to | ||
have a very uniform naming convention (three and only three digits | ||
in this example). | ||
- Do not create an archive file, just copy the file, as is, to HPSS: | ||
``"d1/README\\.txt$" : "d1/README.txt"``. Similarly, for a set of TXT files: | ||
``"d1/([^/]+\\.txt)$" : "d1/\\1"``. | ||
- An example with lots of substitutions:: | ||
|
||
"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar" | ||
|
||
Finally, for truly monumentally-complicated directory trees, there is a | ||
`JSON file`_ included with this distribution describing the SDSS_ data tree | ||
that can be used for examples. To view the equivalent files and directories | ||
for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12. | ||
|
||
.. _SDSS: https://www.sdss.org | ||
.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.