Skip to content

Commit

Permalink
Merge pull request #160 from CDRH/feature/csv_webs_fields
Browse files Browse the repository at this point in the history
CSV and Custom Formats!
  • Loading branch information
jduss4 authored Apr 24, 2020
2 parents fcbc112 + 224bb3e commit 66a13b7
Show file tree
Hide file tree
Showing 41 changed files with 1,051 additions and 269 deletions.
2 changes: 1 addition & 1 deletion .ruby-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.7.0
2.7.1
8 changes: 4 additions & 4 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,16 @@ GEM
mini_portile2 (2.4.0)
minitest (5.14.0)
netrc (0.11.0)
nokogiri (1.10.7)
nokogiri (1.10.9)
mini_portile2 (~> 2.4.0)
rake (10.5.0)
rake (13.0.1)
rest-client (2.0.2)
http-cookie (>= 1.0.2, < 2.0)
mime-types (>= 1.16, < 4.0)
netrc (~> 0.8)
unf (0.1.4)
unf_ext
unf_ext (0.0.7.6)
unf_ext (0.0.7.7)

PLATFORMS
ruby
Expand All @@ -38,7 +38,7 @@ DEPENDENCIES
bundler (>= 1.16.0, < 3.0)
datura!
minitest (~> 5.0)
rake (~> 10.0)
rake (~> 13.0)

BUNDLED WITH
2.1.4
38 changes: 26 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,34 @@

Welcome to this temporary documentation for Datura, a gem dedicated to transforming and posting data sources from CDRH projects. This gem is intended to be used with a collection containing TEI, VRA, CSVs, and more.

## Install
Looking for information about how to post documents? Check out the
[documentation for posting](/docs/3_manage/post.md).

## Install / Set Up Data Repo

Gemfile:
Check that Ruby is installed, preferably 2.7.x or up.

If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension).

```
source "https://rubygems.org"
# fill in the latest available release for the tag
gem "datura", git: "https://github.com/CDRH/datura.git", tag: "v0.0.0"
```

If this is the first datura repository on your machine, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md).

Then, in the directory with the Gemfile, run the following:

```
gem "datura", git: "https://github.com/CDRH/data.git", branch: "datura"
gem install bundler
bundle install
bundle exec setup
```

Next, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md).
The last step should add files and some basic directories. Have a look at the [setup instructions](/docs/1_setup/collection_setup.md) to learn how to add your files and start working with the data!

## Local Development

Expand All @@ -28,21 +46,17 @@ Then in your repo you can run:

```
bundle install
# create the gem package if the above doesn't work
gem install --local path/to/local/datura/pkg/datura-0.x.x.gem
```

If for some reason that is not working, you can instead run the following each time you make a change in datura:
You will need to recreate your gem package for some changes you make in Datura. From the DATURA directory, NOT your data repo directory, run:

```
bundle exec rake install
```

then from the collection (sub in the correct version):

```
gem install --local path/to/local/datura/pkg/datura-0.1.2.gem
```

Note: You may need to delete your `scripts/.xslt-datura` folder as well.
Note: You may also need to delete your `scripts/.xslt-datura` folder if you are making changes to the default Datura scripts.

## First Steps

Expand Down
2 changes: 1 addition & 1 deletion datura.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -59,5 +59,5 @@ Gem::Specification.new do |spec|
spec.add_runtime_dependency "rest-client", "~> 2.0.2"
spec.add_development_dependency "bundler", ">= 1.16.0", "< 3.0"
spec.add_development_dependency "minitest", "~> 5.0"
spec.add_development_dependency "rake", "~> 10.0"
spec.add_development_dependency "rake", "~> 13.0"
end
6 changes: 4 additions & 2 deletions docs/2_customization/all_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@ There are a number of ways you can customize the transformations. Please refer
### To Elasticsearch

- [XML based (HTML / TEI / VRA / webs (Web Scraped HTML))](xml_to_es.md)
- [CSV](csv_to_es.md)
- CSV (Pending)
- [Custom Formats](custom_to_es.md) (those which Datura does not support but which a collection may need)

### To Solr / HTML

- Pending docs TODO
- Pending docs for most formats TODO
- [CSV](csv_to_solr.md)

### To IIIF

Expand Down
170 changes: 170 additions & 0 deletions docs/2_customization/custom_to_es.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Custom Formats to Elasticsearch

Datura provides minimal support for formats other than TEI, VRA,
HTML, and CSV through basic infrastructure to support overrides.

## The Basics

If you want to add a custom format such as YAML, XLS spreadsheets, or if you
want to add some highly customized version of HTML or CSV in addition to an
existing batch of CSVs, you need to create a directory in source with a unique name.

*The name you select should not be `authority` or `annotations`*. Those names
are reserved for projects which require authority files such as gazateers and
scholarly notes about items.

Let's say you need to index `.txt` files. Once you have created the directory
`source/txt` and populated it with a few files, you can run the Datura scripts
with:

```
post -f txt
```

That will start off the process of grabbing the files and reading them.
Unfortunately, since Datura has no idea what sort of format to prepare for, nor
how many items you might need per format (for example, a PDF might be one item
per file while a tab-separated doc could be dozens or hundreds per file).

Additionally, once Datura reads in a file, it doesn't know how or what
information to extract, so it looks like it's time to start writing your own
code!

## Reading Your Format and Prepping for Launch

Just a note before we begin to clarify some of the variables that you may come
across while you're setting up your custom format:

- `@file_location` -- the fullpath to the specific file being processed
- `/var/local/www/data/collections/source/[custom_format]/test.json`
- `@filename` -- the specific file without a path
- `test.json`
- `self.filename()` -- method specific to FileType and subclasses to get the filename
- `@file` -- very generically named, `@file` is the version of your file that has been read in by Ruby
- override the `read_file` method to make `@file` into an XML / JSON / YAML / etc object as needed by your custom class (see below)

### read_file

In [file_custom.rb](/lib/datura/file_types/file_custom.rb), Datura reads in a
file as text and makes a new CustomToEs object from it, which is stored as `@file`. You may wish to
override the following to accommodate your format:

```
class FileCustom < FileType
def read_file
File.read(@file_location)
end
end
```

Currently, this is just straight up attempting to read a file's text. However,
if you are working with XML / HTML, JSON, CSV, YAML, etc, there is likely a
better, format-specific parser that will give you more control. For example,
you might change `read_file` to:

```
# note: may need to require libraries / modules
require "yaml"
class FileCustom < FileType
def read_file
YAML.load_file(@file_location)
end
end
```

### subdocs

The next thing you will need to address if your format needs to be split into
multiple documents (such as personography files, spreadsheets, database dumps,
etc), is how to split up a file. By default, Datura assumes your file is one
item. If that is not the case, override `subdocs`:

```
def subdocs
Array(@file)
end
```

Change that to something which will return an array of items. For example, from
our YAML example, you might have:

```
def subdocs
@file["texts"]
end
```
Or for an XML file:
```
def subdocs
@file.xpath("//grouping")
end
```

### build_es_documents

You're almost done with `file_custom.rb`. You just need to kick off a class
that will handle the transformation per sub-document. For simplicity's sake, if
this is a totally new format that Elasticsearch hasn't seen before, I recommend
leaving this method alone. You can move onto the next step,
[CustomToEs](#customtoes).

If you want to try to piggyback off of an existing Datura class, then you may
need to override this method. Instead of calling `CustomToEs.new()` in it, you
would instead need to add a `require_relative` path at the top of the file to
your new class, and then call `YournewclassToEs.new()` from `build_es_documents`.

In your new class, you could presumably do something like

```
class YournewclassToEs < XmlToEs
# now you have access to XmlToEs helpers for xpaths, etc
end
```

## CustomToEs

The files in the [custom_to_es](/lib/datura/to_es/custom_to_es) directory and
[custom_to_es.rb](/lib/datura/to_es/custom_to_es.rb) give you the basic
structure you need to create your own version of these files. Since
Datura has no way of knowing what format might come its way, the majority of the
methods in `custom_to_es/fields.rb` are empty.

The only thing you **MUST** override is `get_id`.

Create a file in your overrides directory called `custom_to_es.rb` and add the
following:

```
class CustomToEs
def get_id
# include code here that returns an id
# it could be the @filename(false) to get a filename without extension
# or it could be `@item["identifier"] to get the value of a column, etc
# you may want to prepend a collection abbreviation to your id, like
# "nei.#{some_value}"
end
end
```

You can also add preprocessing or postprocess here by overriding `create_json`.

It is expected that you will override most of the methods in `fields.rb`. For
example, you might set a category like:

```
def category
# your code here, referencing @item if necessary
end
```

One more note: due to how `CustomToEs` is created, it is expecting a subdoc
and the original file. This is because it accommodates for something like a
personography file, where you may want to deal with an individual person as
`@item` but need to reference `@file` to get information about the repository
or rightsholder, etc. If your format does not use sub-documents, then you
may simply refer to `@item` throughout and ignore `@file`, which should be
identical.
66 changes: 20 additions & 46 deletions lib/datura/common_xml.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def self.convert_tags(xml)
ele.delete("rend")
end
xml = CommonXml.sub_corrections(xml)
return xml
xml
end

# wrap in order to make valid xml
Expand All @@ -29,7 +29,7 @@ def self.convert_tags(xml)
def self.convert_tags_in_string(text)
xml = Nokogiri::XML("<xml>#{text}</xml>")
converted = convert_tags(xml)
return converted.xpath("//xml").inner_html
converted.xpath("//xml").inner_html
end

def self.create_html_object(filepath, remove_ns=true)
Expand All @@ -45,59 +45,24 @@ def self.create_xml_object(filepath, remove_ns=true)
file_xml
end

# pass in a date and identify whether it should be before or after
# in order to fill in dates (ex: 2014 => 2014-12-31)

# deprecated method
def self.date_display(date, nd_text="N.D.")
date_hyphen = CommonXml.date_standardize(date)
if date_hyphen
y, m, d = date_hyphen.split("-").map { |s| s.to_i }
date_obj = Date.new(y, m, d)
return date_obj.strftime("%B %-d, %Y")
else
return nd_text
end
Datura::Helpers.date_display(date, nd_text)
end

# automatically defaults to setting incomplete dates to the earliest
# date (2016-07 becomes 2016-07-01) but pass in "false" in order
# to set it to the latest available date
# deprecated method
def self.date_standardize(date, before=true)
return_date = nil
if date
y, m, d = date.split(/-|\//)
if y && y.length == 4
# use -1 to indicate that this will be the last possible
m_default = before ? "01" : "-1"
d_default = before ? "01" : "-1"
m = m_default if !m
d = d_default if !d
# TODO clean this up because man it sucks
if Date.valid_date?(y.to_i, m.to_i, d.to_i)
date = Date.new(y.to_i, m.to_i, d.to_i)
month = date.month.to_s.rjust(2, "0")
day = date.day.to_s.rjust(2, "0")
return_date = "#{date.year}-#{month}-#{day}"
end
end
end
return_date
Datura::Helpers.date_standardize(date, before)
end

# deprecated method
def self.normalize_name(abnormal)
# put in lower case
# remove starting a, an, or the
down = abnormal.downcase
down.gsub(/^the |^a |^an /, "")
Datura::Helpers.normalize_name(abnormal)
end

# imitates xslt fn:normalize-space
# removes leading / trailing whitespace, newlines, repeating whitespace, etc
# deprecated method
def self.normalize_space(abnormal)
if abnormal
normal = abnormal.strip.gsub(/\s+/, " ")
end
normal || abnormal
Datura::Helpers.normalize_space(abnormal)
end

# saxon accepts params in following manner
Expand All @@ -107,7 +72,7 @@ def self.stringify_params(param_hash)
if param_hash
params = param_hash.map{ |k, v| "#{k}=#{v}" }.join(" ")
end
return params
params
end

def self.sub_corrections(aXml)
Expand All @@ -122,4 +87,13 @@ def self.to_display_text(aXml)
CommonXml.sub_corrections(aXml).text
end

# TODO remove in 2021
class << self
extend Gem::Deprecate
deprecate :date_display, :"Datura::Helpers.normalize_space", 2021, 1
deprecate :date_standardize, :"Datura::Helpers.normalize_space", 2021, 1
deprecate :normalize_name, :"Datura::Helpers.normalize_space", 2021, 1
deprecate :normalize_space, :"Datura::Helpers.normalize_space", 2021, 1
end

end
Loading

0 comments on commit 66a13b7

Please sign in to comment.