diff --git a/.ruby-version b/.ruby-version
index 24ba9a38d..860487ca1 100644
--- a/.ruby-version
+++ b/.ruby-version
@@ -1 +1 @@
-2.7.0
+2.7.1
diff --git a/Gemfile.lock b/Gemfile.lock
index f6371ab75..e79532f80 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -20,16 +20,16 @@ GEM
mini_portile2 (2.4.0)
minitest (5.14.0)
netrc (0.11.0)
- nokogiri (1.10.7)
+ nokogiri (1.10.9)
mini_portile2 (~> 2.4.0)
- rake (10.5.0)
+ rake (13.0.1)
rest-client (2.0.2)
http-cookie (>= 1.0.2, < 2.0)
mime-types (>= 1.16, < 4.0)
netrc (~> 0.8)
unf (0.1.4)
unf_ext
- unf_ext (0.0.7.6)
+ unf_ext (0.0.7.7)
PLATFORMS
ruby
@@ -38,7 +38,7 @@ DEPENDENCIES
bundler (>= 1.16.0, < 3.0)
datura!
minitest (~> 5.0)
- rake (~> 10.0)
+ rake (~> 13.0)
BUNDLED WITH
2.1.4
diff --git a/README.md b/README.md
index 0b603ffd3..2aea9e408 100644
--- a/README.md
+++ b/README.md
@@ -2,16 +2,34 @@
Welcome to this temporary documentation for Datura, a gem dedicated to transforming and posting data sources from CDRH projects. This gem is intended to be used with a collection containing TEI, VRA, CSVs, and more.
-## Install
+Looking for information about how to post documents? Check out the
+[documentation for posting](/docs/3_manage/post.md).
+## Install / Set Up Data Repo
-Gemfile:
+Check that Ruby is installed, preferably 2.7.x or up.
+
+If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension).
+
+```
+source "https://rubygems.org"
+
+# fill in the latest available release for the tag
+gem "datura", git: "https://github.com/CDRH/datura.git", tag: "v0.0.0"
+```
+
+If this is the first datura repository on your machine, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md).
+
+Then, in the directory with the Gemfile, run the following:
```
-gem "datura", git: "https://github.com/CDRH/data.git", branch: "datura"
+gem install bundler
+bundle install
+
+bundle exec setup
```
-Next, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md).
+The last step should add files and some basic directories. Have a look at the [setup instructions](/docs/1_setup/collection_setup.md) to learn how to add your files and start working with the data!
## Local Development
@@ -28,21 +46,17 @@ Then in your repo you can run:
```
bundle install
+# create the gem package if the above doesn't work
+gem install --local path/to/local/datura/pkg/datura-0.x.x.gem
```
-If for some reason that is not working, you can instead run the following each time you make a change in datura:
+You will need to recreate your gem package for some changes you make in Datura. From the DATURA directory, NOT your data repo directory, run:
```
bundle exec rake install
```
-then from the collection (sub in the correct version):
-
-```
-gem install --local path/to/local/datura/pkg/datura-0.1.2.gem
-```
-
-Note: You may need to delete your `scripts/.xslt-datura` folder as well.
+Note: You may also need to delete your `scripts/.xslt-datura` folder if you are making changes to the default Datura scripts.
## First Steps
diff --git a/datura.gemspec b/datura.gemspec
index 1735714ea..ef85aa47d 100644
--- a/datura.gemspec
+++ b/datura.gemspec
@@ -59,5 +59,5 @@ Gem::Specification.new do |spec|
spec.add_runtime_dependency "rest-client", "~> 2.0.2"
spec.add_development_dependency "bundler", ">= 1.16.0", "< 3.0"
spec.add_development_dependency "minitest", "~> 5.0"
- spec.add_development_dependency "rake", "~> 10.0"
+ spec.add_development_dependency "rake", "~> 13.0"
end
diff --git a/docs/2_customization/all_types.md b/docs/2_customization/all_types.md
index 604b588a4..deee371e9 100644
--- a/docs/2_customization/all_types.md
+++ b/docs/2_customization/all_types.md
@@ -5,11 +5,13 @@ There are a number of ways you can customize the transformations. Please refer
### To Elasticsearch
- [XML based (HTML / TEI / VRA / webs (Web Scraped HTML))](xml_to_es.md)
-- [CSV](csv_to_es.md)
+- CSV (Pending)
+- [Custom Formats](custom_to_es.md) (those which Datura does not support but which a collection may need)
### To Solr / HTML
-- Pending docs TODO
+- Pending docs for most formats TODO
+- [CSV](csv_to_solr.md)
### To IIIF
diff --git a/docs/2_customization/custom_to_es.md b/docs/2_customization/custom_to_es.md
new file mode 100644
index 000000000..b9ef0f232
--- /dev/null
+++ b/docs/2_customization/custom_to_es.md
@@ -0,0 +1,170 @@
+# Custom Formats to Elasticsearch
+
+Datura provides minimal support for formats other than TEI, VRA,
+HTML, and CSV through basic infrastructure to support overrides.
+
+## The Basics
+
+If you want to add a custom format such as YAML, XLS spreadsheets, or if you
+want to add some highly customized version of HTML or CSV in addition to an
+existing batch of CSVs, you need to create a directory in source with a unique name.
+
+*The name you select should not be `authority` or `annotations`*. Those names
+are reserved for projects which require authority files such as gazateers and
+scholarly notes about items.
+
+Let's say you need to index `.txt` files. Once you have created the directory
+`source/txt` and populated it with a few files, you can run the Datura scripts
+with:
+
+```
+post -f txt
+```
+
+That will start off the process of grabbing the files and reading them.
+Unfortunately, since Datura has no idea what sort of format to prepare for, nor
+how many items you might need per format (for example, a PDF might be one item
+per file while a tab-separated doc could be dozens or hundreds per file).
+
+Additionally, once Datura reads in a file, it doesn't know how or what
+information to extract, so it looks like it's time to start writing your own
+code!
+
+## Reading Your Format and Prepping for Launch
+
+Just a note before we begin to clarify some of the variables that you may come
+across while you're setting up your custom format:
+
+- `@file_location` -- the fullpath to the specific file being processed
+ - `/var/local/www/data/collections/source/[custom_format]/test.json`
+- `@filename` -- the specific file without a path
+ - `test.json`
+- `self.filename()` -- method specific to FileType and subclasses to get the filename
+- `@file` -- very generically named, `@file` is the version of your file that has been read in by Ruby
+ - override the `read_file` method to make `@file` into an XML / JSON / YAML / etc object as needed by your custom class (see below)
+
+### read_file
+
+In [file_custom.rb](/lib/datura/file_types/file_custom.rb), Datura reads in a
+file as text and makes a new CustomToEs object from it, which is stored as `@file`. You may wish to
+override the following to accommodate your format:
+
+```
+class FileCustom < FileType
+ def read_file
+ File.read(@file_location)
+ end
+end
+```
+
+Currently, this is just straight up attempting to read a file's text. However,
+if you are working with XML / HTML, JSON, CSV, YAML, etc, there is likely a
+better, format-specific parser that will give you more control. For example,
+you might change `read_file` to:
+
+```
+# note: may need to require libraries / modules
+require "yaml"
+
+class FileCustom < FileType
+ def read_file
+ YAML.load_file(@file_location)
+ end
+end
+```
+
+### subdocs
+
+The next thing you will need to address if your format needs to be split into
+multiple documents (such as personography files, spreadsheets, database dumps,
+etc), is how to split up a file. By default, Datura assumes your file is one
+item. If that is not the case, override `subdocs`:
+
+```
+def subdocs
+ Array(@file)
+end
+```
+
+Change that to something which will return an array of items. For example, from
+our YAML example, you might have:
+
+```
+def subdocs
+ @file["texts"]
+end
+```
+Or for an XML file:
+```
+def subdocs
+ @file.xpath("//grouping")
+end
+```
+
+### build_es_documents
+
+You're almost done with `file_custom.rb`. You just need to kick off a class
+that will handle the transformation per sub-document. For simplicity's sake, if
+this is a totally new format that Elasticsearch hasn't seen before, I recommend
+leaving this method alone. You can move onto the next step,
+[CustomToEs](#customtoes).
+
+If you want to try to piggyback off of an existing Datura class, then you may
+need to override this method. Instead of calling `CustomToEs.new()` in it, you
+would instead need to add a `require_relative` path at the top of the file to
+your new class, and then call `YournewclassToEs.new()` from `build_es_documents`.
+
+In your new class, you could presumably do something like
+
+```
+class YournewclassToEs < XmlToEs
+ # now you have access to XmlToEs helpers for xpaths, etc
+end
+```
+
+## CustomToEs
+
+The files in the [custom_to_es](/lib/datura/to_es/custom_to_es) directory and
+[custom_to_es.rb](/lib/datura/to_es/custom_to_es.rb) give you the basic
+structure you need to create your own version of these files. Since
+Datura has no way of knowing what format might come its way, the majority of the
+methods in `custom_to_es/fields.rb` are empty.
+
+The only thing you **MUST** override is `get_id`.
+
+Create a file in your overrides directory called `custom_to_es.rb` and add the
+following:
+
+```
+class CustomToEs
+
+ def get_id
+ # include code here that returns an id
+ # it could be the @filename(false) to get a filename without extension
+ # or it could be `@item["identifier"] to get the value of a column, etc
+
+ # you may want to prepend a collection abbreviation to your id, like
+ # "nei.#{some_value}"
+ end
+
+end
+```
+
+You can also add preprocessing or postprocess here by overriding `create_json`.
+
+It is expected that you will override most of the methods in `fields.rb`. For
+example, you might set a category like:
+
+```
+def category
+ # your code here, referencing @item if necessary
+end
+```
+
+One more note: due to how `CustomToEs` is created, it is expecting a subdoc
+and the original file. This is because it accommodates for something like a
+personography file, where you may want to deal with an individual person as
+`@item` but need to reference `@file` to get information about the repository
+or rightsholder, etc. If your format does not use sub-documents, then you
+may simply refer to `@item` throughout and ignore `@file`, which should be
+identical.
diff --git a/lib/datura/common_xml.rb b/lib/datura/common_xml.rb
index d4cc8351d..d83aafb39 100644
--- a/lib/datura/common_xml.rb
+++ b/lib/datura/common_xml.rb
@@ -20,7 +20,7 @@ def self.convert_tags(xml)
ele.delete("rend")
end
xml = CommonXml.sub_corrections(xml)
- return xml
+ xml
end
# wrap in order to make valid xml
@@ -29,7 +29,7 @@ def self.convert_tags(xml)
def self.convert_tags_in_string(text)
xml = Nokogiri::XML("#{text}")
converted = convert_tags(xml)
- return converted.xpath("//xml").inner_html
+ converted.xpath("//xml").inner_html
end
def self.create_html_object(filepath, remove_ns=true)
@@ -45,59 +45,24 @@ def self.create_xml_object(filepath, remove_ns=true)
file_xml
end
- # pass in a date and identify whether it should be before or after
- # in order to fill in dates (ex: 2014 => 2014-12-31)
-
+ # deprecated method
def self.date_display(date, nd_text="N.D.")
- date_hyphen = CommonXml.date_standardize(date)
- if date_hyphen
- y, m, d = date_hyphen.split("-").map { |s| s.to_i }
- date_obj = Date.new(y, m, d)
- return date_obj.strftime("%B %-d, %Y")
- else
- return nd_text
- end
+ Datura::Helpers.date_display(date, nd_text)
end
- # automatically defaults to setting incomplete dates to the earliest
- # date (2016-07 becomes 2016-07-01) but pass in "false" in order
- # to set it to the latest available date
+ # deprecated method
def self.date_standardize(date, before=true)
- return_date = nil
- if date
- y, m, d = date.split(/-|\//)
- if y && y.length == 4
- # use -1 to indicate that this will be the last possible
- m_default = before ? "01" : "-1"
- d_default = before ? "01" : "-1"
- m = m_default if !m
- d = d_default if !d
- # TODO clean this up because man it sucks
- if Date.valid_date?(y.to_i, m.to_i, d.to_i)
- date = Date.new(y.to_i, m.to_i, d.to_i)
- month = date.month.to_s.rjust(2, "0")
- day = date.day.to_s.rjust(2, "0")
- return_date = "#{date.year}-#{month}-#{day}"
- end
- end
- end
- return_date
+ Datura::Helpers.date_standardize(date, before)
end
+ # deprecated method
def self.normalize_name(abnormal)
- # put in lower case
- # remove starting a, an, or the
- down = abnormal.downcase
- down.gsub(/^the |^a |^an /, "")
+ Datura::Helpers.normalize_name(abnormal)
end
- # imitates xslt fn:normalize-space
- # removes leading / trailing whitespace, newlines, repeating whitespace, etc
+ # deprecated method
def self.normalize_space(abnormal)
- if abnormal
- normal = abnormal.strip.gsub(/\s+/, " ")
- end
- normal || abnormal
+ Datura::Helpers.normalize_space(abnormal)
end
# saxon accepts params in following manner
@@ -107,7 +72,7 @@ def self.stringify_params(param_hash)
if param_hash
params = param_hash.map{ |k, v| "#{k}=#{v}" }.join(" ")
end
- return params
+ params
end
def self.sub_corrections(aXml)
@@ -122,4 +87,13 @@ def self.to_display_text(aXml)
CommonXml.sub_corrections(aXml).text
end
+ # TODO remove in 2021
+ class << self
+ extend Gem::Deprecate
+ deprecate :date_display, :"Datura::Helpers.normalize_space", 2021, 1
+ deprecate :date_standardize, :"Datura::Helpers.normalize_space", 2021, 1
+ deprecate :normalize_name, :"Datura::Helpers.normalize_space", 2021, 1
+ deprecate :normalize_space, :"Datura::Helpers.normalize_space", 2021, 1
+ end
+
end
diff --git a/lib/datura/data_manager.rb b/lib/datura/data_manager.rb
index 861d71da6..9ae304a43 100644
--- a/lib/datura/data_manager.rb
+++ b/lib/datura/data_manager.rb
@@ -18,13 +18,15 @@ class Datura::DataManager
attr_accessor :collection
def self.format_to_class
- {
+ classes = {
"csv" => FileCsv,
"html" => FileHtml,
"tei" => FileTei,
"vra" => FileVra,
"webs" => FileWebs
}
+ classes.default = FileCustom
+ classes
end
def initialize
@@ -63,7 +65,7 @@ def load_collection_classes
def print_options
pretty = JSON.pretty_generate(@options)
puts "Options: #{pretty}"
- return pretty
+ pretty
end
def run
@@ -179,7 +181,7 @@ def get_files
found = Datura::Helpers.get_directory_files(File.join(@options["collection_dir"], "source", format))
files += found if found
end
- return files
+ files
end
def options_msg
@@ -196,7 +198,7 @@ def options_msg
if @options["verbose"]
print_options
end
- return msg
+ msg
end
# override this step in project specific files
@@ -241,7 +243,7 @@ def prepare_files
@log.error(msg)
end
end
- return file_classes
+ file_classes
end
def prepare_xslt
@@ -293,7 +295,7 @@ def set_up_logger
def should_transform?(type)
# adjust default transformation type in params parser
- return @options["transform_types"].include?(type)
+ @options["transform_types"].include?(type)
end
def transform_and_post(file)
diff --git a/lib/datura/file_type.rb b/lib/datura/file_type.rb
index 6077dd8f2..236369a30 100644
--- a/lib/datura/file_type.rb
+++ b/lib/datura/file_type.rb
@@ -102,11 +102,11 @@ def post_solr(url=nil)
def print_es
json = transform_es
- return pretty_json(json)
+ pretty_json(json)
end
def print_solr
- return transform_solr
+ transform_solr
end
# these rules apply to all XML files (HTML / TEI / VRA)
@@ -156,7 +156,7 @@ def transform_solr
else
req = exec_xsl(@file_location, @script_solr, "xml", nil, @options["variables_solr"])
end
- return req
+ req
end
private
diff --git a/lib/datura/file_types/file_csv.rb b/lib/datura/file_types/file_csv.rb
index e7c01f2a8..65655a940 100644
--- a/lib/datura/file_types/file_csv.rb
+++ b/lib/datura/file_types/file_csv.rb
@@ -34,25 +34,21 @@ def present?(item)
# override to change encoding
def read_csv(file_location, encoding="utf-8")
- return CSV.read(file_location, {
+ CSV.read(file_location, {
encoding: encoding,
headers: true,
return_headers: true
})
end
- # most basic implementation assumes column header is the es field name
- # operates with no logic on the fields
- # YOU MUST OVERRIDE FOR CSVS WHICH DO NOT HAVE BESPOKE HEADINGS FOR API
+ # NOTE previously this blindly took column headings and tried
+ # to send them to Elasticsearch, but this will make a mess of
+ # our index mapping, so instead prefer to only push specific fields
+ # leaving "headers" in method arguments for backwards compatibility
+ #
+ # override as necessary per project
def row_to_es(headers, row)
- doc = {}
- headers.each do |column|
- doc[column] = row[column] if row[column]
- end
- if doc.key?("text") && doc.key?("title")
- doc["text"] << " #{doc["title"]}"
- end
- doc
+ CsvToEs.new(row, options, @csv, self.filename(false)).json
end
# most basic implementation assumes column header is the solr field name
@@ -61,7 +57,7 @@ def row_to_solr(doc, headers, row)
headers.each do |column|
doc.add_child("#{row[column]}") if row[column]
end
- return doc
+ doc
end
def transform_es
@@ -111,7 +107,7 @@ def transform_solr
filepath = "#{@out_solr}/#{self.filename(false)}.xml"
File.open(filepath, "w") { |f| f.write(solr_doc.root.to_xml) }
end
- return { "doc" => solr_doc.root.to_xml }
+ { "doc" => solr_doc.root.to_xml }
end
def write_html_to_file(builder, index)
diff --git a/lib/datura/file_types/file_custom.rb b/lib/datura/file_types/file_custom.rb
new file mode 100644
index 000000000..28725d02b
--- /dev/null
+++ b/lib/datura/file_types/file_custom.rb
@@ -0,0 +1,78 @@
+require_relative "../helpers.rb"
+require_relative "../file_type.rb"
+
+require "rest-client"
+
+class FileCustom < FileType
+ attr_reader :es_req, :format
+
+ def initialize(file_location, options)
+ super(file_location, options)
+ @format = get_format
+ @file = read_file
+ end
+
+ def build_es_documents
+ # currently assuming that the file has one document to post
+ # but since some may include more (personographies, spreadsheets, etc)
+ # this should return an array of documents
+ # NOTE this would also be a pretty reasonable method to override
+ # if you need to split your documents into classes of your own creation
+ # like "YamlToEs" or "XlsToEs", etc
+ docs = []
+ subdocs.each do |subdoc|
+ puts "just checking that there's a subdoc here!"
+ docs << CustomToEs.new(
+ subdoc,
+ options: @options,
+ file: @file,
+ filename: self.filename,
+ file_type: @format)
+ .json
+ end
+ docs.compact
+ end
+
+ def get_format
+ # assumes that the format is in the directory structure
+ File.dirname(@file_location).split("/").last
+ end
+
+ # NOTE: you will likely need to override this method
+ # depending on the format in question
+ def read_file
+ File.read(@file_location)
+ end
+
+ def subdocs
+ # if the file should be split into components (such as a CSV row
+ # or personography person entry), override this method to return
+ # an array of items
+ Array(@file)
+ end
+
+ def transform_es
+ puts "transforming #{self.filename}"
+ # expecting an array
+ es_doc = build_es_documents
+
+ if @options["output"]
+ filepath = "#{@out_es}/#{self.filename(false)}.json"
+ File.open(filepath, "w") { |f| f.write(pretty_json(es_doc)) }
+ end
+ es_doc
+ end
+
+ # CURRENTLY NO SUPPORT FOR FOLLOWING TRANSFORMATIONS
+ def transform_html
+ raise "Custom format to HTML transformation must be implemented in collection"
+ end
+
+ def transform_iiif
+ raise "Custom format to IIIF transformation must be implemented in collection"
+ end
+
+ def transform_solr
+ raise "Custom format to Solr transformation must be implemented in collection"
+ end
+end
diff --git a/lib/datura/file_types/file_tei.rb b/lib/datura/file_types/file_tei.rb
index 66fd4a970..d756450f2 100644
--- a/lib/datura/file_types/file_tei.rb
+++ b/lib/datura/file_types/file_tei.rb
@@ -17,7 +17,7 @@ def initialize(file_location, options)
def subdoc_xpaths
# match subdocs against classes
- return {
+ {
"/TEI" => TeiToEs,
# "//listPerson/person" => TeiToEsPersonography,
}
diff --git a/lib/datura/file_types/file_vra.rb b/lib/datura/file_types/file_vra.rb
index cf8b9bd31..e48e4587b 100644
--- a/lib/datura/file_types/file_vra.rb
+++ b/lib/datura/file_types/file_vra.rb
@@ -11,7 +11,7 @@ def initialize(file_location, options)
def subdoc_xpaths
# planning ahead on this one, but not necessary at the moment
- return {
+ {
"/vra" => VraToEs,
"//listPerson/person" => VraToEsPersonography
}
diff --git a/lib/datura/helpers.rb b/lib/datura/helpers.rb
index 1b56fb33f..2e841d267 100644
--- a/lib/datura/helpers.rb
+++ b/lib/datura/helpers.rb
@@ -5,6 +5,46 @@
module Datura::Helpers
+ # date_display
+ # pass in a date and identify whether it should be before or after
+ # in order to fill in dates (ex: 2014 => 2014-12-31)
+ def self.date_display(date, nd_text="N.D.")
+ date_hyphen = self.date_standardize(date)
+ if date_hyphen
+ y, m, d = date_hyphen.split("-").map { |s| s.to_i }
+ date_obj = Date.new(y, m, d)
+ date_obj.strftime("%B %-d, %Y")
+ else
+ nd_text
+ end
+ end
+
+ # date_standardize
+ # automatically defaults to setting incomplete dates to the earliest
+ # date (2016-07 becomes 2016-07-01) but pass in "false" in order
+ # to set it to the latest available date
+ def self.date_standardize(date, before=true)
+ return_date = nil
+ if date
+ y, m, d = date.split(/-|\//)
+ if y && y.length == 4
+ # use -1 to indicate that this will be the last possible
+ m_default = before ? "01" : "-1"
+ d_default = before ? "01" : "-1"
+ m = m_default if !m
+ d = d_default if !d
+ # TODO clean this up because man it sucks
+ if Date.valid_date?(y.to_i, m.to_i, d.to_i)
+ date = Date.new(y.to_i, m.to_i, d.to_i)
+ month = date.month.to_s.rjust(2, "0")
+ day = date.day.to_s.rjust(2, "0")
+ return_date = "#{date.year}-#{month}-#{day}"
+ end
+ end
+ end
+ return_date
+ end
+
# get_directory_files
# Note: do not end with /
# params: directory (string)
@@ -14,10 +54,10 @@ def self.get_directory_files(directory, verbose_flag=false)
exists = File.directory?(directory)
if exists
files = Dir["#{directory}/*"] # grab all the files inside that directory
- return files
+ files
else
puts "Unable to find a directory at #{directory}" if verbose_flag
- return nil
+ nil
end
end
# end get_directory_files
@@ -30,14 +70,14 @@ def self.get_input(original_input, msg)
puts "#{msg}: \n"
new_input = STDIN.gets.chomp
if !new_input.nil? && new_input.length > 0
- return new_input
+ new_input
else
# keep bugging the user until they answer or despair
puts "Please enter a valid response"
get_input(nil, msg)
end
else
- return original_input
+ original_input
end
end
@@ -55,6 +95,23 @@ def self.make_dirs(*args)
FileUtils.mkdir_p(args)
end
+ # normalize_name
+ # lowercase and remove articles from front
+ def self.normalize_name(abnormal)
+ down = abnormal.downcase
+ down.gsub(/^the |^a |^an /, "")
+ end
+
+ # normalize_space
+ # imitates xslt fn:normalize-space
+ # removes leading / trailing whitespace, newlines, repeating whitespace, etc
+ def self.normalize_space(abnormal)
+ if abnormal
+ normal = abnormal.strip.gsub(/\s+/, " ")
+ end
+ normal || abnormal
+ end
+
# regex_files
# looks through a directory's files for those matching the regex
# params: files (array of file names), regex (regular expression)
@@ -79,11 +136,11 @@ def self.regex_files(files, regex=nil)
def self.should_update?(file, since_date=nil)
if since_date.nil?
# if there is no specified date, then update everything
- return true
+ true
else
# if a file has been updated since a time specified by user
file_date = File.mtime(file)
- return file_date > since_date
+ file_date > since_date
end
end
diff --git a/lib/datura/options.rb b/lib/datura/options.rb
index 25ce6b352..36d4e47e2 100644
--- a/lib/datura/options.rb
+++ b/lib/datura/options.rb
@@ -70,7 +70,7 @@ def remove_environments(config)
end
end
end
- return new_config
+ new_config
end
# remove the unneeded environment and put everything at the first level
@@ -85,7 +85,7 @@ def smash_configs
collection = c.merge(d)
# collection overrides general config
- return general.merge(collection)
+ general.merge(collection)
end
end
diff --git a/lib/datura/parser.rb b/lib/datura/parser.rb
index 8b5655c52..b66fcc4d9 100644
--- a/lib/datura/parser.rb
+++ b/lib/datura/parser.rb
@@ -25,7 +25,7 @@ def self.argv_collection_dir(argv)
puts @usage
exit
end
- return collection_dir
+ collection_dir
end
# take a string in utc and create a time object with it
diff --git a/lib/datura/parser_options/post.rb b/lib/datura/parser_options/post.rb
index 6f52cf2ad..daa9b7408 100644
--- a/lib/datura/parser_options/post.rb
+++ b/lib/datura/parser_options/post.rb
@@ -22,14 +22,16 @@ def self.post_params
# default to no restricted format
options["format"] = nil
- opts.on( '-f', '--format [input]', 'Restrict to one format (csv, html, tei, vra, webs)') do |input|
- if %w[csv html tei vra webs].include?(input)
- options["format"] = input
- else
- puts "Format #{input} is not recognized.".red
- puts "Allowed formats are csv, html, tei, vra, and webs (web-scraped html)"
+ opts.on( '-f', '--format [input]', 'Supported formats (csv, html, tei, vra, webs)') do |input|
+ if %w[authority annotations].include?(input)
+ puts "'authority' and 'annotations' are invalid formats".red
+ puts "Please select a supported format or rename your custom format"
exit
+ elsif !%w[csv html tei vra webs].include?(input)
+ puts "Caution: Requested custom format #{input}.".red
+ puts "See FileCustom class for implementation instructions"
end
+ options["format"] = input
end
options["commit"] = true
@@ -86,6 +88,6 @@ def self.post_params
# magic
optparse.parse!
- return options
+ options
end
end
diff --git a/lib/datura/parser_options/solr_create_api_ore.rb b/lib/datura/parser_options/solr_create_api_core.rb
similarity index 97%
rename from lib/datura/parser_options/solr_create_api_ore.rb
rename to lib/datura/parser_options/solr_create_api_core.rb
index 41bacb101..134e45707 100644
--- a/lib/datura/parser_options/solr_create_api_ore.rb
+++ b/lib/datura/parser_options/solr_create_api_core.rb
@@ -28,6 +28,6 @@ def self.solr_create_api_core_params
exit
end
- return options
+ options
end
end
diff --git a/lib/datura/parser_options/solr_manage_schema.rb b/lib/datura/parser_options/solr_manage_schema.rb
index 605921082..0721b693b 100644
--- a/lib/datura/parser_options/solr_manage_schema.rb
+++ b/lib/datura/parser_options/solr_manage_schema.rb
@@ -32,6 +32,6 @@ def self.solr_manage_schema_params
optparse.parse!
- return options
+ options
end
end
diff --git a/lib/datura/requirer.rb b/lib/datura/requirer.rb
index b50190822..75c7bb247 100644
--- a/lib/datura/requirer.rb
+++ b/lib/datura/requirer.rb
@@ -5,17 +5,11 @@
current_dir = File.expand_path(File.dirname(__FILE__))
-require_relative "to_es/html_to_es.rb"
+require_relative "to_es/es_request.rb"
-require_relative "to_es/tei_to_es.rb"
-require_relative "to_es/tei_to_es/tei_to_es_personography.rb"
-
-require_relative "to_es/webs_to_es.rb"
-
-require_relative "to_es/vra_to_es.rb"
-require_relative "to_es/vra_to_es/vra_to_es_personography.rb"
-
-# Dir["#{current_dir}/tei_to_es/*.rb"].each {|f| require f }
+# x_to_es classes
+Dir["#{current_dir}/to_es/*.rb"].each { |f| require f }
+Dir["#{current_dir}/to_es/**/*.rb"].each { |f| require f }
# file types
-Dir["#{current_dir}/file_types/*.rb"].each {|f| require f }
+Dir["#{current_dir}/file_types/*.rb"].each { |f| require f }
diff --git a/lib/datura/solr_poster.rb b/lib/datura/solr_poster.rb
index 71066d8b4..eb4434a88 100644
--- a/lib/datura/solr_poster.rb
+++ b/lib/datura/solr_poster.rb
@@ -23,7 +23,7 @@ def clear_index
else
puts "Unable to clear index!"
end
- return res
+ res
end
def clear_index_by_regex(field, regex)
@@ -37,7 +37,7 @@ def clear_index_by_regex(field, regex)
else
puts "Unable to clear files from index!"
end
- return res
+ res
end
# returns an error or nil
@@ -49,7 +49,7 @@ def commit_solr
puts "UNABLE TO COMMIT YOUR CHANGES TO SOLR. Please commit manually"
end
end
- return commit_res
+ commit_res
end
def post(content, type)
@@ -60,7 +60,7 @@ def post(content, type)
request = Net::HTTP::Post.new(url.request_uri)
request.body = content
request["Content-Type"] = type
- return http.request(request)
+ http.request(request)
end
# post_file
@@ -68,7 +68,7 @@ def post(content, type)
# TODO refactor?
def post_file(file_location)
file = IO.read(file_location)
- return post_xml(file)
+ post_xml(file)
end
# post_json
@@ -91,7 +91,7 @@ def post_xml(content)
if content.nil? || content.empty?
puts "Missing content to index to Solr. Please check that files are"
puts "available to be converted to Solr format and that they were transformed."
- return nil
+ nil
else
post(content, "application/xml")
end
diff --git a/lib/datura/to_es/csv_to_es.rb b/lib/datura/to_es/csv_to_es.rb
new file mode 100644
index 000000000..cf92fec0f
--- /dev/null
+++ b/lib/datura/to_es/csv_to_es.rb
@@ -0,0 +1,54 @@
+require_relative "../helpers.rb"
+require_relative "csv_to_es/fields.rb"
+require_relative "csv_to_es/request.rb"
+
+#########################################
+# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! #
+#########################################
+# (unless you are a CDRH dev and then you may do so very cautiously)
+# this file provides defaults for ALL of the collections included
+# in the API and changing it could alter dozens of sites unexpectedly!
+# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION
+
+# WHAT IS THIS FILE?
+# This file sets up default behavior for transforming CSV
+# documents to Elasticsearch JSON documents
+
+class CsvToEs
+
+ attr_reader :json, :row, :csv
+ # variables
+ # id, row, csv, options
+
+ def initialize(row, options={}, csv=nil, filename=nil)
+ @row = row
+ @options = options
+ @csv = csv
+ @filename = filename
+ @id = get_id
+
+ create_json
+ end
+
+ # getter for @json response object
+ def create_json
+ @json = {}
+ # if anything needs to be done before processing
+ # do it here (ex: reading in annotations into memory)
+ preprocessing
+ assemble_json
+ postprocessing
+ end
+
+ def get_id
+ @row["id"] || @row["identifier"] || nil
+ end
+
+ def preprocessing
+ # copy this in your csv_to_es collection file to customize
+ end
+
+ def postprocessing
+ # copy this in your csv_to_es collection file to customize
+ end
+end
diff --git a/lib/datura/to_es/csv_to_es/fields.rb b/lib/datura/to_es/csv_to_es/fields.rb
new file mode 100644
index 000000000..96e26db2e
--- /dev/null
+++ b/lib/datura/to_es/csv_to_es/fields.rb
@@ -0,0 +1,187 @@
+class CsvToEs
+ # Note to add custom fields, use "assemble_collection_specific" from request.rb
+ # and be sure to either use the _d, _i, _k, or _t to use the correct field type
+
+ ##########
+ # FIELDS #
+ ##########
+ def id
+ @id
+ end
+
+ def id_dc
+ "https://cdrhapi.unl.edu/doc/#{@id}"
+ end
+
+ def annotations_text
+ # TODO what should default behavior be?
+ end
+
+ def category
+ @row["category"]
+ end
+
+ # nested field
+ def creator
+ # TODO
+ end
+
+ # returns ; delineated string of alphabetized creators
+ def creator_sort
+ # TODO
+ end
+
+ def collection
+ @options["collection"]
+ end
+
+ def collection_desc
+ @options["collection_desc"] || @options["collection"]
+ end
+
+ def contributor
+ # TODO
+ end
+
+ def data_type
+ "csv"
+ end
+
+ def date(before=true)
+ Datura::Helpers.date_standardize(@row["date"], before)
+ end
+
+ def date_display
+ Datura::Helpers.date_display(date)
+ end
+
+ def date_not_after
+ date(false)
+ end
+
+ def date_not_before
+ date(true)
+ end
+
+ def description
+ # Note: override per collection as needed
+ end
+
+ def format
+ @row["format"]
+ end
+
+ def image_id
+ # TODO
+ end
+
+ def keywords
+ # TODO
+ end
+
+ def language
+ # TODO
+ end
+
+ def languages
+ # TODO
+ end
+
+ def medium
+ # Default behavior is the same as "format" method
+ format
+ end
+
+ def person
+ # TODO
+ end
+
+ def people
+ # TODO
+ end
+
+ def places
+ # TODO
+ end
+
+ def publisher
+ # TODO
+ end
+
+ def recipient
+ # TODO
+ end
+
+ def rights
+ # Note: override by collection as needed
+ "All Rights Reserved"
+ end
+
+ def rights_holder
+ # TODO
+ end
+
+ def rights_uri
+ # TODO
+ end
+
+ def source
+ @row["source"]
+ end
+
+ def subjects
+ # TODO
+ end
+
+ def subcategory
+ @row["subcategory"]
+ end
+
+ # text is generally going to be pulled from
+ def text
+ text_all = [ @row["text"] ]
+
+ text_all += text_additional
+ text_all = text_all.compact
+ Datura::Helpers.normalize_space(text_all.join(" "))
+ end
+
+ # override and add by collection as needed
+ def text_additional
+ [ title ]
+ end
+
+ def title
+ @row["title"]
+ end
+
+ def title_sort
+ Datura::Helpers.normalize_name(title) if title
+ end
+
+ def topics
+ @row["topics"]
+ end
+
+ def uri
+ # override per collection
+ # should point at the live website view of resource
+ end
+
+ def uri_data
+ base = @options["data_base"]
+ subpath = "data/#{@options["collection"]}/source/csv"
+ "#{base}/#{subpath}/#{@filename}.csv"
+ end
+
+ def uri_html
+ base = @options["data_base"]
+ subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html"
+ "#{base}/#{subpath}/#{@id}.html"
+ end
+
+ def works
+ @row["works"]
+ end
+
+end
diff --git a/lib/datura/to_es/csv_to_es/request.rb b/lib/datura/to_es/csv_to_es/request.rb
new file mode 100644
index 000000000..b361f3f04
--- /dev/null
+++ b/lib/datura/to_es/csv_to_es/request.rb
@@ -0,0 +1,8 @@
+class CsvToEs
+ include EsRequest
+
+ # please refer to generic es_request.rb file
+ # and override the JSON being sent to elasticsearch here, if needed
+ # project specific overrides should go in the COLLECTION's overrides!
+
+end
diff --git a/lib/datura/to_es/custom_to_es.rb b/lib/datura/to_es/custom_to_es.rb
new file mode 100644
index 000000000..fe39b5690
--- /dev/null
+++ b/lib/datura/to_es/custom_to_es.rb
@@ -0,0 +1,57 @@
+require_relative "../helpers.rb"
+require_relative "custom_to_es/fields.rb"
+require_relative "custom_to_es/request.rb"
+
+#########################################
+# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! #
+#########################################
+# (unless you are a CDRH dev and then you may do so very cautiously)
+# this file provides defaults for ALL of the collections included
+# in the API and changing it could alter dozens of sites unexpectedly!
+# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION
+
+# WHAT IS THIS FILE?
+# This file sets up default behavior for transforming custom
+# documents to Elasticsearch JSON documents
+
+class CustomToEs
+
+ attr_reader :json, :item, :file_type
+
+ def initialize(item, options: {}, file: nil, filename: nil, file_type: nil)
+ @item = item
+ @options = options
+ # behaves similarly to parent_xml in that it represents
+ # the entire file, whereas item MAY represent a portion
+ # of a file (as is the case with a csv row, personography
+ # //person path, etc)
+ @file = file
+ @filename = filename
+ @file_type = file_type
+ @id = get_id
+
+ create_json
+ end
+
+ # getter for @json response object
+ def create_json
+ @json = {}
+ # if anything needs to be done before processing
+ # do it here (ex: reading in annotations into memory)
+ preprocessing
+ assemble_json
+ postprocessing
+ end
+
+ def get_id
+ nil
+ end
+
+ def preprocessing
+ # copy this in your custom_to_es collection file to customize
+ end
+
+ def postprocessing
+ # copy this in your custom_to_es collection file to customize
+ end
+end
diff --git a/lib/datura/to_es/custom_to_es/fields.rb b/lib/datura/to_es/custom_to_es/fields.rb
new file mode 100644
index 000000000..a0d068308
--- /dev/null
+++ b/lib/datura/to_es/custom_to_es/fields.rb
@@ -0,0 +1,186 @@
+class CustomToEs
+ # Note to add custom fields, use "assemble_collection_specific" from request.rb
+ # and be sure to either use the _d, _i, _k, or _t to use the correct field type
+
+ ##########
+ # FIELDS #
+ ##########
+ def id
+ @id
+ end
+
+ def id_dc
+ "https://cdrhapi.unl.edu/doc/#{@id}"
+ end
+
+ def annotations_text
+ # TODO what should default behavior be?
+ end
+
+ def category
+ # TODO
+ end
+
+ # nested field
+ def creator
+ # TODO
+ end
+
+ # returns ; delineated string of alphabetized creators
+ def creator_sort
+ # TODO
+ end
+
+ def collection
+ @options["collection"]
+ end
+
+ def collection_desc
+ @options["collection_desc"] || @options["collection"]
+ end
+
+ def contributor
+ # TODO
+ end
+
+ def data_type
+ @file_type
+ end
+
+ def date(before=true)
+ # TODO
+ # Datura::Helpers.date_standardize(??, before)
+ end
+
+ def date_display
+ Datura::Helpers.date_display(date) if date
+ end
+
+ def date_not_after
+ date(false)
+ end
+
+ def date_not_before
+ date(true)
+ end
+
+ def description
+ # Note: override per collection as needed
+ end
+
+ def format
+ # TODO
+ end
+
+ def image_id
+ # TODO
+ end
+
+ def keywords
+ # TODO
+ end
+
+ def language
+ # TODO
+ end
+
+ def languages
+ # TODO
+ end
+
+ def medium
+ # Default behavior is the same as "format" method
+ format
+ end
+
+ def person
+ # TODO
+ end
+
+ def people
+ # TODO
+ end
+
+ def places
+ # TODO
+ end
+
+ def publisher
+ # TODO
+ end
+
+ def recipient
+ # TODO
+ end
+
+ def rights
+ # Note: override by collection as needed
+ "All Rights Reserved"
+ end
+
+ def rights_holder
+ # TODO
+ end
+
+ def rights_uri
+ # TODO
+ end
+
+ def source
+ # TODO
+ end
+
+ def subjects
+ # TODO
+ end
+
+ def subcategory
+ # TODO
+ end
+
+ # text is generally going to be pulled from
+ def text
+ # TODO
+ # get text, add text_additional
+ # Datura::Helpers.normalize_space(your_text.join(" ")))
+ end
+
+ # override and add by collection as needed
+ def text_additional
+ [ title ]
+ end
+
+ def title
+ # TODO
+ end
+
+ def title_sort
+ Datura::Helpers.normalize_name(title) if title
+ end
+
+ def topics
+ # TODO
+ end
+
+ def uri
+ # override per collection
+ # should point at the live website view of resource
+ end
+
+ def uri_data
+ base = @options["data_base"]
+ subpath = "data/#{@options["collection"]}/source/#{@file_type}"
+ "#{base}/#{subpath}/#{@filename}"
+ end
+
+ def uri_html
+ base = @options["data_base"]
+ subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html"
+ "#{base}/#{subpath}/#{@id}.html"
+ end
+
+ def works
+ # TODO
+ end
+
+end
diff --git a/lib/datura/to_es/custom_to_es/request.rb b/lib/datura/to_es/custom_to_es/request.rb
new file mode 100644
index 000000000..804ddcfd3
--- /dev/null
+++ b/lib/datura/to_es/custom_to_es/request.rb
@@ -0,0 +1,7 @@
+class CustomToEs
+ include EsRequest
+ # please refer to generic es_request.rb file
+ # and override the JSON being sent to elasticsearch here, if needed
+ # project specific overrides should go in the COLLECTION's overrides!
+
+end
diff --git a/lib/datura/to_es/xml_to_es_request.rb b/lib/datura/to_es/es_request.rb
similarity index 87%
rename from lib/datura/to_es/xml_to_es_request.rb
rename to lib/datura/to_es/es_request.rb
index 051ae5b3f..89dc35625 100644
--- a/lib/datura/to_es/xml_to_es_request.rb
+++ b/lib/datura/to_es/es_request.rb
@@ -1,8 +1,14 @@
-# request creation portion of Xml to ES transformation
-# override for VRA / TEI concerns in [type]_to_es.rb
-# files or in collection specific overrides
-
-class XmlToEs
+# assemble_json sets up the JSON structure that will be
+# used to create elasticsearch documents. However, the JSON
+# structure depend on subclasses to define methods like
+# "category" and "subcategory" to populate the JSON.
+#
+# This module itself is not standalone, but by putting
+# the JSON structure in a common place, those classes
+# including it do not each need to redefine the JSON
+# request structure
+
+module EsRequest
def assemble_json
# Note: if your collection does not require a specific field
@@ -27,7 +33,7 @@ def assemble_json
assemble_text
assemble_collection_specific
- return @json
+ @json
end
##############
diff --git a/lib/datura/to_es/html_to_es/fields.rb b/lib/datura/to_es/html_to_es/fields.rb
index 9852b5c4f..94c35db68 100644
--- a/lib/datura/to_es/html_to_es/fields.rb
+++ b/lib/datura/to_es/html_to_es/fields.rb
@@ -137,20 +137,18 @@ def subcategory
def text
# handling separate fields in array
# means no worrying about handling spacing between words
- text = []
+ text_all = []
body = get_text(@xpaths["text"], false)
- text << body
- text += text_additional
- return CommonXml.normalize_space(text.join(" "))
+ text_all << body
+ text_all += text_additional
+ Datura::Helpers.normalize_space(text_all.join(" "))
end
def text_additional
# Note: Override this per collection if you need additional
# searchable fields or information for collections
# just make sure you return an array at the end!
-
- text = []
- text << title
+ [ title ]
end
def title
@@ -158,8 +156,7 @@ def title
end
def title_sort
- t = title
- CommonXml.normalize_name(t)
+ Datura::Helpers.normalize_name(title)
end
def topics
@@ -172,9 +169,7 @@ def uri
end
def uri_data
- base = @options["data_base"]
- subpath = "data/#{@options["collection"]}/tei"
- "#{base}/#{subpath}/#{@id}.xml"
+ # TODO per repository
end
def uri_html
diff --git a/lib/datura/to_es/html_to_es/request.rb b/lib/datura/to_es/html_to_es/request.rb
index e33f905b6..d8929f0d8 100644
--- a/lib/datura/to_es/html_to_es/request.rb
+++ b/lib/datura/to_es/html_to_es/request.rb
@@ -1,7 +1,7 @@
class HtmlToEs < XmlToEs
- # please refer to generic xml to es request file, request.rb
- # and override methods specific to HTML transformation here
+ # please refer to generic es_request.rb file
+ # and override the JSON being sent to elasticsearch here, if needed
# project specific overrides should go in the COLLECTION's overrides!
end
diff --git a/lib/datura/to_es/tei_to_es/fields.rb b/lib/datura/to_es/tei_to_es/fields.rb
index 6c80601d2..b33596771 100644
--- a/lib/datura/to_es/tei_to_es/fields.rb
+++ b/lib/datura/to_es/tei_to_es/fields.rb
@@ -20,19 +20,19 @@ def annotations_text
end
def category
- category = get_text(@xpaths["category"])
- return category.length > 0 ? CommonXml.normalize_space(category) : "none"
+ cat = get_text(@xpaths["category"])
+ cat.length > 0 ? Datura::Helpers.normalize_space(cat) : "none"
end
# note this does not sort the creators
def creator
creators = get_list(@xpaths["creators"])
- return creators.map { |creator| { "name" => CommonXml.normalize_space(creator) } }
+ creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } }
end
# returns ; delineated string of alphabetized creators
def creator_sort
- return get_text(@xpaths["creators"])
+ get_text(@xpaths["creators"])
end
def collection
@@ -50,8 +50,8 @@ def contributor
eles.each do |ele|
contribs << {
"id" => ele["id"],
- "name" => CommonXml.normalize_space(ele.text),
- "role" => CommonXml.normalize_space(ele["role"])
+ "name" => Datura::Helpers.normalize_space(ele.text),
+ "role" => Datura::Helpers.normalize_space(ele["role"])
}
end
end
@@ -64,11 +64,11 @@ def data_type
def date(before=true)
datestr = get_text(@xpaths["date"])
- return CommonXml.date_standardize(datestr, before)
+ Datura::Helpers.date_standardize(datestr, before)
end
def date_display
- date = get_text(@xpaths["date_display"])
+ get_text(@xpaths["date_display"])
end
def date_not_after
@@ -121,22 +121,21 @@ def person
# and put in the xpaths above, also for attributes, etc
# should contain name, id, and role
eles = @xml.xpath(@xpaths["person"])
- people = eles.map do |p|
+ eles.map do |p|
{
"id" => "",
- "name" => CommonXml.normalize_space(p.text),
- "role" => CommonXml.normalize_space(p["role"])
+ "name" => Datura::Helpers.normalize_space(p.text),
+ "role" => Datura::Helpers.normalize_space(p["role"])
}
end
- return people
end
def people
- @json["person"].map { |p| CommonXml.normalize_space(p["name"]) }
+ @json["person"].map { |p| Datura::Helpers.normalize_space(p["name"]) }
end
def places
- return get_list(@xpaths["places"])
+ get_list(@xpaths["places"])
end
def publisher
@@ -145,14 +144,13 @@ def publisher
def recipient
eles = @xml.xpath(@xpaths["recipient"])
- people = eles.map do |p|
+ eles.map do |p|
{
"id" => "",
- "name" => CommonXml.normalize_space(p.text),
+ "name" => Datura::Helpers.normalize_space(p.text),
"role" => "recipient"
}
end
- return people
end
def rights
@@ -179,20 +177,20 @@ def subjects
end
def subcategory
- subcategory = get_text(@xpaths["subcategory"])
- subcategory.length > 0 ? subcategory : "none"
+ subcat = get_text(@xpaths["subcategory"])
+ subcat.length > 0 ? subcat : "none"
end
def text
# handling separate fields in array
# means no worrying about handling spacing between words
- text = []
+ text_all = []
body = get_text(@xpaths["text"], false)
- text << body
+ text_all << body
# TODO: do we need to preserve tags like in text? if so, turn get_text to true
- # text << CommonXml.convert_tags_in_string(body)
- text += text_additional
- return CommonXml.normalize_space(text.join(" "))
+ # text_all << CommonXml.convert_tags_in_string(body)
+ text_all += text_additional
+ Datura::Helpers.normalize_space(text_all.join(" "))
end
def text_additional
@@ -200,21 +198,19 @@ def text_additional
# searchable fields or information for collections
# just make sure you return an array at the end!
- text = []
- text << title
+ [ title ]
end
def title
- title = get_text(@xpaths["titles"]["main"])
- if title.empty?
- title = get_text(@xpaths["titles"]["alt"])
+ title_disp = get_text(@xpaths["titles"]["main"])
+ if title_disp.empty?
+ title_disp = get_text(@xpaths["titles"]["alt"])
end
- return title
+ title_disp
end
def title_sort
- t = title
- CommonXml.normalize_name(t)
+ Datura::Helpers.normalize_name(title)
end
def topics
@@ -229,13 +225,13 @@ def uri
def uri_data
base = @options["data_base"]
subpath = "data/#{@options["collection"]}/source/tei"
- return "#{base}/#{subpath}/#{@id}.xml"
+ "#{base}/#{subpath}/#{@id}.xml"
end
def uri_html
base = @options["data_base"]
subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html"
- return "#{base}/#{subpath}/#{@id}.html"
+ "#{base}/#{subpath}/#{@id}.html"
end
def works
diff --git a/lib/datura/to_es/tei_to_es/request.rb b/lib/datura/to_es/tei_to_es/request.rb
index 14f3b7438..c416d8bba 100644
--- a/lib/datura/to_es/tei_to_es/request.rb
+++ b/lib/datura/to_es/tei_to_es/request.rb
@@ -1,6 +1,6 @@
class TeiToEs < XmlToEs
- # please refer to generic xml to es request file, request.rb
+ # please refer to generic es_request.rb file
# and override methods specific to TEI transformation here
# project specific overrides should go in the COLLECTION's overrides!
diff --git a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb
index 81105ec64..7e4ff79be 100644
--- a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb
+++ b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb
@@ -1,7 +1,7 @@
class TeiToEsPersonography < TeiToEs
def override_xpaths
- return {
+ {
"titles" => {
"main" => "persName[@type='display']",
"alt" => "persName"
@@ -16,16 +16,16 @@ def category
def creator
creators = get_list(@xpaths["creators"], false, @parent_xml)
- return creators.map { |creator| { "name" => creator } }
+ creators.map { |c| { "name" => c } }
end
def creators
- return get_text(@xpaths["creators"], false, @parent_xml)
+ get_text(@xpaths["creators"], false, @parent_xml)
end
def get_id
person = @xml["id"]
- return "#{@filename}_#{person}"
+ "#{@filename}_#{person}"
end
def person
diff --git a/lib/datura/to_es/vra_to_es/fields.rb b/lib/datura/to_es/vra_to_es/fields.rb
index 2ca37a155..ed3d2e45e 100644
--- a/lib/datura/to_es/vra_to_es/fields.rb
+++ b/lib/datura/to_es/vra_to_es/fields.rb
@@ -26,12 +26,12 @@ def category
# note this does not sort the creators
def creator
creators = get_list(@xpaths["creators"])
- return creators.map { |creator| { "name" => CommonXml.normalize_space(creator) } }
+ creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } }
end
# returns ; delineated string of alphabetized creators
def creator_sort
- return get_text(@xpaths["creators"])
+ get_text(@xpaths["creators"])
end
def collection
@@ -48,11 +48,11 @@ def contributor
contributors.each do |ele|
contrib_list << {
"id" => "",
- "name" => CommonXml.normalize_space(ele.xpath("name").text),
- "role" => CommonXml.normalize_space(ele.xpath("role").text)
+ "name" => Datura::Helpers.normalize_space(ele.xpath("name").text),
+ "role" => Datura::Helpers.normalize_space(ele.xpath("role").text)
}
end
- return contrib_list
+ contrib_list
end
def data_type
@@ -61,7 +61,7 @@ def data_type
def date(before=true)
datestr = get_text(@xpaths["dates"]["earliest"])
- CommonXml.date_standardize(datestr, before)
+ Datura::Helpers.date_standardize(datestr, before)
end
def date_display
@@ -112,17 +112,17 @@ def person
# and put in the xpaths above, also for attributes, etc
# should contain name, id, and role
eles = @xml.xpath(@xpaths["person"])
- return eles.map do |p|
+ eles.map do |p|
{
"id" => "",
- "name" => CommonXml.normalize_space(p.text),
- "role" => CommonXml.normalize_space(p["role"])
+ "name" => Datura::Helpers.normalize_space(p.text),
+ "role" => Datura::Helpers.normalize_space(p["role"])
}
end
end
def people
- @json["person"].map { |p| CommonXml.normalize_space(p["name"]) }
+ @json["person"].map { |p| Datura::Helpers.normalize_space(p["name"]) }
end
def places
@@ -135,14 +135,13 @@ def publisher
def recipient
eles = @xml.xpath(@xpaths["recipient"])
- people = eles.map do |p|
+ eles.map do |p|
{
"id" => "",
- "name" => CommonXml.normalize_space(p.text),
- "role" => CommonXml.normalize_space(p["role"]),
+ "name" => Datura::Helpers.normalize_space(p.text),
+ "role" => Datura::Helpers.normalize_space(p["role"]),
}
end
- return people
end
def rights
@@ -175,12 +174,12 @@ def subjects
def text
# handling separate fields in array
# means no worrying about handling spacing between words
- text = []
- text << get_text(@xpaths["text"], false)
+ text_all = []
+ text_all << get_text(@xpaths["text"], false)
# TODO: do we need to preserve tags like in text? if so, turn get_text to true
- # text << CommonXml.convert_tags_in_string(body)
- text += text_additional
- return CommonXml.normalize_space(text.join(" "))
+ # text_all << CommonXml.convert_tags_in_string(body)
+ text_all += text_additional
+ Datura::Helpers.normalize_space(text_all.join(" "))
end
def text_additional
@@ -188,8 +187,7 @@ def text_additional
# searchable fields or information for collections
# just make sure you return an array at the end!
- text = []
- text << title
+ [ title ]
end
def title
@@ -197,8 +195,7 @@ def title
end
def title_sort
- t = title
- CommonXml.normalize_name(t)
+ Datura::Helpers.normalize_name(title)
end
def topics
@@ -213,13 +210,13 @@ def uri
def uri_data
base = @options["data_base"]
subpath = "data/#{@options["collection"]}/source/vra"
- return "#{base}/#{subpath}/#{@id}.xml"
+ "#{base}/#{subpath}/#{@id}.xml"
end
def uri_html
base = @options["data_base"]
subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html"
- return "#{base}/#{subpath}/#{@id}.html"
+ "#{base}/#{subpath}/#{@id}.html"
end
def works
diff --git a/lib/datura/to_es/vra_to_es/request.rb b/lib/datura/to_es/vra_to_es/request.rb
index 974ac6be7..e8d0c1b69 100644
--- a/lib/datura/to_es/vra_to_es/request.rb
+++ b/lib/datura/to_es/vra_to_es/request.rb
@@ -1,7 +1,7 @@
class VraToEs < XmlToEs
- # please refer to generic xml to es request file, request.rb
- # and override methods specific to VRA transformation here
+ # please refer to generic es_request.rb file
+ # and override the JSON being sent to elasticsearch here, if needed
# project specific overrides should go in the COLLECTION's overrides!
end
diff --git a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb
index 9ae5d718c..e4fd6f442 100644
--- a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb
+++ b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb
@@ -1,7 +1,7 @@
class VraToEsPersonography < TeiToEs
def override_xpaths
- return {
+ {
"titles" => {
"main" => "persName[@type='display']",
"alt" => "persName"
@@ -16,16 +16,16 @@ def category
def creator
creators = get_list(@xpaths["creators"], false, @parent_xml)
- return creators.map { |creator| { "name" => creator } }
+ creators.map { |c| { "name" => c } }
end
def creator_sort
- return get_text(@xpaths["creators"], false, @parent_xml)
+ get_text(@xpaths["creators"], false, @parent_xml)
end
def get_id
person = @xml["id"]
- return "#{@filename}_#{person}"
+ "#{@filename}_#{person}"
end
def person
diff --git a/lib/datura/to_es/webs_to_es/fields.rb b/lib/datura/to_es/webs_to_es/fields.rb
index 88a0cf760..aff7d27eb 100644
--- a/lib/datura/to_es/webs_to_es/fields.rb
+++ b/lib/datura/to_es/webs_to_es/fields.rb
@@ -149,7 +149,7 @@ def text
body = get_text(@xpaths["text"], false)
text << body
text += text_additional
- return CommonXml.normalize_space(text.join(" "))
+ return Datura::Helpers.normalize_space(text.join(" "))
end
def text_additional
@@ -167,7 +167,7 @@ def title
def title_sort
t = title
- CommonXml.normalize_name(t)
+ Datura::Helpers.normalize_name(t)
end
def topics
diff --git a/lib/datura/to_es/webs_to_es/request.rb b/lib/datura/to_es/webs_to_es/request.rb
index af67228dc..330795d28 100644
--- a/lib/datura/to_es/webs_to_es/request.rb
+++ b/lib/datura/to_es/webs_to_es/request.rb
@@ -1,6 +1,6 @@
class WebsToEs < XmlToEs
- # please refer to generic xml to es request file, request.rb
+ # please refer to generic es_request.rb file
# and override methods specific to Web Scraped HTML transformation here
# project specific overrides should go in the COLLECTION's overrides!
diff --git a/lib/datura/to_es/xml_to_es.rb b/lib/datura/to_es/xml_to_es.rb
index 38aec2bc9..19324853c 100644
--- a/lib/datura/to_es/xml_to_es.rb
+++ b/lib/datura/to_es/xml_to_es.rb
@@ -1,7 +1,6 @@
require "nokogiri"
require_relative "../helpers.rb"
require_relative "../common_xml.rb"
-require_relative "xml_to_es_request.rb"
#########################################
# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! #
@@ -20,6 +19,7 @@
# about altering their behavior, customizing xpaths, etc
class XmlToEs
+ include EsRequest
attr_reader :json, :xml
# variables
@@ -51,7 +51,7 @@ def create_json
end
def get_id
- return @filename
+ @filename
end
def override_xpaths
@@ -74,7 +74,7 @@ def override_xpaths
# returns an array with the html value in xpath
def get_list(xpaths, keep_tags=false, xml=nil)
xpath_array = xpaths.class == Array ? xpaths : [xpaths]
- return get_xpaths(xpath_array, keep_tags, xml)
+ get_xpaths(xpath_array, keep_tags, xml)
end
# get_text
@@ -87,7 +87,7 @@ def get_text(xpaths, keep_tags=false, xml=nil, delimiter=";")
xpath_array = xpaths.class == Array ? xpaths : [xpaths]
list = get_xpaths(xpath_array, keep_tags, xml)
sorted = list.sort
- return sorted.join("#{delimiter} ")
+ sorted.join("#{delimiter} ")
end
# Note: Recommend that collection team do NOT use this method directly
@@ -111,13 +111,13 @@ def get_xpaths(xpaths, keep_tags=false, xml=nil)
text = CommonXml.to_display_text(content)
end
# remove whitespace of all kinds from the text
- text = CommonXml.normalize_space(text)
+ text = Datura::Helpers.normalize_space(text)
if text.length > 0
list << text
end
end
end
- return list.uniq
+ list.uniq
end
def preprocessing
diff --git a/test/common_xml_test.rb b/test/common_xml_test.rb
index 05c4879a1..765f85c97 100644
--- a/test/common_xml_test.rb
+++ b/test/common_xml_test.rb
@@ -44,50 +44,6 @@ def test_create_xml_object
# TODO
end
- def test_date_display
- # normal dates
- assert_equal "December 2, 2016", CommonXml.date_display("2016-12-02")
- assert_equal "January 31, 2014", CommonXml.date_display("2014-01-31", "no date")
- # no date
- assert_equal "N.D.", CommonXml.date_display(nil)
- assert_equal "no date", CommonXml.date_display("20143183", "no date")
- assert_equal "", CommonXml.date_display(nil, "")
- end
-
- def test_date_standardize
- # missing month and day
- assert_equal "2016-01-01", CommonXml.date_standardize("2016")
- assert_equal "2016-12-31", CommonXml.date_standardize("2016", false)
- # missing day
- assert_nil CommonXml.date_standardize("01-12")
- assert_equal "2014-01-01", CommonXml.date_standardize("2014-01")
- assert_equal "2014-01-31", CommonXml.date_standardize("2014-01", false)
- # complete date
- assert_equal "2014-01-12", CommonXml.date_standardize("2014-01-12")
- # invalid date
- assert_nil CommonXml.date_standardize("2014-30-31")
- # February final day
- assert_equal "2015-02-28", CommonXml.date_standardize("2015-2", false)
- assert_equal "2016-02-29", CommonXml.date_standardize("2016-02", false)
- end
-
- def test_normalize_name
- assert_equal "title", CommonXml.normalize_name("The Title")
- assert_equal "anne of green gables", CommonXml.normalize_name("Anne of Green Gables")
- assert_equal "fancy party", CommonXml.normalize_name("A Fancy Party")
- assert_equal "hour", CommonXml.normalize_name("An Hour")
- end
-
- def test_normalize_space
- # ensure that return characters are replaced by spaces, and multispaces squashed
- test1 = " \rExample \n \n "
- assert_equal " Example ", CommonXml.normalize_space(test1)
-
- # check that newlines are dead regardless
- test2 = "\rExa\rmple\n"
- assert_equal " Exa mple ", CommonXml.normalize_space(test2)
- end
-
def test_sub_corrections
xml_string = "Somethng Something"
xml = Nokogiri::XML xml_string
diff --git a/test/helpers_test.rb b/test/helpers_test.rb
index 7c3c72440..357c64e01 100644
--- a/test/helpers_test.rb
+++ b/test/helpers_test.rb
@@ -3,6 +3,33 @@
class Datura::HelpersTest < Minitest::Test
+ def test_date_display
+ # normal dates
+ assert_equal "December 2, 2016", Datura::Helpers.date_display("2016-12-02")
+ assert_equal "January 31, 2014", Datura::Helpers.date_display("2014-01-31", "no date")
+ # no date
+ assert_equal "N.D.", Datura::Helpers.date_display(nil)
+ assert_equal "no date", Datura::Helpers.date_display("20143183", "no date")
+ assert_equal "", Datura::Helpers.date_display(nil, "")
+ end
+
+ def test_date_standardize
+ # missing month and day
+ assert_equal "2016-01-01", Datura::Helpers.date_standardize("2016")
+ assert_equal "2016-12-31", Datura::Helpers.date_standardize("2016", false)
+ # missing day
+ assert_nil Datura::Helpers.date_standardize("01-12")
+ assert_equal "2014-01-01", Datura::Helpers.date_standardize("2014-01")
+ assert_equal "2014-01-31", Datura::Helpers.date_standardize("2014-01", false)
+ # complete date
+ assert_equal "2014-01-12", Datura::Helpers.date_standardize("2014-01-12")
+ # invalid date
+ assert_nil Datura::Helpers.date_standardize("2014-30-31")
+ # February final day
+ assert_equal "2015-02-28", Datura::Helpers.date_standardize("2015-2", false)
+ assert_equal "2016-02-29", Datura::Helpers.date_standardize("2016-02", false)
+ end
+
def test_get_directory_files
# real directory
files = Datura::Helpers.get_directory_files("#{File.dirname(__FILE__)}/fixtures")
@@ -25,6 +52,23 @@ def test_make_dirs
# TODO
end
+ def test_normalize_name
+ assert_equal "title", Datura::Helpers.normalize_name("The Title")
+ assert_equal "anne of green gables", Datura::Helpers.normalize_name("Anne of Green Gables")
+ assert_equal "fancy party", Datura::Helpers.normalize_name("A Fancy Party")
+ assert_equal "hour", Datura::Helpers.normalize_name("An Hour")
+ end
+
+ def test_normalize_space
+ # ensure that return characters are replaced by spaces, and multispaces squashed
+ test1 = " \rExample \n \n "
+ assert_equal " Example ", Datura::Helpers.normalize_space(test1)
+
+ # check that newlines are dead regardless
+ test2 = "\rExa\rmple\n"
+ assert_equal " Exa mple ", Datura::Helpers.normalize_space(test2)
+ end
+
def test_regex_files
test_files = %w[
/path/to/cody.book.001.xml