diff --git a/.ruby-version b/.ruby-version index 24ba9a38d..860487ca1 100644 --- a/.ruby-version +++ b/.ruby-version @@ -1 +1 @@ -2.7.0 +2.7.1 diff --git a/Gemfile.lock b/Gemfile.lock index f6371ab75..e79532f80 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -20,16 +20,16 @@ GEM mini_portile2 (2.4.0) minitest (5.14.0) netrc (0.11.0) - nokogiri (1.10.7) + nokogiri (1.10.9) mini_portile2 (~> 2.4.0) - rake (10.5.0) + rake (13.0.1) rest-client (2.0.2) http-cookie (>= 1.0.2, < 2.0) mime-types (>= 1.16, < 4.0) netrc (~> 0.8) unf (0.1.4) unf_ext - unf_ext (0.0.7.6) + unf_ext (0.0.7.7) PLATFORMS ruby @@ -38,7 +38,7 @@ DEPENDENCIES bundler (>= 1.16.0, < 3.0) datura! minitest (~> 5.0) - rake (~> 10.0) + rake (~> 13.0) BUNDLED WITH 2.1.4 diff --git a/README.md b/README.md index 0b603ffd3..2aea9e408 100644 --- a/README.md +++ b/README.md @@ -2,16 +2,34 @@ Welcome to this temporary documentation for Datura, a gem dedicated to transforming and posting data sources from CDRH projects. This gem is intended to be used with a collection containing TEI, VRA, CSVs, and more. -## Install +Looking for information about how to post documents? Check out the +[documentation for posting](/docs/3_manage/post.md). +## Install / Set Up Data Repo -Gemfile: +Check that Ruby is installed, preferably 2.7.x or up. + +If your project already has a Gemfile, add the `gem "datura"` line. If not, create a new directory and add a file named `Gemfile` (no extension). + +``` +source "https://rubygems.org" + +# fill in the latest available release for the tag +gem "datura", git: "https://github.com/CDRH/datura.git", tag: "v0.0.0" +``` + +If this is the first datura repository on your machine, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md). + +Then, in the directory with the Gemfile, run the following: ``` -gem "datura", git: "https://github.com/CDRH/data.git", branch: "datura" +gem install bundler +bundle install + +bundle exec setup ``` -Next, install saxon as a system wide executable. [Saxon setup documentation](docs/4_developers/saxon.md). +The last step should add files and some basic directories. Have a look at the [setup instructions](/docs/1_setup/collection_setup.md) to learn how to add your files and start working with the data! ## Local Development @@ -28,21 +46,17 @@ Then in your repo you can run: ``` bundle install +# create the gem package if the above doesn't work +gem install --local path/to/local/datura/pkg/datura-0.x.x.gem ``` -If for some reason that is not working, you can instead run the following each time you make a change in datura: +You will need to recreate your gem package for some changes you make in Datura. From the DATURA directory, NOT your data repo directory, run: ``` bundle exec rake install ``` -then from the collection (sub in the correct version): - -``` -gem install --local path/to/local/datura/pkg/datura-0.1.2.gem -``` - -Note: You may need to delete your `scripts/.xslt-datura` folder as well. +Note: You may also need to delete your `scripts/.xslt-datura` folder if you are making changes to the default Datura scripts. ## First Steps diff --git a/datura.gemspec b/datura.gemspec index 1735714ea..ef85aa47d 100644 --- a/datura.gemspec +++ b/datura.gemspec @@ -59,5 +59,5 @@ Gem::Specification.new do |spec| spec.add_runtime_dependency "rest-client", "~> 2.0.2" spec.add_development_dependency "bundler", ">= 1.16.0", "< 3.0" spec.add_development_dependency "minitest", "~> 5.0" - spec.add_development_dependency "rake", "~> 10.0" + spec.add_development_dependency "rake", "~> 13.0" end diff --git a/docs/2_customization/all_types.md b/docs/2_customization/all_types.md index 604b588a4..deee371e9 100644 --- a/docs/2_customization/all_types.md +++ b/docs/2_customization/all_types.md @@ -5,11 +5,13 @@ There are a number of ways you can customize the transformations. Please refer ### To Elasticsearch - [XML based (HTML / TEI / VRA / webs (Web Scraped HTML))](xml_to_es.md) -- [CSV](csv_to_es.md) +- CSV (Pending) +- [Custom Formats](custom_to_es.md) (those which Datura does not support but which a collection may need) ### To Solr / HTML -- Pending docs TODO +- Pending docs for most formats TODO +- [CSV](csv_to_solr.md) ### To IIIF diff --git a/docs/2_customization/custom_to_es.md b/docs/2_customization/custom_to_es.md new file mode 100644 index 000000000..b9ef0f232 --- /dev/null +++ b/docs/2_customization/custom_to_es.md @@ -0,0 +1,170 @@ +# Custom Formats to Elasticsearch + +Datura provides minimal support for formats other than TEI, VRA, +HTML, and CSV through basic infrastructure to support overrides. + +## The Basics + +If you want to add a custom format such as YAML, XLS spreadsheets, or if you +want to add some highly customized version of HTML or CSV in addition to an +existing batch of CSVs, you need to create a directory in source with a unique name. + +*The name you select should not be `authority` or `annotations`*. Those names +are reserved for projects which require authority files such as gazateers and +scholarly notes about items. + +Let's say you need to index `.txt` files. Once you have created the directory +`source/txt` and populated it with a few files, you can run the Datura scripts +with: + +``` +post -f txt +``` + +That will start off the process of grabbing the files and reading them. +Unfortunately, since Datura has no idea what sort of format to prepare for, nor +how many items you might need per format (for example, a PDF might be one item +per file while a tab-separated doc could be dozens or hundreds per file). + +Additionally, once Datura reads in a file, it doesn't know how or what +information to extract, so it looks like it's time to start writing your own +code! + +## Reading Your Format and Prepping for Launch + +Just a note before we begin to clarify some of the variables that you may come +across while you're setting up your custom format: + +- `@file_location` -- the fullpath to the specific file being processed + - `/var/local/www/data/collections/source/[custom_format]/test.json` +- `@filename` -- the specific file without a path + - `test.json` +- `self.filename()` -- method specific to FileType and subclasses to get the filename +- `@file` -- very generically named, `@file` is the version of your file that has been read in by Ruby + - override the `read_file` method to make `@file` into an XML / JSON / YAML / etc object as needed by your custom class (see below) + +### read_file + +In [file_custom.rb](/lib/datura/file_types/file_custom.rb), Datura reads in a +file as text and makes a new CustomToEs object from it, which is stored as `@file`. You may wish to +override the following to accommodate your format: + +``` +class FileCustom < FileType + def read_file + File.read(@file_location) + end +end +``` + +Currently, this is just straight up attempting to read a file's text. However, +if you are working with XML / HTML, JSON, CSV, YAML, etc, there is likely a +better, format-specific parser that will give you more control. For example, +you might change `read_file` to: + +``` +# note: may need to require libraries / modules +require "yaml" + +class FileCustom < FileType + def read_file + YAML.load_file(@file_location) + end +end +``` + +### subdocs + +The next thing you will need to address if your format needs to be split into +multiple documents (such as personography files, spreadsheets, database dumps, +etc), is how to split up a file. By default, Datura assumes your file is one +item. If that is not the case, override `subdocs`: + +``` +def subdocs + Array(@file) +end +``` + +Change that to something which will return an array of items. For example, from +our YAML example, you might have: + +``` +def subdocs + @file["texts"] +end +``` +Or for an XML file: +``` +def subdocs + @file.xpath("//grouping") +end +``` + +### build_es_documents + +You're almost done with `file_custom.rb`. You just need to kick off a class +that will handle the transformation per sub-document. For simplicity's sake, if +this is a totally new format that Elasticsearch hasn't seen before, I recommend +leaving this method alone. You can move onto the next step, +[CustomToEs](#customtoes). + +If you want to try to piggyback off of an existing Datura class, then you may +need to override this method. Instead of calling `CustomToEs.new()` in it, you +would instead need to add a `require_relative` path at the top of the file to +your new class, and then call `YournewclassToEs.new()` from `build_es_documents`. + +In your new class, you could presumably do something like + +``` +class YournewclassToEs < XmlToEs + # now you have access to XmlToEs helpers for xpaths, etc +end +``` + +## CustomToEs + +The files in the [custom_to_es](/lib/datura/to_es/custom_to_es) directory and +[custom_to_es.rb](/lib/datura/to_es/custom_to_es.rb) give you the basic +structure you need to create your own version of these files. Since +Datura has no way of knowing what format might come its way, the majority of the +methods in `custom_to_es/fields.rb` are empty. + +The only thing you **MUST** override is `get_id`. + +Create a file in your overrides directory called `custom_to_es.rb` and add the +following: + +``` +class CustomToEs + + def get_id + # include code here that returns an id + # it could be the @filename(false) to get a filename without extension + # or it could be `@item["identifier"] to get the value of a column, etc + + # you may want to prepend a collection abbreviation to your id, like + # "nei.#{some_value}" + end + +end +``` + +You can also add preprocessing or postprocess here by overriding `create_json`. + +It is expected that you will override most of the methods in `fields.rb`. For +example, you might set a category like: + +``` +def category + # your code here, referencing @item if necessary +end +``` + +One more note: due to how `CustomToEs` is created, it is expecting a subdoc +and the original file. This is because it accommodates for something like a +personography file, where you may want to deal with an individual person as +`@item` but need to reference `@file` to get information about the repository +or rightsholder, etc. If your format does not use sub-documents, then you +may simply refer to `@item` throughout and ignore `@file`, which should be +identical. diff --git a/lib/datura/common_xml.rb b/lib/datura/common_xml.rb index d4cc8351d..d83aafb39 100644 --- a/lib/datura/common_xml.rb +++ b/lib/datura/common_xml.rb @@ -20,7 +20,7 @@ def self.convert_tags(xml) ele.delete("rend") end xml = CommonXml.sub_corrections(xml) - return xml + xml end # wrap in order to make valid xml @@ -29,7 +29,7 @@ def self.convert_tags(xml) def self.convert_tags_in_string(text) xml = Nokogiri::XML("#{text}") converted = convert_tags(xml) - return converted.xpath("//xml").inner_html + converted.xpath("//xml").inner_html end def self.create_html_object(filepath, remove_ns=true) @@ -45,59 +45,24 @@ def self.create_xml_object(filepath, remove_ns=true) file_xml end - # pass in a date and identify whether it should be before or after - # in order to fill in dates (ex: 2014 => 2014-12-31) - + # deprecated method def self.date_display(date, nd_text="N.D.") - date_hyphen = CommonXml.date_standardize(date) - if date_hyphen - y, m, d = date_hyphen.split("-").map { |s| s.to_i } - date_obj = Date.new(y, m, d) - return date_obj.strftime("%B %-d, %Y") - else - return nd_text - end + Datura::Helpers.date_display(date, nd_text) end - # automatically defaults to setting incomplete dates to the earliest - # date (2016-07 becomes 2016-07-01) but pass in "false" in order - # to set it to the latest available date + # deprecated method def self.date_standardize(date, before=true) - return_date = nil - if date - y, m, d = date.split(/-|\//) - if y && y.length == 4 - # use -1 to indicate that this will be the last possible - m_default = before ? "01" : "-1" - d_default = before ? "01" : "-1" - m = m_default if !m - d = d_default if !d - # TODO clean this up because man it sucks - if Date.valid_date?(y.to_i, m.to_i, d.to_i) - date = Date.new(y.to_i, m.to_i, d.to_i) - month = date.month.to_s.rjust(2, "0") - day = date.day.to_s.rjust(2, "0") - return_date = "#{date.year}-#{month}-#{day}" - end - end - end - return_date + Datura::Helpers.date_standardize(date, before) end + # deprecated method def self.normalize_name(abnormal) - # put in lower case - # remove starting a, an, or the - down = abnormal.downcase - down.gsub(/^the |^a |^an /, "") + Datura::Helpers.normalize_name(abnormal) end - # imitates xslt fn:normalize-space - # removes leading / trailing whitespace, newlines, repeating whitespace, etc + # deprecated method def self.normalize_space(abnormal) - if abnormal - normal = abnormal.strip.gsub(/\s+/, " ") - end - normal || abnormal + Datura::Helpers.normalize_space(abnormal) end # saxon accepts params in following manner @@ -107,7 +72,7 @@ def self.stringify_params(param_hash) if param_hash params = param_hash.map{ |k, v| "#{k}=#{v}" }.join(" ") end - return params + params end def self.sub_corrections(aXml) @@ -122,4 +87,13 @@ def self.to_display_text(aXml) CommonXml.sub_corrections(aXml).text end + # TODO remove in 2021 + class << self + extend Gem::Deprecate + deprecate :date_display, :"Datura::Helpers.normalize_space", 2021, 1 + deprecate :date_standardize, :"Datura::Helpers.normalize_space", 2021, 1 + deprecate :normalize_name, :"Datura::Helpers.normalize_space", 2021, 1 + deprecate :normalize_space, :"Datura::Helpers.normalize_space", 2021, 1 + end + end diff --git a/lib/datura/data_manager.rb b/lib/datura/data_manager.rb index 861d71da6..9ae304a43 100644 --- a/lib/datura/data_manager.rb +++ b/lib/datura/data_manager.rb @@ -18,13 +18,15 @@ class Datura::DataManager attr_accessor :collection def self.format_to_class - { + classes = { "csv" => FileCsv, "html" => FileHtml, "tei" => FileTei, "vra" => FileVra, "webs" => FileWebs } + classes.default = FileCustom + classes end def initialize @@ -63,7 +65,7 @@ def load_collection_classes def print_options pretty = JSON.pretty_generate(@options) puts "Options: #{pretty}" - return pretty + pretty end def run @@ -179,7 +181,7 @@ def get_files found = Datura::Helpers.get_directory_files(File.join(@options["collection_dir"], "source", format)) files += found if found end - return files + files end def options_msg @@ -196,7 +198,7 @@ def options_msg if @options["verbose"] print_options end - return msg + msg end # override this step in project specific files @@ -241,7 +243,7 @@ def prepare_files @log.error(msg) end end - return file_classes + file_classes end def prepare_xslt @@ -293,7 +295,7 @@ def set_up_logger def should_transform?(type) # adjust default transformation type in params parser - return @options["transform_types"].include?(type) + @options["transform_types"].include?(type) end def transform_and_post(file) diff --git a/lib/datura/file_type.rb b/lib/datura/file_type.rb index 6077dd8f2..236369a30 100644 --- a/lib/datura/file_type.rb +++ b/lib/datura/file_type.rb @@ -102,11 +102,11 @@ def post_solr(url=nil) def print_es json = transform_es - return pretty_json(json) + pretty_json(json) end def print_solr - return transform_solr + transform_solr end # these rules apply to all XML files (HTML / TEI / VRA) @@ -156,7 +156,7 @@ def transform_solr else req = exec_xsl(@file_location, @script_solr, "xml", nil, @options["variables_solr"]) end - return req + req end private diff --git a/lib/datura/file_types/file_csv.rb b/lib/datura/file_types/file_csv.rb index e7c01f2a8..65655a940 100644 --- a/lib/datura/file_types/file_csv.rb +++ b/lib/datura/file_types/file_csv.rb @@ -34,25 +34,21 @@ def present?(item) # override to change encoding def read_csv(file_location, encoding="utf-8") - return CSV.read(file_location, { + CSV.read(file_location, { encoding: encoding, headers: true, return_headers: true }) end - # most basic implementation assumes column header is the es field name - # operates with no logic on the fields - # YOU MUST OVERRIDE FOR CSVS WHICH DO NOT HAVE BESPOKE HEADINGS FOR API + # NOTE previously this blindly took column headings and tried + # to send them to Elasticsearch, but this will make a mess of + # our index mapping, so instead prefer to only push specific fields + # leaving "headers" in method arguments for backwards compatibility + # + # override as necessary per project def row_to_es(headers, row) - doc = {} - headers.each do |column| - doc[column] = row[column] if row[column] - end - if doc.key?("text") && doc.key?("title") - doc["text"] << " #{doc["title"]}" - end - doc + CsvToEs.new(row, options, @csv, self.filename(false)).json end # most basic implementation assumes column header is the solr field name @@ -61,7 +57,7 @@ def row_to_solr(doc, headers, row) headers.each do |column| doc.add_child("#{row[column]}") if row[column] end - return doc + doc end def transform_es @@ -111,7 +107,7 @@ def transform_solr filepath = "#{@out_solr}/#{self.filename(false)}.xml" File.open(filepath, "w") { |f| f.write(solr_doc.root.to_xml) } end - return { "doc" => solr_doc.root.to_xml } + { "doc" => solr_doc.root.to_xml } end def write_html_to_file(builder, index) diff --git a/lib/datura/file_types/file_custom.rb b/lib/datura/file_types/file_custom.rb new file mode 100644 index 000000000..28725d02b --- /dev/null +++ b/lib/datura/file_types/file_custom.rb @@ -0,0 +1,78 @@ +require_relative "../helpers.rb" +require_relative "../file_type.rb" + +require "rest-client" + +class FileCustom < FileType + attr_reader :es_req, :format + + def initialize(file_location, options) + super(file_location, options) + @format = get_format + @file = read_file + end + + def build_es_documents + # currently assuming that the file has one document to post + # but since some may include more (personographies, spreadsheets, etc) + # this should return an array of documents + # NOTE this would also be a pretty reasonable method to override + # if you need to split your documents into classes of your own creation + # like "YamlToEs" or "XlsToEs", etc + docs = [] + subdocs.each do |subdoc| + puts "just checking that there's a subdoc here!" + docs << CustomToEs.new( + subdoc, + options: @options, + file: @file, + filename: self.filename, + file_type: @format) + .json + end + docs.compact + end + + def get_format + # assumes that the format is in the directory structure + File.dirname(@file_location).split("/").last + end + + # NOTE: you will likely need to override this method + # depending on the format in question + def read_file + File.read(@file_location) + end + + def subdocs + # if the file should be split into components (such as a CSV row + # or personography person entry), override this method to return + # an array of items + Array(@file) + end + + def transform_es + puts "transforming #{self.filename}" + # expecting an array + es_doc = build_es_documents + + if @options["output"] + filepath = "#{@out_es}/#{self.filename(false)}.json" + File.open(filepath, "w") { |f| f.write(pretty_json(es_doc)) } + end + es_doc + end + + # CURRENTLY NO SUPPORT FOR FOLLOWING TRANSFORMATIONS + def transform_html + raise "Custom format to HTML transformation must be implemented in collection" + end + + def transform_iiif + raise "Custom format to IIIF transformation must be implemented in collection" + end + + def transform_solr + raise "Custom format to Solr transformation must be implemented in collection" + end +end diff --git a/lib/datura/file_types/file_tei.rb b/lib/datura/file_types/file_tei.rb index 66fd4a970..d756450f2 100644 --- a/lib/datura/file_types/file_tei.rb +++ b/lib/datura/file_types/file_tei.rb @@ -17,7 +17,7 @@ def initialize(file_location, options) def subdoc_xpaths # match subdocs against classes - return { + { "/TEI" => TeiToEs, # "//listPerson/person" => TeiToEsPersonography, } diff --git a/lib/datura/file_types/file_vra.rb b/lib/datura/file_types/file_vra.rb index cf8b9bd31..e48e4587b 100644 --- a/lib/datura/file_types/file_vra.rb +++ b/lib/datura/file_types/file_vra.rb @@ -11,7 +11,7 @@ def initialize(file_location, options) def subdoc_xpaths # planning ahead on this one, but not necessary at the moment - return { + { "/vra" => VraToEs, "//listPerson/person" => VraToEsPersonography } diff --git a/lib/datura/helpers.rb b/lib/datura/helpers.rb index 1b56fb33f..2e841d267 100644 --- a/lib/datura/helpers.rb +++ b/lib/datura/helpers.rb @@ -5,6 +5,46 @@ module Datura::Helpers + # date_display + # pass in a date and identify whether it should be before or after + # in order to fill in dates (ex: 2014 => 2014-12-31) + def self.date_display(date, nd_text="N.D.") + date_hyphen = self.date_standardize(date) + if date_hyphen + y, m, d = date_hyphen.split("-").map { |s| s.to_i } + date_obj = Date.new(y, m, d) + date_obj.strftime("%B %-d, %Y") + else + nd_text + end + end + + # date_standardize + # automatically defaults to setting incomplete dates to the earliest + # date (2016-07 becomes 2016-07-01) but pass in "false" in order + # to set it to the latest available date + def self.date_standardize(date, before=true) + return_date = nil + if date + y, m, d = date.split(/-|\//) + if y && y.length == 4 + # use -1 to indicate that this will be the last possible + m_default = before ? "01" : "-1" + d_default = before ? "01" : "-1" + m = m_default if !m + d = d_default if !d + # TODO clean this up because man it sucks + if Date.valid_date?(y.to_i, m.to_i, d.to_i) + date = Date.new(y.to_i, m.to_i, d.to_i) + month = date.month.to_s.rjust(2, "0") + day = date.day.to_s.rjust(2, "0") + return_date = "#{date.year}-#{month}-#{day}" + end + end + end + return_date + end + # get_directory_files # Note: do not end with / # params: directory (string) @@ -14,10 +54,10 @@ def self.get_directory_files(directory, verbose_flag=false) exists = File.directory?(directory) if exists files = Dir["#{directory}/*"] # grab all the files inside that directory - return files + files else puts "Unable to find a directory at #{directory}" if verbose_flag - return nil + nil end end # end get_directory_files @@ -30,14 +70,14 @@ def self.get_input(original_input, msg) puts "#{msg}: \n" new_input = STDIN.gets.chomp if !new_input.nil? && new_input.length > 0 - return new_input + new_input else # keep bugging the user until they answer or despair puts "Please enter a valid response" get_input(nil, msg) end else - return original_input + original_input end end @@ -55,6 +95,23 @@ def self.make_dirs(*args) FileUtils.mkdir_p(args) end + # normalize_name + # lowercase and remove articles from front + def self.normalize_name(abnormal) + down = abnormal.downcase + down.gsub(/^the |^a |^an /, "") + end + + # normalize_space + # imitates xslt fn:normalize-space + # removes leading / trailing whitespace, newlines, repeating whitespace, etc + def self.normalize_space(abnormal) + if abnormal + normal = abnormal.strip.gsub(/\s+/, " ") + end + normal || abnormal + end + # regex_files # looks through a directory's files for those matching the regex # params: files (array of file names), regex (regular expression) @@ -79,11 +136,11 @@ def self.regex_files(files, regex=nil) def self.should_update?(file, since_date=nil) if since_date.nil? # if there is no specified date, then update everything - return true + true else # if a file has been updated since a time specified by user file_date = File.mtime(file) - return file_date > since_date + file_date > since_date end end diff --git a/lib/datura/options.rb b/lib/datura/options.rb index 25ce6b352..36d4e47e2 100644 --- a/lib/datura/options.rb +++ b/lib/datura/options.rb @@ -70,7 +70,7 @@ def remove_environments(config) end end end - return new_config + new_config end # remove the unneeded environment and put everything at the first level @@ -85,7 +85,7 @@ def smash_configs collection = c.merge(d) # collection overrides general config - return general.merge(collection) + general.merge(collection) end end diff --git a/lib/datura/parser.rb b/lib/datura/parser.rb index 8b5655c52..b66fcc4d9 100644 --- a/lib/datura/parser.rb +++ b/lib/datura/parser.rb @@ -25,7 +25,7 @@ def self.argv_collection_dir(argv) puts @usage exit end - return collection_dir + collection_dir end # take a string in utc and create a time object with it diff --git a/lib/datura/parser_options/post.rb b/lib/datura/parser_options/post.rb index 6f52cf2ad..daa9b7408 100644 --- a/lib/datura/parser_options/post.rb +++ b/lib/datura/parser_options/post.rb @@ -22,14 +22,16 @@ def self.post_params # default to no restricted format options["format"] = nil - opts.on( '-f', '--format [input]', 'Restrict to one format (csv, html, tei, vra, webs)') do |input| - if %w[csv html tei vra webs].include?(input) - options["format"] = input - else - puts "Format #{input} is not recognized.".red - puts "Allowed formats are csv, html, tei, vra, and webs (web-scraped html)" + opts.on( '-f', '--format [input]', 'Supported formats (csv, html, tei, vra, webs)') do |input| + if %w[authority annotations].include?(input) + puts "'authority' and 'annotations' are invalid formats".red + puts "Please select a supported format or rename your custom format" exit + elsif !%w[csv html tei vra webs].include?(input) + puts "Caution: Requested custom format #{input}.".red + puts "See FileCustom class for implementation instructions" end + options["format"] = input end options["commit"] = true @@ -86,6 +88,6 @@ def self.post_params # magic optparse.parse! - return options + options end end diff --git a/lib/datura/parser_options/solr_create_api_ore.rb b/lib/datura/parser_options/solr_create_api_core.rb similarity index 97% rename from lib/datura/parser_options/solr_create_api_ore.rb rename to lib/datura/parser_options/solr_create_api_core.rb index 41bacb101..134e45707 100644 --- a/lib/datura/parser_options/solr_create_api_ore.rb +++ b/lib/datura/parser_options/solr_create_api_core.rb @@ -28,6 +28,6 @@ def self.solr_create_api_core_params exit end - return options + options end end diff --git a/lib/datura/parser_options/solr_manage_schema.rb b/lib/datura/parser_options/solr_manage_schema.rb index 605921082..0721b693b 100644 --- a/lib/datura/parser_options/solr_manage_schema.rb +++ b/lib/datura/parser_options/solr_manage_schema.rb @@ -32,6 +32,6 @@ def self.solr_manage_schema_params optparse.parse! - return options + options end end diff --git a/lib/datura/requirer.rb b/lib/datura/requirer.rb index b50190822..75c7bb247 100644 --- a/lib/datura/requirer.rb +++ b/lib/datura/requirer.rb @@ -5,17 +5,11 @@ current_dir = File.expand_path(File.dirname(__FILE__)) -require_relative "to_es/html_to_es.rb" +require_relative "to_es/es_request.rb" -require_relative "to_es/tei_to_es.rb" -require_relative "to_es/tei_to_es/tei_to_es_personography.rb" - -require_relative "to_es/webs_to_es.rb" - -require_relative "to_es/vra_to_es.rb" -require_relative "to_es/vra_to_es/vra_to_es_personography.rb" - -# Dir["#{current_dir}/tei_to_es/*.rb"].each {|f| require f } +# x_to_es classes +Dir["#{current_dir}/to_es/*.rb"].each { |f| require f } +Dir["#{current_dir}/to_es/**/*.rb"].each { |f| require f } # file types -Dir["#{current_dir}/file_types/*.rb"].each {|f| require f } +Dir["#{current_dir}/file_types/*.rb"].each { |f| require f } diff --git a/lib/datura/solr_poster.rb b/lib/datura/solr_poster.rb index 71066d8b4..eb4434a88 100644 --- a/lib/datura/solr_poster.rb +++ b/lib/datura/solr_poster.rb @@ -23,7 +23,7 @@ def clear_index else puts "Unable to clear index!" end - return res + res end def clear_index_by_regex(field, regex) @@ -37,7 +37,7 @@ def clear_index_by_regex(field, regex) else puts "Unable to clear files from index!" end - return res + res end # returns an error or nil @@ -49,7 +49,7 @@ def commit_solr puts "UNABLE TO COMMIT YOUR CHANGES TO SOLR. Please commit manually" end end - return commit_res + commit_res end def post(content, type) @@ -60,7 +60,7 @@ def post(content, type) request = Net::HTTP::Post.new(url.request_uri) request.body = content request["Content-Type"] = type - return http.request(request) + http.request(request) end # post_file @@ -68,7 +68,7 @@ def post(content, type) # TODO refactor? def post_file(file_location) file = IO.read(file_location) - return post_xml(file) + post_xml(file) end # post_json @@ -91,7 +91,7 @@ def post_xml(content) if content.nil? || content.empty? puts "Missing content to index to Solr. Please check that files are" puts "available to be converted to Solr format and that they were transformed." - return nil + nil else post(content, "application/xml") end diff --git a/lib/datura/to_es/csv_to_es.rb b/lib/datura/to_es/csv_to_es.rb new file mode 100644 index 000000000..cf92fec0f --- /dev/null +++ b/lib/datura/to_es/csv_to_es.rb @@ -0,0 +1,54 @@ +require_relative "../helpers.rb" +require_relative "csv_to_es/fields.rb" +require_relative "csv_to_es/request.rb" + +######################################### +# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! # +######################################### +# (unless you are a CDRH dev and then you may do so very cautiously) +# this file provides defaults for ALL of the collections included +# in the API and changing it could alter dozens of sites unexpectedly! +# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION + +# WHAT IS THIS FILE? +# This file sets up default behavior for transforming CSV +# documents to Elasticsearch JSON documents + +class CsvToEs + + attr_reader :json, :row, :csv + # variables + # id, row, csv, options + + def initialize(row, options={}, csv=nil, filename=nil) + @row = row + @options = options + @csv = csv + @filename = filename + @id = get_id + + create_json + end + + # getter for @json response object + def create_json + @json = {} + # if anything needs to be done before processing + # do it here (ex: reading in annotations into memory) + preprocessing + assemble_json + postprocessing + end + + def get_id + @row["id"] || @row["identifier"] || nil + end + + def preprocessing + # copy this in your csv_to_es collection file to customize + end + + def postprocessing + # copy this in your csv_to_es collection file to customize + end +end diff --git a/lib/datura/to_es/csv_to_es/fields.rb b/lib/datura/to_es/csv_to_es/fields.rb new file mode 100644 index 000000000..96e26db2e --- /dev/null +++ b/lib/datura/to_es/csv_to_es/fields.rb @@ -0,0 +1,187 @@ +class CsvToEs + # Note to add custom fields, use "assemble_collection_specific" from request.rb + # and be sure to either use the _d, _i, _k, or _t to use the correct field type + + ########## + # FIELDS # + ########## + def id + @id + end + + def id_dc + "https://cdrhapi.unl.edu/doc/#{@id}" + end + + def annotations_text + # TODO what should default behavior be? + end + + def category + @row["category"] + end + + # nested field + def creator + # TODO + end + + # returns ; delineated string of alphabetized creators + def creator_sort + # TODO + end + + def collection + @options["collection"] + end + + def collection_desc + @options["collection_desc"] || @options["collection"] + end + + def contributor + # TODO + end + + def data_type + "csv" + end + + def date(before=true) + Datura::Helpers.date_standardize(@row["date"], before) + end + + def date_display + Datura::Helpers.date_display(date) + end + + def date_not_after + date(false) + end + + def date_not_before + date(true) + end + + def description + # Note: override per collection as needed + end + + def format + @row["format"] + end + + def image_id + # TODO + end + + def keywords + # TODO + end + + def language + # TODO + end + + def languages + # TODO + end + + def medium + # Default behavior is the same as "format" method + format + end + + def person + # TODO + end + + def people + # TODO + end + + def places + # TODO + end + + def publisher + # TODO + end + + def recipient + # TODO + end + + def rights + # Note: override by collection as needed + "All Rights Reserved" + end + + def rights_holder + # TODO + end + + def rights_uri + # TODO + end + + def source + @row["source"] + end + + def subjects + # TODO + end + + def subcategory + @row["subcategory"] + end + + # text is generally going to be pulled from + def text + text_all = [ @row["text"] ] + + text_all += text_additional + text_all = text_all.compact + Datura::Helpers.normalize_space(text_all.join(" ")) + end + + # override and add by collection as needed + def text_additional + [ title ] + end + + def title + @row["title"] + end + + def title_sort + Datura::Helpers.normalize_name(title) if title + end + + def topics + @row["topics"] + end + + def uri + # override per collection + # should point at the live website view of resource + end + + def uri_data + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/source/csv" + "#{base}/#{subpath}/#{@filename}.csv" + end + + def uri_html + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" + "#{base}/#{subpath}/#{@id}.html" + end + + def works + @row["works"] + end + +end diff --git a/lib/datura/to_es/csv_to_es/request.rb b/lib/datura/to_es/csv_to_es/request.rb new file mode 100644 index 000000000..b361f3f04 --- /dev/null +++ b/lib/datura/to_es/csv_to_es/request.rb @@ -0,0 +1,8 @@ +class CsvToEs + include EsRequest + + # please refer to generic es_request.rb file + # and override the JSON being sent to elasticsearch here, if needed + # project specific overrides should go in the COLLECTION's overrides! + +end diff --git a/lib/datura/to_es/custom_to_es.rb b/lib/datura/to_es/custom_to_es.rb new file mode 100644 index 000000000..fe39b5690 --- /dev/null +++ b/lib/datura/to_es/custom_to_es.rb @@ -0,0 +1,57 @@ +require_relative "../helpers.rb" +require_relative "custom_to_es/fields.rb" +require_relative "custom_to_es/request.rb" + +######################################### +# NOTE: DO NOT EDIT THIS FILE!!!!!!!!! # +######################################### +# (unless you are a CDRH dev and then you may do so very cautiously) +# this file provides defaults for ALL of the collections included +# in the API and changing it could alter dozens of sites unexpectedly! +# PLEASE RUN LOADS OF TESTS AFTER A CHANGE BEFORE PUSHING TO PRODUCTION + +# WHAT IS THIS FILE? +# This file sets up default behavior for transforming custom +# documents to Elasticsearch JSON documents + +class CustomToEs + + attr_reader :json, :item, :file_type + + def initialize(item, options: {}, file: nil, filename: nil, file_type: nil) + @item = item + @options = options + # behaves similarly to parent_xml in that it represents + # the entire file, whereas item MAY represent a portion + # of a file (as is the case with a csv row, personography + # //person path, etc) + @file = file + @filename = filename + @file_type = file_type + @id = get_id + + create_json + end + + # getter for @json response object + def create_json + @json = {} + # if anything needs to be done before processing + # do it here (ex: reading in annotations into memory) + preprocessing + assemble_json + postprocessing + end + + def get_id + nil + end + + def preprocessing + # copy this in your custom_to_es collection file to customize + end + + def postprocessing + # copy this in your custom_to_es collection file to customize + end +end diff --git a/lib/datura/to_es/custom_to_es/fields.rb b/lib/datura/to_es/custom_to_es/fields.rb new file mode 100644 index 000000000..a0d068308 --- /dev/null +++ b/lib/datura/to_es/custom_to_es/fields.rb @@ -0,0 +1,186 @@ +class CustomToEs + # Note to add custom fields, use "assemble_collection_specific" from request.rb + # and be sure to either use the _d, _i, _k, or _t to use the correct field type + + ########## + # FIELDS # + ########## + def id + @id + end + + def id_dc + "https://cdrhapi.unl.edu/doc/#{@id}" + end + + def annotations_text + # TODO what should default behavior be? + end + + def category + # TODO + end + + # nested field + def creator + # TODO + end + + # returns ; delineated string of alphabetized creators + def creator_sort + # TODO + end + + def collection + @options["collection"] + end + + def collection_desc + @options["collection_desc"] || @options["collection"] + end + + def contributor + # TODO + end + + def data_type + @file_type + end + + def date(before=true) + # TODO + # Datura::Helpers.date_standardize(??, before) + end + + def date_display + Datura::Helpers.date_display(date) if date + end + + def date_not_after + date(false) + end + + def date_not_before + date(true) + end + + def description + # Note: override per collection as needed + end + + def format + # TODO + end + + def image_id + # TODO + end + + def keywords + # TODO + end + + def language + # TODO + end + + def languages + # TODO + end + + def medium + # Default behavior is the same as "format" method + format + end + + def person + # TODO + end + + def people + # TODO + end + + def places + # TODO + end + + def publisher + # TODO + end + + def recipient + # TODO + end + + def rights + # Note: override by collection as needed + "All Rights Reserved" + end + + def rights_holder + # TODO + end + + def rights_uri + # TODO + end + + def source + # TODO + end + + def subjects + # TODO + end + + def subcategory + # TODO + end + + # text is generally going to be pulled from + def text + # TODO + # get text, add text_additional + # Datura::Helpers.normalize_space(your_text.join(" "))) + end + + # override and add by collection as needed + def text_additional + [ title ] + end + + def title + # TODO + end + + def title_sort + Datura::Helpers.normalize_name(title) if title + end + + def topics + # TODO + end + + def uri + # override per collection + # should point at the live website view of resource + end + + def uri_data + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/source/#{@file_type}" + "#{base}/#{subpath}/#{@filename}" + end + + def uri_html + base = @options["data_base"] + subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" + "#{base}/#{subpath}/#{@id}.html" + end + + def works + # TODO + end + +end diff --git a/lib/datura/to_es/custom_to_es/request.rb b/lib/datura/to_es/custom_to_es/request.rb new file mode 100644 index 000000000..804ddcfd3 --- /dev/null +++ b/lib/datura/to_es/custom_to_es/request.rb @@ -0,0 +1,7 @@ +class CustomToEs + include EsRequest + # please refer to generic es_request.rb file + # and override the JSON being sent to elasticsearch here, if needed + # project specific overrides should go in the COLLECTION's overrides! + +end diff --git a/lib/datura/to_es/xml_to_es_request.rb b/lib/datura/to_es/es_request.rb similarity index 87% rename from lib/datura/to_es/xml_to_es_request.rb rename to lib/datura/to_es/es_request.rb index 051ae5b3f..89dc35625 100644 --- a/lib/datura/to_es/xml_to_es_request.rb +++ b/lib/datura/to_es/es_request.rb @@ -1,8 +1,14 @@ -# request creation portion of Xml to ES transformation -# override for VRA / TEI concerns in [type]_to_es.rb -# files or in collection specific overrides - -class XmlToEs +# assemble_json sets up the JSON structure that will be +# used to create elasticsearch documents. However, the JSON +# structure depend on subclasses to define methods like +# "category" and "subcategory" to populate the JSON. +# +# This module itself is not standalone, but by putting +# the JSON structure in a common place, those classes +# including it do not each need to redefine the JSON +# request structure + +module EsRequest def assemble_json # Note: if your collection does not require a specific field @@ -27,7 +33,7 @@ def assemble_json assemble_text assemble_collection_specific - return @json + @json end ############## diff --git a/lib/datura/to_es/html_to_es/fields.rb b/lib/datura/to_es/html_to_es/fields.rb index 9852b5c4f..94c35db68 100644 --- a/lib/datura/to_es/html_to_es/fields.rb +++ b/lib/datura/to_es/html_to_es/fields.rb @@ -137,20 +137,18 @@ def subcategory def text # handling separate fields in array # means no worrying about handling spacing between words - text = [] + text_all = [] body = get_text(@xpaths["text"], false) - text << body - text += text_additional - return CommonXml.normalize_space(text.join(" ")) + text_all << body + text_all += text_additional + Datura::Helpers.normalize_space(text_all.join(" ")) end def text_additional # Note: Override this per collection if you need additional # searchable fields or information for collections # just make sure you return an array at the end! - - text = [] - text << title + [ title ] end def title @@ -158,8 +156,7 @@ def title end def title_sort - t = title - CommonXml.normalize_name(t) + Datura::Helpers.normalize_name(title) end def topics @@ -172,9 +169,7 @@ def uri end def uri_data - base = @options["data_base"] - subpath = "data/#{@options["collection"]}/tei" - "#{base}/#{subpath}/#{@id}.xml" + # TODO per repository end def uri_html diff --git a/lib/datura/to_es/html_to_es/request.rb b/lib/datura/to_es/html_to_es/request.rb index e33f905b6..d8929f0d8 100644 --- a/lib/datura/to_es/html_to_es/request.rb +++ b/lib/datura/to_es/html_to_es/request.rb @@ -1,7 +1,7 @@ class HtmlToEs < XmlToEs - # please refer to generic xml to es request file, request.rb - # and override methods specific to HTML transformation here + # please refer to generic es_request.rb file + # and override the JSON being sent to elasticsearch here, if needed # project specific overrides should go in the COLLECTION's overrides! end diff --git a/lib/datura/to_es/tei_to_es/fields.rb b/lib/datura/to_es/tei_to_es/fields.rb index 6c80601d2..b33596771 100644 --- a/lib/datura/to_es/tei_to_es/fields.rb +++ b/lib/datura/to_es/tei_to_es/fields.rb @@ -20,19 +20,19 @@ def annotations_text end def category - category = get_text(@xpaths["category"]) - return category.length > 0 ? CommonXml.normalize_space(category) : "none" + cat = get_text(@xpaths["category"]) + cat.length > 0 ? Datura::Helpers.normalize_space(cat) : "none" end # note this does not sort the creators def creator creators = get_list(@xpaths["creators"]) - return creators.map { |creator| { "name" => CommonXml.normalize_space(creator) } } + creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } end # returns ; delineated string of alphabetized creators def creator_sort - return get_text(@xpaths["creators"]) + get_text(@xpaths["creators"]) end def collection @@ -50,8 +50,8 @@ def contributor eles.each do |ele| contribs << { "id" => ele["id"], - "name" => CommonXml.normalize_space(ele.text), - "role" => CommonXml.normalize_space(ele["role"]) + "name" => Datura::Helpers.normalize_space(ele.text), + "role" => Datura::Helpers.normalize_space(ele["role"]) } end end @@ -64,11 +64,11 @@ def data_type def date(before=true) datestr = get_text(@xpaths["date"]) - return CommonXml.date_standardize(datestr, before) + Datura::Helpers.date_standardize(datestr, before) end def date_display - date = get_text(@xpaths["date_display"]) + get_text(@xpaths["date_display"]) end def date_not_after @@ -121,22 +121,21 @@ def person # and put in the xpaths above, also for attributes, etc # should contain name, id, and role eles = @xml.xpath(@xpaths["person"]) - people = eles.map do |p| + eles.map do |p| { "id" => "", - "name" => CommonXml.normalize_space(p.text), - "role" => CommonXml.normalize_space(p["role"]) + "name" => Datura::Helpers.normalize_space(p.text), + "role" => Datura::Helpers.normalize_space(p["role"]) } end - return people end def people - @json["person"].map { |p| CommonXml.normalize_space(p["name"]) } + @json["person"].map { |p| Datura::Helpers.normalize_space(p["name"]) } end def places - return get_list(@xpaths["places"]) + get_list(@xpaths["places"]) end def publisher @@ -145,14 +144,13 @@ def publisher def recipient eles = @xml.xpath(@xpaths["recipient"]) - people = eles.map do |p| + eles.map do |p| { "id" => "", - "name" => CommonXml.normalize_space(p.text), + "name" => Datura::Helpers.normalize_space(p.text), "role" => "recipient" } end - return people end def rights @@ -179,20 +177,20 @@ def subjects end def subcategory - subcategory = get_text(@xpaths["subcategory"]) - subcategory.length > 0 ? subcategory : "none" + subcat = get_text(@xpaths["subcategory"]) + subcat.length > 0 ? subcat : "none" end def text # handling separate fields in array # means no worrying about handling spacing between words - text = [] + text_all = [] body = get_text(@xpaths["text"], false) - text << body + text_all << body # TODO: do we need to preserve tags like in text? if so, turn get_text to true - # text << CommonXml.convert_tags_in_string(body) - text += text_additional - return CommonXml.normalize_space(text.join(" ")) + # text_all << CommonXml.convert_tags_in_string(body) + text_all += text_additional + Datura::Helpers.normalize_space(text_all.join(" ")) end def text_additional @@ -200,21 +198,19 @@ def text_additional # searchable fields or information for collections # just make sure you return an array at the end! - text = [] - text << title + [ title ] end def title - title = get_text(@xpaths["titles"]["main"]) - if title.empty? - title = get_text(@xpaths["titles"]["alt"]) + title_disp = get_text(@xpaths["titles"]["main"]) + if title_disp.empty? + title_disp = get_text(@xpaths["titles"]["alt"]) end - return title + title_disp end def title_sort - t = title - CommonXml.normalize_name(t) + Datura::Helpers.normalize_name(title) end def topics @@ -229,13 +225,13 @@ def uri def uri_data base = @options["data_base"] subpath = "data/#{@options["collection"]}/source/tei" - return "#{base}/#{subpath}/#{@id}.xml" + "#{base}/#{subpath}/#{@id}.xml" end def uri_html base = @options["data_base"] subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" - return "#{base}/#{subpath}/#{@id}.html" + "#{base}/#{subpath}/#{@id}.html" end def works diff --git a/lib/datura/to_es/tei_to_es/request.rb b/lib/datura/to_es/tei_to_es/request.rb index 14f3b7438..c416d8bba 100644 --- a/lib/datura/to_es/tei_to_es/request.rb +++ b/lib/datura/to_es/tei_to_es/request.rb @@ -1,6 +1,6 @@ class TeiToEs < XmlToEs - # please refer to generic xml to es request file, request.rb + # please refer to generic es_request.rb file # and override methods specific to TEI transformation here # project specific overrides should go in the COLLECTION's overrides! diff --git a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb index 81105ec64..7e4ff79be 100644 --- a/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb +++ b/lib/datura/to_es/tei_to_es/tei_to_es_personography.rb @@ -1,7 +1,7 @@ class TeiToEsPersonography < TeiToEs def override_xpaths - return { + { "titles" => { "main" => "persName[@type='display']", "alt" => "persName" @@ -16,16 +16,16 @@ def category def creator creators = get_list(@xpaths["creators"], false, @parent_xml) - return creators.map { |creator| { "name" => creator } } + creators.map { |c| { "name" => c } } end def creators - return get_text(@xpaths["creators"], false, @parent_xml) + get_text(@xpaths["creators"], false, @parent_xml) end def get_id person = @xml["id"] - return "#{@filename}_#{person}" + "#{@filename}_#{person}" end def person diff --git a/lib/datura/to_es/vra_to_es/fields.rb b/lib/datura/to_es/vra_to_es/fields.rb index 2ca37a155..ed3d2e45e 100644 --- a/lib/datura/to_es/vra_to_es/fields.rb +++ b/lib/datura/to_es/vra_to_es/fields.rb @@ -26,12 +26,12 @@ def category # note this does not sort the creators def creator creators = get_list(@xpaths["creators"]) - return creators.map { |creator| { "name" => CommonXml.normalize_space(creator) } } + creators.map { |c| { "name" => Datura::Helpers.normalize_space(c) } } end # returns ; delineated string of alphabetized creators def creator_sort - return get_text(@xpaths["creators"]) + get_text(@xpaths["creators"]) end def collection @@ -48,11 +48,11 @@ def contributor contributors.each do |ele| contrib_list << { "id" => "", - "name" => CommonXml.normalize_space(ele.xpath("name").text), - "role" => CommonXml.normalize_space(ele.xpath("role").text) + "name" => Datura::Helpers.normalize_space(ele.xpath("name").text), + "role" => Datura::Helpers.normalize_space(ele.xpath("role").text) } end - return contrib_list + contrib_list end def data_type @@ -61,7 +61,7 @@ def data_type def date(before=true) datestr = get_text(@xpaths["dates"]["earliest"]) - CommonXml.date_standardize(datestr, before) + Datura::Helpers.date_standardize(datestr, before) end def date_display @@ -112,17 +112,17 @@ def person # and put in the xpaths above, also for attributes, etc # should contain name, id, and role eles = @xml.xpath(@xpaths["person"]) - return eles.map do |p| + eles.map do |p| { "id" => "", - "name" => CommonXml.normalize_space(p.text), - "role" => CommonXml.normalize_space(p["role"]) + "name" => Datura::Helpers.normalize_space(p.text), + "role" => Datura::Helpers.normalize_space(p["role"]) } end end def people - @json["person"].map { |p| CommonXml.normalize_space(p["name"]) } + @json["person"].map { |p| Datura::Helpers.normalize_space(p["name"]) } end def places @@ -135,14 +135,13 @@ def publisher def recipient eles = @xml.xpath(@xpaths["recipient"]) - people = eles.map do |p| + eles.map do |p| { "id" => "", - "name" => CommonXml.normalize_space(p.text), - "role" => CommonXml.normalize_space(p["role"]), + "name" => Datura::Helpers.normalize_space(p.text), + "role" => Datura::Helpers.normalize_space(p["role"]), } end - return people end def rights @@ -175,12 +174,12 @@ def subjects def text # handling separate fields in array # means no worrying about handling spacing between words - text = [] - text << get_text(@xpaths["text"], false) + text_all = [] + text_all << get_text(@xpaths["text"], false) # TODO: do we need to preserve tags like in text? if so, turn get_text to true - # text << CommonXml.convert_tags_in_string(body) - text += text_additional - return CommonXml.normalize_space(text.join(" ")) + # text_all << CommonXml.convert_tags_in_string(body) + text_all += text_additional + Datura::Helpers.normalize_space(text_all.join(" ")) end def text_additional @@ -188,8 +187,7 @@ def text_additional # searchable fields or information for collections # just make sure you return an array at the end! - text = [] - text << title + [ title ] end def title @@ -197,8 +195,7 @@ def title end def title_sort - t = title - CommonXml.normalize_name(t) + Datura::Helpers.normalize_name(title) end def topics @@ -213,13 +210,13 @@ def uri def uri_data base = @options["data_base"] subpath = "data/#{@options["collection"]}/source/vra" - return "#{base}/#{subpath}/#{@id}.xml" + "#{base}/#{subpath}/#{@id}.xml" end def uri_html base = @options["data_base"] subpath = "data/#{@options["collection"]}/output/#{@options["environment"]}/html" - return "#{base}/#{subpath}/#{@id}.html" + "#{base}/#{subpath}/#{@id}.html" end def works diff --git a/lib/datura/to_es/vra_to_es/request.rb b/lib/datura/to_es/vra_to_es/request.rb index 974ac6be7..e8d0c1b69 100644 --- a/lib/datura/to_es/vra_to_es/request.rb +++ b/lib/datura/to_es/vra_to_es/request.rb @@ -1,7 +1,7 @@ class VraToEs < XmlToEs - # please refer to generic xml to es request file, request.rb - # and override methods specific to VRA transformation here + # please refer to generic es_request.rb file + # and override the JSON being sent to elasticsearch here, if needed # project specific overrides should go in the COLLECTION's overrides! end diff --git a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb index 9ae5d718c..e4fd6f442 100644 --- a/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb +++ b/lib/datura/to_es/vra_to_es/vra_to_es_personography.rb @@ -1,7 +1,7 @@ class VraToEsPersonography < TeiToEs def override_xpaths - return { + { "titles" => { "main" => "persName[@type='display']", "alt" => "persName" @@ -16,16 +16,16 @@ def category def creator creators = get_list(@xpaths["creators"], false, @parent_xml) - return creators.map { |creator| { "name" => creator } } + creators.map { |c| { "name" => c } } end def creator_sort - return get_text(@xpaths["creators"], false, @parent_xml) + get_text(@xpaths["creators"], false, @parent_xml) end def get_id person = @xml["id"] - return "#{@filename}_#{person}" + "#{@filename}_#{person}" end def person diff --git a/lib/datura/to_es/webs_to_es/fields.rb b/lib/datura/to_es/webs_to_es/fields.rb index 88a0cf760..aff7d27eb 100644 --- a/lib/datura/to_es/webs_to_es/fields.rb +++ b/lib/datura/to_es/webs_to_es/fields.rb @@ -149,7 +149,7 @@ def text body = get_text(@xpaths["text"], false) text << body text += text_additional - return CommonXml.normalize_space(text.join(" ")) + return Datura::Helpers.normalize_space(text.join(" ")) end def text_additional @@ -167,7 +167,7 @@ def title def title_sort t = title - CommonXml.normalize_name(t) + Datura::Helpers.normalize_name(t) end def topics diff --git a/lib/datura/to_es/webs_to_es/request.rb b/lib/datura/to_es/webs_to_es/request.rb index af67228dc..330795d28 100644 --- a/lib/datura/to_es/webs_to_es/request.rb +++ b/lib/datura/to_es/webs_to_es/request.rb @@ -1,6 +1,6 @@ class WebsToEs < XmlToEs - # please refer to generic xml to es request file, request.rb + # please refer to generic es_request.rb file # and override methods specific to Web Scraped HTML transformation here # project specific overrides should go in the COLLECTION's overrides! diff --git a/lib/datura/to_es/xml_to_es.rb b/lib/datura/to_es/xml_to_es.rb index 38aec2bc9..19324853c 100644 --- a/lib/datura/to_es/xml_to_es.rb +++ b/lib/datura/to_es/xml_to_es.rb @@ -1,7 +1,6 @@ require "nokogiri" require_relative "../helpers.rb" require_relative "../common_xml.rb" -require_relative "xml_to_es_request.rb" ######################################### # NOTE: DO NOT EDIT THIS FILE!!!!!!!!! # @@ -20,6 +19,7 @@ # about altering their behavior, customizing xpaths, etc class XmlToEs + include EsRequest attr_reader :json, :xml # variables @@ -51,7 +51,7 @@ def create_json end def get_id - return @filename + @filename end def override_xpaths @@ -74,7 +74,7 @@ def override_xpaths # returns an array with the html value in xpath def get_list(xpaths, keep_tags=false, xml=nil) xpath_array = xpaths.class == Array ? xpaths : [xpaths] - return get_xpaths(xpath_array, keep_tags, xml) + get_xpaths(xpath_array, keep_tags, xml) end # get_text @@ -87,7 +87,7 @@ def get_text(xpaths, keep_tags=false, xml=nil, delimiter=";") xpath_array = xpaths.class == Array ? xpaths : [xpaths] list = get_xpaths(xpath_array, keep_tags, xml) sorted = list.sort - return sorted.join("#{delimiter} ") + sorted.join("#{delimiter} ") end # Note: Recommend that collection team do NOT use this method directly @@ -111,13 +111,13 @@ def get_xpaths(xpaths, keep_tags=false, xml=nil) text = CommonXml.to_display_text(content) end # remove whitespace of all kinds from the text - text = CommonXml.normalize_space(text) + text = Datura::Helpers.normalize_space(text) if text.length > 0 list << text end end end - return list.uniq + list.uniq end def preprocessing diff --git a/test/common_xml_test.rb b/test/common_xml_test.rb index 05c4879a1..765f85c97 100644 --- a/test/common_xml_test.rb +++ b/test/common_xml_test.rb @@ -44,50 +44,6 @@ def test_create_xml_object # TODO end - def test_date_display - # normal dates - assert_equal "December 2, 2016", CommonXml.date_display("2016-12-02") - assert_equal "January 31, 2014", CommonXml.date_display("2014-01-31", "no date") - # no date - assert_equal "N.D.", CommonXml.date_display(nil) - assert_equal "no date", CommonXml.date_display("20143183", "no date") - assert_equal "", CommonXml.date_display(nil, "") - end - - def test_date_standardize - # missing month and day - assert_equal "2016-01-01", CommonXml.date_standardize("2016") - assert_equal "2016-12-31", CommonXml.date_standardize("2016", false) - # missing day - assert_nil CommonXml.date_standardize("01-12") - assert_equal "2014-01-01", CommonXml.date_standardize("2014-01") - assert_equal "2014-01-31", CommonXml.date_standardize("2014-01", false) - # complete date - assert_equal "2014-01-12", CommonXml.date_standardize("2014-01-12") - # invalid date - assert_nil CommonXml.date_standardize("2014-30-31") - # February final day - assert_equal "2015-02-28", CommonXml.date_standardize("2015-2", false) - assert_equal "2016-02-29", CommonXml.date_standardize("2016-02", false) - end - - def test_normalize_name - assert_equal "title", CommonXml.normalize_name("The Title") - assert_equal "anne of green gables", CommonXml.normalize_name("Anne of Green Gables") - assert_equal "fancy party", CommonXml.normalize_name("A Fancy Party") - assert_equal "hour", CommonXml.normalize_name("An Hour") - end - - def test_normalize_space - # ensure that return characters are replaced by spaces, and multispaces squashed - test1 = " \rExample \n \n " - assert_equal " Example ", CommonXml.normalize_space(test1) - - # check that newlines are dead regardless - test2 = "\rExa\rmple\n" - assert_equal " Exa mple ", CommonXml.normalize_space(test2) - end - def test_sub_corrections xml_string = "Somethng Something" xml = Nokogiri::XML xml_string diff --git a/test/helpers_test.rb b/test/helpers_test.rb index 7c3c72440..357c64e01 100644 --- a/test/helpers_test.rb +++ b/test/helpers_test.rb @@ -3,6 +3,33 @@ class Datura::HelpersTest < Minitest::Test + def test_date_display + # normal dates + assert_equal "December 2, 2016", Datura::Helpers.date_display("2016-12-02") + assert_equal "January 31, 2014", Datura::Helpers.date_display("2014-01-31", "no date") + # no date + assert_equal "N.D.", Datura::Helpers.date_display(nil) + assert_equal "no date", Datura::Helpers.date_display("20143183", "no date") + assert_equal "", Datura::Helpers.date_display(nil, "") + end + + def test_date_standardize + # missing month and day + assert_equal "2016-01-01", Datura::Helpers.date_standardize("2016") + assert_equal "2016-12-31", Datura::Helpers.date_standardize("2016", false) + # missing day + assert_nil Datura::Helpers.date_standardize("01-12") + assert_equal "2014-01-01", Datura::Helpers.date_standardize("2014-01") + assert_equal "2014-01-31", Datura::Helpers.date_standardize("2014-01", false) + # complete date + assert_equal "2014-01-12", Datura::Helpers.date_standardize("2014-01-12") + # invalid date + assert_nil Datura::Helpers.date_standardize("2014-30-31") + # February final day + assert_equal "2015-02-28", Datura::Helpers.date_standardize("2015-2", false) + assert_equal "2016-02-29", Datura::Helpers.date_standardize("2016-02", false) + end + def test_get_directory_files # real directory files = Datura::Helpers.get_directory_files("#{File.dirname(__FILE__)}/fixtures") @@ -25,6 +52,23 @@ def test_make_dirs # TODO end + def test_normalize_name + assert_equal "title", Datura::Helpers.normalize_name("The Title") + assert_equal "anne of green gables", Datura::Helpers.normalize_name("Anne of Green Gables") + assert_equal "fancy party", Datura::Helpers.normalize_name("A Fancy Party") + assert_equal "hour", Datura::Helpers.normalize_name("An Hour") + end + + def test_normalize_space + # ensure that return characters are replaced by spaces, and multispaces squashed + test1 = " \rExample \n \n " + assert_equal " Example ", Datura::Helpers.normalize_space(test1) + + # check that newlines are dead regardless + test2 = "\rExa\rmple\n" + assert_equal " Exa mple ", Datura::Helpers.normalize_space(test2) + end + def test_regex_files test_files = %w[ /path/to/cody.book.001.xml