Skip to content

Commit

Permalink
Merge pull request #159 from CDRH/bugfix/webs_html
Browse files Browse the repository at this point in the history
addresses need for webscraping to handle HTML not XML
  • Loading branch information
techgique authored Feb 14, 2020
2 parents 010ca7a + 2082b22 commit fcbc112
Show file tree
Hide file tree
Showing 5 changed files with 27 additions and 3 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
Changelog up to date

## [v0.1.6](https://github.com/CDRH/datura/compare/v0.1.5...v0.1.6) - 2020-02-11 - WEBS HTML Object

### Changed
- FileType elasticsearch transform now has swappable component when reading
XML-type files. Webscraping script altered to manipulate HTML instead of
XML object type

## [v0.1.5](https://github.com/CDRH/datura/compare/v0.1.4...v0.1.5) - 2020-02-03 - VRA to Solr

### Added
Expand Down
8 changes: 7 additions & 1 deletion lib/datura/common_xml.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,17 @@ def self.convert_tags_in_string(text)
return converted.xpath("//xml").inner_html
end

def self.create_html_object(filepath, remove_ns=true)
file_html = File.open(filepath) { |f| Nokogiri::HTML(f, &:noblanks) }
file_html.remove_namespaces! if remove_ns
file_html
end

def self.create_xml_object(filepath, remove_ns=true)
file_xml = File.open(filepath) { |f| Nokogiri::XML(f, &:noblanks) }
# TODO is this a good idea?
file_xml.remove_namespaces! if remove_ns
return file_xml
file_xml
end

# pass in a date and identify whether it should be before or after
Expand Down
9 changes: 8 additions & 1 deletion lib/datura/file_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,13 @@ def filename(ext=true)
end
end

# typically assumed to be an XML file, parsed as XML
# but in some cases (for example, web scraping) this needs
# to be overridden to parse HTML instead
def parse_markup_lang_file
CommonXml.create_xml_object(self.file_location)
end

def post_es(url=nil)
url = url || "#{@options["es_path"]}/#{@options["es_index"]}"
begin
Expand Down Expand Up @@ -108,7 +115,7 @@ def print_solr
def transform_es
es_req = []
begin
file_xml = CommonXml.create_xml_object(self.file_location)
file_xml = parse_markup_lang_file
# check if any xpaths hit before continuing
results = file_xml.xpath(*subdoc_xpaths.keys)
if results.length == 0
Expand Down
4 changes: 4 additions & 0 deletions lib/datura/file_types/file_webs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ def initialize(file_location, options)
super(file_location, options)
end

def parse_markup_lang_file
CommonXml.create_html_object(self.file_location)
end

def subdoc_xpaths
{ "/" => WebsToEs }
end
Expand Down
2 changes: 1 addition & 1 deletion lib/datura/version.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
module Datura
VERSION = "0.1.5"
VERSION = "0.1.6"
end

0 comments on commit fcbc112

Please sign in to comment.