This allows Ruby efficiently to extract information from PDF files.
It currently has only very rudimantary PDF editing capabilities.
API Documentation is also available and the test directory has examples of usage.
The gem requires both the PDFium and freeimage libraries.
An Ubuntu PPA is available for PDFium.
Freeimage should be installable via system packages.
# Assuming AWS::S3 is already authorized elsewhere
bucket = AWS::S3.new.buckets['my-pdfs']
pdf = PDFium::Document.from_memory bucket.objects['secrets.pdf'].read
pdf.pages.each do | page |
# render the complete page as a PNG with the height locked to 1000 pixels
# The width will be calculated to maintain the proper aspect ratio
path = "secrets/page-#{page.number}.png"
bucket.objects[path].write page.as_image(height: 1000).data(:png)
# extract and save each embedded image as a PNG
page.images.each do | image |
path = "secrets/page-#{page.number}-image-#{image.index}.png"
bucket.objects[path].write image.data(:png)
end
# Extract text from page. Will be encoded as UTF-16LE by default
path = "secrets/page-#{page.number}-text.txt"
bucket.objects[path].write page.text
end
pdf = PDFium::Document.new("test.pdf")
pdf.save
Page count:
pdf.page_count
PDF Metadata:
pdf.metadata
Returns a hash with keys = :title, :author :subject, :keywords, :creator, :producer, :creation_date, :mod_date
def print_bookmarks(list, indent=0)
list.bookmarks.each do | bm |
print ' ' * indent
puts bm.title
print_marks( bm.children )
end
end
print_bookmarks( pdf.bookmarks )
pdf.each_page | page |
page.as_image(width: 800).save("test-{page.number}.png")
end
doc = PDFium::Document.new("test.pdf")
page = doc.page_at(0)
page.images do |image|
img.save("page-0-image-#{image.index}.png")
end
Text is returned as a UTF-16LE encoded string. Future version may return position information as well
pdf.page_at(0).text.encode!("ASCII-8BIT")