Skip to content

Commit

Permalink
Document parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
avvertix committed Oct 1, 2024
1 parent 28f0602 commit 070007c
Show file tree
Hide file tree
Showing 26 changed files with 970 additions and 16 deletions.
1 change: 0 additions & 1 deletion .github/FUNDING.yml

This file was deleted.

97 changes: 93 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
[![Tests](https://img.shields.io/github/actions/workflow/status/oneofftech/oneofftech-parse-client/run-tests.yml?branch=main&label=tests&style=flat-square)](https://github.com/oneofftech/oneofftech-parse-client/actions/workflows/run-tests.yml)
[![Total Downloads](https://img.shields.io/packagist/dt/oneofftech/oneofftech-parse-client.svg?style=flat-square)](https://packagist.org/packages/oneofftech/oneofftech-parse-client)

Parse client is a library to interact with OneOffTech PDF Parsing service based on [PDFAct](https://github.com/data-house/pdfact). OneOffTech Parse is designed to extract text from PDF files maintaining the structure of the document to improve interaction with Large Language Models (LLMs).
Parse client is a library to interact with [OneOffTech Parse](https://parse.oneofftech.de) service. OneOffTech Parse is designed to extract text from PDF files preserving the [structure of the document](#document-structure) to improve interaction with Large Language Models (LLMs).

OneOffTech Parse is based on [PDF Text extractor](https://github.com/data-house/pdf-text-extractor). The client is suitable to connect to self-hosted versions of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor).


> [!INFO]
Expand All @@ -13,22 +15,109 @@ Parse client is a library to interact with OneOffTech PDF Parsing service based

## Installation

You can install the package via composer:
You can install the package via Composer:

```bash
composer require oneofftech/parse-client
```

## Usage

The Parse client is able to connect to self-hosted instances of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) service or the cloud hosted [OneOffTech Parse](https://parse.oneofftech.de) service.

### Use with self-hosted instance

Before proceeding a running instance of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) is required. Once you have a running instance create an instance of the connector client passing the url on which your instance is listening.

```php
use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector(baseUrl: "http://localhost:5000");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse("https://domain.internal/document.pdf");
```

> [!INFO]
> - The URL of the document must be accessible without authentication.
> - Documents are downloaded for the time of processing and then the file is immediately deleted.

### Use the cloud hosted service

Go to [parse.oneofftech.de](https://parse.oneofftech.de) and obtain an access token. Instantiate the client and provide a URL of a PDF document.

```php
use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector("token");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse("https://domain.internal/document.pdf");
```

> [!INFO]
> - The URL of the document must be accessible without authentication.
> - Documents are downloaded for the time of processing and then the file is immediately deleted.

### Specify the preferred extraction method

Parse service supports different processors, [`pymupdf`](https://github.com/pymupdf/PyMuPDF) or [`pdfact`](https://github.com/data-house/pdfact). You can specify the preferred processor for each request.

```php
...
use OneOffTech\Parse\Client\ParseOption;
use OneOffTech\Parse\Client\DocumentProcessor;
use OneOffTech\Parse\Client\Connectors\ParseConnector;

$client = new ParseConnector("token");

/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
$document = $client->parse(
url: "https://domain.internal/document.pdf",
options: new ParseOption(DocumentProcessor::PYMUPDF)
);
```

### PDFAct vs PyMuPDF

PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.

| feature | PDFAct | PyMuPDF |
|-----------------------------------|--------|---------|
| Text extraction | :white_check_mark: | :white_check_mark: |
| Pagination | :white_check_mark: | :white_check_mark: |
| Headings identification | :white_check_mark: | - |
| Text styles (e.g. bold or italic) | :white_check_mark: | - |
| Page header | :white_check_mark: | - |
| Page footer | :white_check_mark: | - |




## Document structure

Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.

```
Document
├─Page
│ ├─Text (category: heading)
│ └─Text (category: body)
└─Page
├─Text (category: heading)
└─Text (category: body)
```

For a more in-depth explanation of the structure see [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python).


## Testing

Parse client is tested using [PEST](https://pestphp.com/). Tests run for each commit and pull request.

To execute the test suite run:

```bash
composer test
```
Expand All @@ -39,7 +128,7 @@ Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed re

## Contributing

Thank you for considering contributing to the Librarian client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
Thank you for considering contributing to the Parse client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.

## Security Vulnerabilities

Expand Down
13 changes: 8 additions & 5 deletions composer.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
{
"name": "oneofftech/parse-client",
"description": "This is my package oneofftech-parse-client",
"description": "Parse PDF document keeping the structure.",
"keywords": [
"OneOffTech",
"oneofftech-parse-client"
"pdf",
"parse",
"parsing",
"text-extract"
],
"homepage": "https://github.com/oneofftech/oneofftech-parse-client",
"license": "MIT",
Expand All @@ -19,8 +21,9 @@
"saloonphp/saloon": "^3.10"
},
"require-dev": {
"pestphp/pest": "^2.20",
"laravel/pint": "^1.0"
"jonpurvis/lawman": "^1.2",
"laravel/pint": "^1.0",
"pestphp/pest": "^2.20"
},
"autoload": {
"psr-4": {
Expand Down
89 changes: 89 additions & 0 deletions src/Connectors/ParseConnector.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
<?php

namespace OneOffTech\Parse\Client\Connectors;

use OneOffTech\Parse\Client\DocumentProcessor;
use OneOffTech\Parse\Client\Dto\DocumentDto;
use OneOffTech\Parse\Client\ParseOption;
use OneOffTech\Parse\Client\Requests\ExtractTextRequest;
use OneOffTech\Parse\Client\Responses\ParseResponse;
use Saloon\Contracts\Authenticator;
use Saloon\Http\Auth\NullAuthenticator;
use Saloon\Http\Auth\TokenAuthenticator;
use Saloon\Http\Connector;
use Saloon\Http\Response;
use Saloon\Traits\Plugins\AcceptsJson;
use Saloon\Traits\Plugins\AlwaysThrowOnErrors;
use Saloon\Traits\Plugins\HasTimeout;
use SensitiveParameter;

class ParseConnector extends Connector
{
use AcceptsJson;
use AlwaysThrowOnErrors;
use HasTimeout;

protected int $connectTimeout = 30;

protected int $requestTimeout = 120;

protected ?string $response = ParseResponse::class;

public function __construct(

/**
* The authentication token
*/
#[SensitiveParameter]
public readonly ?string $token = null,

/**
* The base url where the API listen
*/
protected readonly string $baseUrl = 'https://parse.oneofftech.de/api/v0',
) {
//
}

public function resolveBaseUrl(): string
{
return $this->baseUrl;
}

protected function defaultAuth(): Authenticator
{
if (is_null($this->token)) {
return new NullAuthenticator;
}

return new TokenAuthenticator($this->token);
}

/**
* Determine if the request has failed.
*/
public function hasRequestFailed(Response $response): ?bool
{
return $response->serverError() || $response->clientError();
}

// Resources and helper methods

/**
* Parse a document hosted on a web server
*
* @param string $url The URL under which the document is accessible
* @param string $mimeType The mime type of the document. Default application/pdf
* @param \OneOffTech\Parse\Client\ParseOption $options Specifiy additional options for the specific parsing processor
*/
public function parse(string $url, string $mimeType = 'application/pdf', ?ParseOption $options = null): DocumentDto
{
return $this
->send((new ExtractTextRequest(
url: $url,
mimeType: $mimeType,
preferredDocumentProcessor: $options?->processor?->value ?? DocumentProcessor::PDFACT->value,
))->validate())
->dto();
}
}
115 changes: 115 additions & 0 deletions src/DocumentFormat/DocumentNode.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
<?php

namespace OneOffTech\Parse\Client\DocumentFormat;

use Countable;
use OneOffTech\Parse\Client\Exceptions\EmptyDocumentException;
use OneOffTech\Parse\Client\Exceptions\InvalidDocumentFormatException;
use RecursiveArrayIterator;
use RecursiveIteratorIterator;

class DocumentNode implements Countable
{

public function __construct(
public readonly array $content,
public readonly array $attributes = [],
) {}


public function type(): string
{
return 'doc';
}


/**
* The number of pages in this document as extracted by the parser.
*/
public function count(): int
{
return count($this->content);
}

/**
* Test if the document is empty, i.e. contains no pages or has no textual content on any of the pages
*/
public function isEmpty(): bool
{
return $this->count() === 0 || !$this->hasContent();
}

/**
* Test if the document has discernible textual content on any of the pages
*/
public function hasContent(): bool
{
foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
if($key === 'text' && !empty($value)){
return true;
}
}

return false;
}


/**
* The pages in this document
*
* @return \OneOffTech\Parse\Client\DocumentFormat\PageNode[]
*/
public function pages(): array
{
return array_map(fn($page) => PageNode::fromArray($page), $this->content);
}

public function text(): string
{
$text = [];

foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
if($key === 'text' && !empty($value)){
$text[] = $value;
}
}

return join(PHP_EOL, $text);
}


/**
* Throw exception if document has no textual content
*
* @throws OneOffTech\Parse\Client\Exceptions\EmptyDocumentException when document has no textual content
*/
public function throwIfNoContent(): self
{
if(!$this->hasContent()){
throw new EmptyDocumentException("Document has no textual content.");
}

return $this;
}


/**
* Create a document node from associative array
*/
public static function fromArray(array $data): DocumentNode
{
if(!(isset($data['category']) && isset($data['content']))){
throw new InvalidDocumentFormatException("Unexpected document structure. Missing category or content.");
}

if($data['category'] !== 'doc'){
throw new InvalidDocumentFormatException("Unexpected node category. Expecting [doc] found [{$data['category']}].");
}

if(!is_array($data['content'])){
throw new InvalidDocumentFormatException("Unexpected content format. Expecting [array].");
}

return new DocumentNode($data['content'] ?? [], $data['attributes'] ?? []);
}
}
Loading

0 comments on commit 070007c

Please sign in to comment.