Skip to content

Commit

Permalink
Added ParsedDOM API to safely manipulate XML
Browse files Browse the repository at this point in the history
  • Loading branch information
JoshyPHP committed Jun 8, 2023
1 parent 052cde4 commit 5b58f44
Show file tree
Hide file tree
Showing 12 changed files with 849 additions and 6 deletions.
2 changes: 1 addition & 1 deletion composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"keywords": ["bbcode","bbcodes","blog","censor","embed","emoji","emoticons","engine","forum","html","markdown","markup","media","parser","shortcodes"],
"license": "MIT",
"require": {
"php": ">=7.4",
"php": "^8.0",
"ext-dom": "*",
"ext-filter": "*",
"lib-pcre": ">=8.13",
Expand Down
89 changes: 89 additions & 0 deletions docs/Utils/ParsedDOM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
### Getting started

The ParsedDOM utility allows you to load the [parsed representation of a text](/Getting_started/How_it_works/) (XML that is usually stored in a database) into a DOM document and operate on it with regular DOM methods as well as a specialized API. Unlike native string manipulation it provides better guarantees that the resulting XML will match what the parser would normally produce. It is best suited for maintenance tasks. For lightweight, real-time operations, it is recommended to use the limited but more efficient [Utils](https://s9e.github.io/TextFormatter/api/s9e/TextFormatter/Utils.html) class if possible.

```php
// Start with the parsed representation of the text
$xml = '<r><p>Hello <EM><s>*</s>world<e>*</e></EM> &#128512;</p></r>';

// Load it into a DOM document
$dom = s9e\TextFormatter\Utils\ParsedDOM::loadXML($xml);

// Select each EM elements using XPath...
foreach ($dom->query('//EM') as $em)
{
// ...and unparse it
$em->unparse();
}

// Converting the document to a string will serialize it back to XML in a way that
// matches what the parser would output. This is different from calling saveXML()
echo '__toString() ', (string) $dom . "\n";
echo 'saveXML() ', $dom->saveXML();
```
```
__toString() <t><p>Hello *world* &#128512;</p></t>
saveXML() <?xml version="1.0"?>
<t><p>Hello *world* &#x1F600;</p></t>
```


### Replacing a tag and its markup

In the following example, we replace Markdown-style emphasis with a `I` BBCode. The Litedown plugin uses `EM` tags for emphasis whereas the BBCodes plugin uses `I` tags for `I` BBCodes, so we have to replace element with a new tag, and replace its markup without touching its content.

```php
$xml = '<r><p>Hello <EM><s>*</s>world<e>*</e></EM></p></r>';
$dom = s9e\TextFormatter\Utils\ParsedDOM::loadXML($xml);

// Select each EM element
foreach ($dom->query('//EM') as $em)
{
// Replace it with what a I tag would generate (a I element)
$b = $em->replaceTag('I');

// Set the markup for this new element/tag, it will be placed in the appropriate location
$b->setMarkupStart('[i]');
$b->setMarkupEnd('[/i]');
}

echo $dom;
```
```
<r><p>Hello <I><s>[i]</s>world<e>[/i]</e></I></p></r>
```


### Replacing a tag and its content

In the following example, we replace an embedded YouTube video with a normal text link using BBCode markup. Here we set its text content to be the YouTube URL, but it could be replaced by something more meaningful such as the video's title.


```php
$xml = '<r><YOUTUBE id="QH2-TGUlwu4">https://www.youtube.com/watch?v=QH2-TGUlwu4</YOUTUBE></r>';
$dom = s9e\TextFormatter\Utils\ParsedDOM::loadXML($xml);

// Select each YOUTUBE element with an id attribute
foreach ($dom->query('//YOUTUBE[@id]') as $youtubeElement)
{
// Generate a URL for the original video
$url = str_starts_with($youtubeElement->textContent, 'https://')
? $youtubeElement->textContent
: 'https://youtu.be/' . $youtubeElement->getAttribute('id');

// Replace the YOUTUBE element with what a [url] BBCode would produce. The default [url]
// BBCode uses a URL tag with a url attribute
$urlElement = $youtubeElement->replaceTag('URL', ['url' => $url]);

// Reset its text content and add the appropriate markup. The order is important here as
// overwriting the text content of an element will remove its markup
$urlElement->textContent = $url;
$urlElement->setMarkupStart('[url]');
$urlElement->setMarkupEnd('[/url]');
}

echo $dom;
```
```
<r><URL url="https://www.youtube.com/watch?v=QH2-TGUlwu4"><s>[url]</s>https://www.youtube.com/watch?v=QH2-TGUlwu4<e>[/url]</e></URL></r>
```
38 changes: 36 additions & 2 deletions docs/testdox.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3780,7 +3780,7 @@ Configurable (s9e\TextFormatter\Tests\Configurator\Traits\Configurable)
[x] __get() throws a RuntimeException if the property does not exist
[x] __get($k) returns null if the property is null
[x] __set('foo', 'bar') calls setFoo('bar') if it exists
[ ] __set() can create new properties
[x] __set() can create new properties
[x] __set() can replace an instance of Foo with another instance of Foo
[x] __set() can replace an instance of Foo with an instance of FooPlus, which extends Foo
[x] __set() throws an exception if an instance of Foo would be replaced by an instance of Bar
Expand Down Expand Up @@ -7866,7 +7866,7 @@ Unformatted (s9e\TextFormatter\Tests\Renderers\Unformatted)
XSLT (s9e\TextFormatter\Tests\Renderers\XSLT)
[x] Is serializable
[x] Does not serialize the XSLTProcessor instance
[ ] Preserves other properties during serialization
[x] Preserves other properties during serialization
[x] setParameter() accepts values that contain both types of quotes but replaces ASCII character " with Unicode character 0xFF02 because of https://bugs.php.net/64137
[x] Does not output </embed> end tags
[x] Does not improperly replace single quotes inside attribute values
Expand Down Expand Up @@ -7942,6 +7942,40 @@ Http (s9e\TextFormatter\Tests\Utils\Http)
[x] getClient() returns an instance of s9e\TextFormatter\Utils\Http\Client
[x] getCachingClient() returns an instance of s9e\TextFormatter\Utils\Http\Clients\Cached that implements s9e\TextFormatter\Utils\Http\Client

Document (s9e\TextFormatter\Tests\Utils\ParsedDOM\Document)
[x] createTagElement('b') normalizes tag name to 'B'
[x] createTagElement('foo:BAR') creates a namespaced tag
[x] createTagElement() sets attributes
[x] createTagElement() normalizes attribute names
[x] normalizeDocument() normalizes elements
[x] normalizeDocument() removes superfluous namespaces
[x] __toString() returns a string without an XML declaration
[x] __toString() returns "<t></t>" for completely empty content
[x] __toString() removes empty markup
[x] __toString() encodes SMP characters

Element (s9e\TextFormatter\Tests\Utils\ParsedDOM\Element)
[x] normalize() removes empty s elements
[x] normalize() removes empty e elements
[x] normalize() removes empty i elements
[x] normalize() does not remove e/i/s elements that only contain whitespace
[x] normalize() does not remove empty E elements
[x] normalize() sorts attributes by name
[x] normalize() runs recursively
[x] setMarkupEnd('*') creates an 'e' element
[x] setMarkupEnd('*') replaces the 'e' element if it exists
[x] setMarkupEnd('') removes the 'e' element
[x] setMarkupStart('*') creates an 's' element
[x] setMarkupStart('*') replaces the 's' element if it exists
[x] setMarkupStart('') removes the 's' element
[x] unparse() does not remove any content
[x] unparse() does not apply recursively
[x] replaceTag() replaces a tag and its attributes
[x] replaceTag() normalizes tag names and attribute names

Parsed DOM (s9e\TextFormatter\Tests\Utils\ParsedDOM)
[x] loadXML() returns an instance of s9e\TextFormatter\Utils\ParsedDOM\Document

XPath (s9e\TextFormatter\Utils\XPath)
[x] export('foo') returns 'foo'
[x] export("d'oh") returns "d'oh"
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ nav:
- Introduction: JavaScript/Introduction.md
- Live preview attributes: JavaScript/Live_preview_attributes.md
- Minifiers: JavaScript/Minifiers.md
- Utilities:
- ParsedDOM: Utils/ParsedDOM.md
- Internals:
- API docs: Internals/API_docs.md
- API changes: Internals/API_changes.md
Expand Down
2 changes: 1 addition & 1 deletion src/Bundles/Fatdown.php

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions src/Bundles/Forum.php

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions src/Utils/ParsedDOM.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<?php declare(strict_types=1);

/**
* @package s9e\TextFormatter
* @copyright Copyright (c) The s9e authors
* @license http://www.opensource.org/licenses/mit-license.php The MIT License
*/
namespace s9e\TextFormatter\Utils;

use const LIBXML_NONET;
use s9e\TextFormatter\Utils\ParsedDOM\Document;

abstract class ParsedDOM
{
public static function loadXML(string $xml): Document
{
$dom = new Document;
$dom->loadXML($xml, LIBXML_NONET);

return $dom;
}
}
115 changes: 115 additions & 0 deletions src/Utils/ParsedDOM/Document.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
<?php declare(strict_types=1);

/**
* @package s9e\TextFormatter
* @copyright Copyright (c) The s9e authors
* @license http://www.opensource.org/licenses/mit-license.php The MIT License
*/
namespace s9e\TextFormatter\Utils\ParsedDOM;

use const LIBXML_NSCLEAN, SORT_STRING, false;
use function ksort, substr, strpos;
use s9e\SweetDOM\Document as SweetDocument;
use s9e\TextFormatter\Configurator\Validators\TagName;
use s9e\TextFormatter\Configurator\Validators\AttributeName;
use s9e\TextFormatter\Utils;

class Document extends SweetDocument
{
/**
* @link https://www.php.net/manual/domdocument.construct.php
*
* @param string $version Version number of the document
* @param string $encoding Encoding of the document
*/
public function __construct(string $version = '1.0', string $encoding = 'utf-8')
{
parent::__construct($version, $encoding);

$this->registerNodeClass('DOMElement', Element::class);
}

public function __toString(): string
{
$this->formatOutput = false;
$this->normalizeDocument();

$xml = $this->saveXML($this->documentElement, LIBXML_NSCLEAN);
$xml = Utils::encodeUnicodeSupplementaryCharacters($xml);

return ($xml === '<t/>') ? '<t></t>' : $xml;
}

/**
* @link https://www.php.net/manual/en/domdocument.normalizedocument.php
*/
public function normalizeDocument(): void
{
parent::normalizeDocument();
$this->documentElement->normalize();

$nodeName = $this->documentElement->firstOf('.//*[name() != "br"][name() != "p"]') ? 'r' : 't';

$root = $this->createElement($nodeName);
while (isset($this->documentElement->firstChild))
{
$root->appendChild($this->documentElement->firstChild);
}
$this->documentElement->replaceWith($root);
}

/**
* Create an element that represents a tag
*
* @param string $tagName
* @param array<string, string> $attributes
* @return Element
*/
public function createTagElement(string $tagName, array $attributes = []): Element
{
$tagName = TagName::normalize($tagName);
$pos = strpos($tagName, ':');

if ($pos === false)
{
$element = $this->createElement($tagName);
}
else
{
$prefix = substr($tagName, 0, $pos);
$namespaceURI = 'urn:s9e:TextFormatter:' . $prefix;
$this->documentElement->setAttributeNS(
'http://www.w3.org/2000/xmlns/',
'xmlns:' . $prefix,
$namespaceURI
);

$element = $this->createElementNS($namespaceURI, $tagName);
}

foreach ($this->normalizeAttributeMap($attributes) as $attrName => $attrValue)
{
$element->setAttribute($attrName, $attrValue);
}

return $element;
}

/**
* @param array<string, string> $attributes
* @return array<string, string> $attributes
*/
protected function normalizeAttributeMap(array $attributes): array
{
$map = [];
foreach ($attributes as $attrName => $attrValue)
{
$attrName = AttributeName::normalize($attrName);
$map[$attrName] = (string) $attrValue;

}
ksort($map, SORT_STRING);

return $map;
}
}
Loading

0 comments on commit 5b58f44

Please sign in to comment.