Skip to content

Commit ec5ef05

Browse files
authored
Optimize memory usage (#6)
Parsing: * Add iterative parsing as an optional behavior * Customized hash function Inserting: * Add optional max_lines in insert Model config: * Add configurable metadata * Allow custom indices to be added
1 parent b547653 commit ec5ef05

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1331
-1044
lines changed

.github/workflows/python-package.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ jobs:
1717
strategy:
1818
fail-fast: false
1919
matrix:
20-
python-version: ["3.9", "3.10", "3.11"]
20+
python-version: ["3.9", "3.10", "3.11", "3.12"]
2121

2222
steps:
2323
- uses: actions/checkout@v3

docs/api/overview.md

+11-5
Original file line numberDiff line numberDiff line change
@@ -28,15 +28,22 @@
2828
## *Advanced use:* loading data into the database
2929

3030
The flow chart below presents data conversions used to load an XML file into the database, showing the functions used
31-
for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in
32-
intermediate steps.
31+
for lower level steps. It can be useful for advanced use cases, for instance:
32+
33+
* transforming the data in intermediate steps,
34+
* adding logging,
35+
* limiting concurrent access to the database within a multiprocess setup, etc.
36+
37+
For those scenarios you can easily reimplement
38+
[`Document.insert_into_target_tables`](document.md/#xml2db.document.Document.insert_into_target_tables) to suit your
39+
needs, using lower level functions.
3340

3441
```mermaid
3542
flowchart TB
3643
subgraph "<a href='../data_model/#xml2db.model.DataModel.parse_xml' style='color:var(--md-code-fg-color)'>DataModel.parse_xml</a>"
3744
direction TB
3845
A[XML file]-- "<a href='../xml_converter/#xml2db.xml_converter.XMLConverter.parse_xml' style='color:var(--md-code-fg-color)'>XMLConverter.parse_xml</a>" -->B[Document tree]
39-
B-- "Document._compute_records_hashes\n<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
46+
B-- "<a href='../document/#xml2db.document.Document.doc_tree_to_flat_data' style='color:var(--md-code-fg-color)'>Document.doc_tree_to_flat_data</a>" -->C[Flat data model]
4047
end
4148
C -.- D
4249
subgraph "<a href='../document/#xml2db.document.Document.insert_into_target_tables' style='color:var(--md-code-fg-color)'>Document.insert_into_target_tables</a>"
@@ -49,8 +56,7 @@ flowchart TB
4956
## *Advanced use:* get data from the database back to XML
5057

5158
The flow chart below presents data conversions used to get back data from the database into XML, showing the functions
52-
used for lower level steps. It can be useful for advanced use case if you want for instance to transform the data in
53-
intermediate steps.
59+
used for lower level steps.
5460

5561
```mermaid
5662
flowchart TB

docs/configuring.md

+72-17
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,62 @@ The column types can also be configured to override the default type mapping, us
1616
diagram (see the [Getting started](getting_started.md) page for directions on how to visualize data models) and
1717
then adapt the configuration if need be.
1818

19-
Configuration options are described below.
19+
Configuration options are described below. Some options can be set at the model level, others at the table level and
20+
others at the field level. The general structure of the configuration dict is the following:
21+
22+
```py title="Model config general structure" linenums="1"
23+
{
24+
"document_tree_hook": None,
25+
"document_tree_node_hook": None,
26+
"row_numbers": False,
27+
"as_columnstore": False,
28+
"metadata_columns": None,
29+
"tables": {
30+
"table1": {
31+
"reuse": True,
32+
"choice_transform": False,
33+
"as_columnstore": False,
34+
"fields": {
35+
"my_column": {
36+
"type": None #default type
37+
}
38+
},
39+
"extra_args": [],
40+
}
41+
}
42+
}
43+
```
44+
45+
## Model configuration
2046

21-
## Field level config
47+
The following options can be passed as a top-level keys of the model configuration `dict`:
48+
49+
* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
50+
access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
51+
for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
52+
should of course stay compatible with the data model.
53+
* `document_tree_node_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It is
54+
similar with `document_tree_hook`, but it is call as soon as a node is completed, not waiting for the entire parsing to
55+
finish. It is especially useful if you intend to filter out some nodes and reduce memory footprint while parsing.
56+
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
57+
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
58+
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
59+
default value is `False` (disabled).
60+
* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
61+
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
62+
clustered columnstore indexes. The default value is `False` (disabled).
63+
* `metadata_columns` (`list`): a list of extra columns that you want to add to the root table of your model. This is
64+
useful for instance to add the name of the file which has been parsed, or a timestamp, etc. Columns should be specified
65+
as dicts, the only required keys are `name` and `type` (a SQLAlchemy type object); other keys will be passed directly
66+
as keyword arguments to `sqlalchemy.Column`. Actual values need to be passed to
67+
[`Document.insert_into_target_tables`](api/document.md#xml2db.document.Document.insert_into_target_tables) for each
68+
parsed documents, as a `dict`, using the `metadata` argument.
69+
* `record_hash_column_name`: the column name to use to store records hash data (defaults to `xml2db_record_hash`).
70+
* `record_hash_constructor`: a function used to build a hash, with a signature similar to `hashlib` constructor
71+
functions (defaults to `hashlib.sha1`).
72+
* `record_hash_size`: the byte size of the record hash (defaults to 20, which is the size of a `sha-1` hash).
73+
74+
## Fields configuration
2275

2376
These configuration options are defined for a specific field of a specific table. A "field" refers to a column in the
2477
table, or a child table.
@@ -140,7 +193,7 @@ timeInterval_end[1, 1]: string
140193
}
141194
```
142195

143-
## Table level config
196+
## Tables configuration
144197

145198
### Simplify "choice groups"
146199

@@ -226,20 +279,22 @@ With MS SQL Server database backend, `xml2db` can create
226279
on tables. However, for `n-n` relationships tables, this option needs to be set globally (see below). The default value
227280
is `False` (disabled).
228281

229-
Configuration: `"as_columnstore":` `False` (default) or `True`
282+
### Extra arguments
230283

231-
## Global options
284+
Extra arguments can be passed to `sqlalchemy.Table` constructors, for instance if you want to customize indexes. These
285+
can be passed in an iterable (e.g. `tuple` or `list`) which will be simply unpacked into the `sqlalchemy.Table`
286+
constructor when building the table.
232287

233-
These options can be passed as a top-level keys of the model configuration `dict`:
288+
Configuration: `"extra_args": []` (default)
234289

235-
* `document_tree_hook` (`Callable`): sets a hook function which can modify the data extracted from the XML. It gives direct
236-
access to the underlying tree data structure just before it is extracted to be loaded to the database. This can be used,
237-
for instance, to prune or modify some parts of the document tree before loading it into the database. The document tree
238-
should of course stay compatible with the data model.
239-
* `row_numbers` (`bool`): adds `xml2db_row_number` columns either to `n-n` relationships tables, or directly to data tables when
240-
deduplication of rows is opted out. This allows recording the original order of elements in the source XML, which is not
241-
always respected otherwise. It was implemented primarily for round-trip tests, but could serve other purposes. The
242-
default value is `False` (disabled).
243-
* `as_columnstore` (`bool`): for MS SQL Server, create clustered columnstore indexes on all tables. This can be also set up at
244-
the table level for each table. However, for `n-n` relationships tables, this option is the only way to configure the
245-
clustered columnstore indexes. The default value is `False` (disabled).
290+
!!! example
291+
Adding an index on a specific column:
292+
``` python
293+
model_config = {
294+
"tables": {
295+
"my_table": {
296+
"extra_args": sqlalchemy.Index("my_index", "my_column1", "my_column2"),
297+
}
298+
}
299+
}
300+
```

docs/getting_started.md

+3-5
Original file line numberDiff line numberDiff line change
@@ -117,11 +117,9 @@ Please read the [How it works](how_it_works.md) page to learn more about the pro
117117
troubleshooting if need be.
118118

119119
!!! note
120-
`xml2db` saves metadata for all loaded XML files. These are currently not configurable and create two additional
121-
columns in the root table:
122-
123-
* `xml2db_input_file_path`: the file path provided to `DataModel.parse_xml`,
124-
* `xml2db_processed_at`: the timestamp at which `DataModel.parse_xml` was called.
120+
`xml2db` can save metadata for each loaded XML file. These can be configured using the
121+
[`metadata_columns` option](configuring.md#model-configuration) and create additional columns in the root table.
122+
It can be used for instance to save file name or loading timestamp.
125123

126124
## Getting back the data into XML
127125

docs/how_it_works.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,8 @@ in memory makes the processing way simpler and faster. We handle files with a si
151151

152152
### Computing hashes
153153

154-
We compute tree hashes (`sha-1`) recursively by adding to each node's hash the hashes of its children element, be it
155-
simple types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema,
154+
We compute tree hashes recursively by adding to each node's hash the hashes of its children element, be it simple
155+
types, attributes or complex types. Children are processed in the specific order they appeared in the XSD schema,
156156
so that hashing is really deterministic.
157157

158158
Right after this step, a hook function is called if provided in the configuration (top-level `document_tree_hook` option
@@ -187,7 +187,9 @@ We keep the primary keys from the flat data model created at the previous stage,
187187
The last step is to merge the temporary tables data into the target tables, while enforcing deduplication, keeping
188188
relationships, etc.
189189

190-
This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction.
190+
This is done by issuing a sequence of `update` and `insert` SQL statements using `sqlalchemy`, in a single transaction
191+
(default) or in multiple transactions.
192+
191193
The process boils down to:
192194

193195
* inserting missing records into the target tables,

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "xml2db"
7-
version = "0.10.1"
7+
version = "0.11.0"
88
authors = [
99
{ name="Commission de régulation de l'énergie", email="opensource@cre.fr" },
1010
]

requirements.txt

+17-19
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
Babel==2.14.0
2-
certifi==2024.2.2
1+
Babel==2.15.0
2+
certifi==2024.6.2
33
charset-normalizer==3.3.2
44
click==8.1.7
55
colorama==0.4.6
66
elementpath==4.4.0
7-
exceptiongroup==1.2.0
7+
exceptiongroup==1.2.1
88
ghp-import==2.1.0
99
greenlet==3.0.3
10-
griffe==0.45.0
11-
idna==3.6
10+
griffe==0.45.2
11+
idna==3.7
1212
iniconfig==2.0.0
13-
Jinja2==3.1.3
13+
Jinja2==3.1.4
1414
lxml==5.1.0
15-
Markdown==3.5.2
15+
Markdown==3.6
1616
MarkupSafe==2.1.5
1717
mergedeep==1.3.4
1818
mkdocs==1.6.0
@@ -25,22 +25,20 @@ mkdocstrings-python==1.10.2
2525
packaging==24.0
2626
paginate==0.5.6
2727
pathspec==0.12.1
28-
platformdirs==4.2.0
29-
pluggy==1.4.0
30-
psycopg2==2.9.9
31-
Pygments==2.17.2
32-
pymdown-extensions==10.7.1
33-
pyodbc==5.1.0
34-
pytest==8.1.1
28+
platformdirs==4.2.2
29+
pluggy==1.5.0
30+
Pygments==2.18.0
31+
pymdown-extensions==10.8.1
32+
pytest==8.2.2
3533
python-dateutil==2.9.0.post0
3634
PyYAML==6.0.1
3735
pyyaml_env_tag==0.1
38-
regex==2023.12.25
39-
requests==2.31.0
36+
regex==2024.5.15
37+
requests==2.32.3
4038
six==1.16.0
41-
SQLAlchemy==2.0.28
39+
SQLAlchemy==2.0.30
4240
tomli==2.0.1
43-
typing_extensions==4.10.0
41+
typing_extensions==4.12.1
4442
urllib3==2.2.1
45-
watchdog==4.0.0
43+
watchdog==4.0.1
4644
xmlschema==3.1.0

src/xml2db/__init__.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
from xml2db.model import DataModel
2-
from xml2db.document import Document
3-
from xml2db.table import (
1+
from .model import DataModel
2+
from .document import Document
3+
from .table import (
44
DataModelTable,
55
DataModelTableReused,
66
DataModelTableDuplicated,

0 commit comments

Comments
 (0)