+
+
+
+
+
+
+
diff --git a/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.0.2.pdf
similarity index 100%
rename from docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf
rename to docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.0.2.pdf
diff --git a/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf
new file mode 100644
index 000000000..2630eab63
Binary files /dev/null and b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf differ
diff --git a/docs/antora/modules/ROOT/nav.adoc b/docs/antora/modules/ROOT/nav.adoc
index 50f58783c..2b65a493f 100644
--- a/docs/antora/modules/ROOT/nav.adoc
+++ b/docs/antora/modules/ROOT/nav.adoc
@@ -1,8 +1,30 @@
* xref:index.adoc[Home]
-* link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary Project Architecture^]
-* xref:mapping_suite_cli_toolchain.adoc[Mapping Suite CLI Toolchain]
-* xref:demo_installation.adoc[Instructions for Software Engineers]
-* xref:user_manual.adoc[User manual]
-* xref:system_arhitecture.adoc[System architecture overview]
-* xref:using_procurement_data.adoc[Using procurement data]
+
+* [.separated]#**General References**#
+** xref:ted-sws-introduction.adoc[About TED-SWS]
+** xref:glossary.adoc[Glossary]
+
+* [.separated]#**For TED-SWS Operators**#
+** xref:user_manual/getting_started_user_manual.adoc[Getting started]
+** xref:user_manual/system-overview.adoc[System overview]
+** xref:user_manual/access-security.adoc[Security and access]
+** xref:user_manual/workflow-management-airflow.adoc[Workflow management with Airflow]
+** xref:user_manual/system-monitoring-metabase.adoc[System monitoring with Metabase]
+
+* [.separated]#**For DevOps**#
+
+** link:{attachmentsdir}/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf[AWS installation manual (v2.5.0)^]
+** link:{attachmentsdir}/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[AWS infrastructure architecture (v0.9)^]
+
+* [.separated]#**For End User Developers**#
+** xref:ted_data/using_procurement_data.adoc[Accessing data in Cellar]
+** link:https://docs.ted.europa.eu/EPO/latest/index.html[eProcurement ontology (latest)^]
+
+* [.separated]#**For TED-SWS Developers**#
+** xref:technical/mapping_suite_cli_toolchain.adoc[Mapping suite toolchain]
+** xref:technical/demo_installation.adoc[Development installation instructions]
+** xref:technical/event_manager.adoc[Event manager description]
+** xref:architecture/arhitecture_choices.adoc[System architecture overview]
+** link:{attachmentsdir}/ted-sws-architecture/index.html[Enterprise architecture model^]
+** xref:architecture/arhitecture_choices.adoc[Architectural choices]
diff --git a/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc b/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc
new file mode 100644
index 000000000..b1ee7031a
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc
@@ -0,0 +1,351 @@
+== Architectural choices
+
+This section describes choices:
+
+* How is this SOA? (is it? It is SOA but not REST Microservices, Why not
+Microservices?
+* Why NoSQL data model vs SQL data model?
+* Why ETL/ELT approach vs. Event Sourcing
+* Why Batch processing vs. Event Streams.
+* Why Airflow ?
+* Why Metabase?
+* Why quick deduplication process? And what are the plans for the
+future?
+
+=== Why is this SOA (Service-oriented architecture) architecture?
+
+ETL (Extract, Transform, Load) architecture is considered
+state-of-the-art for batch processing tasks using Airflow as pipeline
+management for several reasons:
+
+[arabic]
+. *Flexibility*: ETL architecture allows for flexibility in the data
+pipeline as it separates the data extraction, transformation, and
+loading processes. This allows for easy modification and maintenance of
+each individual step without affecting the entire pipeline.
+. *Scalability*: ETL architecture allows for the easy scaling of data
+processing tasks, as new data sources can be added or removed without
+impacting the entire pipeline.
+. *Error Handling*: ETL architecture allows for easy error handling as
+each step of the pipeline can be monitored and errors can be isolated to
+a specific step.
+. *Reusability:* ETL architecture allows for the reuse of existing data
+pipelines, as new data sources can be added without modifying existing
+pipelines.
+. *System management*: Airflow is an open-source workflow management
+system that allows for easy scheduling, monitoring, and management of
+data pipelines. It integrates seamlessly with ETL architecture and
+allows for easy management of complex data pipelines.
+
+Overall, ETL architecture combined with Airflow as pipeline management
+provides a robust and efficient solution for batch processing tasks.
+
+=== Why Monolithic Architecture vs Micro Services Architecture?
+
+There are several reasons why a monolithic architecture may be more
+suitable for an ETL architecture with batch processing pipeline using
+Airflow as the pipeline management tool:
+
+[arabic]
+. *Simplicity*: A monolithic architecture is simpler to design and
+implement as it involves a single codebase and a single deployment
+process. This makes it easier to manage and maintain the ETL pipeline.
+. *Performance*: A monolithic architecture may be more performant than a
+microservices architecture as it allows for more efficient communication
+between the different components of the pipeline. This is particularly
+important for batch processing pipelines, where speed and efficiency are
+crucial.
+. *Scalability*: Monolithic architectures can be scaled horizontally by
+adding more resources to the system, such as more servers or more
+processing power. This allows for the system to handle larger amounts of
+data and handle more complex processing tasks.
+. *Airflow Integration*: Airflow is designed to work with monolithic
+architectures, and it can be more difficult to integrate with a
+microservices architecture. Airflow's DAGs and tasks are designed to
+work with a single codebase, and it may be more challenging to manage
+different services and pipelines across multiple microservices.
+
+Overall, a monolithic architecture may be more suitable for an ETL
+architecture with batch processing pipeline using Airflow as the
+pipeline management tool due to its simplicity, performance,
+scalability, and ease of integration with Airflow.
+
+=== Why ETL/ELT approach vs Event Sourcing ?
+
+ETL (Extract, Transform, Load) architecture is typically used for moving
+and transforming data from one system to another, for example, from a
+transactional database to a data warehouse for reporting and analysis.
+It is a batch-oriented process that is typically scheduled to run at
+specific intervals.
+
+Event sourcing architecture, on the other hand, is a way of storing and
+managing the state of an application by keeping track of all the changes
+to the state as a sequence of events. This allows for better auditing
+and traceability of the state of the application over time, as well as
+the ability to replay past events to reconstruct the current state.
+Event sourcing is often used in systems that require high performance,
+scalability, and fault tolerance.
+
+In summary, ETL architecture is mainly used for data integration and
+data warehousing, Event sourcing is mainly used for building highly
+scalable and fault-tolerant systems that need to store and manage the
+state of an application over time.
+
+A hybrid architecture is implemented in the TED-SWS pipeline, based on
+an ETL architecture but with state storage to repeat a pipeline sequence
+as needed.
+
+=== Why Batch processing vs Event Streams?
+
+Batch processing architecture and Event Streams architecture are two
+different approaches to processing data in code.
+
+Batch processing architecture is a traditional approach where data is
+processed in batches. This means that data is collected over a period of
+time and then processed all at once in a single operation. This approach
+is typically used for tasks such as data analysis, data mining, and
+reporting. It is best suited for tasks that can be done in a single pass
+and do not require real-time processing.
+
+Event Streams architecture, on the other hand, is a more modern approach
+where data is processed in real-time as it is generated. This means that
+data is processed as soon as it is received, rather than waiting for a
+batch to be collected. This approach is typically used for tasks such as
+real-time monitoring, data analytics, and fraud detection. It is best
+suited for tasks that require real-time processing and cannot be done in
+a single pass.
+
+In summary, Batch processing architecture is best suited for tasks that
+can be done in a single pass and do not require real-time processing,
+whereas Event Streams architecture is best suited for tasks that require
+real-time processing and cannot be done in a single pass.
+
+Due to the fact that the TED-SWS pipeline has an ETL architecture, the
+data processing is done in batches, the batches of notices are formed
+per day, all the notices of a day form a batch that will be processed.
+Another method of creating a batch is grouping notices by status and
+executing the pipeline depending on their status.
+
+=== Why NoSQL data model vs SQL data model?
+
+There are several reasons why a NoSQL data model may be more suitable
+for an ETL architecture with batch processing pipeline compared to a SQL
+data model:
+
+[arabic]
+. *Scalability*: NoSQL databases are designed to handle large amounts of
+data and can scale horizontally, allowing for the easy addition of more
+resources as the amount of data grows. This is particularly useful for
+batch processing pipelines that need to handle large amounts of data.
+. *Flexibility*: NoSQL databases are schema-less, which means that the
+data structure can change without having to modify the database schema.
+This allows for more flexibility when processing data, as new data types
+or fields can be easily added without having to make changes to the
+database.
+. *Performance*: NoSQL databases are designed for high-performance and can
+handle high levels of read and write operations. This is particularly
+useful for batch processing pipelines that need to process large amounts
+of data in a short period of time.
+
+. *Handling Unstructured Data*: NoSQL databases are well suited for
+handling unstructured data, such as JSON or XML, that can't be handled
+by SQL databases. This is particularly useful for ETL pipelines that
+need to process unstructured data.
+
+. *Handling Distributed Data*: NoSQL databases are designed to handle
+distributed data, which allows for data to be stored and processed on
+multiple servers. This can help to improve performance and scalability,
+as well as provide fault tolerance.
+
+. *Cost*: NoSQL databases are generally less expensive than SQL databases,
+as they don't require expensive hardware or specialized software. This
+can make them a more cost-effective option for ETL pipelines that need
+to handle large amounts of data.
+
+Overall, a NoSQL data model may be more suitable for an ETL architecture
+with batch processing pipeline compared to a SQL data model due to its
+scalability, flexibility, performance, handling unstructured data,
+handling distributed data and the cost-effectiveness. It is important to
+note that the choice to use a NoSQL data model satisfies the specific
+requirements of the TED-SWS processing pipeline and the nature of the
+data to be processed.
+
+=== Why Airflow?
+
+Airflow is a great solution for ETL pipeline and batch processing
+architecture because it provides several features that are well-suited
+to these types of tasks. First, Airflow provides a powerful scheduler
+that allows you to define and schedule ETL jobs to run at specific
+intervals. This means that you can set up your pipeline to run on a
+regular schedule, such as every day or every hour, without having to
+manually trigger the jobs. Second, Airflow provides a web-based user
+interface that makes it easy to monitor and manage your pipeline.
+
+Both aspects of Airflow are perfectly compatible with the needs of the
+TED-SWS architecture and the use cases required for an Operations
+Manager that will interact with the system. Airflow therefore covers the
+needs of batch processing management and ETL pipeline management.
+
+Airflow provide good coverage of use cases for an Operations Manager,
+specialized for this use cases:
+
+[arabic]
+. *Monitoring pipeline performance*: An operations manager can use Airflow
+to monitor the performance of the ETL pipeline and identify any
+bottlenecks or issues that may be impacting the pipeline's performance.
+They can then take steps to optimize the pipeline to improve its
+performance and ensure that data is being processed in a timely and
+efficient manner.
+
+. *Managing pipeline schedule*: The operations manager can use Airflow to
+schedule the pipeline to run at specific times, such as during off-peak
+hours or when resources are available. This can help to minimize the
+impact of the pipeline on other systems and ensure that data is
+processed in a timely manner.
+
+. *Managing pipeline resources*: The operations manager can use Airflow to
+manage the resources used by the pipeline, such as CPU, memory, and
+storage. They can also use Airflow to scale the pipeline up or down as
+needed to meet changing resource requirements.
+
+. *Managing pipeline failures*: Airflow allows the operations manager to
+set up notifications and alerts for when a pipeline fails or a task
+fails. This allows them to quickly identify and address any issues that
+may be impacting the pipeline's performance.
+
+. *Managing pipeline dependencies*: The operations manager can use Airflow
+to manage the dependencies between different tasks in the pipeline, such
+as ensuring that notice fetching is completed before notice indexing or
+notice metadata normalization.
+
+. *Managing pipeline versioning*: Airflow allows the operations manager to
+maintain different versions of the pipeline, which can be useful for
+testing new changes before rolling them out to production.
+
+. *Managing pipeline security*: Airflow allows the operations manager to
+set up security controls to protect the pipeline and the data it
+processes. They can also use Airflow to audit and monitor access to the
+pipeline and the data it processes.
+
+=== Why Metabase?
+
+Metabase is an excellent solution for data analysis and KPI monitoring
+for a batch processing system, as it offers several key features that
+make it well suited for this type of use case required within the
+TED-SWS system.
+
+First, Metabase is highly customizable, allowing users to create and
+modify dashboards, reports, and visualizations to suit their specific
+needs. This makes it easy to track and monitor the key performance
+indicators (KPIs) that are most important for the batch processing
+system, such as the number of jobs processed, the average processing
+time, and the success rate of job runs.
+
+Second, Metabase offers a wide range of data connectors, allowing users
+to easily connect to and query data sources such as SQL databases, NoSQL
+databases, CSV files, and APIs. This makes it easy to access and analyze
+the data that is relevant to the batch processing system. In TED-SWS the
+data domain model is realized by a document-based data model, not a
+tabular relational data model, so Metabase is a good tool for analyzing
+data with a document-based model.
+
+Third, Metabase has a user-friendly interface that makes it easy to
+navigate and interact with data, even for users with little or no
+technical experience. This makes it accessible to a wide range of users,
+including business analysts, data scientists, and other stakeholders who
+need to monitor and analyse the performance of the batch processing
+system.
+
+Finally, Metabase offers robust security and collaboration features,
+making it easy to share and collaborate on data and insights with team
+members and stakeholders. This makes it an ideal solution for
+organizations that need to monitor and analyse the performance of a
+batch processing system across multiple teams or departments.
+
+=== Why quick deduplication process?
+
+One of the main challenges in entities deduplication from the semantic
+web domain is dealing with the complexity and diversity of the data.
+This can include dealing with different data formats, schemas, and
+vocabularies, as well as handling missing or incomplete data.
+Additionally, entities may have multiple identities or representations,
+making it difficult to determine which entities are duplicates and which
+are distinct. Another difficulty is the scalability of the algorithm to
+handle large amount of data. The performance of the algorithm should be
+efficient and accurate to handle huge number of entities.
+
+There are several approaches and solutions for entities deduplication in
+the semantic web. Some of the top solutions include:
+
+[arabic]
+. *String-based methods*: These methods use string comparison techniques
+such as Jaccard similarity, Levenshtein distance, and cosine similarity
+to identify duplicates based on the similarity of their string
+representations.
+. *Machine learning-based methods*: These methods use machine learning
+algorithms such as decision trees, random forests, and neural networks
+to learn patterns in the data and identify duplicates.
+
+. *Knowledge-based methods*: These methods use external knowledge sources
+such as ontologies, taxonomies, and linked data to disambiguate entities
+and identify duplicates.
+
+. *Hybrid methods*: These methods combine multiple techniques, such as
+string-based and machine learning-based methods, to improve the accuracy
+of deduplication.
+
+. *Blocking Method*: This method is used to reduce the number of entities
+that need to be compared by grouping similar entities together.
+
+In the TED-SWS pipeline, the deduplication of Organization type entities
+is performed using a string-based methods. String-based methods are
+often used for organization entity deduplication, because of their
+simplicity and effectiveness.
+
+TED Europe data often contains information about tenders and public
+procurement, where organizations are identified by their names.
+Organization names are often unique and can be used to identify
+duplicates with high accuracy. String-based methods can be used to
+compare the similarity of different organization names, which can be
+effective in identifying duplicates.
+
+Additionally, the TED europe data is highly structured, so it's easy to
+extract and compare the names of organizations. String-based methods are
+also relatively fast and easy to implement, making them a good choice
+for large data sets. This methods may not be as effective for other
+types of entities, such as individuals, where additional information may
+be needed to identify duplicates. It's also important to note that
+string-based methods may not work as well for misspelled or abbreviated
+names.
+
+Using a quick and dirty deduplication approach instead of a complex
+system at the first iteration of a system implementation can be
+beneficial for several reasons:
+
+[arabic]
+. *Speed*: A quick approach can be implemented quickly and can
+help to identify and remove duplicates quickly. This can be particularly
+useful when working with large and complex data sets, where a more
+complex approach may take a long time to implement and test.
+. *Cost*: A quick and dirty approach is generally less expensive to
+implement than a complex system, as it requires fewer resources and less
+development time.
+. *Simplicity*: A quick and dirty approach is simpler and easier to
+implement than a complex system, which can reduce the risk of errors and
+bugs.
+. *Flexibility*: A quick and dirty approach allows to start with a basic
+system and adapt it as needed, which can be more flexible than a complex
+system that is difficult to change.
+
+. *Testing*: A quick and dirty approach allows to test the system quickly,
+and get feedback from the users and stakeholders, and then use that
+feedback to improve the system.
+
+
+However, it's worth noting that the quick and dirty approach is not a
+long-term solution and should be used only as a first step in the
+implementation of a MDR system. This approach can help to quickly
+identify and remove duplicates and establish a basic system, but it may
+not be able to handle all the complexity and diversity of the data, so
+it's important to plan for and implement more advanced techniques as the
+system matures.
diff --git a/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc b/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc
new file mode 100644
index 000000000..9b922c392
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc
@@ -0,0 +1,476 @@
+== TED-SWS Architecture
+[width="100%",cols="25%,75%",options="header",]
+
+=== Use Cases
+
+Operations Manager is the main actor that will interact with the TED-SWS
+system. When presenting the system architecture we strongly rely on the perspective of this actor
+
+For Operations Manager the following use cases are relevant:
+
+* to fetch notices from the TED website based on a query
+* to fetch notices from the TED website based on a date range
+* to fetch notices from the TED website based on date
+* to load a Mapping Suite into the system
+* to reprocess non-normalized notices from the backlog
+* to reprocess untransformed notices from the backlog
+* to reprocess unvalidated notices from the backlog
+* to reprocess unpackaged notices from the backlog
+* to reprocess the notices we published from the backlog
+
+=== System architecture
+
+The main points of architecture for a system that will transform TED
+notices from XML format to RDF format using an ETL architecture with
+batch processing pipeline are:
+
+[arabic]
+. *Data collection*: An API would be used to collect the
+daily notices from the TED website in XML format and store them in a
+data warehouse.
+. *Metadata management*: A metadata management module would be collect, store and provide filtering capabilities for notices based on their features, such as form number, date of publication, XSD schema version, subform type, etc.
+. *Data transformation*: A data transformation module would be used to
+convert the XML data into RDF format.
+. *Data loading*: The transformed RDF data would be loaded into a triple
+store, such as Cellar, for further analysis or reporting.
+. *Pipeline management*: Airflow would be used to schedule and manage the
+pipeline, ensuring that the pipeline is run on a daily basis to process
+the latest batch of notices from the TED website. Airflow would also be
+used to monitor the pipeline and provide real-time status updates.
+. *Data access*: A SPARQL endpoint or an API would be used to access the
+RDF data stored in the triple store. This would allow external systems
+to query the data and retrieve the information they need.
+. *Security*: The system would be protected by a firewall and would use
+secure protocols (e.g. HTTPS) for data transfer. Access to the data
+would be controlled by authentication and authorization mechanisms.
+
+. *Scalability*: The architecture should be designed to handle large
+amounts of data and easily scale horizontally by adding more resources
+as the amount of data grows.
+. *Flexibility*: The architecture should be flexible to handle changes in
+the data structure without having to modify the database schema.
+. *Performance*: The architecture should be designed for high-performance
+to handle high levels of read and write operations to process data in a
+short period of time.
+
+Figure 1.1 shows the compact, general image of the TED-SWS system
+architecture from the system's business point of view. The system
+represents a pipeline for processing notices from the TED Website and
+publishing them to the CELLAR service.
+
+For the monitoring and management of internal processes, the system
+offers two interfaces. An interface for data monitoring, in the diagram,
+the given interface is represented by the name of “Data Monitoring
+Interface”. Another interface is for the monitoring and management of
+system processes; in the diagram, the given interface is represented by
+the name “Workflow Management Interface”. Operations Manager will use
+these two interfaces for system monitoring and management.
+
+The element of the system that will process the notices is the TED-SWS
+pipeline. The input data for this pipeline will be the notices in XML
+format from the TED website. The result of this pipeline is a METS
+package for each processed notice and its publication in CELLAR, from
+where the end user will be able to access notices in RDF format.
+
+Providing, in Figure 1.1, a compact view of the TED-SWS system
+architecture at the business level is useful because it allows
+stakeholders and decision-makers to quickly and easily understand how
+the system works and how it supports the business goals and objectives.
+A compact view of the architecture can help to communicate the key
+components of the system and how they interact with each other, making
+it easier to understand the system's capabilities and limitations.
+Additionally, a compact view of the architecture can help to identify
+any areas where the system could be improved or where additional
+capabilities are needed to support the business. By providing a clear
+and concise overview of the system architecture, stakeholders can make
+more informed decisions about how to use the system, how to improve it,
+and how to align it with the business objectives.
+
+In Figure 1.1 also is provided, input and output dependencies for a
+TED-SWS system architecture. This is useful because it helps to identify
+the data sources and data destinations that the system relies on, as
+well as the data that the system produces. This information can be used
+to understand the data flows within the system, how the system is
+connected to other systems, and how the system supports the business.
+Input dependencies help to identify the data sources that the system
+relies on, such as external systems, databases, or other data sources.
+This information can be used to understand how the system is connected
+to other systems and how it receives data. Output dependencies help to
+identify the data destinations that the system produces, such as
+external systems, databases, or other data destinations. This
+information can be used to understand how the system is connected to
+other systems and how it sends data. By providing input and output
+dependencies for the TED-SWS system architecture, stakeholders can make
+more informed decisions about how to use the system, how to improve it,
+and how to align it with the business objectives.
+
+image:system_arhitecture/media/image1.png[image,width=100%,height=366]
+
+Figure 1.1 Compact view of system architecture at the business level
+
+In Figure 1.2 the general extended architecture of the TED-SWS system is
+represented, in this diagram, the internal components of the TED-SWS
+pipeline are also included.
+
+image:system_arhitecture/media/image8.png[image,width=100%,height=270]
+
+Figure 1.2 Extended view of system architecture at business level
+
+Figure 1.3 shows the architecture of the TED-SWS system without its
+peripheral elements. This diagram is intended to highlight the services
+that serve the internal components of the pipeline.
+
+*Workflow Management Service* is an external TED-SWS pipeline service
+that performs pipeline management. This service provides a control
+interface, in the figure it is represented by Workflow Management
+Interface.
+
+*Workflow Management Interface* represents an internal process control
+interface, this component will be analysed in a separate diagram.
+
+*Data Visualization Service* is a service that manages logs and pipeline
+data to present them in a form of dashboards.
+
+*Data Monitoring Interface* is a data visualization and dashboard
+editing interface offered by the Data Visualization Service.
+
+*Message Digest Service* is a service that serves the transformation
+component of the TED-SWS pipeline, within the transformation to ensure
+custom RML functions, an external service is needed that will implement
+them.
+
+*Master Data Management & URI Allocation Service* is a service for
+storing and managing unique URIs, this service performs URI
+deduplication.
+
+The *TED-SWS pipeline* contains a set of components, all of which access
+Notice Aggregate and Mapping Suite objects.
+
+image:system_arhitecture/media/image4.png[image,width=100%,height=318]
+
+Figure 1.3 TED-SWS architecture at business level
+
+Figure 1.4 shows the TED-SWS pipeline and its components, and this view
+aims to show the connection between the components.
+
+The pipeline has the following components:
+
+* Fetching Service
+* XML Indexing Service
+* Metadata Normalization Service
+* Transformation Service;
+* Entity Resolution & Deduplication Service
+* Validation Service
+* Packaging Service
+* Publishing Service
+* Mapping Suite Loading Service
+
+*Fetching Service* is a service that extracts notices from the TED
+website and stores them in the database.
+
+*XML Indexing Service* is a service that extracts all unique XPaths from
+an XML and stores them as metadata. Unique XPaths are used later to
+validate if the transformation to RDF format, has been done for all
+XPaths from a notice in XML format.
+
+*Metadata Normalization Service* is a service that normalises the
+metadata of a notice in an internal work format. This normalised
+metadata will be used in other processes on a notice, such as the
+selection of a Mapping Suite for transformation or validation of a
+notice.
+
+*Transformation Service* is the service that transforms a notice from
+the XML format into the RDF format, using for this a Mapping Suite that
+contains the RML transformation rules that will be applied.
+
+*Entity Resolution & Deduplication Service* is a service that performs
+the deduplication of entities from RDF manifestation, namely
+Organization and Procedure entities.
+
+*Validation Service* is a service that validates a notice in RDF format,
+using for this several types of validations, namely validation using
+SHACL shapes, validation using SPARQL tests and XPath coverage
+verification.
+
+*Packaging Service* is a service that creates a METS package that will
+contain notice RDF manifestation.
+
+*Publishing Service* is a service that publishes a notice RDF
+manifestation in the required format, in the case of Cellar the
+publication takes place with a METS package.
+
+image:system_arhitecture/media/image5.png[image,width=100%,height=154]
+
+Figure 1.4 TED-SWS pipeline architecture at business level
+
+=== Processing single notice (BPMN perspective)
+
+The pipeline for processing a notice is the key element in the TED-SWS
+system, the architecture of this pipeline from the business point of
+view is represented in Figure 2. Unlike the previously presented
+figures, in Figure 2 the pipeline is rendered in greater detail and are
+presented relationships between pipeline steps and the artefacts that
+produce or use them.
+
+Based on Figure 2, it can be noted that the pipeline is not a linear
+one, within the pipeline there are control steps that check whether the
+following steps should be executed for a notice.
+
+There are 3 control steps in the pipeline, namely:
+
+* Check notice eligibility for transformation
+* Check notice eligibility for packaging
+* Check notice availability in Cellar
+
+The “Check notice eligibility for transformation” step represents the
+control of a notice if it can be transformed with a Mapping Suite, if it
+can be transformed it goes to the next transformation step, otherwise
+the notice is stored for future processing.
+
+The “Check notice eligibility for packaging” step checks if a notice RDF
+manifestation after the validation step is valid for packaging in a METS
+package. If it is valid, proceed to the packing step, otherwise, store
+the intermediate result for further analysis.
+
+The “Check notice availability in Cellar” step checks, after the
+publication step in Cellar, if a published notice is already accessible
+in Cellar. If the notice is accessible, then the pipeline is finished,
+otherwise the published notice is stored for further analysis.
+
+Pipeline steps produce and use artefacts such as:
+
+* TED-XML notice & metadata;
+* Mapping rules
+* TED-RDF notice
+* Test suites
+* Validation report
+* METS Package activation
+
+image:system_arhitecture/media/image2.png[image,width=100%,height=177]
+
+Figure 2 Single notice processing pipeline at business level
+
+Based on Figure 2, we can notice that the artefacts for a notice appear
+with the passage of certain steps in the pipeline. To be able to
+conveniently manage the state of a notice and all its artefacts
+depending on its state, a notice represents an aggregate of artefacts
+and a state, which changes dynamically during the pipeline.
+
+== Application architecture
+
+In this section, we address the following questions:
+
+* How is the data organised?
+* How does the data structure evolve within the process?
+* Howe does the business process look like?
+* How is the business process realised in the Application?
+
+=== Notice status transition map
+
+A TED-SWS pipeline implement a hybrid architecture based on ETL pipeline
+with status transition map for a notice. The TED-SWS pipeline have many
+steps and is not a linear pipeline, in this case using a notice status
+transition map, for a complex pipeline with multiple steps and
+ramifications like as TED-SWS pipeline, is a good architecture choice
+for several reasons:
+
+[arabic]
+. *Visibility*: A notice status transition map provides a clear and visual
+representation of the different stages that a notice goes through in the
+pipeline. This allows for better visibility into the pipeline, making it
+easier to understand the flow of data and to identify any issues or
+bottlenecks.
+
+. *Traceability*: A notice status transition map allows for traceability
+of notices in the pipeline, which means that it's possible to track a
+notice as it goes through the different stages of the pipeline. This can
+be useful for troubleshooting, as it allows for the identification of
+which stage the notice failed or had an issue.
+
+. *Error Handling*: A notice status transition map allows for the
+definition of error handling procedures for each stage in the pipeline.
+This can be useful for identifying and resolving errors that occur in
+the pipeline, as it allows for a clear understanding of what went wrong
+and what needs to be done to resolve the issue.
+
+. *Auditing*: A notice status transition map allows for the auditing of
+notices in the pipeline, which means that it's possible to track the
+history of a notice, including when it was processed, by whom, and
+whether it was successful or not.
+
+. *Monitoring*: A notice status transition map allows for the monitoring
+of notices in the pipeline, which means that it's possible to track the
+status of a notice, including how many notices are currently being
+processed, how many have been processed successfully, and how many have
+failed.
+
+. *Automation*: A notice status transition map can be used to automate
+some of the process, by defining rules or triggers to move notices
+between different stages of the pipeline, depending on the status of the
+notice.
+
+
+Each notice has a status during the pipeline, a status corresponds to a
+step in the pipeline that the notice passed. Figure 3.1 shows the
+transition flow of the status of a notice, as a note we must take into
+account that a notice can only be in one status at a given time.
+Initially, each notice has the status of RAW and the last status, which
+means finishing the pipeline, is the status of PUBLICLY_AVAILABLE.
+
+Based on the use cases of this pipeline, the following statuses of a
+notice are of interest to the end user:
+
+* RAW
+* NORMALISED_METADATA
+* INELIGIBLE_FOR_TRANSFORMATION
+* TRANSFORMED
+* VALIDATED
+* INELIGIBLE_FOR_PACKAGING
+* PACKAGED
+* INELIGIBLE_FOR_PUBLISHING
+* PUBLISHED
+* PUBLICLY_UNAVAILABLE
+* PUBLICLY_AVAILABLE
+
+image:system_arhitecture/media/image6.png[image,width=546,height=402]
+
+Figure 3.1 Notice status transition
+
+The names of the statuses are self-descriptive, but attention should be
+drawn to some statuses, namely:
+
+* INDEXED
+* NORMALISED_METADATA
+* DISTILLED
+* PUBLISHED
+* PUBLICLY_UNAVAILABLE
+* PUBLICLY_AVAILABLE
+
+The INDEXED status means that the set of unique XPaths appearing in its
+XML manifestation has been calculated for a notice. The unique set of
+XPaths is subsequently required when calculating the XPath coverage
+indicator for the transformation.
+
+The NORMALISED_METADATA status means that for a notice, its metadata has
+been normalised. The metadata of a notice is normalised in an internal
+format to be able to check the eligibility of a notice to be transformed
+with a Mapping Suite package.
+
+The status DISTILLED is used to indicate that the RDF manifestation of a
+notice has been post processed. The post-processing of an RDF
+manifestation provides for the deduplication of the Procedure or
+Organization type entities and the insertion of corresponding triplets
+within this RDF manifestation.
+
+The PUBLISHED status means that a notice has been sent to Cellar, which
+does not mean that it is already available in Cellar. Since there is a
+time interval between the transmission and the actual appearance in the
+Cellar, it is necessary to check later whether a notice is available in
+the Cellar or not. If the verification has taken place and the notice is
+available in the Cellar, it is assigned the status of
+PUBLICLY_AVAILABLE, if it is not available in the Cellar, the status of
+PUBLICLY_UNAVAILABLE is assigned to it.
+
+=== Notice structure
+
+Notice structure has a NoSQL data model, this architecture choice is
+based on dynamic behaviour of notice structure which evolves over time
+while TED-SWS pipeline running and besides that there are other reasons:
+
+[arabic]
+. *Schema-less*: NoSQL databases are schema-less, which means that the
+data structure can change without having to modify the database schema.
+This allows for more flexibility when processing data, as new data types
+or fields can be easily added without having to make changes to the
+database. This is particularly useful for notices that are likely to
+evolve over time, as the structure of the notices can change without
+having to make changes to the database.
+
+. *Handling Unstructured Data*: NoSQL databases are well suited for
+handling unstructured data, such as JSON or XML, that can't be handled
+by SQL databases. This is particularly useful for ETL pipelines that
+need to process unstructured data, as notices are often unstructured and
+may evolve over time.
+. *Handling Distributed Data*: NoSQL databases are designed to handle
+distributed data, which allows for data to be stored and processed on
+multiple servers. This can help to improve performance and scalability,
+as well as provide fault tolerance. This is particularly useful for
+notices that are likely to evolve over time, as the volume of data may
+increase and need to be distributed.
+
+. *Flexible Querying*: NoSQL databases allow for flexible querying, which
+means that the data can be queried in different ways, including by
+specific fields, by specific values, and by ranges. This allows for more
+flexibility when querying the data, as the structure of the notices may
+evolve over time.
+. *Cost-effective*: NoSQL databases are generally less expensive than SQL
+databases, as they don't require expensive hardware or specialized
+software. This can make them a more cost-effective option for ETL
+pipelines that need to handle large amounts of data and that are likely
+to evolve over time.
+
+
+Overall, a NoSQL data model is a good choice for notice structure in an
+ETL pipeline that is likely to evolve over time because it allows for
+more flexibility when processing data, handling unstructured data,
+handling distributed data, flexible querying and it's cost-effective.
+
+Figure 3.2 shows the structure of a notice and its evolution depending
+on the state in which a notice is located. In the given figure, the
+emphasis is placed on the states from which a certain part of the
+structure of a notice is present. As a remark, it should be taken into
+account that once an element of the structure of a notice is present for
+a certain state, it will also be present for all the states derived from
+it, such as the flow of states presented in Figure 3.1.
+
+image:system_arhitecture/media/image3.png[image,width=567,height=350]
+
+Figure 3.2 Dynamic behaviour of notice structure based on status
+
+Based on Figure 3.2, it is noted that the structure of a notice evolves
+with the transition to other states.
+
+For a notice in the state of NORMALISED_METADATA, we can access the
+following fields of a notice:
+
+* Original Metadata
+* Normalised Metadata
+* XML Manifestation
+
+For a notice in the TRANSFORMED state, we can access all the previous
+fields and the following new fields of a notice:
+
+* RDF Manifestation.
+
+For a notice in the VALIDATED state, we can access all the previous
+fields and the following new fields of a notice:
+
+* XPath Coverage Validation
+
+* SHACL Validation
+* SPARQL Validation
+
+For a notice in the PACKAGED state, we can access all the previous
+fields and the following new fields of a notice:
+
+* METS Manifestation
+
+=== Application view of the process
+
+The primary actor of the TED-SWS system will be the Operations Manager,
+who will interact with the system. Application-level pipeline control is
+achieved through the Airflow stack. Figure 4 shows the AirflowUser actor
+representing Operations Manager, this diagram is at the application
+level of the process.
+
+image:system_arhitecture/media/image7.png[image,width=534,height=585]
+
+Figure 4 Dependencies between Airflow DAGs
+
+Based on the use cases defined for an Operations Manger, Figure 4 shows
+the control functionality of the TED-SWS pipeline that it can use. In
+addition to the functionality available for the AirflowUser actor, the
+dependency between DAGs is also rendered. We can note that another actor
+named AirflowScheduler is defined, this actor represents an automatic
+execution mechanism at a certain time interval of certain DAGs.
+
diff --git a/docs/antora/modules/ROOT/pages/future_work.adoc b/docs/antora/modules/ROOT/pages/future_work.adoc
new file mode 100644
index 000000000..3dc112542
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/future_work.adoc
@@ -0,0 +1,59 @@
+== Future work
+
+In the future, another Master Data Registry type system will be used to
+deduplicate entities in the TED-SWS system, which will be implemented
+according to the requirements for deduplication of entities from
+notices.
+
+The future Master Data Registry (MDR) system for entity deduplication
+should have the following architecture:
+
+[arabic]
+. *Data Ingestion*: This component is responsible for extracting and
+collecting data from various sources, such as databases, files, and
+APIs. The data is then transformed, cleaned, and consolidated into a
+single format before it is loaded into the MDR.
+
+. *Data Quality*: This component is responsible for enforcing data quality
+rules, such as format, completeness, and consistency, on the data before
+it is entered into the MDR. This can include tasks such as data
+validation, data standardization, and data cleansing.
+
+. *Entity Dedup*: This component is responsible for identifying and
+removing duplicate entities in the MDR. This can be done using a
+combination of techniques such as string-based, machine learning-based,
+or knowledge-based methods.
+
+. *Data Governance*: This component is responsible for ensuring that the
+data in the MDR is accurate, complete, and up-to-date. This can include
+processes for data validation, data reconciliation, and data
+maintenance.
+
+. *Data Access and Integration*: This component provides access to the MDR
+data through a user interface and API's, and integrates the MDR data
+with other systems and applications.
+
+. *Data Security*: This component is responsible for ensuring that the
+data in the MDR is secure, and that only authorized users can access it.
+This can include tasks such as authentication, access control, and
+encryption.
+
+. *Data Management*: This component is responsible for managing the data
+in the MDR, including tasks such as data archiving, data backup, and
+data recovery.
+
+. *Monitoring and Analytics*: This component is responsible for monitoring
+and analysing the performance of the MDR system, and for providing
+insights into the data to help improve the system.
+
+. *Services layer*: This component is responsible for providing services
+such as, indexing, search and query functionalities over the data.
+
+
+All these components should be integrated and work together to provide a
+comprehensive and efficient MDR system for entity deduplication. The
+system should be scalable and flexible enough to handle large amounts of
+data and adapt to changing business requirements.
+
+
+
diff --git a/docs/antora/modules/ROOT/pages/glossary.adoc b/docs/antora/modules/ROOT/pages/glossary.adoc
new file mode 100644
index 000000000..09def2255
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/glossary.adoc
@@ -0,0 +1,23 @@
+== Glossary
+
+*Airflow* - an open-source platform for developing, scheduling, and
+monitoring batch-oriented pipelines. The web interface helps manage the
+state and monitoring of your pipelines.
+
+*Metabase* - is the BI tool with the friendly UX and integrated tooling
+to let you explore data gathered by running the pipelines available in
+Airflow.
+
+*Cellar* - is the central content and metadata repository of the
+Publications Office of the European Union
+
+*TED-SWS* - is a pipeline system that continuously converts the public
+procurement notices (in XML format) available on the TED Website into
+RDF format and publishes them into CELLAR
+
+*DAG* - (Directed Acyclic Graph) is the core concept of Airflow,
+collecting Tasks together, organized with dependencies and relationships
+to say how they should run. The DAGS are basically the pipelines that
+run in this project to get the public procurement notices from XML to
+RDF and to be published them into CELLAR.
+
diff --git a/docs/antora/modules/ROOT/pages/index.adoc b/docs/antora/modules/ROOT/pages/index.adoc
index 632edc364..dec49b1b7 100644
--- a/docs/antora/modules/ROOT/pages/index.adoc
+++ b/docs/antora/modules/ROOT/pages/index.adoc
@@ -1,20 +1,6 @@
= TED-RDF Conversion Pipeline Documentation
-The TED-RDF Conversion Pipeline, which is part of the TED Semantic Web Services, aka TED-SWS system, provides tools an infrastructure to convert TED notices available in XML format into RDF. This conversion pipeline is designed to work with the https://docs.ted.europa.eu/rdf-mapping/index.html[TED-RDF Mappings].
-
-== Quick references for users
-
-* xref:mapping_suite_cli_toolchain.adoc[Installation and usage instructions for the Mapping Suite CLI toolchain]
-* link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary project architecture (in progress)^]
-
-
-== Developer pages
-
-xref:demo_installation.adoc[Installation instructions for development and testing for software engineers]
-
-xref:attachment$/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[TED-SWS AWS Infrastructure architecture overview v0.9]
-
-xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS AWS Installation manual v2.0.2]
+The TED-RDF Conversion Pipeline, is part of the TED Semantic Web Services (TED-SWS system) and provides tools an infrastructure to convert TED notices available in XML format into RDF. This conversion pipeline is designed to work with the https://docs.ted.europa.eu/rdf-mapping/index.html[TED-SWS Mapping Suites] - self containing packages with transformation rules and resources.
== Project roadmap
@@ -23,8 +9,7 @@ xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS A
| Phase 1 | The first phase places high priority on the deployment into the OP AWS Cloud environment.| August 2022 | xref:attachment$/FATs/2022-08-29-report/index.html[2022-08-29 report] | 29 August 2022 | link:https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/0.0.9-beta[0.0.9-beta]
| Phase 2 | Provided that the deployment in the acceptance environment is successful, the delivery of Phase 2 aims to provide the first production version of the TED SWS system. | Nov 2022 | xref:attachment$/FATs/2022-11-22-TED-SWS-FAT-complete.html[2022-11-22 report] | 20 Nov 2022 | https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/1.0.0-beta[1.0.0-beta]
-| Phase 3 | This phase delivers the documentation and components and improvements that could not be covered in the previous phases. | Feb 2023 | --- | --- | ---
-
+| Phase 3 | This phase delivers the documentation and components and improvements that could not be covered in the previous phases. | Feb 2023 | xref:attachment$/FATs/2023-02-20-TED-SWS-FAT-complete.html[2023-02-20 report] | 21 Feb 2023 | https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/1.1.0-beta[1.1.0-beta]
|===
@@ -32,3 +17,21 @@ xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS A
+//
+// == Quick references for Developers
+//
+// == Quick references for DevOps
+//
+// == Quick references for TED-SWS Developers
+//
+// * xref:mapping_suite_cli_toolchain.adoc[Installation and usage instructions for the Mapping Suite CLI toolchain]
+// * link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary project architecture (in progress)^]
+//
+//
+// == Developer pages
+//
+// xref:demo_installation.adoc[Installation instructions for development and testing for software engineers]
+//
+// xref:attachment$/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[TED-SWS AWS Infrastructure architecture overview v0.9]
+//
+// xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.5.0.pdf[TED-SWS AWS Installation manual v2.5.0]
\ No newline at end of file
diff --git a/docs/antora/modules/ROOT/pages/system_arhitecture.adoc b/docs/antora/modules/ROOT/pages/system_arhitecture.adoc
deleted file mode 100644
index 0e7ea1cc6..000000000
--- a/docs/antora/modules/ROOT/pages/system_arhitecture.adoc
+++ /dev/null
@@ -1,972 +0,0 @@
-= TED-SWS System Architecture
-
-[width="100%",cols="25%,75%",options="header",]
-|===
-|*Editors* |Dragos Paun
- +
-Eugeniu Costetchi
-
-|*Version* |1.0.0
-
-|*Date* |20/02/2023
-|===
-== Introduction
-
-Although TED notice data is already available to the general public
-through the search API provided by the TED website, the current offering
-has many limitations that impede access to and reuse of the data. One
-such important impediment is for example the current format of the data.
-
-Historical TED data come in various XML formats that evolved together
-with the standard TED XML schema. The imminent introduction of eForms
-will also introduce further diversity in the XML data formats available
-through TED's search API. This makes it practically impossible for users
-to consume and process data that span across several years, as
-their information systems must be able to process several different
-flavours of the available XML schemas as well as to keep up with the
-schema's continuous evolution. Their search capabilities are therefore
-confined to a very limited set of metadata.
-
-The TED Semantic Web Service will remove these barriers by providing one
-common format for accessing and reusing all TED data. Coupled with the
-eProcurement Ontology, the TED data will also have semantics attached to
-them allowing users to directly link them with other datasets.
-Moreover, users will now be able to perform much more elaborate
-queries directly on the data source (through the SPARQL endpoint). This
-will reduce their need for data warehousing in order to perform complex
-queries.
-
-These developments, by lowering the barriers, will give rise to a vast
-number of new use-cases that will enable stakeholders and end-users to
-benefit from increased availability of analytics. The ability to perform
-complex queries on public procurement data will be equally open to large
-information systems as well as to simple desktop users with a copy of
-Excel and an internet connection.
-
-To summarize, the TED Semantic Web Service (TED SWS) is a pipeline
-system that continuously converts the public procurement notices (in XML
-format) available on the TED Website into RDF format, publishes them
-into CELLAR and makes them available to the public through CELLAR’s
-SPARQL endpoint.
-
-=== Document overview
-
-This document describes the architecture of the TED-SWS system.
-
-It describes:
-
-* A general description of the system
-* A general architecture
-* A process single notice
-
-=== Glossary
-
-*Airflow* - an open-source platform for developing, scheduling, and
-monitoring batch-oriented pipelines. The web interface helps manage the
-state and monitoring of your pipelines.
-
-*Metabase* - is the BI tool with the friendly UX and integrated tooling
-to let you explore data gathered by running the pipelines available in
-Airflow.
-
-*Cellar* - is the central content and metadata repository of the
-Publications Office of the European Union
-
-*TED-SWS* - is a pipeline system that continuously converts the public
-procurement notices (in XML format) available on the TED Website into
-RDF format and publishes them into CELLAR
-
-*DAG* - (Directed Acyclic Graph) is the core concept of Airflow,
-collecting Tasks together, organized with dependencies and relationships
-to say how they should run. The DAGS are basically the pipelines that
-run in this project to get the public procurement notices from XML to
-RDF and to be published them into CELLAR.
-
-== Architecture
-
-=== System use cases
-
-Operations Manager is the main actor that will interact with the TED-SWS
-system. For these reasons, the use cases of the system will be focused
-on the foreground for this actor.
-
-For Operations Manager are the following use cases:
-
-* to load a Mapping Suite into the database
-* to reprocess non-normalized notices from the backlog
-* to reprocess untransformed notices from the backlog
-* to reprocess unvalidated notices from the backlog
-* to reprocess unpackaged notices from the backlog
-* to reprocess the notices we published from the backlog
-* to fetch notices from the TED website based on a query
-* to fetch notices from the TED website based on a date range
-* to fetch notices from the TED website based on date
-
-=== Architecture overview
-
-The main points of architecture for a system that will transform TED
-notices from XML format to RDF format using an ETL architecture with
-batch processing pipeline are:
-
-[arabic]
-. *Data collection*: A web scraper or API would be used to collect the
-daily notices from the TED website in XML format and store them in a
-data warehouse.
-. *Data cleansing*: A data cleansing module would be used to clean and
-validate the data, removing any invalid or duplicate entries
-. *Data transformation*: A data transformation module would be used to
-convert the XML data into RDF format.
-. *Data loading*: The transformed RDF data would be loaded into a triple
-store, such as Cellar, for further analysis or reporting.
-. *Pipeline management*: Airflow would be used to schedule and manage the
-pipeline, ensuring that the pipeline is run on a daily basis to process
-the latest batch of notices from the TED website. Airflow would also be
-used to monitor the pipeline and provide real-time status updates.
-. *Data access*: A SPARQL endpoint or an API would be used to access the
-RDF data stored in the triple store. This would allow external systems
-to query the data and retrieve the information they need.
-. *Security*: The system would be protected by a firewall and would use
-secure protocols (e.g. HTTPS) for data transfer. Access to the data
-would be controlled by authentication and authorization mechanisms.
-
-. *Scalability*: The architecture should be designed to handle large
-amounts of data and easily scale horizontally by adding more resources
-as the amount of data grows.
-. *Flexibility*: The architecture should be flexible to handle changes in
-the data structure without having to modify the database schema.
-. *Performance*: The architecture should be designed for high-performance
-to handle high levels of read and write operations to process data in a
-short period of time.
-
-Figure 1.1 shows the compact, general image of the TED-SWS system
-architecture from the system's business point of view. The system
-represents a pipeline for processing notices from the TED Website and
-publishing them to the CELLAR service.
-
-For the monitoring and management of internal processes, the system
-offers two interfaces. An interface for data monitoring, in the diagram,
-the given interface is represented by the name of “Data Monitoring
-Interface”. Another interface is for the monitoring and management of
-system processes; in the diagram, the given interface is represented by
-the name “Workflow Management Interface”. Operations Manager will use
-these two interfaces for system monitoring and management.
-
-The element of the system that will process the notices is the TED-SWS
-pipeline. The input data for this pipeline will be the notices in XML
-format from the TED website. The result of this pipeline is a METS
-package for each processed notice and its publication in CELLAR, from
-where the end user will be able to access notices in RDF format.
-
-Providing, in Figure 1.1, a compact view of the TED-SWS system
-architecture at the business level is useful because it allows
-stakeholders and decision-makers to quickly and easily understand how
-the system works and how it supports the business goals and objectives.
-A compact view of the architecture can help to communicate the key
-components of the system and how they interact with each other, making
-it easier to understand the system's capabilities and limitations.
-Additionally, a compact view of the architecture can help to identify
-any areas where the system could be improved or where additional
-capabilities are needed to support the business. By providing a clear
-and concise overview of the system architecture, stakeholders can make
-more informed decisions about how to use the system, how to improve it,
-and how to align it with the business objectives.
-
-In Figure 1.1 also is provided, input and output dependencies for a
-TED-SWS system architecture. This is useful because it helps to identify
-the data sources and data destinations that the system relies on, as
-well as the data that the system produces. This information can be used
-to understand the data flows within the system, how the system is
-connected to other systems, and how the system supports the business.
-Input dependencies help to identify the data sources that the system
-relies on, such as external systems, databases, or other data sources.
-This information can be used to understand how the system is connected
-to other systems and how it receives data. Output dependencies help to
-identify the data destinations that the system produces, such as
-external systems, databases, or other data destinations. This
-information can be used to understand how the system is connected to
-other systems and how it sends data. By providing input and output
-dependencies for the TED-SWS system architecture, stakeholders can make
-more informed decisions about how to use the system, how to improve it,
-and how to align it with the business objectives.
-
-image:system_arhitecture/media/image1.png[image,width=100%,height=366]
-
-Figure 1.1 Compact view of system architecture at the business level
-
-In Figure 1.2 the general extended architecture of the TED-SWS system is
-represented, in this diagram, the internal components of the TED-SWS
-pipeline are also included.
-
-image:system_arhitecture/media/image8.png[image,width=100%,height=270]
-
-Figure 1.2 Extended view of system architecture at business level
-
-Figure 1.3 shows the architecture of the TED-SWS system without its
-peripheral elements. This diagram is intended to highlight the services
-that serve the internal components of the pipeline.
-
-*Workflow Management Service* is an external TED-SWS pipeline service
-that performs pipeline management. This service provides a control
-interface, in the figure it is represented by Workflow Management
-Interface.
-
-*Workflow Management Interface* represents an internal process control
-interface, this component will be analysed in a separate diagram.
-
-*Data Visualization Service* is a service that manages logs and pipeline
-data to present them in a form of dashboards.
-
-*Data Monitoring Interface* is a data visualization and dashboard
-editing interface offered by the Data Visualization Service.
-
-*Message Digest Service* is a service that serves the transformation
-component of the TED-SWS pipeline, within the transformation to ensure
-custom RML functions, an external service is needed that will implement
-them.
-
-*Master Data Management & URI Allocation Service* is a service for
-storing and managing unique URIs, this service performs URI
-deduplication.
-
-The *TED-SWS pipeline* contains a set of components, all of which access
-Notice Aggregate and Mapping Suite objects.
-
-image:system_arhitecture/media/image4.png[image,width=100%,height=318]
-
-Figure 1.3 TED-SWS architecture at business level
-
-Figure 1.4 shows the TED-SWS pipeline and its components, and this view
-aims to show the connection between the components.
-
-The pipeline has the following components:
-
-* Fetching Service
-* XML Indexing Service
-* Metadata Normalization Service
-* Transformation Service;
-* Entity Resolution & Deduplication Service
-* Validation Service
-* Packaging Service
-* Publishing Service
-* Mapping Suite Loading Service
-
-*Fetching Service* is a service that extracts notices from the TED
-website and stores them in the database.
-
-*XML Indexing Service* is a service that extracts all unique XPaths from
-an XML and stores them as metadata. Unique XPaths are used later to
-validate if the transformation to RDF format, has been done for all
-XPaths from a notice in XML format.
-
-*Metadata Normalization Service* is a service that normalises the
-metadata of a notice in an internal work format. This normalised
-metadata will be used in other processes on a notice, such as the
-selection of a Mapping Suite for transformation or validation of a
-notice.
-
-*Transformation Service* is the service that transforms a notice from
-the XML format into the RDF format, using for this a Mapping Suite that
-contains the RML transformation rules that will be applied.
-
-*Entity Resolution & Deduplication Service* is a service that performs
-the deduplication of entities from RDF manifestation, namely
-Organization and Procedure entities.
-
-*Validation Service* is a service that validates a notice in RDF format,
-using for this several types of validations, namely validation using
-SHACL shapes, validation using SPARQL tests and XPath coverage
-verification.
-
-*Packaging Service* is a service that creates a METS package that will
-contain notice RDF manifestation.
-
-*Publishing Service* is a service that publishes a notice RDF
-manifestation in the required format, in the case of Cellar the
-publication takes place with a METS package.
-
-image:system_arhitecture/media/image5.png[image,width=100%,height=154]
-
-Figure 1.4 TED-SWS pipeline architecture at business level
-
-=== Process single notice pipeline architecture
-
-The pipeline for processing a notice is the key element in the TED-SWS
-system, the architecture of this pipeline from the business point of
-view is represented in Figure 2. Unlike the previously presented
-figures, in Figure 2 the pipeline is rendered in greater detail and are
-presented relationships between pipeline steps and the artefacts that
-produce or use them.
-
-Based on Figure 2, it can be noted that the pipeline is not a linear
-one, within the pipeline there are control steps that check whether the
-following steps should be executed for a notice.
-
-There are 3 control steps in the pipeline, namely:
-
-* Check notice eligibility for transformation
-* Check notice eligibility for packaging
-* Check notice availability in Cellar
-
-The “Check notice eligibility for transformation” step represents the
-control of a notice if it can be transformed with a Mapping Suite, if it
-can be transformed it goes to the next transformation step, otherwise
-the notice is stored for future processing.
-
-The “Check notice eligibility for packaging” step checks if a notice RDF
-manifestation after the validation step is valid for packaging in a METS
-package. If it is valid, proceed to the packing step, otherwise, store
-the intermediate result for further analysis.
-
-The “Check notice availability in Cellar” step checks, after the
-publication step in Cellar, if a published notice is already accessible
-in Cellar. If the notice is accessible, then the pipeline is finished,
-otherwise the published notice is stored for further analysis.
-
-Pipeline steps produce and use artefacts such as:
-
-* TED-XML notice & metadata;
-* Mapping rules
-* TED-RDF notice
-* Test suites
-* Validation report
-* METS Package activation
-
-image:system_arhitecture/media/image2.png[image,width=100%,height=177]
-
-Figure 2 Single notice processing pipeline at business level
-
-Based on Figure 2, we can notice that the artefacts for a notice appear
-with the passage of certain steps in the pipeline. To be able to
-conveniently manage the state of a notice and all its artefacts
-depending on its state, a notice represents an aggregate of artefacts
-and a state, which changes dynamically during the pipeline.
-
-== Dynamic behaviour of architecture
-
-In this section, we address the following questions:
-
-* How is the data organised?
-* How does the data structure evolve within the process?
-* Howe does the business process look like?
-* How is the business process realised in the Application?
-
-=== Notice status transition map
-
-A TED-SWS pipeline implement a hybrid architecture based on ETL pipeline
-with status transition map for a notice. The TED-SWS pipeline have many
-steps and is not a linear pipeline, in this case using a notice status
-transition map, for a complex pipeline with multiple steps and
-ramifications like as TED-SWS pipeline, is a good architecture choice
-for several reasons:
-
-[arabic]
-. *Visibility*: A notice status transition map provides a clear and visual
-representation of the different stages that a notice goes through in the
-pipeline. This allows for better visibility into the pipeline, making it
-easier to understand the flow of data and to identify any issues or
-bottlenecks.
-
-. *Traceability*: A notice status transition map allows for traceability
-of notices in the pipeline, which means that it's possible to track a
-notice as it goes through the different stages of the pipeline. This can
-be useful for troubleshooting, as it allows for the identification of
-which stage the notice failed or had an issue.
-
-. *Error Handling*: A notice status transition map allows for the
-definition of error handling procedures for each stage in the pipeline.
-This can be useful for identifying and resolving errors that occur in
-the pipeline, as it allows for a clear understanding of what went wrong
-and what needs to be done to resolve the issue.
-
-. *Auditing*: A notice status transition map allows for the auditing of
-notices in the pipeline, which means that it's possible to track the
-history of a notice, including when it was processed, by whom, and
-whether it was successful or not.
-
-. *Monitoring*: A notice status transition map allows for the monitoring
-of notices in the pipeline, which means that it's possible to track the
-status of a notice, including how many notices are currently being
-processed, how many have been processed successfully, and how many have
-failed.
-
-. *Automation*: A notice status transition map can be used to automate
-some of the process, by defining rules or triggers to move notices
-between different stages of the pipeline, depending on the status of the
-notice.
-
-
-Each notice has a status during the pipeline, a status corresponds to a
-step in the pipeline that the notice passed. Figure 3.1 shows the
-transition flow of the status of a notice, as a note we must take into
-account that a notice can only be in one status at a given time.
-Initially, each notice has the status of RAW and the last status, which
-means finishing the pipeline, is the status of PUBLICLY_AVAILABLE.
-
-Based on the use cases of this pipeline, the following statuses of a
-notice are of interest to the end user:
-
-* RAW
-* NORMALISED_METADATA
-* INELIGIBLE_FOR_TRANSFORMATION
-* TRANSFORMED
-* VALIDATED
-* INELIGIBLE_FOR_PACKAGING
-* PACKAGED
-* INELIGIBLE_FOR_PUBLISHING
-* PUBLISHED
-* PUBLICLY_UNAVAILABLE
-* PUBLICLY_AVAILABLE
-
-image:system_arhitecture/media/image6.png[image,width=546,height=402]
-
-Figure 3.1 Notice status transition
-
-The names of the statuses are self-descriptive, but attention should be
-drawn to some statuses, namely:
-
-* INDEXED
-* NORMALISED_METADATA
-* DISTILLED
-* PUBLISHED
-* PUBLICLY_UNAVAILABLE
-* PUBLICLY_AVAILABLE
-
-The INDEXED status means that the set of unique XPaths appearing in its
-XML manifestation has been calculated for a notice. The unique set of
-XPaths is subsequently required when calculating the XPath coverage
-indicator for the transformation.
-
-The NORMALISED_METADATA status means that for a notice, its metadata has
-been normalised. The metadata of a notice is normalised in an internal
-format to be able to check the eligibility of a notice to be transformed
-with a Mapping Suite package.
-
-The status DISTILLED is used to indicate that the RDF manifestation of a
-notice has been post processed. The post-processing of an RDF
-manifestation provides for the deduplication of the Procedure or
-Organization type entities and the insertion of corresponding triplets
-within this RDF manifestation.
-
-The PUBLISHED status means that a notice has been sent to Cellar, which
-does not mean that it is already available in Cellar. Since there is a
-time interval between the transmission and the actual appearance in the
-Cellar, it is necessary to check later whether a notice is available in
-the Cellar or not. If the verification has taken place and the notice is
-available in the Cellar, it is assigned the status of
-PUBLICLY_AVAILABLE, if it is not available in the Cellar, the status of
-PUBLICLY_UNAVAILABLE is assigned to it.
-
-=== Notice structure
-
-Notice structure has a NoSQL data model, this architecture choice is
-based on dynamic behaviour of notice structure which evolves over time
-while TED-SWS pipeline running and besides that there are other reasons:
-
-[arabic]
-. *Schema-less*: NoSQL databases are schema-less, which means that the
-data structure can change without having to modify the database schema.
-This allows for more flexibility when processing data, as new data types
-or fields can be easily added without having to make changes to the
-database. This is particularly useful for notices that are likely to
-evolve over time, as the structure of the notices can change without
-having to make changes to the database.
-
-. *Handling Unstructured Data*: NoSQL databases are well suited for
-handling unstructured data, such as JSON or XML, that can't be handled
-by SQL databases. This is particularly useful for ETL pipelines that
-need to process unstructured data, as notices are often unstructured and
-may evolve over time.
-. *Handling Distributed Data*: NoSQL databases are designed to handle
-distributed data, which allows for data to be stored and processed on
-multiple servers. This can help to improve performance and scalability,
-as well as provide fault tolerance. This is particularly useful for
-notices that are likely to evolve over time, as the volume of data may
-increase and need to be distributed.
-
-. *Flexible Querying*: NoSQL databases allow for flexible querying, which
-means that the data can be queried in different ways, including by
-specific fields, by specific values, and by ranges. This allows for more
-flexibility when querying the data, as the structure of the notices may
-evolve over time.
-. *Cost-effective*: NoSQL databases are generally less expensive than SQL
-databases, as they don't require expensive hardware or specialized
-software. This can make them a more cost-effective option for ETL
-pipelines that need to handle large amounts of data and that are likely
-to evolve over time.
-
-
-Overall, a NoSQL data model is a good choice for notice structure in an
-ETL pipeline that is likely to evolve over time because it allows for
-more flexibility when processing data, handling unstructured data,
-handling distributed data, flexible querying and it's cost-effective.
-
-Figure 3.2 shows the structure of a notice and its evolution depending
-on the state in which a notice is located. In the given figure, the
-emphasis is placed on the states from which a certain part of the
-structure of a notice is present. As a remark, it should be taken into
-account that once an element of the structure of a notice is present for
-a certain state, it will also be present for all the states derived from
-it, such as the flow of states presented in Figure 3.1.
-
-image:system_arhitecture/media/image3.png[image,width=567,height=350]
-
-Figure 3.2 Dynamic behaviour of notice structure based on status
-
-Based on Figure 3.2, it is noted that the structure of a notice evolves
-with the transition to other states.
-
-For a notice in the state of NORMALISED_METADATA, we can access the
-following fields of a notice:
-
-* Original Metadata
-* Normalised Metadata
-* XML Manifestation
-
-For a notice in the TRANSFORMED state, we can access all the previous
-fields and the following new fields of a notice:
-
-* RDF Manifestation.
-
-For a notice in the VALIDATED state, we can access all the previous
-fields and the following new fields of a notice:
-
-* XPath Coverage Validation
-
-* SHACL Validation
-* SPARQL Validation
-
-For a notice in the PACKAGED state, we can access all the previous
-fields and the following new fields of a notice:
-
-* METS Manifestation
-
-=== Application view of the process
-
-The primary actor of the TED-SWS system will be the Operations Manager,
-who will interact with the system. Application-level pipeline control is
-achieved through the Airflow stack. Figure 4 shows the AirflowUser actor
-representing Operations Manager, this diagram is at the application
-level of the process.
-
-image:system_arhitecture/media/image7.png[image,width=534,height=585]
-
-Figure 4 Dependencies between Airflow DAGs
-
-Based on the use cases defined for an Operations Manger, Figure 4 shows
-the control functionality of the TED-SWS pipeline that it can use. In
-addition to the functionality available for the AirflowUser actor, the
-dependency between DAGs is also rendered. We can note that another actor
-named AirflowScheduler is defined, this actor represents an automatic
-execution mechanism at a certain time interval of certain DAGs.
-
-== Architectural choices
-
-This section describes choices:
-
-* How is this SOA? (is it? It is SOA but not REST Microservices, Why not
-Microservices?
-* Why NoSQL data model vs SQL data model?
-* Why ETL/ELT approach vs. Event Sourcing
-* Why Batch processing vs. Event Streams.
-* Why Airflow ?
-* Why Metabase?
-* Why quick deduplication process? And what are the plans for the
-future?
-
-=== Why is this SOA (Service-oriented architecture) architecture?
-
-ETL (Extract, Transform, Load) architecture is considered
-state-of-the-art for batch processing tasks using Airflow as pipeline
-management for several reasons:
-
-[arabic]
-. *Flexibility*: ETL architecture allows for flexibility in the data
-pipeline as it separates the data extraction, transformation, and
-loading processes. This allows for easy modification and maintenance of
-each individual step without affecting the entire pipeline.
-. *Scalability*: ETL architecture allows for the easy scaling of data
-processing tasks, as new data sources can be added or removed without
-impacting the entire pipeline.
-. *Error Handling*: ETL architecture allows for easy error handling as
-each step of the pipeline can be monitored and errors can be isolated to
-a specific step.
-. *Reusability:* ETL architecture allows for the reuse of existing data
-pipelines, as new data sources can be added without modifying existing
-pipelines.
-. *System management*: Airflow is an open-source workflow management
-system that allows for easy scheduling, monitoring, and management of
-data pipelines. It integrates seamlessly with ETL architecture and
-allows for easy management of complex data pipelines.
-
-Overall, ETL architecture combined with Airflow as pipeline management
-provides a robust and efficient solution for batch processing tasks.
-
-=== Why Monolithic Architecture vs Micro Services Architecture?
-
-There are several reasons why a monolithic architecture may be more
-suitable for an ETL architecture with batch processing pipeline using
-Airflow as the pipeline management tool:
-
-[arabic]
-. *Simplicity*: A monolithic architecture is simpler to design and
-implement as it involves a single codebase and a single deployment
-process. This makes it easier to manage and maintain the ETL pipeline.
-. *Performance*: A monolithic architecture may be more performant than a
-microservices architecture as it allows for more efficient communication
-between the different components of the pipeline. This is particularly
-important for batch processing pipelines, where speed and efficiency are
-crucial.
-. *Scalability*: Monolithic architectures can be scaled horizontally by
-adding more resources to the system, such as more servers or more
-processing power. This allows for the system to handle larger amounts of
-data and handle more complex processing tasks.
-. *Airflow Integration*: Airflow is designed to work with monolithic
-architectures, and it can be more difficult to integrate with a
-microservices architecture. Airflow's DAGs and tasks are designed to
-work with a single codebase, and it may be more challenging to manage
-different services and pipelines across multiple microservices.
-
-Overall, a monolithic architecture may be more suitable for an ETL
-architecture with batch processing pipeline using Airflow as the
-pipeline management tool due to its simplicity, performance,
-scalability, and ease of integration with Airflow.
-
-=== Why ETL/ELT approach vs Event Sourcing ?
-
-ETL (Extract, Transform, Load) architecture is typically used for moving
-and transforming data from one system to another, for example, from a
-transactional database to a data warehouse for reporting and analysis.
-It is a batch-oriented process that is typically scheduled to run at
-specific intervals.
-
-Event sourcing architecture, on the other hand, is a way of storing and
-managing the state of an application by keeping track of all the changes
-to the state as a sequence of events. This allows for better auditing
-and traceability of the state of the application over time, as well as
-the ability to replay past events to reconstruct the current state.
-Event sourcing is often used in systems that require high performance,
-scalability, and fault tolerance.
-
-In summary, ETL architecture is mainly used for data integration and
-data warehousing, Event sourcing is mainly used for building highly
-scalable and fault-tolerant systems that need to store and manage the
-state of an application over time.
-
-A hybrid architecture is implemented in the TED-SWS pipeline, based on
-an ETL architecture but with state storage to repeat a pipeline sequence
-as needed.
-
-=== Why Batch processing vs Event Streams?
-
-Batch processing architecture and Event Streams architecture are two
-different approaches to processing data in code.
-
-Batch processing architecture is a traditional approach where data is
-processed in batches. This means that data is collected over a period of
-time and then processed all at once in a single operation. This approach
-is typically used for tasks such as data analysis, data mining, and
-reporting. It is best suited for tasks that can be done in a single pass
-and do not require real-time processing.
-
-Event Streams architecture, on the other hand, is a more modern approach
-where data is processed in real-time as it is generated. This means that
-data is processed as soon as it is received, rather than waiting for a
-batch to be collected. This approach is typically used for tasks such as
-real-time monitoring, data analytics, and fraud detection. It is best
-suited for tasks that require real-time processing and cannot be done in
-a single pass.
-
-In summary, Batch processing architecture is best suited for tasks that
-can be done in a single pass and do not require real-time processing,
-whereas Event Streams architecture is best suited for tasks that require
-real-time processing and cannot be done in a single pass.
-
-Due to the fact that the TED-SWS pipeline has an ETL architecture, the
-data processing is done in batches, the batches of notices are formed
-per day, all the notices of a day form a batch that will be processed.
-Another method of creating a batch is grouping notices by status and
-executing the pipeline depending on their status.
-
-=== Why NoSQL data model vs SQL data model?
-
-There are several reasons why a NoSQL data model may be more suitable
-for an ETL architecture with batch processing pipeline compared to a SQL
-data model:
-
-[arabic]
-. *Scalability*: NoSQL databases are designed to handle large amounts of
-data and can scale horizontally, allowing for the easy addition of more
-resources as the amount of data grows. This is particularly useful for
-batch processing pipelines that need to handle large amounts of data.
-. *Flexibility*: NoSQL databases are schema-less, which means that the
-data structure can change without having to modify the database schema.
-This allows for more flexibility when processing data, as new data types
-or fields can be easily added without having to make changes to the
-database.
-. *Performance*: NoSQL databases are designed for high-performance and can
-handle high levels of read and write operations. This is particularly
-useful for batch processing pipelines that need to process large amounts
-of data in a short period of time.
-
-. *Handling Unstructured Data*: NoSQL databases are well suited for
-handling unstructured data, such as JSON or XML, that can't be handled
-by SQL databases. This is particularly useful for ETL pipelines that
-need to process unstructured data.
-
-. *Handling Distributed Data*: NoSQL databases are designed to handle
-distributed data, which allows for data to be stored and processed on
-multiple servers. This can help to improve performance and scalability,
-as well as provide fault tolerance.
-
-. *Cost*: NoSQL databases are generally less expensive than SQL databases,
-as they don't require expensive hardware or specialized software. This
-can make them a more cost-effective option for ETL pipelines that need
-to handle large amounts of data.
-
-Overall, a NoSQL data model may be more suitable for an ETL architecture
-with batch processing pipeline compared to a SQL data model due to its
-scalability, flexibility, performance, handling unstructured data,
-handling distributed data and the cost-effectiveness. It is important to
-note that the choice to use a NoSQL data model satisfies the specific
-requirements of the TED-SWS processing pipeline and the nature of the
-data to be processed.
-
-=== Why Airflow?
-
-Airflow is a great solution for ETL pipeline and batch processing
-architecture because it provides several features that are well-suited
-to these types of tasks. First, Airflow provides a powerful scheduler
-that allows you to define and schedule ETL jobs to run at specific
-intervals. This means that you can set up your pipeline to run on a
-regular schedule, such as every day or every hour, without having to
-manually trigger the jobs. Second, Airflow provides a web-based user
-interface that makes it easy to monitor and manage your pipeline.
-
-Both aspects of Airflow are perfectly compatible with the needs of the
-TED-SWS architecture and the use cases required for an Operations
-Manager that will interact with the system. Airflow therefore covers the
-needs of batch processing management and ETL pipeline management.
-
-Airflow provide good coverage of use cases for an Operations Manager,
-specialized for this use cases:
-
-[arabic]
-. *Monitoring pipeline performance*: An operations manager can use Airflow
-to monitor the performance of the ETL pipeline and identify any
-bottlenecks or issues that may be impacting the pipeline's performance.
-They can then take steps to optimize the pipeline to improve its
-performance and ensure that data is being processed in a timely and
-efficient manner.
-
-. *Managing pipeline schedule*: The operations manager can use Airflow to
-schedule the pipeline to run at specific times, such as during off-peak
-hours or when resources are available. This can help to minimize the
-impact of the pipeline on other systems and ensure that data is
-processed in a timely manner.
-
-. *Managing pipeline resources*: The operations manager can use Airflow to
-manage the resources used by the pipeline, such as CPU, memory, and
-storage. They can also use Airflow to scale the pipeline up or down as
-needed to meet changing resource requirements.
-
-. *Managing pipeline failures*: Airflow allows the operations manager to
-set up notifications and alerts for when a pipeline fails or a task
-fails. This allows them to quickly identify and address any issues that
-may be impacting the pipeline's performance.
-
-. *Managing pipeline dependencies*: The operations manager can use Airflow
-to manage the dependencies between different tasks in the pipeline, such
-as ensuring that notice fetching is completed before notice indexing or
-notice metadata normalization.
-
-. *Managing pipeline versioning*: Airflow allows the operations manager to
-maintain different versions of the pipeline, which can be useful for
-testing new changes before rolling them out to production.
-
-. *Managing pipeline security*: Airflow allows the operations manager to
-set up security controls to protect the pipeline and the data it
-processes. They can also use Airflow to audit and monitor access to the
-pipeline and the data it processes.
-
-=== Why Metabase?
-
-Metabase is an excellent solution for data analysis and KPI monitoring
-for a batch processing system, as it offers several key features that
-make it well suited for this type of use case required within the
-TED-SWS system.
-
-First, Metabase is highly customizable, allowing users to create and
-modify dashboards, reports, and visualizations to suit their specific
-needs. This makes it easy to track and monitor the key performance
-indicators (KPIs) that are most important for the batch processing
-system, such as the number of jobs processed, the average processing
-time, and the success rate of job runs.
-
-Second, Metabase offers a wide range of data connectors, allowing users
-to easily connect to and query data sources such as SQL databases, NoSQL
-databases, CSV files, and APIs. This makes it easy to access and analyze
-the data that is relevant to the batch processing system. In TED-SWS the
-data domain model is realized by a document-based data model, not a
-tabular relational data model, so Metabase is a good tool for analyzing
-data with a document-based model.
-
-Third, Metabase has a user-friendly interface that makes it easy to
-navigate and interact with data, even for users with little or no
-technical experience. This makes it accessible to a wide range of users,
-including business analysts, data scientists, and other stakeholders who
-need to monitor and analyse the performance of the batch processing
-system.
-
-Finally, Metabase offers robust security and collaboration features,
-making it easy to share and collaborate on data and insights with team
-members and stakeholders. This makes it an ideal solution for
-organizations that need to monitor and analyse the performance of a
-batch processing system across multiple teams or departments.
-
-=== Why quick deduplication process?
-
-One of the main challenges in entities deduplication from the semantic
-web domain is dealing with the complexity and diversity of the data.
-This can include dealing with different data formats, schemas, and
-vocabularies, as well as handling missing or incomplete data.
-Additionally, entities may have multiple identities or representations,
-making it difficult to determine which entities are duplicates and which
-are distinct. Another difficulty is the scalability of the algorithm to
-handle large amount of data. The performance of the algorithm should be
-efficient and accurate to handle huge number of entities.
-
-There are several approaches and solutions for entities deduplication in
-the semantic web. Some of the top solutions include:
-
-[arabic]
-. *String-based methods*: These methods use string comparison techniques
-such as Jaccard similarity, Levenshtein distance, and cosine similarity
-to identify duplicates based on the similarity of their string
-representations.
-. *Machine learning-based methods*: These methods use machine learning
-algorithms such as decision trees, random forests, and neural networks
-to learn patterns in the data and identify duplicates.
-
-. *Knowledge-based methods*: These methods use external knowledge sources
-such as ontologies, taxonomies, and linked data to disambiguate entities
-and identify duplicates.
-
-. *Hybrid methods*: These methods combine multiple techniques, such as
-string-based and machine learning-based methods, to improve the accuracy
-of deduplication.
-
-. *Blocking Method*: This method is used to reduce the number of entities
-that need to be compared by grouping similar entities together.
-
-In the TED-SWS pipeline, the deduplication of Organization type entities
-is performed using a string-based methods. String-based methods are
-often used for organization entity deduplication, because of their
-simplicity and effectiveness.
-
-TED Europe data often contains information about tenders and public
-procurement, where organizations are identified by their names.
-Organization names are often unique and can be used to identify
-duplicates with high accuracy. String-based methods can be used to
-compare the similarity of different organization names, which can be
-effective in identifying duplicates.
-
-Additionally, the TED europe data is highly structured, so it's easy to
-extract and compare the names of organizations. String-based methods are
-also relatively fast and easy to implement, making them a good choice
-for large data sets. This methods may not be as effective for other
-types of entities, such as individuals, where additional information may
-be needed to identify duplicates. It's also important to note that
-string-based methods may not work as well for misspelled or abbreviated
-names.
-
-Using a quick and dirty deduplication approach instead of a complex
-system at the first iteration of a system implementation can be
-beneficial for several reasons:
-
-[arabic]
-. *Speed*: A quick approach can be implemented quickly and can
-help to identify and remove duplicates quickly. This can be particularly
-useful when working with large and complex data sets, where a more
-complex approach may take a long time to implement and test.
-. *Cost*: A quick and dirty approach is generally less expensive to
-implement than a complex system, as it requires fewer resources and less
-development time.
-. *Simplicity*: A quick and dirty approach is simpler and easier to
-implement than a complex system, which can reduce the risk of errors and
-bugs.
-. *Flexibility*: A quick and dirty approach allows to start with a basic
-system and adapt it as needed, which can be more flexible than a complex
-system that is difficult to change.
-
-. *Testing*: A quick and dirty approach allows to test the system quickly,
-and get feedback from the users and stakeholders, and then use that
-feedback to improve the system.
-
-
-However, it's worth noting that the quick and dirty approach is not a
-long-term solution and should be used only as a first step in the
-implementation of a MDR system. This approach can help to quickly
-identify and remove duplicates and establish a basic system, but it may
-not be able to handle all the complexity and diversity of the data, so
-it's important to plan for and implement more advanced techniques as the
-system matures.
-
-=== What are the plans for the future deduplication?
-
-In the future, another Master Data Registry type system will be used to
-deduplicate entities in the TED-SWS system, which will be implemented
-according to the requirements for deduplication of entities from
-notices.
-
-The future Master Data Registry (MDR) system for entity deduplication
-should have the following architecture:
-
-[arabic]
-. *Data Ingestion*: This component is responsible for extracting and
-collecting data from various sources, such as databases, files, and
-APIs. The data is then transformed, cleaned, and consolidated into a
-single format before it is loaded into the MDR.
-
-. *Data Quality*: This component is responsible for enforcing data quality
-rules, such as format, completeness, and consistency, on the data before
-it is entered into the MDR. This can include tasks such as data
-validation, data standardization, and data cleansing.
-
-. *Entity Dedup*: This component is responsible for identifying and
-removing duplicate entities in the MDR. This can be done using a
-combination of techniques such as string-based, machine learning-based,
-or knowledge-based methods.
-
-. *Data Governance*: This component is responsible for ensuring that the
-data in the MDR is accurate, complete, and up-to-date. This can include
-processes for data validation, data reconciliation, and data
-maintenance.
-
-. *Data Access and Integration*: This component provides access to the MDR
-data through a user interface and API's, and integrates the MDR data
-with other systems and applications.
-
-. *Data Security*: This component is responsible for ensuring that the
-data in the MDR is secure, and that only authorized users can access it.
-This can include tasks such as authentication, access control, and
-encryption.
-
-. *Data Management*: This component is responsible for managing the data
-in the MDR, including tasks such as data archiving, data backup, and
-data recovery.
-
-. *Monitoring and Analytics*: This component is responsible for monitoring
-and analysing the performance of the MDR system, and for providing
-insights into the data to help improve the system.
-
-. *Services layer*: This component is responsible for providing services
-such as, indexing, search and query functionalities over the data.
-
-
-All these components should be integrated and work together to provide a
-comprehensive and efficient MDR system for entity deduplication. The
-system should be scalable and flexible enough to handle large amounts of
-data and adapt to changing business requirements.
-
-
-
diff --git a/docs/antora/modules/ROOT/pages/demo_installation.adoc b/docs/antora/modules/ROOT/pages/technical/demo_installation.adoc
similarity index 100%
rename from docs/antora/modules/ROOT/pages/demo_installation.adoc
rename to docs/antora/modules/ROOT/pages/technical/demo_installation.adoc
diff --git a/docs/antora/modules/ROOT/pages/event_manager.adoc b/docs/antora/modules/ROOT/pages/technical/event_manager.adoc
similarity index 100%
rename from docs/antora/modules/ROOT/pages/event_manager.adoc
rename to docs/antora/modules/ROOT/pages/technical/event_manager.adoc
diff --git a/docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc b/docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc
similarity index 99%
rename from docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc
rename to docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc
index af1253057..34df96423 100644
--- a/docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc
+++ b/docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc
@@ -10,8 +10,8 @@ Open a Linux terminal and clone the `ted-rdf-mapping` project.
[source,bash]
----
-git clone https://github.com/OP-TED/ted-rdf-mapping
-cd ted-rdf-mapping
+git clone https://github.com/meaningfy-ws/mapping-workbench
+cd mapping-workbench
----
Create a virtual Python environment and activate it.
@@ -34,7 +34,8 @@ Install the TED-SWS CLIs as a Python package using the `pip` package manager.
[source,bash]
----
-pip install git+https://github.com/OP-TED/ted-rdf-conversion-pipeline#egg=ted-sws
+make isntall
+make local-dotenv-file
----
== Usage
diff --git a/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc b/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc
new file mode 100644
index 000000000..a10fb6567
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc
@@ -0,0 +1,38 @@
+== Introduction
+
+Although TED notice data is already available to the general public
+through the search API provided by the TED website, the current offering
+has many limitations that impede access to and reuse of the data. One
+such important impediment is for example the current format of the data.
+
+Historical TED data come in various XML formats that evolved together
+with the standard TED XML schema. The imminent introduction of eForms
+will also introduce further diversity in the XML data formats available
+through TED's search API. This makes it practically impossible for users
+to consume and process data that span across several years, as
+their information systems must be able to process several different
+flavours of the available XML schemas as well as to keep up with the
+schema's continuous evolution. Their search capabilities are therefore
+confined to a very limited set of metadata.
+
+The TED Semantic Web Service will remove these barriers by providing one
+common format for accessing and reusing all TED data. Coupled with the
+eProcurement Ontology, the TED data will also have semantics attached to
+them allowing users to directly link them with other datasets.
+Moreover, users will now be able to perform much more elaborate
+queries directly on the data source (through the SPARQL endpoint). This
+will reduce their need for data warehousing in order to perform complex
+queries.
+
+These developments, by lowering the barriers, will give rise to a vast
+number of new use-cases that will enable stakeholders and end-users to
+benefit from increased availability of analytics. The ability to perform
+complex queries on public procurement data will be equally open to large
+information systems as well as to simple desktop users with a copy of
+Excel and an internet connection.
+
+To summarize, the TED Semantic Web Service (TED SWS) is a pipeline
+system that continuously converts the public procurement notices (in XML
+format) available on the TED Website into RDF format, publishes them
+into CELLAR and makes them available to the public through CELLAR’s
+SPARQL endpoint.
\ No newline at end of file
diff --git a/docs/antora/modules/ROOT/pages/using_procurement_data.adoc b/docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc
similarity index 91%
rename from docs/antora/modules/ROOT/pages/using_procurement_data.adoc
rename to docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc
index 4d8c924c6..129f3652f 100644
--- a/docs/antora/modules/ROOT/pages/using_procurement_data.adoc
+++ b/docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc
@@ -1,25 +1,19 @@
= Using procurement data
+This page explains how to use procurement data accessed from *Cellar* with Microsoft Excel, Python and R. There are different ways to access TED notices in CELLAR
+and use the data. The methods described below work with TED notice and other type of semantic assets.
+We use a sample SPARQL query which returns a list of countries. The users shall use TED specific SPARQL queries to fetch needed data.
-This page explains how to use procurement data accessed from Cellar with Excel, Python, R
-and Power BI.
-There are different ways to access TED notices in CELLAR
-and use the data. As scenarios, each method presented in this page
-will take over the list of European countries and shows them in one
-column.
-
*Note:* Jupyter Notebook samples are explained with assumption that a
code editor is already prepared. For example VS Code or Pycharm, or
Jupyter server. Examples are explained using
https://code.visualstudio.com/docs[[.underline]#Visual Studio Code#].
-== Excel
+== Microsoft Excel
-This chapter shows an example using Excel. Microsoft Excel is a
-spreadsheet developed by Microsoft through which we will use the
-interface to query CELLAR repository to see an example.
+This chapter shows an example of getting data from Cellar using Microsoft Excel.
[arabic]
. Prepare link with necessary query:
diff --git a/docs/antora/modules/ROOT/pages/user_manual.adoc b/docs/antora/modules/ROOT/pages/user_manual.adoc
deleted file mode 100644
index a15c376c0..000000000
--- a/docs/antora/modules/ROOT/pages/user_manual.adoc
+++ /dev/null
@@ -1,1338 +0,0 @@
-= TED-SWS User manual
-
-[width="100%",cols="25%,75%",options="header",]
-|===
-|*Editors* |Dragos Paun
- +
-Eugeniu Costetchi
-
-|*Version* |1.0.0
-
-|*Date* |20/02/2023
-|===
-
-== Glossary [[glossary]]
-
-*Airflow* - an open-source platform for developing, scheduling, and
-monitoring batch-oriented pipelines. The web interface helps manage the
-state and monitoring of your pipelines.
-
-*Metabase* - Metabase is the BI tool with the friendly UX and integrated
-tooling to let you explore data gathered by running the pipelines
-available in Airflow.
-
-*Cellar* - is the central content and metadata repository of the
-Publications Office of the European Union
-
-*TED-SWS* - is a pipeline system that continuously converts the public
-procurement notices (in XML format) available on the TED Website into
-RDF format and publishes them into CELLAR
-
-*DAG* - (Directed Acyclic Graph) is the core concept of Airflow,
-collecting Tasks together, organized with dependencies and relationships
-to say how they should run. The DAGS are basically the pipelines that
-run in this project to get the public procurement notices from XML to
-RDF and to be published them into CELLAR.
-
-== Introduction
-
-Although TED notice data is already available to the general public
-through the search API provided by the TED website, the current offering
-has many limitations that impede access to and reuse of the data. One
-such important impediment is for example the current format of the data.
-
-Historical TED data come in various XML formats that evolved together
-with the standard TED XML schema. The imminent introduction of eForms
-will also introduce further diversity in the XML data formats available
-through TED's search API. This makes it practically impossible for
-reusers to consume and process data that span across several years, as
-their information systems must be able to process several different
-flavors of the available XML schemas as well as to keep up with the
-schema's continuous evolution. Their search capabilities are therefore
-confined to a very limited set of metadata.
-
-The TED Semantic Web Service will remove these barriers by providing one
-common format for accessing and reusing all TED data. Coupled with the
-eProcurement Ontology, the TED data will also have semantics attached to
-them allowing reusers to directly link them with other datasets.
-Moreover, reusers will now be able to perform much more elaborate
-queries directly on the data source (through the SPARQL endpoint). This
-will reduce their need for data warehousing in order to perform complex
-queries.
-
-These developments, by lowering the barriers, will give rise to a vast
-number of new use-cases that will enable stakeholders and end-users to
-benefit from increased availability of analytics. The ability to perform
-complex queries on public procurement data will be equally open to large
-information systems as well as to simple desktop users with a copy of
-Excel and an internet connection.
-
-To summarize the TED Semantic Web Service (TED SWS) is a pipeline system
-that continuously converts the public procurement notices (in XML
-format) available on the TED Website into RDF format, publishes them
-into CELLAR and makes them available to the public through CELLAR’s
-SPARQL endpoint.
-
-=== Purpose of the document
-
-The purpose of this document is to explain how to use Airflow and
-Metabase to control and monitor the TED-SWS system. This document may be
-updated by the development team as the system evolves.
-
-=== Intended audience
-
-This document is intended for persons involved in the controlling and
-monitoring the services offered by the TED-SWS system
-
-==== Useful Resources [[useful-resources]]
-
-https://www.metabase.com/learn/getting-started/tour-of-metabase[[.underline]#https://www.metabase.com/learn/getting-started/tour-of-metabase#]
-
-https://www.metabase.com/docs/latest/exploration-and-organization/start[[.underline]#https://www.metabase.com/docs/latest/exploration-and-organization/start#]
-
-https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html#]
-(only UI / Screenshots section)
-
-== Architectural overview
-
-This section provides a high level overview of the TED-SWS system and
-its components. As presented in the image below the system is built by
-multitude of services / components grouped together to help to reach the
-end goal. The system can be divided into 2 main parts:
-
-* Controlling and monitoring
-* Core functionality (code base / TED SWS pipeline)
-
-Each part of the system is formed by a group of components.
-
-Controlling and monitoring, controlled by an operation manager, contains
-a workflow / pipeline management service (Airflow) and data
-visualization service (Metabase). Using this group of services any user
-should be able to control execution of the existing pipelines and also
-monitor the execution results.
-
-The core functionality has many services developed to accommodate the
-entire transformation process of a public procurement notice (in XML
-format) available on the TED Website into RDF format and to publish it
-into CELLAR. Here is a short description of some of the main services:
-
-* fetching service - fetching the notice from TED website
-* indexing service - getting the unique XPATHs in a notice XML
-* metadata normalisation service - extract notice metadata from the XML
-* transformation service - transform the XML to RDF
-* entity resolution and deduplication service - resolve duplicated
-entities in the RDF
-* validation service - validation the RDF transformation
-* packaging service - creating the METS package
-* publishing service - sending the METS package to CELLAR
-
-image:user_manual/media/image59.png[image,width=100%,height=270]
-
-
-=== Pipelines architecture ( Airflow DAGs )
-
-In this section will see a graphic representation that will show the
-flow and dependencies of the available pipelines (DAGs) in Airflow. In
-this representation will see the presence of two users AirflowUser and
-AirflowScheduler, where the AirflowUser is the user that will enable and
-trigger the DAGs and AirflowScheduler is the Airflow component that will
-start the DAGs automatically following a schedule.
-
-The automatic triggered DAGs controlled by the Airflow Scheduler are:
-
-* fetch_notices_by_date
-* daily_check_notices_availibility_in_cellar
-* daily_materialized_views_update
-
-image:user_manual/media/image63.png[image,width=100%,height=382]
-
-The DAGs marked with _purple_ (load_mapping_suite_in_database), _yellow_
-(reprocess_unnormalised_notices_from_backlog,reprocess_unpackaged_notices_from_backlog,
-reprocess_unpublished_notices_from_backlog,reprocess_untransformed_notices_from_backlog,
-reprocess_unvalidated_notices_from_backlog) and _green_
-(fetch_notices_by_date, fetch_notices_by_date_range,
-fetch_notices_by_query) will trigger automatically the
-*notice_processing_pipeline* marked with _blue_, and this will take care
-of the entire processing steps for a notice. These can be used by a user
-by manually triggering these DAGs with or without configuration.
-
-The DAGs marked with _green_ (fetch_notices_by_date,
-fetch_notices_by_date_range, fetch_notices_by_query) are in charge of
-fetching the notices from TED API. The ones marked with _yellow_ (
-reprocess_unnormalised_notices_from_backlog,
-reprocess_unpackaged_notices_from_backlog,
-reprocess_unpublished_notices_from_backlog,
-reprocess_untransformed_notices_from_backlog,
-reprocess_unvalidated_notices_from_backlog) will handle the reprocessing
-of notices from the backlog. The purple marked DAG
-(load_mapping_suite_in_database) will handle the loading of mapping
-suites in the database that will be used to transform the notices.
-
-image:user_manual/media/image11.png[image,width=100%,height=660]
-
-== Notice statuses
-
-During the transformation process through the TED-SWS system, a notice
-will start with a certain status and it will transition to other
-statuses when a particular step of the pipeline
-(notice_processing_pipeline) offered by the system has completed
-successfully or unsuccessfully. This transition is done automatically
-and it will change the _status_ property of a notice. The system has the
-following statuses:
-
-* RAW
-* INDEXED
-* NORMALISED_METADATA
-* INELIGIBLE_FOR_TRANSFORMATION
-* ELIGIBLE_FOR_TRANSFORMATION
-* PREPROCESSED_FOR_TRANSFORMATION
-* TRANSFORMED
-* DISTILLED
-* VALIDATED
-* INELIGIBLE_FOR_PACKAGING
-* ELIGIBLE_FOR_PACKAGING
-* PACKAGED
-* INELIGIBLE_FOR_PUBLISHING
-* ELIGIBLE_FOR_PUBLISHING
-* PUBLISHED
-* PUBLICLY_UNAVAILABLE
-* PUBLICLY_AVAILABLE
-
-The transition from one status to another is decided by the system and
-can be viewed in the graphic representation below.
-
-image:user_manual/media/image14.png[image,width=100%,height=444]
-
-== Notice structure
-
-This section aims at presenting the anatomy of a Notice in the TED-SWS
-system and the dependence of structural elements on the phase of the
-transformation process. This is useful for the user to understand what
-happens behind the scene and what information is available in the
-database, to build analytics dashboards.
-
-The structure of a notice within the TED-SWS system consists of the
-following structural elements:
-
-* Status
-* Metadata
-** Original Metadata
-** Normalised Metadata
-* Manifestation
-** XMLManifestation
-** RDFManifestation
-** METSManifestation
-* Validation Report
-** XPATH Coverage Validation
-** SHACL Validation
-** SPARQL Validation
-
-The diagram below shows the high level structure of the Notice object
-and that certain structural parts of a notice within the system are
-dependent on its state. This means that as the transformation process
-runs through its steps the Notice state changes and new structural parts
-are added. For example, for a notice in the NORMALISED status we can
-access the Original Metadata, Normalised Metadata and XMLManifestation
-fields, for a notice in the TRANSFORMED status we can access in addition
-the RDFManifestation field and similarly for the rest of the statuses.
-
-The diagram depicts states as swim-lanes while the structural elements
-are depicted as ArchiMate Business Objects [cite ArchiMate]. The
-relations we use are composition (arrow with diamond ending) and
-inheritance (arrow with full triangle ending).
-
-As was mentioned above about the states through which a notice can
-transition, a certain structural field if it is present at a certain
-state, then all the states originating from this state will also have
-this field. Not all possible states are depicted. For brevity, we chose
-only the most significant ones, which segment the transformation process
-into stages.
-
-image:user_manual/media/image94.png[image,width=100%,height=390]
-
-== Security credentials
-
-The security credentials will be provided by the infrastructure team
-that installed the necessary infrastructure for this project. Some credentials are set in the environment file necessary for the
-infrastructure installation and others by manually creating a user by
-infra team.
-
-Bellow are the credentials that should be provided
-
-[width="100%",cols="25%,36%,39%",options="header",]
-|===
-|Name |Description |Comment
-|Metabase user |Metabase user for login. This should be an email address
-|This user was manually created by the infrastructure team
-
-|Metabase password |The temporary password that was set by the infra
-team for the user above |This user was manually created by the
-infrastructure team
-
-|Airflow user |Airflow UI user for login |This is the value of
-_AIRFLOW_WWW_USER_USERNAME variable from the env file
-
-|Airflow password |Airflow UI password for login |This is the value of
-_AIRFLOW_WWW_USER_PASSWORD variable from the env file
-
-|Fuseki user |Fuseki user for login |The login should be for admin user
-
-|Fuseki password |Fuseki password for login |This is the value of
-ADMIN_PASSWORD variable from the env file
-
-|Mongo-express user |Mongo-express user for login |This is the value of
-ME_CONFIG_BASICAUTH_USERNAME variable from the env file
-
-|Mongo-express password |Mongo-express password for login |This is the
-value of ME_CONFIG_BASICAUTH_PASSWORD variable from the env file
-|===
-
-== Workflow management with Airflow
-
-The management of the workflow is made available through the user
-interface of the Airflow system. This section describes the provided
-pipelines, and how to operate them in Airflow.
-
-=== Airflow DAG control board
-
-In this section we explain the most important elements to pay attention
-to when operating the pipelines. +
-In software engineering, a pipeline consists of a chain of processing
-elements (processes, threads, coroutines, functions, etc.), arranged so
-that the output of each element is the input of the next. In our case,
-as an example, look at the notice_processing_pipeline, which has this
-chain of processes that takes as input a notice from the TED website and
-as the final output (if every process from this pipeline runs
-successfully) a METS package with a transformed notice in the RDF
-format. Between the processes the input will always be a batch of
-notices. Batch processing is a method of processing large amounts of
-data in a single, pre-defined process. Batch processing is typically
-used for tasks that are performed periodically, such as daily, weekly,
-or monthly. Each step of the pipeline can have a successful or failure
-result, and as such the pipeline can be stopped at any step if something
-went wrong with one of its processes. In Airflow terminology a pipeline
-will be a DAG. He are the processes that will create our
-notice_processing_pipeline DAG:
-
-* notice normalisation
-* notice transformation
-* notice distillation
-* notice validation
-* notice packaging
-* notice publishing
-
-==== Enable / disable switch
-
-In Airflow all the DAGs can be enabled or disabled. If a dag is disabled
-that will stop the DAG from running even if that DAG is scheduled.
-
-When a dag is enabled the switch button will be blue and grey when it is
-disabled.
-
-To enable or disable a dag use the following switch button:
-
-image:user_manual/media/image21.png[image,width=100%,height=32]
-
-image:user_manual/media/image69.png[image,width=56,height=55]
-disabled position
-
-image:user_manual/media/image3.png[image,width=52,height=56]
-enabled position
-
-==== DAG Runs
-
-A DAG Run is an object representing an instantiation of the DAG in time.
-Any time the DAG is executed, a DAG Run is created and all tasks inside
-it are executed. The status of the DAG Run depends on the tasks states.
-Each DAG Run is run separately from one another, meaning that you can
-have many runs of a DAG at the same time.
-
-DAG Run Status
-
-A DAG Run status is determined when the execution of the DAG is
-finished. The execution of the DAG depends on its containing tasks and
-their dependencies. The status is assigned to the DAG Run when all of
-the tasks are in one of the terminal states (i.e. if there is no
-possible transition to another state) like success, failed or skipped.
-
-There are two possible terminal states for the DAG Run:
-
-* success if all the pipeline processes are either success or skipped,
-* failed if any of the pipeline processes is either failed or
-upstream_failed.
-
-In the runs column in the Airflow user interface we can see the state of
-the DAG run, and this can be one of the following:
-
-* queued
-* success
-* running
-* failed
-
-
-Here is an example of this different states
-
-image:user_manual/media/image54.png[image,width=422,height=315]
-
-The transitions for these states will start from queuing, then will go
-to running, and after will either go to success or failure.
-
-Clicking on the numbers associated with a particular DAG run state will
-show you a list of the DAG runs in that state.
-
-==== DAG actions
-
-In the Airflow user interface we have a run button in the Actions column
-that will allow you to trigger a specific DAG with or without specific
-configuration. When clicking on the run button a list of options will
-appear:
-
-* Trigger DAG (triggering DAG without config)
-* Trigger DAG w/ config (triggering DAG with config)
-
-
-image:user_manual/media/image24.png[image,width=378,height=165]
-
-==== DAG Run overview
-
-In the Airflow user interface, when clicking on the DAG name, an
-overview of the runs for that DAG will be available. This will include
-schema of the processes that are a part of the pipeline, task durations,
-code for the DAG, etc. To learn more about Airflow interface please
-refer to the Airflow user manual
-(link:#useful-resources[[.underline]#Useful Resources#])
-
-image:user_manual/media/image74.png[image,width=601,height=281]
-
-
-
-=== Available pipelines
-
-In this section we provide a brief inventory of provided pipelines
-including their names, a short description and a high level diagram.
-
-[arabic]
-
-. *notice_processing_pipeline* - this DAG performs the processing of a
-batch of notices, where the stages take place: normalization,
-transformation, validation, packaging, publishing. This is scheduled and
-automatically started by other DAGs.
-
-
-image:user_manual/media/image31.png[image,width=100%,height=176]
-
-image:user_manual/media/image25.png[image,width=100%,height=162]
-
-
-[arabic, start=2]
-
-. *load_mapping_suite_in_database* - this DAG performs the loading of a
-mapping suite or all mapping suites from a branch on GitHub, with the
-mapping suite the test data from it can also be loaded, if the test data
-is loaded the notice_processing_pipeline DAG will be triggered.
-
-
-
-*Config DAG params:*
-
-
-* mapping_suite_package_name: string
-* load_test_data: boolean
-* branch_or_tag_name: string
-* github_repository_url: string
-
-*Default values:*
-
-* mapping_suite_package_name = None (it will take all available mapping
-suites on that branch or tag)
-* load_test_data = false
-* branch_or_tag_name = "main"
-* github_repository_url= "https://github.com/OP-TED/ted-rdf-mapping.git"
-
-
-image:user_manual/media/image96.png[image,width=100%,height=56]
-
-[arabic, start=3]
-. *fetch_notices_by_query -* this DAG fetches notices from TED by using a
-query and, depending on an additional parameter, triggers the
-notice_processing_pipeline DAG in full or partial mode (execution of
-only one step).
-
-*Config DAG params:*
-
-* query : string
-* trigger_complete_workflow : boolean
-
-*Default values:*
-
-* trigger_complete_workflow = true
-
-image:user_manual/media/image56.png[image,width=100%,height=92]
-
-[arabic, start=4]
-. *fetch_notices_by_date -* this DAG fetches notices from TED for a day
-and, depending on an additional parameter, triggers the
-notice_processing_pipeline DAG in full or partial mode (execution of
-only one step).
-
-*Config DAG params:*
-
-* wild_card : string with date format %Y%m%d*
-* trigger_complete_workflow : boolean
-
-*Default values:*
-
-* trigger_complete_workflow = true
-
-image:user_manual/media/image33.png[image,width=100%,height=100]
-
-[arabic, start=5]
-. *fetch_notices_by_date_range -* this DAG receives a date range and
-triggers the fetch_notices_by_date DAG for each day in the date range.
-
-*Config DAG params:*
-
-
-* start_date : string with date format %Y%m%d
-* end_date : string with date format %Y%m%d
-
-image:user_manual/media/image75.png[image,width=601,height=128]
-
-[arabic, start=6]
-. *reprocess_unnormalised_notices_from_backlog -* this DAG selects all
-notices that are in RAW state and need to be processed and triggers the
-notice_processing_pipeline DAG to process them.
-
-*Config DAG params:*
-
-* start_date : string with date format %Y-%m-%d
-* end_date : string with date format %Y-%m-%d
-
-*Default values:*
-
-* start_date = None , because this param is optional
-* end_date = None, because this param is optional
-
-image:user_manual/media/image60.png[image,width=601,height=78]
-
-[arabic, start=7]
-. *reprocess_unpackaged_notices_from_backlog -* this DAG selects all
-notices to be repackaged and triggers the notice_processing_pipeline DAG
-to repackage them.
-
-*Config DAG params:*
-
-* start_date : string with date format %Y-%m-%d
-* end_date : string with date format %Y-%m-%d
-* form_number : string
-* xsd_version : string
-
-*Default values:*
-
-* start_date = None , because this param is optional
-* end_date = None, because this param is optional
-* form_number = None, because this param is optional
-* xsd_version = None, because this param is optional
-
-image:user_manual/media/image81.png[image,width=100%,height=73]
-
-[arabic, start=8]
-. *reprocess_unpublished_notices_from_backlog -* this DAG selects all
-notices to be republished and triggers the notice_processing_pipeline
-DAG to republish them.
-
-*Config DAG params:*
-
-
-* start_date : string with date format %Y-%m-%d
-* end_date : string with date format %Y-%m-%d
-* form_number : string
-* xsd_version : string
-
-*Default values:*
-
-
-* start_date = None , because this param is optional
-* end_date = None, because this param is optional
-* form_number = None, because this param is optional
-* xsd_version = None, because this param is optional
-
-image:user_manual/media/image37.png[image,width=100%,height=70]
-
-[arabic, start=9]
-. *reprocess_untransformed_notices_from_backlog -* this DAG selects all
-notices to be retransformed and triggers the notice_processing_pipeline
-DAG to retransform them.
-
-*Config DAG params:*
-
-
-* start_date : string with date format %Y-%m-%d
-* end_date : string with date format %Y-%m-%d
-* form_number : string
-* xsd_version : string
-
-*Default values:*
-
-* start_date = None , because this param is optional
-* end_date = None, because this param is optional
-* form_number = None, because this param is optional
-* xsd_version = None, because this param is optional
-
-
-image:user_manual/media/image102.png[image,width=100%,height=69]
-
-[arabic, start=10]
-. *reprocess_unvalidated_notices_from_backlog -* this DAG selects all
-notices to be revalidated and triggers the notice_processing_pipeline
-DAG to revalidate them.
-
-*Config DAG params:*
-
-* start_date : string with date format %Y-%m-%d
-* end_date : string with date format %Y-%m-%d
-* form_number : string
-* xsd_version : string
-
-*Default values:*
-
-
-* start_date = None , because this param is optional
-* end_date = None, because this param is optional
-* form_number = None, because this param is optional
-* xsd_version = None, because this param is optional
-
-image:user_manual/media/image102.png[image,width=100%,height=69]
-
-[arabic, start=11]
-. *daily_materialized_views_update -* this DAG selects all notices to be
-revalidated and triggers the notice_processing_pipeline DAG to
-revalidate them.
-
-*This DAG has no config or default params.*
-
-image:user_manual/media/image98.png[image,width=100%,height=90]
-
-[arabic, start=12]
-. *daily_check_notices_availability_in_cellar -* this DAG selects all
-notices to be revalidated and triggers the notice_processing_pipeline
-DAG to revalidate them.
-
-*This DAG has no config or default params.*
-
-
-image:user_manual/media/image67.png[image,width=339,height=81]
-
-=== Batch processing
-
-=== Running pipelines (How to)
-
-This chapter explains the basic utilization of Ted SWS Airflow pipelines
-by presenting in the format of answering the questions. Basic
-functionality can be used by running DAGs: a core concept of Airflow.
-For advanced documentation access:
-
-https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/stable/concepts/DAGs.html#]
-
-==== UC1: How to load a mapping suite or mapping suites?
-
-As a user I want to load one or several mapping suites into the system
-so that notices can be transformed and validated with them.
-
-==== UC1.a To load all mapping suites
-
-[arabic]
-. Run *load_mapping_suite_in_database* DAG:
-[loweralpha]
-.. Enable DAG
-.. Click Run on Actions column (Play symbol button)
-.. Click Trigger DAG
-
-
-image:user_manual/media/image84.png[image,width=100%,height=61]
-
-==== UC1.b To load specific mapping suite
-
-[arabic]
-. Run *load_mapping_suite_in_database* DAG with configurations:
-[loweralpha]
-.. Enable DAG
-.. Click Run on Actions column (Play symbol button)
-.. Click Trigger DAG w/ config.
-
-image:user_manual/media/image36.png[image,width=100%,height=55]
-
-[arabic, start=2]
-. In the next screen
-
-[loweralpha]
-. In the configuration JSON text box insert the config:
-
-[source,python]
-{"mapping_suite_package_name": "package_F03"}
-
-[loweralpha, start=2]
-. Click Trigger button after inserting the configuration
-
-image:user_manual/media/image27.png[image,width=100%,height=331]
-
-[arabic, start=3]
-. Optional if you want to transform the available test notices that were
-used for development of the mapping suite you can add to configuration
-the *load_test_data* parameter with the value *true*
-
-image:user_manual/media/image103.png[image,width=100%,height=459]
-
-==== UC2: How to fetch and process notices for a day?
-
-As a user I want to fetch and process notices from a selected day so
-that they get published in Cellar and be available to the public in RDF
-format.
-
-UC2.a To fetch and transform notices for a day:
-
-[arabic]
-. Enable *notice_processing_pipeline* DAG
-. Run *fetch_notices_by_date* DAG with configurations:
-[loweralpha]
-.. Enable DAG
-.. Click Run on Actions column
-.. Click Trigger DAG w/ config
-
-image:user_manual/media/image26.png[image,width=100%,height=217]
-
-[arabic, start=3]
-. In the next screen
-
-[loweralpha]
-. In the configuration JSON text box insert the config:
-[source,python]
-{"wild_card ": "20220921*"}*
-
-The value *20220921** is the date of the day to fetch and transform with
-format: yyyymmdd*.
-
-
-[loweralpha, start=2]
-. Click Trigger button after inserting the configuration
-
-image:user_manual/media/image1.png[image,width=100%,height=310]
-
-[arabic, start=4]
-. Optional: It is possible to only fetch notices without transformation.
-To do so add *trigger_complete_workflow* configuration parameter and set
-its value to “false”. +
-[source,python]
-{"wild_card ": "20220921*", "trigger_complete_workflow": false}
-
-image:user_manual/media/image4.png[image,width=100%,height=358]
-
-
-==== UC3: How to fetch and process notices for date range?
-
-As a user I want to fetch and process notices published within a dare
-range so that they are published in Cellar and available to the public
-in RDF format.
-
-UC3.a To fetch for multiple days:
-
-[arabic]
-. Enable *notice_processing_pipeline* DAG
-. Run *fetch_notices_by_date_range* DAG with configurations:
-[loweralpha]
-.. Enable DAG
-.. Click Run on Actions column
-.. Click Trigger DAG w/ config.
-
-image:user_manual/media/image79.png[image,width=100%,height=205]
-
-[arabic, start=3]
-. In the next screen, in the configuration JSON text box insert the
-config:
-[source,python]
-{ "start_date": "20220920", "end_date": "20220920" }
-
-20220920 is the start date and 20220920 is the end date of the days to
-be fetched and transformed with format: yyyymmdd.
-
-[arabic, start=4]
-. Click Trigger button after inserting the configuration
-
-image:user_manual/media/image51.png[image,width=100%,height=331]
-
-==== UC4: How to fetch and process notices using a query?
-
-As a user I want to fetch and process notices published by specific
-filters that are available from the TED API so that they are published
-in Cellar and available to the public in RDF format.
-
-To fetch and transform notices by using a query follow the instructions
-below:
-
-[arabic]
-. Enable *notice_processing_pipeline* DAG
-. Run *fetch_notices_by_query* DAG with configurations:
-.. Enable DAG
-.. Click Run on Actions column
-.. Click Trigger DAG w/ config.
-
-image:user_manual/media/image61.png[image,width=100%,height=200]
-[arabic, start=3]
-. In the next screen
-
-[loweralpha]
-. In the configuration JSON text box insert the config:
-
-[source,python]
-{"query": "ND=[163-2021]"}
-
-
-ND=[163-2021] is the query that will run against the TED API to get
-notices that will match that query
-
-[loweralpha, start=2]
-. Click Trigger button after inserting the configuration
-
-image:user_manual/media/image93.png[image,width=100%,height=378]
-
-[arabic, start=4]
-. Optional: If you need to only fetch notices without
-transformation, add *trigger_complete_workflow* configuration as *false*
-
-image:user_manual/media/image49.png[image,width=100%,height=357]
-
-==== UC5: How to deal with notices that are in the backlog and what to run?
-
-As a user I want to reprocess notices that are in the backlog so that
-they are published in Cellar and available to the public in RDF format.
-
-Notices that have failed running a complete and successful
-notice_processing_pipeline run will be added to the backlog by using
-different statuses that will be added to these notices. The status of a
-notice will be automatically determined by the system. The backlog could
-have multiple notices in different statuses.
-
-The backlog is divided in five categories as follows:
-
-* notices that couldn’t be normalised
-* notices that couldn’t be transformed
-* notices that couldn’t be validated
-* notices that couldn’t be packaged
-* notices that couldn’t be published
-
-===== UC5.a Deal with notices that couldn't be normalised
-
-In the case that the backlog contains notices that couldn’t be
-normalised at some point and will want to try to reprocess those notices
-just run the *reprocess_unnormalised_notices_from_backlog* DAG following
-the instructions below.
-
-[arabic]
-. Enable the reprocess_unnormalised_notices_from_backlog DAG
-
-image:user_manual/media/image92.png[image,width=100%,height=44]
-
-[arabic, start=2]
-. Trigger DAG
-
-image:user_manual/media/image76.png[image,width=100%,height=54]
-
-===== UC5.b: Deal with notices that couldn't be transformed
-
-In the case that the backlog contains notices that couldn’t be
-transformed at some point and will want to try to reprocess those
-notices just run the *reprocess_untransformed_notices_from_backlog* DAG
-following the instructions below.
-
-[arabic]
-. Enable the reprocess_untransformed_notices_from_backlog DAG
-image:user_manual/media/image85.png[image,width=100%,height=36]
-
-[arabic, start=2]
-. Trigger DAG
-
-image:user_manual/media/image77.png[image,width=100%,height=54]
-
-===== UC5.c: Deal with notices that couldn’t be validated
-
-In the case that the backlog contains notices that couldn’t be
-normalised at some point and will want to try to reprocess those notices
-just run the *reprocess_unvalidated_notices_from_backlog* DAG following
-the instructions below.
-
-[arabic]
-. Enable the reprocess_unvalidated_notices_from_backlog DAG
-
-image:user_manual/media/image66.png[image,width=100%,height=41]
-
-[arabic, start=2]
-. Trigger DAG
-
-image:user_manual/media/image52.png[image,width=100%,height=52]
-
-===== UC5.d: Deal with notices that couldn't be published
-
-In the case that the backlog contains notices that couldn’t be
-normalised at some point and will want to try to reprocess those notices
-just run the *reprocess_unpackaged_notices_from_backlog* DAG following
-the instructions below.
-
-[arabic]
-. Enable the reprocess_unpackaged_notices_from_backlog DAG
-
-image:user_manual/media/image29.png[image,width=100%,height=36]
-
-[arabic, start=2]
-. Trigger DAG
-
-image:user_manual/media/image71.png[image,width=100%,height=49]
-
-===== UC5.e: Deal with notices that couldn't be published
-
-In the case that the backlog contains notices that couldn’t be
-normalised at some point and will want to try to reprocess those notices
-just run the *reprocess_unpublished_notices_from_backlog* DAG following
-the instructions below.
-
-[arabic]
-. Enable the reprocess_unpublished_notices_from_backlog DAG
-
-image:user_manual/media/image38.png[image,width=100%,height=38]
-
-[arabic, start=2]
-. Trigger DAG
-
-image:user_manual/media/image19.png[image,width=100%,height=57]
-
-=== Scheduled pipelines
-
-
-Scheduled pipelines are DAGs that are set to run periodically at fixed
-times, dates, or intervals. The DAG schedule can be read in the column
-“Schedule” and if any is set then the value is different from None.
-The scheduled execution is indicated as “cron expressions” [cire cron
-expressions manual]. A cron expression is a string comprising five or
-six fields separated by white space that represents a set of times,
-normally as a schedule to execute some routine. In our context examples
-of daily executions are provided below.
-
-image:user_manual/media/image34.png[image,width=83,height=365,float="right"]
-
-* None - DAG with no Schedule
-* 0 0 * * * - DAG that will run every day at 24:00 UTC
-* 0 6 * * * - DAG that will run every day at 06:00 UTC
-* 0 1 * * * - DAG that will run every day at 01:00 UTC
-
-
-{nbsp}
-
-{nbsp}
-
-{nbsp}
-
-{nbsp}
-
-{nbsp}
-
-=== Operational rules and recommendations
-
-
-Note: Every action that was not described in the previous chapters can
-lead to unpredictable situations.
-
-* Do not stop a DAG when it is in running state. Let it finish. In case
-you need to disable or stop a DAG, then make sure that in the column
-Recent Tasks no numbers in the light green circle are present. Figure
-below depicts one such example.
-image:user_manual/media/image72.png[image,width=601,height=164]
-
-* Do not run reprocess DAGs when notice_processing_pipeline is in running
-state. This will produce errors as the reprocessing DAGs are searching
-for notices in a specific status available in the database. When the
-notice_processing_pipeline is running the notices are transitioning
-between different statuses and that will make it possible to get the
-same notice to be processed twice in the same time, which will produce
-an error. Make sure that in the column Runs for
-notice_processing_pipeline you don’t have any numbers in a light green
-circle before running any reprocess DAGs.
-image:user_manual/media/image30.png[image,width=601,height=162]
-
-
-* Do not manually trigger notice_processing_pipeline as this DAG is
-triggered automatically by other DAGs. This will produce an error as
-this DAG needs to know what batch of notices it is processing (this is
-automatically done by the system). This DAG should only be enabled.
-image:user_manual/media/image18.png[image,width=602,height=29]
-
-* To start any notice processing and transformation make sure that you
-have mapping suites available in the database. You should have at least
-one successful run of the *load_mapping_suite_in_database* DAG and check
-Metabase to see what mapping suites are available.
-image:user_manual/media/image32.png[image,width=653,height=30]
-
-* Do not manually trigger scheduled DAGs unless you use a specific
-configuration and that DAG supports running with specific configuration.
-The scheduled dags should be only enabled.
-image:user_manual/media/image87.png[image,width=601,height=77]
-
-* It is not recommended to load mapping suites while
-notice_processing_pipeline is running. First make sure that there are no
-running tasks and then load other mapping suites.
-image:user_manual/media/image35.png[image,width=601,height=256] {nbsp}
-image:user_manual/media/image91.png[image,width=601,height=209]
-
-* It is recommended to start processing / transforming notices for a short
-period of time e.g fetch notices for a day, week, month but not year.
-The system can handle processing for a longer period but it will take
-time and you will not be able to load other mapping suites while
-processing is running.
-
-
-== Metabase
-
-This section describes how to work with Metabase, exploring user
-interface, accessing dashboards, creating questions, and adding new data
-sources. This description uses examples with real data and data sources
-that are used on TED-SWS project. For advanced documentation access
-link:
-
-https://www.metabase.com/docs/latest/[[.underline]#https://www.metabase.com/docs/latest/#]
-
-=== Main concepts in Metabase
-
-==== What is a question?
-
-In Metabase, a question is a query, its results, and its visualization.
-
-If you’re trying to figure something out about your data in Metabase,
-you’re probably either asking a question or viewing a question that
-someone else on your team created. In everyday usage, a question is
-pretty much synonymous with a query.
-
-==== What is a dashboard?
-
-A dashboard is a data visualization tool that holds important charts and
-text, collected and arranged on a single screen. Dashboards provide a
-high-level, centralized look at KPIs and other business metrics, and can
-cover everything from overall business health to the success of a
-specific project.
-
-The term comes from the automotive dashboard, which like its business
-intelligence counterpart provides status updates and warnings about
-important functions.
-
-==== What is a collection?
-
-In Metabase, a collection is a set of items like questions, dashboards
-and subcollections, that are stored together for some organizational
-purpose. You can think of collections like folders within a file system.
-The root collection in Metabase is called Our Analytics, and it holds
-every other collection that you and others at your organization create.
-
-You may keep a collection titled “Operations” that holds all of the
-questions, dashboards, and models that your organization’s ops team
-uses, so people in that department know where to find the items they
-need to do their jobs. And if there are specific items within a
-collection that your team uses most frequently, you can pin those to the
-top of the collection page for easy reference. Pinned questions in a
-collection will also render a preview of their visualization.
-
-==== What is a card?
-
-A card is a component of a dashboard that displays data or text.
-
-Metabase dashboards are made up of cards, with each card displaying some
-data (visualized as a table, chart, map, or number) or text (like
-headings, descriptive information, or relevant links).
-
-=== User interface
-
-After successful authorization, metabase redirects to main page that is
-composed of the following elements:
-
-image:user_manual/media/image22.png[image,width=633,height=294]
-
-[arabic]
-. Slidebar with collections
-. Settings, searching and adding new questions
-. Home page (Quick last accessed dashboards or questions)
-
-==== UC1 Manually updating the data
-
-As a user I want to manually update the data so I will see the
-questions/dashboards on the latest data.
-
-For *updating data*:
-
-[arabic]
-. Click Settings -> Admin settings -> Databases
-
-image:user_manual/media/image99.png[image,width=448,height=373]
-
-[arabic, start=2]
-. Go to Databases in the top menu
-
-image:user_manual/media/image15.png[image,width=601,height=142]
-
-[arabic, start=3]
-. To *update* the existing data source, click on the name of the necessary
-database and then click on both actions: “Sync database schema now” and
-“Re-scan field values now”. This will be done automatically but if you
-want to have the latest data (i.e the processing is still running) you
-could follow the steps below. However this is not considered a good
-practice.
-
-image:user_manual/media/image78.png[image,width=354,height=162]
-
-image:user_manual/media/image86.png[image,width=280,height=244]
-
-==== UC2: Use existing dashboards
-
-As a user I want to browse through and view dashboards so that I can
-answer business or operational questions about pipelines or notices.
-
-[arabic]
-. To access existing questions / dashboards, click:
-
-Sidebar button -> Necessary collection folder (ex: TED SWS KPI ->
-Pipeline KPI)
-
-image:user_manual/media/image68.png[image,width=189,height=242]
-
-[arabic, start=2]
-. To access the dashboard / question click on the element name in the main
-screen
-
-image:user_manual/media/image50.png[image,width=572,height=227]
-
-==== UC2: Customize a collection
-
-As a user I want to customize my collection preview so I can access
-quickly certain dashboards / questions and clean the unwanted content
-
-[arabic]
-. When opening a collection the main screen will be divided into to
-sections
-
-
-[loweralpha]
-. Pin section - where dashboards and questions can be pinned for easy
-access
-
-. List with dashboards and questions.
-
-
-image:user_manual/media/image46.png[image,width=601,height=341]
-
-[arabic, start=2]
-. Drag the dashboard or question elements from list (2) to
-section (1) to pin them. The element will be moved to the pin section,
-and will be displayed.
-
-. To *delete / move* a dashboard or question:
-
-[loweralpha]
-. Click on checkbox of the elements to be deleted;
-. Click archive or move (this can move the content to another collection)
-
-image:user_manual/media/image17.png[image,width=461,height=282]
-
-==== UC3: Create new question
-
-As a user I want to create a new question so I can explore the available
-data
-
-To *create* question:
-
-[arabic]
-. Click New
-(image:user_manual/media/image65.png[image,width=45,height=27]),
-then Question
-(image:user_manual/media/image83.png[image,width=71,height=22]).
-
-image:user_manual/media/image100.png[image,width=261,height=194]
-
-[arabic, start=2]
-. Select Data source (TEDSWS MongoDB - database name)
-
-image:user_manual/media/image7.png[image,width=353,height=210]
-
-[arabic, start=3]
-. Select Data collection (Notice Collection Materialized View
-
-image:user_manual/media/image28.png[image,width=266,height=307]
-
-*Note:* Always select “Notices Collection Materialised View” collection
-for questions. This collection was created specifically for metabase.
-Using other collections may increase response time of a question.
-
-[arabic, start=4]
-. Select necessary columns to display (ex: Notice status)
-
-image:user_manual/media/image95.png[image,width=397,height=365]
-
-
-[arabic, start=5]
-. (Optional) Select filter (ex: Form number is F03)
-
-image:user_manual/media/image40.png[image,width=275,height=304]
-
-image:user_manual/media/image70.png[image,width=353,height=214]
-
-[arabic, start=6]
-. (Optional) Select Summarize (ex: Count of rows)
-
-image:user_manual/media/image82.png[image,width=273,height=299]
-
-[arabic, start=7]
-. (Optional) Select a column to group by (ex: Notice Status)
-
-image:user_manual/media/image10.png[image,width=389,height=310]
-
-[arabic, start=8]
-. Click Visualize
-image:user_manual/media/image16.png[image,width=143,height=32]
-
-
-image:user_manual/media/image9.png[image,width=268,height=180]
-
-*Note:* This loading page means that questing is requesting an answer.
-Wait until it disappears.After the request is done, the page with
-response and editing a question will appear.
-
-
-[arabic, start=9]
-. Customizing the question
-
-
-Question page is divided into:
-
-* Edit question (name and logic)
-
-* Question visualisation (can be table or chart)
-
-* Visualisation settings (settings for table or chart)
-
-image:user_manual/media/image55.png[image,width=601,height=277]
-
-Tips on *editing* page:
-
-* To *export* the question:
-** Click on Download full results
-
-image:user_manual/media/image89.png[image,width=372,height=286]
-
-* To *edit question*:
-** Click on Show editor
-
-image:user_manual/media/image43.png[image,width=394,height=182]
-
-
-* To *change visualization type*
-** Click on visualization and then on Done once the type was chosen
-
-image:user_manual/media/image39.png[image,width=392,height=345]
-
-* To *edit visualization settings*
-
-** Click on Settings
-
-image:user_manual/media/image5.png[image,width=303,height=346]
-
-
-* To show values on dashboard: Click Show values on data points
-
-image:user_manual/media/image104.png[image,width=255,height=331]
-
-
-* To *save* question just Click Save button
-
-image:user_manual/media/image48.png[image,width=324,height=198]
-
-* Insert question name, description (optional) and collection to save into
-
-image:user_manual/media/image101.png[image,width=305,height=230]
-
-==== UC4: Create dashboard
-
-As a user I want to create a dashboard so I can group a set of questions
-that are of interest to me.
-
-To *create* dashboard:
-
-[arabic]
-. Click New -> Dashboard
-
-image:user_manual/media/image12.png[image,width=548,height=295]
-
-
-[arabic, start=2]
-. Insert Name, Description (optional) and collection where to save
-
-image:user_manual/media/image44.png[image,width=370,height=279]
-
-
-[loweralpha]
-. To select subfolder of the collection, click in arrow on collection
-field:
-
-image:user_manual/media/image13.png[image,width=395,height=199]
-
-[arabic, start=3]
-. Click Create
-
-. To *add* questions on dashboard:
-
-[loweralpha]
-. Click Add questions
-
-image:user_manual/media/image42.png[image,width=285,height=158]
-
-[loweralpha, start=2]
-. Click on the name of necessary question or drag & drop it
-
-image:user_manual/media/image57.png[image,width=307,height=392]
-
-In the dashboard you can add multiple questions, resize and move where
-it needs to be.
-[arabic, start=5]
-. To *save* dashboard:
-
-[loweralpha]
-
-. Click Save button in right top corner of the current screen
-
-image:user_manual/media/image53.png[image,width=171,height=96]
-
-==== UC5: Create user
-
-As a user I want to create another user so that I can share the work
-with others in my team
-
-[arabic]
-. Go to Admin settings by pressing the setting wheel button in the top
-right of the screen and then click Admin settings.
-
-image:user_manual/media/image64.png[image,width=544,height=180]
-
-
-[arabic, start=2]
-. On the next screen go to People in the top menu and click Invite someone
-button
-
-image:user_manual/media/image97.png[image,width=539,height=137]
-
-
-[arabic, start=3]
-. Complete the mandatory fields and put the user in the Administrator if
-you want that user to be an admin or in the All Users group
-
-image:user_manual/media/image73.png[image,width=601,height=345]
-
-[arabic, start=4]
-. Once you click on create a temporary password will be created for this
-user. Save this password and user details as you will need to share
-these with the new user. After this just click Done.
-
-image:user_manual/media/image20.png[image,width=601,height=362]
-
diff --git a/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc b/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc
new file mode 100644
index 000000000..00c59e995
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc
@@ -0,0 +1,36 @@
+== Security and Access
+
+The security credentials will be provided by the infrastructure team
+that installed the necessary infrastructure for this project. Some credentials are set in the environment file necessary for the
+infrastructure installation and others by manually creating a user by
+infra team.
+
+Bellow is the list of credentials that should be available
+
+[width="100%",cols="25%,36%,39%",options="header",]
+|===
+|Name |Description |Comment
+|Metabase user |Metabase user for login. This should be an email address
+|This user was manually created by the infrastructure team
+
+|Metabase password |The temporary password that was set by the infra
+team for the user above |This user was manually created by the
+infrastructure team
+
+|Airflow user |Airflow UI user for login |This is the value of
+_AIRFLOW_WWW_USER_USERNAME variable from the env file
+
+|Airflow password |Airflow UI password for login |This is the value of
+_AIRFLOW_WWW_USER_PASSWORD variable from the env file
+
+|Fuseki user |Fuseki user for login |The login should be for admin user
+
+|Fuseki password |Fuseki password for login |This is the value of
+ADMIN_PASSWORD variable from the env file
+
+|Mongo-express user |Mongo-express user for login |This is the value of
+ME_CONFIG_BASICAUTH_USERNAME variable from the env file
+
+|Mongo-express password |Mongo-express password for login |This is the
+value of ME_CONFIG_BASICAUTH_PASSWORD variable from the env file
+|===
\ No newline at end of file
diff --git a/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc b/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc
new file mode 100644
index 000000000..c8b62d41b
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc
@@ -0,0 +1,27 @@
+= Getting started with TED-SWS
+
+The purpose of this section is to explain how to monitor and control TED-SWS system using Airflow and Metabase interfaces. This page may be updated by the development team as the system evolves.
+
+== Intended audience
+
+This document is intended for persons involved in the controlling and
+monitoring the services offered by the TED-SWS system.
+
+== Getting started
+To gain access and control of TED-SWS system the user shall be provided with access URLs and credentials by the infrastructure team. Please make sure that you know xref:user_manual/access-security.adoc[all the security credentials].
+
+== User Manual
+This user manual is divided into three parts. We advise to get familiar with them in the following order
+
+* xref:user_manual/system-overview.adoc[system overview],
+* xref:user_manual/workflow-management-airflow.adoc[workflow management with Airflow], and
+* xref:user_manual/system-monitoring-metabase.adoc[system monitoring with Metabase].
+
+
+== Additional resources [[useful-resources]]
+
+link:https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html[Apache Airflow User Interface]
+
+link:https://www.metabase.com/learn/getting-started/tour-of-metabase[Tour of Metabase]
+
+link:https://www.metabase.com/docs/latest/exploration-and-organization/start[Metabase organisation and exploration]
diff --git a/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc b/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc
new file mode 100644
index 000000000..a47c5bf4d
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc
@@ -0,0 +1,345 @@
+= System monitoring with Metabase
+
+This section describes how to work with Metabase, exploring user
+interface, accessing dashboards, creating questions, and adding new data
+sources. This description uses examples with real data and data sources
+that are used on TED-SWS project. For advanced documentation please see link:https://www.metabase.com/docs/latest/[Metabase user manual (latest)].
+
+== Main concepts in Metabase
+
+=== What is a question?
+
+In Metabase, a question is a query, its results, and its visualization.
+
+If you’re trying to figure something out about your data in Metabase,
+you’re probably either asking a question or viewing a question that
+someone else on your team created. In everyday usage, a question is
+pretty much synonymous with a query.
+
+=== What is a dashboard?
+
+A dashboard is a data visualization tool that holds important charts and
+text, collected and arranged on a single screen. Dashboards provide a
+high-level, centralized look at KPIs and other business metrics, and can
+cover everything from overall business health to the success of a
+specific project.
+
+The term comes from the automotive dashboard, which like its business
+intelligence counterpart provides status updates and warnings about
+important functions.
+
+=== What is a collection?
+
+In Metabase, a collection is a set of items like questions, dashboards
+and subcollections, that are stored together for some organizational
+purpose. You can think of collections like folders within a file system.
+The root collection in Metabase is called Our Analytics, and it holds
+every other collection that you and others at your organization create.
+
+You may keep a collection titled “Operations” that holds all of the
+questions, dashboards, and models that your organization’s ops team
+uses, so people in that department know where to find the items they
+need to do their jobs. And if there are specific items within a
+collection that your team uses most frequently, you can pin those to the
+top of the collection page for easy reference. Pinned questions in a
+collection will also render a preview of their visualization.
+
+=== What is a card?
+
+A card is a component of a dashboard that displays data or text.
+
+Metabase dashboards are made up of cards, with each card displaying some
+data (visualized as a table, chart, map, or number) or text (like
+headings, descriptive information, or relevant links).
+
+== Metabase user interface
+
+After successful authorization, metabase redirects to main page that is
+composed of the following elements:
+
+image:user_manual/media/image22.png[image,width=633,height=294]
+
+[arabic]
+. Slidebar with collections
+. Settings, searching and adding new questions
+. Home page (Quick last accessed dashboards or questions)
+
+=== UC1 Manually updating the data
+
+As a user I want to manually update the data so I will see the
+questions/dashboards on the latest data.
+
+For *updating data*:
+
+[arabic]
+. Click Settings -> Admin settings -> Databases
+
+image:user_manual/media/image99.png[image,width=448,height=373]
+
+[arabic, start=2]
+. Go to Databases in the top menu
+
+image:user_manual/media/image15.png[image,width=601,height=142]
+
+[arabic, start=3]
+. To *update* the existing data source, click on the name of the necessary
+database and then click on both actions: “Sync database schema now” and
+“Re-scan field values now”. This will be done automatically but if you
+want to have the latest data (i.e the processing is still running) you
+could follow the steps below. However this is not considered a good
+practice.
+
+image:user_manual/media/image78.png[image,width=354,height=162]
+
+image:user_manual/media/image86.png[image,width=280,height=244]
+
+=== UC2: Use existing dashboards
+
+As a user I want to browse through and view dashboards so that I can
+answer business or operational questions about pipelines or notices.
+
+[arabic]
+. To access existing questions / dashboards, click:
+
+Sidebar button -> Necessary collection folder (ex: TED SWS KPI ->
+Pipeline KPI)
+
+image:user_manual/media/image68.png[image,width=189,height=242]
+
+[arabic, start=2]
+. To access the dashboard / question click on the element name in the main
+screen
+
+image:user_manual/media/image50.png[image,width=572,height=227]
+
+=== UC2: Customize a collection
+
+As a user I want to customize my collection preview so I can access
+quickly certain dashboards / questions and clean the unwanted content
+
+[arabic]
+. When opening a collection the main screen will be divided into to
+sections
+
+
+[loweralpha]
+. Pin section - where dashboards and questions can be pinned for easy
+access
+
+. List with dashboards and questions.
+
+
+image:user_manual/media/image46.png[image,width=601,height=341]
+
+[arabic, start=2]
+. Drag the dashboard or question elements from list (2) to
+section (1) to pin them. The element will be moved to the pin section,
+and will be displayed.
+
+. To *delete / move* a dashboard or question:
+
+[loweralpha]
+. Click on checkbox of the elements to be deleted;
+. Click archive or move (this can move the content to another collection)
+
+image:user_manual/media/image17.png[image,width=461,height=282]
+
+=== UC3: Create new question
+
+As a user I want to create a new question so I can explore the available
+data
+
+To *create* question:
+
+[arabic]
+. Click New
+(image:user_manual/media/image65.png[image,width=45,height=27]),
+then Question
+(image:user_manual/media/image83.png[image,width=71,height=22]).
+
+image:user_manual/media/image100.png[image,width=261,height=194]
+
+[arabic, start=2]
+. Select Data source (TEDSWS MongoDB - database name)
+
+image:user_manual/media/image7.png[image,width=353,height=210]
+
+[arabic, start=3]
+. Select Data collection (Notice Collection Materialized View
+
+image:user_manual/media/image28.png[image,width=266,height=307]
+
+*Note:* Always select “Notices Collection Materialised View” collection
+for questions. This collection was created specifically for metabase.
+Using other collections may increase response time of a question.
+
+[arabic, start=4]
+. Select necessary columns to display (ex: Notice status)
+
+image:user_manual/media/image95.png[image,width=397,height=365]
+
+
+[arabic, start=5]
+. (Optional) Select filter (ex: Form number is F03)
+
+image:user_manual/media/image40.png[image,width=275,height=304]
+
+image:user_manual/media/image70.png[image,width=353,height=214]
+
+[arabic, start=6]
+. (Optional) Select Summarize (ex: Count of rows)
+
+image:user_manual/media/image82.png[image,width=273,height=299]
+
+[arabic, start=7]
+. (Optional) Select a column to group by (ex: Notice Status)
+
+image:user_manual/media/image10.png[image,width=389,height=310]
+
+[arabic, start=8]
+. Click Visualize
+image:user_manual/media/image16.png[image,width=143,height=32]
+
+
+image:user_manual/media/image9.png[image,width=268,height=180]
+
+*Note:* This loading page means that questing is requesting an answer.
+Wait until it disappears.After the request is done, the page with
+response and editing a question will appear.
+
+
+[arabic, start=9]
+. Customizing the question
+
+
+Question page is divided into:
+
+* Edit question (name and logic)
+
+* Question visualisation (can be table or chart)
+
+* Visualisation settings (settings for table or chart)
+
+image:user_manual/media/image55.png[image,width=601,height=277]
+
+Tips on *editing* page:
+
+* To *export* the question:
+** Click on Download full results
+
+image:user_manual/media/image89.png[image,width=372,height=286]
+
+* To *edit question*:
+** Click on Show editor
+
+image:user_manual/media/image43.png[image,width=394,height=182]
+
+
+* To *change visualization type*
+** Click on visualization and then on Done once the type was chosen
+
+image:user_manual/media/image39.png[image,width=392,height=345]
+
+* To *edit visualization settings*
+
+** Click on Settings
+
+image:user_manual/media/image5.png[image,width=303,height=346]
+
+
+* To show values on dashboard: Click Show values on data points
+
+image:user_manual/media/image104.png[image,width=255,height=331]
+
+
+* To *save* question just Click Save button
+
+image:user_manual/media/image48.png[image,width=324,height=198]
+
+* Insert question name, description (optional) and collection to save into
+
+image:user_manual/media/image101.png[image,width=305,height=230]
+
+=== UC4: Create dashboard
+
+As a user I want to create a dashboard so I can group a set of questions
+that are of interest to me.
+
+To *create* dashboard:
+
+[arabic]
+. Click New -> Dashboard
+
+image:user_manual/media/image12.png[image,width=548,height=295]
+
+
+[arabic, start=2]
+. Insert Name, Description (optional) and collection where to save
+
+image:user_manual/media/image44.png[image,width=370,height=279]
+
+
+[loweralpha]
+. To select subfolder of the collection, click in arrow on collection
+field:
+
+image:user_manual/media/image13.png[image,width=395,height=199]
+
+[arabic, start=3]
+. Click Create
+
+. To *add* questions on dashboard:
+
+[loweralpha]
+. Click Add questions
+
+image:user_manual/media/image42.png[image,width=285,height=158]
+
+[loweralpha, start=2]
+. Click on the name of necessary question or drag & drop it
+
+image:user_manual/media/image57.png[image,width=307,height=392]
+
+In the dashboard you can add multiple questions, resize and move where
+it needs to be.
+[arabic, start=5]
+. To *save* dashboard:
+
+[loweralpha]
+
+. Click Save button in right top corner of the current screen
+
+image:user_manual/media/image53.png[image,width=171,height=96]
+
+=== UC5: Create user
+
+As a user I want to create another user so that I can share the work
+with others in my team
+
+[arabic]
+. Go to Admin settings by pressing the setting wheel button in the top
+right of the screen and then click Admin settings.
+
+image:user_manual/media/image64.png[image,width=544,height=180]
+
+
+[arabic, start=2]
+. On the next screen go to People in the top menu and click Invite someone
+button
+
+image:user_manual/media/image97.png[image,width=539,height=137]
+
+
+[arabic, start=3]
+. Complete the mandatory fields and put the user in the Administrator if
+you want that user to be an admin or in the All Users group
+
+image:user_manual/media/image73.png[image,width=601,height=345]
+
+[arabic, start=4]
+. Once you click on create a temporary password will be created for this
+user. Save this password and user details as you will need to share
+these with the new user. After this just click Done.
+
+image:user_manual/media/image20.png[image,width=601,height=362]
+
diff --git a/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc b/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc
new file mode 100644
index 000000000..d66ac577f
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc
@@ -0,0 +1,155 @@
+== System overview
+
+This section provides a high level overview of the TED-SWS system and
+its components. As presented in the image below the system is built by
+multitude of services / components grouped together to help to reach the
+end goal. The system can be divided into 2 main parts:
+
+* Controlling and monitoring
+* Core functionality (code base / TED SWS pipeline)
+
+Each part of the system is formed by a group of components.
+
+Controlling and monitoring, controlled by an operation manager, contains
+a workflow / pipeline management service (Airflow) and data
+visualization service (Metabase). Using this group of services any user
+should be able to control execution of the existing pipelines and also
+monitor the execution results.
+
+The core functionality has many services developed to accommodate the
+entire transformation process of a public procurement notice (in XML
+format) available on the TED Website into RDF format and to publish it
+into CELLAR. Here is a short description of some of the main services:
+
+* fetching service - fetching the notice from TED website
+* indexing service - getting the unique XPATHs in a notice XML
+* metadata normalisation service - extract notice metadata from the XML
+* transformation service - transform the XML to RDF
+* entity resolution and deduplication service - resolve duplicated
+entities in the RDF
+* validation service - validation the RDF transformation
+* packaging service - creating the METS package
+* publishing service - sending the METS package to CELLAR
+
+image:user_manual/media/image59.png[image,width=100%,height=270]
+
+== Pipelines structure ( Airflow DAGs )
+
+In this section will see a graphic representation that will show the
+flow and dependencies of the available pipelines (DAGs) in Airflow. In
+this representation will see the presence of two users AirflowUser and
+AirflowScheduler, where the AirflowUser is the user that will enable and
+trigger the DAGs and AirflowScheduler is the Airflow component that will
+start the DAGs automatically following a schedule.
+
+The automatic triggered DAGs controlled by the Airflow Scheduler are:
+
+* fetch_notices_by_date
+* daily_check_notices_availibility_in_cellar
+* daily_materialized_views_update
+
+image:user_manual/media/image63.png[image,width=100%,height=382]
+
+The DAGs marked with _purple_ (load_mapping_suite_in_database), _yellow_
+(reprocess_unnormalised_notices_from_backlog,reprocess_unpackaged_notices_from_backlog,
+reprocess_unpublished_notices_from_backlog,reprocess_untransformed_notices_from_backlog,
+reprocess_unvalidated_notices_from_backlog) and _green_
+(fetch_notices_by_date, fetch_notices_by_date_range,
+fetch_notices_by_query) will trigger automatically the
+*notice_processing_pipeline* marked with _blue_, and this will take care
+of the entire processing steps for a notice. These can be used by a user
+by manually triggering these DAGs with or without configuration.
+
+The DAGs marked with _green_ (fetch_notices_by_date,
+fetch_notices_by_date_range, fetch_notices_by_query) are in charge of
+fetching the notices from TED API. The ones marked with _yellow_ (
+reprocess_unnormalised_notices_from_backlog,
+reprocess_unpackaged_notices_from_backlog,
+reprocess_unpublished_notices_from_backlog,
+reprocess_untransformed_notices_from_backlog,
+reprocess_unvalidated_notices_from_backlog) will handle the reprocessing
+of notices from the backlog. The purple marked DAG
+(load_mapping_suite_in_database) will handle the loading of mapping
+suites in the database that will be used to transform the notices.
+
+image:user_manual/media/image11.png[image,width=100%,height=660]
+
+== Notice statuses
+
+During the transformation process through the TED-SWS system, a notice
+will start with a certain status and it will transition to other
+statuses when a particular step of the pipeline
+(notice_processing_pipeline) offered by the system has completed
+successfully or unsuccessfully. This transition is done automatically
+and it will change the _status_ property of a notice. The system has the
+following statuses:
+
+* RAW
+* INDEXED
+* NORMALISED_METADATA
+* INELIGIBLE_FOR_TRANSFORMATION
+* ELIGIBLE_FOR_TRANSFORMATION
+* PREPROCESSED_FOR_TRANSFORMATION
+* TRANSFORMED
+* DISTILLED
+* VALIDATED
+* INELIGIBLE_FOR_PACKAGING
+* ELIGIBLE_FOR_PACKAGING
+* PACKAGED
+* INELIGIBLE_FOR_PUBLISHING
+* ELIGIBLE_FOR_PUBLISHING
+* PUBLISHED
+* PUBLICLY_UNAVAILABLE
+* PUBLICLY_AVAILABLE
+
+The transition from one status to another is decided by the system and
+can be viewed in the graphic representation below.
+
+image:user_manual/media/image14.png[image,width=100%,height=444]
+
+== Notice structure
+
+This section aims at presenting the anatomy of a Notice in the TED-SWS
+system and the dependence of structural elements on the phase of the
+transformation process. This is useful for the user to understand what
+happens behind the scene and what information is available in the
+database, to build analytics dashboards.
+
+The structure of a notice within the TED-SWS system consists of the
+following structural elements:
+
+* Status
+* Metadata
+** Original Metadata
+** Normalised Metadata
+* Manifestation
+** XMLManifestation
+** RDFManifestation
+** METSManifestation
+* Validation Report
+** XPATH Coverage Validation
+** SHACL Validation
+** SPARQL Validation
+
+The diagram below shows the high level structure of the Notice object
+and that certain structural parts of a notice within the system are
+dependent on its state. This means that as the transformation process
+runs through its steps the Notice state changes and new structural parts
+are added. For example, for a notice in the NORMALISED status we can
+access the Original Metadata, Normalised Metadata and XMLManifestation
+fields, for a notice in the TRANSFORMED status we can access in addition
+the RDFManifestation field and similarly for the rest of the statuses.
+
+The diagram depicts states as swim-lanes while the structural elements
+are depicted as ArchiMate Business Objects [cite ArchiMate]. The
+relations we use are composition (arrow with diamond ending) and
+inheritance (arrow with full triangle ending).
+
+As was mentioned above about the states through which a notice can
+transition, a certain structural field if it is present at a certain
+state, then all the states originating from this state will also have
+this field. Not all possible states are depicted. For brevity, we chose
+only the most significant ones, which segment the transformation process
+into stages.
+
+image:user_manual/media/image94.png[image,width=100%,height=390]
diff --git a/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc b/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc
new file mode 100644
index 000000000..432edfbd7
--- /dev/null
+++ b/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc
@@ -0,0 +1,686 @@
+= Workflow management with Airflow
+
+The management of the workflow is made available through the user
+interface of the Airflow system. This section describes the provided
+pipelines, and how to operate them in Airflow.
+
+== Airflow DAG control board
+
+In this section we explain the most important elements to pay attention
+to when operating the pipelines. +
+In software engineering, a pipeline consists of a chain of processing
+elements (processes, threads, coroutines, functions, etc.), arranged so
+that the output of each element is the input of the next. In our case,
+as an example, look at the notice_processing_pipeline, which has this
+chain of processes that takes as input a notice from the TED website and
+as the final output (if every process from this pipeline runs
+successfully) a METS package with a transformed notice in the RDF
+format. Between the processes the input will always be a batch of
+notices. Batch processing is a method of processing large amounts of
+data in a single, pre-defined process. Batch processing is typically
+used for tasks that are performed periodically, such as daily, weekly,
+or monthly. Each step of the pipeline can have a successful or failure
+result, and as such the pipeline can be stopped at any step if something
+went wrong with one of its processes. In Airflow terminology a pipeline
+will be a DAG. He are the processes that will create our
+notice_processing_pipeline DAG:
+
+* notice normalisation
+* notice transformation
+* notice distillation
+* notice validation
+* notice packaging
+* notice publishing
+
+=== Enable / disable switch
+
+In Airflow all the DAGs can be enabled or disabled. If a dag is disabled
+that will stop the DAG from running even if that DAG is scheduled.
+
+When a dag is enabled the switch button will be blue and grey when it is
+disabled.
+
+To enable or disable a dag use the following switch button:
+
+image:user_manual/media/image21.png[image,width=100%,height=32]
+
+image:user_manual/media/image69.png[image,width=56,height=55]
+disabled position
+
+image:user_manual/media/image3.png[image,width=52,height=56]
+enabled position
+
+=== DAG Runs
+
+A DAG Run is an object representing an instantiation of the DAG in time.
+Any time the DAG is executed, a DAG Run is created and all tasks inside
+it are executed. The status of the DAG Run depends on the tasks states.
+Each DAG Run is run separately from one another, meaning that you can
+have many runs of a DAG at the same time.
+
+DAG Run Status
+
+A DAG Run status is determined when the execution of the DAG is
+finished. The execution of the DAG depends on its containing tasks and
+their dependencies. The status is assigned to the DAG Run when all of
+the tasks are in one of the terminal states (i.e. if there is no
+possible transition to another state) like success, failed or skipped.
+
+There are two possible terminal states for the DAG Run:
+
+* success if all the pipeline processes are either success or skipped,
+* failed if any of the pipeline processes is either failed or
+upstream_failed.
+
+In the runs column in the Airflow user interface we can see the state of
+the DAG run, and this can be one of the following:
+
+* queued
+* success
+* running
+* failed
+
+
+Here is an example of this different states
+
+image:user_manual/media/image54.png[image,width=422,height=315]
+
+The transitions for these states will start from queuing, then will go
+to running, and after will either go to success or failure.
+
+Clicking on the numbers associated with a particular DAG run state will
+show you a list of the DAG runs in that state.
+
+=== DAG actions
+
+In the Airflow user interface we have a run button in the Actions column
+that will allow you to trigger a specific DAG with or without specific
+configuration. When clicking on the run button a list of options will
+appear:
+
+* Trigger DAG (triggering DAG without config)
+* Trigger DAG w/ config (triggering DAG with config)
+
+
+image:user_manual/media/image24.png[image,width=378,height=165]
+
+=== DAG Run overview
+
+In the Airflow user interface, when clicking on the DAG name, an
+overview of the runs for that DAG will be available. This will include
+schema of the processes that are a part of the pipeline, task durations,
+code for the DAG, etc. To learn more about Airflow interface please
+refer to the Airflow user manual
+(link:#useful-resources[[.underline]#Useful Resources#])
+
+image:user_manual/media/image74.png[image,width=601,height=281]
+
+
+
+== Available pipelines
+
+In this section we provide a brief inventory of provided pipelines
+including their names, a short description and a high level diagram.
+
+[arabic]
+
+. *notice_processing_pipeline* - this DAG performs the processing of a
+batch of notices, where the stages take place: normalization,
+transformation, validation, packaging, publishing. This is scheduled and
+automatically started by other DAGs.
+
+
+image:user_manual/media/image31.png[image,width=100%,height=176]
+
+image:user_manual/media/image25.png[image,width=100%,height=162]
+
+
+[arabic, start=2]
+
+. *load_mapping_suite_in_database* - this DAG performs the loading of a
+mapping suite or all mapping suites from a branch on GitHub, with the
+mapping suite the test data from it can also be loaded, if the test data
+is loaded the notice_processing_pipeline DAG will be triggered.
+
+
+
+*Config DAG params:*
+
+
+* mapping_suite_package_name: string
+* load_test_data: boolean
+* branch_or_tag_name: string
+* github_repository_url: string
+
+*Default values:*
+
+* mapping_suite_package_name = None (it will take all available mapping
+suites on that branch or tag)
+* load_test_data = false
+* branch_or_tag_name = "main"
+* github_repository_url= "https://github.com/OP-TED/ted-rdf-mapping.git"
+
+
+image:user_manual/media/image96.png[image,width=100%,height=56]
+
+[arabic, start=3]
+. *fetch_notices_by_query -* this DAG fetches notices from TED by using a
+query and, depending on an additional parameter, triggers the
+notice_processing_pipeline DAG in full or partial mode (execution of
+only one step).
+
+*Config DAG params:*
+
+* query : string
+* trigger_complete_workflow : boolean
+
+*Default values:*
+
+* trigger_complete_workflow = true
+
+image:user_manual/media/image56.png[image,width=100%,height=92]
+
+[arabic, start=4]
+. *fetch_notices_by_date -* this DAG fetches notices from TED for a day
+and, depending on an additional parameter, triggers the
+notice_processing_pipeline DAG in full or partial mode (execution of
+only one step).
+
+*Config DAG params:*
+
+* wild_card : string with date format %Y%m%d*
+* trigger_complete_workflow : boolean
+
+*Default values:*
+
+* trigger_complete_workflow = true
+
+image:user_manual/media/image33.png[image,width=100%,height=100]
+
+[arabic, start=5]
+. *fetch_notices_by_date_range -* this DAG receives a date range and
+triggers the fetch_notices_by_date DAG for each day in the date range.
+
+*Config DAG params:*
+
+
+* start_date : string with date format %Y%m%d
+* end_date : string with date format %Y%m%d
+
+image:user_manual/media/image75.png[image,width=601,height=128]
+
+[arabic, start=6]
+. *reprocess_unnormalised_notices_from_backlog -* this DAG selects all
+notices that are in RAW state and need to be processed and triggers the
+notice_processing_pipeline DAG to process them.
+
+*Config DAG params:*
+
+* start_date : string with date format %Y-%m-%d
+* end_date : string with date format %Y-%m-%d
+
+*Default values:*
+
+* start_date = None , because this param is optional
+* end_date = None, because this param is optional
+
+image:user_manual/media/image60.png[image,width=601,height=78]
+
+[arabic, start=7]
+. *reprocess_unpackaged_notices_from_backlog -* this DAG selects all
+notices to be repackaged and triggers the notice_processing_pipeline DAG
+to repackage them.
+
+*Config DAG params:*
+
+* start_date : string with date format %Y-%m-%d
+* end_date : string with date format %Y-%m-%d
+* form_number : string
+* xsd_version : string
+
+*Default values:*
+
+* start_date = None , because this param is optional
+* end_date = None, because this param is optional
+* form_number = None, because this param is optional
+* xsd_version = None, because this param is optional
+
+image:user_manual/media/image81.png[image,width=100%,height=73]
+
+[arabic, start=8]
+. *reprocess_unpublished_notices_from_backlog -* this DAG selects all
+notices to be republished and triggers the notice_processing_pipeline
+DAG to republish them.
+
+*Config DAG params:*
+
+
+* start_date : string with date format %Y-%m-%d
+* end_date : string with date format %Y-%m-%d
+* form_number : string
+* xsd_version : string
+
+*Default values:*
+
+
+* start_date = None , because this param is optional
+* end_date = None, because this param is optional
+* form_number = None, because this param is optional
+* xsd_version = None, because this param is optional
+
+image:user_manual/media/image37.png[image,width=100%,height=70]
+
+[arabic, start=9]
+. *reprocess_untransformed_notices_from_backlog -* this DAG selects all
+notices to be retransformed and triggers the notice_processing_pipeline
+DAG to retransform them.
+
+*Config DAG params:*
+
+
+* start_date : string with date format %Y-%m-%d
+* end_date : string with date format %Y-%m-%d
+* form_number : string
+* xsd_version : string
+
+*Default values:*
+
+* start_date = None , because this param is optional
+* end_date = None, because this param is optional
+* form_number = None, because this param is optional
+* xsd_version = None, because this param is optional
+
+
+image:user_manual/media/image102.png[image,width=100%,height=69]
+
+[arabic, start=10]
+. *reprocess_unvalidated_notices_from_backlog -* this DAG selects all
+notices to be revalidated and triggers the notice_processing_pipeline
+DAG to revalidate them.
+
+*Config DAG params:*
+
+* start_date : string with date format %Y-%m-%d
+* end_date : string with date format %Y-%m-%d
+* form_number : string
+* xsd_version : string
+
+*Default values:*
+
+
+* start_date = None , because this param is optional
+* end_date = None, because this param is optional
+* form_number = None, because this param is optional
+* xsd_version = None, because this param is optional
+
+image:user_manual/media/image102.png[image,width=100%,height=69]
+
+[arabic, start=11]
+. *daily_materialized_views_update -* this DAG selects all notices to be
+revalidated and triggers the notice_processing_pipeline DAG to
+revalidate them.
+
+*This DAG has no config or default params.*
+
+image:user_manual/media/image98.png[image,width=100%,height=90]
+
+[arabic, start=12]
+. *daily_check_notices_availability_in_cellar -* this DAG selects all
+notices to be revalidated and triggers the notice_processing_pipeline
+DAG to revalidate them.
+
+*This DAG has no config or default params.*
+
+
+image:user_manual/media/image67.png[image,width=339,height=81]
+
+== Batch processing
+
+== Running pipelines (How to)
+
+This chapter explains the basic utilization of Ted SWS Airflow pipelines
+by presenting in the format of answering the questions. Basic
+functionality can be used by running DAGs: a core concept of Airflow.
+For advanced documentation access:
+
+https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/stable/concepts/DAGs.html#]
+
+=== UC1: How to load a mapping suite or mapping suites?
+
+As a user I want to load one or several mapping suites into the system
+so that notices can be transformed and validated with them.
+
+=== UC1.a To load all mapping suites
+
+[arabic]
+. Run *load_mapping_suite_in_database* DAG:
+[loweralpha]
+.. Enable DAG
+.. Click Run on Actions column (Play symbol button)
+.. Click Trigger DAG
+
+
+image:user_manual/media/image84.png[image,width=100%,height=61]
+
+=== UC1.b To load specific mapping suite
+
+[arabic]
+. Run *load_mapping_suite_in_database* DAG with configurations:
+[loweralpha]
+.. Enable DAG
+.. Click Run on Actions column (Play symbol button)
+.. Click Trigger DAG w/ config.
+
+image:user_manual/media/image36.png[image,width=100%,height=55]
+
+[arabic, start=2]
+. In the next screen
+
+[loweralpha]
+. In the configuration JSON text box insert the config:
+
+[source,python]
+{"mapping_suite_package_name": "package_F03"}
+
+[loweralpha, start=2]
+. Click Trigger button after inserting the configuration
+
+image:user_manual/media/image27.png[image,width=100%,height=331]
+
+[arabic, start=3]
+. Optional if you want to transform the available test notices that were
+used for development of the mapping suite you can add to configuration
+the *load_test_data* parameter with the value *true*
+
+image:user_manual/media/image103.png[image,width=100%,height=459]
+
+=== UC2: How to fetch and process notices for a day?
+
+As a user I want to fetch and process notices from a selected day so
+that they get published in Cellar and be available to the public in RDF
+format.
+
+UC2.a To fetch and transform notices for a day:
+
+[arabic]
+. Enable *notice_processing_pipeline* DAG
+. Run *fetch_notices_by_date* DAG with configurations:
+[loweralpha]
+.. Enable DAG
+.. Click Run on Actions column
+.. Click Trigger DAG w/ config
+
+image:user_manual/media/image26.png[image,width=100%,height=217]
+
+[arabic, start=3]
+. In the next screen
+
+[loweralpha]
+. In the configuration JSON text box insert the config:
+[source,python]
+{"wild_card ": "20220921*"}*
+
+The value *20220921** is the date of the day to fetch and transform with
+format: yyyymmdd*.
+
+
+[loweralpha, start=2]
+. Click Trigger button after inserting the configuration
+
+image:user_manual/media/image1.png[image,width=100%,height=310]
+
+[arabic, start=4]
+. Optional: It is possible to only fetch notices without transformation.
+To do so add *trigger_complete_workflow* configuration parameter and set
+its value to “false”. +
+[source,python]
+{"wild_card ": "20220921*", "trigger_complete_workflow": false}
+
+image:user_manual/media/image4.png[image,width=100%,height=358]
+
+
+=== UC3: How to fetch and process notices for date range?
+
+As a user I want to fetch and process notices published within a dare
+range so that they are published in Cellar and available to the public
+in RDF format.
+
+UC3.a To fetch for multiple days:
+
+[arabic]
+. Enable *notice_processing_pipeline* DAG
+. Run *fetch_notices_by_date_range* DAG with configurations:
+[loweralpha]
+.. Enable DAG
+.. Click Run on Actions column
+.. Click Trigger DAG w/ config.
+
+image:user_manual/media/image79.png[image,width=100%,height=205]
+
+[arabic, start=3]
+. In the next screen, in the configuration JSON text box insert the
+config:
+[source,python]
+{ "start_date": "20220920", "end_date": "20220920" }
+
+20220920 is the start date and 20220920 is the end date of the days to
+be fetched and transformed with format: yyyymmdd.
+
+[arabic, start=4]
+. Click Trigger button after inserting the configuration
+
+image:user_manual/media/image51.png[image,width=100%,height=331]
+
+==== UC4: How to fetch and process notices using a query?
+
+As a user I want to fetch and process notices published by specific
+filters that are available from the TED API so that they are published
+in Cellar and available to the public in RDF format.
+
+To fetch and transform notices by using a query follow the instructions
+below:
+
+[arabic]
+. Enable *notice_processing_pipeline* DAG
+. Run *fetch_notices_by_query* DAG with configurations:
+.. Enable DAG
+.. Click Run on Actions column
+.. Click Trigger DAG w/ config.
+
+image:user_manual/media/image61.png[image,width=100%,height=200]
+[arabic, start=3]
+. In the next screen
+
+[loweralpha]
+. In the configuration JSON text box insert the config:
+
+[source,python]
+{"query": "ND=[163-2021]"}
+
+
+ND=[163-2021] is the query that will run against the TED API to get
+notices that will match that query
+
+[loweralpha, start=2]
+. Click Trigger button after inserting the configuration
+
+image:user_manual/media/image93.png[image,width=100%,height=378]
+
+[arabic, start=4]
+. Optional: If you need to only fetch notices without
+transformation, add *trigger_complete_workflow* configuration as *false*
+
+image:user_manual/media/image49.png[image,width=100%,height=357]
+
+=== UC5: How to deal with notices that are in the backlog and what to run?
+
+As a user I want to reprocess notices that are in the backlog so that
+they are published in Cellar and available to the public in RDF format.
+
+Notices that have failed running a complete and successful
+notice_processing_pipeline run will be added to the backlog by using
+different statuses that will be added to these notices. The status of a
+notice will be automatically determined by the system. The backlog could
+have multiple notices in different statuses.
+
+The backlog is divided in five categories as follows:
+
+* notices that couldn’t be normalised
+* notices that couldn’t be transformed
+* notices that couldn’t be validated
+* notices that couldn’t be packaged
+* notices that couldn’t be published
+
+==== UC5.a Deal with notices that couldn't be normalised
+
+In the case that the backlog contains notices that couldn’t be
+normalised at some point and will want to try to reprocess those notices
+just run the *reprocess_unnormalised_notices_from_backlog* DAG following
+the instructions below.
+
+[arabic]
+. Enable the reprocess_unnormalised_notices_from_backlog DAG
+
+image:user_manual/media/image92.png[image,width=100%,height=44]
+
+[arabic, start=2]
+. Trigger DAG
+
+image:user_manual/media/image76.png[image,width=100%,height=54]
+
+==== UC5.b: Deal with notices that couldn't be transformed
+
+In the case that the backlog contains notices that couldn’t be
+transformed at some point and will want to try to reprocess those
+notices just run the *reprocess_untransformed_notices_from_backlog* DAG
+following the instructions below.
+
+[arabic]
+. Enable the reprocess_untransformed_notices_from_backlog DAG
+image:user_manual/media/image85.png[image,width=100%,height=36]
+
+[arabic, start=2]
+. Trigger DAG
+
+image:user_manual/media/image77.png[image,width=100%,height=54]
+
+==== UC5.c: Deal with notices that couldn’t be validated
+
+In the case that the backlog contains notices that couldn’t be
+normalised at some point and will want to try to reprocess those notices
+just run the *reprocess_unvalidated_notices_from_backlog* DAG following
+the instructions below.
+
+[arabic]
+. Enable the reprocess_unvalidated_notices_from_backlog DAG
+
+image:user_manual/media/image66.png[image,width=100%,height=41]
+
+[arabic, start=2]
+. Trigger DAG
+
+image:user_manual/media/image52.png[image,width=100%,height=52]
+
+==== UC5.d: Deal with notices that couldn't be published
+
+In the case that the backlog contains notices that couldn’t be
+normalised at some point and will want to try to reprocess those notices
+just run the *reprocess_unpackaged_notices_from_backlog* DAG following
+the instructions below.
+
+[arabic]
+. Enable the reprocess_unpackaged_notices_from_backlog DAG
+
+image:user_manual/media/image29.png[image,width=100%,height=36]
+
+[arabic, start=2]
+. Trigger DAG
+
+image:user_manual/media/image71.png[image,width=100%,height=49]
+
+==== UC5.e: Deal with notices that couldn't be published
+
+In the case that the backlog contains notices that couldn’t be
+normalised at some point and will want to try to reprocess those notices
+just run the *reprocess_unpublished_notices_from_backlog* DAG following
+the instructions below.
+
+[arabic]
+. Enable the reprocess_unpublished_notices_from_backlog DAG
+
+image:user_manual/media/image38.png[image,width=100%,height=38]
+
+[arabic, start=2]
+. Trigger DAG
+
+image:user_manual/media/image19.png[image,width=100%,height=57]
+
+== Scheduled pipelines
+
+Scheduled pipelines are DAGs that are set to run periodically at fixed
+times, dates, or intervals. The DAG schedule can be read in the column
+“Schedule” and if any is set then the value is different from None.
+The scheduled execution is indicated as “cron expressions” [cire cron
+expressions manual]. A cron expression is a string comprising five or
+six fields separated by white space that represents a set of times,
+normally as a schedule to execute some routine. In our context examples
+of daily executions are provided below.
+
+image:user_manual/media/image34.png[image,width=83,height=365,]
+
+* None - DAG with no Schedule
+* 0 0 * * * - DAG that will run every day at 24:00 UTC
+* 0 6 * * * - DAG that will run every day at 06:00 UTC
+* 0 1 * * * - DAG that will run every day at 01:00 UTC
+
+== Operational rules and recommendations
+
+Note: Every action that was not described in the previous chapters can
+lead to unpredictable situations.
+
+* Do not stop a DAG when it is in running state. Let it finish. In case
+you need to disable or stop a DAG, then make sure that in the column
+Recent Tasks no numbers in the light green circle are present. Figure
+below depicts one such example.
+image:user_manual/media/image72.png[image,width=601,height=164]
+
+* Do not run reprocess DAGs when notice_processing_pipeline is in running
+state. This will produce errors as the reprocessing DAGs are searching
+for notices in a specific status available in the database. When the
+notice_processing_pipeline is running the notices are transitioning
+between different statuses and that will make it possible to get the
+same notice to be processed twice in the same time, which will produce
+an error. Make sure that in the column Runs for
+notice_processing_pipeline you don’t have any numbers in a light green
+circle before running any reprocess DAGs.
+image:user_manual/media/image30.png[image,width=601,height=162]
+
+
+* Do not manually trigger notice_processing_pipeline as this DAG is
+triggered automatically by other DAGs. This will produce an error as
+this DAG needs to know what batch of notices it is processing (this is
+automatically done by the system). This DAG should only be enabled.
+image:user_manual/media/image18.png[image,width=602,height=29]
+
+* To start any notice processing and transformation make sure that you
+have mapping suites available in the database. You should have at least
+one successful run of the *load_mapping_suite_in_database* DAG and check
+Metabase to see what mapping suites are available.
+image:user_manual/media/image32.png[image,width=653,height=30]
+
+* Do not manually trigger scheduled DAGs unless you use a specific
+configuration and that DAG supports running with specific configuration.
+The scheduled dags should be only enabled.
+image:user_manual/media/image87.png[image,width=601,height=77]
+
+* It is not recommended to load mapping suites while
+notice_processing_pipeline is running. First make sure that there are no
+running tasks and then load other mapping suites.
+image:user_manual/media/image35.png[image,width=601,height=256] {nbsp}
+image:user_manual/media/image91.png[image,width=601,height=209]
+
+* It is recommended to start processing / transforming notices for a short
+period of time e.g fetch notices for a day, week, month but not year.
+The system can handle processing for a longer period but it will take
+time and you will not be able to load other mapping suites while
+processing is running.