diff --git a/docs/antora/modules/ROOT/attachments/FATs/2023-02-20-TED-SWS-FAT-complete.html b/docs/antora/modules/ROOT/attachments/FATs/2023-02-20-TED-SWS-FAT-complete.html new file mode 100644 index 000000000..6a14a8898 --- /dev/null +++ b/docs/antora/modules/ROOT/attachments/FATs/2023-02-20-TED-SWS-FAT-complete.html @@ -0,0 +1,17158 @@ + + + + + +Allure Report + + + + + +
+
+ + + +
+ + + + + + + diff --git a/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.0.2.pdf similarity index 100% rename from docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf rename to docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.0.2.pdf diff --git a/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf new file mode 100644 index 000000000..2630eab63 Binary files /dev/null and b/docs/antora/modules/ROOT/attachments/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf differ diff --git a/docs/antora/modules/ROOT/nav.adoc b/docs/antora/modules/ROOT/nav.adoc index 50f58783c..2b65a493f 100644 --- a/docs/antora/modules/ROOT/nav.adoc +++ b/docs/antora/modules/ROOT/nav.adoc @@ -1,8 +1,30 @@ * xref:index.adoc[Home] -* link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary Project Architecture^] -* xref:mapping_suite_cli_toolchain.adoc[Mapping Suite CLI Toolchain] -* xref:demo_installation.adoc[Instructions for Software Engineers] -* xref:user_manual.adoc[User manual] -* xref:system_arhitecture.adoc[System architecture overview] -* xref:using_procurement_data.adoc[Using procurement data] + +* [.separated]#**General References**# +** xref:ted-sws-introduction.adoc[About TED-SWS] +** xref:glossary.adoc[Glossary] + +* [.separated]#**For TED-SWS Operators**# +** xref:user_manual/getting_started_user_manual.adoc[Getting started] +** xref:user_manual/system-overview.adoc[System overview] +** xref:user_manual/access-security.adoc[Security and access] +** xref:user_manual/workflow-management-airflow.adoc[Workflow management with Airflow] +** xref:user_manual/system-monitoring-metabase.adoc[System monitoring with Metabase] + +* [.separated]#**For DevOps**# + +** link:{attachmentsdir}/aws-infra-docs/TED-SWS-Installation-manual-v2.5.0.pdf[AWS installation manual (v2.5.0)^] +** link:{attachmentsdir}/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[AWS infrastructure architecture (v0.9)^] + +* [.separated]#**For End User Developers**# +** xref:ted_data/using_procurement_data.adoc[Accessing data in Cellar] +** link:https://docs.ted.europa.eu/EPO/latest/index.html[eProcurement ontology (latest)^] + +* [.separated]#**For TED-SWS Developers**# +** xref:technical/mapping_suite_cli_toolchain.adoc[Mapping suite toolchain] +** xref:technical/demo_installation.adoc[Development installation instructions] +** xref:technical/event_manager.adoc[Event manager description] +** xref:architecture/arhitecture_choices.adoc[System architecture overview] +** link:{attachmentsdir}/ted-sws-architecture/index.html[Enterprise architecture model^] +** xref:architecture/arhitecture_choices.adoc[Architectural choices] diff --git a/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc b/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc new file mode 100644 index 000000000..b1ee7031a --- /dev/null +++ b/docs/antora/modules/ROOT/pages/architecture/arhitecture_choices.adoc @@ -0,0 +1,351 @@ +== Architectural choices + +This section describes choices: + +* How is this SOA? (is it? It is SOA but not REST Microservices, Why not +Microservices? +* Why NoSQL data model vs SQL data model? +* Why ETL/ELT approach vs. Event Sourcing +* Why Batch processing vs. Event Streams. +* Why Airflow ? +* Why Metabase? +* Why quick deduplication process? And what are the plans for the +future? + +=== Why is this SOA (Service-oriented architecture) architecture? + +ETL (Extract, Transform, Load) architecture is considered +state-of-the-art for batch processing tasks using Airflow as pipeline +management for several reasons: + +[arabic] +. *Flexibility*: ETL architecture allows for flexibility in the data +pipeline as it separates the data extraction, transformation, and +loading processes. This allows for easy modification and maintenance of +each individual step without affecting the entire pipeline. +. *Scalability*: ETL architecture allows for the easy scaling of data +processing tasks, as new data sources can be added or removed without +impacting the entire pipeline. +. *Error Handling*: ETL architecture allows for easy error handling as +each step of the pipeline can be monitored and errors can be isolated to +a specific step. +. *Reusability:* ETL architecture allows for the reuse of existing data +pipelines, as new data sources can be added without modifying existing +pipelines. +. *System management*: Airflow is an open-source workflow management +system that allows for easy scheduling, monitoring, and management of +data pipelines. It integrates seamlessly with ETL architecture and +allows for easy management of complex data pipelines. + +Overall, ETL architecture combined with Airflow as pipeline management +provides a robust and efficient solution for batch processing tasks. + +=== Why Monolithic Architecture vs Micro Services Architecture? + +There are several reasons why a monolithic architecture may be more +suitable for an ETL architecture with batch processing pipeline using +Airflow as the pipeline management tool: + +[arabic] +. *Simplicity*: A monolithic architecture is simpler to design and +implement as it involves a single codebase and a single deployment +process. This makes it easier to manage and maintain the ETL pipeline. +. *Performance*: A monolithic architecture may be more performant than a +microservices architecture as it allows for more efficient communication +between the different components of the pipeline. This is particularly +important for batch processing pipelines, where speed and efficiency are +crucial. +. *Scalability*: Monolithic architectures can be scaled horizontally by +adding more resources to the system, such as more servers or more +processing power. This allows for the system to handle larger amounts of +data and handle more complex processing tasks. +. *Airflow Integration*: Airflow is designed to work with monolithic +architectures, and it can be more difficult to integrate with a +microservices architecture. Airflow's DAGs and tasks are designed to +work with a single codebase, and it may be more challenging to manage +different services and pipelines across multiple microservices. + +Overall, a monolithic architecture may be more suitable for an ETL +architecture with batch processing pipeline using Airflow as the +pipeline management tool due to its simplicity, performance, +scalability, and ease of integration with Airflow. + +=== Why ETL/ELT approach vs Event Sourcing ? + +ETL (Extract, Transform, Load) architecture is typically used for moving +and transforming data from one system to another, for example, from a +transactional database to a data warehouse for reporting and analysis. +It is a batch-oriented process that is typically scheduled to run at +specific intervals. + +Event sourcing architecture, on the other hand, is a way of storing and +managing the state of an application by keeping track of all the changes +to the state as a sequence of events. This allows for better auditing +and traceability of the state of the application over time, as well as +the ability to replay past events to reconstruct the current state. +Event sourcing is often used in systems that require high performance, +scalability, and fault tolerance. + +In summary, ETL architecture is mainly used for data integration and +data warehousing, Event sourcing is mainly used for building highly +scalable and fault-tolerant systems that need to store and manage the +state of an application over time. + +A hybrid architecture is implemented in the TED-SWS pipeline, based on +an ETL architecture but with state storage to repeat a pipeline sequence +as needed. + +=== Why Batch processing vs Event Streams? + +Batch processing architecture and Event Streams architecture are two +different approaches to processing data in code. + +Batch processing architecture is a traditional approach where data is +processed in batches. This means that data is collected over a period of +time and then processed all at once in a single operation. This approach +is typically used for tasks such as data analysis, data mining, and +reporting. It is best suited for tasks that can be done in a single pass +and do not require real-time processing. + +Event Streams architecture, on the other hand, is a more modern approach +where data is processed in real-time as it is generated. This means that +data is processed as soon as it is received, rather than waiting for a +batch to be collected. This approach is typically used for tasks such as +real-time monitoring, data analytics, and fraud detection. It is best +suited for tasks that require real-time processing and cannot be done in +a single pass. + +In summary, Batch processing architecture is best suited for tasks that +can be done in a single pass and do not require real-time processing, +whereas Event Streams architecture is best suited for tasks that require +real-time processing and cannot be done in a single pass. + +Due to the fact that the TED-SWS pipeline has an ETL architecture, the +data processing is done in batches, the batches of notices are formed +per day, all the notices of a day form a batch that will be processed. +Another method of creating a batch is grouping notices by status and +executing the pipeline depending on their status. + +=== Why NoSQL data model vs SQL data model? + +There are several reasons why a NoSQL data model may be more suitable +for an ETL architecture with batch processing pipeline compared to a SQL +data model: + +[arabic] +. *Scalability*: NoSQL databases are designed to handle large amounts of +data and can scale horizontally, allowing for the easy addition of more +resources as the amount of data grows. This is particularly useful for +batch processing pipelines that need to handle large amounts of data. +. *Flexibility*: NoSQL databases are schema-less, which means that the +data structure can change without having to modify the database schema. +This allows for more flexibility when processing data, as new data types +or fields can be easily added without having to make changes to the +database. +. *Performance*: NoSQL databases are designed for high-performance and can +handle high levels of read and write operations. This is particularly +useful for batch processing pipelines that need to process large amounts +of data in a short period of time. + +. *Handling Unstructured Data*: NoSQL databases are well suited for +handling unstructured data, such as JSON or XML, that can't be handled +by SQL databases. This is particularly useful for ETL pipelines that +need to process unstructured data. + +. *Handling Distributed Data*: NoSQL databases are designed to handle +distributed data, which allows for data to be stored and processed on +multiple servers. This can help to improve performance and scalability, +as well as provide fault tolerance. + +. *Cost*: NoSQL databases are generally less expensive than SQL databases, +as they don't require expensive hardware or specialized software. This +can make them a more cost-effective option for ETL pipelines that need +to handle large amounts of data. + +Overall, a NoSQL data model may be more suitable for an ETL architecture +with batch processing pipeline compared to a SQL data model due to its +scalability, flexibility, performance, handling unstructured data, +handling distributed data and the cost-effectiveness. It is important to +note that the choice to use a NoSQL data model satisfies the specific +requirements of the TED-SWS processing pipeline and the nature of the +data to be processed. + +=== Why Airflow? + +Airflow is a great solution for ETL pipeline and batch processing +architecture because it provides several features that are well-suited +to these types of tasks. First, Airflow provides a powerful scheduler +that allows you to define and schedule ETL jobs to run at specific +intervals. This means that you can set up your pipeline to run on a +regular schedule, such as every day or every hour, without having to +manually trigger the jobs. Second, Airflow provides a web-based user +interface that makes it easy to monitor and manage your pipeline. + +Both aspects of Airflow are perfectly compatible with the needs of the +TED-SWS architecture and the use cases required for an Operations +Manager that will interact with the system. Airflow therefore covers the +needs of batch processing management and ETL pipeline management. + +Airflow provide good coverage of use cases for an Operations Manager, +specialized for this use cases: + +[arabic] +. *Monitoring pipeline performance*: An operations manager can use Airflow +to monitor the performance of the ETL pipeline and identify any +bottlenecks or issues that may be impacting the pipeline's performance. +They can then take steps to optimize the pipeline to improve its +performance and ensure that data is being processed in a timely and +efficient manner. + +. *Managing pipeline schedule*: The operations manager can use Airflow to +schedule the pipeline to run at specific times, such as during off-peak +hours or when resources are available. This can help to minimize the +impact of the pipeline on other systems and ensure that data is +processed in a timely manner. + +. *Managing pipeline resources*: The operations manager can use Airflow to +manage the resources used by the pipeline, such as CPU, memory, and +storage. They can also use Airflow to scale the pipeline up or down as +needed to meet changing resource requirements. + +. *Managing pipeline failures*: Airflow allows the operations manager to +set up notifications and alerts for when a pipeline fails or a task +fails. This allows them to quickly identify and address any issues that +may be impacting the pipeline's performance. + +. *Managing pipeline dependencies*: The operations manager can use Airflow +to manage the dependencies between different tasks in the pipeline, such +as ensuring that notice fetching is completed before notice indexing or +notice metadata normalization. + +. *Managing pipeline versioning*: Airflow allows the operations manager to +maintain different versions of the pipeline, which can be useful for +testing new changes before rolling them out to production. + +. *Managing pipeline security*: Airflow allows the operations manager to +set up security controls to protect the pipeline and the data it +processes. They can also use Airflow to audit and monitor access to the +pipeline and the data it processes. + +=== Why Metabase? + +Metabase is an excellent solution for data analysis and KPI monitoring +for a batch processing system, as it offers several key features that +make it well suited for this type of use case required within the +TED-SWS system. + +First, Metabase is highly customizable, allowing users to create and +modify dashboards, reports, and visualizations to suit their specific +needs. This makes it easy to track and monitor the key performance +indicators (KPIs) that are most important for the batch processing +system, such as the number of jobs processed, the average processing +time, and the success rate of job runs. + +Second, Metabase offers a wide range of data connectors, allowing users +to easily connect to and query data sources such as SQL databases, NoSQL +databases, CSV files, and APIs. This makes it easy to access and analyze +the data that is relevant to the batch processing system. In TED-SWS the +data domain model is realized by a document-based data model, not a +tabular relational data model, so Metabase is a good tool for analyzing +data with a document-based model. + +Third, Metabase has a user-friendly interface that makes it easy to +navigate and interact with data, even for users with little or no +technical experience. This makes it accessible to a wide range of users, +including business analysts, data scientists, and other stakeholders who +need to monitor and analyse the performance of the batch processing +system. + +Finally, Metabase offers robust security and collaboration features, +making it easy to share and collaborate on data and insights with team +members and stakeholders. This makes it an ideal solution for +organizations that need to monitor and analyse the performance of a +batch processing system across multiple teams or departments. + +=== Why quick deduplication process? + +One of the main challenges in entities deduplication from the semantic +web domain is dealing with the complexity and diversity of the data. +This can include dealing with different data formats, schemas, and +vocabularies, as well as handling missing or incomplete data. +Additionally, entities may have multiple identities or representations, +making it difficult to determine which entities are duplicates and which +are distinct. Another difficulty is the scalability of the algorithm to +handle large amount of data. The performance of the algorithm should be +efficient and accurate to handle huge number of entities. + +There are several approaches and solutions for entities deduplication in +the semantic web. Some of the top solutions include: + +[arabic] +. *String-based methods*: These methods use string comparison techniques +such as Jaccard similarity, Levenshtein distance, and cosine similarity +to identify duplicates based on the similarity of their string +representations. +. *Machine learning-based methods*: These methods use machine learning +algorithms such as decision trees, random forests, and neural networks +to learn patterns in the data and identify duplicates. + +. *Knowledge-based methods*: These methods use external knowledge sources +such as ontologies, taxonomies, and linked data to disambiguate entities +and identify duplicates. + +. *Hybrid methods*: These methods combine multiple techniques, such as +string-based and machine learning-based methods, to improve the accuracy +of deduplication. + +. *Blocking Method*: This method is used to reduce the number of entities +that need to be compared by grouping similar entities together. + +In the TED-SWS pipeline, the deduplication of Organization type entities +is performed using a string-based methods. String-based methods are +often used for organization entity deduplication, because of their +simplicity and effectiveness. + +TED Europe data often contains information about tenders and public +procurement, where organizations are identified by their names. +Organization names are often unique and can be used to identify +duplicates with high accuracy. String-based methods can be used to +compare the similarity of different organization names, which can be +effective in identifying duplicates. + +Additionally, the TED europe data is highly structured, so it's easy to +extract and compare the names of organizations. String-based methods are +also relatively fast and easy to implement, making them a good choice +for large data sets. This methods may not be as effective for other +types of entities, such as individuals, where additional information may +be needed to identify duplicates. It's also important to note that +string-based methods may not work as well for misspelled or abbreviated +names. + +Using a quick and dirty deduplication approach instead of a complex +system at the first iteration of a system implementation can be +beneficial for several reasons: + +[arabic] +. *Speed*: A quick approach can be implemented quickly and can +help to identify and remove duplicates quickly. This can be particularly +useful when working with large and complex data sets, where a more +complex approach may take a long time to implement and test. +. *Cost*: A quick and dirty approach is generally less expensive to +implement than a complex system, as it requires fewer resources and less +development time. +. *Simplicity*: A quick and dirty approach is simpler and easier to +implement than a complex system, which can reduce the risk of errors and +bugs. +. *Flexibility*: A quick and dirty approach allows to start with a basic +system and adapt it as needed, which can be more flexible than a complex +system that is difficult to change. + +. *Testing*: A quick and dirty approach allows to test the system quickly, +and get feedback from the users and stakeholders, and then use that +feedback to improve the system. + + +However, it's worth noting that the quick and dirty approach is not a +long-term solution and should be used only as a first step in the +implementation of a MDR system. This approach can help to quickly +identify and remove duplicates and establish a basic system, but it may +not be able to handle all the complexity and diversity of the data, so +it's important to plan for and implement more advanced techniques as the +system matures. diff --git a/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc b/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc new file mode 100644 index 000000000..9b922c392 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/architecture/arhitecture_overview.adoc @@ -0,0 +1,476 @@ +== TED-SWS Architecture +[width="100%",cols="25%,75%",options="header",] + +=== Use Cases + +Operations Manager is the main actor that will interact with the TED-SWS +system. When presenting the system architecture we strongly rely on the perspective of this actor + +For Operations Manager the following use cases are relevant: + +* to fetch notices from the TED website based on a query +* to fetch notices from the TED website based on a date range +* to fetch notices from the TED website based on date +* to load a Mapping Suite into the system +* to reprocess non-normalized notices from the backlog +* to reprocess untransformed notices from the backlog +* to reprocess unvalidated notices from the backlog +* to reprocess unpackaged notices from the backlog +* to reprocess the notices we published from the backlog + +=== System architecture + +The main points of architecture for a system that will transform TED +notices from XML format to RDF format using an ETL architecture with +batch processing pipeline are: + +[arabic] +. *Data collection*: An API would be used to collect the +daily notices from the TED website in XML format and store them in a +data warehouse. +. *Metadata management*: A metadata management module would be collect, store and provide filtering capabilities for notices based on their features, such as form number, date of publication, XSD schema version, subform type, etc. +. *Data transformation*: A data transformation module would be used to +convert the XML data into RDF format. +. *Data loading*: The transformed RDF data would be loaded into a triple +store, such as Cellar, for further analysis or reporting. +. *Pipeline management*: Airflow would be used to schedule and manage the +pipeline, ensuring that the pipeline is run on a daily basis to process +the latest batch of notices from the TED website. Airflow would also be +used to monitor the pipeline and provide real-time status updates. +. *Data access*: A SPARQL endpoint or an API would be used to access the +RDF data stored in the triple store. This would allow external systems +to query the data and retrieve the information they need. +. *Security*: The system would be protected by a firewall and would use +secure protocols (e.g. HTTPS) for data transfer. Access to the data +would be controlled by authentication and authorization mechanisms. + +. *Scalability*: The architecture should be designed to handle large +amounts of data and easily scale horizontally by adding more resources +as the amount of data grows. +. *Flexibility*: The architecture should be flexible to handle changes in +the data structure without having to modify the database schema. +. *Performance*: The architecture should be designed for high-performance +to handle high levels of read and write operations to process data in a +short period of time. + +Figure 1.1 shows the compact, general image of the TED-SWS system +architecture from the system's business point of view. The system +represents a pipeline for processing notices from the TED Website and +publishing them to the CELLAR service. + +For the monitoring and management of internal processes, the system +offers two interfaces. An interface for data monitoring, in the diagram, +the given interface is represented by the name of “Data Monitoring +Interface”. Another interface is for the monitoring and management of +system processes; in the diagram, the given interface is represented by +the name “Workflow Management Interface”. Operations Manager will use +these two interfaces for system monitoring and management. + +The element of the system that will process the notices is the TED-SWS +pipeline. The input data for this pipeline will be the notices in XML +format from the TED website. The result of this pipeline is a METS +package for each processed notice and its publication in CELLAR, from +where the end user will be able to access notices in RDF format. + +Providing, in Figure 1.1, a compact view of the TED-SWS system +architecture at the business level is useful because it allows +stakeholders and decision-makers to quickly and easily understand how +the system works and how it supports the business goals and objectives. +A compact view of the architecture can help to communicate the key +components of the system and how they interact with each other, making +it easier to understand the system's capabilities and limitations. +Additionally, a compact view of the architecture can help to identify +any areas where the system could be improved or where additional +capabilities are needed to support the business. By providing a clear +and concise overview of the system architecture, stakeholders can make +more informed decisions about how to use the system, how to improve it, +and how to align it with the business objectives. + +In Figure 1.1 also is provided, input and output dependencies for a +TED-SWS system architecture. This is useful because it helps to identify +the data sources and data destinations that the system relies on, as +well as the data that the system produces. This information can be used +to understand the data flows within the system, how the system is +connected to other systems, and how the system supports the business. +Input dependencies help to identify the data sources that the system +relies on, such as external systems, databases, or other data sources. +This information can be used to understand how the system is connected +to other systems and how it receives data. Output dependencies help to +identify the data destinations that the system produces, such as +external systems, databases, or other data destinations. This +information can be used to understand how the system is connected to +other systems and how it sends data. By providing input and output +dependencies for the TED-SWS system architecture, stakeholders can make +more informed decisions about how to use the system, how to improve it, +and how to align it with the business objectives. + +image:system_arhitecture/media/image1.png[image,width=100%,height=366] + +Figure 1.1 Compact view of system architecture at the business level + +In Figure 1.2 the general extended architecture of the TED-SWS system is +represented, in this diagram, the internal components of the TED-SWS +pipeline are also included. + +image:system_arhitecture/media/image8.png[image,width=100%,height=270] + +Figure 1.2 Extended view of system architecture at business level + +Figure 1.3 shows the architecture of the TED-SWS system without its +peripheral elements. This diagram is intended to highlight the services +that serve the internal components of the pipeline. + +*Workflow Management Service* is an external TED-SWS pipeline service +that performs pipeline management. This service provides a control +interface, in the figure it is represented by Workflow Management +Interface. + +*Workflow Management Interface* represents an internal process control +interface, this component will be analysed in a separate diagram. + +*Data Visualization Service* is a service that manages logs and pipeline +data to present them in a form of dashboards. + +*Data Monitoring Interface* is a data visualization and dashboard +editing interface offered by the Data Visualization Service. + +*Message Digest Service* is a service that serves the transformation +component of the TED-SWS pipeline, within the transformation to ensure +custom RML functions, an external service is needed that will implement +them. + +*Master Data Management & URI Allocation Service* is a service for +storing and managing unique URIs, this service performs URI +deduplication. + +The *TED-SWS pipeline* contains a set of components, all of which access +Notice Aggregate and Mapping Suite objects. + +image:system_arhitecture/media/image4.png[image,width=100%,height=318] + +Figure 1.3 TED-SWS architecture at business level + +Figure 1.4 shows the TED-SWS pipeline and its components, and this view +aims to show the connection between the components. + +The pipeline has the following components: + +* Fetching Service +* XML Indexing Service +* Metadata Normalization Service +* Transformation Service; +* Entity Resolution & Deduplication Service +* Validation Service +* Packaging Service +* Publishing Service +* Mapping Suite Loading Service + +*Fetching Service* is a service that extracts notices from the TED +website and stores them in the database. + +*XML Indexing Service* is a service that extracts all unique XPaths from +an XML and stores them as metadata. Unique XPaths are used later to +validate if the transformation to RDF format, has been done for all +XPaths from a notice in XML format. + +*Metadata Normalization Service* is a service that normalises the +metadata of a notice in an internal work format. This normalised +metadata will be used in other processes on a notice, such as the +selection of a Mapping Suite for transformation or validation of a +notice. + +*Transformation Service* is the service that transforms a notice from +the XML format into the RDF format, using for this a Mapping Suite that +contains the RML transformation rules that will be applied. + +*Entity Resolution & Deduplication Service* is a service that performs +the deduplication of entities from RDF manifestation, namely +Organization and Procedure entities. + +*Validation Service* is a service that validates a notice in RDF format, +using for this several types of validations, namely validation using +SHACL shapes, validation using SPARQL tests and XPath coverage +verification. + +*Packaging Service* is a service that creates a METS package that will +contain notice RDF manifestation. + +*Publishing Service* is a service that publishes a notice RDF +manifestation in the required format, in the case of Cellar the +publication takes place with a METS package. + +image:system_arhitecture/media/image5.png[image,width=100%,height=154] + +Figure 1.4 TED-SWS pipeline architecture at business level + +=== Processing single notice (BPMN perspective) + +The pipeline for processing a notice is the key element in the TED-SWS +system, the architecture of this pipeline from the business point of +view is represented in Figure 2. Unlike the previously presented +figures, in Figure 2 the pipeline is rendered in greater detail and are +presented relationships between pipeline steps and the artefacts that +produce or use them. + +Based on Figure 2, it can be noted that the pipeline is not a linear +one, within the pipeline there are control steps that check whether the +following steps should be executed for a notice. + +There are 3 control steps in the pipeline, namely: + +* Check notice eligibility for transformation +* Check notice eligibility for packaging +* Check notice availability in Cellar + +The “Check notice eligibility for transformation” step represents the +control of a notice if it can be transformed with a Mapping Suite, if it +can be transformed it goes to the next transformation step, otherwise +the notice is stored for future processing. + +The “Check notice eligibility for packaging” step checks if a notice RDF +manifestation after the validation step is valid for packaging in a METS +package. If it is valid, proceed to the packing step, otherwise, store +the intermediate result for further analysis. + +The “Check notice availability in Cellar” step checks, after the +publication step in Cellar, if a published notice is already accessible +in Cellar. If the notice is accessible, then the pipeline is finished, +otherwise the published notice is stored for further analysis. + +Pipeline steps produce and use artefacts such as: + +* TED-XML notice & metadata; +* Mapping rules +* TED-RDF notice +* Test suites +* Validation report +* METS Package activation + +image:system_arhitecture/media/image2.png[image,width=100%,height=177] + +Figure 2 Single notice processing pipeline at business level + +Based on Figure 2, we can notice that the artefacts for a notice appear +with the passage of certain steps in the pipeline. To be able to +conveniently manage the state of a notice and all its artefacts +depending on its state, a notice represents an aggregate of artefacts +and a state, which changes dynamically during the pipeline. + +== Application architecture + +In this section, we address the following questions: + +* How is the data organised? +* How does the data structure evolve within the process? +* Howe does the business process look like? +* How is the business process realised in the Application? + +=== Notice status transition map + +A TED-SWS pipeline implement a hybrid architecture based on ETL pipeline +with status transition map for a notice. The TED-SWS pipeline have many +steps and is not a linear pipeline, in this case using a notice status +transition map, for a complex pipeline with multiple steps and +ramifications like as TED-SWS pipeline, is a good architecture choice +for several reasons: + +[arabic] +. *Visibility*: A notice status transition map provides a clear and visual +representation of the different stages that a notice goes through in the +pipeline. This allows for better visibility into the pipeline, making it +easier to understand the flow of data and to identify any issues or +bottlenecks. + +. *Traceability*: A notice status transition map allows for traceability +of notices in the pipeline, which means that it's possible to track a +notice as it goes through the different stages of the pipeline. This can +be useful for troubleshooting, as it allows for the identification of +which stage the notice failed or had an issue. + +. *Error Handling*: A notice status transition map allows for the +definition of error handling procedures for each stage in the pipeline. +This can be useful for identifying and resolving errors that occur in +the pipeline, as it allows for a clear understanding of what went wrong +and what needs to be done to resolve the issue. + +. *Auditing*: A notice status transition map allows for the auditing of +notices in the pipeline, which means that it's possible to track the +history of a notice, including when it was processed, by whom, and +whether it was successful or not. + +. *Monitoring*: A notice status transition map allows for the monitoring +of notices in the pipeline, which means that it's possible to track the +status of a notice, including how many notices are currently being +processed, how many have been processed successfully, and how many have +failed. + +. *Automation*: A notice status transition map can be used to automate +some of the process, by defining rules or triggers to move notices +between different stages of the pipeline, depending on the status of the +notice. + + +Each notice has a status during the pipeline, a status corresponds to a +step in the pipeline that the notice passed. Figure 3.1 shows the +transition flow of the status of a notice, as a note we must take into +account that a notice can only be in one status at a given time. +Initially, each notice has the status of RAW and the last status, which +means finishing the pipeline, is the status of PUBLICLY_AVAILABLE. + +Based on the use cases of this pipeline, the following statuses of a +notice are of interest to the end user: + +* RAW +* NORMALISED_METADATA +* INELIGIBLE_FOR_TRANSFORMATION +* TRANSFORMED +* VALIDATED +* INELIGIBLE_FOR_PACKAGING +* PACKAGED +* INELIGIBLE_FOR_PUBLISHING +* PUBLISHED +* PUBLICLY_UNAVAILABLE +* PUBLICLY_AVAILABLE + +image:system_arhitecture/media/image6.png[image,width=546,height=402] + +Figure 3.1 Notice status transition + +The names of the statuses are self-descriptive, but attention should be +drawn to some statuses, namely: + +* INDEXED +* NORMALISED_METADATA +* DISTILLED +* PUBLISHED +* PUBLICLY_UNAVAILABLE +* PUBLICLY_AVAILABLE + +The INDEXED status means that the set of unique XPaths appearing in its +XML manifestation has been calculated for a notice. The unique set of +XPaths is subsequently required when calculating the XPath coverage +indicator for the transformation. + +The NORMALISED_METADATA status means that for a notice, its metadata has +been normalised. The metadata of a notice is normalised in an internal +format to be able to check the eligibility of a notice to be transformed +with a Mapping Suite package. + +The status DISTILLED is used to indicate that the RDF manifestation of a +notice has been post processed. The post-processing of an RDF +manifestation provides for the deduplication of the Procedure or +Organization type entities and the insertion of corresponding triplets +within this RDF manifestation. + +The PUBLISHED status means that a notice has been sent to Cellar, which +does not mean that it is already available in Cellar. Since there is a +time interval between the transmission and the actual appearance in the +Cellar, it is necessary to check later whether a notice is available in +the Cellar or not. If the verification has taken place and the notice is +available in the Cellar, it is assigned the status of +PUBLICLY_AVAILABLE, if it is not available in the Cellar, the status of +PUBLICLY_UNAVAILABLE is assigned to it. + +=== Notice structure + +Notice structure has a NoSQL data model, this architecture choice is +based on dynamic behaviour of notice structure which evolves over time +while TED-SWS pipeline running and besides that there are other reasons: + +[arabic] +. *Schema-less*: NoSQL databases are schema-less, which means that the +data structure can change without having to modify the database schema. +This allows for more flexibility when processing data, as new data types +or fields can be easily added without having to make changes to the +database. This is particularly useful for notices that are likely to +evolve over time, as the structure of the notices can change without +having to make changes to the database. + +. *Handling Unstructured Data*: NoSQL databases are well suited for +handling unstructured data, such as JSON or XML, that can't be handled +by SQL databases. This is particularly useful for ETL pipelines that +need to process unstructured data, as notices are often unstructured and +may evolve over time. +. *Handling Distributed Data*: NoSQL databases are designed to handle +distributed data, which allows for data to be stored and processed on +multiple servers. This can help to improve performance and scalability, +as well as provide fault tolerance. This is particularly useful for +notices that are likely to evolve over time, as the volume of data may +increase and need to be distributed. + +. *Flexible Querying*: NoSQL databases allow for flexible querying, which +means that the data can be queried in different ways, including by +specific fields, by specific values, and by ranges. This allows for more +flexibility when querying the data, as the structure of the notices may +evolve over time. +. *Cost-effective*: NoSQL databases are generally less expensive than SQL +databases, as they don't require expensive hardware or specialized +software. This can make them a more cost-effective option for ETL +pipelines that need to handle large amounts of data and that are likely +to evolve over time. + + +Overall, a NoSQL data model is a good choice for notice structure in an +ETL pipeline that is likely to evolve over time because it allows for +more flexibility when processing data, handling unstructured data, +handling distributed data, flexible querying and it's cost-effective. + +Figure 3.2 shows the structure of a notice and its evolution depending +on the state in which a notice is located. In the given figure, the +emphasis is placed on the states from which a certain part of the +structure of a notice is present. As a remark, it should be taken into +account that once an element of the structure of a notice is present for +a certain state, it will also be present for all the states derived from +it, such as the flow of states presented in Figure 3.1. + +image:system_arhitecture/media/image3.png[image,width=567,height=350] + +Figure 3.2 Dynamic behaviour of notice structure based on status + +Based on Figure 3.2, it is noted that the structure of a notice evolves +with the transition to other states. + +For a notice in the state of NORMALISED_METADATA, we can access the +following fields of a notice: + +* Original Metadata +* Normalised Metadata +* XML Manifestation + +For a notice in the TRANSFORMED state, we can access all the previous +fields and the following new fields of a notice: + +* RDF Manifestation. + +For a notice in the VALIDATED state, we can access all the previous +fields and the following new fields of a notice: + +* XPath Coverage Validation + +* SHACL Validation +* SPARQL Validation + +For a notice in the PACKAGED state, we can access all the previous +fields and the following new fields of a notice: + +* METS Manifestation + +=== Application view of the process + +The primary actor of the TED-SWS system will be the Operations Manager, +who will interact with the system. Application-level pipeline control is +achieved through the Airflow stack. Figure 4 shows the AirflowUser actor +representing Operations Manager, this diagram is at the application +level of the process. + +image:system_arhitecture/media/image7.png[image,width=534,height=585] + +Figure 4 Dependencies between Airflow DAGs + +Based on the use cases defined for an Operations Manger, Figure 4 shows +the control functionality of the TED-SWS pipeline that it can use. In +addition to the functionality available for the AirflowUser actor, the +dependency between DAGs is also rendered. We can note that another actor +named AirflowScheduler is defined, this actor represents an automatic +execution mechanism at a certain time interval of certain DAGs. + diff --git a/docs/antora/modules/ROOT/pages/future_work.adoc b/docs/antora/modules/ROOT/pages/future_work.adoc new file mode 100644 index 000000000..3dc112542 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/future_work.adoc @@ -0,0 +1,59 @@ +== Future work + +In the future, another Master Data Registry type system will be used to +deduplicate entities in the TED-SWS system, which will be implemented +according to the requirements for deduplication of entities from +notices. + +The future Master Data Registry (MDR) system for entity deduplication +should have the following architecture: + +[arabic] +. *Data Ingestion*: This component is responsible for extracting and +collecting data from various sources, such as databases, files, and +APIs. The data is then transformed, cleaned, and consolidated into a +single format before it is loaded into the MDR. + +. *Data Quality*: This component is responsible for enforcing data quality +rules, such as format, completeness, and consistency, on the data before +it is entered into the MDR. This can include tasks such as data +validation, data standardization, and data cleansing. + +. *Entity Dedup*: This component is responsible for identifying and +removing duplicate entities in the MDR. This can be done using a +combination of techniques such as string-based, machine learning-based, +or knowledge-based methods. + +. *Data Governance*: This component is responsible for ensuring that the +data in the MDR is accurate, complete, and up-to-date. This can include +processes for data validation, data reconciliation, and data +maintenance. + +. *Data Access and Integration*: This component provides access to the MDR +data through a user interface and API's, and integrates the MDR data +with other systems and applications. + +. *Data Security*: This component is responsible for ensuring that the +data in the MDR is secure, and that only authorized users can access it. +This can include tasks such as authentication, access control, and +encryption. + +. *Data Management*: This component is responsible for managing the data +in the MDR, including tasks such as data archiving, data backup, and +data recovery. + +. *Monitoring and Analytics*: This component is responsible for monitoring +and analysing the performance of the MDR system, and for providing +insights into the data to help improve the system. + +. *Services layer*: This component is responsible for providing services +such as, indexing, search and query functionalities over the data. + + +All these components should be integrated and work together to provide a +comprehensive and efficient MDR system for entity deduplication. The +system should be scalable and flexible enough to handle large amounts of +data and adapt to changing business requirements. + + + diff --git a/docs/antora/modules/ROOT/pages/glossary.adoc b/docs/antora/modules/ROOT/pages/glossary.adoc new file mode 100644 index 000000000..09def2255 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/glossary.adoc @@ -0,0 +1,23 @@ +== Glossary + +*Airflow* - an open-source platform for developing, scheduling, and +monitoring batch-oriented pipelines. The web interface helps manage the +state and monitoring of your pipelines. + +*Metabase* - is the BI tool with the friendly UX and integrated tooling +to let you explore data gathered by running the pipelines available in +Airflow. + +*Cellar* - is the central content and metadata repository of the +Publications Office of the European Union + +*TED-SWS* - is a pipeline system that continuously converts the public +procurement notices (in XML format) available on the TED Website into +RDF format and publishes them into CELLAR + +*DAG* - (Directed Acyclic Graph) is the core concept of Airflow, +collecting Tasks together, organized with dependencies and relationships +to say how they should run. The DAGS are basically the pipelines that +run in this project to get the public procurement notices from XML to +RDF and to be published them into CELLAR. + diff --git a/docs/antora/modules/ROOT/pages/index.adoc b/docs/antora/modules/ROOT/pages/index.adoc index 632edc364..dec49b1b7 100644 --- a/docs/antora/modules/ROOT/pages/index.adoc +++ b/docs/antora/modules/ROOT/pages/index.adoc @@ -1,20 +1,6 @@ = TED-RDF Conversion Pipeline Documentation -The TED-RDF Conversion Pipeline, which is part of the TED Semantic Web Services, aka TED-SWS system, provides tools an infrastructure to convert TED notices available in XML format into RDF. This conversion pipeline is designed to work with the https://docs.ted.europa.eu/rdf-mapping/index.html[TED-RDF Mappings]. - -== Quick references for users - -* xref:mapping_suite_cli_toolchain.adoc[Installation and usage instructions for the Mapping Suite CLI toolchain] -* link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary project architecture (in progress)^] - - -== Developer pages - -xref:demo_installation.adoc[Installation instructions for development and testing for software engineers] - -xref:attachment$/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[TED-SWS AWS Infrastructure architecture overview v0.9] - -xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS AWS Installation manual v2.0.2] +The TED-RDF Conversion Pipeline, is part of the TED Semantic Web Services (TED-SWS system) and provides tools an infrastructure to convert TED notices available in XML format into RDF. This conversion pipeline is designed to work with the https://docs.ted.europa.eu/rdf-mapping/index.html[TED-SWS Mapping Suites] - self containing packages with transformation rules and resources. == Project roadmap @@ -23,8 +9,7 @@ xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS A | Phase 1 | The first phase places high priority on the deployment into the OP AWS Cloud environment.| August 2022 | xref:attachment$/FATs/2022-08-29-report/index.html[2022-08-29 report] | 29 August 2022 | link:https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/0.0.9-beta[0.0.9-beta] | Phase 2 | Provided that the deployment in the acceptance environment is successful, the delivery of Phase 2 aims to provide the first production version of the TED SWS system. | Nov 2022 | xref:attachment$/FATs/2022-11-22-TED-SWS-FAT-complete.html[2022-11-22 report] | 20 Nov 2022 | https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/1.0.0-beta[1.0.0-beta] -| Phase 3 | This phase delivers the documentation and components and improvements that could not be covered in the previous phases. | Feb 2023 | --- | --- | --- - +| Phase 3 | This phase delivers the documentation and components and improvements that could not be covered in the previous phases. | Feb 2023 | xref:attachment$/FATs/2023-02-20-TED-SWS-FAT-complete.html[2023-02-20 report] | 21 Feb 2023 | https://github.com/OP-TED/ted-rdf-conversion-pipeline/releases/tag/1.1.0-beta[1.1.0-beta] |=== @@ -32,3 +17,21 @@ xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.0.2.pdf[TED-SWS A +// +// == Quick references for Developers +// +// == Quick references for DevOps +// +// == Quick references for TED-SWS Developers +// +// * xref:mapping_suite_cli_toolchain.adoc[Installation and usage instructions for the Mapping Suite CLI toolchain] +// * link:{attachmentsdir}/ted-sws-architecture/index.html[Preliminary project architecture (in progress)^] +// +// +// == Developer pages +// +// xref:demo_installation.adoc[Installation instructions for development and testing for software engineers] +// +// xref:attachment$/aws-infra-docs/TED-SWS-AWS-Infrastructure-architecture-overview-v0.9.pdf[TED-SWS AWS Infrastructure architecture overview v0.9] +// +// xref:attachment$/aws-infra-docs/TED-SWS Installation manual v2.5.0.pdf[TED-SWS AWS Installation manual v2.5.0] \ No newline at end of file diff --git a/docs/antora/modules/ROOT/pages/system_arhitecture.adoc b/docs/antora/modules/ROOT/pages/system_arhitecture.adoc deleted file mode 100644 index 0e7ea1cc6..000000000 --- a/docs/antora/modules/ROOT/pages/system_arhitecture.adoc +++ /dev/null @@ -1,972 +0,0 @@ -= TED-SWS System Architecture - -[width="100%",cols="25%,75%",options="header",] -|=== -|*Editors* |Dragos Paun - + -Eugeniu Costetchi - -|*Version* |1.0.0 - -|*Date* |20/02/2023 -|=== -== Introduction - -Although TED notice data is already available to the general public -through the search API provided by the TED website, the current offering -has many limitations that impede access to and reuse of the data. One -such important impediment is for example the current format of the data. - -Historical TED data come in various XML formats that evolved together -with the standard TED XML schema. The imminent introduction of eForms -will also introduce further diversity in the XML data formats available -through TED's search API. This makes it practically impossible for users -to consume and process data that span across several years, as -their information systems must be able to process several different -flavours of the available XML schemas as well as to keep up with the -schema's continuous evolution. Their search capabilities are therefore -confined to a very limited set of metadata. - -The TED Semantic Web Service will remove these barriers by providing one -common format for accessing and reusing all TED data. Coupled with the -eProcurement Ontology, the TED data will also have semantics attached to -them allowing users to directly link them with other datasets. -Moreover, users will now be able to perform much more elaborate -queries directly on the data source (through the SPARQL endpoint). This -will reduce their need for data warehousing in order to perform complex -queries. - -These developments, by lowering the barriers, will give rise to a vast -number of new use-cases that will enable stakeholders and end-users to -benefit from increased availability of analytics. The ability to perform -complex queries on public procurement data will be equally open to large -information systems as well as to simple desktop users with a copy of -Excel and an internet connection. - -To summarize, the TED Semantic Web Service (TED SWS) is a pipeline -system that continuously converts the public procurement notices (in XML -format) available on the TED Website into RDF format, publishes them -into CELLAR and makes them available to the public through CELLAR’s -SPARQL endpoint. - -=== Document overview - -This document describes the architecture of the TED-SWS system. - -It describes: - -* A general description of the system -* A general architecture -* A process single notice - -=== Glossary - -*Airflow* - an open-source platform for developing, scheduling, and -monitoring batch-oriented pipelines. The web interface helps manage the -state and monitoring of your pipelines. - -*Metabase* - is the BI tool with the friendly UX and integrated tooling -to let you explore data gathered by running the pipelines available in -Airflow. - -*Cellar* - is the central content and metadata repository of the -Publications Office of the European Union - -*TED-SWS* - is a pipeline system that continuously converts the public -procurement notices (in XML format) available on the TED Website into -RDF format and publishes them into CELLAR - -*DAG* - (Directed Acyclic Graph) is the core concept of Airflow, -collecting Tasks together, organized with dependencies and relationships -to say how they should run. The DAGS are basically the pipelines that -run in this project to get the public procurement notices from XML to -RDF and to be published them into CELLAR. - -== Architecture - -=== System use cases - -Operations Manager is the main actor that will interact with the TED-SWS -system. For these reasons, the use cases of the system will be focused -on the foreground for this actor. - -For Operations Manager are the following use cases: - -* to load a Mapping Suite into the database -* to reprocess non-normalized notices from the backlog -* to reprocess untransformed notices from the backlog -* to reprocess unvalidated notices from the backlog -* to reprocess unpackaged notices from the backlog -* to reprocess the notices we published from the backlog -* to fetch notices from the TED website based on a query -* to fetch notices from the TED website based on a date range -* to fetch notices from the TED website based on date - -=== Architecture overview - -The main points of architecture for a system that will transform TED -notices from XML format to RDF format using an ETL architecture with -batch processing pipeline are: - -[arabic] -. *Data collection*: A web scraper or API would be used to collect the -daily notices from the TED website in XML format and store them in a -data warehouse. -. *Data cleansing*: A data cleansing module would be used to clean and -validate the data, removing any invalid or duplicate entries -. *Data transformation*: A data transformation module would be used to -convert the XML data into RDF format. -. *Data loading*: The transformed RDF data would be loaded into a triple -store, such as Cellar, for further analysis or reporting. -. *Pipeline management*: Airflow would be used to schedule and manage the -pipeline, ensuring that the pipeline is run on a daily basis to process -the latest batch of notices from the TED website. Airflow would also be -used to monitor the pipeline and provide real-time status updates. -. *Data access*: A SPARQL endpoint or an API would be used to access the -RDF data stored in the triple store. This would allow external systems -to query the data and retrieve the information they need. -. *Security*: The system would be protected by a firewall and would use -secure protocols (e.g. HTTPS) for data transfer. Access to the data -would be controlled by authentication and authorization mechanisms. - -. *Scalability*: The architecture should be designed to handle large -amounts of data and easily scale horizontally by adding more resources -as the amount of data grows. -. *Flexibility*: The architecture should be flexible to handle changes in -the data structure without having to modify the database schema. -. *Performance*: The architecture should be designed for high-performance -to handle high levels of read and write operations to process data in a -short period of time. - -Figure 1.1 shows the compact, general image of the TED-SWS system -architecture from the system's business point of view. The system -represents a pipeline for processing notices from the TED Website and -publishing them to the CELLAR service. - -For the monitoring and management of internal processes, the system -offers two interfaces. An interface for data monitoring, in the diagram, -the given interface is represented by the name of “Data Monitoring -Interface”. Another interface is for the monitoring and management of -system processes; in the diagram, the given interface is represented by -the name “Workflow Management Interface”. Operations Manager will use -these two interfaces for system monitoring and management. - -The element of the system that will process the notices is the TED-SWS -pipeline. The input data for this pipeline will be the notices in XML -format from the TED website. The result of this pipeline is a METS -package for each processed notice and its publication in CELLAR, from -where the end user will be able to access notices in RDF format. - -Providing, in Figure 1.1, a compact view of the TED-SWS system -architecture at the business level is useful because it allows -stakeholders and decision-makers to quickly and easily understand how -the system works and how it supports the business goals and objectives. -A compact view of the architecture can help to communicate the key -components of the system and how they interact with each other, making -it easier to understand the system's capabilities and limitations. -Additionally, a compact view of the architecture can help to identify -any areas where the system could be improved or where additional -capabilities are needed to support the business. By providing a clear -and concise overview of the system architecture, stakeholders can make -more informed decisions about how to use the system, how to improve it, -and how to align it with the business objectives. - -In Figure 1.1 also is provided, input and output dependencies for a -TED-SWS system architecture. This is useful because it helps to identify -the data sources and data destinations that the system relies on, as -well as the data that the system produces. This information can be used -to understand the data flows within the system, how the system is -connected to other systems, and how the system supports the business. -Input dependencies help to identify the data sources that the system -relies on, such as external systems, databases, or other data sources. -This information can be used to understand how the system is connected -to other systems and how it receives data. Output dependencies help to -identify the data destinations that the system produces, such as -external systems, databases, or other data destinations. This -information can be used to understand how the system is connected to -other systems and how it sends data. By providing input and output -dependencies for the TED-SWS system architecture, stakeholders can make -more informed decisions about how to use the system, how to improve it, -and how to align it with the business objectives. - -image:system_arhitecture/media/image1.png[image,width=100%,height=366] - -Figure 1.1 Compact view of system architecture at the business level - -In Figure 1.2 the general extended architecture of the TED-SWS system is -represented, in this diagram, the internal components of the TED-SWS -pipeline are also included. - -image:system_arhitecture/media/image8.png[image,width=100%,height=270] - -Figure 1.2 Extended view of system architecture at business level - -Figure 1.3 shows the architecture of the TED-SWS system without its -peripheral elements. This diagram is intended to highlight the services -that serve the internal components of the pipeline. - -*Workflow Management Service* is an external TED-SWS pipeline service -that performs pipeline management. This service provides a control -interface, in the figure it is represented by Workflow Management -Interface. - -*Workflow Management Interface* represents an internal process control -interface, this component will be analysed in a separate diagram. - -*Data Visualization Service* is a service that manages logs and pipeline -data to present them in a form of dashboards. - -*Data Monitoring Interface* is a data visualization and dashboard -editing interface offered by the Data Visualization Service. - -*Message Digest Service* is a service that serves the transformation -component of the TED-SWS pipeline, within the transformation to ensure -custom RML functions, an external service is needed that will implement -them. - -*Master Data Management & URI Allocation Service* is a service for -storing and managing unique URIs, this service performs URI -deduplication. - -The *TED-SWS pipeline* contains a set of components, all of which access -Notice Aggregate and Mapping Suite objects. - -image:system_arhitecture/media/image4.png[image,width=100%,height=318] - -Figure 1.3 TED-SWS architecture at business level - -Figure 1.4 shows the TED-SWS pipeline and its components, and this view -aims to show the connection between the components. - -The pipeline has the following components: - -* Fetching Service -* XML Indexing Service -* Metadata Normalization Service -* Transformation Service; -* Entity Resolution & Deduplication Service -* Validation Service -* Packaging Service -* Publishing Service -* Mapping Suite Loading Service - -*Fetching Service* is a service that extracts notices from the TED -website and stores them in the database. - -*XML Indexing Service* is a service that extracts all unique XPaths from -an XML and stores them as metadata. Unique XPaths are used later to -validate if the transformation to RDF format, has been done for all -XPaths from a notice in XML format. - -*Metadata Normalization Service* is a service that normalises the -metadata of a notice in an internal work format. This normalised -metadata will be used in other processes on a notice, such as the -selection of a Mapping Suite for transformation or validation of a -notice. - -*Transformation Service* is the service that transforms a notice from -the XML format into the RDF format, using for this a Mapping Suite that -contains the RML transformation rules that will be applied. - -*Entity Resolution & Deduplication Service* is a service that performs -the deduplication of entities from RDF manifestation, namely -Organization and Procedure entities. - -*Validation Service* is a service that validates a notice in RDF format, -using for this several types of validations, namely validation using -SHACL shapes, validation using SPARQL tests and XPath coverage -verification. - -*Packaging Service* is a service that creates a METS package that will -contain notice RDF manifestation. - -*Publishing Service* is a service that publishes a notice RDF -manifestation in the required format, in the case of Cellar the -publication takes place with a METS package. - -image:system_arhitecture/media/image5.png[image,width=100%,height=154] - -Figure 1.4 TED-SWS pipeline architecture at business level - -=== Process single notice pipeline architecture - -The pipeline for processing a notice is the key element in the TED-SWS -system, the architecture of this pipeline from the business point of -view is represented in Figure 2. Unlike the previously presented -figures, in Figure 2 the pipeline is rendered in greater detail and are -presented relationships between pipeline steps and the artefacts that -produce or use them. - -Based on Figure 2, it can be noted that the pipeline is not a linear -one, within the pipeline there are control steps that check whether the -following steps should be executed for a notice. - -There are 3 control steps in the pipeline, namely: - -* Check notice eligibility for transformation -* Check notice eligibility for packaging -* Check notice availability in Cellar - -The “Check notice eligibility for transformation” step represents the -control of a notice if it can be transformed with a Mapping Suite, if it -can be transformed it goes to the next transformation step, otherwise -the notice is stored for future processing. - -The “Check notice eligibility for packaging” step checks if a notice RDF -manifestation after the validation step is valid for packaging in a METS -package. If it is valid, proceed to the packing step, otherwise, store -the intermediate result for further analysis. - -The “Check notice availability in Cellar” step checks, after the -publication step in Cellar, if a published notice is already accessible -in Cellar. If the notice is accessible, then the pipeline is finished, -otherwise the published notice is stored for further analysis. - -Pipeline steps produce and use artefacts such as: - -* TED-XML notice & metadata; -* Mapping rules -* TED-RDF notice -* Test suites -* Validation report -* METS Package activation - -image:system_arhitecture/media/image2.png[image,width=100%,height=177] - -Figure 2 Single notice processing pipeline at business level - -Based on Figure 2, we can notice that the artefacts for a notice appear -with the passage of certain steps in the pipeline. To be able to -conveniently manage the state of a notice and all its artefacts -depending on its state, a notice represents an aggregate of artefacts -and a state, which changes dynamically during the pipeline. - -== Dynamic behaviour of architecture - -In this section, we address the following questions: - -* How is the data organised? -* How does the data structure evolve within the process? -* Howe does the business process look like? -* How is the business process realised in the Application? - -=== Notice status transition map - -A TED-SWS pipeline implement a hybrid architecture based on ETL pipeline -with status transition map for a notice. The TED-SWS pipeline have many -steps and is not a linear pipeline, in this case using a notice status -transition map, for a complex pipeline with multiple steps and -ramifications like as TED-SWS pipeline, is a good architecture choice -for several reasons: - -[arabic] -. *Visibility*: A notice status transition map provides a clear and visual -representation of the different stages that a notice goes through in the -pipeline. This allows for better visibility into the pipeline, making it -easier to understand the flow of data and to identify any issues or -bottlenecks. - -. *Traceability*: A notice status transition map allows for traceability -of notices in the pipeline, which means that it's possible to track a -notice as it goes through the different stages of the pipeline. This can -be useful for troubleshooting, as it allows for the identification of -which stage the notice failed or had an issue. - -. *Error Handling*: A notice status transition map allows for the -definition of error handling procedures for each stage in the pipeline. -This can be useful for identifying and resolving errors that occur in -the pipeline, as it allows for a clear understanding of what went wrong -and what needs to be done to resolve the issue. - -. *Auditing*: A notice status transition map allows for the auditing of -notices in the pipeline, which means that it's possible to track the -history of a notice, including when it was processed, by whom, and -whether it was successful or not. - -. *Monitoring*: A notice status transition map allows for the monitoring -of notices in the pipeline, which means that it's possible to track the -status of a notice, including how many notices are currently being -processed, how many have been processed successfully, and how many have -failed. - -. *Automation*: A notice status transition map can be used to automate -some of the process, by defining rules or triggers to move notices -between different stages of the pipeline, depending on the status of the -notice. - - -Each notice has a status during the pipeline, a status corresponds to a -step in the pipeline that the notice passed. Figure 3.1 shows the -transition flow of the status of a notice, as a note we must take into -account that a notice can only be in one status at a given time. -Initially, each notice has the status of RAW and the last status, which -means finishing the pipeline, is the status of PUBLICLY_AVAILABLE. - -Based on the use cases of this pipeline, the following statuses of a -notice are of interest to the end user: - -* RAW -* NORMALISED_METADATA -* INELIGIBLE_FOR_TRANSFORMATION -* TRANSFORMED -* VALIDATED -* INELIGIBLE_FOR_PACKAGING -* PACKAGED -* INELIGIBLE_FOR_PUBLISHING -* PUBLISHED -* PUBLICLY_UNAVAILABLE -* PUBLICLY_AVAILABLE - -image:system_arhitecture/media/image6.png[image,width=546,height=402] - -Figure 3.1 Notice status transition - -The names of the statuses are self-descriptive, but attention should be -drawn to some statuses, namely: - -* INDEXED -* NORMALISED_METADATA -* DISTILLED -* PUBLISHED -* PUBLICLY_UNAVAILABLE -* PUBLICLY_AVAILABLE - -The INDEXED status means that the set of unique XPaths appearing in its -XML manifestation has been calculated for a notice. The unique set of -XPaths is subsequently required when calculating the XPath coverage -indicator for the transformation. - -The NORMALISED_METADATA status means that for a notice, its metadata has -been normalised. The metadata of a notice is normalised in an internal -format to be able to check the eligibility of a notice to be transformed -with a Mapping Suite package. - -The status DISTILLED is used to indicate that the RDF manifestation of a -notice has been post processed. The post-processing of an RDF -manifestation provides for the deduplication of the Procedure or -Organization type entities and the insertion of corresponding triplets -within this RDF manifestation. - -The PUBLISHED status means that a notice has been sent to Cellar, which -does not mean that it is already available in Cellar. Since there is a -time interval between the transmission and the actual appearance in the -Cellar, it is necessary to check later whether a notice is available in -the Cellar or not. If the verification has taken place and the notice is -available in the Cellar, it is assigned the status of -PUBLICLY_AVAILABLE, if it is not available in the Cellar, the status of -PUBLICLY_UNAVAILABLE is assigned to it. - -=== Notice structure - -Notice structure has a NoSQL data model, this architecture choice is -based on dynamic behaviour of notice structure which evolves over time -while TED-SWS pipeline running and besides that there are other reasons: - -[arabic] -. *Schema-less*: NoSQL databases are schema-less, which means that the -data structure can change without having to modify the database schema. -This allows for more flexibility when processing data, as new data types -or fields can be easily added without having to make changes to the -database. This is particularly useful for notices that are likely to -evolve over time, as the structure of the notices can change without -having to make changes to the database. - -. *Handling Unstructured Data*: NoSQL databases are well suited for -handling unstructured data, such as JSON or XML, that can't be handled -by SQL databases. This is particularly useful for ETL pipelines that -need to process unstructured data, as notices are often unstructured and -may evolve over time. -. *Handling Distributed Data*: NoSQL databases are designed to handle -distributed data, which allows for data to be stored and processed on -multiple servers. This can help to improve performance and scalability, -as well as provide fault tolerance. This is particularly useful for -notices that are likely to evolve over time, as the volume of data may -increase and need to be distributed. - -. *Flexible Querying*: NoSQL databases allow for flexible querying, which -means that the data can be queried in different ways, including by -specific fields, by specific values, and by ranges. This allows for more -flexibility when querying the data, as the structure of the notices may -evolve over time. -. *Cost-effective*: NoSQL databases are generally less expensive than SQL -databases, as they don't require expensive hardware or specialized -software. This can make them a more cost-effective option for ETL -pipelines that need to handle large amounts of data and that are likely -to evolve over time. - - -Overall, a NoSQL data model is a good choice for notice structure in an -ETL pipeline that is likely to evolve over time because it allows for -more flexibility when processing data, handling unstructured data, -handling distributed data, flexible querying and it's cost-effective. - -Figure 3.2 shows the structure of a notice and its evolution depending -on the state in which a notice is located. In the given figure, the -emphasis is placed on the states from which a certain part of the -structure of a notice is present. As a remark, it should be taken into -account that once an element of the structure of a notice is present for -a certain state, it will also be present for all the states derived from -it, such as the flow of states presented in Figure 3.1. - -image:system_arhitecture/media/image3.png[image,width=567,height=350] - -Figure 3.2 Dynamic behaviour of notice structure based on status - -Based on Figure 3.2, it is noted that the structure of a notice evolves -with the transition to other states. - -For a notice in the state of NORMALISED_METADATA, we can access the -following fields of a notice: - -* Original Metadata -* Normalised Metadata -* XML Manifestation - -For a notice in the TRANSFORMED state, we can access all the previous -fields and the following new fields of a notice: - -* RDF Manifestation. - -For a notice in the VALIDATED state, we can access all the previous -fields and the following new fields of a notice: - -* XPath Coverage Validation - -* SHACL Validation -* SPARQL Validation - -For a notice in the PACKAGED state, we can access all the previous -fields and the following new fields of a notice: - -* METS Manifestation - -=== Application view of the process - -The primary actor of the TED-SWS system will be the Operations Manager, -who will interact with the system. Application-level pipeline control is -achieved through the Airflow stack. Figure 4 shows the AirflowUser actor -representing Operations Manager, this diagram is at the application -level of the process. - -image:system_arhitecture/media/image7.png[image,width=534,height=585] - -Figure 4 Dependencies between Airflow DAGs - -Based on the use cases defined for an Operations Manger, Figure 4 shows -the control functionality of the TED-SWS pipeline that it can use. In -addition to the functionality available for the AirflowUser actor, the -dependency between DAGs is also rendered. We can note that another actor -named AirflowScheduler is defined, this actor represents an automatic -execution mechanism at a certain time interval of certain DAGs. - -== Architectural choices - -This section describes choices: - -* How is this SOA? (is it? It is SOA but not REST Microservices, Why not -Microservices? -* Why NoSQL data model vs SQL data model? -* Why ETL/ELT approach vs. Event Sourcing -* Why Batch processing vs. Event Streams. -* Why Airflow ? -* Why Metabase? -* Why quick deduplication process? And what are the plans for the -future? - -=== Why is this SOA (Service-oriented architecture) architecture? - -ETL (Extract, Transform, Load) architecture is considered -state-of-the-art for batch processing tasks using Airflow as pipeline -management for several reasons: - -[arabic] -. *Flexibility*: ETL architecture allows for flexibility in the data -pipeline as it separates the data extraction, transformation, and -loading processes. This allows for easy modification and maintenance of -each individual step without affecting the entire pipeline. -. *Scalability*: ETL architecture allows for the easy scaling of data -processing tasks, as new data sources can be added or removed without -impacting the entire pipeline. -. *Error Handling*: ETL architecture allows for easy error handling as -each step of the pipeline can be monitored and errors can be isolated to -a specific step. -. *Reusability:* ETL architecture allows for the reuse of existing data -pipelines, as new data sources can be added without modifying existing -pipelines. -. *System management*: Airflow is an open-source workflow management -system that allows for easy scheduling, monitoring, and management of -data pipelines. It integrates seamlessly with ETL architecture and -allows for easy management of complex data pipelines. - -Overall, ETL architecture combined with Airflow as pipeline management -provides a robust and efficient solution for batch processing tasks. - -=== Why Monolithic Architecture vs Micro Services Architecture? - -There are several reasons why a monolithic architecture may be more -suitable for an ETL architecture with batch processing pipeline using -Airflow as the pipeline management tool: - -[arabic] -. *Simplicity*: A monolithic architecture is simpler to design and -implement as it involves a single codebase and a single deployment -process. This makes it easier to manage and maintain the ETL pipeline. -. *Performance*: A monolithic architecture may be more performant than a -microservices architecture as it allows for more efficient communication -between the different components of the pipeline. This is particularly -important for batch processing pipelines, where speed and efficiency are -crucial. -. *Scalability*: Monolithic architectures can be scaled horizontally by -adding more resources to the system, such as more servers or more -processing power. This allows for the system to handle larger amounts of -data and handle more complex processing tasks. -. *Airflow Integration*: Airflow is designed to work with monolithic -architectures, and it can be more difficult to integrate with a -microservices architecture. Airflow's DAGs and tasks are designed to -work with a single codebase, and it may be more challenging to manage -different services and pipelines across multiple microservices. - -Overall, a monolithic architecture may be more suitable for an ETL -architecture with batch processing pipeline using Airflow as the -pipeline management tool due to its simplicity, performance, -scalability, and ease of integration with Airflow. - -=== Why ETL/ELT approach vs Event Sourcing ? - -ETL (Extract, Transform, Load) architecture is typically used for moving -and transforming data from one system to another, for example, from a -transactional database to a data warehouse for reporting and analysis. -It is a batch-oriented process that is typically scheduled to run at -specific intervals. - -Event sourcing architecture, on the other hand, is a way of storing and -managing the state of an application by keeping track of all the changes -to the state as a sequence of events. This allows for better auditing -and traceability of the state of the application over time, as well as -the ability to replay past events to reconstruct the current state. -Event sourcing is often used in systems that require high performance, -scalability, and fault tolerance. - -In summary, ETL architecture is mainly used for data integration and -data warehousing, Event sourcing is mainly used for building highly -scalable and fault-tolerant systems that need to store and manage the -state of an application over time. - -A hybrid architecture is implemented in the TED-SWS pipeline, based on -an ETL architecture but with state storage to repeat a pipeline sequence -as needed. - -=== Why Batch processing vs Event Streams? - -Batch processing architecture and Event Streams architecture are two -different approaches to processing data in code. - -Batch processing architecture is a traditional approach where data is -processed in batches. This means that data is collected over a period of -time and then processed all at once in a single operation. This approach -is typically used for tasks such as data analysis, data mining, and -reporting. It is best suited for tasks that can be done in a single pass -and do not require real-time processing. - -Event Streams architecture, on the other hand, is a more modern approach -where data is processed in real-time as it is generated. This means that -data is processed as soon as it is received, rather than waiting for a -batch to be collected. This approach is typically used for tasks such as -real-time monitoring, data analytics, and fraud detection. It is best -suited for tasks that require real-time processing and cannot be done in -a single pass. - -In summary, Batch processing architecture is best suited for tasks that -can be done in a single pass and do not require real-time processing, -whereas Event Streams architecture is best suited for tasks that require -real-time processing and cannot be done in a single pass. - -Due to the fact that the TED-SWS pipeline has an ETL architecture, the -data processing is done in batches, the batches of notices are formed -per day, all the notices of a day form a batch that will be processed. -Another method of creating a batch is grouping notices by status and -executing the pipeline depending on their status. - -=== Why NoSQL data model vs SQL data model? - -There are several reasons why a NoSQL data model may be more suitable -for an ETL architecture with batch processing pipeline compared to a SQL -data model: - -[arabic] -. *Scalability*: NoSQL databases are designed to handle large amounts of -data and can scale horizontally, allowing for the easy addition of more -resources as the amount of data grows. This is particularly useful for -batch processing pipelines that need to handle large amounts of data. -. *Flexibility*: NoSQL databases are schema-less, which means that the -data structure can change without having to modify the database schema. -This allows for more flexibility when processing data, as new data types -or fields can be easily added without having to make changes to the -database. -. *Performance*: NoSQL databases are designed for high-performance and can -handle high levels of read and write operations. This is particularly -useful for batch processing pipelines that need to process large amounts -of data in a short period of time. - -. *Handling Unstructured Data*: NoSQL databases are well suited for -handling unstructured data, such as JSON or XML, that can't be handled -by SQL databases. This is particularly useful for ETL pipelines that -need to process unstructured data. - -. *Handling Distributed Data*: NoSQL databases are designed to handle -distributed data, which allows for data to be stored and processed on -multiple servers. This can help to improve performance and scalability, -as well as provide fault tolerance. - -. *Cost*: NoSQL databases are generally less expensive than SQL databases, -as they don't require expensive hardware or specialized software. This -can make them a more cost-effective option for ETL pipelines that need -to handle large amounts of data. - -Overall, a NoSQL data model may be more suitable for an ETL architecture -with batch processing pipeline compared to a SQL data model due to its -scalability, flexibility, performance, handling unstructured data, -handling distributed data and the cost-effectiveness. It is important to -note that the choice to use a NoSQL data model satisfies the specific -requirements of the TED-SWS processing pipeline and the nature of the -data to be processed. - -=== Why Airflow? - -Airflow is a great solution for ETL pipeline and batch processing -architecture because it provides several features that are well-suited -to these types of tasks. First, Airflow provides a powerful scheduler -that allows you to define and schedule ETL jobs to run at specific -intervals. This means that you can set up your pipeline to run on a -regular schedule, such as every day or every hour, without having to -manually trigger the jobs. Second, Airflow provides a web-based user -interface that makes it easy to monitor and manage your pipeline. - -Both aspects of Airflow are perfectly compatible with the needs of the -TED-SWS architecture and the use cases required for an Operations -Manager that will interact with the system. Airflow therefore covers the -needs of batch processing management and ETL pipeline management. - -Airflow provide good coverage of use cases for an Operations Manager, -specialized for this use cases: - -[arabic] -. *Monitoring pipeline performance*: An operations manager can use Airflow -to monitor the performance of the ETL pipeline and identify any -bottlenecks or issues that may be impacting the pipeline's performance. -They can then take steps to optimize the pipeline to improve its -performance and ensure that data is being processed in a timely and -efficient manner. - -. *Managing pipeline schedule*: The operations manager can use Airflow to -schedule the pipeline to run at specific times, such as during off-peak -hours or when resources are available. This can help to minimize the -impact of the pipeline on other systems and ensure that data is -processed in a timely manner. - -. *Managing pipeline resources*: The operations manager can use Airflow to -manage the resources used by the pipeline, such as CPU, memory, and -storage. They can also use Airflow to scale the pipeline up or down as -needed to meet changing resource requirements. - -. *Managing pipeline failures*: Airflow allows the operations manager to -set up notifications and alerts for when a pipeline fails or a task -fails. This allows them to quickly identify and address any issues that -may be impacting the pipeline's performance. - -. *Managing pipeline dependencies*: The operations manager can use Airflow -to manage the dependencies between different tasks in the pipeline, such -as ensuring that notice fetching is completed before notice indexing or -notice metadata normalization. - -. *Managing pipeline versioning*: Airflow allows the operations manager to -maintain different versions of the pipeline, which can be useful for -testing new changes before rolling them out to production. - -. *Managing pipeline security*: Airflow allows the operations manager to -set up security controls to protect the pipeline and the data it -processes. They can also use Airflow to audit and monitor access to the -pipeline and the data it processes. - -=== Why Metabase? - -Metabase is an excellent solution for data analysis and KPI monitoring -for a batch processing system, as it offers several key features that -make it well suited for this type of use case required within the -TED-SWS system. - -First, Metabase is highly customizable, allowing users to create and -modify dashboards, reports, and visualizations to suit their specific -needs. This makes it easy to track and monitor the key performance -indicators (KPIs) that are most important for the batch processing -system, such as the number of jobs processed, the average processing -time, and the success rate of job runs. - -Second, Metabase offers a wide range of data connectors, allowing users -to easily connect to and query data sources such as SQL databases, NoSQL -databases, CSV files, and APIs. This makes it easy to access and analyze -the data that is relevant to the batch processing system. In TED-SWS the -data domain model is realized by a document-based data model, not a -tabular relational data model, so Metabase is a good tool for analyzing -data with a document-based model. - -Third, Metabase has a user-friendly interface that makes it easy to -navigate and interact with data, even for users with little or no -technical experience. This makes it accessible to a wide range of users, -including business analysts, data scientists, and other stakeholders who -need to monitor and analyse the performance of the batch processing -system. - -Finally, Metabase offers robust security and collaboration features, -making it easy to share and collaborate on data and insights with team -members and stakeholders. This makes it an ideal solution for -organizations that need to monitor and analyse the performance of a -batch processing system across multiple teams or departments. - -=== Why quick deduplication process? - -One of the main challenges in entities deduplication from the semantic -web domain is dealing with the complexity and diversity of the data. -This can include dealing with different data formats, schemas, and -vocabularies, as well as handling missing or incomplete data. -Additionally, entities may have multiple identities or representations, -making it difficult to determine which entities are duplicates and which -are distinct. Another difficulty is the scalability of the algorithm to -handle large amount of data. The performance of the algorithm should be -efficient and accurate to handle huge number of entities. - -There are several approaches and solutions for entities deduplication in -the semantic web. Some of the top solutions include: - -[arabic] -. *String-based methods*: These methods use string comparison techniques -such as Jaccard similarity, Levenshtein distance, and cosine similarity -to identify duplicates based on the similarity of their string -representations. -. *Machine learning-based methods*: These methods use machine learning -algorithms such as decision trees, random forests, and neural networks -to learn patterns in the data and identify duplicates. - -. *Knowledge-based methods*: These methods use external knowledge sources -such as ontologies, taxonomies, and linked data to disambiguate entities -and identify duplicates. - -. *Hybrid methods*: These methods combine multiple techniques, such as -string-based and machine learning-based methods, to improve the accuracy -of deduplication. - -. *Blocking Method*: This method is used to reduce the number of entities -that need to be compared by grouping similar entities together. - -In the TED-SWS pipeline, the deduplication of Organization type entities -is performed using a string-based methods. String-based methods are -often used for organization entity deduplication, because of their -simplicity and effectiveness. - -TED Europe data often contains information about tenders and public -procurement, where organizations are identified by their names. -Organization names are often unique and can be used to identify -duplicates with high accuracy. String-based methods can be used to -compare the similarity of different organization names, which can be -effective in identifying duplicates. - -Additionally, the TED europe data is highly structured, so it's easy to -extract and compare the names of organizations. String-based methods are -also relatively fast and easy to implement, making them a good choice -for large data sets. This methods may not be as effective for other -types of entities, such as individuals, where additional information may -be needed to identify duplicates. It's also important to note that -string-based methods may not work as well for misspelled or abbreviated -names. - -Using a quick and dirty deduplication approach instead of a complex -system at the first iteration of a system implementation can be -beneficial for several reasons: - -[arabic] -. *Speed*: A quick approach can be implemented quickly and can -help to identify and remove duplicates quickly. This can be particularly -useful when working with large and complex data sets, where a more -complex approach may take a long time to implement and test. -. *Cost*: A quick and dirty approach is generally less expensive to -implement than a complex system, as it requires fewer resources and less -development time. -. *Simplicity*: A quick and dirty approach is simpler and easier to -implement than a complex system, which can reduce the risk of errors and -bugs. -. *Flexibility*: A quick and dirty approach allows to start with a basic -system and adapt it as needed, which can be more flexible than a complex -system that is difficult to change. - -. *Testing*: A quick and dirty approach allows to test the system quickly, -and get feedback from the users and stakeholders, and then use that -feedback to improve the system. - - -However, it's worth noting that the quick and dirty approach is not a -long-term solution and should be used only as a first step in the -implementation of a MDR system. This approach can help to quickly -identify and remove duplicates and establish a basic system, but it may -not be able to handle all the complexity and diversity of the data, so -it's important to plan for and implement more advanced techniques as the -system matures. - -=== What are the plans for the future deduplication? - -In the future, another Master Data Registry type system will be used to -deduplicate entities in the TED-SWS system, which will be implemented -according to the requirements for deduplication of entities from -notices. - -The future Master Data Registry (MDR) system for entity deduplication -should have the following architecture: - -[arabic] -. *Data Ingestion*: This component is responsible for extracting and -collecting data from various sources, such as databases, files, and -APIs. The data is then transformed, cleaned, and consolidated into a -single format before it is loaded into the MDR. - -. *Data Quality*: This component is responsible for enforcing data quality -rules, such as format, completeness, and consistency, on the data before -it is entered into the MDR. This can include tasks such as data -validation, data standardization, and data cleansing. - -. *Entity Dedup*: This component is responsible for identifying and -removing duplicate entities in the MDR. This can be done using a -combination of techniques such as string-based, machine learning-based, -or knowledge-based methods. - -. *Data Governance*: This component is responsible for ensuring that the -data in the MDR is accurate, complete, and up-to-date. This can include -processes for data validation, data reconciliation, and data -maintenance. - -. *Data Access and Integration*: This component provides access to the MDR -data through a user interface and API's, and integrates the MDR data -with other systems and applications. - -. *Data Security*: This component is responsible for ensuring that the -data in the MDR is secure, and that only authorized users can access it. -This can include tasks such as authentication, access control, and -encryption. - -. *Data Management*: This component is responsible for managing the data -in the MDR, including tasks such as data archiving, data backup, and -data recovery. - -. *Monitoring and Analytics*: This component is responsible for monitoring -and analysing the performance of the MDR system, and for providing -insights into the data to help improve the system. - -. *Services layer*: This component is responsible for providing services -such as, indexing, search and query functionalities over the data. - - -All these components should be integrated and work together to provide a -comprehensive and efficient MDR system for entity deduplication. The -system should be scalable and flexible enough to handle large amounts of -data and adapt to changing business requirements. - - - diff --git a/docs/antora/modules/ROOT/pages/demo_installation.adoc b/docs/antora/modules/ROOT/pages/technical/demo_installation.adoc similarity index 100% rename from docs/antora/modules/ROOT/pages/demo_installation.adoc rename to docs/antora/modules/ROOT/pages/technical/demo_installation.adoc diff --git a/docs/antora/modules/ROOT/pages/event_manager.adoc b/docs/antora/modules/ROOT/pages/technical/event_manager.adoc similarity index 100% rename from docs/antora/modules/ROOT/pages/event_manager.adoc rename to docs/antora/modules/ROOT/pages/technical/event_manager.adoc diff --git a/docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc b/docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc similarity index 99% rename from docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc rename to docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc index af1253057..34df96423 100644 --- a/docs/antora/modules/ROOT/pages/mapping_suite_cli_toolchain.adoc +++ b/docs/antora/modules/ROOT/pages/technical/mapping_suite_cli_toolchain.adoc @@ -10,8 +10,8 @@ Open a Linux terminal and clone the `ted-rdf-mapping` project. [source,bash] ---- -git clone https://github.com/OP-TED/ted-rdf-mapping -cd ted-rdf-mapping +git clone https://github.com/meaningfy-ws/mapping-workbench +cd mapping-workbench ---- Create a virtual Python environment and activate it. @@ -34,7 +34,8 @@ Install the TED-SWS CLIs as a Python package using the `pip` package manager. [source,bash] ---- -pip install git+https://github.com/OP-TED/ted-rdf-conversion-pipeline#egg=ted-sws +make isntall +make local-dotenv-file ---- == Usage diff --git a/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc b/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc new file mode 100644 index 000000000..a10fb6567 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/ted-sws-introduction.adoc @@ -0,0 +1,38 @@ +== Introduction + +Although TED notice data is already available to the general public +through the search API provided by the TED website, the current offering +has many limitations that impede access to and reuse of the data. One +such important impediment is for example the current format of the data. + +Historical TED data come in various XML formats that evolved together +with the standard TED XML schema. The imminent introduction of eForms +will also introduce further diversity in the XML data formats available +through TED's search API. This makes it practically impossible for users +to consume and process data that span across several years, as +their information systems must be able to process several different +flavours of the available XML schemas as well as to keep up with the +schema's continuous evolution. Their search capabilities are therefore +confined to a very limited set of metadata. + +The TED Semantic Web Service will remove these barriers by providing one +common format for accessing and reusing all TED data. Coupled with the +eProcurement Ontology, the TED data will also have semantics attached to +them allowing users to directly link them with other datasets. +Moreover, users will now be able to perform much more elaborate +queries directly on the data source (through the SPARQL endpoint). This +will reduce their need for data warehousing in order to perform complex +queries. + +These developments, by lowering the barriers, will give rise to a vast +number of new use-cases that will enable stakeholders and end-users to +benefit from increased availability of analytics. The ability to perform +complex queries on public procurement data will be equally open to large +information systems as well as to simple desktop users with a copy of +Excel and an internet connection. + +To summarize, the TED Semantic Web Service (TED SWS) is a pipeline +system that continuously converts the public procurement notices (in XML +format) available on the TED Website into RDF format, publishes them +into CELLAR and makes them available to the public through CELLAR’s +SPARQL endpoint. \ No newline at end of file diff --git a/docs/antora/modules/ROOT/pages/using_procurement_data.adoc b/docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc similarity index 91% rename from docs/antora/modules/ROOT/pages/using_procurement_data.adoc rename to docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc index 4d8c924c6..129f3652f 100644 --- a/docs/antora/modules/ROOT/pages/using_procurement_data.adoc +++ b/docs/antora/modules/ROOT/pages/ted_data/using_procurement_data.adoc @@ -1,25 +1,19 @@ = Using procurement data +This page explains how to use procurement data accessed from *Cellar* with Microsoft Excel, Python and R. There are different ways to access TED notices in CELLAR +and use the data. The methods described below work with TED notice and other type of semantic assets. +We use a sample SPARQL query which returns a list of countries. The users shall use TED specific SPARQL queries to fetch needed data. -This page explains how to use procurement data accessed from Cellar with Excel, Python, R -and Power BI. -There are different ways to access TED notices in CELLAR -and use the data. As scenarios, each method presented in this page -will take over the list of European countries and shows them in one -column. - *Note:* Jupyter Notebook samples are explained with assumption that a code editor is already prepared. For example VS Code or Pycharm, or Jupyter server. Examples are explained using https://code.visualstudio.com/docs[[.underline]#Visual Studio Code#]. -== Excel +== Microsoft Excel -This chapter shows an example using Excel. Microsoft Excel is a -spreadsheet developed by Microsoft through which we will use the -interface to query CELLAR repository to see an example. +This chapter shows an example of getting data from Cellar using Microsoft Excel. [arabic] . Prepare link with necessary query: diff --git a/docs/antora/modules/ROOT/pages/user_manual.adoc b/docs/antora/modules/ROOT/pages/user_manual.adoc deleted file mode 100644 index a15c376c0..000000000 --- a/docs/antora/modules/ROOT/pages/user_manual.adoc +++ /dev/null @@ -1,1338 +0,0 @@ -= TED-SWS User manual - -[width="100%",cols="25%,75%",options="header",] -|=== -|*Editors* |Dragos Paun - + -Eugeniu Costetchi - -|*Version* |1.0.0 - -|*Date* |20/02/2023 -|=== - -== Glossary [[glossary]] - -*Airflow* - an open-source platform for developing, scheduling, and -monitoring batch-oriented pipelines. The web interface helps manage the -state and monitoring of your pipelines. - -*Metabase* - Metabase is the BI tool with the friendly UX and integrated -tooling to let you explore data gathered by running the pipelines -available in Airflow. - -*Cellar* - is the central content and metadata repository of the -Publications Office of the European Union - -*TED-SWS* - is a pipeline system that continuously converts the public -procurement notices (in XML format) available on the TED Website into -RDF format and publishes them into CELLAR - -*DAG* - (Directed Acyclic Graph) is the core concept of Airflow, -collecting Tasks together, organized with dependencies and relationships -to say how they should run. The DAGS are basically the pipelines that -run in this project to get the public procurement notices from XML to -RDF and to be published them into CELLAR. - -== Introduction - -Although TED notice data is already available to the general public -through the search API provided by the TED website, the current offering -has many limitations that impede access to and reuse of the data. One -such important impediment is for example the current format of the data. - -Historical TED data come in various XML formats that evolved together -with the standard TED XML schema. The imminent introduction of eForms -will also introduce further diversity in the XML data formats available -through TED's search API. This makes it practically impossible for -reusers to consume and process data that span across several years, as -their information systems must be able to process several different -flavors of the available XML schemas as well as to keep up with the -schema's continuous evolution. Their search capabilities are therefore -confined to a very limited set of metadata. - -The TED Semantic Web Service will remove these barriers by providing one -common format for accessing and reusing all TED data. Coupled with the -eProcurement Ontology, the TED data will also have semantics attached to -them allowing reusers to directly link them with other datasets. -Moreover, reusers will now be able to perform much more elaborate -queries directly on the data source (through the SPARQL endpoint). This -will reduce their need for data warehousing in order to perform complex -queries. - -These developments, by lowering the barriers, will give rise to a vast -number of new use-cases that will enable stakeholders and end-users to -benefit from increased availability of analytics. The ability to perform -complex queries on public procurement data will be equally open to large -information systems as well as to simple desktop users with a copy of -Excel and an internet connection. - -To summarize the TED Semantic Web Service (TED SWS) is a pipeline system -that continuously converts the public procurement notices (in XML -format) available on the TED Website into RDF format, publishes them -into CELLAR and makes them available to the public through CELLAR’s -SPARQL endpoint. - -=== Purpose of the document - -The purpose of this document is to explain how to use Airflow and -Metabase to control and monitor the TED-SWS system. This document may be -updated by the development team as the system evolves. - -=== Intended audience - -This document is intended for persons involved in the controlling and -monitoring the services offered by the TED-SWS system - -==== Useful Resources [[useful-resources]] - -https://www.metabase.com/learn/getting-started/tour-of-metabase[[.underline]#https://www.metabase.com/learn/getting-started/tour-of-metabase#] - -https://www.metabase.com/docs/latest/exploration-and-organization/start[[.underline]#https://www.metabase.com/docs/latest/exploration-and-organization/start#] - -https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html#] -(only UI / Screenshots section) - -== Architectural overview - -This section provides a high level overview of the TED-SWS system and -its components. As presented in the image below the system is built by -multitude of services / components grouped together to help to reach the -end goal. The system can be divided into 2 main parts: - -* Controlling and monitoring -* Core functionality (code base / TED SWS pipeline) - -Each part of the system is formed by a group of components. - -Controlling and monitoring, controlled by an operation manager, contains -a workflow / pipeline management service (Airflow) and data -visualization service (Metabase). Using this group of services any user -should be able to control execution of the existing pipelines and also -monitor the execution results. - -The core functionality has many services developed to accommodate the -entire transformation process of a public procurement notice (in XML -format) available on the TED Website into RDF format and to publish it -into CELLAR. Here is a short description of some of the main services: - -* fetching service - fetching the notice from TED website -* indexing service - getting the unique XPATHs in a notice XML -* metadata normalisation service - extract notice metadata from the XML -* transformation service - transform the XML to RDF -* entity resolution and deduplication service - resolve duplicated -entities in the RDF -* validation service - validation the RDF transformation -* packaging service - creating the METS package -* publishing service - sending the METS package to CELLAR - -image:user_manual/media/image59.png[image,width=100%,height=270] - - -=== Pipelines architecture ( Airflow DAGs ) - -In this section will see a graphic representation that will show the -flow and dependencies of the available pipelines (DAGs) in Airflow. In -this representation will see the presence of two users AirflowUser and -AirflowScheduler, where the AirflowUser is the user that will enable and -trigger the DAGs and AirflowScheduler is the Airflow component that will -start the DAGs automatically following a schedule. - -The automatic triggered DAGs controlled by the Airflow Scheduler are: - -* fetch_notices_by_date -* daily_check_notices_availibility_in_cellar -* daily_materialized_views_update - -image:user_manual/media/image63.png[image,width=100%,height=382] - -The DAGs marked with _purple_ (load_mapping_suite_in_database), _yellow_ -(reprocess_unnormalised_notices_from_backlog,reprocess_unpackaged_notices_from_backlog, -reprocess_unpublished_notices_from_backlog,reprocess_untransformed_notices_from_backlog, -reprocess_unvalidated_notices_from_backlog) and _green_ -(fetch_notices_by_date, fetch_notices_by_date_range, -fetch_notices_by_query) will trigger automatically the -*notice_processing_pipeline* marked with _blue_, and this will take care -of the entire processing steps for a notice. These can be used by a user -by manually triggering these DAGs with or without configuration. - -The DAGs marked with _green_ (fetch_notices_by_date, -fetch_notices_by_date_range, fetch_notices_by_query) are in charge of -fetching the notices from TED API. The ones marked with _yellow_ ( -reprocess_unnormalised_notices_from_backlog, -reprocess_unpackaged_notices_from_backlog, -reprocess_unpublished_notices_from_backlog, -reprocess_untransformed_notices_from_backlog, -reprocess_unvalidated_notices_from_backlog) will handle the reprocessing -of notices from the backlog. The purple marked DAG -(load_mapping_suite_in_database) will handle the loading of mapping -suites in the database that will be used to transform the notices. - -image:user_manual/media/image11.png[image,width=100%,height=660] - -== Notice statuses - -During the transformation process through the TED-SWS system, a notice -will start with a certain status and it will transition to other -statuses when a particular step of the pipeline -(notice_processing_pipeline) offered by the system has completed -successfully or unsuccessfully. This transition is done automatically -and it will change the _status_ property of a notice. The system has the -following statuses: - -* RAW -* INDEXED -* NORMALISED_METADATA -* INELIGIBLE_FOR_TRANSFORMATION -* ELIGIBLE_FOR_TRANSFORMATION -* PREPROCESSED_FOR_TRANSFORMATION -* TRANSFORMED -* DISTILLED -* VALIDATED -* INELIGIBLE_FOR_PACKAGING -* ELIGIBLE_FOR_PACKAGING -* PACKAGED -* INELIGIBLE_FOR_PUBLISHING -* ELIGIBLE_FOR_PUBLISHING -* PUBLISHED -* PUBLICLY_UNAVAILABLE -* PUBLICLY_AVAILABLE - -The transition from one status to another is decided by the system and -can be viewed in the graphic representation below. - -image:user_manual/media/image14.png[image,width=100%,height=444] - -== Notice structure - -This section aims at presenting the anatomy of a Notice in the TED-SWS -system and the dependence of structural elements on the phase of the -transformation process. This is useful for the user to understand what -happens behind the scene and what information is available in the -database, to build analytics dashboards. - -The structure of a notice within the TED-SWS system consists of the -following structural elements: - -* Status -* Metadata -** Original Metadata -** Normalised Metadata -* Manifestation -** XMLManifestation -** RDFManifestation -** METSManifestation -* Validation Report -** XPATH Coverage Validation -** SHACL Validation -** SPARQL Validation - -The diagram below shows the high level structure of the Notice object -and that certain structural parts of a notice within the system are -dependent on its state. This means that as the transformation process -runs through its steps the Notice state changes and new structural parts -are added. For example, for a notice in the NORMALISED status we can -access the Original Metadata, Normalised Metadata and XMLManifestation -fields, for a notice in the TRANSFORMED status we can access in addition -the RDFManifestation field and similarly for the rest of the statuses. - -The diagram depicts states as swim-lanes while the structural elements -are depicted as ArchiMate Business Objects [cite ArchiMate]. The -relations we use are composition (arrow with diamond ending) and -inheritance (arrow with full triangle ending). - -As was mentioned above about the states through which a notice can -transition, a certain structural field if it is present at a certain -state, then all the states originating from this state will also have -this field. Not all possible states are depicted. For brevity, we chose -only the most significant ones, which segment the transformation process -into stages. - -image:user_manual/media/image94.png[image,width=100%,height=390] - -== Security credentials - -The security credentials will be provided by the infrastructure team -that installed the necessary infrastructure for this project. Some credentials are set in the environment file necessary for the -infrastructure installation and others by manually creating a user by -infra team. - -Bellow are the credentials that should be provided - -[width="100%",cols="25%,36%,39%",options="header",] -|=== -|Name |Description |Comment -|Metabase user |Metabase user for login. This should be an email address -|This user was manually created by the infrastructure team - -|Metabase password |The temporary password that was set by the infra -team for the user above |This user was manually created by the -infrastructure team - -|Airflow user |Airflow UI user for login |This is the value of -_AIRFLOW_WWW_USER_USERNAME variable from the env file - -|Airflow password |Airflow UI password for login |This is the value of -_AIRFLOW_WWW_USER_PASSWORD variable from the env file - -|Fuseki user |Fuseki user for login |The login should be for admin user - -|Fuseki password |Fuseki password for login |This is the value of -ADMIN_PASSWORD variable from the env file - -|Mongo-express user |Mongo-express user for login |This is the value of -ME_CONFIG_BASICAUTH_USERNAME variable from the env file - -|Mongo-express password |Mongo-express password for login |This is the -value of ME_CONFIG_BASICAUTH_PASSWORD variable from the env file -|=== - -== Workflow management with Airflow - -The management of the workflow is made available through the user -interface of the Airflow system. This section describes the provided -pipelines, and how to operate them in Airflow. - -=== Airflow DAG control board - -In this section we explain the most important elements to pay attention -to when operating the pipelines. + -In software engineering, a pipeline consists of a chain of processing -elements (processes, threads, coroutines, functions, etc.), arranged so -that the output of each element is the input of the next. In our case, -as an example, look at the notice_processing_pipeline, which has this -chain of processes that takes as input a notice from the TED website and -as the final output (if every process from this pipeline runs -successfully) a METS package with a transformed notice in the RDF -format. Between the processes the input will always be a batch of -notices. Batch processing is a method of processing large amounts of -data in a single, pre-defined process. Batch processing is typically -used for tasks that are performed periodically, such as daily, weekly, -or monthly. Each step of the pipeline can have a successful or failure -result, and as such the pipeline can be stopped at any step if something -went wrong with one of its processes. In Airflow terminology a pipeline -will be a DAG. He are the processes that will create our -notice_processing_pipeline DAG: - -* notice normalisation -* notice transformation -* notice distillation -* notice validation -* notice packaging -* notice publishing - -==== Enable / disable switch - -In Airflow all the DAGs can be enabled or disabled. If a dag is disabled -that will stop the DAG from running even if that DAG is scheduled. - -When a dag is enabled the switch button will be blue and grey when it is -disabled. - -To enable or disable a dag use the following switch button: - -image:user_manual/media/image21.png[image,width=100%,height=32] - -image:user_manual/media/image69.png[image,width=56,height=55] -disabled position - -image:user_manual/media/image3.png[image,width=52,height=56] -enabled position - -==== DAG Runs - -A DAG Run is an object representing an instantiation of the DAG in time. -Any time the DAG is executed, a DAG Run is created and all tasks inside -it are executed. The status of the DAG Run depends on the tasks states. -Each DAG Run is run separately from one another, meaning that you can -have many runs of a DAG at the same time. - -DAG Run Status - -A DAG Run status is determined when the execution of the DAG is -finished. The execution of the DAG depends on its containing tasks and -their dependencies. The status is assigned to the DAG Run when all of -the tasks are in one of the terminal states (i.e. if there is no -possible transition to another state) like success, failed or skipped. - -There are two possible terminal states for the DAG Run: - -* success if all the pipeline processes are either success or skipped, -* failed if any of the pipeline processes is either failed or -upstream_failed. - -In the runs column in the Airflow user interface we can see the state of -the DAG run, and this can be one of the following: - -* queued -* success -* running -* failed - - -Here is an example of this different states - -image:user_manual/media/image54.png[image,width=422,height=315] - -The transitions for these states will start from queuing, then will go -to running, and after will either go to success or failure. - -Clicking on the numbers associated with a particular DAG run state will -show you a list of the DAG runs in that state. - -==== DAG actions - -In the Airflow user interface we have a run button in the Actions column -that will allow you to trigger a specific DAG with or without specific -configuration. When clicking on the run button a list of options will -appear: - -* Trigger DAG (triggering DAG without config) -* Trigger DAG w/ config (triggering DAG with config) - - -image:user_manual/media/image24.png[image,width=378,height=165] - -==== DAG Run overview - -In the Airflow user interface, when clicking on the DAG name, an -overview of the runs for that DAG will be available. This will include -schema of the processes that are a part of the pipeline, task durations, -code for the DAG, etc. To learn more about Airflow interface please -refer to the Airflow user manual -(link:#useful-resources[[.underline]#Useful Resources#]) - -image:user_manual/media/image74.png[image,width=601,height=281] - - - -=== Available pipelines - -In this section we provide a brief inventory of provided pipelines -including their names, a short description and a high level diagram. - -[arabic] - -. *notice_processing_pipeline* - this DAG performs the processing of a -batch of notices, where the stages take place: normalization, -transformation, validation, packaging, publishing. This is scheduled and -automatically started by other DAGs. - - -image:user_manual/media/image31.png[image,width=100%,height=176] - -image:user_manual/media/image25.png[image,width=100%,height=162] - - -[arabic, start=2] - -. *load_mapping_suite_in_database* - this DAG performs the loading of a -mapping suite or all mapping suites from a branch on GitHub, with the -mapping suite the test data from it can also be loaded, if the test data -is loaded the notice_processing_pipeline DAG will be triggered. - - - -*Config DAG params:* - - -* mapping_suite_package_name: string -* load_test_data: boolean -* branch_or_tag_name: string -* github_repository_url: string - -*Default values:* - -* mapping_suite_package_name = None (it will take all available mapping -suites on that branch or tag) -* load_test_data = false -* branch_or_tag_name = "main" -* github_repository_url= "https://github.com/OP-TED/ted-rdf-mapping.git" - - -image:user_manual/media/image96.png[image,width=100%,height=56] - -[arabic, start=3] -. *fetch_notices_by_query -* this DAG fetches notices from TED by using a -query and, depending on an additional parameter, triggers the -notice_processing_pipeline DAG in full or partial mode (execution of -only one step). - -*Config DAG params:* - -* query : string -* trigger_complete_workflow : boolean - -*Default values:* - -* trigger_complete_workflow = true - -image:user_manual/media/image56.png[image,width=100%,height=92] - -[arabic, start=4] -. *fetch_notices_by_date -* this DAG fetches notices from TED for a day -and, depending on an additional parameter, triggers the -notice_processing_pipeline DAG in full or partial mode (execution of -only one step). - -*Config DAG params:* - -* wild_card : string with date format %Y%m%d* -* trigger_complete_workflow : boolean - -*Default values:* - -* trigger_complete_workflow = true - -image:user_manual/media/image33.png[image,width=100%,height=100] - -[arabic, start=5] -. *fetch_notices_by_date_range -* this DAG receives a date range and -triggers the fetch_notices_by_date DAG for each day in the date range. - -*Config DAG params:* - - -* start_date : string with date format %Y%m%d -* end_date : string with date format %Y%m%d - -image:user_manual/media/image75.png[image,width=601,height=128] - -[arabic, start=6] -. *reprocess_unnormalised_notices_from_backlog -* this DAG selects all -notices that are in RAW state and need to be processed and triggers the -notice_processing_pipeline DAG to process them. - -*Config DAG params:* - -* start_date : string with date format %Y-%m-%d -* end_date : string with date format %Y-%m-%d - -*Default values:* - -* start_date = None , because this param is optional -* end_date = None, because this param is optional - -image:user_manual/media/image60.png[image,width=601,height=78] - -[arabic, start=7] -. *reprocess_unpackaged_notices_from_backlog -* this DAG selects all -notices to be repackaged and triggers the notice_processing_pipeline DAG -to repackage them. - -*Config DAG params:* - -* start_date : string with date format %Y-%m-%d -* end_date : string with date format %Y-%m-%d -* form_number : string -* xsd_version : string - -*Default values:* - -* start_date = None , because this param is optional -* end_date = None, because this param is optional -* form_number = None, because this param is optional -* xsd_version = None, because this param is optional - -image:user_manual/media/image81.png[image,width=100%,height=73] - -[arabic, start=8] -. *reprocess_unpublished_notices_from_backlog -* this DAG selects all -notices to be republished and triggers the notice_processing_pipeline -DAG to republish them. - -*Config DAG params:* - - -* start_date : string with date format %Y-%m-%d -* end_date : string with date format %Y-%m-%d -* form_number : string -* xsd_version : string - -*Default values:* - - -* start_date = None , because this param is optional -* end_date = None, because this param is optional -* form_number = None, because this param is optional -* xsd_version = None, because this param is optional - -image:user_manual/media/image37.png[image,width=100%,height=70] - -[arabic, start=9] -. *reprocess_untransformed_notices_from_backlog -* this DAG selects all -notices to be retransformed and triggers the notice_processing_pipeline -DAG to retransform them. - -*Config DAG params:* - - -* start_date : string with date format %Y-%m-%d -* end_date : string with date format %Y-%m-%d -* form_number : string -* xsd_version : string - -*Default values:* - -* start_date = None , because this param is optional -* end_date = None, because this param is optional -* form_number = None, because this param is optional -* xsd_version = None, because this param is optional - - -image:user_manual/media/image102.png[image,width=100%,height=69] - -[arabic, start=10] -. *reprocess_unvalidated_notices_from_backlog -* this DAG selects all -notices to be revalidated and triggers the notice_processing_pipeline -DAG to revalidate them. - -*Config DAG params:* - -* start_date : string with date format %Y-%m-%d -* end_date : string with date format %Y-%m-%d -* form_number : string -* xsd_version : string - -*Default values:* - - -* start_date = None , because this param is optional -* end_date = None, because this param is optional -* form_number = None, because this param is optional -* xsd_version = None, because this param is optional - -image:user_manual/media/image102.png[image,width=100%,height=69] - -[arabic, start=11] -. *daily_materialized_views_update -* this DAG selects all notices to be -revalidated and triggers the notice_processing_pipeline DAG to -revalidate them. - -*This DAG has no config or default params.* - -image:user_manual/media/image98.png[image,width=100%,height=90] - -[arabic, start=12] -. *daily_check_notices_availability_in_cellar -* this DAG selects all -notices to be revalidated and triggers the notice_processing_pipeline -DAG to revalidate them. - -*This DAG has no config or default params.* - - -image:user_manual/media/image67.png[image,width=339,height=81] - -=== Batch processing - -=== Running pipelines (How to) - -This chapter explains the basic utilization of Ted SWS Airflow pipelines -by presenting in the format of answering the questions. Basic -functionality can be used by running DAGs: a core concept of Airflow. -For advanced documentation access: - -https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/stable/concepts/DAGs.html#] - -==== UC1: How to load a mapping suite or mapping suites? - -As a user I want to load one or several mapping suites into the system -so that notices can be transformed and validated with them. - -==== UC1.a To load all mapping suites - -[arabic] -. Run *load_mapping_suite_in_database* DAG: -[loweralpha] -.. Enable DAG -.. Click Run on Actions column (Play symbol button) -.. Click Trigger DAG - - -image:user_manual/media/image84.png[image,width=100%,height=61] - -==== UC1.b To load specific mapping suite - -[arabic] -. Run *load_mapping_suite_in_database* DAG with configurations: -[loweralpha] -.. Enable DAG -.. Click Run on Actions column (Play symbol button) -.. Click Trigger DAG w/ config. - -image:user_manual/media/image36.png[image,width=100%,height=55] - -[arabic, start=2] -. In the next screen - -[loweralpha] -. In the configuration JSON text box insert the config: - -[source,python] -{"mapping_suite_package_name": "package_F03"} - -[loweralpha, start=2] -. Click Trigger button after inserting the configuration - -image:user_manual/media/image27.png[image,width=100%,height=331] - -[arabic, start=3] -. Optional if you want to transform the available test notices that were -used for development of the mapping suite you can add to configuration -the *load_test_data* parameter with the value *true* - -image:user_manual/media/image103.png[image,width=100%,height=459] - -==== UC2: How to fetch and process notices for a day? - -As a user I want to fetch and process notices from a selected day so -that they get published in Cellar and be available to the public in RDF -format. - -UC2.a To fetch and transform notices for a day: - -[arabic] -. Enable *notice_processing_pipeline* DAG -. Run *fetch_notices_by_date* DAG with configurations: -[loweralpha] -.. Enable DAG -.. Click Run on Actions column -.. Click Trigger DAG w/ config - -image:user_manual/media/image26.png[image,width=100%,height=217] - -[arabic, start=3] -. In the next screen - -[loweralpha] -. In the configuration JSON text box insert the config: -[source,python] -{"wild_card ": "20220921*"}* - -The value *20220921** is the date of the day to fetch and transform with -format: yyyymmdd*. - - -[loweralpha, start=2] -. Click Trigger button after inserting the configuration - -image:user_manual/media/image1.png[image,width=100%,height=310] - -[arabic, start=4] -. Optional: It is possible to only fetch notices without transformation. -To do so add *trigger_complete_workflow* configuration parameter and set -its value to “false”. + -[source,python] -{"wild_card ": "20220921*", "trigger_complete_workflow": false} - -image:user_manual/media/image4.png[image,width=100%,height=358] - - -==== UC3: How to fetch and process notices for date range? - -As a user I want to fetch and process notices published within a dare -range so that they are published in Cellar and available to the public -in RDF format. - -UC3.a To fetch for multiple days: - -[arabic] -. Enable *notice_processing_pipeline* DAG -. Run *fetch_notices_by_date_range* DAG with configurations: -[loweralpha] -.. Enable DAG -.. Click Run on Actions column -.. Click Trigger DAG w/ config. - -image:user_manual/media/image79.png[image,width=100%,height=205] - -[arabic, start=3] -. In the next screen, in the configuration JSON text box insert the -config: -[source,python] -{ "start_date": "20220920", "end_date": "20220920" } - -20220920 is the start date and 20220920 is the end date of the days to -be fetched and transformed with format: yyyymmdd. - -[arabic, start=4] -. Click Trigger button after inserting the configuration - -image:user_manual/media/image51.png[image,width=100%,height=331] - -==== UC4: How to fetch and process notices using a query? - -As a user I want to fetch and process notices published by specific -filters that are available from the TED API so that they are published -in Cellar and available to the public in RDF format. - -To fetch and transform notices by using a query follow the instructions -below: - -[arabic] -. Enable *notice_processing_pipeline* DAG -. Run *fetch_notices_by_query* DAG with configurations: -.. Enable DAG -.. Click Run on Actions column -.. Click Trigger DAG w/ config. - -image:user_manual/media/image61.png[image,width=100%,height=200] -[arabic, start=3] -. In the next screen - -[loweralpha] -. In the configuration JSON text box insert the config: - -[source,python] -{"query": "ND=[163-2021]"} - - -ND=[163-2021] is the query that will run against the TED API to get -notices that will match that query - -[loweralpha, start=2] -. Click Trigger button after inserting the configuration - -image:user_manual/media/image93.png[image,width=100%,height=378] - -[arabic, start=4] -. Optional: If you need to only fetch notices without -transformation, add *trigger_complete_workflow* configuration as *false* - -image:user_manual/media/image49.png[image,width=100%,height=357] - -==== UC5: How to deal with notices that are in the backlog and what to run? - -As a user I want to reprocess notices that are in the backlog so that -they are published in Cellar and available to the public in RDF format. - -Notices that have failed running a complete and successful -notice_processing_pipeline run will be added to the backlog by using -different statuses that will be added to these notices. The status of a -notice will be automatically determined by the system. The backlog could -have multiple notices in different statuses. - -The backlog is divided in five categories as follows: - -* notices that couldn’t be normalised -* notices that couldn’t be transformed -* notices that couldn’t be validated -* notices that couldn’t be packaged -* notices that couldn’t be published - -===== UC5.a Deal with notices that couldn't be normalised - -In the case that the backlog contains notices that couldn’t be -normalised at some point and will want to try to reprocess those notices -just run the *reprocess_unnormalised_notices_from_backlog* DAG following -the instructions below. - -[arabic] -. Enable the reprocess_unnormalised_notices_from_backlog DAG - -image:user_manual/media/image92.png[image,width=100%,height=44] - -[arabic, start=2] -. Trigger DAG - -image:user_manual/media/image76.png[image,width=100%,height=54] - -===== UC5.b: Deal with notices that couldn't be transformed - -In the case that the backlog contains notices that couldn’t be -transformed at some point and will want to try to reprocess those -notices just run the *reprocess_untransformed_notices_from_backlog* DAG -following the instructions below. - -[arabic] -. Enable the reprocess_untransformed_notices_from_backlog DAG -image:user_manual/media/image85.png[image,width=100%,height=36] - -[arabic, start=2] -. Trigger DAG - -image:user_manual/media/image77.png[image,width=100%,height=54] - -===== UC5.c: Deal with notices that couldn’t be validated - -In the case that the backlog contains notices that couldn’t be -normalised at some point and will want to try to reprocess those notices -just run the *reprocess_unvalidated_notices_from_backlog* DAG following -the instructions below. - -[arabic] -. Enable the reprocess_unvalidated_notices_from_backlog DAG - -image:user_manual/media/image66.png[image,width=100%,height=41] - -[arabic, start=2] -. Trigger DAG - -image:user_manual/media/image52.png[image,width=100%,height=52] - -===== UC5.d: Deal with notices that couldn't be published - -In the case that the backlog contains notices that couldn’t be -normalised at some point and will want to try to reprocess those notices -just run the *reprocess_unpackaged_notices_from_backlog* DAG following -the instructions below. - -[arabic] -. Enable the reprocess_unpackaged_notices_from_backlog DAG - -image:user_manual/media/image29.png[image,width=100%,height=36] - -[arabic, start=2] -. Trigger DAG - -image:user_manual/media/image71.png[image,width=100%,height=49] - -===== UC5.e: Deal with notices that couldn't be published - -In the case that the backlog contains notices that couldn’t be -normalised at some point and will want to try to reprocess those notices -just run the *reprocess_unpublished_notices_from_backlog* DAG following -the instructions below. - -[arabic] -. Enable the reprocess_unpublished_notices_from_backlog DAG - -image:user_manual/media/image38.png[image,width=100%,height=38] - -[arabic, start=2] -. Trigger DAG - -image:user_manual/media/image19.png[image,width=100%,height=57] - -=== Scheduled pipelines - - -Scheduled pipelines are DAGs that are set to run periodically at fixed -times, dates, or intervals. The DAG schedule can be read in the column -“Schedule” and if any is set then the value is different from None. -The scheduled execution is indicated as “cron expressions” [cire cron -expressions manual]. A cron expression is a string comprising five or -six fields separated by white space that represents a set of times, -normally as a schedule to execute some routine. In our context examples -of daily executions are provided below. - -image:user_manual/media/image34.png[image,width=83,height=365,float="right"] - -* None - DAG with no Schedule -* 0 0 * * * - DAG that will run every day at 24:00 UTC -* 0 6 * * * - DAG that will run every day at 06:00 UTC -* 0 1 * * * - DAG that will run every day at 01:00 UTC - - -{nbsp} - -{nbsp} - -{nbsp} - -{nbsp} - -{nbsp} - -=== Operational rules and recommendations - - -Note: Every action that was not described in the previous chapters can -lead to unpredictable situations. - -* Do not stop a DAG when it is in running state. Let it finish. In case -you need to disable or stop a DAG, then make sure that in the column -Recent Tasks no numbers in the light green circle are present. Figure -below depicts one such example. -image:user_manual/media/image72.png[image,width=601,height=164] - -* Do not run reprocess DAGs when notice_processing_pipeline is in running -state. This will produce errors as the reprocessing DAGs are searching -for notices in a specific status available in the database. When the -notice_processing_pipeline is running the notices are transitioning -between different statuses and that will make it possible to get the -same notice to be processed twice in the same time, which will produce -an error. Make sure that in the column Runs for -notice_processing_pipeline you don’t have any numbers in a light green -circle before running any reprocess DAGs. -image:user_manual/media/image30.png[image,width=601,height=162] - - -* Do not manually trigger notice_processing_pipeline as this DAG is -triggered automatically by other DAGs. This will produce an error as -this DAG needs to know what batch of notices it is processing (this is -automatically done by the system). This DAG should only be enabled. -image:user_manual/media/image18.png[image,width=602,height=29] - -* To start any notice processing and transformation make sure that you -have mapping suites available in the database. You should have at least -one successful run of the *load_mapping_suite_in_database* DAG and check -Metabase to see what mapping suites are available. -image:user_manual/media/image32.png[image,width=653,height=30] - -* Do not manually trigger scheduled DAGs unless you use a specific -configuration and that DAG supports running with specific configuration. -The scheduled dags should be only enabled. -image:user_manual/media/image87.png[image,width=601,height=77] - -* It is not recommended to load mapping suites while -notice_processing_pipeline is running. First make sure that there are no -running tasks and then load other mapping suites. -image:user_manual/media/image35.png[image,width=601,height=256] {nbsp} -image:user_manual/media/image91.png[image,width=601,height=209] - -* It is recommended to start processing / transforming notices for a short -period of time e.g fetch notices for a day, week, month but not year. -The system can handle processing for a longer period but it will take -time and you will not be able to load other mapping suites while -processing is running. - - -== Metabase - -This section describes how to work with Metabase, exploring user -interface, accessing dashboards, creating questions, and adding new data -sources. This description uses examples with real data and data sources -that are used on TED-SWS project. For advanced documentation access -link: - -https://www.metabase.com/docs/latest/[[.underline]#https://www.metabase.com/docs/latest/#] - -=== Main concepts in Metabase - -==== What is a question? - -In Metabase, a question is a query, its results, and its visualization. - -If you’re trying to figure something out about your data in Metabase, -you’re probably either asking a question or viewing a question that -someone else on your team created. In everyday usage, a question is -pretty much synonymous with a query. - -==== What is a dashboard? - -A dashboard is a data visualization tool that holds important charts and -text, collected and arranged on a single screen. Dashboards provide a -high-level, centralized look at KPIs and other business metrics, and can -cover everything from overall business health to the success of a -specific project. - -The term comes from the automotive dashboard, which like its business -intelligence counterpart provides status updates and warnings about -important functions. - -==== What is a collection? - -In Metabase, a collection is a set of items like questions, dashboards -and subcollections, that are stored together for some organizational -purpose. You can think of collections like folders within a file system. -The root collection in Metabase is called Our Analytics, and it holds -every other collection that you and others at your organization create. - -You may keep a collection titled “Operations” that holds all of the -questions, dashboards, and models that your organization’s ops team -uses, so people in that department know where to find the items they -need to do their jobs. And if there are specific items within a -collection that your team uses most frequently, you can pin those to the -top of the collection page for easy reference. Pinned questions in a -collection will also render a preview of their visualization. - -==== What is a card? - -A card is a component of a dashboard that displays data or text. - -Metabase dashboards are made up of cards, with each card displaying some -data (visualized as a table, chart, map, or number) or text (like -headings, descriptive information, or relevant links). - -=== User interface - -After successful authorization, metabase redirects to main page that is -composed of the following elements: - -image:user_manual/media/image22.png[image,width=633,height=294] - -[arabic] -. Slidebar with collections -. Settings, searching and adding new questions -. Home page (Quick last accessed dashboards or questions) - -==== UC1 Manually updating the data - -As a user I want to manually update the data so I will see the -questions/dashboards on the latest data. - -For *updating data*: - -[arabic] -. Click Settings -> Admin settings -> Databases - -image:user_manual/media/image99.png[image,width=448,height=373] - -[arabic, start=2] -. Go to Databases in the top menu - -image:user_manual/media/image15.png[image,width=601,height=142] - -[arabic, start=3] -. To *update* the existing data source, click on the name of the necessary -database and then click on both actions: “Sync database schema now” and -“Re-scan field values now”. This will be done automatically but if you -want to have the latest data (i.e the processing is still running) you -could follow the steps below. However this is not considered a good -practice. - -image:user_manual/media/image78.png[image,width=354,height=162] - -image:user_manual/media/image86.png[image,width=280,height=244] - -==== UC2: Use existing dashboards - -As a user I want to browse through and view dashboards so that I can -answer business or operational questions about pipelines or notices. - -[arabic] -. To access existing questions / dashboards, click: - -Sidebar button -> Necessary collection folder (ex: TED SWS KPI -> -Pipeline KPI) - -image:user_manual/media/image68.png[image,width=189,height=242] - -[arabic, start=2] -. To access the dashboard / question click on the element name in the main -screen - -image:user_manual/media/image50.png[image,width=572,height=227] - -==== UC2: Customize a collection - -As a user I want to customize my collection preview so I can access -quickly certain dashboards / questions and clean the unwanted content - -[arabic] -. When opening a collection the main screen will be divided into to -sections - - -[loweralpha] -. Pin section - where dashboards and questions can be pinned for easy -access - -. List with dashboards and questions. - - -image:user_manual/media/image46.png[image,width=601,height=341] - -[arabic, start=2] -. Drag the dashboard or question elements from list (2) to -section (1) to pin them. The element will be moved to the pin section, -and will be displayed. - -. To *delete / move* a dashboard or question: - -[loweralpha] -. Click on checkbox of the elements to be deleted; -. Click archive or move (this can move the content to another collection) - -image:user_manual/media/image17.png[image,width=461,height=282] - -==== UC3: Create new question - -As a user I want to create a new question so I can explore the available -data - -To *create* question: - -[arabic] -. Click New -(image:user_manual/media/image65.png[image,width=45,height=27]), -then Question -(image:user_manual/media/image83.png[image,width=71,height=22]). - -image:user_manual/media/image100.png[image,width=261,height=194] - -[arabic, start=2] -. Select Data source (TEDSWS MongoDB - database name) - -image:user_manual/media/image7.png[image,width=353,height=210] - -[arabic, start=3] -. Select Data collection (Notice Collection Materialized View - -image:user_manual/media/image28.png[image,width=266,height=307] - -*Note:* Always select “Notices Collection Materialised View” collection -for questions. This collection was created specifically for metabase. -Using other collections may increase response time of a question. - -[arabic, start=4] -. Select necessary columns to display (ex: Notice status) - -image:user_manual/media/image95.png[image,width=397,height=365] - - -[arabic, start=5] -. (Optional) Select filter (ex: Form number is F03) - -image:user_manual/media/image40.png[image,width=275,height=304] - -image:user_manual/media/image70.png[image,width=353,height=214] - -[arabic, start=6] -. (Optional) Select Summarize (ex: Count of rows) - -image:user_manual/media/image82.png[image,width=273,height=299] - -[arabic, start=7] -. (Optional) Select a column to group by (ex: Notice Status) - -image:user_manual/media/image10.png[image,width=389,height=310] - -[arabic, start=8] -. Click Visualize -image:user_manual/media/image16.png[image,width=143,height=32] - - -image:user_manual/media/image9.png[image,width=268,height=180] - -*Note:* This loading page means that questing is requesting an answer. -Wait until it disappears.After the request is done, the page with -response and editing a question will appear. - - -[arabic, start=9] -. Customizing the question - - -Question page is divided into: - -* Edit question (name and logic) - -* Question visualisation (can be table or chart) - -* Visualisation settings (settings for table or chart) - -image:user_manual/media/image55.png[image,width=601,height=277] - -Tips on *editing* page: - -* To *export* the question: -** Click on Download full results - -image:user_manual/media/image89.png[image,width=372,height=286] - -* To *edit question*: -** Click on Show editor - -image:user_manual/media/image43.png[image,width=394,height=182] - - -* To *change visualization type* -** Click on visualization and then on Done once the type was chosen - -image:user_manual/media/image39.png[image,width=392,height=345] - -* To *edit visualization settings* - -** Click on Settings - -image:user_manual/media/image5.png[image,width=303,height=346] - - -* To show values on dashboard: Click Show values on data points - -image:user_manual/media/image104.png[image,width=255,height=331] - - -* To *save* question just Click Save button - -image:user_manual/media/image48.png[image,width=324,height=198] - -* Insert question name, description (optional) and collection to save into - -image:user_manual/media/image101.png[image,width=305,height=230] - -==== UC4: Create dashboard - -As a user I want to create a dashboard so I can group a set of questions -that are of interest to me. - -To *create* dashboard: - -[arabic] -. Click New -> Dashboard - -image:user_manual/media/image12.png[image,width=548,height=295] - - -[arabic, start=2] -. Insert Name, Description (optional) and collection where to save - -image:user_manual/media/image44.png[image,width=370,height=279] - - -[loweralpha] -. To select subfolder of the collection, click in arrow on collection -field: - -image:user_manual/media/image13.png[image,width=395,height=199] - -[arabic, start=3] -. Click Create - -. To *add* questions on dashboard: - -[loweralpha] -. Click Add questions - -image:user_manual/media/image42.png[image,width=285,height=158] - -[loweralpha, start=2] -. Click on the name of necessary question or drag & drop it - -image:user_manual/media/image57.png[image,width=307,height=392] - -In the dashboard you can add multiple questions, resize and move where -it needs to be. -[arabic, start=5] -. To *save* dashboard: - -[loweralpha] - -. Click Save button in right top corner of the current screen - -image:user_manual/media/image53.png[image,width=171,height=96] - -==== UC5: Create user - -As a user I want to create another user so that I can share the work -with others in my team - -[arabic] -. Go to Admin settings by pressing the setting wheel button in the top -right of the screen and then click Admin settings. - -image:user_manual/media/image64.png[image,width=544,height=180] - - -[arabic, start=2] -. On the next screen go to People in the top menu and click Invite someone -button - -image:user_manual/media/image97.png[image,width=539,height=137] - - -[arabic, start=3] -. Complete the mandatory fields and put the user in the Administrator if -you want that user to be an admin or in the All Users group - -image:user_manual/media/image73.png[image,width=601,height=345] - -[arabic, start=4] -. Once you click on create a temporary password will be created for this -user. Save this password and user details as you will need to share -these with the new user. After this just click Done. - -image:user_manual/media/image20.png[image,width=601,height=362] - diff --git a/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc b/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc new file mode 100644 index 000000000..00c59e995 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/user_manual/access-security.adoc @@ -0,0 +1,36 @@ +== Security and Access + +The security credentials will be provided by the infrastructure team +that installed the necessary infrastructure for this project. Some credentials are set in the environment file necessary for the +infrastructure installation and others by manually creating a user by +infra team. + +Bellow is the list of credentials that should be available + +[width="100%",cols="25%,36%,39%",options="header",] +|=== +|Name |Description |Comment +|Metabase user |Metabase user for login. This should be an email address +|This user was manually created by the infrastructure team + +|Metabase password |The temporary password that was set by the infra +team for the user above |This user was manually created by the +infrastructure team + +|Airflow user |Airflow UI user for login |This is the value of +_AIRFLOW_WWW_USER_USERNAME variable from the env file + +|Airflow password |Airflow UI password for login |This is the value of +_AIRFLOW_WWW_USER_PASSWORD variable from the env file + +|Fuseki user |Fuseki user for login |The login should be for admin user + +|Fuseki password |Fuseki password for login |This is the value of +ADMIN_PASSWORD variable from the env file + +|Mongo-express user |Mongo-express user for login |This is the value of +ME_CONFIG_BASICAUTH_USERNAME variable from the env file + +|Mongo-express password |Mongo-express password for login |This is the +value of ME_CONFIG_BASICAUTH_PASSWORD variable from the env file +|=== \ No newline at end of file diff --git a/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc b/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc new file mode 100644 index 000000000..c8b62d41b --- /dev/null +++ b/docs/antora/modules/ROOT/pages/user_manual/getting_started_user_manual.adoc @@ -0,0 +1,27 @@ += Getting started with TED-SWS + +The purpose of this section is to explain how to monitor and control TED-SWS system using Airflow and Metabase interfaces. This page may be updated by the development team as the system evolves. + +== Intended audience + +This document is intended for persons involved in the controlling and +monitoring the services offered by the TED-SWS system. + +== Getting started +To gain access and control of TED-SWS system the user shall be provided with access URLs and credentials by the infrastructure team. Please make sure that you know xref:user_manual/access-security.adoc[all the security credentials]. + +== User Manual +This user manual is divided into three parts. We advise to get familiar with them in the following order + +* xref:user_manual/system-overview.adoc[system overview], +* xref:user_manual/workflow-management-airflow.adoc[workflow management with Airflow], and +* xref:user_manual/system-monitoring-metabase.adoc[system monitoring with Metabase]. + + +== Additional resources [[useful-resources]] + +link:https://airflow.apache.org/docs/apache-airflow/2.4.3/ui.html[Apache Airflow User Interface] + +link:https://www.metabase.com/learn/getting-started/tour-of-metabase[Tour of Metabase] + +link:https://www.metabase.com/docs/latest/exploration-and-organization/start[Metabase organisation and exploration] diff --git a/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc b/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc new file mode 100644 index 000000000..a47c5bf4d --- /dev/null +++ b/docs/antora/modules/ROOT/pages/user_manual/system-monitoring-metabase.adoc @@ -0,0 +1,345 @@ += System monitoring with Metabase + +This section describes how to work with Metabase, exploring user +interface, accessing dashboards, creating questions, and adding new data +sources. This description uses examples with real data and data sources +that are used on TED-SWS project. For advanced documentation please see link:https://www.metabase.com/docs/latest/[Metabase user manual (latest)]. + +== Main concepts in Metabase + +=== What is a question? + +In Metabase, a question is a query, its results, and its visualization. + +If you’re trying to figure something out about your data in Metabase, +you’re probably either asking a question or viewing a question that +someone else on your team created. In everyday usage, a question is +pretty much synonymous with a query. + +=== What is a dashboard? + +A dashboard is a data visualization tool that holds important charts and +text, collected and arranged on a single screen. Dashboards provide a +high-level, centralized look at KPIs and other business metrics, and can +cover everything from overall business health to the success of a +specific project. + +The term comes from the automotive dashboard, which like its business +intelligence counterpart provides status updates and warnings about +important functions. + +=== What is a collection? + +In Metabase, a collection is a set of items like questions, dashboards +and subcollections, that are stored together for some organizational +purpose. You can think of collections like folders within a file system. +The root collection in Metabase is called Our Analytics, and it holds +every other collection that you and others at your organization create. + +You may keep a collection titled “Operations” that holds all of the +questions, dashboards, and models that your organization’s ops team +uses, so people in that department know where to find the items they +need to do their jobs. And if there are specific items within a +collection that your team uses most frequently, you can pin those to the +top of the collection page for easy reference. Pinned questions in a +collection will also render a preview of their visualization. + +=== What is a card? + +A card is a component of a dashboard that displays data or text. + +Metabase dashboards are made up of cards, with each card displaying some +data (visualized as a table, chart, map, or number) or text (like +headings, descriptive information, or relevant links). + +== Metabase user interface + +After successful authorization, metabase redirects to main page that is +composed of the following elements: + +image:user_manual/media/image22.png[image,width=633,height=294] + +[arabic] +. Slidebar with collections +. Settings, searching and adding new questions +. Home page (Quick last accessed dashboards or questions) + +=== UC1 Manually updating the data + +As a user I want to manually update the data so I will see the +questions/dashboards on the latest data. + +For *updating data*: + +[arabic] +. Click Settings -> Admin settings -> Databases + +image:user_manual/media/image99.png[image,width=448,height=373] + +[arabic, start=2] +. Go to Databases in the top menu + +image:user_manual/media/image15.png[image,width=601,height=142] + +[arabic, start=3] +. To *update* the existing data source, click on the name of the necessary +database and then click on both actions: “Sync database schema now” and +“Re-scan field values now”. This will be done automatically but if you +want to have the latest data (i.e the processing is still running) you +could follow the steps below. However this is not considered a good +practice. + +image:user_manual/media/image78.png[image,width=354,height=162] + +image:user_manual/media/image86.png[image,width=280,height=244] + +=== UC2: Use existing dashboards + +As a user I want to browse through and view dashboards so that I can +answer business or operational questions about pipelines or notices. + +[arabic] +. To access existing questions / dashboards, click: + +Sidebar button -> Necessary collection folder (ex: TED SWS KPI -> +Pipeline KPI) + +image:user_manual/media/image68.png[image,width=189,height=242] + +[arabic, start=2] +. To access the dashboard / question click on the element name in the main +screen + +image:user_manual/media/image50.png[image,width=572,height=227] + +=== UC2: Customize a collection + +As a user I want to customize my collection preview so I can access +quickly certain dashboards / questions and clean the unwanted content + +[arabic] +. When opening a collection the main screen will be divided into to +sections + + +[loweralpha] +. Pin section - where dashboards and questions can be pinned for easy +access + +. List with dashboards and questions. + + +image:user_manual/media/image46.png[image,width=601,height=341] + +[arabic, start=2] +. Drag the dashboard or question elements from list (2) to +section (1) to pin them. The element will be moved to the pin section, +and will be displayed. + +. To *delete / move* a dashboard or question: + +[loweralpha] +. Click on checkbox of the elements to be deleted; +. Click archive or move (this can move the content to another collection) + +image:user_manual/media/image17.png[image,width=461,height=282] + +=== UC3: Create new question + +As a user I want to create a new question so I can explore the available +data + +To *create* question: + +[arabic] +. Click New +(image:user_manual/media/image65.png[image,width=45,height=27]), +then Question +(image:user_manual/media/image83.png[image,width=71,height=22]). + +image:user_manual/media/image100.png[image,width=261,height=194] + +[arabic, start=2] +. Select Data source (TEDSWS MongoDB - database name) + +image:user_manual/media/image7.png[image,width=353,height=210] + +[arabic, start=3] +. Select Data collection (Notice Collection Materialized View + +image:user_manual/media/image28.png[image,width=266,height=307] + +*Note:* Always select “Notices Collection Materialised View” collection +for questions. This collection was created specifically for metabase. +Using other collections may increase response time of a question. + +[arabic, start=4] +. Select necessary columns to display (ex: Notice status) + +image:user_manual/media/image95.png[image,width=397,height=365] + + +[arabic, start=5] +. (Optional) Select filter (ex: Form number is F03) + +image:user_manual/media/image40.png[image,width=275,height=304] + +image:user_manual/media/image70.png[image,width=353,height=214] + +[arabic, start=6] +. (Optional) Select Summarize (ex: Count of rows) + +image:user_manual/media/image82.png[image,width=273,height=299] + +[arabic, start=7] +. (Optional) Select a column to group by (ex: Notice Status) + +image:user_manual/media/image10.png[image,width=389,height=310] + +[arabic, start=8] +. Click Visualize +image:user_manual/media/image16.png[image,width=143,height=32] + + +image:user_manual/media/image9.png[image,width=268,height=180] + +*Note:* This loading page means that questing is requesting an answer. +Wait until it disappears.After the request is done, the page with +response and editing a question will appear. + + +[arabic, start=9] +. Customizing the question + + +Question page is divided into: + +* Edit question (name and logic) + +* Question visualisation (can be table or chart) + +* Visualisation settings (settings for table or chart) + +image:user_manual/media/image55.png[image,width=601,height=277] + +Tips on *editing* page: + +* To *export* the question: +** Click on Download full results + +image:user_manual/media/image89.png[image,width=372,height=286] + +* To *edit question*: +** Click on Show editor + +image:user_manual/media/image43.png[image,width=394,height=182] + + +* To *change visualization type* +** Click on visualization and then on Done once the type was chosen + +image:user_manual/media/image39.png[image,width=392,height=345] + +* To *edit visualization settings* + +** Click on Settings + +image:user_manual/media/image5.png[image,width=303,height=346] + + +* To show values on dashboard: Click Show values on data points + +image:user_manual/media/image104.png[image,width=255,height=331] + + +* To *save* question just Click Save button + +image:user_manual/media/image48.png[image,width=324,height=198] + +* Insert question name, description (optional) and collection to save into + +image:user_manual/media/image101.png[image,width=305,height=230] + +=== UC4: Create dashboard + +As a user I want to create a dashboard so I can group a set of questions +that are of interest to me. + +To *create* dashboard: + +[arabic] +. Click New -> Dashboard + +image:user_manual/media/image12.png[image,width=548,height=295] + + +[arabic, start=2] +. Insert Name, Description (optional) and collection where to save + +image:user_manual/media/image44.png[image,width=370,height=279] + + +[loweralpha] +. To select subfolder of the collection, click in arrow on collection +field: + +image:user_manual/media/image13.png[image,width=395,height=199] + +[arabic, start=3] +. Click Create + +. To *add* questions on dashboard: + +[loweralpha] +. Click Add questions + +image:user_manual/media/image42.png[image,width=285,height=158] + +[loweralpha, start=2] +. Click on the name of necessary question or drag & drop it + +image:user_manual/media/image57.png[image,width=307,height=392] + +In the dashboard you can add multiple questions, resize and move where +it needs to be. +[arabic, start=5] +. To *save* dashboard: + +[loweralpha] + +. Click Save button in right top corner of the current screen + +image:user_manual/media/image53.png[image,width=171,height=96] + +=== UC5: Create user + +As a user I want to create another user so that I can share the work +with others in my team + +[arabic] +. Go to Admin settings by pressing the setting wheel button in the top +right of the screen and then click Admin settings. + +image:user_manual/media/image64.png[image,width=544,height=180] + + +[arabic, start=2] +. On the next screen go to People in the top menu and click Invite someone +button + +image:user_manual/media/image97.png[image,width=539,height=137] + + +[arabic, start=3] +. Complete the mandatory fields and put the user in the Administrator if +you want that user to be an admin or in the All Users group + +image:user_manual/media/image73.png[image,width=601,height=345] + +[arabic, start=4] +. Once you click on create a temporary password will be created for this +user. Save this password and user details as you will need to share +these with the new user. After this just click Done. + +image:user_manual/media/image20.png[image,width=601,height=362] + diff --git a/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc b/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc new file mode 100644 index 000000000..d66ac577f --- /dev/null +++ b/docs/antora/modules/ROOT/pages/user_manual/system-overview.adoc @@ -0,0 +1,155 @@ +== System overview + +This section provides a high level overview of the TED-SWS system and +its components. As presented in the image below the system is built by +multitude of services / components grouped together to help to reach the +end goal. The system can be divided into 2 main parts: + +* Controlling and monitoring +* Core functionality (code base / TED SWS pipeline) + +Each part of the system is formed by a group of components. + +Controlling and monitoring, controlled by an operation manager, contains +a workflow / pipeline management service (Airflow) and data +visualization service (Metabase). Using this group of services any user +should be able to control execution of the existing pipelines and also +monitor the execution results. + +The core functionality has many services developed to accommodate the +entire transformation process of a public procurement notice (in XML +format) available on the TED Website into RDF format and to publish it +into CELLAR. Here is a short description of some of the main services: + +* fetching service - fetching the notice from TED website +* indexing service - getting the unique XPATHs in a notice XML +* metadata normalisation service - extract notice metadata from the XML +* transformation service - transform the XML to RDF +* entity resolution and deduplication service - resolve duplicated +entities in the RDF +* validation service - validation the RDF transformation +* packaging service - creating the METS package +* publishing service - sending the METS package to CELLAR + +image:user_manual/media/image59.png[image,width=100%,height=270] + +== Pipelines structure ( Airflow DAGs ) + +In this section will see a graphic representation that will show the +flow and dependencies of the available pipelines (DAGs) in Airflow. In +this representation will see the presence of two users AirflowUser and +AirflowScheduler, where the AirflowUser is the user that will enable and +trigger the DAGs and AirflowScheduler is the Airflow component that will +start the DAGs automatically following a schedule. + +The automatic triggered DAGs controlled by the Airflow Scheduler are: + +* fetch_notices_by_date +* daily_check_notices_availibility_in_cellar +* daily_materialized_views_update + +image:user_manual/media/image63.png[image,width=100%,height=382] + +The DAGs marked with _purple_ (load_mapping_suite_in_database), _yellow_ +(reprocess_unnormalised_notices_from_backlog,reprocess_unpackaged_notices_from_backlog, +reprocess_unpublished_notices_from_backlog,reprocess_untransformed_notices_from_backlog, +reprocess_unvalidated_notices_from_backlog) and _green_ +(fetch_notices_by_date, fetch_notices_by_date_range, +fetch_notices_by_query) will trigger automatically the +*notice_processing_pipeline* marked with _blue_, and this will take care +of the entire processing steps for a notice. These can be used by a user +by manually triggering these DAGs with or without configuration. + +The DAGs marked with _green_ (fetch_notices_by_date, +fetch_notices_by_date_range, fetch_notices_by_query) are in charge of +fetching the notices from TED API. The ones marked with _yellow_ ( +reprocess_unnormalised_notices_from_backlog, +reprocess_unpackaged_notices_from_backlog, +reprocess_unpublished_notices_from_backlog, +reprocess_untransformed_notices_from_backlog, +reprocess_unvalidated_notices_from_backlog) will handle the reprocessing +of notices from the backlog. The purple marked DAG +(load_mapping_suite_in_database) will handle the loading of mapping +suites in the database that will be used to transform the notices. + +image:user_manual/media/image11.png[image,width=100%,height=660] + +== Notice statuses + +During the transformation process through the TED-SWS system, a notice +will start with a certain status and it will transition to other +statuses when a particular step of the pipeline +(notice_processing_pipeline) offered by the system has completed +successfully or unsuccessfully. This transition is done automatically +and it will change the _status_ property of a notice. The system has the +following statuses: + +* RAW +* INDEXED +* NORMALISED_METADATA +* INELIGIBLE_FOR_TRANSFORMATION +* ELIGIBLE_FOR_TRANSFORMATION +* PREPROCESSED_FOR_TRANSFORMATION +* TRANSFORMED +* DISTILLED +* VALIDATED +* INELIGIBLE_FOR_PACKAGING +* ELIGIBLE_FOR_PACKAGING +* PACKAGED +* INELIGIBLE_FOR_PUBLISHING +* ELIGIBLE_FOR_PUBLISHING +* PUBLISHED +* PUBLICLY_UNAVAILABLE +* PUBLICLY_AVAILABLE + +The transition from one status to another is decided by the system and +can be viewed in the graphic representation below. + +image:user_manual/media/image14.png[image,width=100%,height=444] + +== Notice structure + +This section aims at presenting the anatomy of a Notice in the TED-SWS +system and the dependence of structural elements on the phase of the +transformation process. This is useful for the user to understand what +happens behind the scene and what information is available in the +database, to build analytics dashboards. + +The structure of a notice within the TED-SWS system consists of the +following structural elements: + +* Status +* Metadata +** Original Metadata +** Normalised Metadata +* Manifestation +** XMLManifestation +** RDFManifestation +** METSManifestation +* Validation Report +** XPATH Coverage Validation +** SHACL Validation +** SPARQL Validation + +The diagram below shows the high level structure of the Notice object +and that certain structural parts of a notice within the system are +dependent on its state. This means that as the transformation process +runs through its steps the Notice state changes and new structural parts +are added. For example, for a notice in the NORMALISED status we can +access the Original Metadata, Normalised Metadata and XMLManifestation +fields, for a notice in the TRANSFORMED status we can access in addition +the RDFManifestation field and similarly for the rest of the statuses. + +The diagram depicts states as swim-lanes while the structural elements +are depicted as ArchiMate Business Objects [cite ArchiMate]. The +relations we use are composition (arrow with diamond ending) and +inheritance (arrow with full triangle ending). + +As was mentioned above about the states through which a notice can +transition, a certain structural field if it is present at a certain +state, then all the states originating from this state will also have +this field. Not all possible states are depicted. For brevity, we chose +only the most significant ones, which segment the transformation process +into stages. + +image:user_manual/media/image94.png[image,width=100%,height=390] diff --git a/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc b/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc new file mode 100644 index 000000000..432edfbd7 --- /dev/null +++ b/docs/antora/modules/ROOT/pages/user_manual/workflow-management-airflow.adoc @@ -0,0 +1,686 @@ += Workflow management with Airflow + +The management of the workflow is made available through the user +interface of the Airflow system. This section describes the provided +pipelines, and how to operate them in Airflow. + +== Airflow DAG control board + +In this section we explain the most important elements to pay attention +to when operating the pipelines. + +In software engineering, a pipeline consists of a chain of processing +elements (processes, threads, coroutines, functions, etc.), arranged so +that the output of each element is the input of the next. In our case, +as an example, look at the notice_processing_pipeline, which has this +chain of processes that takes as input a notice from the TED website and +as the final output (if every process from this pipeline runs +successfully) a METS package with a transformed notice in the RDF +format. Between the processes the input will always be a batch of +notices. Batch processing is a method of processing large amounts of +data in a single, pre-defined process. Batch processing is typically +used for tasks that are performed periodically, such as daily, weekly, +or monthly. Each step of the pipeline can have a successful or failure +result, and as such the pipeline can be stopped at any step if something +went wrong with one of its processes. In Airflow terminology a pipeline +will be a DAG. He are the processes that will create our +notice_processing_pipeline DAG: + +* notice normalisation +* notice transformation +* notice distillation +* notice validation +* notice packaging +* notice publishing + +=== Enable / disable switch + +In Airflow all the DAGs can be enabled or disabled. If a dag is disabled +that will stop the DAG from running even if that DAG is scheduled. + +When a dag is enabled the switch button will be blue and grey when it is +disabled. + +To enable or disable a dag use the following switch button: + +image:user_manual/media/image21.png[image,width=100%,height=32] + +image:user_manual/media/image69.png[image,width=56,height=55] +disabled position + +image:user_manual/media/image3.png[image,width=52,height=56] +enabled position + +=== DAG Runs + +A DAG Run is an object representing an instantiation of the DAG in time. +Any time the DAG is executed, a DAG Run is created and all tasks inside +it are executed. The status of the DAG Run depends on the tasks states. +Each DAG Run is run separately from one another, meaning that you can +have many runs of a DAG at the same time. + +DAG Run Status + +A DAG Run status is determined when the execution of the DAG is +finished. The execution of the DAG depends on its containing tasks and +their dependencies. The status is assigned to the DAG Run when all of +the tasks are in one of the terminal states (i.e. if there is no +possible transition to another state) like success, failed or skipped. + +There are two possible terminal states for the DAG Run: + +* success if all the pipeline processes are either success or skipped, +* failed if any of the pipeline processes is either failed or +upstream_failed. + +In the runs column in the Airflow user interface we can see the state of +the DAG run, and this can be one of the following: + +* queued +* success +* running +* failed + + +Here is an example of this different states + +image:user_manual/media/image54.png[image,width=422,height=315] + +The transitions for these states will start from queuing, then will go +to running, and after will either go to success or failure. + +Clicking on the numbers associated with a particular DAG run state will +show you a list of the DAG runs in that state. + +=== DAG actions + +In the Airflow user interface we have a run button in the Actions column +that will allow you to trigger a specific DAG with or without specific +configuration. When clicking on the run button a list of options will +appear: + +* Trigger DAG (triggering DAG without config) +* Trigger DAG w/ config (triggering DAG with config) + + +image:user_manual/media/image24.png[image,width=378,height=165] + +=== DAG Run overview + +In the Airflow user interface, when clicking on the DAG name, an +overview of the runs for that DAG will be available. This will include +schema of the processes that are a part of the pipeline, task durations, +code for the DAG, etc. To learn more about Airflow interface please +refer to the Airflow user manual +(link:#useful-resources[[.underline]#Useful Resources#]) + +image:user_manual/media/image74.png[image,width=601,height=281] + + + +== Available pipelines + +In this section we provide a brief inventory of provided pipelines +including their names, a short description and a high level diagram. + +[arabic] + +. *notice_processing_pipeline* - this DAG performs the processing of a +batch of notices, where the stages take place: normalization, +transformation, validation, packaging, publishing. This is scheduled and +automatically started by other DAGs. + + +image:user_manual/media/image31.png[image,width=100%,height=176] + +image:user_manual/media/image25.png[image,width=100%,height=162] + + +[arabic, start=2] + +. *load_mapping_suite_in_database* - this DAG performs the loading of a +mapping suite or all mapping suites from a branch on GitHub, with the +mapping suite the test data from it can also be loaded, if the test data +is loaded the notice_processing_pipeline DAG will be triggered. + + + +*Config DAG params:* + + +* mapping_suite_package_name: string +* load_test_data: boolean +* branch_or_tag_name: string +* github_repository_url: string + +*Default values:* + +* mapping_suite_package_name = None (it will take all available mapping +suites on that branch or tag) +* load_test_data = false +* branch_or_tag_name = "main" +* github_repository_url= "https://github.com/OP-TED/ted-rdf-mapping.git" + + +image:user_manual/media/image96.png[image,width=100%,height=56] + +[arabic, start=3] +. *fetch_notices_by_query -* this DAG fetches notices from TED by using a +query and, depending on an additional parameter, triggers the +notice_processing_pipeline DAG in full or partial mode (execution of +only one step). + +*Config DAG params:* + +* query : string +* trigger_complete_workflow : boolean + +*Default values:* + +* trigger_complete_workflow = true + +image:user_manual/media/image56.png[image,width=100%,height=92] + +[arabic, start=4] +. *fetch_notices_by_date -* this DAG fetches notices from TED for a day +and, depending on an additional parameter, triggers the +notice_processing_pipeline DAG in full or partial mode (execution of +only one step). + +*Config DAG params:* + +* wild_card : string with date format %Y%m%d* +* trigger_complete_workflow : boolean + +*Default values:* + +* trigger_complete_workflow = true + +image:user_manual/media/image33.png[image,width=100%,height=100] + +[arabic, start=5] +. *fetch_notices_by_date_range -* this DAG receives a date range and +triggers the fetch_notices_by_date DAG for each day in the date range. + +*Config DAG params:* + + +* start_date : string with date format %Y%m%d +* end_date : string with date format %Y%m%d + +image:user_manual/media/image75.png[image,width=601,height=128] + +[arabic, start=6] +. *reprocess_unnormalised_notices_from_backlog -* this DAG selects all +notices that are in RAW state and need to be processed and triggers the +notice_processing_pipeline DAG to process them. + +*Config DAG params:* + +* start_date : string with date format %Y-%m-%d +* end_date : string with date format %Y-%m-%d + +*Default values:* + +* start_date = None , because this param is optional +* end_date = None, because this param is optional + +image:user_manual/media/image60.png[image,width=601,height=78] + +[arabic, start=7] +. *reprocess_unpackaged_notices_from_backlog -* this DAG selects all +notices to be repackaged and triggers the notice_processing_pipeline DAG +to repackage them. + +*Config DAG params:* + +* start_date : string with date format %Y-%m-%d +* end_date : string with date format %Y-%m-%d +* form_number : string +* xsd_version : string + +*Default values:* + +* start_date = None , because this param is optional +* end_date = None, because this param is optional +* form_number = None, because this param is optional +* xsd_version = None, because this param is optional + +image:user_manual/media/image81.png[image,width=100%,height=73] + +[arabic, start=8] +. *reprocess_unpublished_notices_from_backlog -* this DAG selects all +notices to be republished and triggers the notice_processing_pipeline +DAG to republish them. + +*Config DAG params:* + + +* start_date : string with date format %Y-%m-%d +* end_date : string with date format %Y-%m-%d +* form_number : string +* xsd_version : string + +*Default values:* + + +* start_date = None , because this param is optional +* end_date = None, because this param is optional +* form_number = None, because this param is optional +* xsd_version = None, because this param is optional + +image:user_manual/media/image37.png[image,width=100%,height=70] + +[arabic, start=9] +. *reprocess_untransformed_notices_from_backlog -* this DAG selects all +notices to be retransformed and triggers the notice_processing_pipeline +DAG to retransform them. + +*Config DAG params:* + + +* start_date : string with date format %Y-%m-%d +* end_date : string with date format %Y-%m-%d +* form_number : string +* xsd_version : string + +*Default values:* + +* start_date = None , because this param is optional +* end_date = None, because this param is optional +* form_number = None, because this param is optional +* xsd_version = None, because this param is optional + + +image:user_manual/media/image102.png[image,width=100%,height=69] + +[arabic, start=10] +. *reprocess_unvalidated_notices_from_backlog -* this DAG selects all +notices to be revalidated and triggers the notice_processing_pipeline +DAG to revalidate them. + +*Config DAG params:* + +* start_date : string with date format %Y-%m-%d +* end_date : string with date format %Y-%m-%d +* form_number : string +* xsd_version : string + +*Default values:* + + +* start_date = None , because this param is optional +* end_date = None, because this param is optional +* form_number = None, because this param is optional +* xsd_version = None, because this param is optional + +image:user_manual/media/image102.png[image,width=100%,height=69] + +[arabic, start=11] +. *daily_materialized_views_update -* this DAG selects all notices to be +revalidated and triggers the notice_processing_pipeline DAG to +revalidate them. + +*This DAG has no config or default params.* + +image:user_manual/media/image98.png[image,width=100%,height=90] + +[arabic, start=12] +. *daily_check_notices_availability_in_cellar -* this DAG selects all +notices to be revalidated and triggers the notice_processing_pipeline +DAG to revalidate them. + +*This DAG has no config or default params.* + + +image:user_manual/media/image67.png[image,width=339,height=81] + +== Batch processing + +== Running pipelines (How to) + +This chapter explains the basic utilization of Ted SWS Airflow pipelines +by presenting in the format of answering the questions. Basic +functionality can be used by running DAGs: a core concept of Airflow. +For advanced documentation access: + +https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html[[.underline]#https://airflow.apache.org/docs/apache-airflow/stable/concepts/DAGs.html#] + +=== UC1: How to load a mapping suite or mapping suites? + +As a user I want to load one or several mapping suites into the system +so that notices can be transformed and validated with them. + +=== UC1.a To load all mapping suites + +[arabic] +. Run *load_mapping_suite_in_database* DAG: +[loweralpha] +.. Enable DAG +.. Click Run on Actions column (Play symbol button) +.. Click Trigger DAG + + +image:user_manual/media/image84.png[image,width=100%,height=61] + +=== UC1.b To load specific mapping suite + +[arabic] +. Run *load_mapping_suite_in_database* DAG with configurations: +[loweralpha] +.. Enable DAG +.. Click Run on Actions column (Play symbol button) +.. Click Trigger DAG w/ config. + +image:user_manual/media/image36.png[image,width=100%,height=55] + +[arabic, start=2] +. In the next screen + +[loweralpha] +. In the configuration JSON text box insert the config: + +[source,python] +{"mapping_suite_package_name": "package_F03"} + +[loweralpha, start=2] +. Click Trigger button after inserting the configuration + +image:user_manual/media/image27.png[image,width=100%,height=331] + +[arabic, start=3] +. Optional if you want to transform the available test notices that were +used for development of the mapping suite you can add to configuration +the *load_test_data* parameter with the value *true* + +image:user_manual/media/image103.png[image,width=100%,height=459] + +=== UC2: How to fetch and process notices for a day? + +As a user I want to fetch and process notices from a selected day so +that they get published in Cellar and be available to the public in RDF +format. + +UC2.a To fetch and transform notices for a day: + +[arabic] +. Enable *notice_processing_pipeline* DAG +. Run *fetch_notices_by_date* DAG with configurations: +[loweralpha] +.. Enable DAG +.. Click Run on Actions column +.. Click Trigger DAG w/ config + +image:user_manual/media/image26.png[image,width=100%,height=217] + +[arabic, start=3] +. In the next screen + +[loweralpha] +. In the configuration JSON text box insert the config: +[source,python] +{"wild_card ": "20220921*"}* + +The value *20220921** is the date of the day to fetch and transform with +format: yyyymmdd*. + + +[loweralpha, start=2] +. Click Trigger button after inserting the configuration + +image:user_manual/media/image1.png[image,width=100%,height=310] + +[arabic, start=4] +. Optional: It is possible to only fetch notices without transformation. +To do so add *trigger_complete_workflow* configuration parameter and set +its value to “false”. + +[source,python] +{"wild_card ": "20220921*", "trigger_complete_workflow": false} + +image:user_manual/media/image4.png[image,width=100%,height=358] + + +=== UC3: How to fetch and process notices for date range? + +As a user I want to fetch and process notices published within a dare +range so that they are published in Cellar and available to the public +in RDF format. + +UC3.a To fetch for multiple days: + +[arabic] +. Enable *notice_processing_pipeline* DAG +. Run *fetch_notices_by_date_range* DAG with configurations: +[loweralpha] +.. Enable DAG +.. Click Run on Actions column +.. Click Trigger DAG w/ config. + +image:user_manual/media/image79.png[image,width=100%,height=205] + +[arabic, start=3] +. In the next screen, in the configuration JSON text box insert the +config: +[source,python] +{ "start_date": "20220920", "end_date": "20220920" } + +20220920 is the start date and 20220920 is the end date of the days to +be fetched and transformed with format: yyyymmdd. + +[arabic, start=4] +. Click Trigger button after inserting the configuration + +image:user_manual/media/image51.png[image,width=100%,height=331] + +==== UC4: How to fetch and process notices using a query? + +As a user I want to fetch and process notices published by specific +filters that are available from the TED API so that they are published +in Cellar and available to the public in RDF format. + +To fetch and transform notices by using a query follow the instructions +below: + +[arabic] +. Enable *notice_processing_pipeline* DAG +. Run *fetch_notices_by_query* DAG with configurations: +.. Enable DAG +.. Click Run on Actions column +.. Click Trigger DAG w/ config. + +image:user_manual/media/image61.png[image,width=100%,height=200] +[arabic, start=3] +. In the next screen + +[loweralpha] +. In the configuration JSON text box insert the config: + +[source,python] +{"query": "ND=[163-2021]"} + + +ND=[163-2021] is the query that will run against the TED API to get +notices that will match that query + +[loweralpha, start=2] +. Click Trigger button after inserting the configuration + +image:user_manual/media/image93.png[image,width=100%,height=378] + +[arabic, start=4] +. Optional: If you need to only fetch notices without +transformation, add *trigger_complete_workflow* configuration as *false* + +image:user_manual/media/image49.png[image,width=100%,height=357] + +=== UC5: How to deal with notices that are in the backlog and what to run? + +As a user I want to reprocess notices that are in the backlog so that +they are published in Cellar and available to the public in RDF format. + +Notices that have failed running a complete and successful +notice_processing_pipeline run will be added to the backlog by using +different statuses that will be added to these notices. The status of a +notice will be automatically determined by the system. The backlog could +have multiple notices in different statuses. + +The backlog is divided in five categories as follows: + +* notices that couldn’t be normalised +* notices that couldn’t be transformed +* notices that couldn’t be validated +* notices that couldn’t be packaged +* notices that couldn’t be published + +==== UC5.a Deal with notices that couldn't be normalised + +In the case that the backlog contains notices that couldn’t be +normalised at some point and will want to try to reprocess those notices +just run the *reprocess_unnormalised_notices_from_backlog* DAG following +the instructions below. + +[arabic] +. Enable the reprocess_unnormalised_notices_from_backlog DAG + +image:user_manual/media/image92.png[image,width=100%,height=44] + +[arabic, start=2] +. Trigger DAG + +image:user_manual/media/image76.png[image,width=100%,height=54] + +==== UC5.b: Deal with notices that couldn't be transformed + +In the case that the backlog contains notices that couldn’t be +transformed at some point and will want to try to reprocess those +notices just run the *reprocess_untransformed_notices_from_backlog* DAG +following the instructions below. + +[arabic] +. Enable the reprocess_untransformed_notices_from_backlog DAG +image:user_manual/media/image85.png[image,width=100%,height=36] + +[arabic, start=2] +. Trigger DAG + +image:user_manual/media/image77.png[image,width=100%,height=54] + +==== UC5.c: Deal with notices that couldn’t be validated + +In the case that the backlog contains notices that couldn’t be +normalised at some point and will want to try to reprocess those notices +just run the *reprocess_unvalidated_notices_from_backlog* DAG following +the instructions below. + +[arabic] +. Enable the reprocess_unvalidated_notices_from_backlog DAG + +image:user_manual/media/image66.png[image,width=100%,height=41] + +[arabic, start=2] +. Trigger DAG + +image:user_manual/media/image52.png[image,width=100%,height=52] + +==== UC5.d: Deal with notices that couldn't be published + +In the case that the backlog contains notices that couldn’t be +normalised at some point and will want to try to reprocess those notices +just run the *reprocess_unpackaged_notices_from_backlog* DAG following +the instructions below. + +[arabic] +. Enable the reprocess_unpackaged_notices_from_backlog DAG + +image:user_manual/media/image29.png[image,width=100%,height=36] + +[arabic, start=2] +. Trigger DAG + +image:user_manual/media/image71.png[image,width=100%,height=49] + +==== UC5.e: Deal with notices that couldn't be published + +In the case that the backlog contains notices that couldn’t be +normalised at some point and will want to try to reprocess those notices +just run the *reprocess_unpublished_notices_from_backlog* DAG following +the instructions below. + +[arabic] +. Enable the reprocess_unpublished_notices_from_backlog DAG + +image:user_manual/media/image38.png[image,width=100%,height=38] + +[arabic, start=2] +. Trigger DAG + +image:user_manual/media/image19.png[image,width=100%,height=57] + +== Scheduled pipelines + +Scheduled pipelines are DAGs that are set to run periodically at fixed +times, dates, or intervals. The DAG schedule can be read in the column +“Schedule” and if any is set then the value is different from None. +The scheduled execution is indicated as “cron expressions” [cire cron +expressions manual]. A cron expression is a string comprising five or +six fields separated by white space that represents a set of times, +normally as a schedule to execute some routine. In our context examples +of daily executions are provided below. + +image:user_manual/media/image34.png[image,width=83,height=365,] + +* None - DAG with no Schedule +* 0 0 * * * - DAG that will run every day at 24:00 UTC +* 0 6 * * * - DAG that will run every day at 06:00 UTC +* 0 1 * * * - DAG that will run every day at 01:00 UTC + +== Operational rules and recommendations + +Note: Every action that was not described in the previous chapters can +lead to unpredictable situations. + +* Do not stop a DAG when it is in running state. Let it finish. In case +you need to disable or stop a DAG, then make sure that in the column +Recent Tasks no numbers in the light green circle are present. Figure +below depicts one such example. +image:user_manual/media/image72.png[image,width=601,height=164] + +* Do not run reprocess DAGs when notice_processing_pipeline is in running +state. This will produce errors as the reprocessing DAGs are searching +for notices in a specific status available in the database. When the +notice_processing_pipeline is running the notices are transitioning +between different statuses and that will make it possible to get the +same notice to be processed twice in the same time, which will produce +an error. Make sure that in the column Runs for +notice_processing_pipeline you don’t have any numbers in a light green +circle before running any reprocess DAGs. +image:user_manual/media/image30.png[image,width=601,height=162] + + +* Do not manually trigger notice_processing_pipeline as this DAG is +triggered automatically by other DAGs. This will produce an error as +this DAG needs to know what batch of notices it is processing (this is +automatically done by the system). This DAG should only be enabled. +image:user_manual/media/image18.png[image,width=602,height=29] + +* To start any notice processing and transformation make sure that you +have mapping suites available in the database. You should have at least +one successful run of the *load_mapping_suite_in_database* DAG and check +Metabase to see what mapping suites are available. +image:user_manual/media/image32.png[image,width=653,height=30] + +* Do not manually trigger scheduled DAGs unless you use a specific +configuration and that DAG supports running with specific configuration. +The scheduled dags should be only enabled. +image:user_manual/media/image87.png[image,width=601,height=77] + +* It is not recommended to load mapping suites while +notice_processing_pipeline is running. First make sure that there are no +running tasks and then load other mapping suites. +image:user_manual/media/image35.png[image,width=601,height=256] {nbsp} +image:user_manual/media/image91.png[image,width=601,height=209] + +* It is recommended to start processing / transforming notices for a short +period of time e.g fetch notices for a day, week, month but not year. +The system can handle processing for a longer period but it will take +time and you will not be able to load other mapping suites while +processing is running.