You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
very slow or broken by memory errors on medium size datasets. For example the ddf--unfao--faostat dataset, which is 1GB of data. I think if we are going to add subnational data to our datasets, the dataset size will be big, this kind of performance is not ok.
there are uncatched exceptions, for example in the validation of ddf--worldbank--povcalnet, and I've seen other cases with different errors.
I think the problem lays in the design of the ddf validator
it will read all files before doing actual validation, which is unnecessary. We only need to know all the concepts and entities to be able to validation all data.
the validation is base on the rules we want to check, not the actual spec of ddf dataset. This will make us needing to consider all unexpected data in each "rule validator", which makes the validator do some duplicated checking between rules and uncatched exceptions are hiding in the validators
also, as result of above, if we want to change someting in the specification of DDF, it's not easy to propagate into the validator
So I suggest a rewrite of the validator. I especially like to do this with Purescript, which is a language suitable for a validator. Purescript's language syntax is very expressive, and it has a very strong type system. It allows to write code that reads like a spec, and the type system will help us to ensure the data types of inputs are correct. For example, in Concept module, we can see that if we need to create a Concept, we must provide a concept Id and concept type. And Concept Id must be an Identifier, which we can see in its module, an Identifier must be an NonEmptyString.
And we can parse the dataset from bottom up, which means that we start the validation right after we read data from csv. To ensure no illegal data will go to later steps. I learned this way of doing validation from this article
TODOs
Currently the purescript branch has an basic validator that can validate concepts/entities/datapoints. I tried to run it with the testing datasets, it's working mostly. And it successfully detected a few issues that the old one couldn't detect. I will publish it to ddf-validation-ng npm package until it's ok to overwrite the old one.
to install run npm install -g ddf-validation-ng
The running time to do validation is halved (for SG and fasttrack) and memory usage is very small (under 100M for ddf--unfao--faostat).
I changed the error reporting in cli to something like what a complier would do:
So that if you have a editor that supports file links you can jump to the error location by one click.
next steps:
add datapackage.json validation
add more types of issues, so that we can separate errors from warnings in more cases
add translation/synonyms checking
polish the command line options
The text was updated successfully, but these errors were encountered:
Motivation
The current ddf validator has a few issues:
I think the problem lays in the design of the ddf validator
So I suggest a rewrite of the validator. I especially like to do this with Purescript, which is a language suitable for a validator. Purescript's language syntax is very expressive, and it has a very strong type system. It allows to write code that reads like a spec, and the type system will help us to ensure the data types of inputs are correct. For example, in Concept module, we can see that if we need to create a Concept, we must provide a concept Id and concept type. And Concept Id must be an Identifier, which we can see in its module, an Identifier must be an NonEmptyString.
And we can parse the dataset from bottom up, which means that we start the validation right after we read data from csv. To ensure no illegal data will go to later steps. I learned this way of doing validation from this article
TODOs
Currently the
purescript
branch has an basic validator that can validate concepts/entities/datapoints. I tried to run it with the testing datasets, it's working mostly. And it successfully detected a few issues that the old one couldn't detect. I will publish it toddf-validation-ng
npm package until it's ok to overwrite the old one.to install run
npm install -g ddf-validation-ng
The running time to do validation is halved (for SG and fasttrack) and memory usage is very small (under 100M for ddf--unfao--faostat).
I changed the error reporting in cli to something like what a complier would do:
So that if you have a editor that supports file links you can jump to the error location by one click.
next steps:
The text was updated successfully, but these errors were encountered: