Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting in Purescript #565

Open
3 of 4 tasks
semio opened this issue Jan 20, 2024 · 0 comments
Open
3 of 4 tasks

Rewriting in Purescript #565

semio opened this issue Jan 20, 2024 · 0 comments

Comments

@semio
Copy link

semio commented Jan 20, 2024

Motivation

The current ddf validator has a few issues:

  • very slow or broken by memory errors on medium size datasets. For example the ddf--unfao--faostat dataset, which is 1GB of data. I think if we are going to add subnational data to our datasets, the dataset size will be big, this kind of performance is not ok.
  • there are uncatched exceptions, for example in the validation of ddf--worldbank--povcalnet, and I've seen other cases with different errors.

I think the problem lays in the design of the ddf validator

  • it will read all files before doing actual validation, which is unnecessary. We only need to know all the concepts and entities to be able to validation all data.
  • the validation is base on the rules we want to check, not the actual spec of ddf dataset. This will make us needing to consider all unexpected data in each "rule validator", which makes the validator do some duplicated checking between rules and uncatched exceptions are hiding in the validators
  • also, as result of above, if we want to change someting in the specification of DDF, it's not easy to propagate into the validator

So I suggest a rewrite of the validator. I especially like to do this with Purescript, which is a language suitable for a validator. Purescript's language syntax is very expressive, and it has a very strong type system. It allows to write code that reads like a spec, and the type system will help us to ensure the data types of inputs are correct. For example, in Concept module, we can see that if we need to create a Concept, we must provide a concept Id and concept type. And Concept Id must be an Identifier, which we can see in its module, an Identifier must be an NonEmptyString.

And we can parse the dataset from bottom up, which means that we start the validation right after we read data from csv. To ensure no illegal data will go to later steps. I learned this way of doing validation from this article

TODOs

Currently the purescript branch has an basic validator that can validate concepts/entities/datapoints. I tried to run it with the testing datasets, it's working mostly. And it successfully detected a few issues that the old one couldn't detect. I will publish it to ddf-validation-ng npm package until it's ok to overwrite the old one.

to install run npm install -g ddf-validation-ng

The running time to do validation is halved (for SG and fasttrack) and memory usage is very small (under 100M for ddf--unfao--faostat).

I changed the error reporting in cli to something like what a complier would do:

2024-01-20-120649_2700x1688_scrot

So that if you have a editor that supports file links you can jump to the error location by one click.

next steps:

  • add datapackage.json validation
  • add more types of issues, so that we can separate errors from warnings in more cases
  • add translation/synonyms checking
  • polish the command line options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant