Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wish list #123

Open
semio opened this issue Mar 20, 2020 · 5 comments
Open

wish list #123

semio opened this issue Mar 20, 2020 · 5 comments

Comments

@semio
Copy link
Owner

semio commented Mar 20, 2020

  • ideally user should only import ddf_utils/pandas to create a DDF dataset
    • easy way to import csv and convert it to types in DDF model (Entity/DataPoint), and save them in file with correct file names
  • reader for Gapminder spreadsheets
  • chef: support downloading specific version of a dataset
  • chef: add @procedure wrapper to make it easier to create custom procedure
  • ddf build: build dataset (load new source, dependencies and run etl script) in one command
@jheeffer
Copy link

With import, you mean when writing a python script right? Not when executing a command?

With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?

Other things sound good : )!

@jheeffer
Copy link

jheeffer commented Apr 17, 2020

Another feature to think about; not sure if it's for chef or some other solution or if it's even feasible given current set up.

Given an indicator, retrace where it came from originally. I.e. draw the path through the dataset tree. With some of our procedures that might be quite difficult on a detailed level. Maybe it's quite doable on a high level (e.g. finding out that mcv_immunized_percent_of_one_year_olds in SG comes from gapminder world dataset).

@semio
Copy link
Owner Author

semio commented May 21, 2020

With import, you mean when writing a python script right? Not when executing a command?

Right, I am thinking about writing a python script. I'd like to write scripts like this:

from ddf_utils.model.ddf import DDF, Concept, EntityDomain, DataPoints


source_file = '../source/some_file.csv'

def extract_concepts(df) -> [Concept]:
    # process to extract concepts...
    return concepts  # type: list of Concept objects

def extract_entities(df) -> [EntityDomain]:
    # process to extract entity domains
    return entity_domains  # type list of EntityDomain objects

def extract_datapoints(df) -> [Datapoints]:
    # process to extract datatpoints
    return datapoints  # type: list of DataPoints objects

def main():
    df = pd.read_csv(source_file)
    concepts = extract_concepts(df)
    domains = extract_entities(df)
    datapoints = extract_datapoints(df)

    ddf = DDF(concepts=concepts, domains=domains, datapoints=datapoints)
    ddf.to_csv('output/dir')

In the extract functions, we would use functions in pandas/ddf_utils to extract data from / transform the dataframe. So it would be more like recipes, where we have processes for datapoints/concepts/entities.

@semio
Copy link
Owner Author

semio commented May 21, 2020

With reader, are you referring to the spreadsheets of fasttrack? How is that different from current fasttrack code?

Yes, if we only need to support the current fasttrack format, then it won't be much different. But if we want to have multiple datasets or support different formats, making a library should help.

@semio
Copy link
Owner Author

semio commented May 21, 2020

retrace where it came from originally

Right, I think it's not easy for some procedures. For example run_op, we need to parse the operation strings (e.g. "co2_emissions / population * 1000") to get 2 base indicators, and then, the co2_per_capita indicator will have 2 parent datasets.

I think we will have to run the recipe once to build a graph. Procedures should inspect its input and output and modify the graph. We can cache this graph somewhere in the etl/ folder and speed up next queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants