diff --git a/references.bib b/references.bib index 51e5bb2..7eaea4e 100644 --- a/references.bib +++ b/references.bib @@ -1,3 +1,39 @@ + +@ARTICLE{Rubel2022NWB, + title = "The Neurodata Without Borders ecosystem for neurophysiological + data science", + author = "R{\"u}bel, Oliver and Tritt, Andrew and Ly, Ryan and Dichter, + Benjamin K and Ghosh, Satrajit and Niu, Lawrence and Baker, + Pamela and Soltesz, Ivan and Ng, Lydia and Svoboda, Karel and + Frank, Loren and Bouchard, Kristofer E", + abstract = "The neurophysiology of cells and tissues are monitored + electrophysiologically and optically in diverse experiments and + species, ranging from flies to humans. Understanding the brain + requires integration of data across this diversity, and thus + these data must be findable, accessible, interoperable, and + reusable (FAIR). This requires a standard language for data and + metadata that can coevolve with neuroscience. We describe design + and implementation principles for a language for neurophysiology + data. Our open-source software (Neurodata Without Borders, NWB) + defines and modularizes the interdependent, yet separable, + components of a data language. We demonstrate NWB's impact + through unified description of neurophysiology data across + diverse modalities and species. NWB exists in an ecosystem, which + includes data management, analysis, visualization, and archive + tools. Thus, the NWB data language enables reproduction, + interchange, and reuse of diverse neurophysiology data. More + broadly, the design principles of NWB are generally applicable to + enhance discovery across biology through data FAIRness.", + journal = "Elife", + volume = 11, + month = oct, + year = 2022, + keywords = "FAIR data; Neurophysiology; archive; data ecosystem; data + language; data standard; human; mouse; neuroscience; rat", + language = "en" +} + + @ARTICLE{Gorgolewski2016BIDS, title = "The {Brain} {Imaging} {Data} {Structure}, a format for organizing and describing outputs of neuroimaging experiments", diff --git a/sections/01-introduction.qmd b/sections/01-introduction.qmd index 415c614..a117b99 100644 --- a/sections/01-introduction.qmd +++ b/sections/01-introduction.qmd @@ -9,7 +9,7 @@ machine learning techniques, these datasets can help us understand everything from the cellular operations of the human body, through business transactions on the internet, to the structure and history of the universe. However, the development of new machine learning methods, and data-intensive discovery more -generally, rely heavily on Findability, Accessibility, Interoperability and +generally depend on Findability, Accessibility, Interoperability and Reusability (FAIR) of data [@Wilkinson2016FAIR]. One of the main mechanisms through which the FAIR principles are promoted is the @@ -24,79 +24,51 @@ The importance of standards stems not only from discussions within research fields about how research can best be conducted to take advantage of existing and growing datasets, but also arises from an ongoing series of policy discussions that address the interactions between research communities and the -general public. In the United States, memos issued in 2013 and 2022 by the -directors of the White House Office of Science and Technology Policy (OSTP), -James Holdren (2013) and Alondra Nelson (2022). While these memos focused -primarily on making peer-reviewed publications funded by the US Federal -government available to the general public, they also lay an increasingly -detailed path towards the publication and general availability of the data that -is collected as part of the research that is funded by the US government. +general public. In the United States, these policies are expressed, for example +in memos issued by the directors of the White House Office of Science and +Technology Policy (OSTP), James Holdren (in 2013) and Alondra Nelson (in 2022). +While these memos focused primarily on making peer-reviewed publications funded +by the US Federal government available to the general public, they also lay an +increasingly detailed path toward the publication and general availability of +the data that is collected in research that is funded by the US government. The +general guidance and overall spirit of these memos dovetail with more specific +policy guidance related to data and metadata standards. The importance of +standards was underscored in a recent report by the Subcommittee on Open +Science of the National Science and Technology Council on the "Desirable +characteristics of data repositories for federally funded research" +[@nstc2022desirable]. The report explicitly called out the importance of +"allow[ing] datasets and metadata to be accessed, downloaded, or exported from +the repository in widely used, preferably non-proprietary, formats consistent +with standards used in the disciplines the repository serves." This highlights +the need for data and metadata standards across a variety of different kinds of +data. In addition, a report from the National Institute of Standards and +Technology on "U.S. Leadership in AI: A Plan for Federal Engagement in +Developing Technical Standards and Related Tools" emphasized that -- +specifically for the case of AI -- "U.S. government agencies should prioritize +AI standards efforts that are [...] Consensus-based, [...] Inclusive and +accessible, [...] Multi-path, [...] Open and transparent, [...] and [that] +Result in globally relevant and non-discriminatory standards..." [@NIST2019]. +The converging characteristics of standards that arise from these reports +suggest that considerable thought needs to be given to how standards arise, so +that these goals are achieved. -The general guidance and overall spirit of these memos dovetail with more -specific policy discussions that put meat on the bones of the general guidance. -The importance of data and metadata standards, for example, was underscored in -a recent report by the Subcommittee on Open Science of the National Science and -Technology Council on the "Desirable characteristics of data repositories for -federally funded research" [@nstc2022desirable]. The report explicitly called -out the importance of "allow[ing] datasets and metadata to be accessed, -downloaded, or exported from the repository in widely used, preferably -non-proprietary, formats consistent with standards used in the disciplines the -repository serves." This highlights the need for data and metadata standards -across a variety of different kinds of data. In addition, a report from the -National Institute of Standards and Technology on "U.S. Leadership in AI: A -Plan for Federal Engagement in Developing Technical Standards and Related -Tools" emphasized that -- specifically for the case of AI -- "U.S. government -agencies should prioritize AI standards efforts that are [...] Consensus-based, -[...] Inclusive and accessible, [...] Multi-path, [...] Open and transparent, -[...] and [that] Result in globally relevant and non-discriminatory -standards..." [@NIST2019]. The converging characteristics of standards that -arise from these reports suggest that considerable thought needs to be given to -the manner in which standards arise, so that these goals are achieved. - -Standards for a specific domain can come about in various ways. Broadly -speaking two kinds of mechanisms can generate a standard for a specific type of -data: (i) top-down: in this case a (usually) small group of people develop the -standard and disseminate it to the communities of interest with very little -input from these communities. An example of this mode of standards development -can occur when an instrument is developed by a manufacturer and users of this -instrument receive the data in a particular format that was developed in tandem -with the instrument; and (ii) bottom-up: in this case, standards are developed -by a larger group of people that convene and reach consensus about the details -of the standard in an attempt to cover a large range of use-cases. Most -standards are developed through an interplay between these two modes, and -understanding how to make the best of these modes is critical in advancing the -development of data and metadata standards. - -One source of inspiration for bottom-up development of robust, adaptable and -useful standards comes from open-source software (OSS). OSS has a long history -going back to the development of the Unix operating system in the late 1960s. -Over the time since its inception, the large community of developers and users -of OSS have have developed a host of socio-technical mechanisms that support +One source of inspiration for community-driven development of robust, adaptable +and useful standards comes from open-source software (OSS). OSS has a long +history going back to the development of the Unix operating system in the late +1960s. Over the time since its inception, the large community of developers and +users of OSS have developed a host of socio-technical mechanisms that support the development and use of OSS. For example, the Open Source Initiative (OSI), -a non-profit organization that was founded in 1990s has evolved a set of +a non-profit organization that was founded in 1990s developed a set of guidelines for licensing of OSS that is designed to protect the rights of -developers and users. Technical tools to support the evolution of open-source -software include software for distributed version control, such as the Git -Source-code management system. When these social and technical innovations are -put together they enable a host of positive defining features of OSS, such as -transparency, collaboration, and decentralization. These features allow OSS to -have a remarkable level of dynamism and productivity, while also retaining the -ability of a variety of stakeholders to guide the evolution of the software to -take their needs and interests into account. +developers and users. On the more technical side, tools such as the Git +Source-code management system also support open-source development workflows. +When these social and technical innovations are put together they enable a host +of positive defining features of OSS, such as transparency, collaboration, and +decentralization. These features allow OSS to have a remarkable level of +dynamism and productivity, while also retaining the ability of a variety of +stakeholders to guide the evolution of the software to take their needs and +interests into account. -A necessary complement to these technical tools and legal instruments have been -a host of practices that define the social interactions \emph{within} -communities of OSS developers and users, and structures for governing these -communities. While many OSS communities started as projects led by individual -founders (so-called benevolent dictators for life, or BDFL; a title first -bestowed on the originator of the Python programming language, Guido Van Rossum -\cite{Van_Rossum2008BDFL}), recent years have led to an increased understanding -that minimal standards of democratic governance are required in order for OSS -communities to develop and flourish. This has led to the adoption of codes of -conduct that govern the standards of behavior and communication among project -stakeholders. It has also led to the establishment of democratically elected -steering councils/committees from among the members and stakeholders of an OSS -project's community. It was also within the Python community that an orderly process for community-guided evolution of an open-source software project emerged, through diff --git a/sections/02-challenges.qmd b/sections/02-challenges.qmd index c47d619..c04c7b8 100644 --- a/sections/02-challenges.qmd +++ b/sections/02-challenges.qmd @@ -22,17 +22,17 @@ about the practical implications of changes to the standards. ## Unclear pathways for standards success -Standards typically develop organically through sustained and persistent efforts from dedicated -groups of data practitioneers. These include scientists and the broader ecosystem of data curators and users. However there is no playbook on the structure and components of a data standard, or the pathway that moves a data implementation to a data standard. -As a result, data standardization lacks formal avenues for research grants. +Standards typically develop organically through sustained and persistent efforts from dedicated +groups of data practitioneers. These include scientists and the broader ecosystem of data curators and users. However there is no playbook on the structure and components of a data standard, or the pathway that moves a data implementation to a data standard. +As a result, data standardization lacks formal avenues for research grants. ## Cross domain funding gaps -Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastrucutre and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems. +Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastrucutre and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems. -## Data instrumentation issues +## Data instrumentation issues -Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards. +Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards. ## Sustainability diff --git a/sections/xx-use-cases.qmd b/sections/xx-use-cases.qmd new file mode 100644 index 0000000..9497b11 --- /dev/null +++ b/sections/xx-use-cases.qmd @@ -0,0 +1,34 @@ +# Use cases + +Meanwhile, the importance of standards is also increasingly understood in +research communities that are learning about the value of shared data +resources. While some fields, such as astronomy, high-energy physics and earth +sciences have a relatively long history of shared data resources from +organizations such as LSST and CERN, other fields have only relatively recently +become aware of the value of data sharing and its impact. + +For example, neuroscience has traditionally been a "cottage industry", where +individual labs have generated experimental data designed to answer specific +experimental questions. While this model still exists, the field has also seen +the emergence of new modes of data production that focus on generating large +shared datasets designed to answer many different questions, more akin to the +data generated in large astronomy data collection efforts. This change has been +brought on through a combination of technical advances in data acquisition +techniques, which now generate large and very high-dimensional/information-rich +datasets, cultural changes, which have ushered in new norms of transparency and +reproducibility (related to the policy discussions mentioned above), and +funding initiatives that have encouraged this kind of data collection +(including the US BRAIN Initiative and the Allen Institute for Brain Science). +Neuroscience presents an interesting example, because in response to these new +data resources, the field has had to establish new standards for data and +metadata that facilitate sharing and using of these data. Two examples are the +Neurodata Without Borders file format for neurophysiology data [@Rubel2022NWB] +and the Brain Imaging Data Structure standard for neuroimaging data +[@Gorgolewski2016BIDS]. + + + +## Automated discovery + +## Citizen science +