Skip to content

Commit

Permalink
More stuff.
Browse files Browse the repository at this point in the history
  • Loading branch information
arokem committed Jun 5, 2024
1 parent aefc5ed commit 4b54902
Show file tree
Hide file tree
Showing 4 changed files with 118 additions and 76 deletions.
36 changes: 36 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,39 @@

@ARTICLE{Rubel2022NWB,
title = "The Neurodata Without Borders ecosystem for neurophysiological
data science",
author = "R{\"u}bel, Oliver and Tritt, Andrew and Ly, Ryan and Dichter,
Benjamin K and Ghosh, Satrajit and Niu, Lawrence and Baker,
Pamela and Soltesz, Ivan and Ng, Lydia and Svoboda, Karel and
Frank, Loren and Bouchard, Kristofer E",
abstract = "The neurophysiology of cells and tissues are monitored
electrophysiologically and optically in diverse experiments and
species, ranging from flies to humans. Understanding the brain
requires integration of data across this diversity, and thus
these data must be findable, accessible, interoperable, and
reusable (FAIR). This requires a standard language for data and
metadata that can coevolve with neuroscience. We describe design
and implementation principles for a language for neurophysiology
data. Our open-source software (Neurodata Without Borders, NWB)
defines and modularizes the interdependent, yet separable,
components of a data language. We demonstrate NWB's impact
through unified description of neurophysiology data across
diverse modalities and species. NWB exists in an ecosystem, which
includes data management, analysis, visualization, and archive
tools. Thus, the NWB data language enables reproduction,
interchange, and reuse of diverse neurophysiology data. More
broadly, the design principles of NWB are generally applicable to
enhance discovery across biology through data FAIRness.",
journal = "Elife",
volume = 11,
month = oct,
year = 2022,
keywords = "FAIR data; Neurophysiology; archive; data ecosystem; data
language; data standard; human; mouse; neuroscience; rat",
language = "en"
}


@ARTICLE{Gorgolewski2016BIDS,
title = "The {Brain} {Imaging} {Data} {Structure}, a format for organizing and
describing outputs of neuroimaging experiments",
Expand Down
112 changes: 42 additions & 70 deletions sections/01-introduction.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ machine learning techniques, these datasets can help us understand everything
from the cellular operations of the human body, through business transactions
on the internet, to the structure and history of the universe. However, the
development of new machine learning methods, and data-intensive discovery more
generally, rely heavily on Findability, Accessibility, Interoperability and
generally depend on Findability, Accessibility, Interoperability and
Reusability (FAIR) of data [@Wilkinson2016FAIR].

One of the main mechanisms through which the FAIR principles are promoted is the
Expand All @@ -24,79 +24,51 @@ The importance of standards stems not only from discussions within research
fields about how research can best be conducted to take advantage of existing
and growing datasets, but also arises from an ongoing series of policy
discussions that address the interactions between research communities and the
general public. In the United States, memos issued in 2013 and 2022 by the
directors of the White House Office of Science and Technology Policy (OSTP),
James Holdren (2013) and Alondra Nelson (2022). While these memos focused
primarily on making peer-reviewed publications funded by the US Federal
government available to the general public, they also lay an increasingly
detailed path towards the publication and general availability of the data that
is collected as part of the research that is funded by the US government.
general public. In the United States, these policies are expressed, for example
in memos issued by the directors of the White House Office of Science and
Technology Policy (OSTP), James Holdren (in 2013) and Alondra Nelson (in 2022).
While these memos focused primarily on making peer-reviewed publications funded
by the US Federal government available to the general public, they also lay an
increasingly detailed path toward the publication and general availability of
the data that is collected in research that is funded by the US government. The
general guidance and overall spirit of these memos dovetail with more specific
policy guidance related to data and metadata standards. The importance of
standards was underscored in a recent report by the Subcommittee on Open
Science of the National Science and Technology Council on the "Desirable
characteristics of data repositories for federally funded research"
[@nstc2022desirable]. The report explicitly called out the importance of
"allow[ing] datasets and metadata to be accessed, downloaded, or exported from
the repository in widely used, preferably non-proprietary, formats consistent
with standards used in the disciplines the repository serves." This highlights
the need for data and metadata standards across a variety of different kinds of
data. In addition, a report from the National Institute of Standards and
Technology on "U.S. Leadership in AI: A Plan for Federal Engagement in
Developing Technical Standards and Related Tools" emphasized that --
specifically for the case of AI -- "U.S. government agencies should prioritize
AI standards efforts that are [...] Consensus-based, [...] Inclusive and
accessible, [...] Multi-path, [...] Open and transparent, [...] and [that]
Result in globally relevant and non-discriminatory standards..." [@NIST2019].
The converging characteristics of standards that arise from these reports
suggest that considerable thought needs to be given to how standards arise, so
that these goals are achieved.

The general guidance and overall spirit of these memos dovetail with more
specific policy discussions that put meat on the bones of the general guidance.
The importance of data and metadata standards, for example, was underscored in
a recent report by the Subcommittee on Open Science of the National Science and
Technology Council on the "Desirable characteristics of data repositories for
federally funded research" [@nstc2022desirable]. The report explicitly called
out the importance of "allow[ing] datasets and metadata to be accessed,
downloaded, or exported from the repository in widely used, preferably
non-proprietary, formats consistent with standards used in the disciplines the
repository serves." This highlights the need for data and metadata standards
across a variety of different kinds of data. In addition, a report from the
National Institute of Standards and Technology on "U.S. Leadership in AI: A
Plan for Federal Engagement in Developing Technical Standards and Related
Tools" emphasized that -- specifically for the case of AI -- "U.S. government
agencies should prioritize AI standards efforts that are [...] Consensus-based,
[...] Inclusive and accessible, [...] Multi-path, [...] Open and transparent,
[...] and [that] Result in globally relevant and non-discriminatory
standards..." [@NIST2019]. The converging characteristics of standards that
arise from these reports suggest that considerable thought needs to be given to
the manner in which standards arise, so that these goals are achieved.

Standards for a specific domain can come about in various ways. Broadly
speaking two kinds of mechanisms can generate a standard for a specific type of
data: (i) top-down: in this case a (usually) small group of people develop the
standard and disseminate it to the communities of interest with very little
input from these communities. An example of this mode of standards development
can occur when an instrument is developed by a manufacturer and users of this
instrument receive the data in a particular format that was developed in tandem
with the instrument; and (ii) bottom-up: in this case, standards are developed
by a larger group of people that convene and reach consensus about the details
of the standard in an attempt to cover a large range of use-cases. Most
standards are developed through an interplay between these two modes, and
understanding how to make the best of these modes is critical in advancing the
development of data and metadata standards.

One source of inspiration for bottom-up development of robust, adaptable and
useful standards comes from open-source software (OSS). OSS has a long history
going back to the development of the Unix operating system in the late 1960s.
Over the time since its inception, the large community of developers and users
of OSS have have developed a host of socio-technical mechanisms that support
One source of inspiration for community-driven development of robust, adaptable
and useful standards comes from open-source software (OSS). OSS has a long
history going back to the development of the Unix operating system in the late
1960s. Over the time since its inception, the large community of developers and
users of OSS have developed a host of socio-technical mechanisms that support
the development and use of OSS. For example, the Open Source Initiative (OSI),
a non-profit organization that was founded in 1990s has evolved a set of
a non-profit organization that was founded in 1990s developed a set of
guidelines for licensing of OSS that is designed to protect the rights of
developers and users. Technical tools to support the evolution of open-source
software include software for distributed version control, such as the Git
Source-code management system. When these social and technical innovations are
put together they enable a host of positive defining features of OSS, such as
transparency, collaboration, and decentralization. These features allow OSS to
have a remarkable level of dynamism and productivity, while also retaining the
ability of a variety of stakeholders to guide the evolution of the software to
take their needs and interests into account.
developers and users. On the more technical side, tools such as the Git
Source-code management system also support open-source development workflows.
When these social and technical innovations are put together they enable a host
of positive defining features of OSS, such as transparency, collaboration, and
decentralization. These features allow OSS to have a remarkable level of
dynamism and productivity, while also retaining the ability of a variety of
stakeholders to guide the evolution of the software to take their needs and
interests into account.

A necessary complement to these technical tools and legal instruments have been
a host of practices that define the social interactions \emph{within}
communities of OSS developers and users, and structures for governing these
communities. While many OSS communities started as projects led by individual
founders (so-called benevolent dictators for life, or BDFL; a title first
bestowed on the originator of the Python programming language, Guido Van Rossum
\cite{Van_Rossum2008BDFL}), recent years have led to an increased understanding
that minimal standards of democratic governance are required in order for OSS
communities to develop and flourish. This has led to the adoption of codes of
conduct that govern the standards of behavior and communication among project
stakeholders. It has also led to the establishment of democratically elected
steering councils/committees from among the members and stakeholders of an OSS
project's community.

It was also within the Python community that an orderly process for
community-guided evolution of an open-source software project emerged, through
Expand Down
12 changes: 6 additions & 6 deletions sections/02-challenges.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,17 @@ about the practical implications of changes to the standards.

## Unclear pathways for standards success

Standards typically develop organically through sustained and persistent efforts from dedicated
groups of data practitioneers. These include scientists and the broader ecosystem of data curators and users. However there is no playbook on the structure and components of a data standard, or the pathway that moves a data implementation to a data standard.
As a result, data standardization lacks formal avenues for research grants.
Standards typically develop organically through sustained and persistent efforts from dedicated
groups of data practitioneers. These include scientists and the broader ecosystem of data curators and users. However there is no playbook on the structure and components of a data standard, or the pathway that moves a data implementation to a data standard.
As a result, data standardization lacks formal avenues for research grants.

## Cross domain funding gaps

Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastrucutre and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems.
Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastrucutre and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems.

## Data instrumentation issues
## Data instrumentation issues

Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards.
Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards.

## Sustainability

Expand Down
34 changes: 34 additions & 0 deletions sections/xx-use-cases.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Use cases

Meanwhile, the importance of standards is also increasingly understood in
research communities that are learning about the value of shared data
resources. While some fields, such as astronomy, high-energy physics and earth
sciences have a relatively long history of shared data resources from
organizations such as LSST and CERN, other fields have only relatively recently
become aware of the value of data sharing and its impact.

For example, neuroscience has traditionally been a "cottage industry", where
individual labs have generated experimental data designed to answer specific
experimental questions. While this model still exists, the field has also seen
the emergence of new modes of data production that focus on generating large
shared datasets designed to answer many different questions, more akin to the
data generated in large astronomy data collection efforts. This change has been
brought on through a combination of technical advances in data acquisition
techniques, which now generate large and very high-dimensional/information-rich
datasets, cultural changes, which have ushered in new norms of transparency and
reproducibility (related to the policy discussions mentioned above), and
funding initiatives that have encouraged this kind of data collection
(including the US BRAIN Initiative and the Allen Institute for Brain Science).
Neuroscience presents an interesting example, because in response to these new
data resources, the field has had to establish new standards for data and
metadata that facilitate sharing and using of these data. Two examples are the
Neurodata Without Borders file format for neurophysiology data [@Rubel2022NWB]
and the Brain Imaging Data Structure standard for neuroimaging data
[@Gorgolewski2016BIDS].



## Automated discovery

## Citizen science

0 comments on commit 4b54902

Please sign in to comment.