Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More edits towards finalization #17

Merged
merged 8 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,126 @@
@ARTICLE{Hanisch2015-cu,
title = "The Virtual Astronomical Observatory: Re-engineering access to
astronomical data",
author = "Hanisch, R J and Berriman, G B and Lazio, T J W and Emery Bunn, S
and Evans, J and McGlynn, T A and Plante, R",
journal = "Astron. Comput.",
publisher = "Elsevier BV",
volume = 11,
pages = "190--209",
abstract = "The US Virtual Astronomical Observatory was a software
infrastructure and development project designed both to begin the
establishment of an operational Virtual Observatory (VO) and to
provide the US coordination with the international VO effort. The
concept of the VO is to provide the means by which an astronomer
is able to discover, access, and process data seamlessly,
regardless of its physical location. This paper describes the
origins of the VAO, including the predecessor efforts within the
US National Virtual Observatory, and summarizes its main
accomplishments. These accomplishments include the development of
both scripting toolkits that allow scientists to incorporate VO
data directly into their reduction and analysis environments and
high-level science applications for data discovery, integration,
analysis, and catalog cross-comparison. Working with the
international community, and based on the experience from the
software development, the VAO was a major contributor to
international standards within the International Virtual
Observatory Alliance. The VAO also demonstrated how an
operational virtual observatory could be deployed, providing a
robust operational environment in which VO services worldwide
were routinely checked for aliveness and compliance with
international standards. Finally, the VAO engaged in community
outreach, developing a comprehensive web site with on-line
tutorials, announcements, links to both US and internationally
developed tools and services, and exhibits and hands-on training
at annual meetings of the American Astronomical Society and
through summer schools and community days. All digital products
of the VAO Project, including software, documentation, and
tutorials, are stored in a repository for community access. The
enduring legacy of the VAO is an increasing expectation that new
telescopes and facilities incorporate VO capabilities during the
design of their data management systems.",
month = jun,
year = 2015,
language = "en"
}

@ARTICLE{Larobina2023-vq,
title = "Thirty years of the {DICOM} standard",
author = "Larobina, Michele",
journal = "Tomography",
publisher = "mdpi.com",
volume = 9,
number = 5,
pages = "1829--1838",
abstract = "Digital Imaging and Communications in Medicine (DICOM) is an
international standard that defines a format for storing medical
images and a protocol to enable and facilitate data communication
among medical imaging systems. The DICOM standard has been
instrumental in transforming the medical imaging world over the
last three decades. Its adoption has been a significant
experience for manufacturers, healthcare users, and research
scientists. In this review, thirty years after introducing the
standard, we discuss the innovation, advantages, and limitations
of adopting the DICOM and its possible future directions.",
month = oct,
year = 2023,
keywords = "DICOM; communication protocols; file formats; metadata;
quantitative imaging",
language = "en"
}

@INPROCEEDINGS{Mustra2008-xk,
title = "Overview of the {DICOM} standard",
author = "Mustra, Mario and Delac, Kresimir and Grgic, Mislav",
booktitle = "2008 50th International Symposium ELMAR",
publisher = "IEEE",
volume = 1,
pages = "39--44",
abstract = "Digital technology has in the last few decades entered almost
every aspect of medicine. There has been a huge development in
noninvasive medical imaging equipment. Because there are many
medical equipment manufacturers, a standard for storage and
exchange of medical images needed to be developed. DICOM (Digital
Imaging and Communication in Medicine) makes medical image
exchange more easy and independent of the imaging equipment
manufacturer. Besides the image data, DICOM file format supports
other information useful to describe the image. This makes DICOM
easy to use and the data exchange fast and safe while avoiding
possible confusion caused by multiple files for the same study.",
month = sep,
year = 2008
}


@ARTICLE{Scroggins2020-ut,
title = "Once {FITS}, Always {FITS}? Astronomical Infrastructure in
Transition",
author = "Scroggins, Michael and Boscoe, Bernadette M",
journal = "IEEE Ann. Hist. Comput.",
publisher = "IEEE",
volume = 42,
number = 2,
pages = "42--54",
abstract = "The flexible interchange transport system (FITS) file format has
become the de facto standard for sharing, analyzing, and
archiving astronomical data over the last four decades. FITS was
adopted by astronomers in the early 1980s to overcome
incompatibilities between operating systems. On the back of FITS’
success, astronomical data became both backward compatible and
easily shareable. However, new advances in the astronomical
instrumentation, computational technologies, and analytic
techniques have resulted in new data that do not work well within
the traditional FITS format. Tensions have arisen between the
desire to update the format to meet new analytic challenges and
adherence to the original edict for the FITS file format to be
backward compatible. We examine three inflection points in the
governance of FITS: first, initial development and success,
second, widespread acceptance and governance by the working
group, and third, the challenges to FITS in a new era of
increasing data and computational complexity within astronomy.",
year = 2020
}


@ARTICLE{Musen2022metadata,
title = "Without appropriate metadata, data-sharing mandates are
Expand Down
75 changes: 59 additions & 16 deletions sections/02-use-cases.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,34 @@ Image Transport System) file format standard, which was developed in the late
astronomy data preservation and exchange. Essentially every software platform
used in astronomy reads and writes the FITS format. It was developed by
observatories in the 1980s to store image data in the visible and x-ray
spectrum. It has been endorsed by IAU, as well as funding agencies. Though the
format has evolved over time, “once FITS, always FITS”. That is, the format
cannot be evolved to introduce changes that break backward compatibility.
Among the features that make FITS so durable is that it was designed originally
to have a very restricted metadata schema. That is, FITS records were designed
to be the lowest common denominator of word lengths in computer systems at the
time. However, while FITS is compact, its ability to encode the coordinate
frame and pixels, means that data from different observational instruments can
be stored in this format and relationships between data from different
instruments can be related, rendering manual and error-prone procedures for
conforming images obsolete.
spectrum. It has been endorsed by the International Astronomical Union (IAU),
as well as funding agencies. Though the format has evolved over time, “once
FITS, always FITS”. That is, the format cannot be evolved to introduce changes
that break backward compatibility. Among the features that make FITS so durable
is that it was designed originally to have a very restricted metadata schema.
That is, FITS records were designed to be the lowest common denominator of word
lengths in computer systems at the time. However, while FITS is compact, its
ability to encode the coordinate frame and pixels, means that data from
different observational instruments can be stored in this format and
relationships between data from different instruments can be related, rendering
manual and error-prone procedures for conforming images obsolete. Nevertheless,
the stability has also raised some issues as the field continues to adapt to
new measurement methods and the demands of ever-increasing data volumes and
complex data analysis use-case, such as interchange with other data and the use
of complex data bases to store and share data [@Scroggins2020-ut]. Another
prominent example of the use of open-source processes to develop standards in
Astronomy is in the tools and protocols developed by the International Virtual
Observatory Alliance (IVOA) and its national implementations, e.g., in the US
Virtual Astronomical Observatory[@Hanisch2015-cu]. The virtual observatories
facilitate discovery and access across observatories around the world and
underpin data discovery in astronomy. The IVOA took inspiration from the
World-Wide Web Consortium (W3C) and adopted its process for the development of
its standards (i.e., Working drafts $\rightarrow$ Proposed Recommendations
$\rightarrow$ Recommendations), with individual standards developed by
inter-institutional and international working groups. One of the outcomes of
the coordination effort is the development of an ecosystem of software tools
both developed within the observatory teams and within the user community that
interoperate with the standards that were adopted by the observatories.

## High-energy physics (HEP)

Expand All @@ -47,13 +64,38 @@ data is shared (i.e., in a standards-compliant manner).

## Earth sciences

The need for geospatial data exchange between different systems began to be recognized in the 1970s and 1980s, but proprietary formats still dominated. Coordinated standardization efforts brought the Open Geospatial Consortium (OGC) establishment in the 1990s, a critical step towards open standards for geospatial data. The 1990s have also seen the development of key standards such as the Network Common Data Form (NetCDF) developed by the University Corporation for Atmospheric Research (UCAR) and the Hierarchical Data Format (HDF), a set of file formats (HDF4, HDF5) that are widely used, particularly in climate research. The GeoTIFF format, which originated at NASA in the late 1990s, is extensively used to share image data. In the 1990s, open web mapping also began with MapServer (https://mapserver.org) and continued later with other projects such as OpenStreetMap (www.openstreetmap.org). The following two decades, the 2000s-2020s, brought an expansion of open standards and integration with web technologies developed by OGC, as well as other standards such as the Keyhole Markup Language (KML) for displaying geographic data in Earth browsers. Formats suitable for cloud computing also emerged, such as the Cloud Optimized GeoTIFF (COG), followed by Zarr and Apache Parquet for array and tabular data, respectively. In 2006, the Open Source Geospatial Foundation (OSGeo, https://www.osgeo.org) was established, demonstrating the community's commitment to the development of open-source geospatial technologies. While some standards have been developed in the industry (e.g., Keyhole Markup Language (KML) by Keyhole Inc., which Google later acquired), they later became international standards of the OGC, which now encompasses more than 450 commercial, governmental, nonprofit, and research organizations working together on the development and implementation of open standards (https://www.ogc.org).
The need for geospatial data exchange between different systems began to be
recognized in the 1970s and 1980s, but proprietary formats still dominated.
Coordinated standardization efforts brought the Open Geospatial Consortium
(OGC) establishment in the 1990s, a critical step towards open standards for
geospatial data. The 1990s have also seen the development of key standards such
as the Network Common Data Form (NetCDF) developed by the University
Corporation for Atmospheric Research (UCAR), and the Hierarchical Data Format
(HDF), a set of file formats (HDF4, HDF5) that are widely used, particularly in
climate research. The GeoTIFF format, which originated at NASA in the late
1990s, is extensively used to share image data. In the 1990s, open web mapping
also began with MapServer (https://mapserver.org) and continued later with
other projects such as OpenStreetMap (https://www.openstreetmap.org). The
following two decades, the 2000s-2020s, brought an expansion of open standards
and integration with web technologies developed by OGC, as well as other
standards such as the Keyhole Markup Language (KML) for displaying geographic
data in Earth browsers. Formats suitable for cloud computing also emerged, such
as the Cloud Optimized GeoTIFF (COG), followed by Zarr and Apache Parquet for
array and tabular data, respectively. In 2006, the Open Source Geospatial
Foundation (OSGeo, https://www.osgeo.org) was established, demonstrating the
community's commitment to the development of open-source geospatial
technologies. While some standards have been developed in the industry (e.g.,
Keyhole Markup Language (KML) by Keyhole Inc., which Google later acquired),
they later became international standards of the OGC, which now encompasses
more than 450 commercial, governmental, nonprofit, and research organizations
working together on the development and implementation of open standards
(https://www.ogc.org).

## Neuroscience

In contrast to astronomy and HEP, Neuroscience has traditionally been a
"cottage industry", where individual labs have generated experimental data
designed to answer specific experimental questions. While this model still
In contrast to the previously-mentioned fields, Neuroscience has traditionally
been a "cottage industry", where individual labs have generated experimental
data designed to answer specific experimental questions. While this model still
exists, the field has also seen the emergence of new modes of data production
that focus on generating large shared datasets designed to answer many
different questions, more akin to the data generated in large astronomy data
Expand All @@ -72,7 +114,7 @@ success to the adoption of OSS development mechanisms [@Poldrack2024BIDS]. For
example, small changes to the standard are managed through the GitHub pull
request mechanism; larger changes are managed through a BIDS Enhancement
Proposal (BEP) process that is directly inspired by the Python programming
language community's Python Enhancement Proposal procedure, which isused to
language community's Python Enhancement Proposal procedure, which is used to
introduce new ideas into the language. Though the BEP mechanism takes a
slightly different technical approach, it tries to emulate the open-ended and
community-driven aspects of Python development to accept contributions from a
Expand Down Expand Up @@ -102,3 +144,4 @@ if the standard is developed using git/GitHub for versioning, this would
require learning the complex and obscure technical aspects of these system that
are far from easy to adopt, even for many professional scientists.


19 changes: 18 additions & 1 deletion sections/03-challenges.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ community, and migration away from the standard. Similarly, if a standard
evolves too rapidly, users may choose to stick to an outdated version of a
standard for a long time, creating strains on the community of developers and
maintainers of a standard who will need to accommodate long deprecation cycles.
On the other hand, in cases in which some forms of dynamic change is prohibited
-- as in the case of the FITS file format, which prohibits changes that break
backwards-compatibility -- there is also a cost associated with the stability
[@Scroggins2020-ut]: limiting adoption and combinations of new types of
measurements, new analysis methods or new modes of data storage and data
sharing.

## Mismatches between standards developers and user communities

Expand All @@ -56,6 +62,18 @@ have not yet had significant adoption as tools of day-to-day computational
practice. At the same time, it provides clarity and robustness for standards
developers communities that are well-versed in these tools.

Another layer of potential mismatches arises when a more complex set of
stakeholders needs to be considered. For example, the Group on Earth
Observations (GEO) is a network that aims to coordinate decision making around
satellite missions and to standardize the data that results from these
missions. Because this group involves a range of different stakeholders,
including individuals who more closely understand potential legal issues and
researchers who are better equipped to evaluate technical and domain questions,
communication is slower and hindered. As the group aims to move forward by
consensus, these communication difficulties can slow down progress. This is
just an example, which exemplifies the many cases in which OSS process which
strives for consensus can slow progress.


## Cross-domain gaps

Expand Down Expand Up @@ -146,6 +164,5 @@ grants (and see @sec-cross-sector). This hampers the long-term trajectory that
is needed to inculcate a standard into the day-to-day practice of researchers.


## The importance of automated validation


21 changes: 18 additions & 3 deletions sections/04-cross-sector.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,24 @@ provide specific sources of friction. This is because proprietary/closed
formats of data can create difficulty at various transition points: from one
instrument vendor to another, from data producer to downstream recipient/user,
etc. On the other hand, in some cases, cross-sector collaborations with
commercial entities may pave the way to robust and useful standards. One
example is the DICOM standard, which is maintained by working groups that
encompass commercial imaging device vendors and researchers.
commercial entities may pave the way to robust and useful standards. For
example, imaging measurements in human subjects (e.g., in brain imaging
experiments) significantly interact with standards for medical imaging, and
chiefly the Digital Imaging and Communications in Medicine (DICOM) standard,
which is widely used in a range of medical imaging applications, including in
clinical settings [@Larobina2023-vq, @Mustra2008-xk]. The standard emerged from
the demands of the clinical practice in the 1980s, as digital technologies were
came into widespread use in medical imaging, through joint work of industry
organizations: the American College of Radiology and the National Association
of Electronic Manufacturers. One of the defining features of the DICOM standard
is that it allows manufacturers of instruments to define "private fields" that
are compliant with the standard, but which may include idiosyncratically
organized data and/or metadata. This provides significant flexibility, but can
also easily lead to the loss of important information. Nevertheless, the human
brain imaging case is exemplary of a case in which industry standards and
research standards coexist and need to communicate with each other effectively
to advance research use-cases, while keeping up with the rapid development of
the technologies.



Expand Down
Loading