MetPy and Siphon: What goes where? #1678
Replies: 7 comments 5 replies
-
If it's not clear from what I wrote above, my personal leaning is the last option: pull back the met-specific remote data functionality and leave Siphon focused on THREDDS. |
Beta Was this translation helpful? Give feedback.
-
I continue to be keen on exposing other big and unique archives I maintain at Iowa State for usage within Does If PS. I recently been approached about writing a JSON representation of BUFKIT files in my archive for exposing in a web service. I haven't fully boggled yet the format, but you are right, it is a lot more structured than many things we deal with. |
Beta Was this translation helpful? Give feedback.
-
Thanks for kicking off this discussion @dopplershift. My reaction to siphon is that it's a (very) useful thing for wrangling specific protocols for data access, but many of the details of those protocols leak through to the user, and I often find I have to stop and think about THREDDS and how an API request works to understand how to do anything that isn't copied directly from a recipe. Also, as noted by @akrherz, a standard container for data is a key piece. xarray/pandas is the obvious choice in my mind. I continue to claim xarray is CF+NetCDF done right for humans: the presence of metadata finally leads to less, not more, computering for the user. I like how @akrherz puts together those two ideas: move everything having to do with data access and its translation into xarray into one place. That one place cloud be siphon, and then MetPy deals only with calculations and plotting. However, that violates the principle that "MetPy is supposed to be the oracle for parsing well-known metr data formats." Another thing to consider here is the wider ecosystem for earth science data. For instance, how would these decisions interface with the Intake effort that grew out of the Pangeo community? I don't think meteorological data formats are likely to grow more domain specific with time; I'd think it's important to make them as easy to give to the rest of the world as possible. I would also be totally fine with siphon becoming tightly focused on THREDDS-only functionality, and adding a more formal access layer to MetPy, which have a thin shim to expose standard THREDDS servers (e.g., the one run by Unidata) in a somewhat more digested, user-readable. Then MetPy's access layer becomes the thing that is widely used by the whole earth science community. PS. We can also take a moment to appreciate how separate small packages for each format would be a bad idea. For example, with BUFKIT, one would need a BUFKIT-KIT. After the inevitable fork to beef up functionality, it would become BUF-BUFKIT-KIT. QED. |
Beta Was this translation helpful? Give feedback.
-
While I mull over the excellent other points raised, I want to clarify one thing right away: our current target container(s) for returning data from anywhere are xarray and pandas. My goal would be to get that to only xarray since we get better unit support and it would be simpler for our community, IMO. I just haven't looked at what the ramifications actually would be for returning a sounding or METAR data as a Now the actual current state of data returns:
|
Beta Was this translation helpful? Give feedback.
-
I've become a big fan of working towards integrated efforts in the broader community, so I'd like to offer up another (though perhaps too radical) option: Integrate MetPy with Intake and deprecate Siphon in favor of Intake-adjacent package(s)My best read on the broader Pangeo ecosystem is that Intake is where the largest community effort towards the remote data access problem is going. If we're putting what Siphon does, as Unidata's Python remote data access package, under reconsideration, then taking the same opportunity to better integrate with community efforts seems like a good idea to me. Intake's big philosophy is breaking apart that "data source" to "Python data object" pipeline into generalized, interchangeable components. So, if we wanted to leverage that, we would then have to determine how your typical "Siphon + MetPy" workflow translates into Intake's components of catalogs and drivers. My take on this would be:
So, here, BUFKIT parsing would be entirely in MetPy, but remote access to community archives would be mediated through Intake catalogs (powered by the Moving beyond BUFKIT, most of Siphon's simplewebservice utilities that target Then comes THREDDS. intake-thredds exists, but currently really only as a way to interpret THREDDS catalogs as Intake catalog objects and provide easy access via OPENDAP. For a Intake-based replacement of what Siphon can do, at least an NCSS driver and a RadarServer catalog class would need to be implemented. The former strikes as something that would be a fairly hefty task. The package also seems to still be in a fairly rough/early state given the lack of documentation. It's also worth considering the standard data container problem raised by @akrherz and @deeplycloudy. Intake is nice in that it makes this more pluggable, but at least some target data format still needs to decided on so that we can make sure Intake would have a suitable driver in all the cases we need to handle. Luckily, as far as MetPy v1 goes, gridded data is pretty much solved (xarray Datasets with CF-compliant metadata, which can be served by the intake-xarray drivers). Other data types are more nebulous as pointed out in #1678 (comment) however. I'd also like to see MetPy's target container be xarray with CF metadata for pretty much everything, but for this to be provided by an Intake-based workflow from data formats that aren't already NetCDF-like would require new Intake drivers to be created (though, they could easily live in Some questions that would need resolutions for this to be a viable path forward:
|
Beta Was this translation helpful? Give feedback.
-
The outcome from our last developer call was:
|
Beta Was this translation helpful? Give feedback.
-
So I am quasi on-the-hook to do something in this space and am boggling what to do with my blunt hammer coding skills. The well thought out and elegant solutions suggested by @jthielen and @dopplershift are a bit nebulous to me. I am tempted to just-write-code and do the following within Metpy.
Would this be welcome or do I need to spend more time to understand the @jthielen solution? Since I control the server side too for my IEM apis, my current goal there is Uffffffffffffff, signed confused in Iowa. |
Beta Was this translation helpful? Give feedback.
-
History
Once upon a time, MetPy had functions for getting soundings from Wyoming/Iowa State. Because we were re-inventing a lot of testing infrastructure that Siphon already had for mocking out server access, we decided to move them to Siphon, and draw the dividing line between MetPy and Siphon as:
Since that reformulation, Siphon has released support for the IGRA2 upper air database, and Siphon git has support for downloading SPC storm reports and hurricane track info from an NHC archive.
Data Containers and Interoperability
Since it's been a subsequent topic of conversation, I'll add here a note about data models/containers for the libraries. Our current target container(s) for returning data from both MetPy and Siphon are to use one of xarray or pandas. My goal would be to get that to only xarray since we get better unit support and it would be simpler for our community, IMO. (It's not clear what the ramifcations would be for returning a sounding or METAR data as a
Dataset
rather thanDataFrame
.) Whether we go for full CF metadata compliance in those xarray representations is an open question, but we certainly would want to support the metadata we use. 😉 That target of containers is somewhat aspirational at the moment:Dataset
s by default and in some places can't open things withDataset
because xarray chokes on some netCDF representations returned by THREDDSBUFKIT
BUFKIT is a format that we'd love to add support for (we've actually been approached by someone with existing code), but the format/access presents some things that muddy the waters:
What to do
Based on the current delineation for MetPy/Siphon, I'm having trouble deciding where functionality should live. What I don't want is to have Siphon have code for accessing some of the remote sounding files, but then MetPy has others. Here are some ideas that I kicked around with @lesserwhirls:
Put it in Siphon
Under this option, we put BUFKIT support in Siphon just like we do for the other sources. We break up the API such that there is a function/method that would allow someone to do
open(mydata.buf)
and pass it in to e.g.read_bufkit
. Feels weird to have API that works with a local file supplied by the user--given that MetPy is supposed to be the oracle for parsing well-known metr data formats.Split up BUFKIT parsing and access
Put BUFKIT parsing in MetPy and put remote server access in Siphon. Would then require Siphon to gain a (maybe optional) dependency on MetPy. Feels odd to me to split up the functionality like this, but maybe there's some sense to it? Maybe?
Merge Siphon into MetPy
Just eliminate any distinction between the Metpy and Siphon. Siphon's been in need of some love for awhile, and it has languished with the effort focused on MetPy. Merging them would help solve that and decrease the maintenance burden of having two independent but highly related projects. The downsides of this are:
Put BUFKIT in MetPy and pull in others from Siphon
Under this, go ahead and put BUFKIT parsing and remote data access into MetPy. Also, to not confuse the community, bring back to MetPy the code to access some remote data sources that are metr-specific:
Future python functionality to access metr-specific data sources (e.g. any clients for simple access to NEXRAD and GOES archives on S3) would also find a home in MetPy.
I would also strongly suggest taking the opportunity to make sure we actually like the APIs we put in front of those data sources rather than just blindly dropping it in from Siphon. In adding this functionality, MetPy would gain dependencies on
beautifulsoup4
andrequests
, as well as a test dependency onvcrpy
.Siphon could hold onto the functionality of what's already been released, and mark as deprecated, not removing for an extended period (though to be fair Siphon is still 0.x). Things like SPC and NHC, which have not been included in a released version, I would advocate we go ahead and remove.
This would redraw the dividing line as:
With that as Siphon's reduced mission, it likely would not suffer as badly from infrequent maintenance.
Conclusion
Those are my thoughts. Am I missing any ramifications or benefits to any of those options? Is there another option I'm missing?
Explicitly tagging @lesserwhirls, @dcamron, @jrleeman, @jthielen, @kgoebber, @deeplycloudy for thoughts and input. Happy to have feedback/input from any other members of our community who read the novella I've written above.
Beta Was this translation helpful? Give feedback.
All reactions