Skip to content

Conversation

rcaneill
Copy link
Collaborator

This PR makes possible to give a dataset as argument to gsw functions, and based on cf standard name the cf_xarray wrapper will pick the proper arguments. Example:

import xarray as xr
import gsw_xarray as gsw

ds = xr.open_dataset('the_dataset.nc')
gsw.sigma0(ds.SA, ds.CT)
# Or you can now use
gsw.sigma0(ds=ds)

@rcaneill
Copy link
Collaborator Author

@DocOtak What do you think of the API (I haven't written the doc yet)? To use the option one can simply give a dataset as keyword argument, i.e. gsw.sigma0(ds=ds).

@rcaneill
Copy link
Collaborator Author

rcaneill commented Jul 30, 2022

Many arguments don't have a cf standard name. It is possible to add custom criteria in cf-xarray, so we could write our own standard names (and open a request to add them in the cf-convention) https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html.
Or another option is to make the autodetect option only available for the parameters that have a standard name. What do you think?

I would be in favor of the 1st option (our own criteria), this will make things way easier for #30 (regarding the solving of the graph)

@DocOtak
Copy link
Owner

DocOtak commented Aug 1, 2022

@rcaneill apologies for delayed responses, I have been on a long holiday and have been attempting to actually "relax"... I'll be back "full time" next week.

I'm not sure about the gsw.func(ds=ds) type interface, all the functions (that I can recall) which take an entire Dataset does something to all variables in the dataset. Are you aware of any non "built in" numpy ufuncs that take entire datasets?

Elegant data selection is a hard problem faced by my seagoing group at Scripps... the functions that I have seen which take entire pandas dataframes are usually very messy.

@rcaneill
Copy link
Collaborator Author

rcaneill commented Aug 4, 2022

I see your point. Would you then not be in favor of having this kind of detection, or is it more an API problem (that we could replace by something like ds.gsw['sigma0'] or ds.gsw.sigma0())?

@rcaneill
Copy link
Collaborator Author

Hello @DocOtak , I'm pinging you about my last question :)

@DocOtak
Copy link
Owner

DocOtak commented Nov 22, 2022

Hi @rcaneill, greetings from pre cruise quarantine in Tahiti.

I think that we should avoid changing the function signature of the wrapped upstream functions and instead rely on the "accessor" pattern as you suggest: ds.gsw.sigma0().

I like the ds.gsw["sigma0"] form for simplicity. However, also supporting the function call way might allow some selection of which variable to use for input properties:

ds.gsw.sigma0(SP="psal1")

Where psal1 is one of many practical salinity variables we have in hypothetical dataset ds. The func then calculates SA using this psal variable as an input for sigma0.

@rcaneill
Copy link
Collaborator Author

I like your idea of using function call to use variable selection. That would easily tackle the issue of setting which variable to use when multiple fields are present with the same standard name.
I'll try to give a try to add an accessor (when I'll have time)

Enjoy your future field work :)

@rcaneill
Copy link
Collaborator Author

I had some (hopefully good) idea while cycling to university today:

  • I like to keep the ability to use __getitem__ calls: ds.gsw['sigma0']. It makes it easy to e.g. loop through a list of strings of properties we want to compute
  • It makes sense to me to also allow to pass a list of strings to getitem: ds.gsw[['sigma0', 'sigma1']]
  • We keep the possibility of function call: ds.gsw.sigma0(SP="psal1")
  • To tackle the choice of e.g. default practical salinity (CTD, bottle, etc), we add a property/list/dict (as we prefer) on the gsw accessor with default variables to use:
ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1'
  • (This point is more relevant to Auto Discovery of Calculable Values #30 ) We can cache intermediate variables used for computation (if a variable needs steps in between), cf xarray doc on accessor. This can be set as an option (for users with small datasets it is a nice option, but for big datasets it can be a bad option), either through context, settings, or arguments: ds.gsw.sigma0(use_cache=False)

@rcaneill rcaneill changed the title Autodect parameters from dataset Autodect parameters from dataset -- add dataset accessor Nov 24, 2022
@rcaneill rcaneill requested a review from DocOtak November 24, 2022 10:16
@rcaneill
Copy link
Collaborator Author

rcaneill commented Nov 24, 2022

At commit e81063c we have a working version of the accessor! (we can't use getitem with lists, and there is no default option yet)
Could you have a look?

@rcaneill
Copy link
Collaborator Author

rcaneill commented Nov 24, 2022

(We can now use lists in getitem)

@rcaneill
Copy link
Collaborator Author

rcaneill commented Jan 10, 2023

(cf #57 (comment))

Roadmap of this PR:

  • Make it work for functions using cf standard names, and let it fail for other functions

Optional:

  • add detection of reference scale t90 and t68
    (we assume all in situ temperature are ITS-90)
  • deal with t argument that is temperature of sea-ice or of sea-water depending on the function's name
  • deal with p which is in situ pressure of either water or ice, minus 10.1325 dbar
    (we do same as for t, the standard name sea_water_pressure is used, as e.g. in ARGO this is what is used for the pressure induced by water, i.e. is 0 at surface. There is no standard name for sea ice pressure.)
  • Create a way to the user to configure the variable's names (e.g. gsw.set_non_cf_name({'entropy':'entropy_name_in_ds', 'SA_seaice':'salinity_of_sea_ice_name_in_ds'}) or ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1')

@rcaneill

This comment was marked as resolved.

@rcaneill
Copy link
Collaborator Author

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

About this, it is issue #30 that I decided to split into 2 different PR. The 1st one (this PR) to create the accessor and possibility to use the standard names, and a 2nd one to have the autodiscovery (to be done, but I had a good draft of the algo here https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378)

@rcaneill
Copy link
Collaborator Author

@DocOtak
I updated the code with your comments. I let you try it with your data (or check the notebook about ARGO), but it should work smoothly :)

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

@rcaneill
Copy link
Collaborator Author

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

Had an answer and problem solved!

@DocOtak
Copy link
Owner

DocOtak commented Aug 18, 2023

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

@rcaneill
Copy link
Collaborator Author

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

Hm I don't remember but I guess that yes. As far as I remember (I did this notebook 1 year ago), in the end the easiest was not to find a path within the graph with one classical algorithm but simply brute force to find all possible variables that one can compute with a certain set of inputs.

@rcaneill rcaneill mentioned this pull request Aug 29, 2023
@DocOtak
Copy link
Owner

DocOtak commented Aug 30, 2023

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

  • Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.
  • That one comment I made about missing name, I've had bad luck with making reviews actually visible to not me...

@rcaneill
Copy link
Collaborator Author

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

  • Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.

I think that the API is stable if using the accessor + providing the name of each variable inside the dataset. If only relying on standard names (but practical salinity, as 2 standard names are used so it may raise issues) I think it is also robust. And the mixture of standard and non standard names using the options manager may need more focus. I tried to write good tests, but maybe we could write some more.
We could raise a warning when the accessor is used, saying that it may be unable to autodetect the arguments in certain unusual cases (so versio 0.4.0 of gsw-xarray would have the warning, maybe also 0.4.1 if 0.4.1 adds the autodiscovery, and 0.4.2 would be considered as stable if we have enough feedback)

If you have time to document our intent, I'm pleased if you do so. I can try to write some more tests + resolve your comment about modules

@rcaneill
Copy link
Collaborator Author

rcaneill commented Aug 31, 2023

This comment is to list all tests with should have for the accessor. Feel free to add more configurations if needed. I'll try to implement them tomorrow:

Basic tests

  • give names
  • with standard names (getitem)
  • with standard names (method)

More advanced

  • with standard names + option to choose from dataset which array to pick (getitem)
  • with standard names + option to choose from dataset which array to pick (method)
  • with non cf names + option set to say which array to pick (method)
  • mixture of standard names + decide with array to pick + non cf names (method)

With Pint

  • Test with a pint quantified dataset (method)
  • Test with a pint quantified dataset (getitem)

These tests should be done with the method ds.gsw.sigma0() and the getitem ds.gsw["sigmao"]

@rcaneill
Copy link
Collaborator Author

rcaneill commented Sep 1, 2023

  • We should add in the doc that the names we provide to the accessor with getitem are the names or the underlying function, and not the name of the variables (this is confusing as many function have the same name as the output variable). This is probably clear when using the method, as it is an explicit call to a function.

@DocOtak
Copy link
Owner

DocOtak commented Nov 25, 2023

Any reason to not merge (so we can not let this prevent forward movement)?

@rcaneill
Copy link
Collaborator Author

I think it was ready, I was waiting for your review but it seems that I forgot to ask you...
So if you can have a last check, if seems good you can merge afterward

@DocOtak
Copy link
Owner

DocOtak commented Nov 25, 2023

We can do any cleanups in future PRs I think, tests are passing and I don't think we have broken any currently used public interfaces. Maybe we could mark the "auto detector" as experimental. I also suspect a refactor of things might be worth it eventually, there are some deep indentations that are indicative of a "code smell" but I am not worried about that for now.

@DocOtak DocOtak merged commit b20db8d into master Nov 25, 2023
@rcaneill rcaneill deleted the use-dataset branch May 29, 2024 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants