Autodect parameters from dataset -- add dataset accessor #53

rcaneill · 2022-07-30T19:42:46Z

This PR makes possible to give a dataset as argument to gsw functions, and based on cf standard name the cf_xarray wrapper will pick the proper arguments. Example:

import xarray as xr
import gsw_xarray as gsw

ds = xr.open_dataset('the_dataset.nc')
gsw.sigma0(ds.SA, ds.CT)
# Or you can now use
gsw.sigma0(ds=ds)

rcaneill · 2022-07-30T19:49:42Z

@DocOtak What do you think of the API (I haven't written the doc yet)? To use the option one can simply give a dataset as keyword argument, i.e. gsw.sigma0(ds=ds).

rcaneill · 2022-07-30T19:52:56Z

Many arguments don't have a cf standard name. It is possible to add custom criteria in cf-xarray, so we could write our own standard names (and open a request to add them in the cf-convention) https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html.
Or another option is to make the autodetect option only available for the parameters that have a standard name. What do you think?

I would be in favor of the 1st option (our own criteria), this will make things way easier for #30 (regarding the solving of the graph)

DocOtak · 2022-08-01T23:53:35Z

@rcaneill apologies for delayed responses, I have been on a long holiday and have been attempting to actually "relax"... I'll be back "full time" next week.

I'm not sure about the gsw.func(ds=ds) type interface, all the functions (that I can recall) which take an entire Dataset does something to all variables in the dataset. Are you aware of any non "built in" numpy ufuncs that take entire datasets?

Elegant data selection is a hard problem faced by my seagoing group at Scripps... the functions that I have seen which take entire pandas dataframes are usually very messy.

rcaneill · 2022-08-04T13:58:36Z

I see your point. Would you then not be in favor of having this kind of detection, or is it more an API problem (that we could replace by something like ds.gsw['sigma0'] or ds.gsw.sigma0())?

rcaneill · 2022-11-16T13:07:53Z

Hello @DocOtak , I'm pinging you about my last question :)

DocOtak · 2022-11-22T21:06:37Z

Hi @rcaneill, greetings from pre cruise quarantine in Tahiti.

I think that we should avoid changing the function signature of the wrapped upstream functions and instead rely on the "accessor" pattern as you suggest: ds.gsw.sigma0().

I like the ds.gsw["sigma0"] form for simplicity. However, also supporting the function call way might allow some selection of which variable to use for input properties:

ds.gsw.sigma0(SP="psal1")

Where psal1 is one of many practical salinity variables we have in hypothetical dataset ds. The func then calculates SA using this psal variable as an input for sigma0.

rcaneill · 2022-11-23T08:35:47Z

I like your idea of using function call to use variable selection. That would easily tackle the issue of setting which variable to use when multiple fields are present with the same standard name.
I'll try to give a try to add an accessor (when I'll have time)

Enjoy your future field work :)

rcaneill · 2022-11-24T08:30:25Z

I had some (hopefully good) idea while cycling to university today:

I like to keep the ability to use __getitem__ calls: ds.gsw['sigma0']. It makes it easy to e.g. loop through a list of strings of properties we want to compute
It makes sense to me to also allow to pass a list of strings to getitem: ds.gsw[['sigma0', 'sigma1']]
We keep the possibility of function call: ds.gsw.sigma0(SP="psal1")
To tackle the choice of e.g. default practical salinity (CTD, bottle, etc), we add a property/list/dict (as we prefer) on the gsw accessor with default variables to use:

ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1'

(This point is more relevant to Auto Discovery of Calculable Values #30 ) We can cache intermediate variables used for computation (if a variable needs steps in between), cf xarray doc on accessor. This can be set as an option (for users with small datasets it is a nice option, but for big datasets it can be a bad option), either through context, settings, or arguments: ds.gsw.sigma0(use_cache=False)

rcaneill · 2022-11-24T10:17:25Z

At commit e81063c we have a working version of the accessor! (we can't use getitem with lists, and there is no default option yet)
Could you have a look?

rcaneill · 2022-11-24T16:46:10Z

(We can now use lists in getitem)

rcaneill · 2023-01-10T10:42:33Z

(cf #57 (comment))

Roadmap of this PR:

Make it work for functions using cf standard names, and let it fail for other functions

Optional:

add detection of reference scale t90 and t68
(we assume all in situ temperature are ITS-90)
deal with t argument that is temperature of sea-ice or of sea-water depending on the function's name
deal with p which is in situ pressure of either water or ice, minus 10.1325 dbar
(we do same as for t, the standard name sea_water_pressure is used, as e.g. in ARGO this is what is used for the pressure induced by water, i.e. is 0 at surface. There is no standard name for sea ice pressure.)
Create a way to the user to configure the variable's names (e.g. gsw.set_non_cf_name({'entropy':'entropy_name_in_ds', 'SA_seaice':'salinity_of_sea_ice_name_in_ds'}) or ds.gsw.defaults['sea_water_practical_salinity'] = 'psal1')

…e or water

rcaneill · 2023-08-17T12:27:13Z

What are your thoughts on this figuring out intermediate parameters not already present in the dataset? e.g. ds.gsw.CT_from_t() calculates SA from SP if not already present. Somewhat related, right now the get item syntax uses the function name: ds.gsw["SA_from_SP"] if there is something that can figure out the intermediate properties, it might be better to just have the property name: ds.gsw["SA"]. If you think this is too difficult, don't worry about it for this round of changes.

About this, it is issue #30 that I decided to split into 2 different PR. The 1st one (this PR) to create the accessor and possibility to use the standard names, and a 2nd one to have the autodiscovery (to be done, but I had a good draft of the algo here https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378)

rcaneill · 2023-08-17T16:09:46Z

@DocOtak
I updated the code with your comments. I let you try it with your data (or check the notebook about ARGO), but it should work smoothly :)

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

rcaneill · 2023-08-17T16:27:14Z

I use an internal function from cf_xarray for performance, I opened an issue there: xarray-contrib/cf-xarray#465

Had an answer and problem solved!

DocOtak · 2023-08-18T03:49:38Z

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

rcaneill · 2023-08-18T06:59:33Z

Oh awesome, in that gist. I'm not familiar with that graph library. Is it able to tell you if cycles or loops exist in the graph?

Hm I don't remember but I guess that yes. As far as I remember (I did this notebook 1 year ago), in the end the easiest was not to find a path within the graph with one classical algorithm but simply brute force to find all possible variables that one can compute with a certain set of inputs.

gsw_xarray/_util.py

DocOtak · 2023-08-30T20:30:26Z

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.
That one comment I made about missing name, I've had bad luck with making reviews actually visible to not me...

rcaneill · 2023-08-31T07:15:53Z

I played with the latest commits, I really like it and think this is the right track. There are two things I noticed:

Do you think the accessor API is stable? i.e. should we say it is "provisional" while we make sure it is ok in more "real world" testing. Part of this I think would be documenting our intent (vs the implementation that might not match due to bugs), I can take a first crack at it if you'd like.

I think that the API is stable if using the accessor + providing the name of each variable inside the dataset. If only relying on standard names (but practical salinity, as 2 standard names are used so it may raise issues) I think it is also robust. And the mixture of standard and non standard names using the options manager may need more focus. I tried to write good tests, but maybe we could write some more.
We could raise a warning when the accessor is used, saying that it may be unable to autodetect the arguments in certain unusual cases (so versio 0.4.0 of gsw-xarray would have the warning, maybe also 0.4.1 if 0.4.1 adds the autodiscovery, and 0.4.2 would be considered as stable if we have enough feedback)

If you have time to document our intent, I'm pleased if you do so. I can try to write some more tests + resolve your comment about modules

rcaneill · 2023-08-31T15:43:33Z

This comment is to list all tests with should have for the accessor. Feel free to add more configurations if needed. I'll try to implement them tomorrow:

Basic tests

give names
with standard names (getitem)
with standard names (method)

More advanced

with standard names + option to choose from dataset which array to pick (getitem)
with standard names + option to choose from dataset which array to pick (method)
with non cf names + option set to say which array to pick (method)
mixture of standard names + decide with array to pick + non cf names (method)

With Pint

Test with a pint quantified dataset (method)
Test with a pint quantified dataset (getitem)

These tests should be done with the method ds.gsw.sigma0() and the getitem ds.gsw["sigmao"]

rcaneill · 2023-09-01T09:28:13Z

We should add in the doc that the names we provide to the accessor with getitem are the names or the underlying function, and not the name of the variables (this is confusing as many function have the same name as the output variable). This is probably clear when using the method, as it is an explicit call to a function.

…performance test

DocOtak · 2023-11-25T16:51:59Z

Any reason to not merge (so we can not let this prevent forward movement)?

rcaneill · 2023-11-25T16:59:35Z

I think it was ready, I was waiting for your review but it seems that I forgot to ask you...
So if you can have a last check, if seems good you can merge afterward

DocOtak · 2023-11-25T17:15:46Z

We can do any cleanups in future PRs I think, tests are passing and I don't think we have broken any currently used public interfaces. Maybe we could mark the "auto detector" as experimental. I also suspect a refactor of things might be worth it eventually, there are some deep indentations that are indicative of a "code smell" but I am not worried about that for now.

rcaneill added 6 commits July 29, 2022 20:00

rename arguments dict

7991b05

add a test for dataset

470bde7

add cf-xarray

0d64bcc

autodetect test

53cbbbb

add option to autodetect missing args from dataset

818efcd

linting

8f8db85

rcaneill mentioned this pull request Jul 30, 2022

Auto Discovery of Calculable Values #30

Open

change test for accessor (test will fail now)

3757e6c

add dataset accessor

e81063c

rcaneill changed the title ~~Autodect parameters from dataset~~ Autodect parameters from dataset -- add dataset accessor Nov 24, 2022

rcaneill requested a review from DocOtak November 24, 2022 10:16

rcaneill added 2 commits November 24, 2022 17:40

allow for use of string in accessor arguments

c348c02

allow to use list in accessor

5f3f1c3

rcaneill mentioned this pull request Nov 29, 2022

What is left for version 1.0 ? #56

Open

7 tasks

DocOtak mentioned this pull request Nov 29, 2022

add flake8 in pre-commit #54

Closed

rcaneill mentioned this pull request Jan 10, 2023

Inputs arguments that have no cf standard names #57

Open

This comment was marked as resolved.

Sign in to view

rcaneill added 2 commits January 10, 2023 13:47

tackle problem of argument t that is either in situ temperature of ic…

9a46c93

…e or water

Correct minor typo in docstring

95a010d

add wraps to get docstring

879dda2

rcaneill added 7 commits August 17, 2023 17:57

add line profiler and dask

be92fbb

update example

1ba5172

update tests

7931f0a

change order of priority based on user set options

1fa7e6d

update README with options

63e985f

add order of priority

5cb0dad

reran notebook

68d791e

remove using of cf_xarray internal function

45f5fb6

rcaneill mentioned this pull request Aug 29, 2023

Add citation #67

Closed

DocOtak reviewed Aug 30, 2023

View reviewed changes

gsw_xarray/_util.py Show resolved Hide resolved

rcaneill added 5 commits September 1, 2023 11:32

add test for pint quantified array

831a601

add more tests

983be08

emphasis on the fact that the accessor maps functions

f9ad776

does not provide access to module with accessor

ba635d5

add doc without output, use help for functions, and add upstream gsw …

27ca29d

…performance test

DocOtak merged commit b20db8d into master Nov 25, 2023

rcaneill deleted the use-dataset branch May 29, 2024 08:27

Autodect parameters from dataset -- add dataset accessor #53

Autodect parameters from dataset -- add dataset accessor #53

Uh oh!

Conversation

rcaneill commented Jul 30, 2022

Uh oh!

rcaneill commented Jul 30, 2022

Uh oh!

rcaneill commented Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DocOtak commented Aug 1, 2022

Uh oh!

rcaneill commented Aug 4, 2022

Uh oh!

rcaneill commented Nov 16, 2022

Uh oh!

DocOtak commented Nov 22, 2022

Uh oh!

rcaneill commented Nov 23, 2022

Uh oh!

rcaneill commented Nov 24, 2022

Uh oh!

rcaneill commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcaneill commented Nov 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcaneill commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

rcaneill commented Aug 17, 2023

Uh oh!

rcaneill commented Aug 17, 2023

Uh oh!

rcaneill commented Aug 17, 2023

Uh oh!

DocOtak commented Aug 18, 2023

Uh oh!

rcaneill commented Aug 18, 2023

Uh oh!

Uh oh!

DocOtak commented Aug 30, 2023

Uh oh!

rcaneill commented Aug 31, 2023

Uh oh!

rcaneill commented Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Basic tests

More advanced

With Pint

Uh oh!

rcaneill commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DocOtak commented Nov 25, 2023

Uh oh!

rcaneill commented Nov 25, 2023

Uh oh!

DocOtak commented Nov 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rcaneill commented Jul 30, 2022 •

edited

Loading

rcaneill commented Nov 24, 2022 •

edited

Loading

rcaneill commented Nov 24, 2022 •

edited

Loading

rcaneill commented Jan 10, 2023 •

edited

Loading

rcaneill commented Aug 31, 2023 •

edited

Loading

rcaneill commented Sep 1, 2023 •

edited

Loading