cab specification 2.0 #637

SpheMakh · 2020-06-04T17:21:16Z

The cab specification has a few obvious flaws. These ones are particularly bad:

does not distinguish between files and directories
does not allow a pattern specification (pinging @bennahugo ;))
It would be more readable if the parameters list was specified as:

param1:
  type: foo
  desc: blah blah 
  default: bar
  io: input | output | mixed
param2:
.
.
.

The text was updated successfully, but these errors were encountered:

o-smirnov · 2020-07-20T08:45:38Z

First of all, as a thought experiment, I suggest we try to formulate this as a generic YML validator that can replace PyKwalify, so that we can use it for Caracal too.

Things like the io attribute would be Stimela-specific extension attributes. Extension attributes aside, is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal? If yes, then we're making one tool for both those jobs.

SpheMakh · 2020-07-20T08:55:46Z

Agreed, but let's just do this. It will simplify the integration of stimela and caracal

is a Stimela cab parameters list the same kind of beast as a worker schema in Caracal?

In principle yes.

o-smirnov · 2020-07-20T09:01:49Z

Second, a major feature I would like to see is include tags, to include YML files into YML files. There are some straightforward implementations discussed here, we could just reuse them or cut and paste: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another

Use cases for includes:

avoiding duplication (which means avoiding errors!) For example, Stimela has a whole family of casa-derived cabs with mostly duplicate parameters. These could all be defined in one place (in the base casa cab), and included into the derived cabs.
likewise, avoiding duplication in Caracal schema files
structuring Caracal config files better. For example, I have a fairly elaborate inspect worker config which I've customized (differently) for MeerKAT and uGMRT data. I continue to tweak it, but changes are increasingly difficult to propagate consistently. If I could just include it into my Caracal configs instead of cut-and-pasting it, this problem goes away.
perhaps more fancily: a Caracal schema could choose to import a (section of) a Stimela parameter file, to have a 1-to-1 mapping of parameters. This would mean tweaking the !include implementation to specify a (sub)section to be included, but why not?

o-smirnov · 2020-08-12T14:43:05Z

I volunteer to get a start on this, since I'm the one who seems to be accumulating the requirements...

SpheMakh · 2020-12-03T16:40:09Z

Here is a first draft. The include tags don't really work for what we want, but we can just have an include section in the definition

schema: 2.0.0
task: casa_gaincal
base: stimela/casa:1.6.9
binary: giancal
description: Determine temporal gains from calibrator observations
prefix: null

include_inputs: 
  - ../misc/casagain.yml # this file contains parameters that are common between gain calibration tasks
include_outputs:
  - ../misc/casagain.yml

inputs:
  gaintype:
    info: Type of gain solution
    type: str 
    default: G
    choices:
      - G 
      - T 
      - GSPLINE
      - K 
      - KCROSS
      - XY+QU
      - XYf+QU
  splinetime:
    info: Spline timescale(sec); All spw's are first averaged.
    type: float
    default: 3600.0
  npointaver:
    info: The phase-unwrapping algorithm
    type: int 
    default: 3
  phasewrap:
    info: Wrap the phase for jumps greater than this value (degrees)
    type: float
    default: 180.0
  smodel:
    info: Point source Stokes parameters for source model.
    type:
      - List[str]
      - List[float]
  save_result:
    info: Name of pickle file to save the task result into
    type: str 

outputs:
  pickled_gains:
    name: "{inputs.save_result.value}"
    location: "{outdir}"
    required: false

o-smirnov · 2020-12-07T10:14:40Z

OK, so you're suggesting that any include_XYZ is automatically loaded and added onto XYZ, right?

But then why not do the same via an include_yml keyword that can appear inside any mapping? E.g.

inputs:
  include_yml:
    - ../misc/casagain.yml
  gaintype:
    info: Type of gain solution
    type: str

Or perhaps (I still want to be able to pull specific sections in):

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name

Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html contains "{foo}" and "{bar}" at any point, these can be substituted in:

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name
    vars:
      foo: 1
      bar: 2

SpheMakh · 2020-12-07T10:31:04Z

OK, so you're suggesting that any include_XYZ is automatically loaded and added onto XYZ, right?

Yes

Or perhaps (I still want to be able to pull specific sections in):

inputs:
  include_yml:
    path: ../misc/casagain.yml
    section: section_name

This is more versatile, especially if we want to use the same specification for caracal.

Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if casagain.html contains "{foo}" and "{bar}" at any point, these can be substituted in:

Agreed.

I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir via "{}"

o-smirnov · 2020-12-07T10:50:47Z

I should've made this explicit in my previous comment, but you can also substitute some stimela variables like msdir indir outdir via "{}"

Yeah, I only saw that after I posted. That's a very useful feature (see e.g. CubiCal's default --j-save-to settings).

BTW, if you put yml directly after your opening triple quote, you get syntax highlighting (see e.g. I edited your post above).

SpheMakh · 2020-12-08T09:18:36Z

Also better to specify positional parameters in the YAML

inputs:
  gaintype:
    positional: false  # <----
    info: Type of gain solution
    type: str

SpheMakh · 2020-12-08T09:22:02Z

@paoloserra maybe you want to have some input here as we are thinking of also replacing pykwalify in caracal with this specification.

paoloserra · 2020-12-08T11:14:37Z

@SpheMakh I've had a look but I'm not sure what exactly (if anything) we would need to change in the caracal schema files.

o-smirnov · 2020-12-08T11:18:37Z

Well the schema should be backwards-compatible for the most part, but it should allow for more options and flexibility. What sort of limitations were we running into with the caracal schemas?

Personally, I think the include capability will be very useful.

Off the top of my head I can think of permitting a true None value, but that's minor.

paoloserra · 2020-12-08T11:22:50Z

So include is a way to give common settings to, e.g., multiple runs of the flag worker within a single caracal run?

I'm not aware of significant limitations. We had to use example to set the default values, which was a little odd, but that's all.

SpheMakh · 2020-12-08T11:33:03Z

Something that often takes too much time when making a new worker is writing the parameters, which are already defined in stimela. With this, we can just use the include feature and give the appropriate cab file. The changes to caracal would be minor but would make life easier moving forward.

paoloserra · 2020-12-08T11:37:59Z

I'm probably missing something, but CARACal has its own parameter names and default values, which do not always coincide with those in Stimela.

And anyway, very few parameters are actually required in CARACal, most workers can be added to a pipeline run adding very few parameters to the CARACal config file. This is shown by https://github.com/caracal-pipeline/caracal/blob/master/caracal/sample_configurations/minimalConfig.yml .

So I think we need to be careful about transferring Stimela par names / default values to CARACal.

o-smirnov · 2020-12-08T12:05:40Z

I think for future packages (and perhaps Caracal R2.0), it would make a lot of sense to lean on include heavily. Some of the current implementations I find annoying, confusing and error-prone. For example, it would be a lot "cleaner" if wsclean settings were their own subsection in selfcal, something like:

selfcal:
  imager: wsclean
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    robust: -1   # overriding whatever was pulled from stimela

This way Caracal could be free to have different and perhaps more sensible defaults (and the {}-subst mechanism would allow for various games with the remapping of parameter names), but at the same time there would be a consistent 1:1:1 mapping between wsclean parameter names, Stimela parameter names, and settings in the appropriate subsection, which makes life easier for everyone, no?

(CubiCal and the coming QuartiCal is another package-of-100-parameters that could benefit from a consistent naming mechanism...)

Another thing that include would be great for is providing "standard libraries". For example:

flag:
  include_yml:
    path: caracal/meerkat-flagging.yml
    section: aggresive_4k_flagging

inspect:
  include_yml:
    path: caracal/meerkat-inspect.yml
    section: detailed_1gc_plots

Not saying things should literally be implemented like this -- just throwing food for thought out there.

paoloserra · 2020-12-08T12:14:04Z

I don't know about parameter name matching. Stimela tends to stick to the native parameter names. CARACal does not need to. In fact, we've changed most of the parameter names in order to have some internal CARACal consistency. So I'd say a Stimela-Caracal match is never going to happen.

I fully agree that the CARACal selfcal parameters need a lot of cleaning up -- things need to be more modular.

Overall there's plenty to improve in CARACal, and it's good to think of this for a 2.0 release. But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.

o-smirnov · 2020-12-08T12:31:00Z

But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user.

Well, it's not about following Stimela naming. Rather, for complicated packages such as WSClean, DDF and CubiCal, I'd like to (ideally) see a system where Stimela names are consistent with the native package parameter names, and there's a way to import this schema into Caracal "as is". Ideally, users who know WSClean should be able to piss out Caracal configs at will without having to look up what we decided to rename each parameter to. If you're going to give the user rope, make it familiar rope.

This does not have to preclude a simple and self-consistent subset of parameters to buttress the Caracal newbie.

SpheMakh · 2020-12-08T12:43:03Z

I also lean towards keeping the names the same as the packages, but this will be optional. We can have an optional overrides feature

selfcal:
  imager: wsclean
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    pixel:
      info: Image size in pixels
      overrides: size  # <--------------
      .
      .
      .

SpheMakh · 2020-12-08T12:47:31Z

Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like cellsize npixel niter when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.

o-smirnov · 2020-12-08T13:01:01Z

But that divergence can be accomplished via override and {}-substitutions. E.g.:

selfcal:
  cellsize:
     ...
  img_wsclean:
    include_yml:
      path: stimela/wsclean.yml
      section: parameters
    scale:                                    # "scale" is a wsclean-specific paramater
      default: {selfcal.cellsize}     # ...which we set from a Caracal parameter by default (but the end user can still override)

paoloserra · 2020-12-08T13:36:50Z

Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like cellsize npixel niter when I decide to use casa or ddfacet instead of wsclean for imaging. So we should definitely have the option of diverging from stimela and native package names.

Yes, that was one of the reasons.

o-smirnov · 2020-12-28T11:13:49Z

Actually, I don't think we need to reinvent the wheel here. Both includes and substitutions (of a sort) are already standard YaML features. See merge and anchors, https://yaml.org/type/merge.html.

The somewhat annoying thing is that anchors have to be explicitly set with "&" at point of origin. And it knows nothing about including files -- the YaML code in the file has to be loaded in the stream before. So the example above would have to look something like

selfcal: 
  cellsize: &selfcal_cellsize
     ...
  img_wsclean:
    <<: *stimela__cabs__wsclean__parameters
    scale:                                    
      default: *selfcal_cellsize

(NB, this is cartoon code, it can't actually work like that, but I don't want to cloud the picture with details...)

There may be a way to automatically set anchor names if we use ruamel.yaml, as it has a set_yaml_anchor feature. The loading code would then look something like:

Load all stimela yaml files into a single nested dist, e.g. the contents of wsclean.yml would end up under stimela: cabs: wsclean.
Run a loop to assign anchors based on section names, thus automatically creating anchors such as e.g. stimela__cabs__wsclean__parameters.
Dump the dict to a string yaml buffer.
Load config from buffer + Caracal yaml.

In principle, this is a generic pattern to support a layer of config files Y that can reference sections from a layer of config files X via automatically-generated anchor names. You can keep on adding layers, the only limitation is that each layer can reference auto-generated anchors from the previous layers, but not from within itself.

o-smirnov · 2020-12-28T19:14:26Z

This sounds complicated, but I think I can make a simple proof-of-concept on am experimental Stimela branch.

I think Stimela itself needs this: there's currently a lot of repetition in the cabs parameters (for example, the entire casa_xxx family has many common parameters) which could be eliminated by defining common stuff in config files under base, and pulling this into config files under cabs using the merge mechanism. In terms of my previous comment, the base configs would be layer X (loaded into a stimela__base namespace), and the cab configs would be layer Y (stimela__cabs). And Caracal configs could then layer on top of that using the same mechanism.

o-smirnov · 2020-12-30T15:48:46Z

OK, I got tired of fighting Caracal and data and flags, so I spent a day or so hacking on a couple of proofs of concept. Checked it into the configuratt branch.

Approach 1: anchors and aliases

This uses anchors and aliases, as discussed above, plus some extra code to automatically set anchors, and merge YML configs. See here, and run as python stimela/configuratt/demo.py.

I've made example .yml files for the casa base image and for the applycal cab to illustrate.

Pros: the syntax is nice and clean, and is a standard YML feature, e.g.:

inputs:
  <<: *stimela-base-casa-mssel_inputs

  docallib:
    info: "Use callib or traditional cal apply parameters"
    type: bool
    default: false

Cons: the functionality is limited, and it all feels a bit klunky. Also, applycal.yml can't be read on its own by a standard YML parser because it contains these undefined aliases which my code generates auto-magically. It can only be read together with casa.yml by the magic code in demo.py.

Approach 2: use OmegaConf

This is a neat new package, and provides almost all the features we need (plus it's evolving quickly, so will probably have the rest soon). It's a lot more powerful than that: you can specify a schema programmatically using the standard Python typing module, and the resulting schema can be a lot richer than what pykwalify offers. See https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py, it's practically self-documenting.

The >> feature is not quite in it yet, but it has ${a.b.c} interpolation, and also direct command-line support, and lots of other good stuff. See applycl.yaml for a taste (note that the demo files for approach 2 are called yaml as opposed to yml).

Pros: powerful and elegant schemas, explicitly designed for the situation where a config is spread across many files and we want to merge them all together, built-in interpolation (${section.variable}) is very useful. Seems to be developing quickly. Explicit command-line support, we can lose all that horrible parsing code.

Cons: some schema errors have been utterly incomprehensible (if you thought pykwalify was bad...) It is also an option to use omegaconf without the schema mechanism (essentially, it's then a fancy YML reader providing variable interpolation), and then on top of that use something like pykwalify to process and validate the config afterwards.

o-smirnov · 2020-12-31T09:29:23Z

The >> feature is not quite in it yet,

After a discussion with the developer, I realized this is something relatively simple to bolt on externally, so I checked in a modified demo. This works and looks quite nice now:

inputs:
  _use:
    - base.casa.params.mssel_inputs

  docallib:
    info: "Use callib or traditional cal apply parameters"
    type: bool
    default: false

I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:

crosscal:
  _use: lib.meerkat.crosscal.4k
  secondary:
    order: KGIAKF

inspect:
  _use: lib.meerkat.plots.cal.phaseballs, lib.meerkat.plots.cal.spectra, lib.meerkat.plots.cal.detailed
  
  shadems:
    plots: 
      my_extra_plots: 
        - some_extra_plots

P.S. This needs omegaconf master to run, as it uses some features from the upcoming 2.1 release.

gigjozsa · 2021-01-01T12:30:19Z

@o-smirnov You mention pykwalify vs typing. I was wondering if overloading a parameter in a schema, by which I mean allowing it to be multiple types, e.g.:

None
int
list of ints
list of n strings
list of list of floats

all allowed, but not a float, a list of floats, a list of n+1 or n-1 strings etc.

o-smirnov · 2021-01-02T13:32:35Z

Good question, but it's turning into a Caracal discussion, so let's discuss in in caracal-pipeline/caracal#1303 rather.

gigjozsa · 2021-01-04T10:10:08Z

I am not particularly concerned where a validation takes place, so this is good for me, but in principle this could (should?) happen as early as possible, i.e. Stimela (where the question turned up for me).

SpheMakh · 2021-01-04T13:14:48Z

I go for approach 2. OmegaConf is exactly what we need

SpheMakh · 2021-01-04T13:30:46Z

I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like:

Yes, this is the direction we should move. For example, the casa base file I use is below. Let me incorporate the mergeConf shindig to my branch

  vis:
    type: Directory
    info: Name of input visibility file
    default: null
    required: true
    positional: false
    prefix: null
  field:
    info: Select field using field id(s) or field name(s)
    type: str
    default: null 
  spw: 
    info: Select spectral window/channels
    type : str
    default: null
  selectdata:
    info: Other data selection parameters
    type: bool 
    default: true
  timerange:
    info: Select data based on time range
    type: str
    default: null
  uvrange:
    info": Select data within uvrange (default units meters)
    dtype: str 
    default: null
  antenna:
    info: Select data based on antenna/baseline
    type: str
    default: null
  scan:
    info: Scan number range
    type: str 
    default: null
  observation:
    info: Select by observation ID(s)
    type: str
    default: null
  msselect:
    info: Optional complex data selection (ignore for now)
    type: str
    default: null

o-smirnov · 2021-01-04T17:17:20Z

That's almost word-for-word what I wrote for the omegaconf demo: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/cargo/base/casa/casa.yaml

o-smirnov · 2021-01-04T17:50:26Z

I go for approach 2. OmegaConf is exactly what we need

OK then let's start a branch and get moving. Setting up the infrastructure is very little work, I think most of it is already in my demo. Porting all the parameters.json files will take some effort, but they'll be so nice and compact when it's done, it's bound to be aesthetically rewarding.

Implementing #692 will also be pretty straightforward then.

SpheMakh · 2021-01-04T20:01:35Z

Let's continue in your configuratt branch

o-smirnov · 2021-01-05T20:34:06Z

OK cool, well I moved all the generic config merging infrastructure into the configuratt module, the init now is as simple as it can be: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py#L109.

If you check the schemas and start working on the image and cab ymls, I volunteer to work on #692.

Also, #598 is naturally merged into this...

SpheMakh · 2021-01-05T21:05:33Z

o-smirnov · 2021-01-29T16:34:56Z

This is probably better served by a discussion thread: #696

o-smirnov · 2021-02-09T09:23:31Z

And we should include #598 in these deliberations.

SpheMakh added the enhancement label Jun 4, 2020

SpheMakh self-assigned this Jun 4, 2020

This was referenced Aug 11, 2020

Define empty input in schema and consistently use it in the pipeline caracal-pipeline/caracal#1144

Open

wsclean needs a three-way mf-weighting option #668

Open

o-smirnov pinned this issue Jan 29, 2021

o-smirnov unpinned this issue Jan 29, 2021

cab specification 2.0 #637

cab specification 2.0 #637

Comments

SpheMakh commented Jun 4, 2020

o-smirnov commented Jul 20, 2020 • edited Loading

SpheMakh commented Jul 20, 2020

o-smirnov commented Jul 20, 2020

o-smirnov commented Aug 12, 2020

SpheMakh commented Dec 3, 2020 • edited by o-smirnov Loading

o-smirnov commented Dec 7, 2020 • edited Loading

SpheMakh commented Dec 7, 2020

o-smirnov commented Dec 7, 2020

SpheMakh commented Dec 8, 2020

SpheMakh commented Dec 8, 2020

paoloserra commented Dec 8, 2020

o-smirnov commented Dec 8, 2020

paoloserra commented Dec 8, 2020

SpheMakh commented Dec 8, 2020 • edited Loading

paoloserra commented Dec 8, 2020

o-smirnov commented Dec 8, 2020

paoloserra commented Dec 8, 2020

o-smirnov commented Dec 8, 2020

SpheMakh commented Dec 8, 2020

SpheMakh commented Dec 8, 2020

o-smirnov commented Dec 8, 2020

paoloserra commented Dec 8, 2020

o-smirnov commented Dec 28, 2020

o-smirnov commented Dec 28, 2020

o-smirnov commented Dec 30, 2020

Approach 1: anchors and aliases

Approach 2: use OmegaConf

o-smirnov commented Dec 31, 2020 • edited Loading

gigjozsa commented Jan 1, 2021

o-smirnov commented Jan 2, 2021

gigjozsa commented Jan 4, 2021

SpheMakh commented Jan 4, 2021

SpheMakh commented Jan 4, 2021

o-smirnov commented Jan 4, 2021

o-smirnov commented Jan 4, 2021

SpheMakh commented Jan 4, 2021

o-smirnov commented Jan 5, 2021

SpheMakh commented Jan 5, 2021

o-smirnov commented Jan 29, 2021

o-smirnov commented Feb 9, 2021

o-smirnov commented Jul 20, 2020 •

edited

Loading

SpheMakh commented Dec 3, 2020 •

edited by o-smirnov

Loading

o-smirnov commented Dec 7, 2020 •

edited

Loading

SpheMakh commented Dec 8, 2020 •

edited

Loading

o-smirnov commented Dec 31, 2020 •

edited

Loading