Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] add subset string view comparator #248

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

captain-yoshi
Copy link
Contributor

@captain-yoshi captain-yoshi commented Apr 13, 2022

This PR implements a subset comparator for string views. It is really helpful to find mispelled yaml keys that are usually silently ignored when deserializing objects that have optional keys. The is_subset_strview method is highly dependant on formaters so not really usable, but the is_subset_strview_skipval is usefull for my context because my keys don't have floating numbers.

collision_detection::CollisionRequest req;

ryml::Tree t = ryml::parse_in_arena(ryml::csubstr(
R"(
collision_request:
  distance: true
  cost: true
  contacts: true
  mispelled_key: true
)"));

// deserialize
ryml::NodeRef n = t.rootref();
n["collision_request"] >> req;

// reserialize
ryml::Tree t1;
ryml::NodeRef n1 = t1.rootref();
n1 << ryml::key("collision_request") << req;

// t.is_subset_strview(&t1);
t.is_subset_strview_skipval(&t1);
// would return 'false' because the 'mispelled_key' is not a subset of the deserialized object

I have tried to follow your C++ style. You can close if you don't want this subset implementation. Will change code to your liking. Tags are not compared at the moment. Are they resolved before being added in the string views ?

I will also implement a is_subset method in another PR, which will need decoding with variadic templates.
EDIT: Not needed, can be implemented here.

TODO:

  • Add proper tests in test/test_subset.cpp (have tested against 15 tests internally)
  • Check equality for key and val ref

@biojppm
Copy link
Owner

biojppm commented Apr 13, 2022

Sorry, I'm having trouble understanding what you want to achieve. It does not help that the proposed name seems to be clashing with what it is actually doing: is_subset_strview suggests some sort of query whether a substr is a sub- or super-view of another, but yet the functions seem to be using operator== which is actually a string comparison.

So before we go into details, can you please explain what is the problem you're trying to solve? Is it ensuring that all member fields are accounted for in a serialization roundtrip? If so, why the subset_strview name? Or is it something else?

@captain-yoshi
Copy link
Contributor Author

captain-yoshi commented Apr 14, 2022

So before we go into details, can you please explain what is the problem you're trying to solve?

I want to catch misspelled keys entered by the user that is not used when deserializing an object (without applying that logic in any deserialization code). To solve this problem, I though of checking if a node is a subset of another node. This uses the operator== for each key and values (with some logic to validate node type). For floating numbers it is hard to do some equality due to precision error, that is why I had a version with skipval.

If so, why the subset_strview name?

I meant that there is no decoding, that it is strictly done in the string view which is a string comparison. I wanted to have is_subset which decodes all key/val to all fundamental types and compares for equality. Also another one which only compares the string view key/val (much faster, less reliable) thus the is_subset_strview.


New Idea: I could reuse the logic implemented in the PR and try decoding for all fundamental types

bool is_subset(...) {
    // before checking for equality try decoding key/val with these fundamentals types in order
    //   string  -> eq operator in stringview
    //   double  -> atod() would work for floats also ?
    //   int64   -> atoi() would work for all signed integral values
    //   uint64  -> atoi() would work for all unsigned integral values
    //   bool    -> from_chars(csubstr buf, bool * C4_RESTRICT v)
}

// Would it be better if the user could control which fundamentals he wants to decode from ?

Hope this helps! I can give a more concrete example if you need.

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

So in other words, you want to verify that a given node has child keys only from within a certain set, and/or that any child key must be in that certain set?

Then these would be the main predicates (improvised here, untested):

struct Tree
{
    // ....

    // non-recursive predicates:

    /// return true if subject_node has all the keys or indices in refnode from a reftree
    /// @note does not check values, only keys (for maps) or indices (for seqs)
    bool has_all(size_t subject_node, Tree const* reftree, size_t refnode) const;
    bool has_exactly(size_t subject_node, Tree const* reftree, size_t refnode) const;

    // and helper functions to drive the recursive descent:

    bool has_all_recursive(size_t subject_node, Tree const* reftree, size_t refnode) const;
    bool has_exactly_recursive(size_t subject_node, Tree const* reftree, size_t refnode) const;

    // ....
};

and the predicate implementation would look like this:

bool Tree::has_all(size_t subject_node, Tree const* reftree, size_t refnode) const
{
    C4_ASSERT(is_container(subject_node));
    C4_ASSERT(reftree->is_container(refnode));
    if(is_map(subject_node))
    {
        C4_ASSERT(reftree->is_map(refnode));
        for(size_t ch = reftree->first_child(refnode); ch != NONE; ch = reftree->next_sibling(ch))
            if( ! has_child(reftree->key(ch)))
                return false;
        return true;
    }
    else if(is_seq(subject_node))
    {
        C4_ASSERT(reftree->is_seq(refnode));
        return num_children(subject_node) >= reftree->num_children(refnode);
    }
    else
    {
        C4_NEVER_REACH();
    }
}

bool Tree::has_exactly(size_t subject_node, Tree const* reftree, size_t refnode) const
{
    C4_ASSERT(is_container(subject_node));
    C4_ASSERT(reftree->is_container(refnode));
    if(is_map(subject_node))
    {
        C4_ASSERT(reftree->is_map(refnode));
        for(size_t ch = reftree->first_child(refnode); ch != NONE; ch = reftree->next_sibling(ch))
            if( ! has_child(reftree->key(ch)))
                return false;
        for(size_t ch = first_child(subject_node); ch != NONE; ch = next_sibling(ch))
            if( ! reftree->has_child(key(ch)))
                return false;
        return true;
    }
    else if(is_seq(subject_node))
    {
        C4_ASSERT(reftree->is_seq(refnode));
        return num_children(subject_node) == reftree->num_children(refnode);
    }
    else
    {
        C4_NEVER_REACH();
    }
}

As for the values, I don't see the usefulness in the comparison. Comparison of map keys or seq indices makes sense to ensure that a data tree is complete. But if you also have value equality requirements, doesn't that mean that the ref tree is the actual data tree? 😕

And as for the float comparison, let's leave that question for later. It seems very specific, and related to the value question, and I want to make sure the main functionality is there before looking at that side.

@captain-yoshi
Copy link
Contributor Author

But if you also have value equality requirements, doesn't that mean that the ref tree is the actual data tree?

Well I think it depends on the formatters used. These yaml documents would be considered equals. So in that case, the ref tree would not be the actual data tree.

---
bool: 1 
---
bool: true
---
bool: True

Even for keys, should we not decode them before doing the equality ? It could be opt-in, default only compare in the string view, else try decoding to fundamentals before comparing.

---
1: mario
255: luigi
---
true: mario
0xff: luigi
---
True: mario
0o377: luigi

I'm all for simplicity, but validating keys and/or val, with or without decoding does not change code that much... except for tests.

struct Tree
{
    // ....
    // The way I see it there would be 

    bool is_subset(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval = true, bool decode = false) const;
    bool is_superset(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval = true, bool decode = false) const;

    // and helper functions to drive the recursive descent:

    bool is_subset_recursive(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval, bool decode) const;
    bool is_superset_recursive(size_t subject_node, Tree const* reftree, size_t refnode, bool skipval, bool decode) const;

    // ....
};

Is is_subset easier to interpret then has_all ?

@captain-yoshi
Copy link
Contributor Author

I think I understand know!

has_all: check only keys from containers (without decoding), abort if not a container
is_subset: check key and/or val, w/wo decoding. Also checks anchors/ref, stream, docval ...

Do we even need a has_all? With the subset method, I can achieve the same thing (albeit maybe slower but not that much). And it will not abort (except if the pointer is invalid, or a usecase is not taking into consideration).

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

EDITED:

I think there's some confusion here. .has_all() vs .is_subset() is a question of structure, ie what members are present at a particular node as compared to a node from a different tree. And that has is not the same as comparing.

Let's leave the 255 vs 0xff vs 0o377 equivalence question for another time. For now let's treat them from the point of view of the string mindset, which is that used by rapidyaml. Notice that rapidyaml has no knowledge of the deserialized data types; it only acquires this knowledge inside the user's operator>> request (and only there). Eg, outside of operator>>, in bool: 1 is 1 a string, a bool, an int, an unsigned, or some other type constructible from any of the former? That's an unsolvable problem for this library. It may not be unsolvable for your use case, but doing any assumption like that would leave out a lot of other possible competing assumptions.

I know you want to get closer to value semantics, but for now that's just not here - only string semantics. That may still be not enough for what you need, but at least this much will be available for you and other users. This is because adding the value semantics is too disruptive for now, and also a really hard problem from multiple points of view. So that one will be best left to a different PR.

Also, focusing on string semantics allows us to focus on the structure, using the plain csubstr::operator== which is the infrastructure used by find_child() et al.

So what comes next assumes only the existing string semantics. And for now, all of those keys are necessarily different, just as the string 1 is not equal to true or True.

Notice that for given nodes a and b, a.has_all(b) means a.is_superset_of(b), and if that's true, then necessarily b.is_subset_of(a). That is: depending on how it is invoked, .has_all() covers both .is_subset() or .is_superset(). That's why .has_all() was chosen.

Further, .has_all() cannot give set equivalence (and neither do their close equivalents .is_superset() or .is_subset()). The only way to test for equivalence between a and b using .has_all() is a.has_all(b) && b.has_all(a).

That's why the complement to .has_all() (is superset) is .has_exactly() (is same set), which does both the tests in one swoop.

Now, I agree the names are a bit cryptic. Maybe .has_all_children() (is superset) and .has_same_children() (is same set) would be better.

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

To clarify, I am stating that:

  1. has_all_children() and .has_same_children() are the functions we're looking for (but I'm open to different/better names)
  2. Those functions use the existing string-compare, and not value-compare.
  3. Those functions may or may not (string-)compare the VALs of the structure, but for now, let's assume we don't need that. Let's think about VALs after we have more agreement on the scope of this PR regarding KEYs or indices.
  4. We should leave aside the decode part, and only look at that in a different PR subsequent to this. By the way, that is actually a comparison function that should be provided by the user, eg, template<class T> operator==(csubstr lhs, csubstr rhs), and nothing like that is available yet to accomodate this. That would make this PR much larger than desirable.

@captain-yoshi
Copy link
Contributor Author

Thx for the detailed explanation and for your patience!

That's an unsolvable problem for this library.

Agreed ! For my usecase it made sense, but is not true for everyone. I imagine comparing a key anchor/ref is an unsolvable problem also...

Now, I agree the names are a bit cryptic. Maybe .has_all_children() (is superset) and .has_same_children() (is same set) would be better.

The has_all and has_exactly names are fine with me. As I don't have better names, I will let you decide!

Those functions may or may not (string-)compare the VALs of the structure, but for now, let's assume we don't need that. Let's think about VALs after we have more agreement on the scope of this PR regarding KEYs or indices.

Fair enough. Should I ensure that the node type is equal between the subject and the reference or just the containers like your snippet above? E.g. take into account the key anchor/ref, docval.

# tree_a
--- docval
mario: 1up

# tree_b
---
mario: 1up

# Should `tree_a->has_all( 0, tree_b, 0)` return true ?

# -----------------------------------------------------

# tree_c
mario: &life 1up

# tree_d
mario: 1up

# Should `tree_c->has_all( 0, tree_d, 0)` return true ?

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

Anchors are meta-structure, so:

Should tree_a->has_all( 0, tree_b, 0) return true ?

Assuming that docval in tree_a is meant to be &docval, ie an anchor, then --- yes, tree_a has all keys of tree_b. So it should return true.

Should tree_c->has_all( 0, tree_d, 0) return true ?
Likewise, tree_c has all the keys in tree_d. So the result is true.

Ideally, if anchors are playing a role, then you should resolve the trees before making this sort of query.

But to clarify, what sort of role do you want anchors to play here? I may be missing something.

@captain-yoshi
Copy link
Contributor Author

Ideally, if anchors are playing a role, then you should resolve the trees before making this sort of query.

This is what I had in mind, I just wanted your wisdom on this!

Assuming that docval in tree_a is meant to be &docval, ie an anchor, then --- yes

It was really meant as a doc value "myvalue". Should we return true for this case also ?


Maybe we should call these functions has_all_keys and has_same_keys ?

Last question, how do you start a specific test in linux, e.g. if I wanted to test the test_merge.cpp ?

@captain-yoshi
Copy link
Contributor Author

Forgot to mention, what is also important to me is the structure of a sequence which has keys and nested keys. For values I don't check them, but the order and type MUST match, e.g.

# tree_a
- key1:
    nested_key: 2
- val1
- val2

# tree_b
- key1:
    nested_key: 5
- dontcare

# `tree_a->has_all( 0, tree_b, 0)` would return false
# `tree_b->has_all( 0, tree_a, 0)` would return true

The second commit account for that. I don't know if you had that in mind ?

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

# tree_a
--- docval
mario: 1up

It was really meant as a doc value "myvalue".

That's not valid YAML. It cannot be both a doc value and a doc map. Eg, see this.

@captain-yoshi
Copy link
Contributor Author

This is really helpfull!

Maybe it has been updated but getting an error with rapidyaml in sandbox for this case:

a: 2
- 1

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

Maybe we should call these functions has_all_keys and has_same_keys ?

The _keys suffix is misleading because they apply to SEQs as well, which have no keys, only indices.

# tree_a
- key1:
    nested_key: 2
- val1
- val2

# tree_b
- key1:
    nested_key: 5
- dontcare

# tree_c
- key1:
    nested_key: 5

tree_a->has_all(0, tree_b, 0) would return false
tree_b->has_all(0, tree_a, 0) would return true

Actually, it's different; .has_all() means "has all the children of". So given that these are SEQs, we only care about the number of children. Hence, we would have:

assert(   tree_a->has_all(0, tree_b, 0)); // root a has at least as many members as root b
assert(   tree_a->has_all(0, tree_c, 0)); // root a has at least as many members as root c
assert( ! tree_b->has_all(0, tree_a, 0)); // root b has less members than root a
assert(   tree_b->has_all(0, tree_c, 0)); // root b has at least as many members as root c
assert( ! tree_c->has_all(0, tree_a, 0)); // root c has less members than root a
assert( ! tree_c->has_all(0, tree_b, 0)); // root c has less members than root b

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

Maybe it has been updated but getting an error with rapidyaml in sandbox for this case

Actually rapidyaml does NOT error out, and that's the problem. Indeed, that's being worked on.

@biojppm
Copy link
Owner

biojppm commented Apr 14, 2022

Last question, how do you start a specific test in linux, e.g. if I wanted to test the test_merge.cpp ?

To enable the tests, add the cmake flag -DRYML_DEV=ON; this will enable a bunch of development targets; if you're using Makefile generator type make help to see the list of targets.

Next, just run the target ryml-test-merge.

Or use ctest: go into the build directory, cd into test and run ctest -R merge and possibly other ctest flags like --output-on-failure.

@captain-yoshi
Copy link
Contributor Author

I think there's some confusion here. .has_all() vs .is_subset() is a question of structure, ie what members are present at a particular node as compared to a node from a different tree. And that has is not the same as comparing.

OK! I really do understand now! I was wondering why my PR and your proposed code was so different.

Is there any reason not to traverse sequences that have keys ? This is what I need and implemented. I'm totally fine with keeping this in my code. I don't see any usecase for the has_all(), but maybe I don't get the full picture...

So in other words, you want to verify that a given node has child keys only from within a certain set, and/or that any child key must be in that certain set?

Well my example was simplified, sorry for that. I need to know if a key, in a map or a sequence container recursively has been misspelled:

# tree_a
- key1:
    nested_key1: ~
- key2:
    nested_key1: ~
    nested_key2: ~
    misspelled: ~

# tree_b
- key1:
    nested_key1: ~
    nested_key2: ~
- key2:
    nested_key1: ~
    nested_key2: ~
    nested_key3: ~
// I think my reference of `has_all` is reversed compare to yours. The 'tree_a' is the one that will be looped against the 'tree_b'.
assert( ! tree_a->has_all(0, tree_b, 0)); // root a has at least as many members as root b, but 'misspelled' key is missing from tree_b
assert( ! tree_b->has_all(0, tree_a, 0)); // root b has less members then root a:  'nested_key2' from key1 and 'nested_key3' from key2 is missing from tree_a

@biojppm biojppm force-pushed the master branch 4 times, most recently from d128813 to 507400e Compare April 16, 2024 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants