Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add + for rulesets #9

Merged
merged 7 commits into from
Oct 17, 2023

Conversation

beniaminogreen
Copy link
Contributor

Hi there, thanks for writing such a fantastic package!

I'm writing to ask if you can implement the + operator to combine several rulesets. This would allow users to combine several rulesets programatically and concisely.

I often find I would like to do this if I have sets of rules to check several smaller datasets which then get combined into a larger dataset . I would like to be able to easily define checks on the smaller datasets and then combine them all and run them on the larger datasets using the + operator.

Here's a quick example on how this could work.

ruleset_a <- ruleset(
  rule(mpg > 10 & mpg < 30) # mpg goes up to 34
)
rulesset_b <- ruleset(
  rule(cyl %in% c(4, 8)), # missing 6 cyl
  rule(vs %in% c(0, 1), allow_na = TRUE)
)

ruleset_a+ruleset_b 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

Threw this together in about 10 mins, so I am ofc happy to go back and forth and make iterations until it's something you are happy with, if you think this is a good direction to go in.

Best,
Ben

@DavZim
Copy link
Owner

DavZim commented Oct 16, 2023

Hi Ben,

thanks for the PR (and on a related note congrats to the sucess of zoomerjoin <3).

I like the idea of having an option to combine rulesets. I am at the moment unsure whether I would want to overload the + operator or have a function like data.table::rbindlist() (maybe bindlist(list(ruleset1, ruleset2, ruleset3))). Maybe both would be possible.

Also, I would like to have a couple of tests that cover some use-cases.
And last but not least, feel free to add yourself to the DESCRIPTION file.

Best,
David

@beniaminogreen
Copy link
Contributor Author

I very much like the suggestion to add a bind_rules or something similar to allow users to programatically combine a list of rulsets and rules. I've gone ahead and implemented this. Under the hood, I've done this by improving the overloading of + so that you can now also add:

  • a rule with a rule
  • a rule with a ruleset
  • a ruleset with a rule
  • a ruleset with a ruleset

Here's a snippet showing off this new behavior.

rule_1 <- rule(mpg > 10 & mpg < 30) # mpg goes up to 34
rule_2 <- rule(cyl %in% c(4, 8)) # missing 6 cyl
rule_3 <- rule(vs %in% c(0, 1), allow_na = TRUE)

# combine rules with rules
rule_1 + rule_2 
#> <Verification Ruleset with 2 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)

# combine rulesets with rules
ruleset(rule_1, rule_2) + rule_3
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# combine rules with rulesets
rule_1 + ruleset(rule_2, rule_3) 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# rulesets with rulesets
ruleset(rule_1) + ruleset(rule_2, rule_3) 
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

# combine a list of rulsets and rules with `bind_rules`
bind_rules(list(rule_1, rule_2, ruleset(rule_3)))
#> <Verification Ruleset with 3 elements>
#>   [1] 'Rule for: mpg' matching `mpg > 10 & mpg < 30` (allow_na: FALSE)
#>   [2] 'Rule for: cyl' matching `cyl %in% c(4, 8)` (allow_na: FALSE)
#>   [3] 'Rule for: vs' matching `vs %in% c(0, 1)` (allow_na: TRUE)

Created on 2023-10-16 with reprex v2.0.2

If you think this is a promising direction, I would love to spruce up the documentation, and can add some examples. I've also gone ahead and added some tests to verify that the + method and bind_rules are working as expected.

# [double dispatch](https://yutani.rbind.io/post/double-dispatch-of-s3-method/)
# semantics.

if (class(a) == "ruleset" & class(b)=="ruleset") {
Copy link
Owner

@DavZim DavZim Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something like this?

if (class(a) != "ruleset") a <- list(a)
if (class(b) != "ruleset") b <- list(b)
out <- list(a, b)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work in the case that both a and b are rules, but I like the suggestion - this could have been more concise.

This should work:

if (class(a) == "rule" & class(b) == "rule") {
    out <- list(a,b)
} else {
    if (class(a) == "rule") a <- list(a)
    if (class(b) == "rule") b <- list(b)
    out <- c(a, b)
}

rule_1 <- rule(mpg >10)
rule_2 <- rule(hp >10)
expect_equal(
rule_1 + rule_2,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reindent the lines? (If you are using RStudio: CTRL + A then CTRL + I)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing! Assuming we want what's inside the tests indented another time.

@DavZim
Copy link
Owner

DavZim commented Oct 17, 2023

I have added two comments to your code. I like where this is going, the only thing that I think is missing are duplications in rules.
How do we want to handle them? Drop duplications or do we want to keep them? What do you think makes more sense?

@beniaminogreen
Copy link
Contributor Author

Just made both of those changes - thanks for taking a look at them.

How to handle duplicates is a good question. I think that keeping or removing duplicates makes sense as long as the documentation is very explicit about it.

I have just made a change to remove duplicates when adding two sets together. This makes the behavior consistent with what a user might expect when taking the union of two sets - duplicates will automatically be removed, because sets don't contain duplicates.

@DavZim
Copy link
Owner

DavZim commented Oct 17, 2023

Looks good to me. I'll run the tests and then merge the PR.

@DavZim
Copy link
Owner

DavZim commented Oct 17, 2023

@beniaminogreen
Copy link
Contributor Author

beniaminogreen commented Oct 17, 2023

Thanks for flagging this. I think the github runner is set to be more strict than the one on my PC. Just fixed the warning and the note it raised.

Edit: I think the issue was I was using devtools::test() not devtools::check() because I did not have duckdb installed on the machine I am using.

@beniaminogreen
Copy link
Contributor Author

Sorry, realized there was another issue. Just tested with devtools::check_man() and devtools::check() and it all looks good now.

@DavZim DavZim merged commit 7b6ef92 into DavZim:main Oct 17, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants