sql windows functions #1760
-
For purposes of example suppose we have the csv file
In SQL one can add sequence numbers by group using windowing functions like this:
but it seems this does not work on polars or if it does it is not supported by qsv. Would like to be be able to do this using SQL. If polars does not support this would it be feasible to add SQLite from qsv or other database that supports this? |
Beta Was this translation helpful? Give feedback.
Replies: 24 comments 6 replies
-
You can add a seq column with 1 or (rather) 2 QSV passes:
|
Beta Was this translation helpful? Give feedback.
-
That does not sequence by group. The SQL statement shown would produce
|
Beta Was this translation helpful? Give feedback.
-
There is no SQL statement, and given an SQL producing Name,Value the pipe will put the seq in front |
Beta Was this translation helpful? Give feedback.
-
My initial post in this thread shows the SQL statement that gives sequence numbers by group. Here is that same statement giving sequence numbers by the grouping column Name using R and SQLite.
|
Beta Was this translation helpful? Give feedback.
-
As you pointed out @ggrothendieck , it's apparently not currently supported in Polars. I did re-ping the request for more extensive Window functions support in Polars (pola-rs/polars#7227 (comment)). Given how fast the team iterates, it shouldn't be too long till its supported and qsv will leverage it as soon as it's released. |
Beta Was this translation helpful? Give feedback.
-
I found a workaround if there are only a small number of rows in each group (which is my case). It is fairly awkward and complex, it is probably pretty inefficient, it involves generating a bunch of temporary files which w ould later have to be deleted and it seems it can't be represented as single pipeline but it seems to work on this small example. Suppose we have this file1.csv
The code follows assuming MS-Windows. First we create ordinary sequence numbers in tmp1.csv and then make a copy in temp2 except we make the names distinct from tmp1.csv. We then join tmp1.csv and tmp2.csv by group (the Name column) and only keep rows for which the index in the second file in the join is less than or equal to the first.
This gives:
|
Beta Was this translation helpful? Give feedback.
-
Your workaround really exercises qsv's composability goal :) I'll move this to Discussions so other folks can learn from your example... |
Beta Was this translation helpful? Give feedback.
-
Have simplified it down from 6 qsv calls to 3 and from 3 temporary files to 1. Things that made this possible were:
On the other hand:
^ is the Windows continuation character as before.
|
Beta Was this translation helpful? Give feedback.
-
I found that this works too but through trial and error it seems that polars via qsv does not handle:
giving
Here is the sql statement from above formatted over several lines: ``
Name,Value
|
Beta Was this translation helpful? Give feedback.
-
It's an extended version of the polars-cli, and most of it limits, specially on the SQL parsing front, are largely dictated by Polars' capabilities and limitations. I'll try to see if the limitations you found above can be overcome through Polars SQL. With your help, maybe we can create an extended Wiki article on |
Beta Was this translation helpful? Give feedback.
-
Just one correction. It seems polars does support multiple with statements so the qsv select at the end could be absorbed into the SQL statement:
which looks like this on multiple lines:
|
Beta Was this translation helpful? Give feedback.
-
Regarding your comments, developing an SQL engine is a large amount of work and given the number of problems it suggests that there are others in polars that I didn't find. There is an SQLite crate and I wonder how much work it would be to support two SQL engines. I have used SQLite a fair bit and know of no bugs in it. |
Beta Was this translation helpful? Give feedback.
-
I'm partial to Polars since its written in Rust and it seems to be the fastest SQL engine at the moment in the Rust ecosystem, if not of all dataframe libraries - as performance is qsv's top goal (https://github.com/jqnatividad/qsv?tab=readme-ov-file#goals--non-goals). SQLite is rock solid and stable, but for OLAP type queries that I'm targeting qsv for, it seems duckdb is a better fit as an alternative engine in terms of performance and its focus on OLAP type queries. Also, I ran some simple benchmarks comparing As for duckdb, we actually use it to enable the But since duckdb already has a powerful composable CLI, I'm not convinced its worthwhile. Already, several qsv users use duckdb in their pipelines, and that's perfectly fine by me. Why not just use sqlite in your pipelines and only use qsv as required? |
Beta Was this translation helpful? Give feedback.
-
Yes, duckdb would be even better. |
Beta Was this translation helpful? Give feedback.
-
Found another approach to this problem assuming the input csv is sorted. Suppose we have this input csv and lua file.
Then
gives
|
Beta Was this translation helpful? Give feedback.
-
Here is a shorter version. It is now short enough that we don't need a lua file for it. Would be nice if it were simpler though. In lua
|
Beta Was this translation helpful? Give feedback.
-
This gives us a way to get the first n rows of each group. Here we use n=1.
Actually for n=1 we could simplify it to:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply. I tried the table part of the example using the cmd line and got the code shown below. At any rate I will look closer at the material you mention although I think it is unlikely I will want to write significant lua scripts given that I know R. The main uses I would have for qsv involve speed and simplicity:
|
Beta Was this translation helpful? Give feedback.
-
One thing that this brought out is that if one does a luau map to a column name that exists it creates a second column of the same name. Is it the intention that the columns be immutable? This is data.csv from your table lookup example.
Note that it did not replace that data in the State column with 3 but created a second State column. The documentation does say that new columns are added after existing columns but it wasn't clear that doing this defines a new column. Is this intentional?
|
Beta Was this translation helpful? Give feedback.
-
I usually map to uppercase column names, which I then select and rename,
so I never noticed.
|
Beta Was this translation helpful? Give feedback.
-
Here is the us-states-example.csv example table slightly improved.
To get the column indexes qsv headers was used but unexpectedly it does not default to stdin and one must specify - . (If you are using numeric column indexing a typical action would presumably be to pipe the result to qsv headers before writing the qsv command in the pipeline that uses such indexes.)
|
Beta Was this translation helpful? Give feedback.
-
Regarding your comment about adding sequence numbers to the ends of duplicate column names in luau map I think I would have expected that if you assign to a column name that already exists then that column would be overwritten with the new values just like in most programming languages where assigning to a variable that already exists clobbers it. |
Beta Was this translation helpful? Give feedback.
-
In looking at the us-states example more it seems that the key complication is calculating the cumulative sums. Not only is the calculation itself somewhat complex but the need for a separate line of code to format it seems unfortunate. If there were a luau function to perform a cumulative sum the code would be a bit shorter and also simpler. We show what it would like below. This is not executable because the cumsum function does not currently exist.
|
Beta Was this translation helpful? Give feedback.
-
If you are implementing cumsum there are a number of other related functions that have proven to be useful in other languages which follow a similar pattern and so could be readily implemented at the same time. cumprod, cumany, cumall, cummax, cummin accumulate Other The following are also useful and are related in so far as they also involve storing the previous value. lag
With lag we could omit setting prev
diff |
Beta Was this translation helpful? Give feedback.
I found a workaround if there are only a small number of rows in each group (which is my case). It is fairly awkward and complex, it is probably pretty inefficient, it involves generating a bunch of temporary files which w ould later have to be deleted and it seems it can't be represented as single pipeline but it seems to work on this small example.
Suppose we have this file1.csv
The code follows assuming MS-Windows. First we create ordinary sequence numbers in tmp1.csv and then make a copy in temp2 except we make the names distinct from tmp1.csv. We then join tmp1.csv and tmp2.csv by group (the Name column) and only keep rows for which the index in the secon…