rpolars

Use awesome polars DataFrame library from R!

rpolars is not completely translated yet - aim to finish March 2023

See what is currently translated in latest documentation:

install latest rpolars release

No dependencies other than R (≥ 4.1.0)

Macbbook x86_64 install.packages(repos=NULL, "https://github.com/rpolars/rpolars/releases/latest/download/rpolars__x86_64-apple-darwin17.0.tgz")
Linux x86_64 install.packages(repos=NULL, "https://github.com/rpolars/rpolars/releases/latest/download/rpolars__x86_64-pc-linux-gnu.gz")
Windows install.packages(repos=NULL, "https://github.com/rpolars/rpolars/releases/latest/download/rpolars__x86_64-w64-mingw32.zip")
Other targets? Start a new issue.

Documentation:

Latest docs found here

news:

update 24th November: minipolars is getting bigger and is changing name to rpolars and is hosted on github.com/rpolars/rpolars. Translation, testing and documenting progress is unfortunately not fast enough to finish in 2022. Goal postponed to March 2023. rlang is dropped as install dependency. No dependencies should make it very easy to install and manage versions long term.
update 10th November 2022: Full support for Windows, see installation section. After digging through gnu ld linker documentation and R source code idiosyncrasies, rpolars, can now be build for windows (nighly-gnu). In the end adding this super simple linker export definition file prevented the linker from trying to export all +160_000 internal variables into a 16bit symbol table maxing out at 65000 variables. Many thanks for 24-hour support from extendr-team <3.
update 4th November 2022: Latest documentation shows half (125) of all expression functions are now ported. Automatic binary release for Mac and Linux. Windows still pending. It is now very easy to install rpolars from binary. See install section.
update: 5th October 2022 Currently ~20% of features have been translated. To make polars call R multi-threaded was a really hard nut to crack as R has no Global-interpreter-lock feature. My solution is to have a main thread in charge of R calls, and any abitrary polars child threads can request to have R user functions executed. Implemented with flume mpsc channels. No serious obstacles left known to me. Just a a lot of writing. Priliminary perfomance benchmarking promise rpolars is going to perform just as fast pypolars.

What is polars

Polars is the fastest data table query library. The syntax is related to Spark, but column oriented and not row oriented. All R libraries are also column oriented so this should feel familiar. Unlike Spark, polars is natively multithreaded instead of multinode(d). This make polars simple to install and use as any other R package. Like Spark and SQL-variants polars optimizes queries for memory consuption and speed so you don’t have to. Expect 5-10 speedup compared to dplyr on simple transformations from >500Mb data. When chaining many operations the speedup due to optimization can be even higher. Polars is built on the apache-arrow memory model.

This port relies on extendr https://github.com/extendr which is the R equivalent to pyo3+maturin. Extendr is very convenient for calling rust from R and the reverse.

Build from source

Install rust + set buildchain to nightly + 3rd party dependencies. See installation workflows/pkgdown.yaml for Windows, Linux and Mac.

install rust with rustup
rustup default nightly
clone repo
on Windows rtools42 must be in path
source("./renv/activate.R") to install and set up R packages (likely automatically triggered by .Rprofile)
rextendr::document() to compile rust code and quick build package
or R CMD INSTALL --no-multiarch --with-keep.source rpolars to build final package
devtools::test() to run all unit tests.

rpolars_teaser

================ Søren Welling 11/24/2022

Hello world

#loading the package rpolars only exposes a few functions 
library(rpolars)

#all constructors are accessed via pl

#Here we go, Hello world written with polars expressions
pl$col("hello")$sum()$over("world","from")$alias("polars")

## polars Expr: col("hello").sum().over([col("world"), col("from")]).alias("polars")

Typical ussage

Where dplyr has %>%-piping and `data.table has [,]-indexing, method chaining object$m1()$m2() is the bread and butter syntax of polars. For now the best learning material to understand the syntax and the power of polars is the official user guide for python. As rpolars syntax is the same ( except $ instead of .) the guide should be quite useful. The following example shows a typical ‘polar_frame’ method together with chained expressions.

#create polar_frames from iris
df = pl$DataFrame(iris)

#make selection (similar to dplyr mutute() and data.table [,.()] ) and use expressions or strings.
df = df$select(
  pl$col("Sepal.Width")$sum()$over("Species")$alias("sw_sum_over_species"),
  pl$col("Sepal.Length")$sum()$over("Species")$alias("sl_sum_over_species"),
  "Petal.Width"
)

#polars expressions are column instructions

#1 take the column named Sepal.Width
#2 sum it...
#3 over(by) the column  Species
#4 rename/alias to sw_sum_over_species


#convert back to data.frame
head(df$as_data_frame())

##   sw_sum_over_species sl_sum_over_species Petal.Width
## 1               171.4               250.3         0.2
## 2               171.4               250.3         0.2
## 3               171.4               250.3         0.2
## 4               171.4               250.3         0.2
## 5               171.4               250.3         0.2
## 6               171.4               250.3         0.4

`polar_frame` from `series` and R vectors

#a single column outside a polars_frame is called a series
pl$Series((1:5) * 5,"my_series")

## polars Series: shape: (5,)
## Series: 'my_series' [f64]
## [
##  5.0
##  10.0
##  15.0
##  20.0
##  25.0
## ]

#Create polar_From from a mix of series and/or plain R vectors.
pl$DataFrame(
  newname = pl$Series(c(1,2,3,4,5),name = "b"), #overwrite name b with 'newname'
  pl$Series((1:5) * 5,"a"),
  pl$Series(letters[1:5],"b"),
  c(5,4,3,2,1), #unnamed vector
  named_vector = c(15,14,13,12,11) ,#named provide
  c(5,4,3,2,0)
)

## polars DataFrame: shape: (5, 6)
## ┌─────────┬──────┬─────┬────────────┬──────────────┬──────────────┐
## │ newname ┆ a    ┆ b   ┆ new_column ┆ named_vector ┆ new_column_1 │
## │ ---     ┆ ---  ┆ --- ┆ ---        ┆ ---          ┆ ---          │
## │ f64     ┆ f64  ┆ str ┆ f64        ┆ f64          ┆ f64          │
## ╞═════════╪══════╪═════╪════════════╪══════════════╪══════════════╡
## │ 1.0     ┆ 5.0  ┆ a   ┆ 5.0        ┆ 15.0         ┆ 5.0          │
## ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 2.0     ┆ 10.0 ┆ b   ┆ 4.0        ┆ 14.0         ┆ 4.0          │
## ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 3.0     ┆ 15.0 ┆ c   ┆ 3.0        ┆ 13.0         ┆ 3.0          │
## ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 4.0     ┆ 20.0 ┆ d   ┆ 2.0        ┆ 12.0         ┆ 2.0          │
## ├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 5.0     ┆ 25.0 ┆ e   ┆ 1.0        ┆ 11.0         ┆ 0.0          │
## └─────────┴──────┴─────┴────────────┴──────────────┴──────────────┘

Data types

#polars is strongly typed. Data-types can be created like this:
pl$dtypes$Float64

## polars DataType: Float64

# currently translated dtypes
pl$dtypes

$Boolean
polars DataType: Boolean

$Float32
polars DataType: Float32

$Float64
polars DataType: Float64

$Int32
polars DataType: Int32

$Int64
polars DataType: Int64

$UInt32
polars DataType: UInt32

$UInt64
polars DataType: UInt64

$Utf8
polars DataType: Utf8

$Categorical
polars DataType: Categorical(
    None,
)

Read csv and the `polars_lazy_frame`

  #using iris.csv as example
  write.csv(iris, "iris.csv",row.names = FALSE)

  #read csv into a lazy_polar_frame and compute sum of Sepal.Width over Species
  lpf = pl$lazy_csv_reader("iris.csv")$select(
    pl$col("Sepal.Width")$sum()$over("Species")
  )
  
  #a lazy frame is only a tree of instructions
  print(lpf) #same as lpf$describe_plan()

## [1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
##    SELECT [col("Sepal.Width").sum().over([col("Species")])] FROM
##     CSV SCAN iris.csv; PROJECT */5 COLUMNS; SELECTION: None

  #read plan from bottom to top, says:  "read entire csv, then compute sum x over y"
  
  #what polars actually will do is the optimized plan
  
  lpf$describe_optimized_plan()

##    SELECT [col("Sepal.Width").sum().over([col("Species")])] FROM
##     CSV SCAN iris.csv; PROJECT 2/5 COLUMNS; SELECTION: None

## NULL

  #optimized plan says:  "read only column x and y from csv, compute sum x over y"
  
  #Only reading some columns or in other cases some row in to memory can save speed downstream operations. This is called peojection. 
  
  
  #to execute plan, simply call $collect() and get a polars_frame as result
  
  lpf$collect()

## polars DataFrame: shape: (150, 1)
## ┌─────────────┐
## │ Sepal.Width │
## │ ---         │
## │ f64         │
## ╞═════════════╡
## │ 171.4       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 171.4       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 171.4       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 171.4       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ ...         │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 148.7       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 148.7       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 148.7       │
## ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
## │ 148.7       │
## └─────────────┘

User pass user defined functions to polars

It is possible to mix R code with polars by passing R functions to polars. R functions are slower. Use native polar functions/expressions where possible.

    pl$DataFrame(iris)$select(
      pl$col("Sepal.Length")$map(\(s) { #map with a R function
        x = s$to_r_vector() #convert from Series to a native R vector
        x[x>=5] = 10
        x[1:10] # if return is R vector, it will automatically be converted to Series again
      })
    )$as_data_frame()

##    Sepal.Length
## 1          10.0
## 2           4.9
## 3           4.7
## 4           4.6
## 5          10.0
## 6          10.0
## 7           4.6
## 8          10.0
## 9           4.4
## 10          4.9

Name		Name	Last commit message	Last commit date
Latest commit History 390 Commits
.github		.github
R		R
man		man
misc		misc
renv		renv
src		src
test		test
tests		tests
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
_pkgdown.yml		_pkgdown.yml
minipolars.Rproj		minipolars.Rproj
renv.lock		renv.lock
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rpolars

install latest rpolars release

Documentation:

news:

What is polars

Build from source

rpolars_teaser

Hello world

Typical ussage

`polar_frame` from `series` and R vectors

Data types

Read csv and the `polars_lazy_frame`

User pass user defined functions to polars

About

Releases 4

Packages

Languages

License

rpolars/rpolars_old

Folders and files

Latest commit

History

Repository files navigation

rpolars

install latest rpolars release

Documentation:

news:

What is polars

Build from source

rpolars_teaser

Hello world

Typical ussage

polar_frame from series and R vectors

Data types

Read csv and the polars_lazy_frame

User pass user defined functions to polars

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

`polar_frame` from `series` and R vectors

Read csv and the `polars_lazy_frame`

Packages