feat: optimize table info() and describe() for large column tables efficiently #9639

jitingxu1 · 2024-07-19T17:48:45Z

Is your feature request related to a problem?

we have table.info() and table.describe() for Ibis table. The function loops over each column and performs multiple aggregations and form the output by unioning all the stats for each column

it is often used for univariate analysis by DS, When I tried these two functions on a bigger dataset (465 columns, 1.5m rows), I have two issues:

It took forever to get the results for the table.info(),
describe() throws the Out of Memory exception

The info() majorly calculate the sum, mean of null rows, but it is very slow. Is it very expensive to calculate this on duckDB?

for pos, colname in enumerate(self.columns):
            col = self[colname]
            typ = col.type()
            agg = self.select(
                isna=ibis.case().when(col.isnull(), 1).else_(0).end()
            ).agg(
                name=lit(colname),
                type=lit(str(typ)),
                nullable=lit(typ.nullable),
                nulls=lambda t: t.isna.sum(),
                non_nulls=lambda t: (1 - t.isna).sum(),
                null_frac=lambda t: t.isna.mean(),
                pos=lit(pos, type=dt.int16),
            )
            aggs.append(agg)
        return ibis.union(*aggs).order_by(ibis.asc("pos"))

The describe() throws Out of Memory exception, it will generate 465 small tables for the union, not sure why it takes so much memory.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:9 │
│ 9 in __rich_console__                                                                            │
│                                                                                                  │
│    96 │   │                                                                                      │
│    97 │   │   try:                                                                               │
│    98 │   │   │   if opts.interactive:                                                           │
│ ❱  99 │   │   │   │   rich_object = to_rich(self, console_width=console_width)                   │
│   100 │   │   │   else:                                                                          │
│   101 │   │   │   │   rich_object = Text(self._noninteractive_repr())                            │
│   102 │   │   except Exception as e:                                                             │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :273 in to_rich                                                                                  │
│                                                                                                  │
│   270 │   │   │   expr, max_length=max_length, max_string=max_string, max_depth=max_depth        │
│   271 │   │   )                                                                                  │
│   272 │   else:                                                                                  │
│ ❱ 273 │   │   return _to_rich_table(                                                             │
│   274 │   │   │   expr,                                                                          │
│   275 │   │   │   max_rows=max_rows,                                                             │
│   276 │   │   │   max_columns=max_columns,                                                       │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/pretty.py │
│ :345 in _to_rich_table                                                                           │
│                                                                                                  │
│   342 │   │   if orig_ncols > len(computed_cols):                                                │
│   343 │   │   │   table = table.select(*computed_cols)                                           │
│   344 │                                                                                          │
│ ❱ 345 │   result = table.limit(max_rows + 1).to_pyarrow()                                        │
│   346 │   # Now format the columns in order, stopping if the console width would                 │
│   347 │   # be exceeded.                                                                         │
│   348 │   col_info = []                                                                          │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/expr/types/core.py:4 │
│ 86 in to_pyarrow                                                                                 │
│                                                                                                  │
│   483 │   │   Table                                                                              │
│   484 │   │   │   A pyarrow table holding the results of the executed expression.                │
│   485 │   │   """                                                                                │
│ ❱ 486 │   │   return self._find_backend(use_default=True).to_pyarrow(                            │
│   487 │   │   │   self, params=params, limit=limit, **kwargs                                     │
│   488 │   │   )                                                                                  │
│   489                                                                                            │
│                                                                                                  │
│ /Users/claypot/miniconda3/envs/ibisml-dev/lib/python3.12/site-packages/ibis/backends/duckdb/__in │
│ it__.py:1375 in to_pyarrow                                                                       │
│                                                                                                  │
│   1372 │   │   limit: int | str | None = None,                                                   │
│   1373 │   │   **_: Any,                                                                         │
│   1374 │   ) -> pa.Table:                                                                        │
│ ❱ 1375 │   │   table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()        │
│   1376 │   │   return expr.__pyarrow_result__(table)                                             │
│   1377 │                                                                                         │
│   1378 │   def execute(                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (19.6 GiB/19.6 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

What is the motivation behind your request?

These two functions are very useful for univariate analysis.

Describe the solution you'd like

Could we do some batch or parallel computing?

What version of ibis are you running?

9.1.0

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

jcrist · 2024-07-22T18:01:54Z

Can you separate out the time it takes to execute from the time it takes to construct the expression. Timing:

# how long the expression takes to construct
expr = t.describe()

# how long it takes to compile
ibis.to_sql(expr)

# and how long it takes to execute (time for below, minus time for compile above)
expr.execute()

jitingxu1 · 2024-07-24T05:08:17Z

import ibis
import pandas as pd
import numpy as np

num_rows = 1500000
num_cols = 450
data = np.random.rand(num_rows, num_cols)
columns = [f'col_{i}' for i in range(num_cols)]
df = pd.DataFrame(data, columns=columns)
memtable = ibis.memtable(df)

I tested table.info() for this memtable with 1.5m rows and 450 cols, here is the time for compilation and execution

Compile time: 5.6186909675598145 seconds
Compile + Running time: 109.90575790405273 seconds

The above running time is acceptable for this size.

The issue I met in the kaggle competition, I ran the df_train.info(), but the df_train was generated by joining multiple tables, and each table underwent some operations, such as Cast and agg. With additional processing transformation, it make the final query in df_train.info() super large. The execution failed with this error:

File /opt/conda/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py:1409, in Backend.execute(self, expr, params, limit, **_)
   1406 import pandas as pd
   1407 import pyarrow.types as pat
-> 1409 table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
   1411 df = pd.DataFrame(
   1412     {
   1413         name: (
   (...)
   1425     }
   1426 )
   1427 df = DuckDBPandasData.convert_table(df, expr.as_table().schema())

OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (17.5 GiB/17.5 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

I set the duckdb configuration

con.raw_sql("PRAGMA max_temp_directory_size='50GiB';")
con.raw_sql("PRAGMA max_memory='30GiB';")

but it did not help.

cpcloud · 2024-07-24T12:20:40Z

Looking into this, a large source of the Ibis overhead here is the 450 relations we are constructing, each of which must go through binding and dereferencing neither of which is particularly cheap.

I poked around DuckDB's SUMMARIZE command, and looking at EXPLAIN SUMMARIZE t I saw a curious-looking UNNEST on an ungrouped aggregate, which made me remember that for pivot_longer we collect (in the generic sense, unrelated to execution) the aggregates into a array expression and then do a joint unnest on all the arrays we've constructed.

I tried a similar approach here, and after adding some benchmarks I'm seeing about an 11x (!) speed up in expression construction:

----------------------------------------------------------------------------- benchmark 'test_summarize_construct[describe]': 2 tests ------------------------------------------------------------------------------
Name (time in s)                                          Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_summarize_construct[describe] (0001_fda484c)     12.6991 (11.46)    12.8223 (10.70)    12.7558 (11.14)    0.0453 (1.0)      12.7573 (11.46)    0.0546 (1.0)           2;0  0.0784 (0.09)          5           1
test_summarize_construct[describe] (NOW)               1.1084 (1.0)       1.1988 (1.0)       1.1450 (1.0)      0.0467 (1.03)      1.1130 (1.0)      0.0841 (1.54)          2;0  0.8734 (1.0)           5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The caveat here is that not all backends support arrays, so we won't be able to see this improvement in every backend.

cpcloud · 2024-07-24T12:55:23Z

For the curious, here's the DuckDB plan where you can see the UNNEST:

D create table penguins as from read_csv('/data/penguins.csv');
D explain summarize (select year from penguins);

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        column_name        │
│        column_type        │
│            min            │
│            max            │
│       approx_unique       │
│            avg            │
│            std            │
│            q25            │
│            q50            │
│            q75            │
│           count           │
│      null_percentage      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           UNNEST          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          min(#0)          │
│          max(#1)          │
│ approx_count_distinct(#2) │
│          avg(#3)          │
│         stddev(#4)        │
│    approx_quantile(#5)    │
│    approx_quantile(#6)    │
│    approx_quantile(#7)    │
│        count_star()       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            year           │
│            year           │
│            year           │
│            year           │
│    CAST(year AS DOUBLE)   │
│            year           │
│            year           │
│            year           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          penguins         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            year           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          EC: 344          │
└───────────────────────────┘

cpcloud · 2024-07-24T12:56:55Z

A follow up to #9684 might actually make a proper Summarize operation in Ibis that at least in the DuckDB case would have an extremely small constant overhead to construct the expression.

cpcloud · 2024-07-24T12:58:03Z

Even before that, we could probably move this entire thing to compilation time which would make this much cheaper for both the union case and the array case ... I'll look into it.

jitingxu1 · 2024-07-24T18:10:48Z

Hi @cpcloud Thank you so much for your prompt check and fix.

jitingxu1 added the feature Features or general enhancements label Jul 19, 2024

jcrist added the performance Issues related to ibis's performance label Jul 22, 2024

cpcloud linked a pull request Jul 24, 2024 that will close this issue

perf(expressions): speed up .describe() and .info() expression construction #9684

Draft

vparpoil mentioned this issue Jul 25, 2024

bug: How to call duckDB SUMMARIZE function using ibis .sql() ? #9694

Closed

1 task

ncclementi mentioned this issue Aug 13, 2024

[meta]: Expression IR performance #9828

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize table info() and describe() for large column tables efficiently #9639

feat: optimize table info() and describe() for large column tables efficiently #9639

jitingxu1 commented Jul 19, 2024

jcrist commented Jul 22, 2024

jitingxu1 commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024 •

edited

Loading

jitingxu1 commented Jul 24, 2024

feat: optimize table info() and describe() for large column tables efficiently #9639

feat: optimize table info() and describe() for large column tables efficiently #9639

Comments

jitingxu1 commented Jul 19, 2024

Is your feature request related to a problem?

What is the motivation behind your request?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

jcrist commented Jul 22, 2024

jitingxu1 commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024

cpcloud commented Jul 24, 2024 • edited Loading

jitingxu1 commented Jul 24, 2024

cpcloud commented Jul 24, 2024 •

edited

Loading