Nanocubes for time ranges? #45

hagmonk · 2016-02-05T08:16:20Z

Hi there! I'm grappling with an OLAP style problem and I'm hoping to apply nanocubes, but I'm not entirely sure how well my problem maps to this domain.

I've got an event stream representing changes to a set of entities. Something like 30 million entities, each of which might have a dozen dimensions. New events for each entity could arrive years or seconds apart. There is no spatial component to the data.

I mostly answer queries along the lines of 'at midnight every day between 2015-01-01 and 2015-07-31, how many entities had dimensions A = 1, B = 8, C = 3'. Maybe a colloquial way of stating the problem could be 'at midnight each day, how many people are watching netflix, eating popcorn, and wearing red socks'. My event stream only tells me when events change.

So in Postgres (after months of research into the validity of this approach) I end up building table partitions for each dimension, each row containing the entity id, the dimension's value, and the tsrange for which this fact was true. Then the problem reduces to intersecting time ranges, and building plain old macro scale cubes to cache aggregation results. But the bloat is staggering: ~6 GB of compressed data when unpacked this way and indexed tops 120 GB, and I'm not even considering all the possible dimensions yet. I feel like I'm forcing myself towards a big data problem I shouldn't have.

How might one introduce the concept of an event with a duration into a nanocube? If you can point me in the right direction I'll be sure to contribute some sample code back to the repo :)

mehmetminanc · 2016-02-05T09:44:14Z

Hi,

That's honestly a hard problem. Given my limited, a few months, experience
with nanocube, I can confidently claim that what you are asking for is not
really possible with the nanocube. The reason for that is data points in
nanocube's cannot have time intervals, each data point is a one and only
one occurence of the event.

My naive suggestion, conceptually pretty similar to yours, is joining all
the latest changes of a dimension of an entity with the entity itself and
then filter out based on your criteria. To contrast this with your solution
is that I am omitting explicit time intervals, as a possible optimization,
whatever happened last to dimension A of an entity before midnight of
2015-01-01 is the value of the dimension A of the entity. I am really not
sure this would make any difference, but I like to think this will your
limit the size of your dataset to number changes, asymptotically. And will
make the problem simpler, since you are working on a Time Series.

What I would do if I encountered such problem is that I would dump the
aforementioned joined entity-change table to a file on a HDFS cluster and
run an Spark task to filter out rows of interest, i.e. latest rows of
entities before the time interval we are interested in.

That's my $.02, hope it helps.

On 5 February 2016 at 10:16, Hagmonk [email protected] wrote:

Hi there! I'm grappling with an OLAP style problem and I'm hoping to apply
nanocubes, but I'm not entirely sure how well my problem maps to this
domain

I've got an event stream representing changes to a set of entities
Something like 30 million entities, each of which might have a dozen
dimensions New events for each entity could arrive years or seconds apart
There is no spatial component to the data

I mostly answer queries along the lines of 'at midnight every day between
2015-01-01 and 2015-07-31, how many entities had dimensions A = 1, B = 8, C
= 3' Maybe a colloquial way of stating the problem could be 'at midnight
each day, how many people are watching netflix, eating popcorn, and wearing
red socks' My event stream only tells me when events change

So in Postgres (after months of research into the validity of this
approach) I end up building table partitions for each dimension, each row
containing the entity id, the dimension's value, and the tsrange for which
this fact was true Then the problem reduces to intersecting time ranges,
and building plain old macro scale cubes to cache aggregation results But
the bloat is staggering: ~6 GB of compressed data when unpacked this way
and indexed tops 120 GB, and I'm not even considering all the possible
dimensions yet I feel like I'm forcing myself towards a big data problem I
shouldn't have

How might one introduce the concept of an event with a duration into a
nanocube? If you can point me in the right direction I'll be sure to
contribute some sample code back to the repo :)

—
Reply to this email directly or view it on GitHub
#45.

Best wishes,
Mehmet M. INANC.

hagmonk · 2016-02-05T18:39:50Z

@mehmetminanc, thanks for the comments. It is something of a hard problem :)

I do wonder whether nanocubes are truly unable to handle this. Thinking more, my instinct tells me that the spatial dimensions could be used/abused to model time. You could map the tsrange into latitude (leaving the longitude dimension at 0), then map the unique entity identifier to a time value. A query for spatial coordinates across the full dataset would normally give you the list of events that occurred in that region … in this model it would return you a list of entities that experienced changes in that time range.

I remembered this general approach from

IndexiGoh, Cheng Hian, et al. "Indexing temporal data using existing B+-trees." Data & Knowledge Engineering 18.2 (1996): 147-165.

They suggest modeling time ranges as a (time,duration) tuple which makes them mappable onto a coordinate system.

I feel like the nanocube structure itself is quite generalized and my problem is likely a case of not knowing the right question to ask :)

mehmetminanc · 2016-02-05T20:51:34Z

Now, that is some proper abuse! I really liked the idea.

I have several concerns though, main one is the fact that this is
theoretically the same with indexing the tsrange, or in my solution
latest-state? If you forget that nanocube is in memory and postgresql is on
disk - by default -, they will both have more or less the same tree
structure. Hence, very similar (log(N)) computational complexity of
retrieving an event, whether it be tsrange, latest-state or spatial lookup.
If you think this is a practical problem of disk-lookup vs memory-lookup
you can cache your postgresql tables up to memory altogether and still use
the queries you've already come up with.There are definitely ways to load
postgres tables entirely in memory, even of the given size.

Another problem is that nanocube, almost by definition, will aggregate
tsranges that occurred at the same or very close times, even though you
have got a quite a bit of precision - 2^25 - in latitude dimension, it's
still likely that this will happen. This problem does not look undoable
since you've still got the longitude dimension ready to be abused, but will
make the mapping function non-trivial.

Good luck, and please keep up.

On 5 February 2016 at 20:39, Hagmonk [email protected] wrote:

@mehmetminanc https://github.com/mehmetminanc, thanks for the comments.
It is something of a hard problem :)

I do wonder whether nanocubes are truly unable to handle this. Thinking
more, my instinct tells me that the spatial dimensions could be used/abused
to model time. You could map the tsrange into latitude (leaving the
longitude dimension at 0), then map the unique entity identifier to a time
value. A query for spatial coordinates across the full dataset would
normally give you the list of events that occurred in that region … in this
model it would return you a list of entities that experienced changes in
that time range.

I remembered this general approach from IndexiGoh, Cheng Hian, et al.
"Indexing temporal data using existing B+-trees." Data & Knowledge
Engineering 18.2 (1996): 147-165.
https://www.comp.nus.edu.sg/%7Eooibc/stbtree95.pdf. They suggest
modeling time ranges as a (time,duration) tuple which makes them mappable
onto a coordinate system.

I feel like the nanocube structure itself is quite generalized and my
problem is likely a case of not knowing the right question to ask :)

—
Reply to this email directly or view it on GitHub
#45 (comment).

Best wishes,
Mehmet M. INANC.

hagmonk · 2016-02-05T23:29:20Z

Abusing databases is a favorite pastime :)

Here's a little more detail that might make things more concrete (and sometimes ideas materialize during explanation) One note is that I'm actually only interested in the aggregations across time, I can afford to dip back into the source tables to get a specific set of entities (as long as I know the dimensions that were used)

What kills me in Postgres with tsranges ends up being with the joins. Once you are interested in four or five dimensions (not uncommon at all in my use case) then you are starting the stress the query planner. Once you start down that path, you're starting to use CTEs and other optimization fences to influence the planner's decisions. It gets ugly.

[side note: I watched a talk by someone who 'solved' this problem quite creatively. Dispatch multiple combinations of the query with the CTE order shuffled around, then just use results from the quickest query and kill the rest! A distributed Darwinian query planner …]

A basic query might look like

select 
  dim1.entity_id, 
  (dim1.event_range * dim2.event_range) 
from dimension_a dim1 
  join dimension_b dim2 using (entity_id) 
where 
  dim1.value = 'red' 
  and dim2.value = 'netflix' 
  and not isempty(dim1.event_range * dim2.event_range)

After the joins and intersections are complete, I'm left with rows that give me the entity ID and the time range over which all the dimensions were true for this entity. The two ways I've dealt with this are:

use generate_series to step over each bucket of interest, and count the number of matching rows in my intersected table. Basically:

select 
  g.s, count(*) 
from 
  generate_series(now() - interval '1 year', now(), '1 day') g(s), 
  intersected_ranges r 
where 
  r.event_range @> g.s

You can speed this up in Postgres by using transaction in which you build a "temporary on commit drop" intersected_ranges table (it won't be in memory but it's not logged and will be dropped on commit). Then slap a sp-gist index on the event_range column, then perform the query above. Crazy to build a temp table and index mid-query, but this totally works very very well.

build the same intersected_ranges table and then use unnest to translate the tsrange into a start/stop event. So

['2015-01-01 12:34:56', '2015-02-02 12:12:12')

becomes a row like

t	u
'2015-01-01 12:34:56'	1
'2015-02-02 12:12:12'	-1

Now it's normally fast to do an aggregation and window query something like:

  select 
    t::day, 
    sum(u) as "daily total", 
    sum(sum(u)) over (order by t::day) as "rolling total"
  from
    unnested_table 
group by 1 order by 1

day	daily total	rolling total
'2015-01-01'	5	5
'2015-02-02'	3	8
'2015-02-02'	-2	6

The downside of this approach is that if there's a source data error (which can happen) that error will be carried throughout the rest of the calculation, whereas dipping into the events at intervals is somewhat more resilient.

In both cases you still have to wear the cost of the joins and the intersections, and then managing the growth of the roll-up tables and requests for new intersections.

One potential nanocube approach would be to materialize the entire state of the all the entities at some point in time, on every single day, and have it compute aggregates as if they were daily events. That could mean tens of millions of data points each day, with a dozen or so dimensions. These entities never go away or expire, either. I'm not sure if that starts to overwhelm things, and it would be expensive to compute, but I could experiment to see where the boundaries are.

cscheid · 2016-02-05T23:46:41Z

Are you willing to hack around the source code? If you made a few changes to the TimeSeriesEntryType class to support signed integers, then maybe you could think of "started eating popcorn" as an event of type popcorn and value 1, and "ended eating popcorn" as an event of type popcorn and value -1.

Now you have your time series as usual, but to say "between 12:00 and 13:00" you could look at the difference between a query at 13:00 and another at 12:00. EDIT: No, I'm wrong. a single query at 12:00 is enough to give you the number of "active events".

This is sorta kinda modeling things as dirac deltas and letting the aggregation infrastructure "integrate them into heaviside deltas"

hagmonk · 2016-02-06T00:40:45Z

Oh nice, let me experiment with that. I can totally hack something like that together. Thanks for the pointer!

This wouldn't necessarily get you the intersections for free though, would it? If I streamed in signed events this way, maybe for "eating popcorn" and "watching netflix", and I know those events came from the same entity via a unique ID, the dream would be to query for any set at a given time: people eating popcorn but not watching netflix, people doing both, people watching netflix without popcorn.

(yes I totally wrote "people eating netflix" as a set but caught it just in time ;)

cscheid · 2016-02-06T00:54:50Z

This wouldn't necessarily get you the intersections for free though, would it?

Not for free, nor easily. And it's worse than that: if you want to keep the intersections around, you'll have to add them as categories by themselves: "netflix_and_popcorn", etc, and that means that if you query for the sum over all categories, you'll get the wrong result. So you'll have to be careful with the way the presentation layer will display the data as well.

hagmonk · 2016-02-06T02:15:24Z

And it's worse than that: if you want to keep the intersections around, you'll have to add them as categories by themselves

This isn't much of a regression from what I'm dealing with now. We have a giant roll-up table with all the interesting intersections materialized on a day by day basis. If I can get a much more compact and speedy version of that, I'm winning.

If we go right down to netflix_and_popcorn that means you can drop the entity ID all together and just accumulate unique identifiers that represent each possible combination. A single entity that had these attributes would generate the set {netflix, popcorn, netflix_and_popcorn} and you could go off and count those. That makes the code slightly simpler, I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nanocubes for time ranges? #45

Nanocubes for time ranges? #45

hagmonk commented Feb 5, 2016

mehmetminanc commented Feb 5, 2016

hagmonk commented Feb 5, 2016

mehmetminanc commented Feb 5, 2016

hagmonk commented Feb 5, 2016

cscheid commented Feb 5, 2016

hagmonk commented Feb 6, 2016

cscheid commented Feb 6, 2016

hagmonk commented Feb 6, 2016

Nanocubes for time ranges? #45

Nanocubes for time ranges? #45

Comments

hagmonk commented Feb 5, 2016

mehmetminanc commented Feb 5, 2016

hagmonk commented Feb 5, 2016

mehmetminanc commented Feb 5, 2016

hagmonk commented Feb 5, 2016

cscheid commented Feb 5, 2016

hagmonk commented Feb 6, 2016

cscheid commented Feb 6, 2016

hagmonk commented Feb 6, 2016