-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nanocubes for time ranges? #45
Comments
Hi, That's honestly a hard problem. Given my limited, a few months, experience My naive suggestion, conceptually pretty similar to yours, is joining all What I would do if I encountered such problem is that I would dump the That's my $.02, hope it helps. On 5 February 2016 at 10:16, Hagmonk [email protected] wrote:
Best wishes, |
@mehmetminanc, thanks for the comments. It is something of a hard problem :) I do wonder whether nanocubes are truly unable to handle this. Thinking more, my instinct tells me that the spatial dimensions could be used/abused to model time. You could map the tsrange into latitude (leaving the longitude dimension at 0), then map the unique entity identifier to a time value. A query for spatial coordinates across the full dataset would normally give you the list of events that occurred in that region … in this model it would return you a list of entities that experienced changes in that time range. I remembered this general approach from They suggest modeling time ranges as a (time,duration) tuple which makes them mappable onto a coordinate system. I feel like the nanocube structure itself is quite generalized and my problem is likely a case of not knowing the right question to ask :) |
Now, that is some proper abuse! I really liked the idea. I have several concerns though, main one is the fact that this is Another problem is that nanocube, almost by definition, will aggregate Good luck, and please keep up. On 5 February 2016 at 20:39, Hagmonk [email protected] wrote:
Best wishes, |
Abusing databases is a favorite pastime :) Here's a little more detail that might make things more concrete (and sometimes ideas materialize during explanation) One note is that I'm actually only interested in the aggregations across time, I can afford to dip back into the source tables to get a specific set of entities (as long as I know the dimensions that were used) What kills me in Postgres with tsranges ends up being with the joins. Once you are interested in four or five dimensions (not uncommon at all in my use case) then you are starting the stress the query planner. Once you start down that path, you're starting to use CTEs and other optimization fences to influence the planner's decisions. It gets ugly. [side note: I watched a talk by someone who 'solved' this problem quite creatively. Dispatch multiple combinations of the query with the CTE order shuffled around, then just use results from the quickest query and kill the rest! A distributed Darwinian query planner …] A basic query might look like
After the joins and intersections are complete, I'm left with rows that give me the entity ID and the time range over which all the dimensions were true for this entity. The two ways I've dealt with this are:
You can speed this up in Postgres by using transaction in which you build a "temporary on commit drop" intersected_ranges table (it won't be in memory but it's not logged and will be dropped on commit). Then slap a sp-gist index on the event_range column, then perform the query above. Crazy to build a temp table and index mid-query, but this totally works very very well.
becomes a row like
Now it's normally fast to do an aggregation and window query something like:
The downside of this approach is that if there's a source data error (which can happen) that error will be carried throughout the rest of the calculation, whereas dipping into the events at intervals is somewhat more resilient. In both cases you still have to wear the cost of the joins and the intersections, and then managing the growth of the roll-up tables and requests for new intersections. One potential nanocube approach would be to materialize the entire state of the all the entities at some point in time, on every single day, and have it compute aggregates as if they were daily events. That could mean tens of millions of data points each day, with a dozen or so dimensions. These entities never go away or expire, either. I'm not sure if that starts to overwhelm things, and it would be expensive to compute, but I could experiment to see where the boundaries are. |
Are you willing to hack around the source code? If you made a few changes to the TimeSeriesEntryType class to support signed integers, then maybe you could think of "started eating popcorn" as an event of type popcorn and value 1, and "ended eating popcorn" as an event of type popcorn and value -1. Now you have your time series as usual, but to say "between 12:00 and 13:00" you could look at the difference between a query at 13:00 and another at 12:00. EDIT: No, I'm wrong. a single query at 12:00 is enough to give you the number of "active events". This is sorta kinda modeling things as dirac deltas and letting the aggregation infrastructure "integrate them into heaviside deltas" |
Oh nice, let me experiment with that. I can totally hack something like that together. Thanks for the pointer! This wouldn't necessarily get you the intersections for free though, would it? If I streamed in signed events this way, maybe for "eating popcorn" and "watching netflix", and I know those events came from the same entity via a unique ID, the dream would be to query for any set at a given time: people eating popcorn but not watching netflix, people doing both, people watching netflix without popcorn. (yes I totally wrote "people eating netflix" as a set but caught it just in time ;) |
Not for free, nor easily. And it's worse than that: if you want to keep the intersections around, you'll have to add them as categories by themselves: "netflix_and_popcorn", etc, and that means that if you query for the sum over all categories, you'll get the wrong result. So you'll have to be careful with the way the presentation layer will display the data as well. |
This isn't much of a regression from what I'm dealing with now. We have a giant roll-up table with all the interesting intersections materialized on a day by day basis. If I can get a much more compact and speedy version of that, I'm winning. If we go right down to netflix_and_popcorn that means you can drop the entity ID all together and just accumulate unique identifiers that represent each possible combination. A single entity that had these attributes would generate the set {netflix, popcorn, netflix_and_popcorn} and you could go off and count those. That makes the code slightly simpler, I guess. |
Hi there! I'm grappling with an OLAP style problem and I'm hoping to apply nanocubes, but I'm not entirely sure how well my problem maps to this domain.
I've got an event stream representing changes to a set of entities. Something like 30 million entities, each of which might have a dozen dimensions. New events for each entity could arrive years or seconds apart. There is no spatial component to the data.
I mostly answer queries along the lines of 'at midnight every day between 2015-01-01 and 2015-07-31, how many entities had dimensions A = 1, B = 8, C = 3'. Maybe a colloquial way of stating the problem could be 'at midnight each day, how many people are watching netflix, eating popcorn, and wearing red socks'. My event stream only tells me when events change.
So in Postgres (after months of research into the validity of this approach) I end up building table partitions for each dimension, each row containing the entity id, the dimension's value, and the tsrange for which this fact was true. Then the problem reduces to intersecting time ranges, and building plain old macro scale cubes to cache aggregation results. But the bloat is staggering: ~6 GB of compressed data when unpacked this way and indexed tops 120 GB, and I'm not even considering all the possible dimensions yet. I feel like I'm forcing myself towards a big data problem I shouldn't have.
How might one introduce the concept of an event with a duration into a nanocube? If you can point me in the right direction I'll be sure to contribute some sample code back to the repo :)
The text was updated successfully, but these errors were encountered: