-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C (compression) and M (maxDiscrete) parameters return the same results #16
Comments
The Regarding compression Let me know if you continue having issues. If you want to observe different results, try setting max-discrete to values like 25 or 50, or zero, and varying the compression values but keeping them <= 1. |
Mr. Erlandson, Thank you for the quick response and detailed explanation. We've observed some tiny differences with our TDigest using M = 25 and 0, though changing C has not brought about any difference so far. Though the aggregated results* still remain the exact same even with extreme values like C = 0.000001 or M = 100000, and they are under-estimating the true results. We counted the uniques via some Spark SQL functions and most fruit_type have anywhere between 100 and 3,000 unique discrete values. Distributions are extremely right-skewed, so most rows are somewhere between 0~10, with outliers that are in the range of many thousands. Is there something in the package that particularly excels at dealing with distributions that are so skewed like this? I think i've heard you mentioning working on something similar in your Spark Summit speech, but couldn't grasp the entirety of it. *We use the cdfInverse() function in this package to obtain a "threshold number" which describes the floor value of top 30% consumer(s) of any fruit_type so the program gives that consumer a label (if s/he is top 30%). For example, if a threshold number returned by cdfInverse() is 60 for Avocados, then whoever has bought more than 60 Avocados is given a label like "Avocado enjoyer". As you may have guessed by now, this process involved a few other table joins and transformations that I have not disclosed here. I could see about revealing some of that, if you think it'll help the explanation. Thank you in advance. |
@velwu can you attach some example data from your use case? A single column of numeric values that you are trying to sketch, which is giving you trouble, would be sufficient for me to get a feel for what the algorithm is actually doing and possible solutions Speaking in generalities: the t-digest is designed to allocate more resolution at the tails of a distribution (by making cluster sizes smaller at the tails). However, it is also still a sketch, and so if a distribution has an extremely long tail, I am unsure what its behavior would be. |
Mr. Erlandson (@erikerlandson), Yes, thank you. Here are some of that example data. top30percent_num is a "threshold number", meaning there are 30% of the population which are strictly greater than (so a ">" expression, not a ">=") the value of 4.0 centroids is an array obtained using TDigest.getCentUnsafe.mkString("Array(", ", ", ")") masses is an array obtained using TDigest.getMassUnsafe.mkString("Array(", ", ", ")") ===== DATA BELOW ===== fruit_type = Tayberry top30percent_num = 4.0 centroids = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.000073393521917, 7.000145750045635, 8.001188117725917, 9.003333396428566, 10.008484086672457, 11.035982725595119, 12.436242825159047, 13.935279735232925, 15.098994579197218, 16.511259867781114, 18.260182142253313, 19.86941640294012, 21.362021349000482, 22.841753038589435, 24.317914199903715, 25.96390094634534, 27.983366298230738, 30.489735175798966, 33.51140603472986, 37.2055873257278, 41.44547946599917, 46.304810343084505, 51.57747302397325, 56.94515134503221, 62.41283791796801, 68.20791010944029, 74.6865420828331, 82.69414647439031, 92.770435998285, 104.70655644456826, 119.0220386505123, 136.4907987446255, 159.05861224356724, 186.68287985110416, 217.53985402457968, 247.63889721835227, 274.4762959357546, 299.3535887854102, 324.0647354917961, 350.8301681602125, 382.09159905518254, 411.7050695585377, 439.39454418428744, 467.6237175057026, 499.6498284103393, 537.3330704370983, 579.8583082086442, 632.3023366378322, 693.3623139333414, 751.1584889780127, 801.4966764041216, 845.7856180034878, 891.0909090909093, 939.8145365243353, 998.8138400744841, 1066.5521025124158, 1131.6498041132077, 1198.63848992374, 1254.5925924620633, 1303.8717434385364, 1360.8831857872362, 1416.3451150421936, 1480.1245331341975, 1533.5714285714287, 1566.9512172885884, 1599.0512769215163, 1661.2500000000002, 1744.473668569509, 1787.3076868071576, 1822.0, 1861.6666619188447, 1881.999986646775, 1928.0, 1974.0, 2064.0, 2070.4000021711554, 2178.666659149131, 2219.99998081348, 2282.0, 2293.4285750385197, 2304.0, 2421.0, 3396.0) masses = Array(5398281.0, 3633088.0, 2460545.0, 1689881.0, 1179151.0, 842740.8517191977, 610634.4571238874, 453918.6911569149, 346840.2754764726, 271727.31541708275, 222724.34757794722, 290800.7335715829, 138476.42502237728, 119751.83242871228, 140158.8590997329, 117470.43421967147, 77971.99400099477, 71570.91652402612, 55204.863267924055, 53772.480326603574, 51903.72173838791, 58408.46021909827, 53086.57077887626, 54061.016527306994, 47315.581558571066, 40205.249888977414, 34011.498867577844, 25979.426857923092, 19550.889272686665, 16287.614592011409, 13231.863246870456, 12709.078315434372, 12420.159968650645, 11807.999980939347, 10246.373476650368, 9220.554967032673, 8640.146864860595, 7896.745173329689, 6409.272525296329, 4516.915520130643, 3173.5667872684385, 2161.224538285732, 1699.762085969193, 1497.8649416683236, 1483.7521192429406, 1347.542964248235, 987.6982021801848, 826.1687999458092, 778.5582398326427, 796.893712157504, 752.4167177490278, 717.6838723017004, 682.2497454108154, 528.4997272212621, 356.5002727787379, 186.75017624016272, 149.24982375983728, 132.0, 130.74996542265004, 126.25003457734996, 88.749984826661, 64.250015173339, 53.24999256593924, 33.75000743406076, 26.31250730264361, 21.93748591423703, 21.187496363307265, 17.562510419812092, 7.0, 10.249997731867541, 9.750002268132459, 8.0, 4.749997848168963, 3.250002151831037, 3.0, 2.9999973293525932, 3.0000026706474068, 2.0, 3.0, 0.7500004240536535, 1.2499995759463465, 2.999998990182134, 1.0000010098178662, 0.2500005527731277, 1.7499994472268723, 1.0, 1.0, 1.0) |
thanks @velwu - can you attach a file with the raw count values? If it is too large to attach here, email me at |
I ran a sketch of the "tayberry" data, with the following results. I'd consider the following to be a good sketch of the data. scala> val data = io.Source.fromFile("/home/eje/tayberry_raw.csv").getLines.toVector.map(_.toInt)
data: scala.collection.immutable.Vector[Int] = Vector(1, 5, 2, 2, ...
scala> import org.isarnproject.sketches.TDigest
import org.isarnproject.sketches.TDigest
scala> val td = TDigest.sketch(data)
td: org.isarnproject.sketches.TDigest = TDigest(0.5,0,63,TDigestMap(1.0 -> (5644016.0, 5644016.0), 2.0 -> (3410123.0, 9054139.0), 3.0 -> (2086517.0, 1.1140656E7), 4.0 -> (1314806.0, 1.2455462E7), ...
scala> val datasort = data.sortWith((a, b)=>(a < b))
datasort: scala.collection.immutable.Vector[Int] = Vector(1, 1, 1,...
scala> def quantiles(td:TDigest, sorted: IndexedSeq[Int], q: Double) = (sorted((q * sorted.length).toInt), td.cdfInverse(q), td.cdfDiscreteInverse(q))
quantiles: (td: org.isarnproject.sketches.TDigest, sorted: IndexedSeq[Int], q: Double)(Int, Double, Double)
scala> quantiles(td, datasort, 0.1)
res8: (Int, Double, Double) = (1, 1.2197759786857603, 1.0)
scala> quantiles(td, datasort, 0.3)
res9: (Int, Double, Double) = (1, 1.6593279360572806, 1.0)
scala> quantiles(td, datasort, 0.5)
res10: (Int, Double, Double) = (2, 2.0, 2.0)
scala> quantiles(td, datasort, 0.7)
res11: (Int, Double, Double) = (4, 3.710698395888893, 4.0)
scala> quantiles(td, datasort, 0.9)
res12: (Int, Double, Double) = (8, 8.0, 8.0)
|
Then it would seem this has been a case of identifying the right function for the job all along. The discrete variant cdfDiscreteInverse() returns the exact same values as the ones from the sorted data, which is encouraging. Thank you. I will notify my team about this and observe what improvements our use case can get from it. Another comment will be made here soon. |
Greetings, Mr. Erlandson (@erikerlandson),
The team I represent recently started using this solution and we've been trying to tune accuracy and performance via
different C (compression) and M (maxDiscrete) parameters.
In our case, we had a dataframe called "originDf" which has 3 columns ("id": String, "fruit_type": String, "count": Integer)
+------+--------------------+-----------+
|id |fruit_type |count |
+------+--------------------+-----------+
|001|Apples |2 |
|002|Apricots |79 |
|001|Avocados |4 |
|003|Watermelon |13 |
|007|Blueberries |5 |
|007|Cherries |6 |
|007|Clementine |41 |
|007|Cucumbers |5 |
|007|Elderberry |3 |
|007|Eggfruit |1 |
|008|Eggfruit |19 |
|012|Clementine |61 |
|013|Blueberries |21 |
|014|Blueberries |4 |
|...|Lime |4 |
|...|Rambutan |3 |
|...|Strawberries |6 |
|...|Watermelon |5 |
|...|Tangerine |3 |
|...|Tangerine |6 |
+------+--------------------+-----------+
This dataframe has roughly 0.2 billion rows in total.
We then did the following:
val t_digest_udf1= TDigestAggregator.udf[Double](compression = C, maxDiscrete = M)
val groupDf1 = originDf.groupBy(col("fruit_type")).agg(t_digest_udf1(col("count")) as "t_digests",)
val udf1 = groupDf1.first()
val t1 = udf1.getAsTDigest
where (C, M) is respectively (100, 100), (0.01, 100), (100, 10000), and the resulting single TDigest from each set as t1, t2, and t3
t1: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
t2: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
t3: org.isarnproject.sketches.java.TDigest = TDigest(1.0 -> (22240.0, 22240.0), 2.0 -> (6509.0, 28749.0), 3.0 -> (2936.0, 31685.0), 4.0 -> (1594.0, 33279.0), 5.0 -> (1096.0, 34375.0), 6.0 -> (767.0, 35142.0), 7.0 -> (523.0, 35665.0), 8.0 -> (404.0, 36069.0), 9.0 -> (358.0, 36427.0), 10.0 -> (284.0, 36711.0), 11.0 -> (201.0, 36912.0), 12.0 -> (189.0, 37101.0), 13.0 -> (162.0, 37263.0), 14.0 -> (120.0, 37383.0), 15.0 -> (98.0, 37481.0), 16.0 -> (86.0, 37567.0), 17.0 -> (68.0, 37635.0), 18.0 -> (69.0, 37704.0), 19.0 -> (63.0, 37767.0), 20.0 -> (50.0, 37817.0), 21.0 -> (50.0, 37867.0), 22.0 -> (44.0, 37911.0), 23.0 -> (37.0, 37948.0), 24.0 -> (31.0, 37979.0), 25.0 -> (29.0, 38008.0), 26.0 -> (24.0, 38032.0) ...)
It turns out t1, t2 & t3 (all of the org.isarnproject.sketches.java.TDigest class) look exactly the same value-wise.
Though Boolean checks such as t1.equal(t2) all return false, which indicates these are different TDigest entities somehow.
Do you see anything off in this usage? Did we use the TDigest correctly, and more importantly, is our understanding of the algorithm correct?
Thank you and we look forward to your responses.
The text was updated successfully, but these errors were encountered: