-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tcl runtime performance improvements #21
Comments
BTW. Apropos "TBD" and "We need help with the benchmark"... The command How it can be made you can for example see in sebres/tcl#2 (see script Additionally I can imagine that many performance test cases may be extracted (also half-automatically) from tcl-own test-cases (folder |
I found my old ticket, that can explain how easy the large performance improvement can be done...
|
Above mentioned results were from i5-6500. For i5-2400 it looks still worse
Because both systems differentiate almost only by the memory bandwidth, the ensemble overhead seems to be memory-related also (very strange on the constellation). |
Apropos "issue" from above (ensemble overhead) in tcl8.5 and my SE-fork - below you'll find results, compared with current trunk on i5-6500, 3.2GHz. tcl8.5: % info patchlevel
- 8.7a0
+ 8.5.19
% timerate -calibrate {}
- 0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec
+ 0.030185025513108005 µs/#-overhead 0.030270 µs/# 64487338 # 33036546 #/sec
%
% timerate {string first A $s 2000}
- 0.321560 µs/#, 2725287 #, 3109837 #/sec
+ 0.256630 µs/# 3192446 # 3896667 #/sec 819.276 nett-ms
% timerate {::tcl::string::first A $s 2000}
- 0.191541 µs/#, 4220926 #, 5220810 #/sec
+ 0.133809 µs/# 5251583 # 7473339 #/sec 702.709 nett-ms tclSE: % info patchlevel
- 8.7a0
+ 9.0-SE.18
% timerate -calibrate {}
- 0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec
+ 0.030022483025454204 µs/#-overhead 0.030022 µs/# 65017940 # 33308370 #/sec
%
% timerate {string first A $s 2000}
- 0.321560 µs/#, 2725287 #, 3109837 #/sec
+ 0.122230 µs/# 6568019 # 8181266 #/sec 802.812 nett-ms
% timerate {::tcl::string::first A $s 2000}
- 0.191541 µs/#, 4220926 #, 5220810 #/sec
+ 0.117670 µs/# 6770841 # 8498362 #/sec 796.723 nett-ms |
I've found a difference between my tclSE and tcl.core as regards the overhead of ensemble. Because clock is an ensemble-command also, thereby it is additionally faster now (up to 2x). Below you'll find new measurement result for clock-command: Of course it concerns to all ensemble command in tcl. I'll wait for any feedback... because so far no reaction at all here (neither my questions are answered, nor any statements given about performance test-scripts, etc.) |
Closed as not interested |
Sorry for not getting back with you on this, I've just started taking over being the front guy on the bounties from Karl. Let me start with these questions:
Benchmark program to be determined. So it doesn't exist at this point.
We're kind of expecting a 10x speedup to require JIT native code generation. Both speedups would be expected to be more apparent in heavily algorithmic and numeric code, rather than code that spends most of its time called out to libraries and extensions. |
Hi Peter, Well so what, I'm back now and will try to rebase this all as "sebres-performance-branch" to tcl-core fossil-repository. BTW. Have you tried in the meantime to take a look at my clock-solution? |
Gah, forgotten! Recently I've got a closer look at a ticket Exploding runtime with growing list of after idle events and found very large difference to my fork-branch. And not only mentioned issue, but fundamental problem about an timer-idle events. If you develop also event-driven, then this may be very helpful and can provide very large performance increase (up-to several hundred thousand times, depending on the size of the event-queue). I've created several performance test-cases, whose result you can find below: Diff: event-perf-test.diff.txt If interested also, I can rebase it into my new performance-branch in fossil together with ensembles-improvement. The overhead costs in original tcl-core are not linearly (sometimes grows exponential), please find enclosed a small excerpt of total summary by 10000 and 60000 events in the queue): 10000 events in queue: -Total 8 cases in 0.00 sec. (3.84 nett-sec.):
+Total 8 cases in 0.00 sec. (0.02 nett-sec.):
-3840255.000000 µs/# 8 # 2.083 #/sec 3840.255 nett-ms
+23607.000000 µs/# 8 # 338.883 #/sec 23.607 nett-ms
Average:
-480031.875000 µs/# 1 # 2 #/sec 480.032 nett-ms
+2950.875000 µs/# 1 # 339 #/sec 2.951 nett-ms
Max:
-1489702 µs/# 1 # 0.671 #/sec 1489.702 nett-ms
+4178.00 µs/# 1 # 239.35 #/sec 4.178 nett-ms 60000 events in queue: -Total 8 cases in 0.00 sec. (352.27 nett-sec.):
+Total 8 cases in 0.00 sec. (0.16 nett-sec.):
-352267646.000000 µs/# 8 # 0.023 #/sec 352267.646 nett-ms
+157668.000000 µs/# 8 # 50.740 #/sec 157.668 nett-ms
Average:
-44033455.750000 µs/# 1 # 0 #/sec 44033.456 nett-ms
+19708.500000 µs/# 1 # 51 #/sec 19.709 nett-ms
Max:
-169016982 µs/# 1 # 0.006 #/sec 169016.982 nett-ms
+27429.0 µs/# 1 # 36.458 #/sec 27.429 nett-ms |
This looks promising. We also do a lot of event-driven code, and I've had to play games moving events from one queue to another to avoid starvation but haven't dug deeper because the behavior I saw was easily explained simply by the lower priority of timer events. A 2x speedup for ensemble commands by itself isn't likely to get a 2x speedup overall, since only a fairly small number of commands are ensembles. [string subcommand] and [array subcommand] seem the most likely candidates to have an impact, and they're only a few percent (a quick check of a sample of my code found them in about 5% of non-comment non-empty lines). Combining all these together might get there. |
By the way: I've found another screw, that could be turned to improve whole Tcl performance (byte-code execution) up to 2x and higher (although I had measured 6x - 10x also on some complex cases, but I'm not sure all things are portable to the core without extremely large effort). Just as reference values: % info patchlevel
-8.6.7
+8.6-SE.10b3
% timerate -calibrate {}
-0.06132125572047973 µs/#-overhead 0.061321 µs/# 31832339 # 16307559 #/sec
+0.03671978085047272 µs/#-overhead 0.036918 µs/# 52874530 # 27087375 #/sec
% timerate {set a 1}
-0.067728 µs/# 7749007 # 14765019 #/sec 524.822 nett-ms
+0.010424 µs/# 13938133 # 95928567 #/sec 145.297 nett-ms
% timerate {set a}
-0.033416 µs/# 10555488 # 29925545 #/sec 352.725 nett-ms
+0.005117 µs/# 15051566 # 195424123 #/sec 77.020 nett-ms Please note that already the execution of compiled empty code is about two times faster (see result of the calibration overhead ~ 0.0367µs vs. 0.061µs). Additionally to my latest above-mentioned event-performance branch (and fix-ensemble-overhead), as combination of all these, it should exceed 2x speed increase on almost all tasks and possesses very considerable potential to reach 10x bounds. The problems I've currently additionally to the general complexity of the back-porting of this "screw" to standard tcl-core:
Thus I'm really not sure I want do this step and open up still one construction site, without prospects of sustained success. |
Thus closing, sorry for noise. |
Because this is about the performance... just to reference BTW - sebres/tcl#5 I tried it with one "old" (but big) application of me, which uses regexp in the wild (directly and indirectly), sometimes very intensively (and with binding on sqlite as functions): after switch in the init.tcl to pcre using one-line mod Rewriting some cycles searching via regexp for multiple matches from tcl-NFA to DFA (with exact one match-search for multiple alternatives with cycle over exactly one result of this match, instead of multiple incremental search in the cycle) increases performance of this functionality up to 1000x! Regards, Sergey. |
I can try to back-port several functionalities of my tclSE to current trunk/tcl8.6 after I'll get tcl-clock speed-up production (or acceptance) ready.
So I think that https://github.com/flightaware/Tcl-bounties#tcl-runtime-performance-improvements would be reasonably feasible at least to 2x. But 10x (at least partially) is also to imagine.
For example I've there another (faster) lexer, very fast byte code, tclOO engine, etc.
Pair questions in-between:
clock scan
somewhere, you've already now a speed-up up to 100% ;-)Or if you use extreme slow regex somewhere, improving of the code-base all around won't really affect the whole performance.
So which functionality are more by importance?
The text was updated successfully, but these errors were encountered: