|Alexander Motin||Feb 5, 2012 11:04 pm|
|David Xu||Feb 5, 2012 11:59 pm|
|Gary Jennejohn||Feb 6, 2012 2:08 am|
|Alexander Best||Feb 6, 2012 8:01 am|
|Alexander Motin||Feb 6, 2012 8:28 am|
|Tijl Coosemans||Feb 6, 2012 9:37 am|
|Alexander Motin||Feb 6, 2012 9:54 am|
|Florian Smeets||Feb 6, 2012 11:07 am|
|Alexander Best||Feb 6, 2012 11:10 am|
|Alexander Motin||Feb 6, 2012 11:18 am|
|Julian Elischer||Feb 6, 2012 10:10 pm|
|Ivan Voras||Feb 8, 2012 3:06 am|
|Andriy Gapon||Feb 11, 2012 5:34 am|
|Alexander Motin||Feb 11, 2012 6:21 am|
|Konstantin Belousov||Feb 11, 2012 7:35 am|
|Andriy Gapon||Feb 11, 2012 9:04 am|
|Alexander Motin||Feb 13, 2012 11:56 am|
|Jeff Roberson||Feb 13, 2012 12:23 pm|
|Alexander Motin||Feb 13, 2012 12:54 pm|
|Jeff Roberson||Feb 13, 2012 1:39 pm|
|Alexander Motin||Feb 13, 2012 2:38 pm|
|Alexander Motin||Feb 15, 2012 11:46 am|
|Jeff Roberson||Feb 15, 2012 11:54 am|
|Alexander Motin||Feb 15, 2012 12:06 pm|
|Alexander Motin||Feb 15, 2012 8:41 pm|
|Alexander Motin||Feb 16, 2012 12:48 am|
|Alexander Motin||Feb 16, 2012 2:58 am|
|Florian Smeets||Feb 16, 2012 1:28 pm|
|Alexander Motin||Feb 17, 2012 8:29 am|
|Arnaud Lacombe||Feb 17, 2012 8:52 am|
|Alexander Motin||Feb 17, 2012 9:02 am|
|George Mitchell||Feb 26, 2012 4:32 pm|
|George Mitchell||Feb 26, 2012 4:37 pm|
|Olivier Smedts||Feb 27, 2012 2:34 am|
|George Mitchell||Feb 27, 2012 3:23 am|
|Olivier Smedts||Feb 27, 2012 3:27 am|
|Andriy Gapon||Feb 27, 2012 4:41 am|
|George Mitchell||Feb 27, 2012 3:54 pm|
|Adrian Chadd||Mar 2, 2012 3:05 pm|
|George Mitchell||Mar 2, 2012 4:14 pm|
|Adrian Chadd||Mar 2, 2012 7:24 pm|
|Alexander Motin||Mar 2, 2012 11:40 pm|
|Ivan Klymenko||Mar 3, 2012 12:18 am|
|Adrian Chadd||Mar 3, 2012 12:59 am|
|Alexander Motin||Mar 3, 2012 1:12 am|
|Alexander Motin||Mar 3, 2012 4:53 am|
|Ivan Klymenko||Mar 3, 2012 7:25 am|
|Alexander Motin||Mar 3, 2012 8:30 am|
|Mario Lobo||Mar 3, 2012 8:56 am|
|Alexander Motin||Mar 3, 2012 9:56 am|
|Ivan Klymenko||Mar 3, 2012 11:15 am|
|Arnaud Lacombe||Apr 5, 2012 11:11 am|
|Alexander Motin||Apr 5, 2012 11:45 am|
|Attilio Rao||Apr 6, 2012 7:12 am|
|Alexander Motin||Apr 6, 2012 7:26 am|
|Attilio Rao||Apr 6, 2012 7:30 am|
|Alexander Motin||Apr 6, 2012 7:40 am|
|Alexander Motin||Apr 9, 2012 12:57 pm|
|Arnaud Lacombe||Apr 10, 2012 9:57 am|
|Alexander Motin||Apr 10, 2012 10:18 am|
|Alexander Motin||Apr 10, 2012 10:53 am|
|Arnaud Lacombe||Apr 10, 2012 11:45 am|
|Alexander Motin||Apr 10, 2012 12:13 pm|
|Mike Meyer||Apr 10, 2012 1:04 pm|
|Arnaud Lacombe||Apr 10, 2012 1:50 pm|
|Mike Meyer||Apr 10, 2012 2:19 pm|
|Adrian Chadd||Apr 11, 2012 3:19 pm|
|Subject:||Re: [RFT][patch] Scheduling for HTT and not only|
|From:||Alexander Motin (ma...@FreeBSD.org)|
|Date:||Feb 11, 2012 6:21:02 am|
On 02/11/12 15:35, Andriy Gapon wrote:
on 06/02/2012 09:04 Alexander Motin said the following:
I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE knows about HTT and when doing load balancing once a second, it does right things. Unluckily, if some other thread gets in the way, process can be easily pushed out to another CPU, where it will stay for another second because of CPU affinity, possibly sharing physical core with something else without need.
I've made a patch, reworking SCHED_ULE affinity code, to fix that: http://people.freebsd.org/~mav/sched.htt.patch
This patch does three things: - Disables strict affinity optimization when HTT detected to let more sophisticated code to take into account load of other logical core(s). - Adds affinity support to the sched_lowest() function to prefer specified (last used) CPU (and CPU groups it belongs to) in case of equal load. Previous code always selected first valid CPU of evens. It caused threads migration to lower CPUs without need. - If current CPU group has no CPU where the process with its priority can run now, sequentially check parent CPU groups before doing global search. That should improve affinity for the next cache levels.
I know that you are working on improving this patch and we have already discussed some ideas via out-of-band channels.
I've heavily rewritten the patch already. So at least some of the ideas are already addressed. :) At this moment I am mostly satisfied with results and after final tests today I'll probably publish new version.
Here's some additional ideas. They are in part inspired by inspecting OpenSolaris code.
Let's assume that one of the goals of a scheduler is to maximize system
performance / computational throughput[*]. I think that modern SMP-aware
schedulers try to employ the following two SMP-specific techniques to achieve
that: - take advantage of thread-to-cache affinity to minimize "cold cache" time - distribute the threads over logical CPUs to optimize system resource usage by minimizing[**] sharing of / contention over the resources, which could be caches, instruction pipelines (for HTT threads), FPUs (for AMD Bulldozer "cores"), etc.
1. Affinity. It seems that on modern CPUs the caches are either inclusive or some smart "as if inclusive" caches. As a result, if two cores have a shared cache at any level, then it should be relatively cheap to move a thread from one core to the other. E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a shared L3 cache, then on modern processors it should be much cheaper to move a thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache.
Absolutely true! On smack-mysql indexed select benchmarks I've found that on Atom CPU with two cores without L3 it is cheaper to move two mysql threads to one physical core (L2 cache) suffering from SMT, then bounce data between cores. Same time on Core i7 with shared L3 and also SMT results are strictly opposite.
If this assumption is really true, then we can track only an affinity of a thread with relation to a top level shared cache. E.g. if migration within an L3 cache is cheap, then we don't have any reason to constrain a migration scope to an L2 cache, let alone L1.
In present patch version I've implemented two different thresholds for the last level cache and for the rest. That's why I am waiting from you patch to properly detect cache topologies. :)
2. Balancing. I think that the current balancing code is pretty good, but can be augmented with the following: A. The SMP topology in longer term should include other important shared resources, not only caches. We already have this in some form via CG_FLAG_THREAD, which implies instruction pipeline sharing.
At this moment I am using different penalty coefficients for SMT and shared caches (for unrelated processes sharing is is not good). No problem to add more types there. Separate flag for shared FPU could be used to have different penalty coefficients for usual threads and FPU-less kernel threads.
B. Given the affinity assumptions, sched_pickcpu can pick the best CPU only among CPUs sharing a top level cache if a thread still has an affinity to it or among all CPUs otherwise. This should reduce temporary imbalances.
I've done it in more complicated way. I am doing cache affinity with weight 2 to all paths with running _now_ threads of the same process and with weight 1 to the previous path where thread was running. I believe that constant cache trashing between two running threads is much worse then single jump from one CPU to another on context some switches. Though it could be made configurable.
C. I think that we should eliminate the bias in the sched_lowest() family of functions. I like how your patch started addressing this. For the cases where the hint (cg_prefer) can not be reasonably picked it should be a pseudo-random value. OpenSolaris does it the following way: http://fxr.watson.org/fxr/ident?v=OPENSOLARIS;im=10;i=CPU_PSEUDO_RANDOM
With new math, cases of equal load estimation are more rate, but I've also added alike mechanism. I've just made it not exactly random, but depending on calling thread ID to avoid extra jumping.
Footnotes: [*] Goals of a scheduler could be controlled via policies. E.g. there could be a policy to reduce power usage.
[**] Given a possibility of different policies a scheduler may want to concentrate threads. E.g. if a system has two packages with two cores each and there are two CPU-hungry threads, then the system may place them both on the same package to reduce power usage.
Good idea, but if one CPU is so much burning, difference between C1 and C6 of idle cores/packages will be minimal for system's total. Another question is concentrating periodically called interactive to let other cores/packages to not wake up at all and go into the deepest sleep state.
Another interesting case is threads that share a VM space or otherwise share some non-trivial amount of memory. As you have suggested, it might make sense to concentrate those threads so that they share a cache.
I am doing it now for threaded processes. It works in described smack-mysql on Atom case if I hint scheduler to be more aggressive with affinity and less with SMT penalties. Unluckily I have no real multisocket system to test that properly. Also I have no idea what to do with processes using shared memory but not using threads (IIRC like Postgress). If there would be some correlation key...
-- Alexander Motin
_______________________________________________ free...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "free...@freebsd.org"