atom feed67 messages in org.freebsd.freebsd-hackersRe: [RFT][patch] Scheduling for HTT a...
FromSent OnAttachments
Alexander MotinFeb 5, 2012 11:04 pm 
David XuFeb 5, 2012 11:59 pm 
Gary JennejohnFeb 6, 2012 2:08 am 
Alexander BestFeb 6, 2012 8:01 am 
Alexander MotinFeb 6, 2012 8:28 am 
Tijl CoosemansFeb 6, 2012 9:37 am 
Alexander MotinFeb 6, 2012 9:54 am 
Florian SmeetsFeb 6, 2012 11:07 am 
Alexander BestFeb 6, 2012 11:10 am 
Alexander MotinFeb 6, 2012 11:18 am 
Julian ElischerFeb 6, 2012 10:10 pm 
Ivan VorasFeb 8, 2012 3:06 am 
Andriy GaponFeb 11, 2012 5:34 am 
Alexander MotinFeb 11, 2012 6:21 am 
Konstantin BelousovFeb 11, 2012 7:35 am 
Andriy GaponFeb 11, 2012 9:04 am 
Alexander MotinFeb 13, 2012 11:56 am 
Jeff RobersonFeb 13, 2012 12:23 pm 
Alexander MotinFeb 13, 2012 12:54 pm 
Jeff RobersonFeb 13, 2012 1:39 pm 
Alexander MotinFeb 13, 2012 2:38 pm 
Alexander MotinFeb 15, 2012 11:46 am 
Jeff RobersonFeb 15, 2012 11:54 am 
Alexander MotinFeb 15, 2012 12:06 pm 
Alexander MotinFeb 15, 2012 8:41 pm 
Alexander MotinFeb 16, 2012 12:48 am 
Alexander MotinFeb 16, 2012 2:58 am 
Florian SmeetsFeb 16, 2012 1:28 pm 
Alexander MotinFeb 17, 2012 8:29 am 
Arnaud LacombeFeb 17, 2012 8:52 am 
Alexander MotinFeb 17, 2012 9:02 am 
George MitchellFeb 26, 2012 4:32 pm 
George MitchellFeb 26, 2012 4:37 pm 
Olivier SmedtsFeb 27, 2012 2:34 am 
George MitchellFeb 27, 2012 3:23 am 
Olivier SmedtsFeb 27, 2012 3:27 am 
Andriy GaponFeb 27, 2012 4:41 am 
George MitchellFeb 27, 2012 3:54 pm 
Adrian ChaddMar 2, 2012 3:05 pm 
George MitchellMar 2, 2012 4:14 pm 
Adrian ChaddMar 2, 2012 7:24 pm 
Alexander MotinMar 2, 2012 11:40 pm 
Ivan KlymenkoMar 3, 2012 12:18 am 
Adrian ChaddMar 3, 2012 12:59 am 
Alexander MotinMar 3, 2012 1:12 am 
Alexander MotinMar 3, 2012 4:53 am 
Ivan KlymenkoMar 3, 2012 7:25 am 
Alexander MotinMar 3, 2012 8:30 am 
Mario LoboMar 3, 2012 8:56 am 
Alexander MotinMar 3, 2012 9:56 am 
Ivan KlymenkoMar 3, 2012 11:15 am 
Arnaud LacombeApr 5, 2012 11:11 am 
Alexander MotinApr 5, 2012 11:45 am 
Attilio RaoApr 6, 2012 7:12 am 
Alexander MotinApr 6, 2012 7:26 am 
Attilio RaoApr 6, 2012 7:30 am 
Alexander MotinApr 6, 2012 7:40 am 
Alexander MotinApr 9, 2012 12:57 pm 
Arnaud LacombeApr 10, 2012 9:57 am 
Alexander MotinApr 10, 2012 10:18 am 
Alexander MotinApr 10, 2012 10:53 am 
Arnaud LacombeApr 10, 2012 11:45 am 
Alexander MotinApr 10, 2012 12:13 pm 
Mike MeyerApr 10, 2012 1:04 pm 
Arnaud LacombeApr 10, 2012 1:50 pm 
Mike MeyerApr 10, 2012 2:19 pm 
Adrian ChaddApr 11, 2012 3:19 pm 
Subject:Re: [RFT][patch] Scheduling for HTT and not only
From:Florian Smeets (fl@FreeBSD.org)
Date:Feb 6, 2012 11:07:37 am
List:org.freebsd.freebsd-hackers

On 06.02.12 08:59, David Xu wrote:

On 2012/2/6 15:44, Alexander Motin wrote:

On 06.02.2012 09:40, David Xu wrote:

On 2012/2/6 15:04, Alexander Motin wrote:

Hi.

I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE knows about HTT and when doing load balancing once a second, it does right things. Unluckily, if some other thread gets in the way, process can be easily pushed out to another CPU, where it will stay for another second because of CPU affinity, possibly sharing physical core with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that: http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things: - Disables strict affinity optimization when HTT detected to let more sophisticated code to take into account load of other logical core(s).

Yes, the HTT should first be skipped, looking up in upper layer to find a more idling physical core. At least, if system is a dual-core, 4-thread CPU, and if there are two busy threads, they should be run on different physical cores.

- Adds affinity support to the sched_lowest() function to prefer specified (last used) CPU (and CPU groups it belongs to) in case of equal load. Previous code always selected first valid CPU of evens. It caused threads migration to lower CPUs without need.

Even some level of imbalance can be borne, until it exceeds a threshold, this at least does not trash other cpu's cache, pushing a new thread to another cpu trashes its cache. The cpus and groups can be arranged in a circle list, so searching a lowest load cpu always starts from right neighborhood to tail, then circles from head to left neighborhood.

- If current CPU group has no CPU where the process with its priority can run now, sequentially check parent CPU groups before doing global search. That should improve affinity for the next cache levels.

I've made several different benchmarks to test it, and so far results look promising: - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with fetch and FTP transmit with ftpd. On receive I've got 103MB/s on interface; on transmit somewhat less -- about 85MB/s. In both cases scheduler kept interrupt thread and application on different physical cores. Without patch speed fluctuating about 103-80MB/s on receive and is about 85MB/s on transmit. - On the same Atom I've tested TCP speed with iperf and got mostly the same results: - receive to Atom with patch -- 755-765Mbit/s, without patch -- 531-765Mbit/s. - transmit from Atom in both cases 679Mbit/s. Fluctuating receive behavior in both tests I think can be explained by some heavy callout handled by the swi4:clock process, called on receive (seen in top and schedgraph), but not on transmit. May be it is specifics of the Realtek NIC driver.

- On the same Atom tested number of 512 byte reads from SSD with dd in 1 and 32 streams. Found no regressions, but no benefits also as with one stream there is no congestion and with multiple streams all cores congested.

- On Core i7-2600K (4 physical cores + HTT) I've run more then 20 `make buildworld`s with different -j values (1,2,4,6,8,12,16) for both original and patched kernel. I've found no performance regressions, while for -j4 I've got 10% improvement: # ministat -w 65 res4A res4B x res4A + res4B +-----------------------------------------------------------------+ |+ | |++ x x x| |A| |______M__A__________| | +-----------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1554.86 1617.43 1571.62 1581.3033 32.389449 + 3 1420.69 1423.1 1421.36 1421.7167 1.2439587 Difference at 95.0% confidence -159.587 ± 51.9496 -10.0921% ± 3.28524% (Student's t, pooled s = 22.9197) , and for -j6 -- 3.6% improvement: # ministat -w 65 res6A res6B x res6A + res6B +-----------------------------------------------------------------+ | + | | + + x x x | ||_M__A___| |__________A____M_____|| +-----------------------------------------------------------------+ N Min Max Median Avg Stddev x 3 1381.17 1402.94 1400.3 1394.8033 11.880372 + 3 1340.4 1349.34 1341.23 1343.6567 4.9393758 Difference at 95.0% confidence -51.1467 ± 20.6211 -3.66694% ± 1.47842% (Student's t, pooled s = 9.09782)

Who wants to do independent testing to verify my results or do some more interesting benchmarks? :)

PS: Sponsored by iXsystems, Inc.

The benchmark is incomplete, a complete benchmark should at lease includes cpu intensive applications. Testing for release world databases and web servers and other importance applications is needed.

I plan to do this, but you may help. ;)

Thanks, I need to find time. I have cc'ed hackers@, my first mail seems forgot to include it. I think designing a SMP scheduler is a dirty work, many test and refining and still, you may get imperfect result. ;-)

Here are my tests for PostgreSQL (i still use r229659 as the baseline was taken with that revision) This is on a 2x4 core, no HTT box. Max throughput is at 10 threads, so that is what i used for ministat.

x 229659 + 229659+mav-ule +---------------------------------------------------------------------+ | + x | |+ + + * x+xx x + x + +x x +x| | |__________________|______A__________A____M__M_____|____| | +---------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 49647.932 50376.405 50194.668 50093.065 240.47236 + 10 49482.234 50359.181 50159.422 49936.298 341.25592 No difference proven at 95.0% confidence

All the numbers are here https://docs.google.com/spreadsheet/ccc?key=0Ai0N1xDe3uNAdDRxcVFiYjNMSnJWOTZhUWVWWlBlemc&hl=en_US#gid=4

I'll update the pbzip2 tab in the document later today.

Florian