atom feed3 messages in org.openssl.openssl-usersRe: Openssl Engine Performance Benchm...
FromSent OnAttachments
Shasi ThatiMar 31, 2009 8:15 pm 
David SchwartzMar 31, 2009 10:06 pm 
Geoff ThorpeMar 31, 2009 10:07 pm 
Subject:Re: Openssl Engine Performance Benchmarks
From:Geoff Thorpe (geo@geoffthorpe.net)
Date:Mar 31, 2009 10:07:33 pm
List:org.openssl.openssl-users

On Tuesday 31 March 2009 23:16:10 Shasi Thati wrote:

Hi,

I have a question regarding the openssl speed command. When I use this command to test the crypto offload engine performance what is the right command to use?

Is it

openssl speed -evp aes-128-cbc -engine xxxxxx -elapsed

or

openssl speed -evp aes-128-cbc -engine xxxxxx

I have seen examples with both of them on the internet and I get different results with each of them. What exactly does "elapsed" option add here?

It means "elapsed". :-) Ie. how much time elapsed during the benchmark. The normal measurement is cpu usage, which is something less than or equal to the elapsed time - if the benchmark used half the available cpu cycles during the elapsed period (according to scheduler stats, accurate or otherwise), the time given would be half the elapsed time.

The usefulness of using cpu-time (instead of "-elapsed") is to eliminate; (a) skewed statistics due to the system running other tasks while the benchmark was in progress (ie. you're only billed for what you use), and (b) to eliminate time the s/w (and driver) spent waiting for the crypto accelerator to respond to crypto operations. The value of (b) is to interpolate certain theoretical limits. Ie. if 80% of the time is spent waiting on the accelerator, the cpu-time for the benchmark run would be 1/5 of the elapsed time and so the calculated number of crypto ops per second would be 5 times what actually happened in real/elapsed time. If the latency of the accelerator is roughly constant but it can process multiple things at once due to having multiple execution units, then this inflated number is a useful "estimate" of how much you could theoretically process if you had multiple threads/processes keeping the cpu busy rather than waiting. In this example you'd need at least 5 threads to achieve such a performance level. (Which also assumes the accelerator performance would continue to scale up that far.)

Cheers, Geoff