098afa1e1b
OpenSSL is built with the generic linux settings for most targets, including aarch64. These generic settings are designed for 32-bit CPU and provide no assembler optmization: this is widely suboptimal for aarch64. This patch simply switches to the aarch64 settings that are already available in OpenSSL. Here is the output of "openssl speed" before the optimization, with "(...)" representing build flags that didn't change: OpenSSL 1.0.2l 25 May 2017 options:bn(64,32) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr) compiler: aarch64-openwrt-linux-musl-gcc (...) And after this patch, OpenSSL uses 64 bit mode and assembler optimizations: OpenSSL 1.0.2l 25 May 2017 options:bn(64,64) rc4(ptr,char) des(idx,cisc,2,int) aes(partial) blowfish(ptr) compiler: aarch64-openwrt-linux-musl-gcc (...) -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM Here are some benchmarks on a pine64+ running latest LEDE master r5142-20d363aed3: before# openssl speed sha aes blowfish The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 3918.89k 9982.43k 19148.03k 24933.03k 27325.78k sha256 4604.51k 10240.64k 17472.51k 21355.18k 22801.07k sha512 3662.19k 14539.41k 21443.16k 29544.11k 33177.60k blowfish cbc 16266.63k 16940.86k 17176.92k 17237.33k 17252.35k aes-128 cbc 19712.95k 21447.40k 22091.09k 22258.35k 22304.09k aes-192 cbc 17680.12k 19064.47k 19572.14k 19703.13k 19737.26k aes-256 cbc 15986.67k 17132.48k 17537.28k 17657.17k 17689.26k after# openssl speed sha aes blowfish type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 6770.87k 26172.80k 86878.38k 205649.58k 345978.20k sha256 20913.93k 74663.85k 184658.18k 290891.09k 351032.66k sha512 7633.10k 30110.14k 50083.24k 71883.43k 82485.25k blowfish cbc 16224.93k 16933.55k 17173.76k 17234.94k 17252.35k aes-128 cbc 19425.74k 21193.31k 22065.74k 22304.77k 22380.54k aes-192 cbc 17452.29k 18883.84k 19536.90k 19741.70k 19800.06k aes-256 cbc 15815.89k 17003.01k 17530.03k 17695.40k 17746.60k For some reason AES and blowfish do not benefit, but SHA performance improves between 1.7x and 15x. SHA256 clearly benefits the most from the optimization (4.5x on small blocks, 15x on large blocks!). When using EVP (with "openssl speed -evp <algo>"): # Before, EVP mode type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 3824.46k 10049.66k 19170.56k 24947.03k 27325.78k sha256 3368.33k 8511.15k 16061.44k 20772.52k 22721.88k sha512 2845.23k 11381.57k 19467.69k 28512.26k 33008.30k bf-cbc 15146.74k 16623.83k 17092.01k 17211.39k 17249.62k aes-128-cbc 17873.03k 20870.61k 21933.65k 22216.36k 22301.35k aes-192-cbc 16184.18k 18607.15k 19447.13k 19670.02k 19737.26k aes-256-cbc 14774.06k 16757.25k 17457.58k 17639.42k 17686.53k # After, EVP mode type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha1 7056.97k 27142.10k 89515.86k 209155.41k 347419.99k sha256 7745.70k 29750.06k 95341.48k 211001.69k 332376.75k sha512 4550.47k 18086.06k 39997.10k 65880.75k 81431.21k bf-cbc 15129.20k 16619.03k 17090.56k 17212.76k 17246.89k aes-128-cbc 99619.74k 269032.34k 450214.23k 567353.00k 613933.06k aes-192-cbc 93180.74k 231017.79k 361766.66k 433671.51k 461731.16k aes-256-cbc 89343.23k 209858.58k 310160.04k 362234.88k 380878.85k Blowfish does not seem to have assembler optimization at all, and SHA still benefits (between 1.6x and 14.5x) but is generally slower than in non-EVP mode. However, AES performance is improved between 5.5x and 27.5x, which is really impressive! For aes-128-cbc on large blocks, a core i7-6600U @2.60GHz is only twice as fast... Signed-off-by: Baptiste Jonglez <git@bitsofnetworks.org>
16 lines
2.3 KiB
Diff
16 lines
2.3 KiB
Diff
--- a/Configure
|
|
+++ b/Configure
|
|
@@ -470,6 +470,13 @@ my %table=(
|
|
"linux-alpha-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
|
|
"linux-alpha+bwx-ccc","ccc:-fast -readonly_strings -DL_ENDIAN::-D_REENTRANT:::SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_PTR DES_RISC1 DES_UNROLL:${alpha_asm}",
|
|
|
|
+# OpenWrt targets
|
|
+"linux-armv4-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${armv4_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|
|
+"linux-aarch64-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${aarch64_asm}:linux64:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|
|
+"linux-x86_64-openwrt", "gcc:-m64 -DL_ENDIAN -DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHUNK DES_INT DES_UNROLL:${x86_64_asm}:elf:dlfcn:linux-shared:-fPIC:-m64:.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR):::64",
|
|
+"linux-mips-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${mips32_asm}:o32:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|
|
+"linux-generic-openwrt","gcc:-DTERMIOS \$(OPENWRT_OPTIMIZATION_FLAGS) -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|
|
+
|
|
# Android: linux-* but without pointers to headers and libs.
|
|
"android","gcc:-mandroid -I\$(ANDROID_DEV)/include -B\$(ANDROID_DEV)/lib -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|
|
"android-x86","gcc:-mandroid -I\$(ANDROID_DEV)/include -B\$(ANDROID_DEV)/lib -O3 -fomit-frame-pointer -Wall::-D_REENTRANT::-ldl:BN_LLONG ${x86_gcc_des} ${x86_gcc_opts}:".eval{my $asm=${x86_elf_asm};$asm=~s/:elf/:android/;$asm}.":dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR)",
|