Advanced Vector Extensions (AVX) is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD.


SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set

L1d cache cpus

Quality Example

"For small buffers hot in l1d cache avx can copy significantly faster than sse on cpus like haswell where 256b loads stores really do use a 256b data path to l1d cache instead of splitting into two 128b operations"

from question "SSE-copy, AVX-copy and std::copy performance"

"So congratulations - you can pat yourself on the back your avx routine is indeed about a third faster than the sse routine tested on haswell i7 here"

from question "AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code"

Smaller values fast

Quality Example
Clearly faster

"Now for sse is clearly faster and for the smaller values it s nearlly as fast as avx"

from question "Performance of SSE and AVX when both Memory-band width limited"


"So the avx version does indeed appear to faster than the sse version both for the original implementations and the optimised implementations"

from question "Float point multiplication: LOSING speed with AVX against SSE?"

"As expected the performance got better with both and avx 2 faster than sse 4.2 but when i profiled the code with papi i found out that the total number of misses mainly l1 and l2 increased a lot"

from question "Increased number of cache misses when vectorizing code"

Scalar faster

"The question is avx scalar is 2.7x faster than sse when i vectorized it the speed up is 3x matrix size is 128x128 for this question"

from question "What is the benefits of using vaddss instead of addss in scalar matrix addition?"


Quality Example

"The underlying reason for this and various other avx limitations is that architecturally avx is little more than two sse execution units side by side - you will notice that virtually no avx instructions operate horizontally across the boundary between the two 128 bit halves of a vector which is particularly annoying in the case of vpalignr"

from question "Intel AVX : Why is there no 256-bits version of dot product for double precision floating point variables?"

"And simd math libraries for sse and avx however they seem to be more sse than avx2"

from question "Efficient implementation of log2(__m256d) in AVX2"


"Also note that the fact that the avx are a newer than sse doesn t make the avx faster whatever you are planning to use the number of cycles taken by an function is probably more important than the avx vs sse argument for example see this answer"

from question "SSE-copy, AVX-copy and std::copy performance"


"I expected avx to be about 1.5x faster than sse"

from question "AVX vs. SSE: expect to see a larger speedup"


"I have code that does the same thing but the avx version is considerably slower than the sse version"

from question "Float point multiplication: LOSING speed with AVX against SSE?"

More time

"Buf1 buf2 and buf3 is small enough to located in l1 cache and l2 cache l2 cache 1mb .both of sse and avx is band width limited but with the datalen increase why do the avx need more time than sse"

from question "Performance of SSE and AVX when both Memory-band width limited"

Back to Home
Data comes from Stack Exchange with CC-BY-SA-3.0