Often, achieving significant performance boosts in your code doesn’t require a complete overhaul or migration to GPUs. Sometimes, the power of your CPU, when properly unleashed, can deliver astonishing speedups with minimal effort. Let’s delve into how Julia, a language renowned for its performance, leverages CPU vectorization to rival, and in some cases, negate the need for GPUs in specific computational tasks.
Unleashing CPU Power: The Dot Product Example
Consider a seemingly simple operation: the dot product of two vectors. In Julia, a straightforward implementation looks like this:
function dot_product(x, y)
out = zero(promote_type(eltype(x), eltype(y)))
@inbounds @simd for i in eachindex(x,y)
out += x[i] * y[i]
end
out
end
x, y = randn(256), randn(256);
Now, let’s examine the assembly code generated for this function on a machine equipped with AVX512 capabilities. The @code_native
macro in Julia allows us to peek under the hood:
julia> @code_native dot_product(x, y)
.text
; Function dot_product {
; Location: REPL[4]:3
; Function eachindex; {
; Location: abstractarray.jl:207
; Function eachindex; {
; Location: abstractarray.jl:217
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: REPL[4]:2
subq $40, %rsp
movq 24(%rdi), %rax
;} ; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
movq %rax, %rcx
sarq $63, %rcx
andnq %rax, %rcx, %r8
;}}}}}}} ; Location: abstractarray.jl:218
; Function _all_match_first; {
; Location: abstractarray.jl:224
; Function #89; {
; Location: abstractarray.jl:218
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: array.jl:155
movq 24(%rsi), %rcx
;} ; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
movq %rcx, %rdx
sarq $63, %rdx
andnq %rcx, %rdx, %rcx
;}}}}}}}}} ; Function _all_match_first; {
; Location: promotion.jl:403
cmpq %rcx, %r8
;}
jne L300
; Location: abstractarray.jl:217
; Function eachindex; {
; Location: abstractarray.jl:214
; Function axes1; {
; Location: abstractarray.jl:93
; Function axes; {
; Location: abstractarray.jl:75
; Function map; {
; Location: tuple.jl:165
; Function Type; {
; Location: range.jl:314
; Function Type; {
; Location: range.jl:305
; Function max; {
; Location: promotion.jl:414
testq %rax, %rax
;}}}}}}}}}
jle L76
movq (%rdi), %rcx
movq (%rsi), %rdx
; Location: REPL[4]:3
cmpq $32, %r8
jae L88
vxorpd %xmm0, %xmm0, %xmm0
movl $1, %esi
jmp L259
L76:
vxorps %xmm0, %xmm0, %xmm0
; Location: REPL[4]:6
addq $40, %rsp
vzeroupper
retq
; Location: REPL[4]:3
L88:
movabsq $9223372036854775776, %rdi # imm = 0x7FFFFFFFFFFFFFE0
andq %r8, %rdi
leaq 1(%rdi), %rsi
vxorpd %xmm0, %xmm0, %xmm0
xorl %eax, %eax
vxorpd %xmm1, %xmm1, %xmm1
vxorpd %xmm2, %xmm2, %xmm2
vxorpd %xmm3, %xmm3, %xmm3
nopl (%rax,%rax)
; Location: REPL[4]:4
; Function getindex; {
; Location: array.jl:739
L128:
vmovupd (%rdx,%rax,8), %zmm4
vmovupd 64(%rdx,%rax,8), %zmm5
vmovupd 128(%rdx,%rax,8), %zmm6
vmovupd 192(%rdx,%rax,8), %zmm7
;} ; Function +; {
; Location: float.jl:395
vfmadd231pd (%rcx,%rax,8), %zmm4, %zmm0
vfmadd231pd 64(%rcx,%rax,8), %zmm5, %zmm1
vfmadd231pd 128(%rcx,%rax,8), %zmm6, %zmm2
vfmadd231pd 192(%rcx,%rax,8), %zmm7, %zmm3
addq $32, %rax
cmpq %rax, %rdi
jne L128
vaddpd %zmm0, %zmm1, %zmm0
vaddpd %zmm0, %zmm2, %zmm0
vaddpd %zmm0, %zmm3, %zmm0
vextractf64x4 $1, %zmm0, %ymm1
vaddpd %zmm1, %zmm0, %zmm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %zmm1, %zmm0, %zmm0
vpermilpd $1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
vaddpd %zmm1, %zmm0, %zmm0
cmpq %rdi, %r8
;} ; Location: REPL[4]:3
je L292
L259:
addq $-1, %rsi
nopw (%rax,%rax)
; Location: REPL[4]:4
; Function getindex; {
; Location: array.jl:739
L272:
vmovsd (%rdx,%rsi,8), %xmm1 # xmm1 = mem[0],zero
;} ; Function +; {
; Location: float.jl:395
vfmadd231sd (%rcx,%rsi,8), %xmm1, %xmm0
;} ; Function iterate; {
; Location: range.jl:591
; Function ==; {
; Location: promotion.jl:403
addq $1, %rsi
cmpq %rsi, %r8
;}}
jne L272
; Location: REPL[4]:6
L292:
addq $40, %rsp
vzeroupper
retq
; Location: REPL[4]:3
; Function eachindex; {
; Location: abstractarray.jl:207
; Function eachindex; {
; Location: abstractarray.jl:218
L300:
movabsq $jl_system_image_data, %rax
movq %rax, 8(%rsp)
movabsq $jl_system_image_data, %rax
movq %rax, 16(%rsp)
movq %rdi, 24(%rsp)
movq %rsi, 32(%rsp)
movabsq $jl_invoke, %rax
movabsq $140450899053840, %rdi # imm = 0x7FBD45F24D10
leaq 8(%rsp), %rsi
movl $4, %edx
callq *%rax
ud2
nopw %cs:(%rax,%rax)
;} ;}}}
Looking at the assembly, particularly the loop labeled L128
, we see instructions like vmovupd
and vfmadd231pd
operating on zmm
registers. These zmm
registers are 512-bit registers, indicating the use of AVX512 instructions. ymm
registers are 256-bit, and xmm
are 128-bit. The core loop efficiently loads and processes 32 doubles (4 registers * 512 bits / 64 bits per double) in each iteration:
L128:
vmovupd (%rdx,%rax,8), %zmm4
vmovupd 64(%rdx,%rax,8), %zmm5
vmovupd 128(%rdx,%rax,8), %zmm6
vmovupd 192(%rdx,%rax,8), %zmm7
;} ; Function +; {
; Location: float.jl:395
vfmadd231pd (%rcx,%rax,8), %zmm4, %zmm0
vfmadd231pd 64(%rcx,%rax,8), %zmm5, %zmm1
vfmadd231pd 128(%rcx,%rax,8), %zmm6, %zmm2
vfmadd231pd 192(%rdx,%rax,8), %zmm7, %zmm3
addq $32, %rax
cmpq %rax, %rdi
jne L128
This loop efficiently loads data and performs fused multiply-add operations (vfmadd), dramatically accelerating the dot product calculation. Furthermore, this implementation implicitly utilizes a more accurate summation algorithm similar to pairwise summation, enhancing both speed and numerical precision compared to a naive approach.
Benchmarking the Speed: CPU Vectorization in Action
Let’s quantify the performance gains with benchmarks:
julia> using BenchmarkTools
julia> @btime dot_product($x, $y)
12.979 ns (0 allocations: 0 bytes)
4.738430453861962
julia> @btime $x' * $y # Built-in dot product for comparison
25.089 ns (0 allocations: 0 bytes)
4.738430453861962
Our hand-crafted dot_product
function, thanks to CPU vectorization, is significantly faster than even Julia’s already optimized built-in dot product for this specific case. For context, on a similarly clocked Ryzen CPU without AVX512, the minimum time for the built-in dot product might be around 34ns, highlighting the substantial advantage of AVX512 and effective vectorization.
Beyond Simple Kernels: Vectorizing Complex Computations
The benefits of CPU vectorization extend far beyond basic operations like dot products. Consider a more complex scenario: calculating unnormalized conditional probabilities in a Gibbs sampler for a mixture of zero-mean tri-variate T distributions. A standard Julia implementation might look like this:
using Random, BenchmarkTools, SpecialFunctions
function update_individual_probs_v1!(mt::MersenneTwister, probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
@inbounds for g ∈ 1:NG
Li11 = Li[1,g]
Li21 = Li[2,g]
Li31 = Li[3,g]
Li22 = Li[4,g]
Li32 = Li[5,g]
Li33 = Li[6,g]
νfactor = (ν[g] - 2) / ν[g]
exponent = T(-1.5) - T(0.5) * ν[g]
base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) + lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
for i ∈ 1:size(probabilities,1)
lx₁ = Li11 * x[i,1]
lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
end
end
end
function update_individual_probs_v1!(probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
update_individual_probs_v1!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG))
end
Let’s create some synthetic data to test performance:
T = Float32
NG = 6; N = 1024;
X = randn(T, N,3);
probabilities = Matrix{T}(undef, N, NG);
Li = rand(T, 6,NG);
ν = T(N / NG) + 4 .+ rand(T, NG);
baseπ = rand(T, NG);
baseπ ./= sum(baseπ);
Benchmarking this version yields:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
---------------
minimum time: 118.813 μs (0.00% GC)
median time: 121.939 μs (0.00% GC)
mean time: 127.839 μs (0.00% GC)
maximum time: 195.095 μs (0.00% GC)
---------------
samples: 10000
evals/sample: 1
While @simd
macro might offer limited help here, we can leverage the SLEEFwrap.jl
package and its @restrict_simd
macro to achieve significant vectorization, even with special functions like exp
, log
, and lgamma
:
using SIMDPirates, SLEEFwrap
using SLEEFwrap: @restrict_simd
@generated function update_individual_probs!(mt::MersenneTwister, probabilities::AbstractMatrix{T}, baseπ::AbstractVector{T}, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
quote
@inbounds for g ∈ 1:NG
Li11 = Li[1,g]
Li21 = Li[2,g]
Li31 = Li[3,g]
Li22 = Li[4,g]
Li32 = Li[5,g]
Li33 = Li[6,g]
νfactor = (ν[g] - 2) / ν[g]
exponent = T(-1.5) - T(0.5) * ν[g]
base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) + lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
@restrict_simd $T for i ∈ 1:size(probabilities,1)
lx₁ = Li11 * x[i,1]
lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
end
end
end
end
function update_individual_probs!(probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
update_individual_probs!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG))
end
Benchmarking the vectorized version shows a remarkable speed improvement:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
---------------
minimum time: 7.363 μs (0.00% GC)
median time: 7.867 μs (0.00% GC)
mean time: 7.990 μs (0.00% GC)
maximum time: 16.006 μs (0.00% GC)
---------------
samples: 10000
evals/sample: 4
The vectorized code is now almost an order of magnitude faster! This demonstrates that even for complex numerical computations, CPU vectorization can be a game-changer, offering performance levels that can rival, and sometimes even surpass, the benefits of offloading to a GPU, especially when considering the overhead of data transfer to and from the GPU.
CPU vs GPU: Choosing the Right Tool
While GPUs excel in massively parallel computations and are indispensable for tasks like deep learning and large-scale simulations, CPU vectorization provides a powerful and often overlooked avenue for performance optimization. For many numerical tasks, especially those that are data-parallel but not massively so, or those involving complex control flow or special functions, optimizing for CPU vectorization can yield impressive results, sometimes even outperforming GPU solutions when overhead is considered.
When deciding whether to Compare Gpu acceleration against CPU optimization, consider:
- Problem Parallelism: Is the problem massively parallel or more moderately parallel? CPU vectorization shines in moderate parallelism.
- Data Transfer Overhead: Moving data to and from GPUs can be a bottleneck. CPU optimization avoids this.
- Code Complexity: Vectorizing for CPU can sometimes be simpler and more direct than writing and managing GPU kernels.
- Special Functions and Control Flow: CPU vectorization can handle these efficiently, while GPU implementations might be more complex.
In conclusion, before immediately jumping to GPU acceleration, explore the potential of CPU vectorization. Julia, with its powerful compiler and tools like SLEEFwrap
, makes it easier than ever to tap into this often underutilized resource and achieve remarkable performance gains, demonstrating that in many scenarios, the CPU remains a formidable computational powerhouse.