Compare CPU GPU: Unleashing CPU Performance with Julia – Vectorization and Optimization Techniques

In the ever-evolving landscape of computing, the debate between CPUs and GPUs often takes center stage. While GPUs are renowned for their parallel processing prowess, particularly in graphics and machine learning, the Central Processing Unit (CPU) remains a vital workhorse for a vast array of computational tasks. Understanding how to maximize CPU performance is crucial, and in this article, we delve into how Julia, a high-performance programming language, empowers developers to achieve remarkable CPU optimizations. We will explore techniques like vectorization, demonstrating how seemingly simple code transformations can lead to significant speed improvements, effectively showcasing the power of modern CPUs when harnessed correctly.

Understanding CPU Vectorization for Performance

Modern CPUs are incredibly sophisticated, equipped with features that go beyond simple scalar operations. One such feature is vectorization, also known as SIMD (Single Instruction, Multiple Data). Vectorization allows the CPU to perform the same operation on multiple data points simultaneously, drastically increasing throughput for suitable workloads. Instruction sets like AVX512 enable CPUs to process 512 bits of data in a single instruction, meaning operations on multiple floating-point numbers can occur in parallel within the CPU’s core. This is a key technique for bridging the performance gap and making CPUs highly competitive even in computationally intensive scenarios.

Dot Product Optimization: A Practical Example in Julia

Let’s illustrate CPU vectorization with a concrete example: calculating the dot product of two vectors. A naive implementation might look like this:

function dot_product_naive(x, y)
    out = zero(promote_type(eltype(x), eltype(y)))
    for i in eachindex(x,y)
        out += x[i] * y[i]
    end
    out
end
x, y = randn(256), randn(256);

However, Julia’s compiler is intelligent enough to automatically vectorize this loop when possible. Let’s examine the assembly code generated for a slightly optimized version, dot_product, on a machine with AVX512 capabilities:

function dot_product(x, y)
    out = zero(promote_type(eltype(x), eltype(y)))
    @inbounds for i in eachindex(x,y)
        out += x[i] * y[i]
    end
    out
end

By using @code_native dot_product(x, y), we can inspect the generated machine code. A snippet reveals the use of zmm registers:

L128:
  vmovupd (%rdx,%rax,8), %zmm4
  vmovupd 64(%rdx,%rax,8), %zmm5
  vmovupd 128(%rdx,%rax,8), %zmm6
  vmovupd 192(%rdx,%rax,8), %zmm7
  vfmadd231pd (%rcx,%rax,8), %zmm4, %zmm0
  vfmadd231pd 64(%rdx,%rax,8), %zmm5, %zmm1
  vfmadd231pd 128(%rdx,%rax,8), %zmm6, %zmm2
  vfmadd231pd 192(%rdx,%rax,8), %zmm7, %zmm3
  addq $32, %rax
  cmpq %rax, %rdi
  jne L128

The zmm registers are 512-bit registers, indicating that the CPU is processing 512 bits of data at once. The loop body L128 loads multiple doubles into zmm registers and uses vfmadd231pd, a fused multiply-add instruction, to perform vectorized calculations. This significantly speeds up the dot product.

Benchmarking this vectorized dot_product function against a naive approach (like x' * y in Julia) demonstrates the performance gain:

using BenchmarkTools
julia> @btime dot_product($x, $y)
  12.979 ns (0 allocations: 0 bytes)
  4.738430453861962

julia> @btime $x' * $y
  25.089 ns (0 allocations: 0 bytes)
  4.738430453861962

The vectorized dot_product is almost twice as fast as the naive approach, highlighting the power of CPU vectorization achieved with minimal code effort in Julia.

Advanced Vectorization: Gibbs Sampler Example with SLEEFwrap

For more complex scenarios, achieving vectorization might require more explicit direction. Consider a Gibbs sampler example involving a mixture of zero-mean tri-variate T distributions. A straightforward, non-vectorized implementation in Julia might look like update_individual_probs_v1!:

using Random, BenchmarkTools, SpecialFunctions

function update_individual_probs_v1!(mt::MersenneTwister, probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    @inbounds for g ∈ 1:NG
        Li11 = Li[1,g]
        Li21 = Li[2,g]
        Li31 = Li[3,g]
        Li22 = Li[4,g]
        Li32 = Li[5,g]
        Li33 = Li[6,g]
        νfactor = (ν[g] - 2) / ν[g]
        exponent = T(-1.5) - T(0.5) * ν[g]
        base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) + lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
        for i ∈ 1:size(probabilities,1)
            lx₁ = Li11 * x[i,1]
            lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
            lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
            probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
        end
    end
end

function update_individual_probs_v1!(probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    update_individual_probs_v1!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG))
end

T = Float32
NG = 6; N = 1024;
X = randn(T, N,3);
probabilities = Matrix{T}(undef, N, NG);
Li = rand(T, 6,NG);
ν = T(N / NG) + 4 .+ rand(T, NG);
baseπ = rand(T, NG);
baseπ ./= sum(baseπ);

@btime update_individual_probs_v1!(probabilities, baseπ, Li, ν, X, Val(NG))

This version, even with @simd, might not fully utilize vectorization, especially when special functions like log, exp, and lgamma are involved. However, with the help of the SLEEFwrap.jl package and its @vectorize_loops macro, we can achieve significant vectorization, even for loops containing special functions. Here’s the vectorized version, update_individual_probs!:

using SIMDPirates, SLEEFwrap
using SLEEFwrap: @restrict_simd, @generated

@generated function update_individual_probs!(mt::MersenneTwister, probabilities::AbstractMatrix{T}, baseπ::AbstractVector{T}, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    quote
        @inbounds for g ∈ 1:NG
            Li11 = Li[1,g]
            Li21 = Li[2,g]
            Li31 = Li[3,g]
            Li22 = Li[4,g]
            Li32 = Li[5,g]
            Li33 = Li[6,g]
            νfactor = (ν[g] - 2) / ν[g]
            exponent = T(-1.5) - T(0.5) * ν[g]
            base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) + lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
            @restrict_simd $T for i ∈ 1:size(probabilities,1)
                lx₁ = Li11 * x[i,1]
                lx₂ = Li21 * x[i,1] + Li22 * x[i,2]
                lx₃ = Li31 * x[i,1] + Li32 * x[i,2] + Li33 * x[i,3]
                probabilities[i,g] = exp(base + exponent * log(one(T) + νfactor * (lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃)))
            end
        end
    end
end

function update_individual_probs!(probabilities::AbstractMatrix{T}, baseπ, Li::AbstractMatrix{T}, ν, x::AbstractMatrix{T}, ::Val{NG}) where {T,NG}
    update_individual_probs!(Random.GLOBAL_RNG, probabilities, baseπ, Li, ν, x, Val(NG))
end


@btime update_individual_probs!(probabilities, baseπ, Li, ν, X, Val(NG))

Benchmarking both versions reveals a dramatic speedup:

julia> @btime update_individual_probs_v1!(probabilities, baseπ, Li, ν, X, Val(NG))
  121.939 μs (0 allocations: 0 bytes)

julia> @btime update_individual_probs!(probabilities, baseπ, Li, ν, X, Val(NG))
  7.867 μs (0 allocations: 0 bytes)

The vectorized version is significantly faster, showcasing the effectiveness of SLEEFwrap in enabling vectorization for complex numerical code in Julia, even when dealing with special functions.

CPU vs GPU: Choosing the Right Tool

While GPUs excel in massively parallel computations, CPUs remain highly relevant and powerful, especially when optimized. CPU optimization, as demonstrated through vectorization in Julia, is crucial in scenarios where:

  • Workloads are not inherently massively parallel: Some algorithms are not easily parallelizable to the degree GPUs require to be efficient.
  • Latency is critical: CPUs often offer lower latency for individual tasks compared to offloading to a GPU and retrieving results.
  • Tasks benefit from fast scalar and vector operations: CPUs are highly optimized for scalar and vector operations, making them ideal for many general-purpose computations and algorithms that can leverage vectorization.
  • Overhead of GPU offloading is significant: For small to medium-sized tasks, the overhead of transferring data to the GPU and back might outweigh the benefits of GPU parallelism.

Conclusion: Maximizing CPU Performance with Julia

This exploration highlights that CPUs are far from obsolete in performance computing. With the right tools and techniques, like Julia and vectorization, CPUs can deliver exceptional performance, often rivaling or even exceeding GPUs in specific workloads. Julia’s ability to generate highly optimized, vectorized code with minimal programmer effort makes it an excellent choice for harnessing the full potential of modern CPUs. Understanding the strengths of both CPUs and GPUs, and knowing how to optimize for each, is key to efficient and high-performance computing in today’s diverse hardware landscape.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *