Friday, December 30, 2011

Template based vector library with SSE support

This article explains how to set up a vector math library that performs mathematical operations on arbitrary sized float arrays or vectors. It can handle aligned and unaligned pointers with minimal code overhead but optimal runtime performance. This is achieved through template functions and compile-time arguments.
This tutorial is based on a simple code example: adding two arrays and storing the result in a third array. Step by step, we will introduce loop SSE intrinsics, loop unrolling and functor based concepts which allow to build a library with different operations.
We start with a simple function for adding two arrays:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
  for(size_t i = 0; i < N; i++)
  {
    C[i] = A[i] + B[i];
  }
}
This code simply sums the arrays element-wise. In a previous article about loop unrolling we explained how to make the loop more efficient by reducing the number of comparisons and conditional jumps:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
  size_t i = 0;
  // unrolled loop
  for(; i < ROUND_DOWN(N, 4); i+=4)
  {
    C[i+0] = A[i+0] + B[i+0];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
  }
  // process remaining elements
  for(; i < N; i++)
  {
    C[i] = A[i] + B[i];
  }
}
Modern processors have special SIMD units that allow to process more than one float value at a time. In the unrolled loop we can therefore use special SSE intrinsics that process all four values simultaneously:
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
  size_t i = 0;
  // unrolled loop with SSE commands
  for(; i < ROUND_DOWN(N, 4); i+=4)
  {
    __m128 a = _mm_loadu_ps(A + i);
    __m128 b = _mm_loadu_ps(B + i);
    __m128 c = _mm_add_ps(a, b);
    _mm_storeu_ps(C + i, c);
  }
  // process remaining elements
  for(; i < N; i++)
  {
    C[i] = A[i] + B[i];
  }
}
Note that we use unaligned load and store commands since we cannot assume that the pointer A, B or C point to memory aligned on a 16 byte address, which is required for the much faster aligned load and store versions _mm_load_ps and _mm_store_ps. Writing special implementations for aligned memory access is cumbersome since there are 8 possible combinations of aligned and unaligned pointers in this function. Therefore, we present a template based approach that allows to parametrize the method at compile time. During run-time the appropriate load/store function is used with no additional overhead (supported by function inlining). We define template functions load_ps and store_ps with an alignment parameter and specialize each:
enum Alignment
{
  Aligned,
  Unaligned
};

template<Alignment type> inline __m128 load_ps(const float* p);
template<> inline __m128 load_ps<Aligned>(const float* p) { return _mm_load_ps(p); }
template<> inline __m128 load_ps<Unaligned>(const float* p) { return _mm_loadu_ps(p); }

template<Alignment type> inline void store_ps(float* p, __m128 a);
template<> inline void store_ps<Aligned>(float* p, __m128 a) { return _mm_store_ps(p, a); }
template<> inline void store_ps<Unaligned>(float* p, __m128 a) { return _mm_storeu_ps(p, a); }
Now, we can rewrite our add_vectors function with three template parameters, each specifying the alignment of the corresponding pointer.
template<Alignment stC, Alignment ldA, Alignment ldB>
void add_vectors(float* C, const float* A, const float* B, size_t N)
{
  size_t i = 0;
  // unrolled loop with template loading/storing
  for(; i < ROUND_DOWN(N, 4); i+=4)
  {
    __m128 a = load_ps<ldA>(A + i);
    __m128 b = load_ps<ldB>(B + i);
    __m128 c = _mm_add_ps(a, b);
    store_ps<stC>(C + i, c);
  }
  // process remaining elements
  for(; i < N; i++)
  {
    C[i] = A[i] + B[i];
  }
}
A call to this function with known alignment information may look like this:
float* A = new[100]; // unaligned new
float* B = _aligned_malloc(100, 16); // 16-byte aligned memory
float C[100]; // allocated on stack, probably unaligned
add_vectors<Unaligned, Aligned, Unaligned>(C, A, B, 100);
When developing a complete library of array functions (such as difference, product, etc.), it is desirable to reuse the code of add_vectors and not rewrite it for every operator. By using functors we make the operation itself a parameter (op).
class plus
{
public:
  __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(a, b); }
  __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(a, b); }
};

template<Alignment stC, Alignment ldA, Alignment ldB, Op op>
void process_vectors(float* C, const float* A, const float* B, size_t N, const Op& op)
{
  size_t i = 0;
  // unrolled loop with template loading/storing
  for(; i < ROUND_DOWN(N, 4); i+=4)
  {
    __m128 a = load_ps<ldA>(A + i);
    __m128 b = load_ps<ldB>(B + i);
    __m128 c = op.eval_ps(a, b);
    store_ps<stC>(C + i, c);
  }
  // process remaining elements
  for(; i < N; i++)
  {
    __m128 a = _mm_load_ss(A + i);
    __m128 b = _mm_load_ss(B + i);
    __m128 c = op.eval_ss(a, b);
    _mm_store_ss(C + i, c);
  }
}
The class plus is used as a functor which can perform any operation with two arguments. We require two evaluation methods: eval_ps for packed operations (4 elements) and eval_ss for operating on single elements. Two vectors can now be added by the following call:
float *A, *B, *C; // pointers to aligned/unaligned memory
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, plus());
The initial coding effort quickly pays off when we extend the library with different operators, such as minus:
class minus
{
public:
  __m128 eval_ps(__m128 a, __m128 b) const { return _mm_sub_ps(a, b); }
  __m128 eval_ss(__m128 a, __m128 b) const { return _mm_sub_ss(a, b); }
};

// ...

//perform C = A - B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, minus());
The functor concept has additional benefits. For example, we can define an operator, which performs a weighted sum or linear combination of two values and use it with our process_vectors without modifications:
_MM_ALIGN16 class scaled_plus
{
public:
 scaled_plus(float scale1, float scale2) : s1_(_mm_set1_ps(scale1)), s2_(_mm_set1_ps(scale2)) {}
 __m128 eval_ps(__m128 a, __m128 b) const { return _mm_add_ps(_mm_mul_ps(a, s1_), _mm_mul_ps(b, s2_)); }
 __m128 eval_ss(__m128 a, __m128 b) const { return _mm_add_ss(_mm_mul_ss(a, s1_), _mm_mul_ss(b, s2_)); }
private:
 __m128 s1_;
 __m128 s2_;
};


// ...

//perform C = 0.3 * A + 0.7 * B
process_vectors<Unaligned, Aligned, Aligned>(C, A, B, 100, scaled_plus(0.3f, 0.7f));
Note that the _MM_ALIGN16 prefix tells the compiler, that an scaled_plus object is aligned on 16-byte memory, which is required for accessing s1_ and s2_. When compiling this code in a release-build, there is no overhead when calling the functor object, since everything can be inlined.
In summary, we showed how to make the alignment of pointers a compile time argument to ensure optimal load and store operations on an array of float values. By using functors, we are able to create a generic library which supports arbitrary mathematical operations and enables passing of additional parameters.

No comments:

Post a Comment