FastC++: Coding Cpp Efficiently: Simple vector3 class with SSE support

In this post we show how to write a simple class which represents a 3D vector which uses SSE operations for fast calculations. The class stores three float values (x, y and z) and implements all basic vector operators such as add, subtract, multiply, divide, cross product, dot product and length calculations. It uses aligned 128-bit memory which allows to use SSE intrinsics directly. In addition, it overloads the new and delete operators for arrays which allows to create multiple instances of vector3.
The following code can be used in a stand-alone header file (i.e. vector3.h):

#include <smmintrin.h>

// Simple vector class
_MM_ALIGN16 class vector3
{
 public:
  // constructors
  inline vector3() : mmvalue(_mm_setzero_ps()) {}
  inline vector3(float x, float y, float z) : mmvalue(_mm_set_ps(0, z, y, x)) {}
  inline vector3(__m128 m) : mmvalue(m) {}

  // arithmetic operators with vector3
  inline vector3 operator+(const vector3& b) const { return _mm_add_ps(mmvalue, b.mmvalue); }
  inline vector3 operator-(const vector3& b) const { return _mm_sub_ps(mmvalue, b.mmvalue); }
  inline vector3 operator*(const vector3& b) const { return _mm_mul_ps(mmvalue, b.mmvalue); }
  inline vector3 operator/(const vector3& b) const { return _mm_div_ps(mmvalue, b.mmvalue); }

  // op= operators
  inline vector3& operator+=(const vector3& b) { mmvalue = _mm_add_ps(mmvalue, b.mmvalue); return *this; }
  inline vector3& operator-=(const vector3& b) { mmvalue = _mm_sub_ps(mmvalue, b.mmvalue); return *this; }
  inline vector3& operator*=(const vector3& b) { mmvalue = _mm_mul_ps(mmvalue, b.mmvalue); return *this; }
  inline vector3& operator/=(const vector3& b) { mmvalue = _mm_div_ps(mmvalue, b.mmvalue); return *this; }

  // arithmetic operators with float
  inline vector3 operator+(float b) const { return _mm_add_ps(mmvalue, _mm_set1_ps(b)); }
  inline vector3 operator-(float b) const { return _mm_sub_ps(mmvalue, _mm_set1_ps(b)); }
  inline vector3 operator*(float b) const { return _mm_mul_ps(mmvalue, _mm_set1_ps(b)); }
  inline vector3 operator/(float b) const { return _mm_div_ps(mmvalue, _mm_set1_ps(b)); }

  // op= operators with float
  inline vector3& operator+=(float b) { mmvalue = _mm_add_ps(mmvalue, _mm_set1_ps(b)); return *this; }
  inline vector3& operator-=(float b) { mmvalue = _mm_sub_ps(mmvalue, _mm_set1_ps(b)); return *this; }
  inline vector3& operator*=(float b) { mmvalue = _mm_mul_ps(mmvalue, _mm_set1_ps(b)); return *this; }
  inline vector3& operator/=(float b) { mmvalue = _mm_div_ps(mmvalue, _mm_set1_ps(b)); return *this; }

  // cross product
  inline vector3 cross(const vector3& b) const 
  {
   return _mm_sub_ps(
    _mm_mul_ps(_mm_shuffle_ps(mmvalue, mmvalue, _MM_SHUFFLE(3, 0, 2, 1)), _mm_shuffle_ps(b.mmvalue, b.mmvalue, _MM_SHUFFLE(3, 1, 0, 2))), 
    _mm_mul_ps(_mm_shuffle_ps(mmvalue, mmvalue, _MM_SHUFFLE(3, 1, 0, 2)), _mm_shuffle_ps(b.mmvalue, b.mmvalue, _MM_SHUFFLE(3, 0, 2, 1)))
   );
  }

  // dot product with another vector
  inline float dot(const vector3& b) const { return _mm_cvtss_f32(_mm_dp_ps(mmvalue, b.mmvalue, 0x71)); }
  // length of the vector
  inline float length() const { return _mm_cvtss_f32(_mm_sqrt_ss(_mm_dp_ps(mmvalue, mmvalue, 0x71))); }
  // 1/length() of the vector
  inline float rlength() const { return _mm_cvtss_f32(_mm_rsqrt_ss(_mm_dp_ps(mmvalue, mmvalue, 0x71))); }
  // returns the vector scaled to unit length
  inline vector3 normalize() const { return _mm_mul_ps(mmvalue, _mm_rsqrt_ps(_mm_dp_ps(mmvalue, mmvalue, 0x7F))); }

  // overloaded operators that ensure alignment
  inline void* operator new[](size_t x) { return _aligned_malloc(x, 16); }
  inline void operator delete[](void* x) { if (x) _aligned_free(x); }

  // Member variables
  union
  {
   struct { float x, y, z; };    
   __m128 mmvalue;
  };
};

inline vector3 operator+(float a, const vector3& b) { return b + a; }
inline vector3 operator-(float a, const vector3& b) { return vector3(_mm_set1_ps(a)) - b; }
inline vector3 operator*(float a, const vector3& b) { return b * a; }
inline vector3 operator/(float a, const vector3& b) { return vector3(_mm_set1_ps(a)) / b; }

This class can be used as follows:

vector3 a(1, 2, 3);
vector3 b = 2 * a + 4;
vector3 c = b.cross(a.normalize());

When compiling such code, the compiler will remove the inlined function calls and store vector3 objects directly in SSE registers, if possible. This generates optimal code with no overhead.

11 comments:

dsnettletonJuly 3, 2012 at 7:11 AM
Great post. This was very helpful for me.
Richard EvansAugust 20, 2012 at 10:46 AM
Thanks, this is awesome and just what I am looking for to start implementing sse in the performance critical bit of my code. I do have a question though - is there any magic required to make this perform better with AVX, ie by operating on 2 sets of 3 floats simultaneously, or is it better to use ILP within the outer loop, by calling two instructions for each (double) iteration?

Richard.
AnonymousJanuary 10, 2013 at 3:44 AM
Horrible GNU license, so code is unusable.
theowl84March 11, 2013 at 8:48 AM
The code is free to use, but comes with no warranty.
Robin VierichMarch 28, 2013 at 6:37 PM
Thanks for this! How would I access the individual elements of the vector? (ex. how would I access the X value to use elsewhere?)
theowl84May 2, 2013 at 8:33 PM
since vector3 uses a union structure, it should be as easy as vector3 v; v.x = 123;
UnknownAugust 19, 2013 at 3:13 PM
Hello theo,

Why did you use _MM_ALIGN16 ?

Thx !
AnonymousSeptember 11, 2013 at 6:02 PM
Wouldn't the division operation result in a divide by zero, since the spare element of the b vector is likely to be zero? It seems the code should be modified to copy b to a temporary _mm128, set R3 to 1.0, then perform the division by the temporary.
Waldemar BancewiczNovember 10, 2013 at 3:57 AM
The SSE registers are 128 bits, and the overhead of loading/storing an extra 4 bytes actually drops the performance down considerably. Based on my tests, there is almost no performance improvement for vector addition. It only makes sense to work with 4-dimensional vectors.
faBRicioAugust 12, 2016 at 10:37 PM
Have you ever benchmarked this? In my tests, it performs the same as a non-SSE implementation. Probably the reason is the problem pointed in Reed's blog: http://www.reedbeta.com/blog/2013/12/28/on-vector-math-libraries/

FastC++: Coding Cpp Efficiently

Wednesday, December 28, 2011

Simple vector3 class with SSE support

11 comments:

About Me

Useful Links