Originally published on X/Twitter and LinkedIn on October 25, 2024
Recently, I was optimizing an algorithm in C++ for our WebAR engine, which needs to perform smoothly across a variety of devices. The goal was to ensure smooth AR experiences, ideally hitting 30 FPS even on low-end devices.
When we talk about optimization, the first thing that usually comes to mind is improving time complexity. But there’s a lot more to optimization than just that.
Common Optimization Approaches
- Algorithmic Optimization: The first thing we tend to look at is improving time complexity, like reducing O(n²) to O(n log n). There are also space-time trade-offs and considerations around data structure selection.
- Domain-Specific Optimization: Depending on the problem domain, optimizations can be applied. For example, if you’re working with a geometric problem, you might optimize based on its specific characteristics. For instance, a 3D rectangular object might not need as many triangles to represent its surface—reducing triangle count without sacrificing rendering quality can be an effective optimization.
- System-Level Optimization: This typically involves cache optimization, memory alignment, and loop unrolling. These days, compilers handle much of this, but there’s always room for manual tweaking.
- Parallel Processing: Threads and async operations can speed things up by processing tasks concurrently. However, for WebAssembly (WASM), threading has its limitations.
- SIMD Vectorization: This is where register-level parallelism comes in, allowing the CPU to handle multiple operations at once. SIMD (Single Instruction, Multiple Data) can be incredibly powerful when leveraged correctly.
My Optimization Journey
In my case, I started with the usual approach that every engineer tries first—optimizing time and space complexity. Once that was sorted, I looked at any geometric optimizations I could make. But even after that, it still wasn’t enough. I was processing streams of pixels from a 2D image, extracting various features and calculating feature scores, then converting them into descriptors on the fly. In computer vision, descriptors are like fingerprints for specific points in an image - they help us identify and match these points across different frames. Creating these descriptors involves complex mathematical calculations on pixel neighborhoods, making it a computationally intensive operation. So, I thought, “Why not use threads?” Now, threads in WASM are relatively new, available through pthreads, but they come with their own limitations—especially when you’re building an SDK that needs to run on client servers locally or via CDN.
“The main issue with WASM threads is the requirement for Cross-Origin-Opener-Policy (COOP) and Cross-Origin-Embedder-Policy (COEP) headers. These headers are necessary to enable threading but require control over web server configurations. This makes WASM threads incompatible with popular hosting services, and developers have to coordinate with IT teams to implement the required headers—an extra hassle. As we strive to make AR developers’ lives easier, this is a point of concern.”
Once I had exhausted the traditional methods, SIMD—Single Instruction, Multiple Data—came into play and made a huge difference. SIMD allows you to leverage the CPU’s registers to handle multiple operations simultaneously, which is perfect for performance-hungry tasks like mine.
Understanding SIMD
What is SIMD?
SIMD stands for Single Instruction, Multiple Data. It allows a single instruction to be applied to multiple pieces of data simultaneously, making it incredibly useful for repetitive computations, such as vector math, image processing, or AI workloads.
How SIMD Works with Registers
SIMD leverages the CPU’s registers to perform multiple operations at once. A register is a small, fast storage space the CPU uses during operations. In normal scalar code, each register holds a single value. Think of it as processing one item at a time.
With SIMD, a single register can hold multiple values at once. For example, a 128-bit register can hold four 32-bit floats or integers or eight 16-bit values. The processor can then execute the same operation on all the values in the register simultaneously, providing a substantial performance boost, particularly for tasks involving large datasets or repetitive operations like matrix multiplications or pixel transformations.
Writing SIMD Code
Platform-Specific Intrinsics
Writing SIMD code requires understanding the target architecture—whether it’s ARM, x86, or WASM. Each platform has its own set of intrinsics, which are functions that let you directly interact with SIMD instructions at the hardware level. Here are some examples:
- ARM: NEON intrinsics
- x86: SSE, AVX intrinsics
- WASM: A common SIMD library that works on both ARM and x86 chips
These intrinsics allow you to instruct the CPU to perform the same operation across multiple data points simultaneously.
Practical Example: Vector Addition
Here’s an example of SIMD in action. First, I’ll show a scalar version of a simple vector addition, and then the SIMD-optimized version using ARM NEON intrinsics.
#include <arm_neon.h>
#include <iostream>
#include <chrono>
// Regular scalar addition
void addArraysScalar(const float* arr1, const float* arr2, float* result, int size) {
for (int i = 0; i < size; i++) {
result[i] = arr1[i] + arr2[i];
}
}
// NEON SIMD addition - processes 4 floats at once
void addArraysNEON(const float* arr1, const float* arr2, float* result, int size) {
int i;
for (i = 0; i <= size - 4; i += 4) {
float32x4_t v1 = vld1q_f32(arr1 + i);
float32x4_t v2 = vld1q_f32(arr2 + i);
float32x4_t sum = vaddq_f32(v1, v2);
vst1q_f32(result + i, sum);
}
for (; i < size; i++) {
result[i] = arr1[i] + arr2[i];
}
}
int main() {
const int SIZE = 1024;
float* array1 = new float[SIZE];
float* array2 = new float[SIZE];
float* result1 = new float[SIZE];
float* result2 = new float[SIZE];
for (int i = 0; i < SIZE; i++) {
array1[i] = i * 1.1f;
array2[i] = i * 2.2f;
}
auto start = std::chrono::high_resolution_clock::now();
addArraysScalar(array1, array2, result1, SIZE);
auto end = std::chrono::high_resolution_clock::now();
auto scalarTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
start = std::chrono::high_resolution_clock::now();
addArraysNEON(array1, array2, result2, SIZE);
end = std::chrono::high_resolution_clock::now();
auto simdTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Scalar time: " << scalarTime.count() << " microseconds\n";
std::cout << "SIMD time: " << simdTime.count() << " microseconds\n";
std::cout << "Speedup: " << float(scalarTime.count())/simdTime.count() << "x\n";
delete[] array1;
delete[] array2;
delete[] result1;
delete[] result2;
return 0;
}
When I ran this on my system (M2 Pro Chip, 16 GB) using gcc clang-1600.0.26.3 compiler, SIMD delivered a speedup of 2x over scalar code.
But take it with a pinch of salt - SIMD code is not always faster.
When SIMD Doesn’t Help
While SIMD can be powerful, it’s not always the right choice. Here are a few cases where SIMD may not provide the expected benefits:
- Small Data Sets: SIMD has setup overhead. For small arrays (typically less than 1000 elements), the overhead might exceed the performance benefits. In these cases, regular scalar code might be faster.
- Non-Sequential Memory Access: If your data access pattern is random or scattered, such as processing sparse arrays, SIMD may not be helpful.
- Complex Conditional Logic: SIMD thrives on repetitive operations. If your code involves complex conditional logic (e.g., many if-else conditions), it might not be vectorized efficiently.
- Already Optimized Code: When compiler auto-vectorization is already doing a good job or when code is I/O bound rather than CPU bound.
- Platform Portability Concerns: When you can’t guarantee SIMD support on target platforms as code needs to run on many different architectures.
Always profile your code before and after applying SIMD to ensure you’re getting the expected performance boost.
Real-World Results
In my scenario, switching to SIMD for vectorizing operations led to significant performance gains. When I ran the scalar code using the wasm emscripten compiler (emcc) locally on an M2 Pro Chip (16 GB), it took around 7ms.
After vectorization, the time dropped to just 1ms on my profiler—a massive improvement that helped push the FPS to 30 even on low-end Android devices.
The takeaway? SIMD won’t solve every problem, but when used in the right context, it can be a game-changer.
Related Articles
If you enjoyed this performance optimization deep-dive, you might also like:
- Understanding Virtual and Physical Addresses in Operating Systems - A comprehensive guide to memory management, debugging techniques, and how modern operating systems handle memory through virtual and physical addresses.
If you found this article helpful, feel free to connect with me on X/Twitter or LinkedIn.