Making JavaScript 6x faster with CPU-level optimizations
While learning high-performance computing theory, I wondered: how much of the low-level optimization techniques could actually work in JavaScript and in the browser?
How much faster can we get this function that sums a list of numbers?
function sum(nums) {
let result = 0;
for (let val of nums) {
result += val;
}
return result;
}
Measuring performance
Performance benchmarking across different devices is notoriously unreliable—your laptop and my desktop will produce different absolute numbers. To get statistically meaningful comparisons, I’m using Tachometer, which uses repeated sampling to reliably identify performance differences.
All benchmark results below come from running each technique 50 times and comparing the statistical distributions. If you want to reproduce these results, this gist contains the test files, that can be benchmarked with:
npx tachometer 1_simple.html 2_chunks.html 3_simple_wasm.html 4_wasm_simd.html
Summation techniques
The benchmark is how long it takes to sum 1 million random integers between 0 and 10. We will only benchmark the summation, and none of the boilerplate. All numbers are from running on my machine and will differ on your machine.
Technique 1 - A naive sum
The naive sum function looks like so:
/// Implementation in 1_simple.html
function sum(nums) {
let result = 0;
for (let val of nums) {
result += val;
}
return result;
}
Given a list of nums
, we iterate over each number and add it to the result
accumulator. At the end of the loop, the result
accumulator contains the total
sum of all the elements in the nums
array.
The function is benchmarked with the following code:
const randomNumbers = Array.from(
{ length: 1_000_000 },
() => Math.floor(Math.random() * 100));
bench.start(); // Tachometer benchmark span start
console.log(sum(randomNumbers));
bench.stop(); // Tachometer benchmark span end
And running npx tachometer 1_simple.html
returns an average time of 7.04ms - 7.15ms.
So there is our baseline. Let’s see if we can do better.
Technique 2 - Chunking summation into four accumulators
Instead of the loop going one item at a time, we can instead jump four items at a time, and use four accumulators.
/// Implementation in 2_chunks.html
function sum(nums) {
let a = 0, b = 0, c = 0, d = 0;
for (let i = 0; i < nums.length; i += 4) {
a += nums[i];
b += nums[i + 1] ?? 0;
c += nums[i + 2] ?? 0;
d += nums[i + 3] ?? 0;
}
return a + b + c + d;
}
Measuring this with Tachometer against the simple sum technique gives the following result:
Benchmark | Bytes | Avg time | vs 1_simple.html | vs 2_chunks.html |
---|---|---|---|---|
1_simple.html | 1.00 KiB | 7.02ms - 7.08ms | - | slower 437% - 453% 5.72ms - 5.80ms |
2_chunks.html | 1.20 KiB | 1.28ms - 1.31ms | faster 81% - 82% 5.72ms - 5.80ms | - |
We have been able to improve our performance by 5. The original implementation takes 7ms on my machine, whilst the chunking implementation takes only 1.28ms.
Why do four accumulators make the summation code so much faster? It’s due to the underlying CPU hardware.
Read about Instruction-Level Parallelism here, and Pipeline Hazards here.
Modern CPU’s provide instruction-level parallelism that you get for free. However, there are pipeline hazards that can prevent the CPU from executing in parallel.
In short, the CPU will try and run instructions in parallel, unless it encounters a hazard. The CPU must then wait for the hazard to be resolved, and then computation continues. The simple sum function contains a hazard because multiple instructions will try to add to the same accumulator – which cannot happen in parallel. By having four accumulators, we allow the CPU to write to all four simultaneously if it can.
But can we do even better?
Technique 3 - Rewrite the sum function in WebAssembly
WebAssembly or Wasm for short, is a binary instruction format, which basically lets you write assembly that runs on the browser. Ideally, you’d write your code in C++ or Rust, and then compile to Wasm. This allows running C++ or Rust on the browser side by side with JavaScript.
To inspect the Wasm you’ll need to look at the gist, because it’s quite long.
The Tachometer results are:
Benchmark | Bytes | Avg time | vs 1_simple.html | vs 3_simple_wasm.html |
---|---|---|---|---|
1_simple.html | 1.00 KiB | 7.04ms - 7.12ms | - | slower 199% - 209% 4.70ms - 4.80ms |
3_simple_wasm.html | 3.38 KiB | 2.29ms - 2.36ms | faster 67% - 68% 4.70ms - 4.80ms | - |
The simple Wasm solution is slower than the chunking solution, but still faster than the naive JavaScript solution. However, Wasm has another secret weapon, “Single instruction, multiple data” or SIMD.
Technique 4 - Sum function in WebAssembly using SIMD
As a quick caveat, based on CanIUse statistics, 93.27% of users have Wasm SIMD support. If you do use this, you may still want to feature detect it and fall back on the scalar Wasm solution. However, most users can benefit from this optimization.
So what even is SIMD? It basically provides new functions that operate on groups of data.
Taking the prior chunking example code:
function sum(nums) {
let a = 0, b = 0, c = 0, d = 0;
for (let i = 0; i < nums.length; i += 4) {
a += nums[i];
b += nums[i + 1] ?? 0;
c += nums[i + 2] ?? 0;
d += nums[i + 3] ?? 0;
}
return a + b + c + d;
}
JavaScript does not support SIMD, so below is pseudocode of what SIMD kind-of looks like:
function sum(nums) {
// Pseudocode SIMD vector of 4 zeros.
let accumulator = new Int32x4(0, 0, 0, 0);
for (let i = 0; i < nums.length; i += 4) {
// Load 4 numbers at once into a SIMD vector.
let chunk = new Int32x4(
nums[i] ?? 0,
nums[i + 1] ?? 0,
nums[i + 2] ?? 0,
nums[i + 3] ?? 0
);
// With one instruction, add all four numbers.
accumulator = Int32x4.add(accumulator, chunk);
}
// Extract the 4 values and sum them.
return accumulator.x + accumulator.y + accumulator.z + accumulator.w;
}
And when the sum
function is implemented in Wasm using real Wasm SIMD functions, we get:
Benchmark | Bytes | Avg time | vs 1_simple.html | vs 4_wasm_simd.html |
---|---|---|---|---|
1_simple.html | 1.00 KiB | 7.06ms - 7.13ms | - | slower 502% - 525% 5.90ms - 5.98ms |
4_wasm_simd.html | 5.21 KiB | 1.14ms - 1.18ms | faster 83% - 84% 5.90ms - 5.98ms | - |
We’ve made it to the 6 times speedup promised by the title!
Bonus technique: Array.prototype.reduce
You may want to see the one liner solution of:
function sum(nums) {
return nums.reduce((acc, v) => acc + v);
}
It is actually faster than the simple loop, but not as fast as chunking coming in at 5.56ms vs the naive loop’s 7.03ms:
Benchmark | Bytes | Avg time | vs 1_simple.html | vs reduce implementation |
---|---|---|---|---|
1_simple.html | 1.00 KiB | 7.03ms - 7.45ms | - | slower 25% - 33% 1.43ms - 1.86ms |
reduce implementation | 0.96 KiB | 5.56ms - 5.63ms | faster 20% - 25% 1.43ms - 1.86ms | - |
Conclusion
I’ve spent most of my professional career as a frontend developer and have rarely considered the underlying hardware beneath the browser or Node.js.
I did not anticipate how much faster code could get with changes that look like they should be slower.
This post explores a problem that often doesn’t need to be optimized. The naive solution is likely fast enough. But, I think the learnings of how it’s possible to make code faster is a gateway to us all being more critical about the overall code that we write, and software becoming faster.
Edit(2025-05-26): After talking to a v8 engineer
Thank you to Peter Marshall who went through and highlighted the assembly created by v8 using Compiler Explorer.
Thanks to him, I now have answers as to why the first sum example is so slow.
The first sum
function uses a for-of
, so the compiled assembly within the loop body contains the
following checks:
- Check that the array didn’t stop containing non-integers.
- Check array length didn’t change.
- Check that there are no holes in it.
- Check that the array still has the Array prototype.
- Check that the Array prototype has the original
next
method.
Changing the first sum
function to use a for (let i = 0; i < nums.length; i++) {
reduces the time to around 1.5ms
. On Chrome and Safari it’s still faster to do
the super-scalar accumulating, but the benefits are no longer orders of magnitude.
Surprisingly, Firefox regressed in speed slightly when using the multiple accumulator approach.
All browsers have significant speed improvements when using Wasm with SIMD to calculate the sum.
Additional notes
I tried to code a compute shader that would sum the numbers, but I could not make one that was performant. Someone else may be able to implement something faster than the SIMD solution.
If you’re still reading you may be interested in the source material that inspired this post.
- video: WebAssembly at Google by Thomas Steiner & Thomas Nattestadt @ Wasm I/O 2024.
- Great video that shares concrete Google features built using Wasm and SIMD.
- Draft Free Book: Algorithms for Modern Hardware by Sergey Slotin.
- video: The Art of SIMD Programming by Sergey Slotin