Making JavaScript 6x faster with CPU-level optimizations

While learning high-performance computing theory, I wondered: how much of the low-level optimization techniques could actually work in JavaScript and in the browser?

How much faster can we get this function that sums a list of numbers?

function sum(nums) {
  let result = 0;
  for (let val of nums) {
    result += val;
  }
  return result;
}

Measuring performance

Performance benchmarking across different devices is notoriously unreliable—your laptop and my desktop will produce different absolute numbers. To get statistically meaningful comparisons, I’m using Tachometer, which uses repeated sampling to reliably identify performance differences.

All benchmark results below come from running each technique 50 times and comparing the statistical distributions. If you want to reproduce these results, this gist contains the test files, that can be benchmarked with:

npx tachometer 1_simple.html 2_chunks.html 3_simple_wasm.html 4_wasm_simd.html

Summation techniques

The benchmark is how long it takes to sum 1 million random integers between 0 and 10. We will only benchmark the summation, and none of the boilerplate. All numbers are from running on my machine and will differ on your machine.

Technique 1 - A naive sum

The naive sum function looks like so:

/// Implementation in 1_simple.html
function sum(nums) {
  let result = 0;
  for (let val of nums) {
    result += val;
  }
  return result;
}

Given a list of nums, we iterate over each number and add it to the result accumulator. At the end of the loop, the result accumulator contains the total sum of all the elements in the nums array.

The function is benchmarked with the following code:

const randomNumbers = Array.from(
  { length: 1_000_000 },
  () => Math.floor(Math.random() * 100));

bench.start(); // Tachometer benchmark span start
console.log(sum(randomNumbers));
bench.stop(); // Tachometer benchmark span end

And running npx tachometer 1_simple.html returns an average time of 7.04ms - 7.15ms.

So there is our baseline. Let’s see if we can do better.

Technique 2 - Chunking summation into four accumulators

Instead of the loop going one item at a time, we can instead jump four items at a time, and use four accumulators.

/// Implementation in 2_chunks.html
function sum(nums) {
  let a = 0, b = 0, c = 0, d = 0;
  for (let i = 0; i < nums.length; i += 4) {
    a += nums[i];
    b += nums[i + 1] ?? 0;
    c += nums[i + 2] ?? 0;
    d += nums[i + 3] ?? 0;
  }
  return a + b + c + d;
}

Measuring this with Tachometer against the simple sum technique gives the following result:

BenchmarkBytesAvg timevs 1_simple.htmlvs 2_chunks.html
1_simple.html1.00 KiB7.02ms - 7.08ms-slower
437% - 453%
5.72ms - 5.80ms
2_chunks.html1.20 KiB1.28ms - 1.31msfaster
81% - 82%
5.72ms - 5.80ms
-

We have been able to improve our performance by 5. The original implementation takes 7ms on my machine, whilst the chunking implementation takes only 1.28ms.

Why do four accumulators make the summation code so much faster? It’s due to the underlying CPU hardware.

Read about Instruction-Level Parallelism here, and Pipeline Hazards here.

Modern CPU’s provide instruction-level parallelism that you get for free. However, there are pipeline hazards that can prevent the CPU from executing in parallel.

In short, the CPU will try and run instructions in parallel, unless it encounters a hazard. The CPU must then wait for the hazard to be resolved, and then computation continues. The simple sum function contains a hazard because multiple instructions will try to add to the same accumulator – which cannot happen in parallel. By having four accumulators, we allow the CPU to write to all four simultaneously if it can.

But can we do even better?

Technique 3 - Rewrite the sum function in WebAssembly

WebAssembly or Wasm for short, is a binary instruction format, which basically lets you write assembly that runs on the browser. Ideally, you’d write your code in C++ or Rust, and then compile to Wasm. This allows running C++ or Rust on the browser side by side with JavaScript.

To inspect the Wasm you’ll need to look at the gist, because it’s quite long.

The Tachometer results are:

BenchmarkBytesAvg timevs 1_simple.htmlvs 3_simple_wasm.html
1_simple.html1.00 KiB7.04ms - 7.12ms-slower
199% - 209%
4.70ms - 4.80ms
3_simple_wasm.html3.38 KiB2.29ms - 2.36msfaster
67% - 68%
4.70ms - 4.80ms
-

The simple Wasm solution is slower than the chunking solution, but still faster than the naive JavaScript solution. However, Wasm has another secret weapon, “Single instruction, multiple data” or SIMD.

Technique 4 - Sum function in WebAssembly using SIMD

As a quick caveat, based on CanIUse statistics, 93.27% of users have Wasm SIMD support. If you do use this, you may still want to feature detect it and fall back on the scalar Wasm solution. However, most users can benefit from this optimization.

So what even is SIMD? It basically provides new functions that operate on groups of data.

Taking the prior chunking example code:

function sum(nums) {
  let a = 0, b = 0, c = 0, d = 0;
  for (let i = 0; i < nums.length; i += 4) {
    a += nums[i];
    b += nums[i + 1] ?? 0;
    c += nums[i + 2] ?? 0;
    d += nums[i + 3] ?? 0;
  }
  return a + b + c + d;
}

JavaScript does not support SIMD, so below is pseudocode of what SIMD kind-of looks like:

function sum(nums) {
  // Pseudocode SIMD vector of 4 zeros.
  let accumulator = new Int32x4(0, 0, 0, 0);
  
  for (let i = 0; i < nums.length; i += 4) {
    // Load 4 numbers at once into a SIMD vector.
    let chunk = new Int32x4(
      nums[i] ?? 0,
      nums[i + 1] ?? 0, 
      nums[i + 2] ?? 0,
      nums[i + 3] ?? 0
    );
    
    // With one instruction, add all four numbers.
    accumulator = Int32x4.add(accumulator, chunk);
  }
  
  // Extract the 4 values and sum them.
  return accumulator.x + accumulator.y + accumulator.z + accumulator.w;
}

And when the sum function is implemented in Wasm using real Wasm SIMD functions, we get:

BenchmarkBytesAvg timevs 1_simple.htmlvs 4_wasm_simd.html
1_simple.html1.00 KiB7.06ms - 7.13ms-slower
502% - 525%
5.90ms - 5.98ms
4_wasm_simd.html5.21 KiB1.14ms - 1.18msfaster
83% - 84%
5.90ms - 5.98ms
-

We’ve made it to the 6 times speedup promised by the title!

Bonus technique: Array.prototype.reduce

You may want to see the one liner solution of:

function sum(nums) {
  return nums.reduce((acc, v) => acc + v);
}

It is actually faster than the simple loop, but not as fast as chunking coming in at 5.56ms vs the naive loop’s 7.03ms:

BenchmarkBytesAvg timevs 1_simple.htmlvs reduce implementation
1_simple.html1.00 KiB7.03ms - 7.45ms-slower
25% - 33%
1.43ms - 1.86ms
reduce implementation0.96 KiB5.56ms - 5.63msfaster
20% - 25%
1.43ms - 1.86ms
-

Conclusion

I’ve spent most of my professional career as a frontend developer and have rarely considered the underlying hardware beneath the browser or Node.js.

I did not anticipate how much faster code could get with changes that look like they should be slower.

This post explores a problem that often doesn’t need to be optimized. The naive solution is likely fast enough. But, I think the learnings of how it’s possible to make code faster is a gateway to us all being more critical about the overall code that we write, and software becoming faster.

Edit(2025-05-26): After talking to a v8 engineer

Thank you to Peter Marshall who went through and highlighted the assembly created by v8 using Compiler Explorer.

Thanks to him, I now have answers as to why the first sum example is so slow.

The first sum function uses a for-of, so the compiled assembly within the loop body contains the following checks:

Changing the first sum function to use a for (let i = 0; i < nums.length; i++) { reduces the time to around 1.5ms. On Chrome and Safari it’s still faster to do the super-scalar accumulating, but the benefits are no longer orders of magnitude.

Surprisingly, Firefox regressed in speed slightly when using the multiple accumulator approach.

All browsers have significant speed improvements when using Wasm with SIMD to calculate the sum.

Additional notes

I tried to code a compute shader that would sum the numbers, but I could not make one that was performant. Someone else may be able to implement something faster than the SIMD solution.

If you’re still reading you may be interested in the source material that inspired this post.