James Randall Musings on software development, business and technology.
Building a Real-Time Path Tracer in WebGPU

Having built a raycaster for my Wolfenstein 3D recreation I wanted to take things to the next level and build a “simple” ray tracer. The result is a real-time path tracer written entirely in WebGPU compute shaders, running in a browser, rendering levels loaded from actual Doom WAD files. No hardware ray tracing cores, no ML denoisers, no engine — just maths, triangles, and a GPU doing a LOT of work.

But when you try and figure out what happens when you fire a ray of light from a virtual camera into a scene made of triangles it turns out things get complicated fast. You get physically correct lighting — soft shadows, colour bleeding, natural falloff — the kind of rendering that looks right in a way that traditional rasterisation never quite manages. You also get noise, performance problems, and a sharp reminder that even a £1,600 GPU can’t brute-force the physics. In fact it still can’t even come close.

This post walks through the construction of that renderer: from the rendering equation through BVH acceleration, Monte Carlo importance sampling, temporal accumulation, and spatial denoising — everything needed to go from a single noisy ray to a converged image. The Doom WAD loader parses the original 1993 level geometry and rebuilds it as a triangle mesh that the path tracer can chew on, which means you can wander through E1M1 lit by nothing but a handheld torch and physically correct light transport. The renderer is embedded below — you can interact with it directly. These demos are computationally expensive and so you’ll need to press PLAY first — I didn’t want the first thing you knew about these interactive demos to be fan noise! If it runs really slowly… drop the resolution. On my Mac I get 60fps with these defaults.

I’m not going to pretend this came out as a coherent whole. Getting a simple scene on the screen was (relatively) simple but I’ve backed and forthed with denoisers, different optimisations, tweaking numbers, chunking things differently - the list is endless. I started tinkering with it a couple of months ago. Even writing it all up took an age.

I was learning as I went and rewriting and rewriting code with the help of an LLM. Its been fascinating and the slightly messy code is the result of that learning process - but that learning process hopefully helps me explain how it works and why realtime path tracing without tricks is still a pipedream on a GPU.

In any case - I hope I’ve done the topic justice.


The Rendering Equation

Every pixel on screen represents a measurement of light. The question is: how much light arrives at the camera through this pixel? The answer is the rendering equation, introduced by James Kajiya in 1986:

$$L_o(\mathbf{x}, \omega_o) = L_e(\mathbf{x}, \omega_o) + \int_{\Omega} f_r(\mathbf{x}, \omega_i, \omega_o) \, L_i(\mathbf{x}, \omega_i) \, (\omega_i \cdot \mathbf{n}) \, d\omega_i$$

In plain terms: the light leaving a point $\mathbf{x}$ in direction $\omega_o$ is the light emitted at that point, plus all the light arriving from every direction in the hemisphere above the surface, scaled by the material’s reflectance function and the angle of incidence.

That integral over the hemisphere is the hard part. There’s no closed-form solution for arbitrary scenes — you can’t solve it algebraically. Instead, we must estimate it by random sampling: fire rays in random directions, see what light they find, and average the results. This is Monte Carlo integration, and it’s the foundation of everything that follows.

And I’m not going to lie: I’m not a mathematician (I know just enough to be dangerous and can usually get there, I just have to frown a lot) and had to go round this loop a few times before I understood what was happening. It took me a while to appreciate that you can’t solve it as such, you have to approximate it. I got there by back tracking from the noise. Why the noise.


Anatomy of a Ray

A ray is defined by an origin point and a direction:

$$\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}, \quad t > 0$$

For the primary ray — camera to scene — the origin is the camera position and the direction is computed from the pixel coordinates and the field of view:

fn generate_ray(pixel: vec2f) -> vec3f {
  let aspect = camera.resolution.x / camera.resolution.y;
  let fov_scale = tan(camera.fov * 0.5);

  let ndc = vec2f(
    (2.0 * pixel.x / camera.resolution.x - 1.0) * aspect * fov_scale,
    (1.0 - 2.0 * pixel.y / camera.resolution.y) * fov_scale
  );

  let forward = normalize(camera.direction);
  let right = normalize(cross(forward, camera.up));
  let up = cross(right, forward);

  return normalize(forward + right * ndc.x + up * ndc.y);
}

The pixel coordinates are mapped to Normalised Device Coordinates (NDC), scaled by the field of view and aspect ratio, then transformed into a world-space direction using the camera’s orientation vectors. Each pixel gets a slightly different direction, creating the perspective projection.


Finding What a Ray Hits

Once we have a ray, we need to find the closest surface it intersects. Every surface in the scene is made of triangles, so this reduces to: for each triangle, does this ray hit it, and if so, at what distance $t$?

The standard ray–triangle intersection test is the Möller–Trumbore Algorithm. Given a ray $(\mathbf{o}, \mathbf{d})$ and a triangle with vertices $\mathbf{v}_0, \mathbf{v}_1, \mathbf{v}_2$, we compute:

$$\mathbf{e}_1 = \mathbf{v}_1 - \mathbf{v}_0, \quad \mathbf{e}_2 = \mathbf{v}_2 - \mathbf{v}_0$$$$\mathbf{h} = \mathbf{d} \times \mathbf{e}_2, \quad a = \mathbf{e}_1 \cdot \mathbf{h}$$

If $a \approx 0$ (the squiggly equals means approximate), then the ray is parallel to the triangle. Otherwise, we compute the barycentric coordinates $(u, v)$ of the intersection point:

$$f = \frac{1}{a}, \quad \mathbf{s} = \mathbf{o} - \mathbf{v}_0$$$$u = f(\mathbf{s} \cdot \mathbf{h}), \quad \mathbf{q} = \mathbf{s} \times \mathbf{e}_1, \quad v = f(\mathbf{d} \cdot \mathbf{q})$$$$t = f(\mathbf{e}_2 \cdot \mathbf{q})$$

If $u \in [0, 1]$, $v \in [0, 1]$, $u + v \leq 1$, and $t > 0$, we have a valid hit. The barycentric coordinates (ahh my old friend, I first came across these while working on Wolfenstein, not surprising given its a raycaster and this is a path tracer, and now they seem to pop up everywhere) also let us interpolate texture coordinates and normals across the triangle surface.

fn intersect_triangle_uv(ray_origin: vec3f, ray_dir: vec3f,
                          v0: vec3f, v1: vec3f, v2: vec3f) -> TriangleHitResult {
  var result: TriangleHitResult;
  result.t = -1.0;

  let edge1 = v1 - v0;
  let edge2 = v2 - v0;
  let h = cross(ray_dir, edge2);
  let a = dot(edge1, h);

  if (abs(a) < 0.00001) { return result; }

  let f = 1.0 / a;
  let s = ray_origin - v0;
  let u = f * dot(s, h);
  if (u < 0.0 || u > 1.0) { return result; }

  let q = cross(s, edge1);
  let v = f * dot(ray_dir, q);
  if (v < 0.0 || u + v > 1.0) { return result; }

  let t = f * dot(edge2, q);
  if (t > 0.0001) {
    result.t = t;
    result.u = u;
    result.v = v;
  }
  return result;
}

This is elegant and fast for a single triangle, the problem is that a scene can contain thousands of them.


The Brute Force Problem

The naive approach is to test every ray against every triangle. For a scene with $N$ triangles and an image of $W \times H$ pixels, each with $S$ samples and $B$ bounces, that’s:

$$\text{Tests} = W \times H \times S \times B \times N$$

For our dungeon scene at 1280×800, 4 samples per pixel, 4 bounces, and 3000 triangles:

$$1280 \times 800 \times 4 \times 4 \times 3000 = 49{,}152{,}000{,}000$$

Forty-nine billion intersection tests per frame. At maybe 1 nanosecond per test, that’s 49 seconds per frame. Not exactly real-time. Though that said it still renders faster than a Mandelbrot on my BBC Micro - which is kind of astounding!


Bounding Volume Hierarchies

The solution is to not test most triangles but to find a way to organise them so we can test far fewer. A Bounding Volume Hierarchy (BVH) organises triangles into a tree of axis-aligned bounding boxes (AABBs). Each node contains a box that tightly encloses all the triangles in its subtree. Before testing any triangles, we test the ray against the box. If it misses the box, we skip the entire subtree — potentially thousands of triangles — in a single cheap test.

You’ll come across similar spatial organisation strategies everywhere.

Ray–AABB Intersection

Testing a ray against an axis-aligned box is much cheaper than testing against a triangle (again shades of Wolfenstein which is naturally axis aligned given its grid based maps). We compute the entry and exit distances along each axis:

fn intersect_aabb(ray_origin: vec3f, ray_dir_inv: vec3f,
                  box_min: vec3f, box_max: vec3f, max_t: f32) -> bool {
  let t1 = (box_min - ray_origin) * ray_dir_inv;
  let t2 = (box_max - ray_origin) * ray_dir_inv;

  let tmin = max(max(min(t1.x, t2.x), min(t1.y, t2.y)), min(t1.z, t2.z));
  let tmax = min(min(max(t1.x, t2.x), max(t1.y, t2.y)), max(t1.z, t2.z));

  return tmax >= tmin && tmin < max_t && tmax > 0.0;
}

Note ray_dir_inv — the reciprocal of the direction, precomputed once per ray. This avoids a division per axis per box test.

Building the Tree

The BVH is constructed on the CPU using the Surface Area Heuristic (SAH). At each node, we evaluate every possible split position along each axis. The cost of a split is:

$$C = C_{\text{traversal}} + \frac{SA_L}{SA_P} \cdot N_L \cdot C_{\text{intersect}} + \frac{SA_R}{SA_P} \cdot N_R \cdot C_{\text{intersect}}$$

Where $SA_L, SA_R, SA_P$ are the surface areas of the left, right, and parent bounding boxes, and $N_L, N_R$ are the triangle counts in each child. We pick the split that minimises this cost. The surface area ratio estimates the probability of a random ray hitting each child, so the heuristic minimises the expected number of intersection tests.

The tree is flattened into a contiguous array for GPU consumption. Each node stores its AABB bounds plus either child indices (for internal nodes) or a triangle range (for leaves). A high bit flag distinguishes the two:

struct BVHNode {
  min_bounds: vec3f,
  left_child_or_first_tri: u32,
  max_bounds: vec3f,
  right_child_or_count: u32,   // High bit set = leaf
}

Walking the Tree

Traversal uses an explicit stack on the GPU as we can’t use recursion in WGSL compute shaders:

fn trace_bvh_accel(ray_origin: vec3f, ray_dir: vec3f) -> HitInfo {
  var closest_hit: HitInfo;
  closest_hit.t = 1e30;
  closest_hit.hit = false;
  let ray_dir_inv = 1.0 / ray_dir;

  var stack: array<u32, 32>;
  var stack_ptr = 0u;
  stack[0] = 0u;  // Start at root
  stack_ptr = 1u;

  while (stack_ptr > 0u) {
    stack_ptr -= 1u;
    let node = bvh_nodes[stack[stack_ptr]];

    // Skip if ray misses this node's box
    if (!intersect_aabb(ray_origin, ray_dir_inv,
        node.min_bounds, node.max_bounds, closest_hit.t)) {
      continue;
    }

    if (is_leaf(node)) {
      // Test each triangle in this leaf
      for (var i = 0u; i < get_tri_count(node); i++) {
        let verts = tri_verts[node.left_child_or_first_tri + i];
        let hit = intersect_triangle_uv(ray_origin, ray_dir,
            verts.v0, verts.v1, verts.v2);
        if (hit.t > 0.0 && hit.t < closest_hit.t) {
          closest_hit.t = hit.t;
          closest_hit.hit = true;
          // Store triangle index and barycentrics...
        }
      }
    } else {
      // Push both children, nearest first
      // (ordered traversal for better early termination)
      // ...
    }
  }
  return closest_hit;
}

The closest_hit.t threshold is critical: every time we find a closer hit, the threshold tightens, and subsequent AABB tests reject more nodes. Visiting the nearest child first (ordered traversal) maximises the chance of finding a close hit early, which maximises the pruning of the far child.

Visualising the BVH

The debug visualiser makes traversal efficiency visible. The heatmap mode colours each pixel by how many BVH nodes were visited — cool blues for efficient rays, hot reds for expensive ones. You can try it out below, its worth experimenting with the different scenes and visualisations. The debugger will start in windowed mode so you can see the scene and its BVH simultaneously, but you can also overlay it.

Notice how rays that pass through the dense cluster of cubes on the right are more expensive (warmer colours) than rays hitting the well-separated cubes on the left. Same number of triangles, very different traversal cost, purely because of spatial arrangement.

The wireframe mode shows the bounding boxes at each level of the tree. Step through the depths to see how the BVH recursively subdivides space:

At depth 0, one box surrounds the entire scene. At depth 1, the first split separates the room roughly in half. By depth 4 or 5, individual objects have their own tight boxes and the spatial partitioning follows the geometry closely.

The diagonal slab cutting across the scene is deliberately pathological — its AABB spans nearly the entire room, overlapping everything. The heatmap shows the cost: rays near the diagonal visit more nodes because the BVH can’t exclude it from any spatial query. This is why axis-aligned BVHs struggle with geometry that doesn’t align with any axis.

A BVH Per Tile: Making the Player’s Torch Work

Everything described so far assumes a static scene — the BVH is built once on the CPU, uploaded to the GPU, and never changes. That works for the Cornell box and the BVH teaching scene. But in the dungeon the player is a light source, they’re carrying a torch and can roam around the dungeon.

The torch itself is straightforward: it’s passed to the shader as a set of uniforms — position, colour, radius, falloff — and the path tracing loop treats it as a point light at the camera position. No geometry, no emissive triangles, just a light that happens to be wherever the player is standing. But the problem is the BVH.

A single global BVH contains every triangle in the dungeon. For a 16×16 tile map that’s around 3,000 triangles — walls, floors, ceilings, torch panels. But the player’s torch has a finite range. Light from a handheld torch doesn’t illuminate walls four rooms away. Most of those 3,000 triangles are irrelevant at any given moment, and the BVH wastes time testing nodes that enclose distant, unlit geometry.

The solution is to precompute a separate BVH for every walkable tile. At initialisation, we iterate over each open cell in the dungeon map and build a distance-culled subset of the scene:

private precomputeBVHsForPositions(): void {
  const positions = this.walkablePositions!;
  const dist = this.renderDistance;
  const distSq = dist * dist;

  for (const pos of positions) {
    // Cull triangles by distance from this tile center (XZ plane)
    const culled: Triangle[] = [];
    for (const tri of this.allTriangles) {
      const d0 = (tri.v0.x - pos.x) ** 2 + (tri.v0.z - pos.z) ** 2;
      const d1 = (tri.v1.x - pos.x) ** 2 + (tri.v1.z - pos.z) ** 2;
      const d2 = (tri.v2.x - pos.x) ** 2 + (tri.v2.z - pos.z) ** 2;
      if (Math.min(d0, d1, d2) <= distSq) {
        culled.push(tri);
      }
    }

    // Build BVH for this subset
    const builder = new BVHBuilder();
    const { nodes, orderedTriangles } = builder.build(culled);
    const flat = flattenBVH(nodes);
    // Store for later swap...
  }
}

The culling is conservative: a triangle is included if any vertex falls within the render distance. This means the BVH for each tile contains only the geometry that could plausibly be illuminated by the player’s torch at that position. A render distance of 10 world units (5 tiles) is enough to capture everything the torch can reach while excluding the far side of the dungeon.

For each walkable tile we store the packed BVH nodes, the reordered triangle data (both hot vertex positions and cold attributes), and the emissive light list — everything the GPU needs, precomputed and ready to upload.

At render time, when the player steps onto a new tile, we detect the change and hot-swap the buffers:

private swapBVHForTile(): void {
  // Find nearest walkable position to camera
  const cx = this.camera.position.x;
  const cz = this.camera.position.z;
  let bestKey = this.currentTileKey;
  let bestDist = Infinity;

  for (const pos of this.walkablePositions!) {
    const dx = pos.x - cx;
    const dz = pos.z - cz;
    const d = dx * dx + dz * dz;
    if (d < bestDist) {
      bestDist = d;
      bestKey = `${pos.x},${pos.z}`;
    }
  }

  if (bestKey === this.currentTileKey) return;

  const entry = this.precomputedBVHs.get(bestKey);
  if (!entry) return;

  // Upload this tile's BVH, triangles, and light list
  this.device.queue.writeBuffer(this.bvhBuffer, 0, entry.bvhData);
  this.device.queue.writeBuffer(this.triangleBuffer, 0, entry.triVertsData.buffer);
  this.device.queue.writeBuffer(this.triAttribsBuffer, 0, entry.triAttribsData.buffer);
  this.device.queue.writeBuffer(this.lightsBuffer, 0, entry.lightData.buffer);
  this.currentTileKey = bestKey;
}

The GPU buffers are pre-allocated to the size of the largest per-tile BVH, so the swap is just a series of writeBuffer calls — no reallocation, no pipeline rebuild. In practice, for our 16×16 dungeon with around 130 walkable tiles, precomputing all the BVHs takes a few hundred milliseconds at startup and the per-tile swap is imperceptible.

The tradeoff is memory: we’re storing ~130 copies of partially overlapping BVH data. For a dungeon this size it’s negligible. For a sprawling open world it wouldn’t scale — you’d want streaming or incremental BVH updates instead. But for a tile-based dungeon where movement is discrete and the render distance is bounded, precomputation is the simplest thing that works, and given everything else going on here I’ll take simple.

You can see this in action in the demonstration below. As you move around the map you’ll see the BVH changing.


Path Tracing

Finding the closest surface is just the beginning. Now we need to compute the light arriving at that surface point — which requires solving the rendering equation.

Monte Carlo Integration and Importance Sampling

Why the integral can’t be solved analytically

The hemisphere integral asks: what’s the total light arriving from every direction above this surface point? For that to have a closed-form solution, you’d need to know the incoming radiance function $L_i(\omega_i)$ everywhere — which itself depends on the geometry of the entire scene, every other surface’s reflectance, every light source, every occlusion. The incoming light at one point depends on the outgoing light at every other point, which depends on the incoming light at those points. It’s recursive and scene-dependent. There’s no formula you can write down and evaluate. You have to go and look.

The basic Monte Carlo idea

Monte Carlo integration is a general technique: if you can’t evaluate an integral, you can estimate it by sampling. The simplest version — pick $N$ random points in the domain, evaluate the function at each, and average — gives you an estimate that converges to the true answer as $N$ grows. The formal statement is:

$$\int f(x) \, dx \approx \frac{1}{N} \sum_{i=1}^{N} \frac{f(x_i)}{p(x_i)}$$

where $p(x_i)$ is the probability density of picking sample $x_i$. If you sample uniformly over the hemisphere, $p$ is a constant ($1/2\pi$ for a uniform hemisphere), and you’re just averaging scaled function evaluations.

This works but it’s inefficient. Most of the light contribution at a diffuse surface comes from directions near the surface normal (overhead), because the $\cos\theta$ term in the rendering equation means glancing rays contribute almost nothing. Uniform sampling wastes effort on those near-horizon directions that barely matter.

And so we factor in importance.

What importance sampling does

The idea is straightforward even if the name sounds fancy: if you sample more densely in the directions that contribute the most, each sample gives you more useful information and the estimate converges faster — less noise for the same sample count. You bias your random sampling toward the regions that matter.

The ideal is to sample proportionally to the integrand itself. For a diffuse surface the integrand includes that $\cos\theta$ factor, so you choose a probability distribution that’s proportional to $\cos\theta$:

$$p(\omega) = \frac{\cos\theta}{\pi}$$

The $\pi$ is just the normalisation constant that makes it integrate to 1 over the hemisphere. Nothing deeper going on there.

The cancellation

Now plug everything into the Monte Carlo estimator. The rendering equation integrand for a diffuse surface is:

$$f_r \cdot L_i(\omega_i) \cdot \cos\theta_i$$

where $f_r = \text{albedo}/\pi$ for a Lambertian BRDF. The Monte Carlo estimator divides by the sampling probability:

$$\frac{f_r \cdot L_i(\omega_i) \cdot \cos\theta_i}{p(\omega_i)} = \frac{(\text{albedo}/\pi) \cdot L_i(\omega_i) \cdot \cos\theta_i}{\cos\theta_i / \pi}$$

The $\cos\theta_i$ cancels top and bottom. The $\pi$ cancels top and bottom. You’re left with:

$$\text{albedo} \cdot L_i(\omega_i)$$

So after all that setup, each sample just multiplies the surface colour by whatever light arrived from the sampled direction. No trigonometry at shading time, no normalisation constants and the importance sampling has absorbed all of it into the sampling distribution itself.

Why dividing by $p(\omega_i)$ matters

This is the bit that can feel counterintuitive. You’re sampling directions near the normal more often — so why do you divide by the probability, which makes each of those frequently-sampled directions count for less?

Because without the division, directions near the normal would be both sampled more often and weighted equally. They’d be double-counted. The $1/p$ correction is what keeps the estimator unbiased: you sample those directions more, so each individual sample has to count for less, and on average it converges to the same answer as if you’d sampled uniformly — just with less noise along the way.

The practical consequence

Your shader code doesn’t need to compute any cosine weighting at shading time. All the geometric complexity is baked into the cosine_hemisphere function that generates the directions — it naturally produces more directions near the normal and fewer near the horizon, and the maths guarantees the estimator stays correct.

fn cosine_hemisphere(normal: vec3f, r1: f32, r2: f32) -> vec3f {
  let phi = 2.0 * PI * r1;
  let cos_theta = sqrt(r2);
  let sin_theta = sqrt(1.0 - r2);

  let x = cos(phi) * sin_theta;
  let y = sin(phi) * sin_theta;
  let z = cos_theta;

  // Build tangent space from normal
  var tangent: vec3f;
  if (abs(normal.y) > 0.999) {
    tangent = normalize(cross(vec3f(1.0, 0.0, 0.0), normal));
  } else {
    tangent = normalize(cross(vec3f(0.0, 1.0, 0.0), normal));
  }
  let bitangent = cross(normal, tangent);

  return tangent * x + bitangent * y + normal * z;
}

The two random inputs r1 and r2 (uniform in $[0, 1]$) are transformed into a direction on the hemisphere. The sqrt(r2) for cos_theta is where the cosine weighting lives — a uniform random variable passed through a square root produces a cosine-weighted distribution. Without that sqrt, you’d get uniform hemisphere sampling. The rest of the function builds a coordinate frame from the surface normal and transforms the local-space direction into world space.

That covers a single bounce: fire a ray, hit a surface, pick a cosine-weighted direction, see what light comes back. But one bounce only captures direct illumination — light that travels straight from a source to a surface to the camera. The rendering equation is recursive. Light bouncing off a wall illuminates the floor, which illuminates the ceiling, which illuminates the wall again. To capture this indirect illumination you need to follow the path further: bounce again, and again, accumulating the effect of each surface’s colour on the light passing through it. That chain of bounces — and knowing when to stop — is what the path tracing loop handles.

Ooof. Thats a lot. As I’ve said before I didn’t get here in one go. This is the product of experimenting, asking questions, and trying different things with the goal being to get a decent set of trade offs between image quality and performance.

The Path Tracing Loop

A single sample traces a path through the scene: camera → surface → bounce → surface → bounce → … until the ray hits a light, escapes the scene, or is terminated by Russian roulette.

At each surface interaction, the throughput tracks how much light is attenuated by the path so far. Each diffuse bounce multiplies the throughput by the surface colour. When the path eventually hits a light source, the emitted light is scaled by the accumulated throughput:

fn path_trace(...) -> vec3f {
  var throughput = vec3f(1.0);
  var radiance = vec3f(0.0);

  for (var bounce = 0u; bounce < max_bounces; bounce++) {
    let hit = trace_scene(ray_origin, ray_dir, bounce);
    if (!hit.hit) { break; }

    let mat = materials[hit.material_index];

    // If we hit a light, add its contribution
    if (mat.material_type == MATERIAL_EMISSIVE) {
      radiance += throughput * mat.emissive;
      break;
    }

    // Attenuate throughput by surface colour
    throughput *= surface_color;

    // Bounce in a random direction
    ray_dir = cosine_hemisphere(normal, random(), random());
    ray_origin = hit_pos + normal * 0.001;

    // Russian roulette: randomly terminate dim paths
    if (bounce > 2u) {
      let p = max(throughput.x, max(throughput.y, throughput.z));
      if (random() > p) { break; }
      throughput /= p;  // Compensate for termination probability
    }
  }
  return radiance;
}

Russian roulette is the mathematically correct way to terminate paths which to me wasn’t particularly intuitive. To understand why consider the alternative: capping at a fixed bounce count, say 4 bounces. Any light that would have arrived via a 5th or 6th bounce is simply thrown away and in a torch-lit stone corridor, where most illumination reaches surfaces after multiple bounces off walls, that’s a lot of lost energy. The image would be systematically too dark: not because of noise, but because of bias. No amount of additional samples will fix it, because every sample is missing the same light.

Russian roulette avoids this by never unconditionally terminating a path. Instead, at each bounce beyond the first few, we randomly terminate paths with a probability proportional to their remaining energy. A tempered randomisation so to speak.

After several bounces, the throughput — the cumulative product of all surface colours along the path — tells you how much light this path could still contribute. A ray that has bounced off three dark stone walls has a throughput near zero. Even if it eventually finds a light source, its contribution to the final pixel will be negligible and continuing to trace it is wasted computation.

So at each bounce we roll the dice. If the maximum component of the throughput is 0.1, there’s a 90% chance we terminate the path right there. The 10% of paths that survive are boosted by $1/p$ — in this case multiplied by 10 — to compensate for all the paths that were killed. Any individual surviving path overestimates its contribution, but averaged across many samples, the overestimation cancels the energy lost from terminated paths. The estimator remains unbiased — on average, it converges to the same answer as if we’d traced every path to infinite depth.

And so with this approach we’re spending our computation budget on the paths that matter.

if (bounce > 2u) {
  let p = max(throughput.x, max(throughput.y, throughput.z));
  if (pcg(rng_state) > p) {
    break;  // Path terminated — it wasn't going to contribute much
  }
  throughput /= p;  // Survivors are boosted to compensate
}

The bounce > 2u guard ensures the first few bounces always run — you don’t want to randomly kill a primary ray. My first cut of this was missing this and I got a very “partial” screen - lots of space. But after that, dim paths are likely to be terminated and bright paths continue until they find light or fade to nothing.


Why Is Everything So Noisy?

At low sample counts, each pixel’s estimate is based on a handful of random paths. Some paths happen to hit a light source — they return bright values. Other paths bounce into shadows — they return dark values. The average of these few samples is far from the true answer, and neighbouring pixels get different random paths, so they disagree. That disagreement is noise.

The dungeon at 1 sample per pixel. Every pixel fired a single random path. The image is technically correct — each pixel is an unbiased estimate of the true colour — but the variance is enormous.

The noise follows a $1/\sqrt{N}$ convergence rate. Doubling your samples halves the standard deviation. To reduce noise by a factor of 10, you need 100× more samples. This is why brute-force path tracing is so expensive: visually clean images require hundreds or thousands of samples per pixel.

At 1280×800 with 4 bounces, the numbers are unforgiving. Each sample per pixel traces a path of up to 4 rays through the BVH, and there are 1,024,000 pixels:

Samples/pixel Rays per frame Time at 330M rays/sec Framerate Noise relative to 1 spp
1 ~4M 12ms ~80fps 1.0×
16 ~65M 200ms ~5fps 0.25× (4× cleaner)
512 ~2.1B 6.4s ~0.15fps 0.044× (23× cleaner)

At 1 sample per pixel you get 80fps and an image that’s pure noise. At 16 samples you can start to make out the scene, but you’re at 5fps — not interactive, and still visibly grainy. To get a genuinely clean image you need hundreds or thousands of samples, which means seconds per frame. The path tracer at 512 spp takes over six seconds to render a single frame, and even that isn’t fully converged in the darkest corners.

With enough samples — either from a single high-spp frame or accumulated over time via temporal blending — the image converges to something beautiful:

The path between 1 noisy sample and a converged image is where all the engineering lives. We don’t have the computational power to do lots of samples even on a modern GPU.


The Ghost in the Machine

There’s a particular artefact worth understanding because it reveals how temporal accumulation works — and fails.

Dynamic objects in the scene (like the phantom enemy) appear semi-transparent when temporal blending is active. This isn’t a rendering bug in the traditional sense, the temporal accumulator is doing exactly what it’s designed to do, with unintended consequences.

When the camera is static, each frame’s output is blended with the accumulated history using a $1/N$ weighting:

$$\text{result} = \frac{1}{N+1} \cdot \text{current} + \frac{N}{N+1} \cdot \text{history}$$

After 20 frames, the current frame contributes only 5% to the result. If the phantom moves into a pixel’s location at frame 21 then that pixel’s value is 95% wall colour (from the previous 20 frames of history) and 5% phantom colour. The phantom is literally transparent — you’re seeing it blended with what used to be behind it.

It takes many frames for the phantom to become fully opaque. At a blend factor of 0.05, reaching 90% opacity takes about 45 frames. At 88fps, that’s half a second of ghosting. And if the phantom is moving, it never becomes fully opaque — it’s a perpetual ghost.

The fix requires per-pixel change detection: if a pixel’s brightness has changed significantly from its history, blend aggressively toward the current frame regardless of the static frame count but distinguishing “the scene changed” from “the noise pattern changed” is hard when your input is noisy. This is the fundamental tension of temporal accumulation — the thing that makes it converge (heavy reliance on history) is the same thing that makes it resist change.


Denoising: Inventing the Rest

At real-time sample counts (1–8 spp), the raw path-traced image is too noisy to be usable. A spatial denoiser smooths the noise using information from the G-buffer — the normal and depth at each pixel — to blur within surfaces while preserving edges.

The À-Trous Wavelet Filter

This is a fairly standard spatial denoiser, it’s a series of increasingly spaced bilateral filter passes where each pass applies a 5×5 kernel with a step size that doubles each iteration:

Pass Step Size Effective Radius
0 1 2 pixels
1 2 4 pixels
2 4 8 pixels
3 8 16 pixels
4 16 32 pixels

Five passes give a 32-pixel blur radius using only a 5×5 kernel — much cheaper than a single 65×65 convolution.

At each sample position, three weights control the blending. The normal weight prevents blurring across geometric edges — if two pixels have very different surface normals, they belong to different surfaces and shouldn’t be mixed:

$$w_{\text{normal}} = \max(0, \mathbf{n}_c \cdot \mathbf{n}_s)^{128}$$

The depth weight prevents blurring across depth discontinuities — a wall and a distant corridor behind a doorway shouldn’t blend together:

$$w_{\text{depth}} = \exp\left(-\frac{(\Delta d / d_c)^2}{2\sigma_d^2}\right)$$

The spatial weight comes from the à-trous kernel itself, a separable $[1, 4, 6, 4, 1]/16$ distribution.

// In the inner loop of each à-trous pass:
let spatial_weight = kernel_weights[i] * kernel_weights[j];
let normal_dot = max(0.0, dot(center_normal, sample_normal));
let normal_weight = pow(normal_dot, 128.0);
let relative_depth = abs(center_depth - sample_depth) / max(center_depth, 0.001);
let depth_weight = exp(-relative_depth * relative_depth * 100.0);

let weight = spatial_weight * normal_weight * depth_weight;

The challenge is balancing blur against detail. Too aggressive and the image loses texture; too conservative and noise remains. We found that removing the colour similarity weight entirely and relying purely on geometric (normal + depth) weights produced the best results — the colour weight was systematically biased toward darker values, causing energy loss.

I’ll be honest this is the bit I’m least satisfied with - I’ve not got to what feels like a good balance between blur and detail. But without going down a denoising / ML rabbit hole thats probably the best I’m going to do for now.


Temporal Accumulation: Memory Between Frames

In contrast to spatial denoising which operates within a single frame temporal accumulation operates across frames: each frame’s result is blended with the accumulated history from previous frames.

When the camera is static, this is equivalent to gathering more samples over time. Frame 1 has 4 spp. After blending with frame 2 (also 4 spp), you effectively have 8 spp. After 10 frames, 40 spp. The image progressively converges toward the 512 spp quality shown earlier.

When the camera moves, the history must be reprojected — each pixel’s world position from the current frame is transformed into the previous frame’s screen space to find where it was. This requires the previous frame’s view-projection matrix:

fn reproject(world_pos: vec3f) -> vec2f {
  let clip = matrices.prev_view_proj * vec4f(world_pos, 1.0);
  let ndc = clip.xy / clip.w;
  return vec2f(
    (ndc.x * 0.5 + 0.5) * screen_width,
    (0.5 - ndc.y * 0.5) * screen_height
  );
}

If the reprojected position is off-screen, or the depth at that position doesn’t match (indicating a different surface — disocclusion), the history is rejected and the current frame is used alone.

To prevent ghosting from stale history during movement, a neighbourhood clamp restricts the history colour to the range observed in the current frame’s 3×3 neighbourhood, computed in YCoCg colour space for better perceptual uniformity:

let stats = get_neighbourhood_stats(pixel_coord);  // [mean, stddev]
let history_ycocg = rgb_to_ycocg(history_col);
let clamped = clamp(history_ycocg,
    stats[0] - stats[1] * 1.25,
    stats[0] + stats[1] * 1.25);
history_col = ycocg_to_rgb(clamped);

This is the same technique used by every modern game engine with temporal anti-aliasing (TAA). The tradeoff is always the same: tighter clamping reduces ghosting but increases flicker; looser clamping smooths noise but smears moving objects.

Again this is another rabbit hole I feel I would need to go down to get better results.


The Pipeline

The full render pipeline executes three compute passes per frame, followed by a full-screen blit:

  1. Path trace — fire rays, compute lighting, write colour + G-buffer (normals, depth)
  2. Temporal accumulation — blend current frame with reprojected history
  3. Spatial denoise — à-trous wavelet filter guided by the G-buffer
  4. Blit — tone map (Reinhard) and gamma correct, display on screen

Each pass reads from textures written by the previous pass. The temporal pass ping-pongs between two history textures so the output of frame N becomes the input history for frame N+1.


Memory Layout: Hot and Cold Data

A ray traversing the BVH tests dozens of triangles. Each intersection test only needs the three vertex positions — 48 bytes. But the original Triangle struct also contained normals, material indices, UV coordinates, and texture indices — another 48 bytes that are only needed for the single closest hit.

Splitting the triangle data into two separate GPU buffers halves the memory traffic during traversal:

// Hot path: read during every intersection test
struct TriangleVerts {
  v0: vec3f, _pad0: f32,
  v1: vec3f, _pad1: f32,
  v2: vec3f, _pad2: f32,
}

// Cold path: read once per ray, only for the winner
struct TriangleAttribs {
  normal: vec3f,
  material_index: u32,
  uv0: vec2f, uv1: vec2f,
  uv2: vec2f,
  texture_index: i32,
  _pad: f32,
}

After finding the closest hit (by index), a single read from the attribute buffer resolves everything needed for shading:

fn resolve_hit(hit: HitInfo) -> ResolvedHit {
  let attrib = tri_attribs[hit.tri_index];
  // Interpolate UVs using stored barycentrics
  let w = 1.0 - hit.bary_u - hit.bary_v;
  r.uv = w * attrib.uv0 + hit.bary_u * attrib.uv1 + hit.bary_v * attrib.uv2;
  // ...
}

Randomness on the GPU

Path tracing requires enormous quantities of random numbers — two per bounce direction, one for Russian roulette, plus sub-pixel jitter. GPU compute shaders don’t have access to system random number generators, so we use a PCG (Permuted Congruential Generator) hash function (honestly I’ve lost track of the number of different ways I’ve been through to generate random numbers on a GPU):

fn pcg(state: ptr<function, u32>) -> f32 {
  *state = *state * 747796405u + 2891336453u;
  let word = ((*state >> ((*state >> 28u) + 4u)) ^ *state) * 277803737u;
  return f32((word >> 22u) ^ word) / 4294967295.0;
}

The seed combines the pixel coordinates and the frame number, ensuring that each pixel gets a unique sequence and each frame produces a different noise pattern:

fn rand_seed(pixel: vec2u, frame: u32) -> u32 {
  return pixel.x * 1973u + pixel.y * 9277u + frame * 26699u;
}

The frame number is critical for temporal accumulation — if every frame produced identical noise, averaging them would give the same noisy result. Different seeds per frame mean independent noise patterns, and independent samples converge via the law of large numbers.


What Games Actually Do (And Why It’s Worse Than You Think)

Every real-time ray-traced game you’ve seen — Cyberpunk 2077, Quake II RTX, Metro Exodus — uses the same fundamental approach described here. The differences are in scale and sophistication, but also (in a way and in the long tradition of videogames) honesty.

Hardware RT cores accelerate BVH traversal and ray–triangle intersection in dedicated silicon, achieving maybe 10× the throughput of our software compute shader. But RT cores are a small fraction of the die area doing a small fraction of the visual work. They get top billing in the marketing because “AI upscaling ON” doesn’t sell £1,600 graphics cards and in fact many find it noticeable, detrimental and jarring (myself included).

Neural denoisers replace the simple à-trous filter with trained neural networks that can reconstruct a plausible image from 1 sample per pixel. The word “plausible” is doing a lot of heavy lifting as these reconstructions shimmer on thin geometry, smear during movement, and produce a subtle temporal instability that your visual system registers even if you can’t articulate what’s wrong. Follow independent reviewers on YouTube and you’ll quickly find people bemoaning that things look messy and that we’ve lost the sharpness we used to have.

Temporal upscaling (DLSS, FSR, XeSS) runs the renderer at a fraction of the display resolution — often 540p or 720p internally — and uses a temporal neural network to hallucinate a 4K output. These are presented as features but they’re really admissions of failure. They exist because the raw silicon isn’t advancing fast enough to brute-force the problems they’re designed to mask. Each GPU generation now delivers less actual hardware improvement and more software gimmicks, while prices go up. The RTX 5080 outperforms the previous generation by as little as 2.5% in some titles. Its headline feature — Multi Frame Generation — inserts AI-generated fake frames that add latency and visual artefacts.

Hybrid rendering traces rays only for specific effects — reflections, shadows, global illumination — while primary visibility is rasterised. So “RTX ON” typically means a handful of noisy ray-traced contributions composited onto a clean rasterised base, then run through three layers of neural reconstruction to hide the noise.

The result is that the image you see is, in a very real sense, a hallucination — the AI is inventing most of what’s displayed. Once you’ve seen it you can’t unsee it. Hair ghosts. Fences shimmer. Text on in-game signs goes soft. Edges trail during fast camera movement. Games from ten years ago — running at native resolution on GPUs that cost a quarter of the price — actually often looked sharper and more stable than what we get today from a £1,600 card that’s spending half its transistor budget on generating frames that never existed.

I had to turn DLSS on to get decent performance out of The Outer Worlds 2 on a 5080 — a card that should be running rings around anything Obsidian can throw at it. The distortion was distracting. And that game isn’t even graphically ambitious. If a 5080 can’t run a narrative RPG cleanly without AI reconstruction, what exactly is the hardware for? The answer, increasingly, is that the GPU is a DLSS delivery mechanism that happens to also have some rasterisation hardware attached and lets lazy developers get away with shipping lazy sloppy unperformant code. We are, quite literally, paying for poor coding and paying twice: once for the game, once for the hardware that facilitates it.

To be fair: graphics in games have always been about trickery but I guess whats changed is that the trickery used to feel like genuinely novel and innovative use of the hardware and people pushing limits. Now that trickery often feels like its being used to push up prices and ship poorly optimised code. Thats a big difference.

What you see here is the approach without these tricks: rays, noise, and no hallucination. At 512 spp the image is genuinely path-traced — every pixel computed, nothing invented. At 4 spp it’s grainy but truthful. The noise from indirect illumination in a torch-lit dungeon reads as atmosphere rather than artefact. Sure its not going to ship as a viable game (though I think you could lean into the aesthetic) but its a good representation of whats going on at the bottom.


Try It Yourself

The renderer below is the live WebGPU path tracer described in this post. Click the play button to start rendering, then use WASD to move and Q/E to turn. The settings panel lets you adjust samples per pixel, bounce count, resolution, and toggle the temporal and spatial denoising passes.

The source code is available on GitHub. The entire renderer — path tracing, BVH construction, temporal accumulation, spatial denoising, and the Doom WAD loader — is roughly 2,500 lines of TypeScript and WGSL.

Built by James Randall — tool-maker, system builder, and occasional cyclist. Walking the hills with my four-legged friend when I'm not building worlds.
© 2025