CS180 Project 4 — Neural Radiance Fields

Neural Radiance Fields

By Kourosh Salahi · CS180 / 280A

This project implements NeRF completely from scratch: camera calibration, pose estimation, ray sampling, neural fields, volume rendering, and training NeRFs on both the classic Lego dataset and my own captured data. This page contains animations

Part 0 — Camera Calibration & Dataset Capture

Camera Frustums Visualization

After calibrating my camera through a grid of aruco tags, and taking multiple photos of my object from different views, I was able to estimate camera poses, as visualized below.

Calibration Visualization

Detected ArUco corners used for calibration

Part 1 — Neural Field Fitting (2D)

Model Architecture

For image fitting, I used a fully-connected network with width 256 and positional encoding L = 10. The model takes the 2D coordinates (after PE, dimension 4L + 2) and passes them through three hidden layers:

Input: 4L + 2-dim positional-encoded coordinates
Hidden layers: 3 × (Linear → ReLU), each of width 256
Output layer: Linear → Sigmoid (predicts RGB in [0,1])

Training used Adam with learning rate 1e-2 and MSE loss.

Training Progression — Fox (Width=256, L=10)

Training iterations shown: 0 → 50 → 100 → 200 → 500 → 1000 → 1999.

Final Results

Four combinations of PE frequency (L ∈ {4,10}) and width (W ∈ {64,256}), all at iter1999. It seems as though Positional encoding has the highest effect on our overall image ouput quality.

PSNR Curves (Fox)

Part 2 — Neural Radiance Field on Multi-View Lego

In this section, I implemented a complete NeRF pipeline: converting image pixels to 3D world-space rays, sampling points along those rays, evaluating a radiance field network, and performing differentiable volume rendering to optimize the model against multiple posed RGB images.

2.1 Camera → Ray Pipeline

pixel_to_camera(K, uv): I convert pixel coordinates to camera coordinates by analytically inverting the pinhole intrinsics. For each pixel (u, v), I compute: (x, y, z) = ((u - ox)/fx, (v - oy)/fy, 1).
transform(c2w, x_c): My world-space transform builds homogeneous coordinates explicitly. Given x_c ∈ ℝ^N×3, I append a column of ones, reshape to (N, 4, 1), then left-multiply by the camera-to-world matrix:
```
c2w @ [x, y, z, 1]^T → [X, Y, Z, 1]^T
```
The result is squeezed back to (N, 3). This implementation matches the mathematical camera transform exactly and is fully vectorized across all points.
pixel_to_ray(K, c2w, uv): After converting pixel locations to camera coordinates and transforming them into world space, (using the previous two functions) I compute:
ray_o = c2w[:3, 3] as the origin would simply be the translation of the camera to world frame, and ray_d = normalize(x_w − ray_o), as the ray direction would be the difference between our world point and the position of the camera in the world frame. This yields the origin and direction of each ray used during training.

Together, these steps implement the full pixel → camera → world → ray pipeline required for NeRF training.

2.2 Sampling Rays & Points

I use global ray sampling, where all pixels from all training images are flattened into a single pool. Each pixel is recorded with a +0.5 offset so that sampling occurs at the pixel center. At every iteration, I randomly choose N pixel indices, giving uniformly distributed rays across all views.

For each selected pixel, I compute its ray origin and direction from the previous pixel_to_ray method. Each ray is then discretized into n_samples = 64 points between near = 2.0 and far = 6.0 using:

x = r_o + r_d · t

During training I ensure that every portion of the ray is eventually sampled and prevent the network from overfitting to fixed depths by adding random noise within each interval (i.e., t_i ← t_i + ε·Δt). This ensures that over training, every part of the ray contributes to the reconstruction. The values n_samples, near, and far are parameters in my implementation, and I later change them for different scenes (e.g., my real dataset uses a smaller range), since the optimal sampling bounds depend heavily on object scale and the physical capture setup.

2.3 Dataloader + Precomputation + Viser Visualizations

To accelerate training, I built a preprocessing stage that:

enumerates every pixel in every image with a +0.5 center offset,
stores their RGB values,
stores a repeated copy of the camera extrinsic for each pixel,
precomputes world-space ray_o and ray_d for the entire dataset.

The dataloader then samples rays by simply indexing into these large flattened arrays. I validated correctness of UV alignment, ray orientation, and frustum consistency using Viser visualizations.

2.4 Neural Radiance Field MLP

My NeRF model closely follows the original architecture: a deep point MLP with a skip connection, a density head, and a separate view-dependent color head. Both 3D sample locations and ray directions use sinusoidal positional encoding before being fed into the network.

Positional Encoding: For my implementation, 3D points are encoded using L_x=10, giving 3 + 6L = 63 channels. Ray directions use L_r=4, producing 3 + 6L = 27 channels.
Point MLP (σ + features): The encoded 3D point is processed by an 8-layer MLP with width 256. At the 4th hidden layer (index 3), I concatenate the original positional encoding back into the network, exactly matching:
Linear(256 + input_dim_x → 256). This skip connection helps preserve high-frequency spatial detail.
Density Head (σ): After the final point-layer, a 256 → 1 linear projection followed by ReLU produces the volume density σ.
Color Head (rgb): The point-MLP output is first transformed into a 256-D feature vector. This feature is concatenated with the encoded ray direction (27-D) and passed through:
- (256 + input_dim_r) → 128 with ReLU,
- 128 → 3 with Sigmoid to produce RGB in [0,1].
Outputs: The network returns σ (density) and rgb (color).

2.5 Volume Rendering

I implemented volume rendering entirely in a vectorized form. For each ray, the sampled densities σ and colors rgb are combined using the weighting:

w_i = T_i · (1 − exp(−σ_i Δt))

where the transmittance T_i is computed using a shifted cumulative sum (up until but excluding value i) of the densities:

T_i = exp( − cumulative_sum(σ · Δt) )

This matches the continuous volume rendering equation while remaining efficient, since all rays and samples are processed in parallel. The final pixel color is:

C = Σ w_i · rgb_i

Training Progression (0 → 1400 iterations)

Below are renders from the validation camera at key training iterations: 0, 200, 400, 600, 800, 1000, 1200, 1400.

PSNR Curves

As you can see, the PSNR curve reaches above the goal of 23, and the average PSNR for the 10 images of the validation set with the trained model is 23.91.

Spherical Rendering (Novel Views)

Part 2.6 — NeRF Trained on My Own Captured Object

For this section, I trained a full Neural Radiance Field using the dataset captured in Part 0. This involved using my own calibrated images and camera poses, then training a NeRF model identical in structure to Part 2, with adjustments to near/far ranges and sampling strategy due to the much smaller physical scale of the scene.

Model Architecture & Training Setup

I used a custom NeRF architecture with skip connection, separate density/color branches, and independent positional encoding frequencies for 3D sample points and view directions.

Hyperparameter & Implementation Adjustments

near = 0.2 and far = 0.5 (real-world scene was much smaller than Lego dataset)
64 samples per ray, jittered during training
Positional encoding: L_x = 10 for positions, L_r = 4 for directions
Adam optimizer: lr = 5e-4

These adjustments were essential to avoid the model sampling empty space and to keep rays bounded within the tight capture volume around the object.

Training Loss Over Time

Intermediate Training Renders

Below are snapshots of the NeRF’s output at various stages of training. These demonstrate how density and color predictions refine over time.

Reference Image

Novel-View Rendering (Final)

After training, I generated a 360° orbit animation by using the provided “look-at-origin” camera generation code, producing a smooth camera path around the object. Because my Aruco tag was so big, I pointed it towards a different corner rather than the origin corner. Thus, I had to change the logic for the roation function to rotate along a different world point. The final rendered GIF is shown below:

Novel-view orbit of my reconstructed NeRF

Conclusion

I implemented a complete Neural Radiance Field pipeline from scratch. From camera calibration and PnP, to neural fields, ray sampling, and full volumetric rendering, this project deepened my understanding of differentiable rendering and scene representation. THANKS FOR VIEWING MY PROJECT!!!

CS180 • Project 4 · NeRF