A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

¹Computer Vision Laboratory, EPFL, Switzerland,

ICCV 2025

Abstract

Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones.However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.

View-consistent Sampling

Our central idea is to pre-compute a view-consistency distribution along rays and to perform importance sampling according to this distribution. As a result, the sampling will concentrate around surface points instead of random points in the capture volume.

Distilling Grometric Information from DINOv2

We distill the foundation model DINOv2 for geometric information by tuning our projection network on real images from MegaDepth dataset.

Following is the visualization of the feature distillation process. For the two test images from Megadepth dataset, we first randomly generate $50$ ground truth correspondences (same as in the training process), shown as colored dots, and then extract vanilla DINOv2 features (384 dimension) and the proposed distilled DINOv2 features (32 dimension) at these locations. We compute the feature similarities across the two views and show the resulting similarity matrices on the right, where an optimal correspondence should give the identity matrix.

Qualitative Results

We show comparisons of VS-NeRF to the main competitors and the corresponding ground truth images from held-out test views. The scenes are, from the top down: Bicycle with 60 training views, Stump with 110 training views, Counter with 70 training views from the Mip-NeRF360 dataset and Francis with 70 training views from Tanks&Temples. The '+' prefix indicates the included additional component to Nerfacto.

Quantitative Results

We also show performances of VS-NeRF and competitors with increasing number of views over Mip-NeRF360 dataset and Tanks&Temples, in terms of PSNR values. The '+' prefix indicates the included additional component to Nerfacto.

BibTeX

@inproceedings{fan2025vcs, title={A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields}, author={Fan, Aoxiang and Dumery, Corentin and Talabot, Nicolas and Fua, Pascal}, booktitle={International Conference on Computer Vision (ICCV)}, year={2025} }