Neural Scene Representations with Single-Evaluation Rendering

Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a *single* network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.

3D-structured Neural Scene Representations have recently enjoyed great success, and have enabled new applications
in graphics, vision, and artificial intelligence.
However, they are currently severely limited by the neural rendering algorithms they require for training and
testing: Both volumetric rendering as in NeRF and sphere-tracing as in Scene Representation Networks
require **hundreds** of evaluations of the representation per ray, resulting in forward pass times on the order of
**tens of seconds** for a single, 256x256 image. This cost is incurred both at training and testing time,
and training requires backpropagating through this extremely expensive rendering algorithm.
This makes these neural scene representations borderline infeasible for use in all but computer graphics tasks, and even here,
additional tricks - such as caching and hybrid explicit-implicit representations - are required to achieve real-time
framerates.

We introduce Light Field Networks, or short LFNs. Instead of mapping a 3D world coordinate to whatever is at that
coordinate, LFNs directly map an **oriented ray** to whatever is **observed** by that oriented ray.
In this manner, LFNs parameterize the **full 360-degree light field** of the underlying 3D scene.
This means that LFNs only require a **single** evaluation of the neural implicit representation per ray.
This unlocks rendering at framerates of >500 FPS, and with a minimal memory footprint. Below,
we compare rendering an LFN to the recently proposed PixelNeRF - LFNs accelerate rendering by a factor of about 15,000.

Unintuitevly, LFNs do **not** only encode the appearance of the underlying 3D scene, **but also its geometry**.
Our novel parameterization of light fields via the mathematically convenient plucker
coordinates, together with the unique properties of Neural Implicit Representations, allows us to **extract sparse depth maps**
of the underlying 3D scene in **constant time, without ray-marching!** We achieve this by deriving a
relationship of an LFN's derivatives to the scene's geometry: at a high level, the geometry of the levelsets of the
4D light field encode the geometry of the underlying scene, and these levelsets can be efficiently accessed via automatic
differentiation. This is in contrast to 3D-structured
representations, which require ray-marching to extract any representation of the scene's geometry.
Below, we show sparse depth maps extracted from LFNs that were trained to represent simple room-scale environments.

While 3D-structured neural representations ensure multi-view consistency via ray-marching, Light Field Networks
are not naturally multi-view consistent. To overcome this challenge, we leverage meta-learning via hypernetworks
to learn a space of multi-view consistent light fields.
As a corollary, we can leverage this learned prior to reconstruct an LFN from **only a single image observation**!
In this regime, LFNs outperform existing globally conditioned neural scene representations such as Scene Representation Networks
or the Differentiable Volumetric Renderer, while rendering in real-time and requiring orders of magnitude less memory.

Check out our related projects on neural scene representations!

We propose a new neural network architecture for implicit neural representations that can accurately
fit complex signals, such as room-scale SDFs, video, and audio, and allows us to supervise implicit
representations via their gradients to solve boundary value problems!

We identify a key relationship between generalization across implicit neural representations and
meta-
learning, and propose to leverage gradient-based meta-learning for learning priors over deep
signed distance
functions. This allows us to reconstruct SDFs an order of magnitude faster than the auto-decoder
framework,
with no loss in performance!

A continuous, 3D-structure-aware neural scene representation that encodes both geometry and
appearance,
supervised only in 2D via a neural renderer, and generalizes for 3D reconstruction from a single
posed 2D image.

We demonstrate that the features learned by neural implicit scene representations are useful for
downstream
tasks, such as semantic segmentation, and propose a model that can learn to perform continuous
3D
semantic segmentation on a class of objects (such as chairs) given only a single, 2D (!)
semantic label map!

@inproceedings{sitzmann2021lfns,
author = {Sitzmann, Vincent
and Rezchikov, Semon
and Freeman, William T.
and Tenenbaum, Joshua B.
and Durand, Fredo},
title = {Light Field Networks: Neural Scene Representations
with Single-Evaluation Rendering},
booktitle = {Proc. NeurIPS},
year={2021}
}