LGTM: Apple's 4K Gaussian Splatting Without the Compute Explosion

Feed-forward 3D Gaussian Splatting has a scaling problem. Existing methods predict pixel-aligned primitives, meaning the number of Gaussians grows with the square of the output resolution. Double the resolution on each axis and you need four times as many Gaussians. Scale from 512 to 4K and the primitive count explodes by a factor of 64. In practice, methods like NoPoSplat consume over 60 GB of GPU memory just to render at 1024x576, and fail entirely at higher resolutions. Apple's LGTM, published at ICLR 2026, solves this by splitting geometry and texture into two separate networks. The result: native 4K feed-forward novel view synthesis that actually fits on a single GPU.

The Quadratic Scaling Problem

3D Gaussian Splatting represents a scene as a collection of ellipsoidal primitives, each carrying position, scale, rotation, opacity, and colour. Feed-forward methods like NoPoSplat, DepthSplat, and Flash3D predict these parameters directly from input images, bypassing the per-scene optimisation that made the original 3DGS impractical for real-time applications.

The problem is that these methods typically predict one Gaussian per pixel of their internal feature map. If the feature map is 1024x576, that is roughly 590,000 primitives. Push it to 4096x2304 and you are looking at 9.4 million Gaussians just for the primitives, before accounting for the attention and transformer layers needed to predict them. Training NoPoSplat at 1024x576 already requires 61.85 GB of GPU memory with a batch size of one. At 2048x1152 it runs out of memory entirely. No amount of gradient checkpointing saves you; the bottleneck is the quadratic relationship between resolution and primitive count, not intermediate activations.

Standard 3DGS compounds this by coupling appearance and geometry inside each primitive. A flat wall with detailed brickwork needs thousands of tiny Gaussians to capture the texture, even though the geometry is trivially simple. The representation conflates two things that scale at very different rates.

LGTM's Dual-Network Architecture

LGTM, short for "Less Gaussians, Texture More", attacks the scaling problem by decoupling geometry from appearance. Instead of predicting a massive grid of pixel-aligned primitives, it uses two separate networks that operate at different resolutions.

The Primitive Network takes low-resolution input images (typically 512x288 or 256x144) and predicts a compact set of 2D Gaussian primitives. These capture the scene's geometry: positions, scales, rotations, opacities, and low-frequency colour through spherical harmonics. Critically, this network trains with high-resolution supervision. It consumes low-res inputs but renders at full target resolution, forcing it to learn appropriate scale parameters that produce correct results when rasterised at 4K. This high-res supervision prevents holes and aliasing artefacts that would otherwise appear when upscaling a sparse primitive grid.

The Texture Network takes the high-resolution source images and enriches each primitive with a per-primitive texture map. Each Gaussian gets a small T x T texture (where T is 2, 4, or 8 depending on the target resolution) that encodes both colour detail and an alpha mask. The texture network combines three feature sources: patchified features from the high-res image, projective features computed by "inverse-rendering" the source image back onto the Gaussian planes, and backbone features shared from the primitive network.

The projective texture mapping step works by computing, for each Gaussian, the inverse homography that maps from the primitive's local coordinates back to source image pixels. This gives the network a strong prior for high-frequency appearance without learnable parameters, which the texture network then refines into per-primitive textures.

Training follows a staged recipe. Stage one pre-trains the primitive network alone to build a stable geometric foundation. Stage two jointly trains both networks, with the primitive network's learning rate reduced to one-tenth to prevent the new texture signal from destabilising the geometry. Colour textures are zero-initialised so they start as additive detail on top of the spherical harmonic base colour.

Benchmark Results

The numbers tell the story clearly. Here is the head-to-head comparison from the paper's pilot study, measured on the DL3DV dataset with two context views:

Resolution	Method	Memory (GB)	LPIPS	SSIM	PSNR
1024x576	NoPoSplat	61.85	0.239	0.716	23.17
1024x576	LGTM	20.16	0.213	0.816	25.61
2048x1152	NoPoSplat	OOM	-	-	-
2048x1152	LGTM	21.39	0.176	0.810	25.33
4096x2304	NoPoSplat	OOM	-	-	-
4096x2304	LGTM	28.23	0.200	0.803	24.49

NoPoSplat cannot train at all beyond 1K. LGTM reaches 4K on a single NVIDIA A100 with under 30 GB of memory. And it does not just render at 4K; it actually improves perceptual quality. The LPIPS score drops from 0.239 to 0.176 going from 1K to 2K, and remains strong at 4K. That perceptual improvement comes from the richer texture maps having more room to encode fine detail.

Inference scaling is similarly efficient. Rendering at 4K (a 64x pixel increase over the 512x288 baseline) requires only 1.8x the peak GPU memory and 1.47x the total inference time. The texture maps grow linearly with resolution, not quadratically, and projective feature extraction runs in a few milliseconds even for 4K source images.

LGTM improves on every baseline it touches. On DepthSplat at 4K, it pushes LPIPS from 0.198 to 0.170. On Flash3D (single-view input) at 4K, the improvement is even more dramatic, dropping LPIPS from 0.371 to 0.219. The framework works with or without known camera poses, with monocular or multi-view input.

Why This Matters for Vision Pro

Apple Vision Pro has a 23 million pixel display (roughly 3660x3200 per eye). Spatial photography and video captured on the device already produce high-resolution assets, but reconstructing those scenes into splats that can re-render from arbitrary viewpoints has been computationally out of reach for on-device inference. The existing pipeline requires minutes of per-scene optimisation on a desktop GPU.

LGTM changes the calculus. A 4K novel view synthesises in roughly 175 milliseconds on an A100, with the bulk of that time in the network forward pass (142 ms) rather than the rasteriser (33 ms). The primitive count stays fixed at 512x288 regardless of output resolution. That is a small enough model footprint that on-device deployment becomes a realistic near-term target, especially with Apple's Neural Engine and the GPU optimisations that typically follow Apple Research publications into shipping frameworks.

Passthrough quality on Vision Pro depends on rendering fidelity matching what the physical eyes would see. Low-resolution splat reconstructions produce visible blur and ghosting when composited into the real-time camera feed. Room scanning, spatial photography, and any mixed-reality use case that needs to reconstruct and re-project the physical environment benefits directly from higher resolution output. LGTM makes it possible to feed-forward reconstruct at the display's native resolution, closing the quality gap between reconstructed and observed reality.

How LGTM Compares to Other Approaches

The landscape of efficient Gaussian Splatting splits into two camps: post-hoc compression methods and feed-forward methods.

Post-hoc approaches like Compact3D, Scaffold-GS, and Mip-Splatting prune, merge, or distill a per-scene optimised 3DGS model into something smaller. They produce excellent quality but still need 10 to 30 minutes of per-scene training. They are optimisation-time solutions, not inference-time solutions.

Feed-forward methods like NoPoSplat, DepthSplat, and Flash3D predict Gaussians directly from images, generalising across scenes with no per-scene training. But until LGTM, they were stuck at low resolutions because of the quadratic scaling problem. You could not simply upscale the output; the primitive grid determined both geometry and appearance simultaneously.

LGTM occupies a unique position: it is the first feed-forward method to use textured Gaussians. Existing textured splatting methods (BBSplat, Textured Gaussians for 3DGS) achieve similar geometry-appearance decoupling but all require per-scene optimisation. LGTM brings the efficiency of textured representations into the feed-forward regime for the first time, and it does so as a general framework that plugs into existing baselines rather than replacing them entirely.

Apple's Broader Gaussian Splatting Portfolio

LGTM fits into a growing body of Gaussian Splatting research from Apple. HUGS (Human Gaussian Splats, December 2023) addressed animatable human reconstruction, combining deformable Gaussians with a learned body model so freely moving people could be rendered alongside static scenes. SHARP looked at removing aliasing artefacts in splat rendering. More broadly, Apple has published on perceptual metrics for 3DGS that correlate better with human judgement than standard PSNR and SSIM.

The thread connecting these projects is practical deployment. Every paper moves the field toward on-device, real-time, high-fidelity 3D reconstruction. LGTM is the latest and arguably the most significant, because it rethinks the fundamental representation bottleneck rather than optimising within existing constraints.

Code for LGTM is listed as "coming soon" on the project page.

Feed-forward rendering at native display resolution is no longer a research curiosity. With LGTM, it is a solved architectural problem. As spatial computing hardware matures and display resolutions climb higher, the feed-first approach, predict once and render from any viewpoint, looks set to become the standard pipeline for consumer spatial applications. Per-scene optimisation will retain its place for production-quality offline assets, but for real-time capture, room scanning, and interactive spatial media, the future is feed-forward.

Paper: arxiv.org/abs/2603.25745 Project page: yxlao.github.io/lgtm Apple ML Research: machinelearning.apple.com/research/less-gaussians-texture-more