Naturally Supervised 3D Visual Grounding with
Language-Regularized Concept Learners


CVPR 2024

[arXiv]  [code


Abstract

3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities — from zero-shot composition to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.




Naturally Supervised 3D Visual Grounding

We study the task of referring expression comprehension (3D-REC). The goal is to locate the target referred object in a 3D scene given a language utterance. Different from existing methods that require direct and dense supervision, LARC complete this task in a naturally supervised setting. LARC learns 3D visual grounding by looking at only scenes and question-answer pairs. LARC removes the need for pre-segmented object point clouds and object-level classification supervision.


Method

To learn in the naturally supervised setting, LARC combines language priors and neuro-symbolic concept learners. Neuro-symbolic concept learners decompose visual reasoning queries into modular functions, and execute them with neural networks. Each network outputs representations that can be indexed by its concept name and arity. Due to this modularity and structured representation, prior neuro-symbolic works have shown strong generalization, data efficiency, and transferability in the 3D domain.

LARC builds on a structured neuro-symbolic model, and importantly regularizes the structured and interpretable representations based on explicit language constraints, in order to learn without direct supervision.

3D Neuro-Symbolic Model
A neuro-symbolic concept learner for 3D grounding consists of three components:
1. A semantic parser parses the input utterance into a symbolic program that represents the reasoning process underlying the utterance.
2. A 3D feature encoder extracts object feature fobj, and relational features frel and fternary, which represent object features, binary relations (e.g., beside) and ternary relations (e.g., between) among objects, respectively.
3. A fully differentiable program executor takes the program and extracted features fobj, fbinary and fternary, and executes the program with the features to find the target object.



To enable learning in naturally supervised settings, LARC introduces regularization on intermediate representations based on language constraints.

Constraints Generation with LLMs
LARC leverages general constraints derived from well-studied semantic relations between words, for example, symmetry, exclusivity, and synonymity, which are broadly applicable across all language-driven tasks. Notably, concepts satisfying these constraints can be distilled from LLMs based on language priors explicitly.



Constraints on Structured Representations
LARC effectively encodes semantic contexts from language into neuro-symbolic concept learners through regularization.

Specifically, during the execution of neuro-symbolic programs, relations between objects are represented as probability matrices. Their elements are interpreted as the likelihood that the referred relation exists between the pair or triple of objects. For example, probneari,j specifies the probability that object i is near object j, where i, j are indices of objects. So, for each type of concepts, we encourage the probability matrices to behave differently.

For example, for symmetric concepts (e.g., near), we encourage probability matrices to be symmetric, and penalize asymmetric ones, which is represented as a regularization loss in LARC.




Zero-Shot Generalization

We evaluate LARC's ability to zero-shot generalize to unseen concepts based on language composition rules. We create two test sets with concepts not seen during training: ternary relations (e.g., center, between) and antonyms (e.g., not behind, not left). Queries with these concepts are removed from the train set.

Performance on unseen concepts

We also evaluate LARC's transfer performance to an unseen dataset, by training on ReferIt3D and testing on ScanRefer, which contains new utterances on scenes from ScanNet. We retrieve a subset that reference the same object categories and relations as in ReferIt3D, such that all methods can be run inference-only.

Performance on unseen dataset



Data Efficiency

We demonstrate that LARC retains strong data efficiency due to its modular concept learning framework. In the following table, we see that LARC is significantly more data efficient than prior works at 5%, 10%, 15%, 20%, and 25% of training data used. Notably, LARC sees a 6.8 point percent gain from the top-performing prior work with 10% of data.




Visualization

Here, we present visualizations of LARC's and NS3D's learned features for symmetric (left two columns) and exclusive (right two columns) concepts; each matrix represents likelihood of pairs of objects' relations adhering to the given concept. LARC's features learn to encode constraints from language priors significantly more effectively than that of the neuro-symbolic baseline, NS3D.





Code Release

You can find our code here.