Self-Supervised Visual Representation Learning from Hierarchical Grouping

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.

1 Introduction

The ability to learn from large unlabeled datasets appears crucial to deploying machine learning techniques across many application domains for which annotating data is too costly or simply infeasible due to scale. For visual data, quantity and collection rate often far outpace ability to annotate, making self-supervised approaches particularly crucial to future advancement of the field.

Recent efforts on self-supervised visual learning fall into several broad camps. Among them, Kingma  et al.   [ 24 ] and Donahue  et al.   [ 12 ] design general architectures to learn latent feature representations, driven by modeling image distributions. Another group of approaches  [ 16 , 10 , 28 , 35 ] leverage, as supervision, pseudo-labels automatically generated from hand-designed proxy tasks. Here, the general strategy is to split data examples into two parts and predict one from the other. Alternatively, Wu  et al.   [ 45 ] and Zhuang  et al.   [ 52 ] learn visual features by asking a deep network to embed the entire training dataset, mapping each image to a location different from others, and relying on this constraint to drive the emergence of a topology that reflects semantics. Across many of these camps, technical improvements to the scale and efficiency of learning further boost results  [ 19 , 7 , 36 , 13 ] . Section  2 provides a more complete background.

Refer to caption

We approach self-supervised learning using a strategy somewhat different from those approaches outlined above. While incorporating aspects of proxy tasks and embedding objectives, a key distinction is that our system’s proxy task is itself generated by a (simpler) trained vision system. We thus seek to bootstrap visual representation learning in a manner loosely inspired by, though certainly not accurately mirroring, a progression of simple to complex biological vision systems. This is an under-explored, though not unrecognized, strategy in computer vision. Serving as a noteworthy example is Li  et al. ’s  [ 30 ] approach of using motion as a readily available, automatically-derived, supervisory signal for learning to detect contours in static images. We focus on the next logical stage in such a bootstrapping sequence: using a pre-existing contour detector to automatically define a task objective for learning semantic visual representations. Figure  1 illustrates how this primitive visual system, combined with a modern contrastive feature learning framework, trains a convolutional neural network (CNN) to produce semantic embeddings (Figure  2 ). We defer full details to Section  3 .

Our system not only leverages contours to learn visual semantics, but also leverages a small amount of annotated data to learn from a vastly larger pool of unlabeled data. Our visual primitive of contour detection is trained in a supervised manner from only 500 500 500 annotated images  [ 34 ] . This primitive then drives self-supervised learning across datasets ranging from tens of thousands to millions of images; in this latter phase, our system does not utilize any annotations and trains from randomly initialized parameters. This is a crucial distinction from SegSort  [ 21 ] , whose pipeline bears coarse resemblance to our Figure  1 . SegSort’s “unsupervised” learning setting still relies on starting from ImageNet pre-trained CNNs; its “unsupervised” aspect is only with respect to forgoing use of segmentation ground-truth. In contrast, we address the problem of representation learning entirely from scratch, save for the 500 500 500 annotated images of the Berkeley Segmentation Dataset  [ 34 ] .

ImageNet, even without labels, is curated: most ImageNet images contain a single category. This provides some implicit supervision, which may bias the self-supervised work that experimentally targets learning from ImageNet, including MoCo  [ 19 ] , InstFeat  [ 48 ] and others  [ 45 , 7 ] . Many use cases for self-supervision will lack such curation. As our bootstrapping strategy utilizes a visual primitive geared toward partitioning complex scenes into meaningful components, it is a better fit to learning on unlabeled examples from datasets containing scenes ( e.g.,  PASCAL  [ 14 ] , COCO  [ 31 ] ).

Using a similar siamese network, we outperform InstFeat  [ 48 ] by a large margin on the task of learning transferable representations from PASCAL and COCO images alone (disregarding labels). In this setting, our results are competitive with those of the state-of-the-art MoCo system  [ 19 ] , while our method remains simpler. We do not rely on a momentum encoder or memory bank. Even with this simpler training architecture, our segmentation-aware approach enhances the efficiency of learning from complex scenes. Here, our pre-training converges in under half the epochs needed by MoCo to learn representations with comparable transfer performance on downstream tasks.

In addition to evaluating learned representation quality on standard classification tasks, Sections  4 and  5 explore applications to semantic region search and instance tracking in video. Using similarity in our learned feature space to conduct matching across images and frames, we outperform competing methods in both applications. Our results point to a promising new pathway of crafting self-supervised learning strategies around bootstrapping the training of one visual module from another.

2 Related Work

Self-Supervised Representation Learning. Approaches to self-supervised visual learning that train networks to predict one aspect of the data from another have utilized a variety of proxy tasks, including prediction of context  [ 38 ] , colorization  [ 50 , 29 ] , cross-channels  [ 51 ] , optical-flow  [ 49 ] , and rotation  [ 16 ] . Oord  et al.   [ 36 ] train to predict future representations in the latent space of an autoregressive model. As objects move coherently in video, Mahendran  et al.   [ 33 ] learn pixel-wise embeddings for static frames, such that their pair-wise similarity mirrors that of optical flow vectors.

Another family of methods casts representation learning in terms of clustering or embedding objectives. DeepCluster  [ 5 ] iterates between clustering CNN output representations to define target classes and re-training the CNN to better predict those targets. Contrastive multiview coding  [ 42 ] sets the objective as mapping different views of the same scene to a common embedding location, distinct from other scenes. Ye  et al.   [ 48 ] apply similar intuition with respect to data augmentation. Instance discrimination  [ 45 ] formulates feature learning as a non-parametric softmax prediction problem, enforcing consistency between a predicted hypersphere embedding and a counterpart maintained in memory banks. Following Wu  et al.   [ 45 ] , Zhuang  et al. ’s local aggregation approach  [ 52 ] uses additional clustering steps to reason about embedding targets.

Differing from the siamese network of Ye  et al.   [ 48 ] and the dataset-level memory bank of Wu  et al.   [ 45 ] , momentum contrast (MoCo)  [ 19 ] uses a moving-average encoder for reference embeddings. This offers scalability superior to a memory bank. Self-supervised networks trained with MoCo outperform their ImageNet-supervised counterparts, as benchmarked by fine-tuning to multiple tasks.

Like MoCo, our approach also benefits from a feature learning paradigm that jointly considers augmentation invariance and negative example sampling. But, instead of learning feature representation only at image level, our method learns pixel-wise semantic affinity in the context of regions. Our system relies on a simpler siamese architecture, rather than requiring a moving-average encoder.

Image Segmentation. The classic notion of image segmentation – partitioning into meaningful regions without necessarily labeling according to known semantic classes – has a rich history. Given the duality between regions and their bounding contours, modern approaches often focus on the problem of contour detection. Though deep neural networks are now the dominant tool for contour detection  [ 41 , 4 , 46 , 25 , 47 ] , the best prior approaches  [ 2 , 3 , 11 ] deliver somewhat respectable results.

Pre-training CNNs on ImageNet before fine-tuning them for contour detection provides accuracy gains  [ 46 ] . But, as our purpose is to bootstrap representation learning from contours, obviating the need for ImageNet supervision, we do not want ImageNet labels used in our contour detector training pipeline. For experiments, we therefore select an older detector, based on random forests  [ 11 ] , along with traditional machinery (OWT-UCM)  [ 2 ] for converting contours into region hierarchies. Both components are trained using only the 500 500 500 labeled images of the BSDS  [ 34 ] .

Representation Learning using Segmentation. Several works incorporate segmentation into representation learning. Fathi  et al.   [ 15 ] formulate the task of instance segmentation as metric learning. They train a network for instance-aware embedding by optimizing feature similarity of pixels sampled within or cross-instances. Kong  et al.   [ 27 ] , adopting a similar objective, combine a CNN with a differentiable mean-shift clustering approach to learn instance segmentation. Chen  et al.   [ 8 ] address video instance segmentation via pixel embedding with a modified triplet loss. Pathak  et al.   [ 37 ] learn representations for recognition tasks by predict moving object segmentation from static frames.

SegSort  [ 21 ] utilizes an iterative grouping algorithm in EM (expectation maximization) fashion, learning a segmentation-aware embedding of pixels onto a hypersphere. Specifically, it leverages regions (computed by OWT-UCM  [ 2 ] on HED  [ 46 ] contours) as defining separation criteria, maximizing pairwise intra-similarity and inter-contrast for the pixels within the same or different regions, respectively. Unlike supervised counterparts, SegSort can learn semantic segmentation on top of ImageNet  [ 9 ] pre-trained CNNs. In this mode, contours (as opposed to ground-truth per-pixel class labels) provide the only additional supervisory signal for learning region semantics.

While our work shares a similar spirit with SegSort  [ 21 ] , we aim to bootstrap learning of semantic region representations entirely from scratch, removing the dependence on ImageNet pre-training for both the primary CNN and the contour detection component.

3 Bootstrapping Semantics from Grouping

We train a convolutional neural network ϕ ​ ( ⋅ ) italic-ϕ ⋅ \phi(\cdot) , which maps an input image 𝕀 𝕀 \mathbb{I} into a spatially-extended feature representation 𝔽 = ϕ ​ ( 𝕀 ) 𝔽 italic-ϕ 𝕀 \mathbb{F}=\phi(\mathbb{I}) . Let 𝔽 ​ ( i ) ∈ ℝ d 𝔽 𝑖 superscript ℝ 𝑑 \mathbb{F}(i)\in\mathbb{R}^{d} denote the output d 𝑑 d -dimensional feature embedding of pixel i 𝑖 i . We adopt a contrastive learning objective that operates on a pixel level. Defining sim ​ ( 𝔽 ​ ( i ) , 𝔽 ​ ( j ) ) = 𝔽 ​ ( i ) T ​ 𝔽 ​ ( j ) / ( ‖ 𝔽 ​ ( i ) ‖ ⋅ ‖ 𝔽 ​ ( j ) ‖ ) sim 𝔽 𝑖 𝔽 𝑗 𝔽 superscript 𝑖 𝑇 𝔽 𝑗 ⋅ norm 𝔽 𝑖 norm 𝔽 𝑗 \text{sim}(\mathbb{F}(i),\mathbb{F}(j))=\mathbb{F}(i)^{T}\mathbb{F}(j)/(||\mathbb{F}(i)||\cdot||\mathbb{F}(j)||) as the cosine similarity between feature vectors 𝔽 ​ ( i ) 𝔽 𝑖 \mathbb{F}(i) and 𝔽 ​ ( j ) 𝔽 𝑗 \mathbb{F}(j) , we want to learn optimal network parameters ϕ ∗ superscript italic-ϕ \phi^{*} as follows:

where P ​ o ​ s ​ ( i ) 𝑃 𝑜 𝑠 𝑖 Pos(i) , N ​ e ​ g ​ ( i ) 𝑁 𝑒 𝑔 𝑖 Neg(i) denote the pixels in the same or different semantic categories as pixel i 𝑖 i .

Unlike the supervised setting, where annotations determine P ​ o ​ s ​ ( i ) 𝑃 𝑜 𝑠 𝑖 Pos(i) , N ​ e ​ g ​ ( i ) 𝑁 𝑒 𝑔 𝑖 Neg(i) , we must automate estimation of these relationships. In designing such a procedure, we operate under stringent assumptions: (1) the network is initialized from scratch, and (2) training images may contain complex scenes.

Grouping Primitive. We deriving guidance from a visual grouping primitive to sample P ​ o ​ s ​ ( i ) 𝑃 𝑜 𝑠 𝑖 Pos(i) and N ​ e ​ g ​ ( i ) 𝑁 𝑒 𝑔 𝑖 Neg(i) . Specifically, we adopt a contour detector ϕ E subscript italic-ϕ 𝐸 \phi_{E} , which, acting on image I 𝐼 I , produces an edge strength map E = ϕ E ​ ( I ) 𝐸 subscript italic-ϕ 𝐸 𝐼 E=\phi_{E}(I) . We then convert E 𝐸 E into a region map R 𝑅 R using a hierarchical merging process which repeatedly removes the weakest edge separating two regions. The real-valued edge strength at which two distinct regions merge defines a distance metric, which we extend to a notion of distance between pixels. This is precisely the procedure for constructing an ultrametric contour map (UCM)  [ 1 ] , which can equivalently be regarded as as both a reweighted edge map, and a tree T 𝑇 T defining a region hierarchy. The leaves of T 𝑇 T are the initial, finest-scale, regions; interior nodes represent larger regions formed by merging their children and have a real-valued height in the hierarchy equal to the distance between their child regions. Figure  1 (top, center) shows an example.

As remarked upon earlier, we utilize the structured forest edge detector of Dollár and Zitnick  [ 11 ] , trained on a small dataset (BSDS  [ 34 ] ). In constructing the region tree, we follow Arbeláez  et al.   [ 2 ] , applying their variant of the oriented watershed transform (OWT) prior to computing the UCM.

where σ p subscript 𝜎 𝑝 \sigma_{p} is a temperature parameter to control the concentration on region distance, and we manually cast the self-similarity to one: exp ⁡ ( − d T ​ ( R a , R a ) / σ p ) → 1 → subscript 𝑑 𝑇 subscript 𝑅 𝑎 subscript 𝑅 𝑎 subscript 𝜎 𝑝 1 \exp(-d_{T}(R_{a},R_{a})/\sigma_{p})\rightarrow 1 to further concentrate positive sampling within the anchor regions. In experiments, we find this trick leads to better positive sampling and performance. Similarly, we can define the probability of assigning j 𝑗 j to i 𝑖 i ’s negative set as:

Note that we do not formulate P ​ ( j ∈ N ​ e ​ g ​ ( i ) ) 𝑃 𝑗 𝑁 𝑒 𝑔 𝑖 P(j\in Neg(i)) as the complement of P ​ ( j ∈ P ​ o ​ s ​ ( i ) ) 𝑃 𝑗 𝑃 𝑜 𝑠 𝑖 P(j\in Pos(i)) , for the purpose of excluding the pixels whose assignments are vague. With these defined distributions, we can sample corresponding reference pixels from the image.

Augmented Reference Sampling. Asking vision systems to produce feature representations that are invariant to common image transformations ( e.g.,  color change, object deformation, occlusion) appears to be extremely helpful in learning semantic features. To this end, we augment our sampling of positive and negative pixel pairs in two ways:

Image Augmentation. We impose data augmentation on training images to collect extra positive and negative pairs. Specifically, suppose we have sampled P ​ o ​ s ​ ( i ) 𝑃 𝑜 𝑠 𝑖 Pos(i) and N ​ e ​ g ​ ( i ) 𝑁 𝑒 𝑔 𝑖 Neg(i) for pixel i 𝑖 i , and augmented input image with transformation τ 𝜏 \tau . We can then have augmented pixels pairs P ​ o ​ s ​ ( i ) a ​ u ​ g = τ ​ ( P ​ o ​ s ​ ( i ) ) 𝑃 𝑜 𝑠 subscript 𝑖 𝑎 𝑢 𝑔 𝜏 𝑃 𝑜 𝑠 𝑖 Pos(i)_{aug}=\tau(Pos(i)) and N e g ( i ) a ​ u ​ g = τ ( N e g ( i ) ) ) Neg(i)_{aug}=\tau(Neg(i))) to sample referenced features on ϕ ​ ( τ ​ ( 𝕀 ) ) italic-ϕ 𝜏 𝕀 \phi(\tau(\mathbb{I})) .

Crossing-image Negative Sampling ( c ​ n ​ s c n s cns ). Capturing object-level, rather than merely instance-level, semantics requires considering relationships across the entire dataset; we need a signal relating reference pixels from different images. We adopt an approach similar to  [ 45 , 19 , 44 ] and randomly sample negative features from the dataset (but operationalized on a per-pixel level). We denote these randomly sampled negatives as N ​ e ​ g ​ ( i ) c ​ n ​ s 𝑁 𝑒 𝑔 subscript 𝑖 𝑐 𝑛 𝑠 Neg(i)_{cns} for selected anchor pixel i 𝑖 i .

𝑖 𝑁 𝑒 𝑔 𝑖 𝑁 𝑒 𝑔 subscript 𝑖 𝑎 𝑢 𝑔 𝑁 𝑒 𝑔 subscript 𝑖 𝑐 𝑛 𝑠 Neg^{+}(i)=Neg(i)\cup Neg(i)_{aug}\cup Neg(i)_{cns} . During training, we end-to-end optimize Equation ( 1 ) starting from a randomly initialized network ϕ italic-ϕ \phi , as to produce feature vectors 𝔽 ​ ( i ) 𝔽 𝑖 \mathbb{F}(i) which encode fine-grained pixel-wise semantic representations.

4 Experiment Settings

4.1 datasets and preprocessing.

Datasets. We experiment on datasets of complex scenes, with variable numbers of object instances: PASCAL  [ 14 ] and COCO  [ 31 ] . PASCAL provides 1464 and 1449 pixel-wise annotated images for training and validation, respectively. Hariharan  et al.   [ 17 ] extend this set with extra annotations, yielding 10582 training images, denoted as train_aug . The COCO-2014  [ 31 ] dataset provides instance and semantic segmentations for 81 foreground object classes on over 80K training images. In unsupervised experiments, we train the network on the union of images in the train_aug set of PASCAL and train set of COCO. We evaluate learned embeddings on the PASCAL val set by training a pixel-wise classifier for semantic segmentation on PASCAL train_aug , set atop frozen features.

We also benchmark learned embeddings on the DAVIS-2017  [ 40 ] dataset for the task of instance mask tracking. We ONLY train on PASCAL train_aug and COCO train , without including any images from DAVIS-2017. After self-supervised training, we directly evaluate on the val set of DAVIS-2017 as to propagate initial instance masks to consecutive frames through embedding similarity.

Edges and Regions. Our proposed self-supervised learning framework starts with an edge predictor. We avoid neural network based edge predictors like HED  [ 46 ] , which depend upon an ImageNet pre-trained backbone. We instead turn to structured edges (SE)  [ 11 ] , which only leverages the small supervised BSDS  [ 34 ] for training. From SE-produced edges, we additionally compute a spectral edge signal (following  [ 2 ] ), and feed a combination of both edge maps into the processing steps of OWT-UCM  [ 2 ] . This converts the predicted edges a region tree T 𝑇 T , which we constrain to have no more than 40 40 40 regions at its finest level.

4.2 Implementation Details

Network Design. We use a randomly initialized ResNet-50  [ 20 ] backbone. To produce high fidelity semantic features, we adopt a hypercolumn design  [ 18 ] that combines feature maps coming from different blocks. We keep the stride of ResNet-50 unchanged and adopt a single 3 × 3 3 3 3\times 3 Conv-BN-ReLU block to project the last outputs of the Res3 and Res4 blocks into 256-channel feature maps. These two feature maps are both interpolated to match the spatial resolution of a 4 × 4\times downsampled input image before concatenation. Finally, we project the resulting feature map using a single 3 × 3 3 3 3\times 3 convolution layer, to produce a final 32-channel feature map as our output semantic embedding 𝔽 𝔽 \mathbb{F} .

Training. We use Adam  [ 23 ] to train our model for 80 epochs with batch size 70. We initialize learning rate as 1e-2 which is then decayed by 0.1 at 25, 45, 60 epochs, respectively. We perform data augmentation including random resized cropping, random horizontal flipping, and color jittering on input images, which are then resized to 224 × 224 224 224 224\times 224 before being fed into the network. For one image, we randomly sample 7 regions and, for each region, sample 10 positive pixels and 5 negative pixels. We use σ p = 0.8 subscript 𝜎 𝑝 0.8 \sigma_{p}=0.8 for all experiments.

In experiments fine-tuning on PASCAL t ​ r ​ a ​ i ​ n ​ _ ​ a ​ u ​ g 𝑡 𝑟 𝑎 𝑖 𝑛 _ 𝑎 𝑢 𝑔 train\_aug , we freeze all the base features and only update parameters of a newly added head, the Atrous Spatial Pyramid Pooling (ASPP) module from DeepLab V3  [ 6 ] . Here, we use SGD with weight decay 5 ​ e − 4 5 superscript 𝑒 4 5e^{-4} and momentum 0.9 to optimize the pixel-wise cross entropy loss for 20K iterations with batch size 20. We randomly crop and resize images to 384x384 patches. The learning rate starts at 0.03 and decays by 0.1 at 10K and 15K iterations.

5 Results and Analysis

5.1 evaluation of learned self-supervised representations.

To compare with other representation learning approaches: InstFeat  [ 48 ] , MoCo  [ 19 ] , and SegSort  [ 21 ] , we use the official code released by the respective authors to train a ResNet-50 backbone on the same unified dataset of COCO t ​ r ​ a ​ i ​ n 𝑡 𝑟 𝑎 𝑖 𝑛 train and PASCAL t ​ r ​ a ​ i ​ n ​ _ ​ a ​ u ​ g 𝑡 𝑟 𝑎 𝑖 𝑛 _ 𝑎 𝑢 𝑔 train\_aug set. We preserve the default parameters of these approaches. For InstFeat  [ 48 ] and SegSort  [ 21 ] , we set the total training epochs to 80 and rescale the timing of learning rate decay accordingly. We find 80 epochs is sufficient for convergence of these models.

For SegSort  [ 21 ] , we replace their network, which is built upon an ImageNet pre-trained ResNet-101, with a ResNet-50 that is initialized from scratch. We update the related parameters and region map, e.g.,  learning rate, decayed epochs, total epochs, to be consistent with our setting. This allows us to compare with SegSort as a truly unsupervised approach, rather than one that relies on ImageNet pre-training. Table  1 reports results after fine-tuning for PASCAL semantic segmentation.

Training with a similar siamese architecture on PASCAL+COCO, our method outperforms InstFeat  [ 48 ] by a large margin (46.51  vs.  38.11 mIoU). We outperform SegSort  [ 21 ] , which neither leverages information from the region hierarchy nor samples negatives in a cross-image fashion, by a larger margin (46.51 vs.  36.15).

Though equipped with neither a momentum encoder nor a memory bank, our method achieves comparable results to a variant of MoCo  [ 19 ] (46.51  vs.  47.01 mIoU), while requiring far fewer training epochs (80  vs.  200). Restricted to only 80 pre-training epochs, we significantly outperform MoCo (46.51  vs.  42.07). Given less pre-training data (PASCAL only, Table  1 bottom), we similarly observe a substantial advantage over MoCo (43.51 vs.  32.4 mIoU). Together, faster convergence and superior performance with less data suggest that our ability to exploit regions drastically improves the efficiency of unsupervised visual representation learning.

This efficiency hypothesis is further supported by the fact that our method continues to improve if it can take advantage of higher quality edges. A variant built using from the advanced edge detector HED  [ 46 ] reaches 48.82 mIoU, outperforming all competing approaches. Note, however, that HED itself utilizes ImageNet pre-training, so this is an ablation-style comparison that does not obey the same strict restriction, of only relying on unlabeled PASCAL+COCO images, as all other entries in the corresponding section of Table  1 .

Refer to caption

Besides freezing the backbone, we also run an end-to-end fine-tuning experiment on a PASCAL+COCO pre-trained backbone, following the evaluation protocol in He  et al.   [ 19 ] . Our method achieves comparable results with MoCo  [ 19 ] (47.2 vs.  46.9 mIoU) when both are trained for 80 epochs. Here, MoCo  [ 19 ] reaches 55.0 when trained for 200 epochs.

We perform ablations on cross-image negative sampling ( c ​ n ​ s 𝑐 𝑛 𝑠 cns ), reported in Table  1 (bottom). Figure  3 shows that increasing c ​ n ​ s 𝑐 𝑛 𝑠 cns yields feature embeddings more consistent with semantic partitioning.

We also experiment with unsupervised training over ImageNet, where MoCo performs well under the implicit assumption that most unlabeled images only contain only a single category. Simply adding our pixel-wise contrastive target on top of MoCo outperforms the MoCo baseline when benchmarked at 30 epochs (55.13 vs.  53.84), but, as shown in Table  1 (top, Ours † superscript Ours † \text{Ours}^{\dagger} ), we do not witness relative improvement at 200 epochs. However, this does suggest that region awareness can help to boost unsupervised learning efficiency, even on ImageNet.

5.2 Semantic Segment Retrieval

Refer to caption

We also adopt a direct approach to examine our learned embeddings. We first partition the images of PASCAL t ​ r ​ a ​ i ​ n ​ _ ​ a ​ u ​ g 𝑡 𝑟 𝑎 𝑖 𝑛 _ 𝑎 𝑢 𝑔 train\_aug and v ​ a ​ l 𝑣 𝑎 𝑙 val set into a fixed number of segments by running K-means (K=15) clustering on the embedding. Then we use a region feature descriptor, computed by averaging the feature vectors over all pixels within a segment, to retrieve nearest neighbors of the v ​ a ​ l 𝑣 𝑎 𝑙 val set regions from the t ​ r ​ a ​ i ​ n ​ _ ​ a ​ u ​ g 𝑡 𝑟 𝑎 𝑖 𝑛 _ 𝑎 𝑢 𝑔 train\_aug set. We report qualitative (Figure  4 ) and quantitative (Table  2 ) results. Without any supervised fine-tuning, our learned representations reflect semantic categories and object shape.

5.3 Instance Mask Tracking

In this task, we track instance masks by fetching cross-frame neighboring pixels measured under feature similarity induced by our output embedding. Specifically, we predict the instance class of pixel i 𝑖 i at time step t 𝑡 t by y t ​ ( i ) = ∑ k ∑ j sim ​ ( 𝔽 t ​ ( i ) , 𝔽 t − k ​ ( j ) ) ​ y t − k ​ ( j ) subscript 𝑦 𝑡 𝑖 subscript 𝑘 subscript 𝑗 sim subscript 𝔽 𝑡 𝑖 subscript 𝔽 𝑡 𝑘 𝑗 subscript 𝑦 𝑡 𝑘 𝑗 y_{t}(i)=\sum_{k}\sum_{j}\text{sim}(\mathbb{F}_{t}(i),\mathbb{F}_{t-k}(j))y_{t-k}(j) , where 𝔽 i ​ ( t ) subscript 𝔽 𝑖 𝑡 \mathbb{F}_{i}(t) denotes the feature vector of pixel i 𝑖 i at time step t 𝑡 t , which we also augmented with spatial coordinates i 𝑖 i . We utilize the k 𝑘 k previous frames to propagate labels. y t ​ ( i ) subscript 𝑦 𝑡 𝑖 y_{t}(i) is a one-hot vector indicating the instance class assignment for pixel i 𝑖 i at time frame t 𝑡 t . In our experiment, we choose k = 2 𝑘 2 k=2 and j 𝑗 j as the set of 10 10 10 nearest neighbors of pixel i 𝑖 i .

Refer to caption

We evaluate performance using region similarity 𝒥 𝒥 \mathcal{J} and contour accuracy ℱ ℱ \mathcal{F} (as defined by  [ 39 ] ), with Table  3 reporting results. Our feature representation, learned by respecting the region hierarchy, benefits temporal matching for difficult cases, e.g.,  large motion, where local intensity is not reliable. Though not optimized for precise temporal matching, our approach still outperforms recent state-of-the-art video-based unsupervised approaches in both region and boundary quality benchmarks.

6 Conclusion

We propose a self-supervised learning framework, only leveraging a visual primitive predictor trained on a small dataset, that bootstraps visual feature representation learning on large-scale unlabeled image sets by optimizing a pixel-wise contrastive loss to respect a primitive grouping hierarchy. We demonstrate the effectiveness of our pixel-wise learning target as deployed on unlabeled images of complex scenes with multiple objects, through fine-tuning to predict semantic segmentation. We also show that our learned features can directly benefit the downstream applications of segment search and instance mask tracking.

7 Broader Impact

As an advance in self-supervised visual representation learning, our work may serve as a technical approach for a wide variety of applications that learn from unlabeled image datasets, with impact as varied as the potential applications. We believe that a compelling and practical use case is likely to be in domains where human annotation is especially difficult, such as medical imaging, and are hopeful that that further development of these techniques will eventually have a positive impact in medical and scientific domains.

Acknowledgments and Disclosure of Funding

The authors have no competing interests.

  • [1] Arbeláez, P.: Boundary extraction in natural images using ultrametric contour maps. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) (2006)
  • [2] Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2010)
  • [3] Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
  • [4] Bertasius, G., Shi, J., Torresani, L.: High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
  • [5] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
  • [6] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
  • [7] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (2020)
  • [8] Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • [9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
  • [10] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
  • [11] Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)
  • [12] Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. In: International Conference on Learning Representations (2017)
  • [13] Donahue, J., Simonyan, K.: Large scale adversarial representation learning. In: Advances in Neural Information Processing Systems (2019)
  • [14] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision (2010)
  • [15] Fathi, A., Wojna, Z., Rathod, V., Wang, P., Song, H.O., Guadarrama, S., Murphy, K.P.: Semantic instance segmentation via deep metric learning. arXiv:1703.10277 (2017)
  • [16] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)
  • [17] Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: International Conference on Computer Vision (2011)
  • [18] Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
  • [19] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
  • [20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [21] Hwang, J.J., Yu, S.X., Shi, J., Collins, M.D., Yang, T.J., Zhang, X., Chen, L.C.: SegSort: Segmentation by discriminative sorting of segments. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
  • [22] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [23] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
  • [24] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2013)
  • [25] Kokkinos, I.: Pushing the boundaries of boundary detection using deep learning. arXiv:1511.07386 (2015)
  • [26] Kong, S., Fowlkes, C.: Multigrid predictive filter flow for unsupervised learning on videos. arXiv:1904.01693 (2019)
  • [27] Kong, S., Fowlkes, C.C.: Recurrent pixel embedding for instance grouping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • [28] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: European Conference on Computer Vision (2016)
  • [29] Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [30] Li, Y., Paluri, M., Rehg, J.M., Dollár, P.: Unsupervised learning of edges. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [31] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: European Conference on Computer Vision (2014)
  • [32] Liu, C., Yuen, J., Torralba, A., Sivic, J., Freeman, W.T.: SIFT flow: Dense correspondence across different scenes. In: European Conference on Computer Vision (2008)
  • [33] Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical-flow similarity for self-supervised learning. In: Asian Conference on Computer Vision (2018)
  • [34] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE International Conference on Computer Vision (2001)
  • [35] Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision (2016)
  • [36] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
  • [37] Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [38] Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [39] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [40] Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017)
  • [41] Shen, W., Wang, X., Wang, Y., Bai, X., Zhang, Z.: Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
  • [42] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv:1906.05849 (2019)
  • [43] Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (2018)
  • [44] Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
  • [45] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
  • [46] Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
  • [47] Yang, J., Price, B., Cohen, S., Lee, H., Yang, M.H.: Object contour detection with a fully convolutional encoder-decoder network. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [48] Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
  • [49] Zhan, X., Pan, X., Liu, Z., Lin, D., Loy, C.C.: Self-supervised learning via conditional motion propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
  • [50] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision (2016)
  • [51] Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [52] Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (2019)

ar5iv homepage

Self-Supervised Visual Representation Learning from Hierarchical Grouping

Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

Xiao Zhang, Michael Maire

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.

Name Change Policy

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings.

Use the "Report an Issue" link to request a name change.

self supervised visual representation learning from hierarchical grouping

  • I have joined the University of Chicago as an Assistant Professor in Computer Science!
  • Two papers appearing at ECCV 2018: Sparsely Aggregated CNNs and Self-Supervised Relative Depth Learning
  • New arXiv paper on sparsely aggregated CNNs, an architectural improvement over DenseNet.
  • Our work on a new method for regularizing deep networks will appear at CVPR 2018.
  • Joint Workshop of the COCO and Places Challenges to be held October 29 at ICCV 2017 . A new stuff segmentation challenge has been added alongside the COCO detection and keypoint challenges.
  • Multigrid CNN code now available.
  • Colorization as a proxy task for self-supervised learning of visual representations
  • Multigrid neural architectures (with implications for learning attention)
  • FractalNet paper updated with results on ImageNet (accepted to ICLR 2017).
  • Common Visual Data Foundation launched!
  • Our colorization work will be presented at ECCV 2016.
  • Joint ImageNet and COCO workshop to be held October 9 at ECCV 2016 .
  • New arXiv paper on fractal networks, an alternative ultra-deep architecture.
  • Our colorization work is featured in the NVIDIA Accelerated Computing Newsletter .
  • New arXiv paper on automatic colorization.
  • Affinity CNN arXiv paper available (accepted to CVPR 2016).
  • ICCV'15 paper on direct intrinsics now available.
  • MS COCO and ImageNet LSVRC workshop to be held at ICCV 2015 .
  • CVPR'15 paper on learning lightness now available.
  • ACCV'14 paper on reconstructive sparse code transfer available below.
  • I joined TTI-Chicago in Fall 2014.
  • I co-organized the Perceptual Organization in Computer Vision (POCV) workshop at CVPR 2014 . Slides for many talks are now available on the program .
  • Xin Yuan (University of Chicago)
  • Xiao Zhang (University of Chicago)
  • Sudarshan Babu (TTIC)
  • Pedro Savarese (TTIC)
  • Tri Huynh (University of Chicago), Ph.D., 2021 → Google Thesis: Universal Neural Memory Architectures: Multigrid Connectivity, Domain-Agnostic Geometry, and Local Operators

self supervised visual representation learning from hierarchical grouping

Information-Theoretic Segmentation by Inpainting Error Maximization

self supervised visual representation learning from hierarchical grouping

Domain-Independent Dominance of Adaptive Methods

self supervised visual representation learning from hierarchical grouping

Multimodal Contrastive Training for Visual Representation Learning

self supervised visual representation learning from hierarchical grouping

Are Machine Learning Cloud APIs Used Correctly?

self supervised visual representation learning from hierarchical grouping

Growing Efficient Deep Networks by Structured Continuous Sparsification

self supervised visual representation learning from hierarchical grouping

Winning the Lottery with Continuous Sparsification

self supervised visual representation learning from hierarchical grouping

Self-Supervised Visual Representation Learning from Hierarchical Grouping

self supervised visual representation learning from hierarchical grouping

Multigrid Neural Memory

self supervised visual representation learning from hierarchical grouping

Orthogonalized SGD and Nested Architectures for Anytime Neural Networks

self supervised visual representation learning from hierarchical grouping

ALERT: Accurate Anytime Learning for Energy and Timeliness

self supervised visual representation learning from hierarchical grouping

Pixel Consensus Voting for Panoptic Segmentation

self supervised visual representation learning from hierarchical grouping

Learning Implicitly Recurrent CNNs Through Parameter Sharing

self supervised visual representation learning from hierarchical grouping

Sparsely Aggregated Convolutional Networks

self supervised visual representation learning from hierarchical grouping

Self-Supervised Relative Depth Learning for Urban Scene Understanding

self supervised visual representation learning from hierarchical grouping

Regularizing Deep Networks by Modeling and Predicting Label Structure

self supervised visual representation learning from hierarchical grouping

Colorization as a Proxy Task for Visual Understanding

self supervised visual representation learning from hierarchical grouping

Multigrid Neural Architectures

self supervised visual representation learning from hierarchical grouping

FractalNet: Ultra-Deep Neural Networks without Residuals

self supervised visual representation learning from hierarchical grouping

Learning Representations for Automatic Colorization

self supervised visual representation learning from hierarchical grouping

Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding

self supervised visual representation learning from hierarchical grouping

Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression

self supervised visual representation learning from hierarchical grouping

Learning Lightness from Human Judgement on Relative Reflectance

self supervised visual representation learning from hierarchical grouping

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

self supervised visual representation learning from hierarchical grouping

Microsoft COCO: Common Objects in Context

self supervised visual representation learning from hierarchical grouping

Progressive Multigrid Eigensolvers for Multiscale Spectral Segmentation

self supervised visual representation learning from hierarchical grouping

Hierarchical Scene Annotation

self supervised visual representation learning from hierarchical grouping

Object Detection and Segmentation from Joint Embedding of Parts and Pixels

self supervised visual representation learning from hierarchical grouping

Occlusion Boundary Detection and Figure/Ground Assignment from Optical Flow

self supervised visual representation learning from hierarchical grouping

Contour Detection and Hierarchical Image Segmentation

self supervised visual representation learning from hierarchical grouping

Simultaneous Segmentation and Figure/Ground Organization using Angular Embedding

self supervised visual representation learning from hierarchical grouping

From Contours to Regions: An Empirical Evaluation

self supervised visual representation learning from hierarchical grouping

Using Contours to Detect and Localize Junctions in Natural Images

self supervised visual representation learning from hierarchical grouping

SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition

self supervised visual representation learning from hierarchical grouping

Making Latin Manuscripts Searchable using gHMM's

self supervised visual representation learning from hierarchical grouping

Names and Faces in the News

self supervised visual representation learning from hierarchical grouping

Recognition by Probabilistic Hypothesis Construction

self supervised visual representation learning from hierarchical grouping

Contour Detection and Image Segmentation

self supervised visual representation learning from hierarchical grouping

Dynamic Code Updates

Self-Supervised Visual Representation Learning with Semantic Grouping

Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track

Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, Xiaojuan Qi

In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability. Instead, we propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning. The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and conversely facilitates grouping semantically coherent pixels together. Compared with previous efforts, by simultaneously optimizing the two coupled objectives of semantic grouping and contrastive learning, our approach bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. Experiments show our approach effectively decomposes complex scenes into semantic groups for feature learning and significantly benefits downstream tasks, including object detection, instance segmentation, and semantic segmentation. Code is available at: https://github.com/CVMI-Lab/SlotCon.

Name Change Policy

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings.

Use the "Report an Issue" link to request a name change.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 25 January 2022

A self-supervised domain-general learning framework for human ventral stream representation

  • Talia Konkle   ORCID: orcid.org/0000-0003-1738-4744 1 &
  • George A. Alvarez 1  

Nature Communications volume  13 , Article number:  491 ( 2022 ) Cite this article

13k Accesses

27 Citations

65 Altmetric

Metrics details

  • Learning algorithms
  • Object vision

Anterior regions of the ventral visual stream encode substantial information about object categories. Are top-down category-level forces critical for arriving at this representation, or can this representation be formed purely through domain-general learning of natural image structure? Here we present a fully self-supervised model which learns to represent individual images, rather than categories, such that views of the same image are embedded nearby in a low-dimensional feature space, distinctly from other recently encountered views. We find that category information implicitly emerges in the local similarity structure of this feature space. Further, these models learn hierarchical features which capture the structure of brain responses across the human ventral visual stream, on par with category-supervised models. These results provide computational support for a domain-general framework guiding the formation of visual representation, where the proximate goal is not explicitly about category information, but is instead to learn unique, compressed descriptions of the visual world.

Similar content being viewed by others

self supervised visual representation learning from hierarchical grouping

Conscious perception of natural images is constrained by category-related visual features

Daniel Lindh, Ilja G. Sligte, … Ian Charest

self supervised visual representation learning from hierarchical grouping

Orthogonal Representations of Object Shape and Category in Deep Convolutional Neural Networks and Human Visual Cortex

Astrid A. Zeman, J. Brendan Ritchie, … Hans Op de Beeck

self supervised visual representation learning from hierarchical grouping

A map of object space in primate inferotemporal cortex

Pinglei Bao, Liang She, … Doris Y. Tsao

Introduction

Patterned light hitting the retina is transformed through a hierarchy of processing stages in the ventral visual stream, driving to a representational format that enables us to discriminate, identify, categorize, and remember thousands of different objects 1 , 2 , 3 , 4 , 5 , 6 . What pressures govern the formation of this visual representation? This is a deeply debated question, with proposals balancing the relative guiding influence of innate biases versus the structure of the visual input statistics, and the degree to which learning operates over more domain-specialized versus domain-general architectures 3 , 7 , 8 , 9 , 10 , 11 , 12 , 13 .

For example, some prominent theoretical accounts of high-level visual system organization assert that the representations are explicitly about object categories, and that category-level (“domain-level”) forces are critical for guiding this visual representational format 7 , 11 , 14 , 15 , 16 . For example, these theories argue that visual representation formation is guided by distinct long-range network connectivity to help learn features in support of broad conceptual distinctions (e.g. for animate or inanimate entities 14 , 15 ), or to guide the learning of a specific set of categories with particular functional relevance (e.g., faces, bodies, and scenes 3 , 11 ).

And, in what has sometimes been taken as converging support for the role of category-level pressures in forming visual representations, deep convolutional neural network models—trained directly to support object categorization—develop hierarchical feature spaces that show an emergent match with brain responses 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 (see refs. 25 , 26 for review). However, on deeper examination, it is not clear that the category-level supervisory signals involved in training deep neural networks are a good proxy for the representational pressures implied in the domain-level cognitive neuroscience theories. In particular, these deep neural networks models are trained with much finer-grained distinctions at the subordinate category level (e.g., with features that are explicitly guided to differentiate among three different kinds of crabs, not to mention the 90 breeds of dogs, present among the 1000 categories of the ImageNet database 27 ). Thus, it is clear these category-supervised deepnet models are operationalizing category-level pressures in a different way than is generally posited by most cognitive neuroscience theories.

Alternate theories of visual representation formation put relatively more weight on the structure of the natural image statistics, and relatively less weight on downstream output needs driving visual representation formation 9 , 10 , 12 , 28 , 29 . These theoretical proposals argue that the visual cortex is a generic covariance extractor and that there are systematic differences in the way things look (e.g., among faces and scenes; among animals, big objects, and small objects)–it is these perceptual feature differences that actually underlie the ‘categorical’ distinctions of high-level visual responses 28 , 30 , 31 . On this account, visual learning is less a process of enrichment (i.e., building new features for each new category) and more a process of differentiation (i.e., learning to seek out distinguishing features that are already present in the visual input 32 , 33 ). A strong version of this domain-general theoretical framework posits that learning good visual representation does not at all rely on presupposing categories, leveraging labels, or otherwise drawing on any specialized architectural constraints for some kinds of stimuli over others. However, a key challenge remains to make this domain-general proposal more computationally explicit: what is an alternative representation goal, if not category-supervision, that might serve as an internal learning signal to draw out useful structure from natural image statistics?

Key insight into this challenge comes from work that changed the objective from learning features that can discriminate all categories from each other to instead learning features that can discriminate every view from every other view 34 . The logic here is that that views of objects from the same category will naturally project nearby in such a feature space, due solely to the statistical structure of the input and the generic architectural prior, without explicit category-level pressure. Inspired by this insight, here we developed a learning framework that is fully self-supervised, called instance-prototype contrastive learning (IPCL). The model operates by taking multiple samples over an image and projecting these through a deep convolutional neural network backbone into a low-dimensional embedding space. To learn instance-level structure, the model tries to map these multiple samples of the views nearby by in the latent space (towards the “instance-prototype”), while also making this embedding distinct from the representations of recently encountered views (“contrastive learning”). As such, the final representational format can be conceived of as a high-fidelity perceptual representation, capable of fine-grained discrimination between views. Within the broader cognitive neuroscience context, this model thus operationalizes a domain-general view of visual representation learning, where no specialized pressures beyond the visual system are required to guide the format of visual representation.

In the present work, we show that our instance-prototype contrastive learning models indeed have naturally emergent category structure in the latent space. Further, these models learn hierarchical visual feature spaces that can capture brain response structure, on par with category-supervised models, at or near the noise ceiling of the data in most regions, in two condition-rich fMRI datasets. Concurrent with the present work, and at an extremely rapid pace, new self-supervised instance-level contrastive learning models have been introduced which have even higher emergent categorization accuracy 35 , 36 , 37 , 38 , 39 , 40 ; however, we find the representational spaces learned in these more performant feature spaces are not increasingly more brain-like (c.f. ref. 23 ). As a whole, this work invites a shift away from the category-based specialized framework that has been dominant in high-level visual cognitive neuroscience, providing an alternative conceptual framework in which the representational goal of the visual system is to capture fine-grained visual differences in a useful compressed format, learnable with domain-general mechanisms. Critically, for this argument, the models are not intended to be direct models of the biological brain per se, but rather to serve as computational existence proofs of what kind of representational formats are learnable from the input given certain constraints. As such, the degree to which these models learn representational formats that show correspondence with the visual system provides computational-empirical plausibility for a domain-general view of the formation of visual system representation.

Instance-prototype contrastive learning

We designed an instance-prototype contrastive learning algorithm (IPCL) to learn a representation of visual object information in a fully self-supervised manner, depicted in Fig.  1a . The overarching goal is to learn a low-dimensional embedding of natural images, in which sampled views of the same image are nearby to each other in this space and also separable from the embeddings of all other images.

figure 1

a Overview of the self-supervised instance-prototype contrastive learning (IPCL) model which learns instance-level representations without category or instance labels. b t-SNE visualization of 500 images from ten ImageNet categories, showing emergent category clusters in deepnet feature space. c All stimuli for the two fMRI datasets. Note that in this figure, the face image has been covered, to remove identifying information. d View from the bottom of the brain, showing voxel-wise reliability across the ventral visual stream for the Object Orientation dataset (top) and Inanimate Objects dataset (bottom). The color bar indicates the Pearson correlation between odd and even halves of the data. e Overview of the voxel-wise encoding RSA procedure. Source data are provided as a Source Data file.

To do so, each image is sampled with 5 augmentations, allowing for crops, rescaling, and color jitter (following the same parameters as in ref. 41 ). These samples are passed through a deep convolutional neural network backbone and projected into a 128-dimensional embedding space, which is L2-normed so that all image embeddings lie on the unit hypersphere. The contrastive learning objective has two component terms. First, the model tries to make the embeddings of these augmented views similar to each other by moving them towards the average representation among these views—the “instance prototype.” Simultaneously, the model tries to make these representations dissimilar from those of recently encountered items, which are stored in a lightweight memory queue of the most recent 4096 images—the “contrastive” component. See the  Supplementary Information for the more precise mathematical formulation of this loss function.

For the convolutional neural network backbone, we used an AlexNet architecture 42 , modified to have group-normalization layers ( N  = 32 groups) rather than standard batch normalization; see Supplementary Fig.  1 ), which was important to stabilize the learning process. While traditional batchnorm normalizes each individual channel across the full image batch, groupnorm normalizes across groups of channels for each individual image 43 , group normalization operates by normalizing across groups of feature channels for each image 34 , with intriguing parallels to divisive normalization operations in the visual system 44 , 45 . Five IPCL models were trained under this learning scheme, with slightly different training variations; all training details can be found in the  Supplementary Information .

Emergent object category information

To examine whether these self-supervised models show any emergent object category similarity structure in the embedding space, we used two standard methods to assess 1000-way classification accuracy on ImageNet. The k-nearest neighbor (kNN) method assigns each image a label by finding the k (=200) nearest neighbors in the feature space, assigning each of the 1000 possible labels a weight based on their prevalence amongst the neighbors (scaled by similarity to the target), and scoring classification as correct when the top-weighted class matched the correct class (top-1 knn accuracy 41 ). The linear evaluation protocol trains a new 1000-way classification layer over the features of the penultimate layer to estimate how often the top predicted label matches the actual label of each image 38 , 39 (see  Supplementary Information for method details).

Object category read-out from the primary IPCL models achieved an average top-1 kNN accuracy of 37.3% (35.4−38.4%) from the embedding space and 37.1% (32.2−39.7%) from the penultimate layer (fc7). In contrast, untrained models with a matched architecture show minimal object categorization capacity, with top-1 kNN accuracy of 3.5% (3.3−3.8%) and top-1 linear evaluation accuracy of 7.2% (fc7). Figure  1b visualizes the category structure of an IPCL model, showing a t-SNE plot with a random selection of 500 images from ten categories, arranged so that images with similar IPCL activations in the final output layer are nearby in the plot. It is clear that images from the same category cluster together. Thus, these fully self-supervised IPCL models have learned a feature space, which implicitly captures some object category structure, with no explicit representational pressure to do so.

For comparison, we trained a category-supervised model with matched architecture and visual diet and tested the categorization accuracy with the same metrics as the self-supervised model. The kNN top-1 accuracy was 58.8%, with a linear readout of 55.7% from the penultimate layer (fc7). An additional category-supervised matched-architecture model, trained with only one augmentation per image (rather than 5, which is a more standard training protocol), also showed similar classification accuracy (readout from fc7: kNN top-1: 55.5%; linear evaluation top-1: 54.5%). Thus, these matched-architecture category-supervised models have notably better categorization accuracy on the ImageNet database than our IPCL-trained models. Supplementary Table  1 reports the categorization accuracies for all of the individual models.

Relationship to the structure of human brain responses

To the extent that categorization capacity is indicative of brain-like representation in this accuracy regime 23 , we would expect these fully self-supervised models to have feature spaces with at least some emergent brain-like correspondence, but not as strong as category-supervised models. However, it is also possible that feature spaces learned in these self-supervised models have comparable or even more brain-like feature spaces than category-supervised models (e.g., if the instance-level representational goal more closely aligns with that driving visual system tuning). Thus, we next examined the degree to which the IPCL feature spaces have an emergent brain-like correspondence, relative to the category-supervised models.

Brain responses were measured using functional magnetic resonance imaging (fMRI) in two different condition-rich experiments (Fig.  1c , see Methods and Supporting Information ), using a powerful 4 s mini-block design that provides reliable estimates of responses to individual items 46 . The Object Orientation dataset included images of eight items presented at five different in-plane orientations; this stimulus set probes for item-level orientation-tolerance along the ventral visual hierarchy, while spanning the animate/inanimate domain. The Inanimate Objects dataset included images of 72 everyday objects; this stimulus set probes finer-grained distinctions within the inanimate domain. Thus, these two stimulus sets provide complementary views into object similarity structure. The resulting data revealed reliable voxel-level responses along the ventral visual stream (Fig.  1d ; see Methods). To delineate brain regions along the hierarchical axis of the ventral stream, we defined three brain sectors reflecting the early visual areas (V1–V3), the posterior occipito-temporal cortex (pOTC), and the anterior occipito-temporal cortex (aOTC; see Methods). Within these sectors, individual subject representational geometries were reliable and consistent across subjects, yielding highly reliable group- averaged representational geometries (EarlyV split-half reliability: r  = 0.86–0.90; pOTC: r  = 0.75–0.90; aOTC: r  = 0.60–0.89), providing a robust target to predict with different deep neural networks.

To relate the representations learned by these deep neural networks with brain sector responses along the ventral visual hierarchy, we used an approach that leveraged both voxel-wise encoding methods 47 , 48 and representational similarity 49 , which we subsequently refer to as voxel-wise-encoding RSA (veRSA; Fig.  1e ; see Methods; see also refs. 50 , 51 ). This method fits an encoding model at each voxel independently, using weighted combinations of deepnet units ( \(W\) ), to predict the univariate response profile. Then, the set of voxel encoding models are used to predict multi-voxel pattern responses to new items ( \(\hat{R}\) ) and to derive the predicted representational geometry in this encoded space ( \(\hat{G}\) ). This predicted RDM is then compared to the RDM of the brain sector ( \(G\) ), as the key measure of how well the layer’s features fit that brain region. This analysis choice places theoretical value on the response magnitude of a voxel as an informative brain signature, while also reflecting the theoretical position in which neurons across the cortex participate as a unified population code.

The brain predictivity of the models are depicted in Fig.  2 . The results show that the IPCL model achieves parity with the category-supervised models in accounting for the structure of brain responses, evident across both datasets and at all three levels of hierarchy. Each plot shows the layer-wise correlations between the predicted and measured brain representational geometry, with all IPCL models in blue (with multiple lines reflecting replicates of the same model with slight training variations, see Methods), and category-supervised models in orange. The adjacent plots show the maximum model correlation, reflecting the layer with the strongest correlation with the brain RDM, computed with a cross-validated procedure to prevent double-dipping (cv max-r; see Methods), plotted for IPCL models, category-supervised models, and an untrained model. Supplementary Table  2 reports the statistical tests comparing the brain predictivity between IPCL and category-supervised models, e.g., in 56/60 comparisons, the cross-validated max correlation for the IPCL models is greater than or not significantly different from category-supervised models (and with Bonferroni correction for multiple comparisons category-supervised models never showed a significantly higher correlation than an IPCL model).

figure 2

a Visualization of the ventral stream regions of interest spanning the visual hierarchy from posterior to anterior (EarlyV, pOTC, aOTC). b and c show the veRSA results for the Object Orientation and Inanimate Object datasets, respectively. Each panel plots the mean correlation between model RDMs with neural RDMs (y-axis), averaged over split-halves of the brain data, shown separately for each model layer (x-axis) and brain region (rows). All IPCL models are in blue, and category-supervised models are in orange. The thickness of each line reflects 95% confidence intervals based on 1000 bootstrapped samples across split-halves. Bar plots show cross-validated estimates of the maximum correlation across model layers for each model class (IPCL in blue, category-supervised in orange, and an untrained model in gray). Error bars reflect a mirrored density plot (violin plot) showing the distribution of correlations across all split-halves, aggregated across instances of a given model type. Distributions are cutoff at ±1.5 IQR (interquartile range, Q3-Q1). Source data are provided as a Source Data file.

Further, all models account for a large proportion of the explainable variance in these highly- reliable brain representational geometries—though with a noticeable difference between the two datasets. Considering the object orientation dataset, the proportion of explainable variance accounted for approached the noise ceiling in all sectors for both IPCL and the category-supervised models (mean IPCL: 88, 84, 94; category-supervised: 82, 91, 87; noise ceiling: r  = 0.90, 0.90, 0.89; for EarlyV, pOTC, and aOTC, respectively). However, considering the Inanimate Objects dataset, neither the IPCL nor category-supervised counterpart models learned feature spaces that reached as close to the noise ceiling, leaving increasing unaccounted for variance along the hierarchy (mean IPCL: 74, 47, 32%; category-supervised: 65, 41, 28%; noise ceiling: r  = 0.86, 0.74, 0.60; for EarlyV, pOTC, aOTC, respectively). These results reveal that the particular stimulus distinctions emphasized in the dataset matter, as these dramatically impact the claim of whether the representations learned by these models are fully brain-like, or whether the models fall short of the noise ceiling.

Finally, these results also generally show a hierarchical convergence between brains and deep neural networks, with earlier layers capturing the structure best in the early visual cortex, and later layers capturing the structure in the occipito-temporal cortex. Unexpectedly, we also found that the untrained models were competitive with the trained models in accounting for responses in EarlyV and partially in pOTC, whereas both IPCL and category-supervised models clearly outperform untrained models in aOTC. A deeper inspection revealed that the predicted representational distances in untrained models hover around zero, which is consistent with the fact that they cannot classify object categories very well. However, these feature spaces nevertheless contain small differences that are consistent with the brain data, amplified by the voxel-wise encoding procedure. Further, the use of Group-Normalization layers also boost untrained models—e.g., local Response Norm or Batch Normalization generally fit brain responses less well, particularly in the early visual cortex (see Supplementary Fig.  2 ). These findings highlight that there are useful architectural inductive biases present in untrained networks.

Overall, these results show that our instance-prototype contrastive learning models, trained without category-level labels, can capture the structure of human brain responses to objects along the visual hierarchy, on par with the category-supervised models. This pattern holds even in later stages of the ventral visual stream, where inductive biases alone are not sufficient to predict brain responses.

Varying the visual diet

As some of the reliable brain responses in the later hierarchical stages of the Inanimate Objects dataset was unexplained, we next explored whether variations in the visual diet of the IPCL models might increase their brain predictivity. For example, the pressure to learn instance-level representations over a more diverse diet of visual input might result in richer feature representations that better capture the structure neural representations, particularly in the later brain stages reflecting finer-grained inanimate object distinctions. However, it is also possible that the relatively close-scale and centered views of objects present in the ImageNet database are critical for learning object-relevant feature spaces, and that introducing additional content (e.g., from faces and scenes) will detrimentally affect the capacity of the learned feature space to account for these object-focused brain datasets.

To probe the influence of visual diet, we trained six new IPCL models over different training image sets (Fig.  3a ; see Methods, Supplementary Information , and Supplementary Table  1 ), and compared their brain predictivity to the ImageNet trained baseline. First, because we made some changes to the image augmentations to accommodate all image sets, we trained a new baseline IPCL model on ImageNet. Second, we used object-focused images from a different dataset as a test of near-transfer (OpenImages 52 , 53 ). The third dataset was scene images (Places2, ref. 54 , which we consider an intermediate-transfer test, as models trained to do scene categorization also learn object-selective features 55 . The fourth dataset was faces (VGGFace2; ref. 56 , a far-transfer test that allows to explore whether a visual diet composed purely of close-up faces learns features that are sufficient to capture the structure of brain responses to isolated objects. The fifth dataset included a mixture of objects, faces, and places, which provides a richer diet that spans traditional visual domains, with the total number of images per epoch matched to the ImageNet dataset. The sixth dataset had the same mixture but used three times as many images per epoch to test whether increased exposure was necessary to learn useful representations with this more diverse dataset.

figure 3

a Example images in the style of each of dataset are shown. For OpenImagesV6 and Places2, similar images were found from commons.wikimedia.org. For VGGFace2, images were generated from thispersondoesnotexist.com. b The cross-validated maximum correlation (cv max-r) between model RDMs and neural RDMs for each dataset (rows), and each brain region (columns). Mean scores are shown with a black dot at the center of a mirrored density plot (violin plot) showing the distribution of correlations across all split-halves (distributions are cutoff at ±1.5 IQR, interquartile range, Q3-Q1). The dashed black lines indicate the ±1.5 IQR range for the matched baseline IPCL model trained on ImageNet. Source data are provided as a Source Data file.

For each of these six models trained with different kinds of visual experience, we used the same veRSA approach and then calculated the cross-validated maximum correlation across layers (see Methods). The results are plotted in Fig.  3b , where the five IPCL models with different visual experiences (colored violin plots) are plotted in the context of the new baseline IPCL model trained on ImageNet (black dashed lines).

The overarching pattern of results shows that the visual diet actually had very little effect on how well the learned feature spaces could capture the object similarity structure measured in the brain responses. Quantitatively, the mean absolute difference in brain predictivity from the baseline ImageNet- trained model was ∆r < 0.044 (range of signed differences −0.202 to 0.040). The visible outlier is the model trained only with views of faces. The features learned by this model were significantly less able to capture the structure of the object orientation dataset in both the posterior and anterior occipito-temporal cortex, with a difference from the baseline model >2.5 standard deviations from the mean difference across all comparisons (pOTC: z = 3.67; aOTC: z = 3.21). However, the feature spaces of this model were still able to capture the differences among objects in the Inanimate Object dataset, variants in EarlyV and pOTC (though with a small reliable difference in pOTC) and was not different from the ImageNet trained baseline in aOTC (corrected t < 1). The full set of results are reported in Supplementary Table  3 .

Overall, this second set of IPCL models suggests that the statistics of most natural input contains the relevant relationships to comparably capture these brain signatures. Further, these models also highlight the general nature of the learning objective, demonstrating that it can be applied over richer and more variable image content, which is traditionally learned separately in supervised learning.

Accuracy vs brain predictivity

The analyses so far demonstrate that, while category-supervised models show better object categorization capacity, IPCL models still achieve parity in their correspondence with the visual hierarchy. However, neither the category-supervised nor the IPCL models are able to fully capture the structure of the measured brain responses, particularly in the later hierarchical stage of the Inanimate Objects dataset that captures many finer-grained object relationships. This predictivity gap raises a new question— if instance-level contrastive learning systems advance to the point of achieving comparable emergent classification accuracy to category-supervised models, will even more brain-like representation emerge?

Concurrently, a number of new instance-level contrastive learning models have been developed, which allow us to test this possibility (e.g., SimCLR 39 , MoCo 37 , MoCoV2 38 , and SwAV 40 ). For example, SimCLR leverages related principles as our IPCL network, with a few notable differences: it uses two augmentations per image (rather than an instance prototype), a more compute-intensive system for storing negative samples (in contrast to our lightweight memory queue), and a more powerful architectural backbone (Resnet50; ref. 57 ). Critically, this model, and others like MoCoV2 and SwAV, now achieve object classification performance that rivals their category-supervised comparands. Do these models show more brain-like representation, specifically in their responses to inanimate objects, where the later hierarchical brain structure was reliable and unaccounted for?

The results indicate that these newer models do not close this gap. Figure  4 depicts the relationship between top-1 accuracy and the strength of the brain correspondence, for the Inanimate Object dataset. All instance-level contrastive learning models are plotted with colored markers, while category-supervised models are plotted with open markers. Different base architectures are indicated by the marker shape). These scatter plots highlight that, across these models, top-1 accuracy ranges from 26–73%; however, improved categorization capacity is not accompanied by a more brain-like feature space. Further, these plots suggest that these particular variations in architecture, including higher powered ResNet 57 and ResNeXt 58 models, also do not seem to close this gap.

figure 4

The x- axis plots top-1 classification accuracy, and the y-axis plots the cross-validated max correlation with the Inanimate Object dataset, in each of the three brain sectors. Self-supervised contrastive learning models are shown with colored markers and category-supervised with open markers. Model architecture is indicated by marker shape. Red dashed lines and double-headed arrows draw attention to the gap between these model fits and the reliability ceiling of these brain data. Source data are provided as a Source Data file.

Finally, we also asked whether a recent self-supervised model trained on an, even more, ecological visual diet—images sampled from baby head-mounted cameras—might show better brain predictivity (TC-Moco 59 , SAYCam dataset 60 ). The visual experience of toddlers involves extensive experience with very few things, rather than an equal distribution over many categories–a visual curriculum which may be important for visual representation learning 61 . However, this particular model also did not close the brain predictivity gap evident in the similarity structure of inanimate objects at the later stages of the visual hierarchy (Fig.  4 ; purple diamond). Note though that this model does not yet take advantage of temporal information in videos beyond a few frames; building effective systems that use contrastive learning over video is an active frontier 62 , 63 , 64 .

Overall, the Inanimate Objects dataset has revealed some reliable representational structure in the object-selective cortex that is not easily captured by current deepnet models, even across these broadly sampled variations in learning algorithm, architecture, and visual diet. Further, these aggregated results complement the emerging trend that overall object categorization accuracy is not indicative of overall brain predictivity 23 , here considering a variety of other instance-level contrastive learning methods, over a much wider range of top-1 accuracy levels.

Auxiliary results

For reference, we also conducted the same analyses using a classic representational similarity analysis (rather than veRSA), in which there was no voxel-wise encoding models, nor any deepnet unit feature re-weighting (Supplementary Figs.  2 – 4 ). Overall, the magnitude of the correlation between model layers and the brain RDMs was systematically lower than when using veRSA. Despite this general main effect, the primary claims were also evident in this simpler analysis method: IPCL models showed parity with (or even superior performance to) category-supervised models, across brain sectors and datasets, with one notable exception. That is, in the aOTC and when considering the object orientation dataset, the category-supervised model showed better correspondence with the brain than the IPCL models (Supplementary Fig.  3 ). This discrepancy between classic RSA and veRSA does highlight that veRSA is able effectively to adjust the representational space to better capture the brain data, while classic RSA weights all features equally. We discuss these results in the context of the open challenge of linking hypotheses between deepnet features and brain responses.

Here we introduced instance-prototype contrastive learning models, trained with no labels of any kind, which learn a hierarchy of visual feature spaces that (i) have emergent categorization capacity based on the local similarity structure of the latent space and (ii) predict the representational geometry of hierarchical ventral visual stream processing in the human brain, on par with category-supervised counterparts. This correspondence with the structure of human visual system responses held in two datasets, considering both object orientation variation and finer-grained inanimate object distinctions, and also held over self-supervised models trained with different visual input diets. Finally, we highlight that there is a representational structure in the brain that was not well accounted for by any model tested, particularly in the anterior region of the ventral visual stream, related to finer-grained differences among inanimate objects. Broadly, these results provide computational plausibility for instance-level separability-—that is, to tell apart every view from every other view–as a plausible goal of ventral visual stream representation, which reflects a shift away from the category-based framework that has been dominant in high-level visual cognitive neuroscience research.

Implications for the biological visual system

The primary advance of this work for insights into the visual system is to make a computationally supported learnability argument: it is possible to achieve some category-level similarity structure without presupposing explicit category-level pressures. Items with similar visual features are likely to be from similar categories, and we show that the goal of instance-level representation allows that natural covariance of the data to emerge in the latent space of the model — a result that is further supported by the expanding set of self-supervised models with emergent object categorization accuracy comparable to category-supervised systems 36 , 37 , 38 , 39 , 40 . Our work adds further support for the viability of this hypothesis of visual system representation, by demonstrating an emergent correspondence with the similarity structure measured from brain responses—e.g., it is not the case that our self-supervised models learn a representation format that is decidedly un-brain-like. Indeed, recent work suggests that not all self-supervised learning objectives achieve brain-like representation with parity to category-supervised models 65 .

Our model invites an interpretation of the visual system as a very domain-general learning function 10 , which maps undifferentiated, unlabeled input into a useful representational format. On this view, the embedding space can be thought of as a high-fidelity perceptual interface, with useful visual primitives over which separate conceptual representational systems can operate. For example, explicit object category-level information may be the purview of more discrete compositional representational systems, that can provide “conceptual hooks” into different parts of the embedding space 66 , 67 , 68 . Intriguingly, new theoretical work suggests that instance-level contrastive learning may actually implicitly be learning to invert the generative process (i.e., mapping from pixels to the latent dimensions of the environment which give rise to the projected images 69 , suggesting that contrastive learning may be particularly well-suited for extracting meaningful representations from images.

What does the failure of these models to predict reliable variance in aOTC for the Inanimate Objects dataset tell us about the nature of representations in this region? Using this same brain dataset, we have found that behavioral judgments related to the shape similarity, rather than semantic similarity, show better correspondence with aOTC 70 , 71 , 72 . This result raises the possibility that the deepnets tested here are missing aspects of shape reflected in aOTC responses (e.g. structural representations 73 , global form 74 , or configural representations 75 ), which resonates with the fact that deep convolutional neural networks (CNNs) operate more as local texture analyzers 76 , 77 , and may be architecturally unable explicitly represent global shape 78 . Taken together, these results indicate that the success of CNNs in predicting ventral stream responses is driven by their ability to capture texture-based representations that are also extensively present throughout the ventral stream 28 , but they fall short where more explicit shape representations are emphasized. Capturing brain-like finer-grained distinctions among inanimate objects is thus an important frontier that is currently beyond the scope of both contrastive and category-supervised CNN models.

Components of the learning objective

Why is instance-prototype contrastive learning so effective in forming useful visual representations, and what insights might this provide with respect to biological mechanisms of information processing? Recent theoretical work 79 has revealed that the two components of the contrastive objective function have two distinct and important representational consequences, which they refer to as alignment (similarity across views) and uniformity (using all parts of the feature space equally). To satisfy the alignment requirement, the model must learn what it means for images to be similar. For IPCL, the model takes five samples from the world and tries to move them to a common part of the embedding space, forcing the model to learn that the perceptual features shared across these augmentations are important to preserve identity, while the unshared perceptual features can be discarded. Interpreted with a biological lens, these augmentations are like proto-eye movements, and this analogy highlights how this model can integrate more active sensing and predictive coding 80 . For example, augmentations could sample over translation and rotation shifts of the kind that occur with eye and head movements. Further, “efference copy” signals 81 , 82 , which signal the magnitude and direction of movements between samples, might also lead to predictable shifts in the embedding space. This intrinsic information about the sampling process could enable the system to learn representations that are “equivariant”, as opposed to “invariant”, over identity-preserving transformations 83 , 84 .

The second component of the objective function enforces representational uniformity–that is, where the set of all images have uniform coverage over the hypersphere embedding space. In IPCL this is accomplished by storing a modest set of “recent views” in a memory queue to serve as negative samples; other successful contrastive learning models use a much larger set of negatives (either in a batch or queue) which presumably helps enforce this goal 38 , 39 . The memory queue also has biological undertones: the human and non-human primate ventral streams are effectively a highway to the hippocampus 85 . Through this lens, the recent memory queue of IPCL is a stand-in for the traces that would be accessible in a hippocampal memory system, inviting further modifications that vary the weight of the contrast with fading negative samples, or negative sample replay. However, we are not committed to memory queue data structure, per se. Given that its functional role is to give rise to good representational coverage over the latent space, there may be other architectural mechanisms by which the item separability can be achieved, e.g., enforcing independence across feature channels 86 , or including a predictor head 87 . Indeed, there is an ongoing debate about whether the instance-level separability requires these negative samples at all 88 , 89 , 90 .

While these instance-level contrastive learning systems advance a domain-general learning algorithm guiding visual representation formation, they are by no means a perfect model of how the brain learns, nor are they a direct model of the biological system per se–we instead see them as a testbed for broader learnability arguments and as useful for providing cognitive insights into visual representation and formats (e.g., clusters in an L2-normed hypersphere can easily be read-out with local linear hyperplanes, and this is not true of euclidean spaces 79 ), and as such serve as a useful computational abstraction.

Concurrent work in non-human primate vision

In highly related recent work, Zhuang et al., (2021) explored a variety of self-supervised vision models and whether they have brain-like representation, using single-unit responses of the non-human primate ventral visual stream. Broadly, they found that the models using similar instance-level contrastive approaches as ours achieved parity with category-supervised models in predicting single-unit neural response profiles in areas V1, V4, and IT; exceeding the capacities of other kinds of self-supervised models with different goals, including an autoencoder (a reconstructive goal), next frame prediction (PredNet 91 ), and other non-relational objectives like depth labeling and colorization 92 , 93 . Further, they also capitalized on the value of this general objective, developing variations of their instance-level contrastive learning model to learn over video from the SAYcam baby head-cam dataset 60 , finding weaker but generally maintained neural predictivity. While almost every methodological detail is different from the work here, including the theoretical assumptions implied by their methods to relate model feature spaces and single-unit responses, these two studies generally drive to very similar broad claims. That is, both provide empirical support for moving away from category-supervision towards instance-level contrastive learning. Further, the differences between our approaches reveal an expansive new empirical space to explore, considering different methods (fMRI, electrophysiology), models (IPCL, Local Aggregation), and model organisms (humans, monkeys); and, critically, the linking hypotheses (veRSA, encoding models) that operationalize our understanding of the neural code of object representation.

Analytical linking hypotheses between model and brain activations

The question of how feature spaces learned in deep neural networks should be linked to brain responses measured with fMRI is an ongoing analytical frontier–different methods are abundant 21 , 22 , 24 , 28 , 94 , 95 , each making different implicit assumptions about the nature of the link between model feature spaces and brain responses. In the present work, we assume a voxel is best understood as a weighted combination of deepnet features—this is intuitive given the coarse sampling of a voxel over the neural population code. However, note that even single neuron responses (measured with electrophysiology in the primate brain) are typically modeled as weighted combinations of deepnet units, or even as weights on the principle components throughout the deepnet feature space 96 . In general, exactly how deepnet units are conceived of (e.g., how the tuning of any one deepnet unit is related to single neuron firing) is still coming into theoretical focus, where different hypotheses are implicit in the kind of regression model (e.g., whether encoding weights should be a sparse and positive relationship, or low in magnitude and distributed across many deepnet units).

To arrive at a single aggregate measure of neural predictivity, encoding model approaches simply average across the set of individual neuron fits 23 , 65 . In contrast, we considered these voxel-wise encoding models together as an integrated population code, in which items vary in the similarity of their activation profiles, which focuses on the representational geometry of the embedding 49 . One motivation for this shift to the representational similarity as the critical neural target to predict is that fMRI allows for relatively extensive spatial coverage, providing access to a population-level code at a different scale than is possible with dozens to hundreds of single-unit recordings; indeed trying to predict the RDM of a brain region is now the defacto standard in visual cognitive neuroscience. However, note that our approach differs from other kinds of weighted RSA analyses that are often employed on fMRI data 24 , 94 , which fit the representational geometry directly by re-weighting feature-based RDMs, discarding univariate activation profiles entirely. Finally, for RSA approaches, exactly how distances in high-dimensional feature space are conceived of and computed is a further open frontier 97 , where different hypotheses about the way information is evident in the neural code are implicitly embedded in the choice of distance metrics (e.g., as the euclidean distance or the angle between vectors 6 , 98 ).

At stake with these different analytical approaches is that the choices influence the pattern of results and subsequent inferences. For example, in the present data, model features are much more strongly related to brain RDMs when using veRSA than when using classic RSA, which make sense considering this method can recover true relationships that have been blurred by voxel-level sampling; however, untrained models also improve dramatically under this method, raising the question of whether the flexibility of re-weighting to the feature space is too great (or the Pearson-r scoring method is too lenient). As another example, in the present data, the IPCL features were able to comparably capture responses in aOTC to objects at different orientations, but only with veRSA, and not with classic RSA. This discrepancy between the analysis approaches suggests that the brain-like orientation information is embedded in the feature space, but requires voxel-wise encoding models to draw out those relationships—these pairwise relationships are less strongly evident in the unweighted feature space. Why? One possibility is that these IPCL models do not currently experience any orientation jitter across the samples (only crops, resizes, and coloration variation) and thus orientation-tolerance cannot enter into to the instance-prototype representations. In current work, we are adding orientation augmentation to IPCL samples to explore this possibility. More broadly, we highlight these analytic complexities for two reasons. First, to be transparent about the untidy patterns in our data and the current state of our thinking for motivating these analysis decisions in the present work. And second, to open the conversation for the field to understand more deeply the ways in which deepnet models have brain-like representation of visual information under different analysis assumptions, especially as these new interdisciplinary analytical standard approaches are being developed.

A domain-general account of visual representation learning

The pressures guiding the tuning along the ventral visual stream, and the formation of object category information, have been deeply debated, with some theories proposing that category-level (or “domain-level”) forces may be critical drive the organization of this cortex. That instance-level contrastive learning can result in emergent categorical representation supports an alternative theoretical viewpoint, in which category-specialized learning mechanisms are not necessary to learn representations with categorical structure. On this generalist account, visual mechanisms operate similarly over all kinds of input, and the goal is to learn hierarchical visual features that simply try to discriminate each view from every other view of the world, regardless of the visual content or domain. We further show that these instance-level contrastive learning systems can have representations that are as brain-like as category-supervised systems, increasing the viability of this general learning account. This generalist view does not deny the importance of abstract categories in higher-level cognition but instead introduces the instance-level learning objective as a proximate goal that learns compact perceptual representations that can support a wide variety of downstream tasks, including but not limited to object recognition and categorization.

IPCL and category-supervised comparison models were implemented in PyTorch 99 , based on the codebase of Wu et al. ( https://github.com/zhirongw/lemniscate.pytorch ). Code and models are available here: ( https://github.com/harvard-visionlab/open_ipcl ).

For our primary models, we trained five models with an Alexnet-gn architecture (Supplementary Fig.  1 ), using instance-prototype contrastive learning (see  Supplementary Methods for details), on the ImageNet-1k dataset 100 . We used the data augmentation scheme used by 41 , with both spatial augmentation (random crop and resize; horizontal flip), and pixelwise augmentation (random grayscale; random brightness, contrast, saturation, and hue variation). These augmentations require the network to learn a representation that treats images as similar across these transformations. The replications reflect explorations through different training hyper-parameters. See the  Supplementary Methods for extended details about the architecture, augmentations, loss function, and training parameters.

For the category-supervised model, we used the same AlexNet-gn architecture as in the primary IPCL models (minus the final L2-norm layer), but with a 1000-dimensional final fully-connected layer corresponding to the 1000 ImageNet classes. The standard cross-entropy loss function was used to train the model on the ImageNet classification task. Otherwise, training was identical to the IPCL models, with the same visual diet (i.e., same batch size and the number of augmented samples per image using the same augmentation scheme), and the same optimization and learning rate settings.

We trained 6 additional IPCL models to examine the impact of visual diet on learned representations, using datasets that focus on objects, places, faces, or a mixture of these image types: (i) ImageNet: ∼ 1.28 million images spanning 1000 object categories 100 . (ii) Objects: OpenImagesV6, ∼ 1.74 million training images spanning 600 boxable object classes (52; 53). (iii) Faces: vggFace2, ∼ 3.14 million training images spanning 8631 face identities 56 . (iv) Places: places2, ∼ 1.80 million images of scenes/places spanning 365 categories 54 ; (v) Faces-Places-Objects-1x: a mixture of ImageNet, vggFace, and places2, randomly sampling images across all sets, limited to ∼ 1.28 million images per epoch to match the size of the ImageNet training set, (vi) Faces-Places-Objects-3x: limited to 3.6 million images per epoch. We used less extreme cropping parameters for all of these models than for the primary models so that the faces in the vggFace2 dataset would not be too zoomed in (as in this dataset, they tend to be already tightly cropped views of heads and faces). We used identical normalization statistics for each model (rather than tailoring the normalization statistics to each training set). Finally, we had to reduce the learning rate of the Faces model to .001 in order to stabilize learning. Otherwise, all other training details were identical to those for the primary models.

We also analyzed the representations of several concurrently-developed instance-level contrastive learning models: SimCLR 39 : MoCoV2 38 and SwAV 40 , which are trained on ImageNet; and TC-MoCo 59 : trained on baby head-cam video data 60 . These models were downloaded from official public releases.

To extract activations from a model, images were resized to 224 × 224 pixels and then normalized using the same normalization statistics used to train the model. The images were passed through the model, and activations from each model layer were retained for analysis. The activation maps from convolutional layers were flattened over both space and channel dimensions yielding a feature vector with a length equal to NumChannels × Height × Width, while the output of the fully-connected layers provided a flattened feature vector with a length equal to NumChannels.

fMRI experiments

The object orientation fMRI dataset reflects brain responses measured in seven participants while viewing images of eight items presented at five different in-plane orientations (0, 45, 90, 135, and 180 degrees), yielding a total of 40 image conditions. These images were presented in a mini-blocked design, wherein each 6 min-12 s run, each image was flashed four times (600 ms on, 400 ms off) in a 4 s block, and was followed by 4 s fixation. All 40 conditions were presented in each run; the order was determined using the optseq2 software and was additionally constrained so that no item appeared in consecutive blocks (e.g., an upright dog, followed by an inverted dog). Two additional 20 s rest periods were distributed throughout the run. Participants completed 12 runs. Their task was to pay attention to each image and complete a vigilance task (press a button when a red circle appeared around an object), which happened 12 times in the run. Participants (ages 20–35, four female, unknown racial distribution) were recruited through the Department of Psychology at Harvard University and gave informed consent according to procedures approved by the Harvard University Internal Review Board.

The Inanimate Objects fMRI dataset reflects brain responses measured in ten participants while viewing images depicted 72 inanimate items. In each 8-min run, each image was flashed four times (600 ms on, 400 ms off) in a 4 s block, with all 72 images presented in a block in each run (randomly ordered), with 4 × 15 s rest periods interleaved throughout. Participants completed six runs. Their task was to pay attention to each image and complete a vigilance task (press a button when a red-frame appeared around an object, which happened 12 times in the run). Participants (ages 19–32; eight females. unknown racial distribution) gave informed consent approved by the Internal Review Board at the University of Trento, Italy.

All fMRI protocols were presented using scripts written in Matlab using PsychToolbox. Functional data were analyzed using Brain Voyager QX software and MATLAB, with standard preprocessing procedures and general linear modeling analyses to estimate voxel-wise responses to each condition at the single-subject level. Details related to the acquisition and preprocessing steps can be found in the  Supplementary Information . All analyses were conducted in Matlab and Python using custom analysis code.

Brain sectors

First, the EarlyV sector was defined for each individual to include areas V1–V3, which were delineated based on activations from a separate retinotopy protocol. Next, an occipito-temporal cortex mask was drawn by hand on each hemisphere (excluding the EarlyV sector), within which the 1000-most active voxels were included, based on the contrast [all objects > rest] at the group level. To divide this cortex into posterior and anterior OTC sectors, we used an anatomical cutoff (TAL Y: -53), based on a systematic dip in local-regional reliability at this anatomical location, based off on concurrent work also analyzing this Inanimate Object dataset 70 . The same posterior-anterior division was applied to define the sectors and extract data from the Object Orientation dataset.

Data reliability

The noise ceiling was defined in each sector, based on splitting participants into two groups and averaging over all possible split-halves. Specifically, we computed all of the subject-specific RDMs for each sector. Then, on a given iteration, we split the participants in half and computed the average sector-level brain RDMs for each of these two groups. We computed the similarity of these two RDMs by correlating the elements along the lower triangular matrix (excluding the diagonal). The correlation distance (1-Pearson) was used for creating and comparing RDMs. This procedure was repeated for all possible split-halves over subjects. The noise ceiling was estimated as the mean correlation across splits (average of fisher-z transformed correlation values), and an adjusted 95% confidence interval that takes into account the non-independence of the samples 101 . This particular method was used to dovetail with the model-brain correlations, described next.

Model-brain analyses

The first key dependent measure (veRSA correlation) reflects the suitability of the features learned in a layer to predict the multivariate response structure in that brain sector. To compute this, we used the following procedure.

Voxel-wise encoding

For each deepnet layer, subject, and sector, each voxel’s response profile (over 40 or 72 image conditions, depending on the dataset) was fit with an encoding model. Specifically, in a leave-one-out procedure, a single image was held out, and ridge regression was used to find the optimal weights for predicting each voxel response to the remaining images. We used sklearn’s 102 cross-validated ridge regression to find the optimal lambda parameter. The response for the held-out item was then predicted using the learned regression weights. Each item was held out once, providing a cross-validated estimate of responses to each image in every voxel, which together forms a model-based prediction of neural responses in each brain region. Based on these predicted responses, a model-predicted-RDMs was computed for each participant.

Layerwise RSA analysis

Next, for each sector and layer, the model-predicted-RDMs for each subject were divided into two groups and averaged, yielding two average model-predicted-RDMs from two independent halves of the data. Each RDM was correlated with actual brain RDM, where the brain RDM was computed from the same set of participants. This analysis was repeated for all possible splits-halves of the participants. The average fisher-z transformed correlation (and an adjusted 95% confidence interval 101 was taken as the key measure of layer-sector correspondence.

Note that this average correlation reflects the similarity between the model-predicted-RDMs and the brain-RDMs, where only half of the subject’s brain data are used. This method of splitting the data into two halves was designed to increase the reliability in the data—we found that the RDMs were more stable with the benefit of averaging across subjects, while any one individual’s brain data were generally less reliable. Additionally, this procedure allows there to be some generality across subjects. Finally, we did not adjust the fit values to correct for the fact that the model-to-brain fit reflects only half the brain data, instead, we kept it as is, which also allows the average layer-sector correlation to be directly compared to the similarly-estimated noise ceiling of the brain data. sector.

Cross-validated max-layer estimation

The second key dependent measure relating model-brain correspondence reflects the strength of the best-fitting layer to a given sector. To compute this measure, we again used the same technique of splitting the data in half by two groups of subjects (this time to prevent double-dipping). Specifically, for each model and sector, the veRSA correlation was computed for all layers, and then layer with the highest veRSA correlation was selected. Then, in the independent half of the data (from new participants), the veRSA correlation was computed for this selected layer, and taken as a measure of the highest correspondence between the model and the sector. As above, this procedure was repeated for all possible split-halves of the subjects, and the cross-validated max-r measure was taken as the average across splits (averaging fisher-z transformed correlation values, and using the adjusted 95% confidence interval that takes into account the non-independence of the samples). This procedure insures an independent estimate of the maximum correspondence across layers.

Classic RSA

For comparison, we also computed and compared RDMs in both layerwise feature spaces and brain sectors using classic RSA. In this case, RDMs were computed directly from the deepnet activations (across units) and the brain activation patterns (across voxels), with no encoding model or feature weighting.

Statistical comparisons

To compare the cross-validated max correlation values between models, we used paired t -tests over all split-halves of the data, with a correction for non-independence of the samples, following 101 (tests based on repeated k-fold cross-validation) for corrected variance estimate and adjusted t -values. Comparisons between IPCL and Category-Supervised models are found in Supplementary Table  2 ; Comparisons between IPCL and an untrained model are found in Supplementary Table  2 ; Comparison between models trained with different visual diets to the baseline IPCL model trained on ImageNet are reported in Supplementary Table  3 ). Statistical significance for these paired t -tests was determined using a Bonferonni corrected α level of 0.05 / 30 = 0.00167, where 30 corresponds to the number of family-wise tests for all reported tests.

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

Data availability

Brain data, analysis code, and figure-plotting code are available on the Open Science Framework ( https://osf.io/trne8/ ). Public image datasets used to train the models include: ImageNet ( https://image-net.org/ ), OpenImagesV6 ( https://storage.googleapis.com/openimages/web/index.html ), VggFace2 ( https://github.com/ox-vgg/vgg_face2 ), and Places2 ( http://places2.csail.mit.edu/ ).  Source data are provided with this paper.

Code availability

Model training scripts and pretrained models are available on Github ( https://github.com/harvard-visionlab/open_ipcl ; https://doi.org/10.5281/zenodo.5719364 ; ref. 103 ).

Mishkin, M., Ungerleider, L. G. & Macko, K. A. Object vision and spatial vision: two cortical pathways. Trends Neurosci. 6 , 414–417 (1983).

Google Scholar  

Haxby, J. V. et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293 , 2425–2430 (2001).

ADS   CAS   PubMed   Google Scholar  

Kanwisher, N. Functional specificity in the human brain: a window into the functional ar- chitecture of the mind. Proc. Natl Acad. Sci. USA 107 , 11163–11170 (2010).

ADS   CAS   PubMed   PubMed Central   Google Scholar  

DiCarlo, J. J. & Cox, D. D. Untangling invariant object recognition. Trends Cogn. Sci. 11 , 333–341 (2007).

PubMed   Google Scholar  

Grill-Spector, K. & Weiner, K. S. The functional architecture of the ventral temporal cortex and its role in categorization. Nat. Rev. Neurosci. 15 , 536–548 (2014).

CAS   PubMed   PubMed Central   Google Scholar  

Meyer, T. & Rust, N. C. Single-exposure visual memory judgments are reflected in inferotempo- ral cortex. eLife 7 , e32259 (2018).

PubMed   PubMed Central   Google Scholar  

Op de Beeck, H. P., Pillet, I. & Ritchie, J. B. Factors determining where category-selective areas emerge in visual cortex. Trends Cogn. Sci. 23 , 784–797 (2019).

Powell, L. J., Kosakowski, H. L. & Saxe, R. Social origins of cortical face areas. Trends Cogn. Sci. 22 , 752–763 (2018).

Livingstone, M. S., Arcaro, M. J. & Schade, P. F. Cortex is cortex: ubiquitous principles drive face-domain development. Trends Cogn. Sci 23 , 3 (2019).

Arcaro, M. J. & Livingstone, M. S. On the relationship between maps and domains in inferotem- poral cortex. Nat. Rev. Neurosci. 22 , 573–583 (2021).

Kamps, F. S., Hendrix, C. L., Brennan, P. A. & Dilks, D. D. Connectivity at the origins of domain specificity in the cortical face and place networks. Proc. Natl Acad. Sci. USA 117 , 6163–6169 (2020).

Konkle, T. & Oliva, A. A real-world size organization of object responses in occipitotemporal cortex. Neuron 74 , 1114–1124 (2012).

Konkle, T. & Caramazza, A. The large-scale organization of object-responsive cortex is reflected in resting-state network architecture. Cereb. Cortex 27 , 4933–4945 (2017).

Mahon, B. Z. & Caramazza, A. What drives the organization of object knowledge in the brain? Trends Cogn. Sci. 15 , 97–103 (2011).

Peelen, M. V. & Downing, P. E. Category selectivity in human visual cortex: beyond visual object recognition. Neuropsychologia 105 , 177–183 (2017).

Bracci, S., Ritchie, J. B. & de Beeck, H. O. On the partnership between neural representations of object categories and visual features in the ventral visual pathway. Neuropsychologia 105 , 153–164 (2017).

Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Comp. Biol . 10 , e1003915 (2014).

Yamins, D. L. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111 , 8619–8624 (2014).

Güçlü, U. & van Gerven, M. A. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci. 35 , 10005–10014 (2015).

Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Sci. Rep. 6 , 1–13 (2016).

Eickenberg, M., Gramfort, A., Varoquaux, G. & Thirion, B. Seeing it all: convolutional network layers map the function of the human visual system. NeuroImage 152 , 184–194 (2017).

Wen, H., Shi, J., Chen, W. & Liu, Z. Deep residual network predicts cortical representation and organization of visual features for rapid categorization. Sci. Rep. 8 , 1–17 (2018).

ADS   Google Scholar  

Schrimpf, M. et al. Brain-score: which artificial neural network for object recognition is most brain-like? Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/407007v2 (2018).

Storrs, K. R., Kietzmann, T. C., Walther, A., Mehrer, J., & Kriegeskorte, N. Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. J. Cogn. Neurosci. 33 , 2044–2064 (2021).

Kriegeskorte, N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 1 , 417–446 (2015).

Serre, T. Deep learning: the good, the bad, and the ugly. Annu. Rev. Vis. Sci. 5 , 399–426 (2019).

Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115 , 211–252 (2015).

MathSciNet   Google Scholar  

Long, B., Yu, C.-P. & Konkle, T. Mid-level visual features underlie the high-level categorical organization of the ventral stream. Proc. Natl Acad. Sci. USA 115 , E9015–E9024 (2018).

Janini, D. & Konkle, T. A poke´mon-sized window into the human brain. Nat. Hum. Beh. 3 , 552–553 (2019).

Long, B., Störmer, V. S. & Alvarez, G. A. Mid-level perceptual features contain early cues to animacy. J. Vis. 17 , 20–20 (2017).

Malcolm, G. L., Groen, I. I. & Baker, C. I. Making sense of real-world scenes. Trends Cogn. Sci. 20 , 843–856 (2016).

Gibson, J. J. The Ecological Approach to Visual Perception (Psychology Press, 2014).

Baggs, E. & Chemero, A. in Perception as Information Detection (Wagman, J. B. & Blau, J. J. C.) Ch. 1 (Routledge, 2019).

Wu, Y. & He, K. Group normalization. In Proceedings of the European Conference on Computer Vision ( ECCV ). 3–19 (Springer, 2018).

Zhuang, C., Zhai, A. L. & Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019).

Tian, Y., Krishnan, D., & Isola, P. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16 pp. 776-794 (Springer International Publishing 2020).

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020).

Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. Preprint at arXiv :2003.04297 (2020).

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. A simple framework for contrastive learning of visual representations. International Conference on Machine Learning PMLR (2020).

Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. Proceedings of Advances in Neural Information Processing Systems (NeurIPS) (2020).

Wu, Z., Xiong, Y., Yu, S. X., & Lin, D. Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems . 1097–1105 (ACM, 2012).

Ioffe, S., & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning (PMLR) (2015).

Heeger, D. J. Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9 , 181–197 (1992).

CAS   PubMed   Google Scholar  

Carandini, M. & Heeger, D. J. Normalization as a canonical neural computation. Nat. Rev. Neurosci. 13 , 51–62 (2012).

CAS   Google Scholar  

Tarhan, L. & Konkle, T. Reliability-based voxel selection. NeuroImage 207 , 116350 (2020).

Mitchell, T. M. et al. Predicting human brain activity associated with the meanings of nouns. Science 320 , 1191–1195 (2008).

Naselaris, T., Kay, K. N., Nishimoto, S. & Gallant, J. L. Encoding and decoding in fmri. NeuroImage 56 , 400–410 (2011).

Kriegeskorte, N., Mur, M. & Bandettini, P. A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Sys. Neurosci. 2 , 4 (2008).

Khaligh-Razavi, S.-M., Henriksson, L., Kay, K. & Kriegeskorte, N. Fixed versus mixed rsa: explaining visual representations by fixed and mixed feature sets from shallow and deep computational models. J. Math. Psychol. 76 , 184–197 (2017).

MathSciNet   PubMed   PubMed Central   MATH   Google Scholar  

Kriegeskorte, N., Wei, XX. Neural tuning and representational geometry. Nat. Rev. Neurosci. 22 , 703–718 https://doi.org/10.1038/s41583-021-00502-3 (2021).

Article   CAS   PubMed   Google Scholar  

Krasin, I. et al. Openimages: a public dataset for large-scale multi-label and multi-class image classification. https://github.com/openimages (2017).

Kuznetsova, A. et al. The open images dataset v4. Int. J. Comput. Vis. 128 , 1956–1981 (2020).

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40 , 1452–1464 (2017).

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Object detectors emerge in deep scene cnns. International Conference on Learning Representations (ICLR) (2015).

Cao, Q., Shen, L., Xie, W., Parkhi, O. M. & Zisserman, A. Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition ( FG 2018 ). 67–74 (IEEE, 2018).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).

Xie, S., Girshick, R., Dolla´r, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).

Orhan, A. E., Gupta, V. V. & Lake, B. M. Self-supervised learning through the eyes of a child. Conference on Neural Information Processing Systems, NeurIPS (2020).

Sullivan, J., Mei, M., Perfors, A., Wojcik, E., & Frank, M. C. SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind , 1–10 (2020).

Smith, L. B. & Slone, L. K. A developmental approach to machine learning? Front. Pscyhol. 8 , 2124 (2017).

Sermanet, P. et al. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation ( ICRA ). 1134–1141 (IEEE, 2018).

Zhuang, C., She, T., Andonian, A., Mark, M. S. & Yamins, D. Unsupervised learning from video with deep neural embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 9563–9572 (IEEE, 2020).

Knights, J. et al. Temporally coherent embeddings for self-supervised video representation learning. In 2020 25th International Conference on Pattern Recognition ( ICPR ). 8914–8921 (IEEE, 2021).

Zhuang, C. et al. Unsupervised neural network models of the ventral visual stream. Proc. Natl Acad. Sci . USA 118 , e2014196118 (2021).

Konkle, T., Brady, T. F., Alvarez, G. A. & Oliva, A. Conceptual distinctiveness supports detailed visual long-term memory for real-world objects. J. Exp. Psychol. Gen. 139 , 558 (2010).

Ga¨rdenfors, P. From sensations to concepts: a proposal for two learning processes. Rev. Phil. Psychol 10 , 441–464 (2019).

Solomon, S. & Schapiro, A. Structure shapes the representation of a novel category. Preprint at PsyArXiv (2021).

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M. & Brendel, W. International Conference of Machine Learning (ICML) (2021).

Magri, C. & Konkle, T. Object-selective cortex shows distinct representational formats along the posterior-to-anterior axis: evidence from brain-behavior correlations. J. Vis. 20 , 185–185 (2020).

Baldassi, C. et al. Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons. PLoS Comput. Biol. 9 , e1003167 (2013).

Jozwik, K. M., Kriegeskorte, N. & Mur, M. Visual features as stepping stones toward semantics: explaining object similarity in it and perception with non-negative least squares. Neuropsychologia 83 , 201–226 (2016).

Lescroart, M. D. & Biederman, I. Cortical representation of medial axis structure. Cereb. Cortex 23 , 629–637 (2013).

Ostwald, D., Lam, J. M., Li, S. & Kourtzi, Z. Neural coding of global form in the human visual cortex. J. Neurophsiol . 99 , 2456–2469 (2008).

Wilson, H. R. & Wilkinson, F. From orientations to objects: configural processing in the ventral stream. J. Vis. 15 , 4–4 (2015).

Geirhos, R. et al. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. International Conference on Learning Representations (ICLR) (2019).

Brendel, W. & Bethge, M. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. Preprint at arXiv :1904.00760 (2019).

Doerig, A., Bornet, A., Choung, O.-H. & Herzog, M. H. Crowding reveals fundamental differences in local vs. global processing in humans and machines. Vis. Res. 167 , 39–45 (2020).

Wang, T. & Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning . 9929–9939 (PMLR, 2020).

Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neuro . 2 , 79–87 (1999).

Colby, C. et al. The updating of the representation of visual space in parietal cortex by intended eye movements. Science 255 , 90–92 (1992).

ADS   PubMed   Google Scholar  

Crapse, T. B. & Sommer, M. A. Corollary discharge across the animal kingdom. Nat. Rev. Neurosci. 9 , 587–600 (2008).

Lenc, K. & Vedaldi, A. Understanding image representations by measuring their equivariance and equivalence. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).

Bouchacourt, D., Ibrahim, M. & Deny, S. Addressing the topological defects of disentanglement via distributed operators. Preprint at arXiv :2102.05623 (2021).

Van Essen, D. C. & Maunsell, J. H. Hierarchical organization and functional streams in the visual cortex. Trends Neurosci. 6 , 370–375 (1983).

Zbontar, J., Jing, L., Misra, I., LeCun, Y. & Deny, S. Barlow twins: self-supervised learning via redundancy reduction. Preprint at arXiv :2103.03230 (2021).

Chen, X. & He, K. Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . (2021).

Grill, J.-B. et al. Bootstrap your own latent: a new approach to self-supervised learning. Preprint at arXiv :2006.07733 (2020).

Chen, X., & He, K. Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021).

Tsai, Y.-H. H., Bai, S., Morency, L.-P. & Salakhutdinov, R. A note on connecting barlow twins with negative-sample-free contrastive learning. Preprint at arXiv :2104.13712 (2021).

Lotter, W., Kreiman, G. & Cox, D. A neural network trained for prediction mimics diverse features of biological neurons and perception. Nat. Mach. Intell 2 , 210–219 (2020).

Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F. & Navab, N. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international Conference on 3D Vision ( 3DV ). 239–248 (IEEE, 2016).

Zhang, R., Isola, P. & Efros, A. A. Colorful image colorization. In European Conference on Computer Vision. 649–666 (Springer, 2016).

Jozwik, K. M., Kriegeskorte, N., Storrs, K. R. & Mur, M. Deep convolutional neural networks outperform feature-based but not categorical models in explaining object similarity judgments. Front. Psychol. 8 , 1726 (2017).

Zeman, A. A., Ritchie, J. B., Bracci, S. & de Beeck, H. O. Orthogonal representations of object shape and category in deep convolutional neural networks and human visual cortex. Sci. Rep. 10 , 1–12 (2020).

Klindt, D. A., Ecker, A. S., Euler, T., & Bethge, M. Neural system identification for large populations separating what and where. Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS) (2017).

Stringer, C., Pachitariu, M., Steinmetz, N., Carandini, M. & Harris, K. D. High-dimensional geometry of population responses in visual cortex. Nature 571 , 361–365 (2019).

Diedrichsen, J. et al. Comparing representational geometries using whitened unbiased-distance- matrix similarity. Preprint at arXiv :2007.0278 9 (2020).

Paszke, A., et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 32 , 8026–8037 (2019).

Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition . 248–255 (IEEE, 2009).

Bouckaert, R. R. & Frank, E. Evaluating the replicability of significance tests for comparing learning algorithms. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 3–12 (Springer, 2004).

Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Konkle, T. & Alvarez, G. A. (2022) A self-supervised domain-general learning framework for human ventral stream representation. Nat. Commun. Code Repository: harvard-visionlab/open ipcl https://doi.org/10.5281/zenodo.5719364 (2021).

Download references

Acknowledgements

Funding for this project was provided by an Amazon Cloud Credits for Research Grant to TK and GAA, an NSF CAREER BCS-1942438 to TK, and an NSF PAC COMP-COG 1946308 to GAA. Thank you to Morgan Henry who collected the Object Orientation dataset.

Author information

Authors and affiliations.

Department of Psychology & Center for Brain Science, Harvard University, Cambridge, MA, USA

Talia Konkle & George A. Alvarez

You can also search for this author in PubMed   Google Scholar

Contributions

Both authors contributed extensively to this work. TK collected and pre-processed the Inanimate Object Dataset. TK and GAA supervised the collection and preprocessing of the Object Orientation dataset. TK organized all brain data for analysis. GAA implemented and trained all models. TK and GA jointly developed the self-supervised model, designed the experiments and analytical procedures, created the figures, and wrote the manuscript.

Corresponding authors

Correspondence to Talia Konkle or George A. Alvarez .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Konkle, T., Alvarez, G.A. A self-supervised domain-general learning framework for human ventral stream representation. Nat Commun 13 , 491 (2022). https://doi.org/10.1038/s41467-022-28091-4

Download citation

Received : 03 June 2021

Accepted : 13 December 2021

Published : 25 January 2022

DOI : https://doi.org/10.1038/s41467-022-28091-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Model metamers reveal divergent invariances between biological and artificial neural networks.

  • Jenelle Feather
  • Guillaume Leclerc
  • Josh H. McDermott

Nature Neuroscience (2023)

Perception of 3D shape integrates intuitive physics and analysis-by-synthesis

  • Ilker Yildirim
  • Max H. Siegel
  • Joshua B. Tenenbaum

Nature Human Behaviour (2023)

The neuroconnectionist research programme

  • Adrien Doerig
  • Rowan P. Sommers
  • Tim C. Kietzmann

Nature Reviews Neuroscience (2023)

Advancing Naturalistic Affective Science with Deep Learning

  • Landry S. Bulls
  • Mark A. Thornton

Affective Science (2023)

Structural covariance of the ventral visual stream predicts posttraumatic intrusion and nightmare symptoms: a multivariate data fusion analysis

  • Nathaniel G. Harnett
  • Katherine E. Finegold
  • Jennifer S. Stevens

Translational Psychiatry (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

self supervised visual representation learning from hierarchical grouping

Self-Supervised Visual Representation Learning from Hierarchical Grouping

self supervised visual representation learning from hierarchical grouping

We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.

self supervised visual representation learning from hierarchical grouping

Michael Maire

self supervised visual representation learning from hierarchical grouping

Related Research

Repre: improving self-supervised vision transformer with reconstructive pre-training, self-supervised visual representation learning with semantic grouping, regroup: recursive neural networks for hierarchical grouping of vector graphic primitives, motion-aware self-supervised video representation learning via foreground-background merging, infobehavior: self-supervised representation learning for ultra-long behavior sequence via hierarchical grouping, associating spatially-consistent grouping with text-supervised semantic segmentation, deep grouping model for unified perceptual parsing.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

IMAGES

  1. Figure 1 from Self-Supervised Visual Representation Learning from

    self supervised visual representation learning from hierarchical grouping

  2. Self-Supervised Visual Representation Learning from Hierarchical

    self supervised visual representation learning from hierarchical grouping

  3. Grokking self-supervised (representation) learning: how it works in

    self supervised visual representation learning from hierarchical grouping

  4. Self-Supervised Representation Learning

    self supervised visual representation learning from hierarchical grouping

  5. Self-Supervised Visual Representation Learning

    self supervised visual representation learning from hierarchical grouping

  6. Self-Supervised Learning: Definition, Tutorial & Examples

    self supervised visual representation learning from hierarchical grouping

VIDEO

  1. Recent Advances in Self-supervised Deep Visual Representation Learning

  2. HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation

  3. Self Supervised Visual Planning

  4. [SSL] 논문 리뷰 : Mixed Autoencoder for Self-supervised Visual Representation Learning

  5. Java

  6. WSDM-23 Paper: Self-supervised Graph Representation Learning for Black Market Account Detection

COMMENTS

  1. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned ...

  2. Self-supervised visual representation learning from hierarchical

    We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive.

  3. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Review 2. Summary and Contributions: The paper proposes a self-supervised representation learning approach for imaging data using a pixel-wise contrastive learning objective.Distances between pixel representations are obtained by leveraging a hierarchical region structure. The key contribution is a visual representation learning approach that avoids pre-training on large-scale datasets such as ...

  4. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Recent efforts on self-supervised visual learning fall into several broad camps. Among them, Kingma et al. [] and Donahue et al. [] design general architectures to learn latent feature representations, driven by modeling image distributions. Another group of approaches [16, 10, 28, 35] leverage, as supervision, pseudo-labels automatically generated from hand-designed proxy tasks.

  5. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Self-Supervised Visual Representation Learning from Hierarchical Grouping. Meta Review. This is an interesting paper bringing a sober perspective on the recent self-supervised learning progress in ImageNet. It shows that there are still opportunities in going beyond image-level self-supervised tasks, achieving slightly better results than MOCO ...

  6. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Abstract. We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an ...

  7. PDF Self-Supervised Visual Representation Learning with Semantic Grouping

    Our work is in the domain of self-supervised visual representation learning, where the goal is to learn visual representations without human annotations. We briefly review relevant works below. Image-level self-supervised learning aims at learning visual representations by treating each image as one data sample.

  8. Self-supervised visual representation learning with semantic grouping

    In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data. Existing works have demonstrated the potential of utilizing the underlying complex structure within scene-centric data; still, they commonly rely on hand-crafted objectness priors or specialized pretext tasks to build a learning framework, which may harm generalizability.

  9. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Abstract. We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive.

  10. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    A framework for bootstrapping visual representation learning from a primitive visual grouping capability that is operationalized via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy is created. We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability.

  11. michael maire

    Self-Supervised Visual Representation Learning from Hierarchical Grouping [ ] Multigrid Neural Memory [ ] Orthogonalized SGD and Nested Architectures for Anytime Neural Networks ... A self-supervised learning framework connecting vision and language in a common embedding space, with both inter-modal and intra-modal contrastive objectives ...

  12. Self-Supervised Visual Representation Learning with Semantic Grouping

    The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and ...

  13. A self-supervised domain-general learning framework for human ...

    a Overview of the self-supervised instance-prototype contrastive learning (IPCL) model which learns instance-level representations without category or instance labels.b t-SNE visualization of 500 ...

  14. PDF Self-Supervised Visual Representation Learning with Semantic Grouping

    2) We demonstrate that semantic grouping is crucial for learning good representations from scene-centric data. 3) Combining semantic grouping and representation learning, we unleash the potential of scene-centric pre-training, largely close its gap with object-centric pre-training and achieve state-of-the-art results in various downstream tasks.

  15. Self-Supervised Visual Representation Learning with Semantic Grouping

    This work proposes contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning, which bypasses the disadvantages of hand-crafted priors and is able to learn object/group-level representations from scene-centric images. In this paper, we tackle the problem of learning visual representations from unlabeled scene-centric data.

  16. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Figure 1: Bootstrapping semantic representation learning via primitive hierarchical grouping. Top: Self-Supervised Target Sampling. From a hierarchical segmentation of an image (i.e., a region tree T ), rendered here as a boundary strength map (OWT-UCM [2]), we define distance between regions dT (R1, R2) according to the level in the hierarchy at which they merge. Treating this distance as a ...

  17. Self-Supervised Visual Representation Learning with Semantic Grouping

    Our work is in the domain of self-supervised visual representation learning, where the goal is to learn visual representations without human annotations. We briefly review relevant works below. Image-level self-supervised learning aims at learning visual representations by treating each image as one data sample.

  18. Self-Supervised Visual Representation Learning with Semantic Grouping

    The semantic grouping is performed by assigning pixels to a set of learnable prototypes, which can adapt to each sample by attentive pooling over the feature and form new slots. Based on the learned data-dependent slots, a contrastive objective is employed for representation learning, which enhances the discriminability of features, and ...

  19. Self-Supervised Visual Representation Learning with Semantic Grouping

    Compared with other image-, pixel-, and object-level selfsupervised learning methods, our method shows consistent improvements over different tasks without leveraging multi-crop [6] and objectness priors. (†: re-impl. w/ official weights; ‡: full re-impl.) - "Self-Supervised Visual Representation Learning with Semantic Grouping"

  20. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive.

  21. Bioinspired actor-critic algorithm for reinforcement learning

    , A Q-learning approach to flocking with UAVs in a stochastic environment, IEEE Trans. Cybern. 47 (1) (2017) 186 - 197. Google Scholar [30] Wang Xiao, Shi Peng, Wen Changxuan, Zhao Yushan, An algorithm of reinforcement learning for maneuvering parameter self-tuning applying in satellite cluster, Math. Probl. Eng. (2020). Google Scholar

  22. Self-Supervised Visual Representation Learning from Hierarchical Grouping

    Abstract: We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive.