自监督系列代码

# 01、MoCo: Momentum Contrast for Unsupervised Visual Representation Learning

# 1.1 MoCo v1 & MoCo v2

https://github.com/facebookresearch/moco

Models

Our pre-trained ResNet-50 models can be downloaded as following:

epochs	mlp	aug+	cos	top-1 acc.	model	md5
MoCo v1 (opens new window)	200				60.6	download (opens new window)	`b251726a`
MoCo v2 (opens new window)	200	✓	✓	✓	67.7	download (opens new window)	`59fd9945`
MoCo v2 (opens new window)	800	✓	✓	✓	71.1	download (opens new window)	`a04e12f8`

# 1.2 MoCov3

https://github.com/facebookresearch/moco-v3

ResNet-50, linear classification

pretrain epochs	pretrain crops	linear acc
100	2x224	68.9
300	2x224	72.8
1000	2x224	74.6

ViT, linear classification

model	pretrain epochs	pretrain crops	linear acc
ViT-Small	300	2x224	73.2
ViT-Base	300	2x224	76.7

ViT, end-to-end fine-tuning

model	pretrain epochs	pretrain crops	e2e acc
ViT-Small	300	2x224	81.4
ViT-Base	300	2x224	83.2

# 02、SimCLR - A Simple Framework for Contrastive Learning of Visual Representations

https://github.com/google-research/simclr

# Pre-trained models for SimCLRv1

The pre-trained models (base network with linear classifier layer) can be found below. Note that for these SimCLRv1 checkpoints, the projection head is not available.

Model checkpoint and hub-module	ImageNet Top-1
ResNet50 (1x) (opens new window)	69.1
ResNet50 (2x) (opens new window)	74.2
ResNet50 (4x) (opens new window)	76.6

# Pre-trained models for SimCLRv2

Depth	Width	SK	Param (M)	F-T (1%)	F-T(10%)	F-T(100%)	Linear eval	Supervised
50	1X	False	24	57.9	68.4	76.3	71.7	76.6
50	1X	True	35	64.5	72.1	78.7	74.6	78.5
50	2X	False	94	66.3	73.9	79.1	75.6	77.8
50	2X	True	140	70.6	77.0	81.3	77.7	79.3
101	1X	False	43	62.1	71.4	78.2	73.6	78.0
101	1X	True	65	68.3	75.1	80.6	76.3	79.6
101	2X	False	170	69.1	75.8	80.7	77.0	78.9
101	2X	True	257	73.2	78.8	82.4	79.0	80.1
152	1X	False	58	64.0	73.0	79.3	74.5	78.3
152	1X	True	89	70.0	76.5	81.3	77.2	79.9
152	2X	False	233	70.2	76.6	81.1	77.4	79.1
152	2X	True	354	74.2	79.4	82.9	79.4	80.4
152	3X	True	795	74.9	80.1	83.1	79.8	80.5

# 03、SimSiam: Exploring Simple Siamese Representation Learning

https://github.com/facebookresearch/simsiam

# Models and Logs

Our pre-trained ResNet-50 models and logs:

pre-train epochs	batch size	pre-train ckpt	pre-train log	linear cls. ckpt	linear cls. log	top-1 acc.
100	512	link (opens new window)	link (opens new window)	link (opens new window)	link (opens new window)	68.1
100	256	link (opens new window)	link (opens new window)	link (opens new window)	link (opens new window)	68.3

# 04、Understanding Dimensional Collapse in Contrastive Self-supervised Learning

# 05、Improving Contrastive Learning by Visualizing Feature Transformation

https://github.com/DTennant/CL-Visualizing-Feature-Transformation

Models

For your convenience, we provide the following pre-trained models on ImageNet-1K and ImageNet-100.

pre-train method	pre-train dataset	backbone	#epoch	ImageNet-1K	VOC det AP50	COCO det AP	Link
Supervised	ImageNet-1K	ResNet-50	-	76.1	81.3	38.2	download (opens new window)
MoCo-v1	ImageNet-1K	ResNet-50	200	60.6	81.5	38.5	download (opens new window)
MoCo-v1+FT	ImageNet-1K	ResNet-50	200	61.9	82.0	39.0	download (opens new window)
MoCo-v2	ImageNet-1K	ResNet-50	200	67.5	82.4	39.0	download (opens new window)
MoCo-v2+FT	ImageNet-1K	ResNet-50	200	69.6	83.3	39.5	download (opens new window)
MoCo-v1+FT	ImageNet-100	ResNet-50	200	IN-100 result 77.2	-	-	download (opens new window)

# 06、Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

# 6.1 Pascal VOC object detection

# Faster-RCNN with C4

Method	Epochs	Arch	AP	AP50	AP75	Download
Scratch	-	ResNet-50	33.8	60.2	33.1	-
Supervised	100	ResNet-50	53.5	81.3	58.8	-
MoCo	200	ResNet-50	55.9	81.5	62.6	-
SimCLR	1000	ResNet-50	56.3	81.9	62.5	-
MoCo v2	800	ResNet-50	57.6	82.7	64.4	-
InfoMin	200	ResNet-50	57.6	82.7	64.6	-
InfoMin	800	ResNet-50	57.5	82.5	64.0	-
PixPro (ours) (opens new window)	100	ResNet-50	58.8	83.0	66.5	config (opens new window) \| model (opens new window)
PixPro (ours) (opens new window)	400	ResNet-50	60.2	83.8	67.7	config (opens new window) \| model (opens new window)

# 6.2 COCO object detection

# Mask-RCNN with FPN

Method	Epochs	Arch	Schedule	bbox AP	mask AP	Download
Scratch	-	ResNet-50	1x	32.8	29.9	-
Supervised	100	ResNet-50	1x	39.7	35.9	-
MoCo	200	ResNet-50	1x	39.4	35.6	-
SimCLR	1000	ResNet-50	1x	39.8	35.9	-
MoCo v2	800	ResNet-50	1x	40.4	36.4	-
InfoMin	200	ResNet-50	1x	40.6	36.7	-
InfoMin	800	ResNet-50	1x	40.4	36.6	-
PixPro (ours) (opens new window)	100	ResNet-50	1x	40.8	36.8	config (opens new window) \| model (opens new window)
PixPro (ours)	100*	ResNet-50	1x	41.3	37.1	-
PixPro (ours)	400*	ResNet-50	1x	41.4	37.4	-

* Indicates methods with instance branch.

# Mask-RCNN with C4

Method	Epochs	Arch	Schedule	bbox AP	mask AP	Download
Scratch	-	ResNet-50	1x	26.4	29.3	-
Supervised	100	ResNet-50	1x	38.2	33.3	-
MoCo	200	ResNet-50	1x	38.5	33.6	-
SimCLR	1000	ResNet-50	1x	38.4	33.6	-
MoCo v2	800	ResNet-50	1x	39.5	34.5	-
InfoMin	200	ResNet-50	1x	39.0	34.1	-
InfoMin	800	ResNet-50	1x	38.8	33.8	-
PixPro (ours) (opens new window)	100	ResNet-50	1x	40.0	34.8	config (opens new window) \| model (opens new window)
PixPro (ours) (opens new window)	400	ResNet-50	1x	40.5	35.3	config (opens new window) \| model (opens new window)

# 07、CVPR2021 | Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning

https://github.com/valeoai/obow

# 7.1 ResNet50 pre-trained model

Method	Epochs	Batch-size	Dataset	ImageNet linear acc.	Links to pre-trained weights
OBoW	200	256	ImageNet	73.8	entire model (opens new window) / only feature extractor (opens new window)

# 08、NeurIPS 2020 | Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

https://github.com/facebookresearch/swav

method	epochs	batch-size	multi-crop	ImageNet top-1 acc.	url	args
SwAV	800	4096	2x224 + 6x96	75.3	model (opens new window)	script (opens new window)
SwAV	400	4096	2x224 + 6x96	74.6	model (opens new window)	script (opens new window)
SwAV	200	4096	2x224 + 6x96	73.9	model (opens new window)	script (opens new window)
SwAV	100	4096	2x224 + 6x96	72.1	model (opens new window)	script (opens new window)
SwAV	200	256	2x224 + 6x96	72.7	model (opens new window)	script (opens new window)
SwAV	400	256	2x224 + 6x96	74.3	model (opens new window)	script (opens new window)
SwAV	400	4096	2x224	70.1	model (opens new window)	script (opens new window)
DeepCluster-v2	800	4096	2x224 + 6x96	75.2	model (opens new window)	script (opens new window)
DeepCluster-v2	400	4096	2x160 + 4x96	74.3	model (opens new window)	script (opens new window)
DeepCluster-v2	400	4096	2x224	70.2	model (opens new window)	script (opens new window)
SeLa-v2	400	4096	2x160 + 4x96	71.8	model (opens new window)	-
SeLa-v2	400	4096	2x224	67.2	model (opens new window)	-

# 09、ECCV 2020 | Learning to Classify Images without Labels

https://github.com/wvangansbeke/Unsupervised-Classification

We also train SCAN on ImageNet for 1000 clusters. We use 10 clusterheads and finally take the head with the lowest loss. The accuracy (ACC), normalized mutual information (NMI), adjusted mutual information (AMI) and adjusted rand index (ARI) are computed:

Method	ACC	NMI	AMI	ARI	Download link
SCAN (ResNet50)	39.9	72.0	51.2	27.5	Download (opens new window)

# 10、ICML 2020 | Self-Supervised Prototypical Transfer Learning for Few-Shot Classification

https://github.com/indy-lab/ProtoTransfer

# 11、NeurIPS 2020 | Bootstrap Your Own Latent

https://github.com/deepmind/deepmind-research/tree/master/byol

Using this implementation should achieve a top-1 accuracy on Imagenet between 74.0% and 74.5% after about 8h of training using 512 Cloud TPU v3.

# 12、Efficient Self-Supervised Vision Transformers

https://github.com/microsoft/esvit

# 12.1 Pretrained models

You can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks.

EsViT (Swin) with network configurations of increased model capacities, pre-trained with both view-level and region-level tasks. ResNet-50 trained with both tasks is shown as a reference.

arch	params	linear	k-nn	download	logs
ResNet-50	23M	75.7%	71.3%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-T, W=7)	28M	78.0%	75.7%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-S, W=7)	49M	79.5%	77.7%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-B, W=7)	87M	80.4%	78.9%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-T, W=14)	28M	78.7%	77.0%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-S, W=14)	49M	80.8%	79.1%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)
EsViT (Swin-B, W=14)	87M	81.3%	79.3%	full ckpt (opens new window)	train (opens new window)	linear (opens new window)	knn (opens new window)

# 13、Emerging Properties in Self-Supervised Vision Transformers.

https://github.com/facebookresearch/dino

# 13.1 Pretrained models

You can choose to download only the weights of the pretrained backbone used for downstream tasks, or the full checkpoint which contains backbone and projection head weights for both student and teacher networks. We also provide the backbone in onnx format, as well as detailed arguments and training/evaluation logs. Note that DeiT-S and ViT-S names refer exactly to the same architecture.

arch	params	k-nn	linear	download
ViT-S/16	21M	74.5%	77.0%	backbone only (opens new window)	full ckpt (opens new window)	onnx (opens new window)	args (opens new window)	logs (opens new window)	eval logs (opens new window)
ViT-S/8	21M	78.3%	79.7%	backbone only (opens new window)	full ckpt (opens new window)	onnx (opens new window)	args (opens new window)	logs (opens new window)	eval logs (opens new window)
ViT-B/16	85M	76.1%	78.2%	backbone only (opens new window)	full ckpt (opens new window)	onnx (opens new window)	args (opens new window)	logs (opens new window)	eval logs (opens new window)
ViT-B/8	85M	77.4%	80.1%	backbone only (opens new window)	full ckpt (opens new window)	onnx (opens new window)	args (opens new window)	logs (opens new window)	eval logs (opens new window)
ResNet-50	23M	67.5%	75.3%	backbone only (opens new window)	full ckpt (opens new window)	onnx (opens new window)	args (opens new window)	logs (opens new window)	eval logs (opens new window)

We also release XCiT models ([arXiv (opens new window)] [code (opens new window)]) trained with DINO:

arch	params	k-nn	linear	download
xcit_small_12_p16	26M	76.0%	77.8%	backbone only (opens new window)	full ckpt (opens new window)	args (opens new window)	logs (opens new window)	eval (opens new window)
xcit_small_12_p8	26M	77.1%	79.2%	backbone only (opens new window)	full ckpt (opens new window)	args (opens new window)	logs (opens new window)	eval (opens new window)
xcit_medium_24_p16	84M	76.4%	78.8%	backbone only (opens new window)	full ckpt (opens new window)	args (opens new window)	logs (opens new window)	eval (opens new window)
xcit_medium_24_p8	84M	77.9%	80.3%	backbone only (opens new window)	full ckpt (opens new window)	args (opens new window)	logs (opens new window)	eval (opens new window)

上次更新: 2021/11/03, 23:35:28

← (SimSiam) SimSiam Exploring Simple Siamese Representation Learning Structured Knowledge Distillation for Semantic Segmentation→