-
Towards Unified INT8 Training for Convolutional Neural Network
-
🔑 Key:
- Mainly Dealing with the Gradient Quantization
- Empirical 4 Rules of Gradient
- Theoretical Convergence Bound & 2 Principles
- 2 Technique: Directional-Sensitive Gradient Clipping + Deviation Counteractive LR Scaling
-
🎓 Source:
- CVPR 2020 SenseTime + BUAA
-
🌱 Motivation:
-
💊 Methodology:
- Symmetric Uniform Quantization with Stochastic Rounding
- Challenged for Quantizing Gradients
- Small perturbation would affect direction
- Sharp and Wide Distribution(Unlike Weight/Activation)
- Evolutionary: As time goes on, even more sharp
- Layer Depth: Closely related to network depth(shallower the layer is, distribution sharper)
- Special Block: DW Layer, always sharp
- Theoretical Bound afftected by 3 Terms(mainly with Quantization Error & LR & L2-Norm)
- Useful Tricks: 1. Min Q Error 2. Scale Down the LR
- Directional Sensitive Gradient Clipping
- Actually its just plain grad clipping
- Find the Clipping Value: Cosine Distance instead of MSE(Avoid the magnitude of grad's effect)
- Deviation Counteractive LR Scaling
- balance the exponentially accumulated grad error(deviation) by exponentially decreasing LR accordingly
f(deviation) = max(e^(-\alpha*deviation), \beta)
- \beta controls the lower bound of lr
- \alpha controls the decay degree
- Stochastic Rounding
- curandGenerator
- Linear Congruential Generator, yield a sequence of pseudo randomized number
-
📐 Exps:
-
💡 Ideas:
- (Found with smaller LR, MobV2 training didn't crash,although perf. decay)
- Deviation of grad exponentially accumulated since its propagated through layer
-
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
-
🔑 Key:
- Outlier Channel Splitting
-
🎓 Source:
- Zhiru
-
🌱 Motivation:
- Post-training quantization follows bell-shaped distribution while hardware could better handle linear
- so the outlier becomes a problem
- Post-training quantization follows bell-shaped distribution while hardware could better handle linear
-
💊 Methodology:
- Duplicate Outliers channels, then halves its value \
- Similar to 《Net2Net》 Net2WiderNet
-
📐 Exps:
-
💡 Ideas:
- Post-Quantization's mainstream,First Clipping,then Sym-Linear-Quan
- Activation Clipping - Use Subset of input sample
- Earlier work: min L2 Norm of Quantization Error
- ACIQ: fits a Gaussian and Laplacian,use the fitting curve analytical compute optimal threshold
- SAWB: Linear extrapolate 6 dists
- TensorRt: Profile the dist, min the KL Divergence between original and quantized dist
-
Scalable Methods for 8-bit Training of Neural Networks
- Intel (AIPG)
- 涉及到了几篇其他的工作
- Mixed Precision Training of Convnet
- 16bit训练,没有精度损失
- 用了DFP (Dynamic Fixed Point)
- L1-Norm Batch Normalization for Efficient Training of Deep Neural Networks Shuang
- Mixed Precision Training of Convnet
- RangeBN
- Part5 Theoretical Analysis
- GEMMLOWP
-
Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss
- Train a Quantize Interval, Orune & Clipping Together
- 将一个Quantizer分为一个Transformer(将元素映射到[-1,1]或者是[0,1])和一个Discretezer
- 我理解是把QuantizeStep也就是qD作为一个参数(parameterize it)?
- 这个演示只演示了正半轴,负半轴对称
- 这个cx,dx也要去学习?(因为作者文中并没有提到这两个怎么取得)
- 作者强调了把这个step也参数化,然后直接从最后的Los来学习这些参数,而不是通过接近全精度副本的L2范数(这个思想和And The Bit Goes Down类似)
- 虽然这个变换血复杂,但是作者argue说只是为了inference,前向时候用的是固定过的参数,所以无所谓
- 这个更新用到也是STE,但是我理解都这么非线性了,STE还能work吗?
- Activation的Quantizer与其不同,那个次方的参数gamma是1了
-
Post Training 4 Bit Quantization of Conv Network For Rapid Deployment
- Intel (AIPG)
- No Need For Finetune / Whole Dataset
- Privacy & Off-The-Shelf (Avoid Retraining) - Could Achieve 8 bit
- 3 Methods
-
- Analutical Clipping For Integer Quantization
- Analytical threshold for cliping Value
- 假设QuantizationNoise是一个关于高斯或者拉普拉斯分布的函数
- 对传统的,在min/max之间均匀量化的情况,round到middle point
- 则MSE是
- Quantization Noise
- Clipping Noise
- Optimal Clipping
-
- Per Channel Bit Allocation
-
- After Quantization Will be a bias in mean/var
-
- Related Works
- 早期的一些二值化网络的延申
- XnorNet文章是同时提了BinaryWeightedNetwork和XNORNet
- 向量乘变为二值向量做bitcount
- 反向传播的时候也可以把梯度编程Ternary,但是需要取Max作为ScalingFactor而不是最大值
- 很神奇
- Source Code
- 别人的一个复现
- 参考了这个issue是参与前向运算的
- 注意只有XNORNet是input和weight全部都是二值,所以可以用Bitcount的运算!而后续的TWN和TTQ都没有提这个事情,而他们是只对Weight做了三值!
- 后面是TernaryNet(TWN)
- ABCNet (Accurate Binary Conv)
- 用多个Binary的LinearCombinatoin来代表全精度weight
- DoReFa
- 也可以加速反向梯度
- 和xnor的区别是,xnor的取saclefactor是逐channel的而它是逐layer的
Dorefa作者说xnor的方式做不了backprop加速,但是xnor作者在原文中说他也binarize了梯度
- TTQ (我全都训)
- XnorNet文章是同时提了BinaryWeightedNetwork和XNORNet
- BinaryConnect: Training Deep Neural Networks with binary weights during propagations
- 2015
- Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David
- Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
- 2016
- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio
- XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- 2016
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi
- Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing
- 2016
- Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, Dharmendra S. Modha
- Ternary Weight Networks
- 2016
- Fengfu Li, Bo Zhang, Bin Liu
- DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
- 2016
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, Yuheng Zou
- Flexible Network Binarization with Layer-wise Priority
- 2017
- Lixue Zhuang, Yi Xu, Bingbing Ni, Hongteng Xu
- ReBNet: Residual Binarized Neural Network
- 2017
- Mohammad Ghasemzadeh, Mohammad Samragh, Farinaz Koushanfar
- Towards Accurate Binary Convolutional Neural Network
- 2017
- Xiaofan Lin, Cong Zhao, Wei Pan
- Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving
- 2018
- Jiaolong Xu, Peng Wang, Heng Yang, Antonio M. López
- Self-Binarizing Networks
- 2019
- Fayez Lahoud, Radhakrishna Achanta, Pablo Márquez-Neila, Sabine Süsstrunk
- Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization
- 2019
- Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, Roeland Nusselder
- Least squares binary quantization of neural networks
- 2020
- Hadi Pouransari, Oncel Tuzel
- Widening and Squeezing: Towards Accurate and Efficient QNNs
- 2020
- Chuanjian Liu, Kai Han, Yunhe Wang, Hanting Chen, Chunjing Xu, Qi Tian
- Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm
- Training Competitive Binary Neural Networks from Scratch
- 2018
- 强调了目前(现在看起来好像不是目前了)的很多方法都需要借助2-stage training或者是full-precision model
- 所以想提出一种相对simple的训练方法,能够直接from scratch
- 同时首先提出了binary+Dense的模式
- 提到了Bi-real net
- binarize化的resnet,加入了additional shortcut
- 以一次额外的real-value Addition为代价
- 将Sign函数之前的部分,shortcutc到BN之后
- a change of gradient computation
- complex training strategy, finetuning from full-precision
- binarize化的resnet,加入了additional shortcut
- 稍微修改STE
- no weight decay
- how 2 choose scaling factor
- XNORNet++
- BMVC
- 原先版本的对Activation feature map计算scaling factor需要针对输入每次去做,computational expensive
- fuse weight & activation scaling factor into one
- (感觉很trivial很empirical)
- MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy?
- 2020
- Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, Christoph Meinel
- 本质上是提出了一种新的Block
- 指出了原本的BNN的优化方法有
- 加大Channel数目
- multiple binary basis
- 认为BNN中的主要损失
- FP32乘法与Binary乘法之间的误差
- 统一的Scaling Factor所导致的Feature Map空间有限
- IRNet-Forward and Backward Information Retention for Accurate Binary Neural Networks
- IRNet(Information Retention Network)
- 2种主要方法分别针对forward和backward
- Libra-parameter-binarization - minimize the q-error and information loss
- balance & standardize weiht
- Error Decay Estimator(EDE)
- Libra-parameter-binarization - minimize the q-error and information loss
- minimizing the quantization error - ||A-Q(A)|| - not always work
- EDE
-
Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1
-
🔑 Key:
- Decomposite Multi-Precision NN into multi BinaryNN, more efficient Deployment
-
🎓 Source:
-
🌱 Motivation:
-
💊 Methodology:
-
📐 Exps:
-
💡 Ideas:
-
创新点
-
Decomposite NN into Multi BNNs
-
M-bit Encoding Function
-
Support Mixed Precisions
-
Advan
- Many tasks, generality
-
Question
- Typo in Table3 "Encoded Activation and Weights"
- Periodical
- the speed-up rate, whether concerning the encoding/decode and scale multiplication(although it may not cost much)
- Decomposition method hardware cost
-