Towards Unified INT8 Training for Convolutional Neural Network
🔑 Key:
- Mainly Dealing with the Gradient Quantization
- Empirical 4 Rules of Gradient
- Theoretical Convergence Bound & 2 Principles
- 2 Technique: Directional-Sensitive Gradient Clipping + Deviation Counteractive LR Scaling
🎓 Source:
- CVPR 2020 SenseTime + BUAA
🌱 Motivation:
💊 Methodology:
- Symmetric Uniform Quantization with Stochastic Rounding
- Challenged for Quantizing Gradients
  - Small perturbation would affect direction
  - Sharp and Wide Distribution（Unlike Weight/Activation）
  - Evolutionary: As time goes on, even more sharp
  - Layer Depth: Closely related to network depth(shallower the layer is, distribution sharper)
  - Special Block: DW Layer, always sharp
- Theoretical Bound afftected by 3 Terms(mainly with Quantization Error & LR & L2-Norm)
  - Useful Tricks: 1. Min Q Error 2. Scale Down the LR
- Directional Sensitive Gradient Clipping
  - Actually its just plain grad clipping
  - Find the Clipping Value: Cosine Distance instead of MSE(Avoid the magnitude of grad's effect)
- Deviation Counteractive LR Scaling
  - balance the exponentially accumulated grad error(deviation) by exponentially decreasing LR accordingly
  - f(deviation) = max(e^(-\alpha*deviation), \beta)
    - \beta controls the lower bound of lr
    - \alpha controls the decay degree
- Stochastic Rounding
  - curandGenerator
  - Linear Congruential Generator, yield a sequence of pseudo randomized number
📐 Exps:
💡 Ideas:
- (Found with smaller LR, MobV2 training didn't crash,although perf. decay)
- Deviation of grad exponentially accumulated since its propagated through layer
Improving Neural Network Quantization without Retraining using Outlier Channel Splitting
🔑 Key:
- Outlier Channel Splitting
🎓 Source:
- Zhiru
🌱 Motivation:
- Post-training quantization follows bell-shaped distribution while hardware could better handle linear
  - so the outlier becomes a problem
💊 Methodology:
- Duplicate Outliers channels, then halves its value \
- Similar to 《Net2Net》 Net2WiderNet
📐 Exps:
💡 Ideas:
- Post-Quantization's mainstream，First Clipping，then Sym-Linear-Quan
  - Activation Clipping - Use Subset of input sample
  - Earlier work: min L2 Norm of Quantization Error
  - ACIQ: fits a Gaussian and Laplacian,use the fitting curve analytical compute optimal threshold
  - SAWB: Linear extrapolate 6 dists
  - TensorRt: Profile the dist, min the KL Divergence between original and quantized dist
Training Quantized Network with Auxiliary Gradient Module
- 额外的fullPrecision梯度模块(解决residue的skip connection不好定的问题，目的前向完全fix point)，有几分用一个FP去杠杆起低比特网络的意味
- - 这里的Adapter是一个1x1Conv
  - FH共享卷积层参数，但是有独立的BN，在训练过程中Jointly Optimize
  - 用H这个网络跳过Shortcut，让梯度回流
- 好像是一个很冗长的方法...
Scalable Methods for 8-bit Training of Neural Networks
- Intel (AIPG)
- 涉及到了几篇其他的工作
  - Mixed Precision Training of Convnet
    - 16bit训练，没有精度损失
    - 用了DFP (Dynamic Fixed Point)
  - L1-Norm Batch Normalization for Efficient Training of Deep Neural Networks Shuang
- RangeBN
  - (~~所以我们之前爆炸很正常~~，但是前向定BN不定grad为啥没事呢...)
  - 核心思想是用输入的Max-Min来代替方差，再加上一个ScaleAdjustTerm（固定值1/sqrt(2*ln(n))）
    - 证明了在Gaussian情况下
  - Backward只有y的导数是低比特的，而W的是全精度的
    - （作者argue说只有y的计算是Sequential的，所以另外一个部分组件的计算不要求很快，所以可以全精度...）
- Part5 Theoretical Analysis
  - ⭐对层的敏感度分析有参考价值
- GEMMLOWP
  - - 按照chunk确定，来规避大的Dynamic Range
Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss
- Train a Quantize Interval, Orune & Clipping Together
- 将一个Quantizer分为一个Transformer(将元素映射到[-1,1]或者是[0,1])和一个Discretezer
  - 我理解是把QuantizeStep也就是qD作为一个参数(parameterize it)？
  - 这个演示只演示了正半轴，负半轴对称
  - 这个cx，dx也要去学习?（因为作者文中并没有提到这两个怎么取得）
  - 作者强调了把这个step也参数化，然后直接从最后的Los来学习这些参数，而不是通过接近全精度副本的L2范数（这个思想和And The Bit Goes Down类似）
  - - 虽然这个变换血复杂，但是作者argue说只是为了inference，前向时候用的是固定过的参数，所以无所谓
    - 这个更新用到也是STE，但是我理解都这么非线性了，STE还能work吗？
  - Activation的Quantizer与其不同，那个次方的参数gamma是1了

[Mixed Precision Training of CNN using Interger]
- DFP(Dynamic Fixed Point)

Post Training 4 Bit Quantization of Conv Network For Rapid Deployment
- Intel (AIPG)
- No Need For Finetune / Whole Dataset
  - Privacy & Off-The-Shelf (Avoid Retraining) - Could Achieve 8 bit
- 3 Methods
  - 1. Analutical Clipping For Integer Quantization
    - Analytical threshold for cliping Value
    - 假设QuantizationNoise是一个关于高斯或者拉普拉斯分布的函数
    - 对传统的，在min/max之间均匀量化的情况，round到middle point
    - 则MSE是
    - Quantization Noise
    - Clipping Noise
    - Optimal Clipping
      - We Could Just Use This
  - 1. Per Channel Bit Allocation
    - Overall MSE min
    - Regular Per Channel Quantization,每一层有一个自己的Scale和offset，本文不同的Channel选用了不同的bitwidth
    - 只要保证最后平均每层还是4bit（有点流氓哈）
    - 转化为一个优化问题，我总共有B个bit可以分配给N个channe，我希望最后的MSE最小
    - 拉格朗日乘子法
      - 计算出最终的最佳分配，给每层的Bi
    - Assume That: Optimial Quantization Step Size is (Range)^(2/3)
  - 1. After Quantization Will be a bias in mean/var
- Related Works
  - ACIQ是这篇文章的方法，更早 (Propose Activation Clip post training)
    - 更早有人用KL散度去找Clipping Value（需要用全精度模型去训练，花时间）
    - 本质都是去Handle Statistical Outlier
      - 有人分解Channel
      - 细节没清楚，放这边看之后有咩有时间看
Accurate & Efficient 2 bit QNN
- PACT + SAWB
  - SAWB(Statistical Aware Weight Bining)-目的是为了有效的利用分布的统计信息(其实也就是一二阶统计量)
  - 优化的目标还是最小化量化误差(新weight和原weight的L2范数)
  - 说之前取Scale用所有参数的mean的方式只有在分布Normal的时候才成立
  - 参数C1，C2是线性拟合出来的，根据bitwidth
    - 选取了几种比较常见的分布(Gauss,Uniform,Laplace,Logistic,Triangle,von Mises)
    - (我理解这个好像只是在证明一阶和二阶统计量就足够拟合了)图里面的点是每个分布的最佳Scale
      - 上面的实验是对应一种量化间隔，作者后面的实验又说明对于多个量化间隔同理
      - 所以最大的贡献其实是对于实际的分布，用实际分布的统计量去找这个最佳的Scaling
- 文中分析了PACT和Relu一样有表达能力
- 还给了一个SystemDeisgn的Insght

TTQ
- Han
- DoReFa(Layer-Wise Scaling Factor:L1Norm)
- TWN(Ternary Weight Net)(View As Optmize Problem Of Minimizing L2 Norm Between)
- - t是一个超参数，对所有层保持一致
- TWN的步骤融入了Quantize-Aware Training
- Scaling Factor
  - DoReFa直接取L1 Norm的mean
  - TWN对fp32的Weight，最小化L2范数（Xnor也是）
  - TTQ这里是训练出来的（所以并不是来自整个参数的分布，而是独立的参数）

TernGrad
- 数学证明了收敛(假设是梯度有界)
- 是From Scratch的
- 做了Layer-Wise的Ternary/以及Gradient Clipping
- - 其中bt是一个Random Binary Vector
  - Stochastic Rounding
- Scale Sharing
- 他能work我们不能work？（前向是全精度的？）

Binary Related

早期的一些二值化网络的延申
- XnorNet文章是同时提了BinaryWeightedNetwork和XNORNet
  - 向量乘变为二值向量做bitcount
  - 反向传播的时候也可以把梯度编程Ternary，但是需要取Max作为ScalingFactor而不是最大值
  - - 很神奇
    - Source Code
    - 别人的一个复现
    - 参考了这个issue是参与前向运算的
- 注意只有XNORNet是input和weight全部都是二值，所以可以用Bitcount的运算！而后续的TWN和TTQ都没有提这个事情，而他们是只对Weight做了三值！
- 后面是TernaryNet（TWN）
  - 二值变三值
- ABCNet (Accurate Binary Conv)
  - 用多个Binary的LinearCombinatoin来代表全精度weight
- DoReFa
  - 也可以加速反向梯度
  - 和xnor的区别是，xnor的取saclefactor是逐channel的而它是逐layer的
    - ~~Dorefa作者说xnor的方式做不了backprop加速，但是xnor作者在原文中说他也binarize了梯度~~
- TTQ （我全都训）
  - Backprop

BinaryConnect: Training Deep Neural Networks with binary weights during propagations
- 2015
- Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
- 2016
- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, Yoshua Bengio
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
- 2016
- Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi
Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing
- 2016
- Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, Dharmendra S. Modha
Ternary Weight Networks
- 2016
- Fengfu Li, Bo Zhang, Bin Liu
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
- 2016
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, Yuheng Zou
Flexible Network Binarization with Layer-wise Priority
- 2017
- Lixue Zhuang, Yi Xu, Bingbing Ni, Hongteng Xu
ReBNet: Residual Binarized Neural Network
- 2017
- Mohammad Ghasemzadeh, Mohammad Samragh, Farinaz Koushanfar
Towards Accurate Binary Convolutional Neural Network
- 2017
- Xiaofan Lin, Cong Zhao, Wei Pan
Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving
- 2018
- Jiaolong Xu, Peng Wang, Heng Yang, Antonio M. López
Self-Binarizing Networks
- 2019
- Fayez Lahoud, Radhakrishna Achanta, Pablo Márquez-Neila, Sabine Süsstrunk
Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization
- 2019
- Koen Helwegen, James Widdicombe, Lukas Geiger, Zechun Liu, Kwang-Ting Cheng, Roeland Nusselder
Least squares binary quantization of neural networks
- 2020
- Hadi Pouransari, Oncel Tuzel
Widening and Squeezing: Towards Accurate and Efficient QNNs
- 2020
- Chuanjian Liu, Kai Han, Yunhe Wang, Hanting Chen, Chunjing Xu, Qi Tian
Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm
- 2018
- 添加了一条额外的shortcut
  - 卷积或者是BN的输出，在binarize化之前, connect this real activation到consecutive block
- 用一种新的tight approx来取代STE对grad进行矫正
- magnitude-aware gradient
- 从全精度的pretrained模型开始，activation用clip改为ReLU
- 提到了训练BNN的两个问题
  - Non-diff
  - Grad相对太小不能改变sign
    - 保留一份real-value weight就可以
Training Competitive Binary Neural Networks from Scratch
- 2018
- 强调了目前(现在看起来好像不是目前了)的很多方法都需要借助2-stage training或者是full-precision model
  - 所以想提出一种相对simple的训练方法，能够直接from scratch
- 同时首先提出了binary+Dense的模式
- 提到了Bi-real net
  - binarize化的resnet，加入了additional shortcut
    - 以一次额外的real-value Addition为代价
    - 将Sign函数之前的部分，shortcutc到BN之后
  - a change of gradient computation
  - complex training strategy, finetuning from full-precision
- 稍微修改STE
- no weight decay
- how 2 choose scaling factor
  - 有人prove了在前向的时候对weight做filter-wise的scaling factor是无效的，但是对grad可用
  - scaling feature for activation
  - 认为learning a useful scaling factor是很难，因为BN的存在
XNORNet++
- BMVC
- 原先版本的对Activation feature map计算scaling factor需要针对输入每次去做，computational expensive
- fuse weight & activation scaling factor into one
- (感觉很trivial很empirical)
MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy?
- 2020
- Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, Christoph Meinel
- 本质上是提出了一种新的Block
- 指出了原本的BNN的优化方法有
  - 加大Channel数目
  - multiple binary basis
- 认为BNN中的主要损失
  - FP32乘法与Binary乘法之间的误差
  - 统一的Scaling Factor所导致的Feature Map空间有限
IRNet-Forward and Backward Information Retention for Accurate Binary Neural Networks
- IRNet(Information Retention Network)
- 2种主要方法分别针对forward和backward
  - Libra-parameter-binarization - minimize the q-error and information loss
    - balance & standardize weiht
  - Error Decay Estimator(EDE)
- minimizing the quantization error - ||A-Q(A)|| - not always work
  - Objective Function: Min(Q-error) + Max(Binary Entropy)
  - - Bernoulli Distribution
  - 为了训练更加stable，对w减均值并且norm
  - 额外加入了一个optimal bit-shift scalar
    - <<>>s表示左/右shift
  - activation的binary则是最简单的Sign
  - 从Information View解决问题，先balance再binarize，去retain足够的information
    - Libra-PB有bernoulli分布下的最大information entropy
- EDE

Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1
🔑 Key:
- Decomposite Multi-Precision NN into multi BinaryNN, more efficient Deployment
🎓 Source:
🌱 Motivation:
💊 Methodology:
📐 Exps:
💡 Ideas:
创新点
- Decomposite NN into Multi BNNs
- M-bit Encoding Function
- Support Mixed Precisions
- Advan
  - Many tasks, generality
- Question
  - Typo in Table3 "Encoded Activation and Weights"
  - Periodical
  - the speed-up rate, whether concerning the encoding/decode and scale multiplication(although it may not cost much)
  - Decomposition method hardware cost

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

papers.md

papers.md

Binary Related

Files

papers.md

Latest commit

History

papers.md

File metadata and controls

Binary Related