Skip to content

Latest commit

 

History

History
484 lines (451 loc) · 31.5 KB

File metadata and controls

484 lines (451 loc) · 31.5 KB
  • Towards Unified INT8 Training for Convolutional Neural Network

  • 🔑 Key:

    • Mainly Dealing with the Gradient Quantization
    • Empirical 4 Rules of Gradient
    • Theoretical Convergence Bound & 2 Principles
    • 2 Technique: Directional-Sensitive Gradient Clipping + Deviation Counteractive LR Scaling
  • 🎓 Source:

    • CVPR 2020 SenseTime + BUAA
  • 🌱 Motivation:

  • 💊 Methodology:

    • Symmetric Uniform Quantization with Stochastic Rounding
    • Challenged for Quantizing Gradients
      • Small perturbation would affect direction
      • Sharp and Wide Distribution(Unlike Weight/Activation)
      • Evolutionary: As time goes on, even more sharp
      • Layer Depth: Closely related to network depth(shallower the layer is, distribution sharper)
      • Special Block: DW Layer, always sharp
    • Theoretical Bound afftected by 3 Terms(mainly with Quantization Error & LR & L2-Norm)
      • Useful Tricks: 1. Min Q Error 2. Scale Down the LR
    • Directional Sensitive Gradient Clipping
      • Actually its just plain grad clipping
      • Find the Clipping Value: Cosine Distance instead of MSE(Avoid the magnitude of grad's effect)
    • Deviation Counteractive LR Scaling
      • balance the exponentially accumulated grad error(deviation) by exponentially decreasing LR accordingly
      • f(deviation) = max(e^(-\alpha*deviation), \beta)
        • \beta controls the lower bound of lr
        • \alpha controls the decay degree
    • Stochastic Rounding
      • curandGenerator
      • Linear Congruential Generator, yield a sequence of pseudo randomized number
  • 📐 Exps:

  • 💡 Ideas:

    • (Found with smaller LR, MobV2 training didn't crash,although perf. decay)
    • Deviation of grad exponentially accumulated since its propagated through layer
  • Improving Neural Network Quantization without Retraining using Outlier Channel Splitting

  • 🔑 Key:

    • Outlier Channel Splitting
  • 🎓 Source:

    • Zhiru
  • 🌱 Motivation:

    • Post-training quantization follows bell-shaped distribution while hardware could better handle linear
      • so the outlier becomes a problem
  • 💊 Methodology:

    • Duplicate Outliers channels, then halves its value \
    • Similar to 《Net2Net》 Net2WiderNet
  • 📐 Exps:

  • 💡 Ideas:

    • Post-Quantization's mainstream,First Clipping,then Sym-Linear-Quan
      • Activation Clipping - Use Subset of input sample
      • Earlier work: min L2 Norm of Quantization Error
      • ACIQ: fits a Gaussian and Laplacian,use the fitting curve analytical compute optimal threshold
      • SAWB: Linear extrapolate 6 dists
      • TensorRt: Profile the dist, min the KL Divergence between original and quantized dist
  • Training Quantized Network with Auxiliary Gradient Module

    • 额外的fullPrecision梯度模块(解决residue的skip connection不好定的问题,目的前向完全fix point),有几分用一个FP去杠杆起低比特网络的意味
      • 这里的Adapter是一个1x1Conv
      • FH共享卷积层参数,但是有独立的BN,在训练过程中Jointly Optimize
      • 用H这个网络跳过Shortcut,让梯度回流
    • 好像是一个很冗长的方法...
  • Scalable Methods for 8-bit Training of Neural Networks

    • Intel (AIPG)
    • 涉及到了几篇其他的工作
    • RangeBN
      • (所以我们之前爆炸很正常,但是前向定BN不定grad为啥没事呢...)
      • 核心思想是用输入的Max-Min来代替方差,再加上一个ScaleAdjustTerm(固定值1/sqrt(2*ln(n)))
        • 证明了在Gaussian情况下
      • Backward只有y的导数是低比特的,而W的是全精度的
        • (作者argue说只有y的计算是Sequential的,所以另外一个部分组件的计算不要求很快,所以可以全精度...)
    • Part5 Theoretical Analysis
      • ⭐对层的敏感度分析有参考价值
    • GEMMLOWP
        • 按照chunk确定,来规避大的Dynamic Range
  • Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss

    • Train a Quantize Interval, Orune & Clipping Together
    • 将一个Quantizer分为一个Transformer(将元素映射到[-1,1]或者是[0,1])和一个Discretezer
      • 我理解是把QuantizeStep也就是qD作为一个参数(parameterize it)?
      • 这个演示只演示了正半轴,负半轴对称
      • 这个cx,dx也要去学习?(因为作者文中并没有提到这两个怎么取得)
      • 作者强调了把这个step也参数化,然后直接从最后的Los来学习这些参数,而不是通过接近全精度副本的L2范数(这个思想和And The Bit Goes Down类似)
        • 虽然这个变换血复杂,但是作者argue说只是为了inference,前向时候用的是固定过的参数,所以无所谓
        • 这个更新用到也是STE,但是我理解都这么非线性了,STE还能work吗?
      • Activation的Quantizer与其不同,那个次方的参数gamma是1了

  • [Mixed Precision Training of CNN using Interger]
    • DFP(Dynamic Fixed Point)

  • Post Training 4 Bit Quantization of Conv Network For Rapid Deployment

    • Intel (AIPG)
    • No Need For Finetune / Whole Dataset
      • Privacy & Off-The-Shelf (Avoid Retraining) - Could Achieve 8 bit
    • 3 Methods
        1. Analutical Clipping For Integer Quantization
        • Analytical threshold for cliping Value
        • 假设QuantizationNoise是一个关于高斯或者拉普拉斯分布的函数
        • 对传统的,在min/max之间均匀量化的情况,round到middle point
        • 则MSE是
        • Quantization Noise
        • Clipping Noise
        • Optimal Clipping
          • We Could Just Use This
        1. Per Channel Bit Allocation
        • Overall MSE min
        • Regular Per Channel Quantization,每一层有一个自己的Scale和offset,本文不同的Channel选用了不同的bitwidth
        • 只要保证最后平均每层还是4bit(有点流氓哈)
        • 转化为一个优化问题,我总共有B个bit可以分配给N个channe,我希望最后的MSE最小
        • 拉格朗日乘子法
          • 计算出最终的最佳分配,给每层的Bi
        • Assume That: Optimial Quantization Step Size is (Range)^(2/3)
        1. After Quantization Will be a bias in mean/var
    • Related Works
      • ACIQ是这篇文章的方法,更早 (Propose Activation Clip post training)
        • 更早有人用KL散度去找Clipping Value(需要用全精度模型去训练,花时间)
        • 本质都是去Handle Statistical Outlier
          • 有人分解Channel
            • 细节没清楚,放这边看之后有咩有时间看
  • Accurate & Efficient 2 bit QNN

    • PACT + SAWB
      • SAWB(Statistical Aware Weight Bining)-目的是为了有效的利用分布的统计信息(其实也就是一二阶统计量)
      • 优化的目标还是最小化量化误差(新weight和原weight的L2范数)
      • 说之前取Scale用所有参数的mean的方式只有在分布Normal的时候才成立
      • 参数C1,C2是线性拟合出来的,根据bitwidth
        • 选取了几种比较常见的分布(Gauss,Uniform,Laplace,Logistic,Triangle,von Mises)
        • (我理解这个好像只是在证明一阶和二阶统计量就足够拟合了)图里面的点是每个分布的最佳Scale
          • 上面的实验是对应一种量化间隔,作者后面的实验又说明对于多个量化间隔同理
          • 所以最大的贡献其实是对于实际的分布,用实际分布的统计量去找这个最佳的Scaling
    • 文中分析了PACT和Relu一样有表达能力
    • 还给了一个SystemDeisgn的Insght

  • TTQ
    • Han
    • DoReFa(Layer-Wise Scaling Factor:L1Norm)
    • TWN(Ternary Weight Net)(View As Optmize Problem Of Minimizing L2 Norm Between)
      • t是一个超参数,对所有层保持一致
    • TWN的步骤融入了Quantize-Aware Training
    • Scaling Factor
      • DoReFa直接取L1 Norm的mean
      • TWN对fp32的Weight,最小化L2范数(Xnor也是)
      • TTQ这里是训练出来的(所以并不是来自整个参数的分布,而是独立的参数)

  • TernGrad
    • 数学证明了收敛(假设是梯度有界)
    • 是From Scratch的
    • 做了Layer-Wise的Ternary/以及Gradient Clipping
      • 其中bt是一个Random Binary Vector
      • Stochastic Rounding
    • Scale Sharing
    • 他能work我们不能work?(前向是全精度的?)

Binary Related

  • 早期的一些二值化网络的延申
    • XnorNet文章是同时提了BinaryWeightedNetwork和XNORNet
      • 向量乘变为二值向量做bitcount
      • 反向传播的时候也可以把梯度编程Ternary,但是需要取Max作为ScalingFactor而不是最大值
    • 注意只有XNORNet是input和weight全部都是二值,所以可以用Bitcount的运算!而后续的TWN和TTQ都没有提这个事情,而他们是只对Weight做了三值!
    • 后面是TernaryNet(TWN)
      • 二值变三值
    • ABCNet (Accurate Binary Conv)
      • 用多个Binary的LinearCombinatoin来代表全精度weight
    • DoReFa
      • 也可以加速反向梯度
      • 和xnor的区别是,xnor的取saclefactor是逐channel的而它是逐layer的
        • Dorefa作者说xnor的方式做不了backprop加速,但是xnor作者在原文中说他也binarize了梯度
    • TTQ (我全都训)
      • Backprop

  • Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1

  • 🔑 Key:

    • Decomposite Multi-Precision NN into multi BinaryNN, more efficient Deployment
  • 🎓 Source:

  • 🌱 Motivation:

  • 💊 Methodology:

  • 📐 Exps:

  • 💡 Ideas:

  • 创新点

    • Decomposite NN into Multi BNNs

    • M-bit Encoding Function

    • Support Mixed Precisions

    • Advan

      • Many tasks, generality
    • Question

      • Typo in Table3 "Encoded Activation and Weights"
      • Periodical
      • the speed-up rate, whether concerning the encoding/decode and scale multiplication(although it may not cost much)
      • Decomposition method hardware cost