Metadata
aliases: []
Notes about MatMul-free neural networks
Notes are about this paper: link.
Introduction
- MatMul is prevalent in neural networks
- dense layers use vector-matrix multiplication (VMM)
- convolution neural nets use block-sparse VMM with shared parameters
- self-attention requires matrix-matrix multiplication
- mostly implemented in CUDA
- MatMul consumes most computational power might be worth substituting it with simpler elementary operations
- an approach is to use quantized weights or activations (or both) either (binary) or (ternary) possible values
- previous attempts did this but still had to keep some MatMul for attention
- previous attempts failed to converge
Two separate directions:
Low-precision quantization for language models
- one method incrementally quantized: 32bit 4bit 2bit binary
- other method introduced quantization aware training to train with 2bit weights
- previous methods tried spiking neural networks (SNNs)
- some applications in vision transformers
- applied in sentiment analysis
- issues with scaling language models
Method
This part describes the method that the authors proposed.
- BitLinear layers (MatMul-free dense layers): ternary weights constrained to the set
- additional quantization techniques
- MatMul replaced with addition and negation
- MatMul-free token mixing and channel mixing
MatMul-free Dense Layers with Ternary Weights
Here we get a comprehensive description of MatMul-free dense layers:
- standard MatMul based dense layer with and weight matrix where the output is :
- this is avoided by adopting BitNet: dense layers become BitLinear modules
- MatMul becomes pure addition with accumulation (i.e. ternary accumulation)
- elements of the weight matrix are
- is the ternary weight matrix
- MatMul with ternary weights become this, where is the ternary MatMul:
- this can be simplified to accumulation, since multiplication on can just be addition:
- so ternary MatMul could be just written as:
The rest of the details of the architecture are omitted from these notes.
Quantization for MatMul-free dense layers
In this part, the ternary quantization technique is described.
( will be a small number to prevent overflow during clipping.)
Weight quantization
First, weights are quantized to using absmean quantization
- scales the weight matrix by it's average absolute value
- then rounds each element to the closest number in
where is the absmean of matrix elements, and the first rounds , then clips it's value between and .
Activation quantization
- activations are quantized to 8-bit precision (as with BitNet)
- absmax quantization is used
- scales activations into range , where is the number of bits and
where clips between and and is the maximum absolute value of elements in (infinite norm).
To maintain numerical stability, is root-mean-square normed before the MatMul operation:
where is the mean of the weight matrix elements.