2024-06-14 ↩

Metadata

aliases: []

Notes about MatMul-free neural networks

Notes are about this paper: link.

MatMul is prevalent in neural networks
- dense layers use vector-matrix multiplication (VMM)
- convolution neural nets use block-sparse VMM with shared parameters
- self-attention requires matrix-matrix multiplication
mostly implemented in CUDA
MatMul consumes most computational power might be worth substituting it with simpler elementary operations
an approach is to use quantized weights or activations (or both) either (binary) or (ternary) possible values
- previous attempts did this but still had to keep some MatMul for attention
- previous attempts failed to converge

Two separate directions:

This part describes the method that the authors proposed.

BitLinear layers (MatMul-free dense layers): ternary weights constrained to the set
- additional quantization techniques
- MatMul replaced with addition and negation
MatMul-free token mixing and channel mixing

Here we get a comprehensive description of MatMul-free dense layers:

this can be simplified to accumulation, since multiplication on can just be addition:

The rest of the details of the architecture are omitted from these notes.

In this part, the ternary quantization technique is described.

( will be a small number to prevent overflow during clipping.)

First, weights are quantized to using absmean quantization

where is the absmean of matrix elements, and the first rounds , then clips it's value between and .

where clips between and and is the maximum absolute value of elements in (infinite norm).

To maintain numerical stability, is root-mean-square normed before the MatMul operation:

where is the mean of the weight matrix elements.