2024-06-14

Metadata
aliases: []

Notes about MatMul-free neural networks

Notes are about this paper: link.

Introduction

Two separate directions:

Low-precision quantization for language models

MatMul-free transformers

Method

This part describes the method that the authors proposed.

MatMul-free Dense Layers with Ternary Weights

Here we get a comprehensive description of MatMul-free dense layers:

The rest of the details of the architecture are omitted from these notes.

Quantization for MatMul-free dense layers

In this part, the ternary quantization technique is described.

( will be a small number to prevent overflow during clipping.)

Weight quantization

First, weights are quantized to using absmean quantization

  1. scales the weight matrix by it's average absolute value
  2. then rounds each element to the closest number in

where is the absmean of matrix elements, and the first rounds , then clips it's value between and .

Activation quantization

where clips between and and is the maximum absolute value of elements in (infinite norm).

To maintain numerical stability, is root-mean-square normed before the MatMul operation:

where is the mean of the weight matrix elements.