# Contents

Quantization is a process that maps a continous or discrete set of values into approximations that belong to a smaller set. Quantization is a lossy: some information about the original data is lost in the process. The key to a successful quantization is therefore the selection of an error criterion – such as entropy and signal-to-noise ratio – and the development of optimal quantizers for this criterion.

# Principles of Scalar Quantization quantization of a time-varying value by a 4-bit midtread uniform quantization on $$[-1.0, 1.0]$$

## Quantizers

A scalar quantizer $$[ \, \cdot \, ]$$ is an idempotent mapping from $$\mathbb{R}$$ to a countable subset of $$\mathbb{R}$$:

$| \{[x]\, , \, x \in \mathbb{R}\} | \leq |\mathbb{N}| \; \mbox{ and } \; \forall \, x \in \mathbb{R}, \; [[x]] = [x]$

This definition should be taken with a grain of salt as variants of the real line are often used, including the extended real line $$\mathbb{R} \cup \{-\infty, +\infty\}$$, the real line with signed zeros $$\mathbb{R}\cup\{0^-, 0^+\}$$, the real line plus the undefined symbol $$\bot$$, or a combination thereof.

The countability assumption is what makes the quantizer useful as an attempt to approximate a continous value by a discrete set that can be encoded as an integer. A quantizer is meant to be split into a forward and inverse quantizer: the forward quantizer builds from $$x$$ an integer code that refers to $$[x]$$ without ambiguity and the inverse quantizer builds the approximation $$[x]$$ back from the code.

Formally, a forward quantizer for $$[\cdot]$$ is a mapping $$i[\, \cdot \,]: \mathbb{R} \to \mathbb{Z}$$ such that $$[x] = [y]$$ implies $$i[x]=i[y]$$. Because of this property, $$i[\, \cdot \,]$$ may be factored into $$i[\, \cdot \,] = i \circ [\, \cdot \,]$$ where $i : \mathrm{rng} \, [\, \cdot\, ] \to \mathbb{Z}.$ The notation for the forward quantizer is therefore consistent with the use as $$f[x]$$ as a shortcut for $$f([x])$$. The associated inverse quantizer, denoted $$i^{-1}$$, is a left inverse of $$i$$: a mapping whose domain is a subset of $$\mathbb{Z}$$ that contains $$\mathrm{rng} i$$ and such that $\forall \, x \in \mathbb{R}, \; (i^{-1} \circ i) [x] = [x]$ The first step of this quantizer composition partitions the real line into the family of sets $$(I_n)_n$$ with $I_n = \{x\in\mathbb{R}, \; i[x] = n\}, \; n \in \mathrm{rng} i$ The second step associates to any set into this partition a unique representative element. In every practical case we will encounter, the sets $$I_n$$ are – possibly unbounded – intervals, either open, half-open or closed. In this context, we associate to $$x$$ the decision values $$[x]^-$$ and $$[x]^+$$ to be $[x]^- = \inf \, \{y \in \mathbb{R}, \; [x] = [y]\} \; \mbox{ and } \; [x]^+ = \sup \, \{y \in \mathbb{R}, \; [x] = [y]\}$ and the step of the quantization at point $$x$$ is $\Delta(x) = [x]^+ - [x]^-$

### Example – Integer Rounding

The floor function $$\lfloor \, \cdot \, \rfloor$$ is a scalar quantizer that maps a real number to the largest previous integer: $\forall \, x \in \mathbb{R}, \; \lfloor x \rfloor \in \mathbb{Z} \; \mbox{ and } \; \lfloor x \rfloor \leq x < \lfloor x \rfloor + 1$ A natural forward quantizer for $$\lfloor \, \cdot \, \rfloor$$ is … itself ! The identity $$n \mapsto n$$ is the corresponding inverse mapping. This quantizer partitions the real-line into the half-open intervals $$I_n = [n, n+1)$$ for any $$i \in \mathbb{Z}$$.

The floor function of NumPy is an finite-precision implementation of this function. Its argument and return value are (arrays of) 64-bits floating-point numbers.

To obtain a (forward) quantizer with a finite range indexable on $$32$$ bits, we may modify the initial quantizer specification so that the data outside of the range $$[-2^{31}, 2^{31} - 1]$$ – the range of 32-bit signed integers – is clipped: $\lfloor x \rfloor_{32} = \left| \begin{array}{rl} -2^{31} & \mbox{if } \, x \leq -2^{31} \\ 2^{31} - 1 & \mbox{if } \, x \geq 2^{31} - 1 \\ \lfloor x \rfloor & \mbox{otherwise.} \end{array} \right.$ Given those modifications, a suitable finite implementation of the forward and inverse quantizers is the following code/decode pair:

from numpy import *

def encode(x):
n = floor(x)
n = clip(n, -2**31, 2**31 - 1)
return int32(n)

def decode(n):
return float64(n)

def quantize(x):
return decode(encode(x))

The step function $$\Delta$$ of this quantization is defined by: $\Delta (x) = \left| \begin{array}{rl} +\infty & \mbox{if } \; x < -2^{31} + 1 \\ 1 & \mbox{if } \; -2^{31} + 1 \leq x < 2^{31}-1 \\ +\infty & \mbox{if } \; 2^{31} - 1 \leq x \end{array} \right.$

Other rounding functions may serve as the basis for similar schemes: the ceiling function $$\lceil \cdot \rceil$$ (NumPy function ceil) defined by: $\forall \, x \in \mathbb{R}, \; \lceil x \rceil \in \mathbb{Z} \; \mbox{ and } \; \lceil x \rceil - 1 < x \leq \lceil x \rceil$ Instead of selecting the lower or upper integer approximation of $$x$$ we may also select the nearest: $\forall \, x \in \mathbb{R}, \; | x - [x] | = \min \, \{|x - n|, \, n \in \mathbb{Z}\}$ The value $$[x]$$ is not defined by this relation when $$x = n + 1/2$$, $$n$$ being an integer. The NumPy function round\_ rounds for example such real number to the nearest even integer.

This example suggests a general interface for quantizers. Such objects would provide an encode method for the forward quantization, a decode method for the inverse quantization and would be callable so that quantizer(x) would apply both steps to the data x. Such objects could inherit the following Quantizer base class:

class Quantizer(object):
"Quantizers Base Class."
def encode(self, data):
raise NotImplementedError("undefined forward quantizer")

def decode(self, data):
raise NotImplementedError("undefined inverse quantizer")

def __call__(self, data):
return self.decode(self.encode(data))

We can then rewrite the above integer approximation quantizer as:

class RoundingQuantizer(Quantizer):
def __init__(self, rounding=floor, integer_type=int32):
self.rounding = rounding
self.integer_type

def encode(self, x):
x = array(x)
n = self.rounding(x)
n = clip(n, -2**31, 2**31 - 1)
return n.astype(self.integer_type)

def decode(self, n):
n = array(n)
return n.astype(float64)

rounding_quantizer = RoundingQuantizer()

Note that this version of the quantizer is also vectorized: several values grouped in a NumPy array may be used as arguments to encode and decode. This is an implicit requirement that we expect all quantizer classes to follow for convenience.

## Uniform Quantization

A quantizer is uniform in an interval with lower bound $$a$$ and higher bound $$b$$ if its step function is constant in the interval. The size of the step is then directly connected to the width of the interval and the number $$N$$ of distinct values of $$[x]$$ by $\Delta(x) = \frac{b-a}{N}$

The final option that characterizes the quantizer is the choice of the base rounding function. A reference implementation is then given by:

class Uniform(Quantizer):
def __init__(self, low=0.0, high=1.0, N=2**8, rounding=round_):
self.low = float(low)
self.high = float(high)
self.N = N
self.delta = (high - low) / self.N
self.rounding = rounding

def encode(self, data):
low, high, delta = self.low, self.high, self.delta
data = clip(data, low + delta/2.0, high - delta/2)
flints = self.rounding((data - low) / delta - 0.5)
return array(flints, dtype=long)

def decode(self, i):
return self.low + (i + 0.5) * self.delta

Note that if the default value of N is selected – or more generally any even value – $$ \neq 0$$: the approximation error for $$0$$ is not zero. When this property may be an issues, odd values of N may be selected – for example $$2^8 - 1$$ so that $$0$$ is correctly approximated ; such a quantizer is called a midtread quantizer – opposed to the original midrise quantizer. 4-bit uniform encoder on $$(0,1)$$: forward quantizer 4-bit uniform decoder on $$(0,1)$$: inverse quantizer

### Quantization of Random Variables

Consider a random variable $$X$$ with values $$x \in \mathbb{R}$$ and a density of probability $$p(x)$$. For any $$[x]$$, we may consider the event $$[X] = [x]$$ with probability $P([X]=[x]) = \int_{\{y \in \mathbb{R}, \; [y] = [x]\}} p(y) \, dy = \int_{[x]^-}^{[x]^+} p(y) \, dy$ If the density $$p$$ is constant on every interval associated to the quantization, this equation may be simplified into: $P([X]=[x]) = p(x) \times \Delta(x)$ More generally, if the quantizer values $$[x]$$ are dense enough – we say that the high resolution assumption is satisfied – then this relation holds approximately.

The entropy attached to this collection of events is maximal when every event is equally likely, that is, under this approximation, when the step $$\Delta(x)$$ is proportional to the inverse of $$p(x)$$ $\Delta(x) \propto \frac{1}{p(x)}$

### Implementation of Non-Uniform Quantizers

Non-uniform quantizers may be – at least conceptually – simply generated from uniform quantizers and non-linear transformations. If $$[\, \cdot \,]$$ denotes a uniform quantizer and $$f$$ is an increasing mapping, the function $$[\, \cdot \,]_f$$ defined by the equation $[x]_f = (f^{-1} \circ [ \, \cdot \, ] \circ f) (x)$ and displayed in figure is a nonlinear quantizer. The function $$f$$ is called the characteristic function of the quantizer. Depending on the selected range for the uniform quantizer, it is determined up to an affine transformation. Nonlinear quantizer implementation

Note that if $$f$$ is linear or affine, that is $$f(x) = ax+b$$, the quantizer $$[\, \cdot \,]_f$$ is still uniform – that’s a reason why uniform quantizers are sometimes called linear quantizers.

Let $$\Delta$$ be the step of the uniform quantizer et let’s determine what quantization step $$\Delta_f(x)$$ is attached to this scheme.

For every value of $$x$$, the decision values attached to $$y = f(x)$$ by the uniform quantizer are $$[y]^-$$ and $$[y]^+$$. Hence, the decision values for $$x$$ and the non-linear quantization are $[x]^-_f = f^{-1}([y]^-) \, \mbox{ and } \, [x]^+_f = f^{-1}([y]^+)$ and if the high resolution assumption is satisfied the step $$\Delta_f(x)$$ is : $\Delta_f(x) = f^{-1}([y]^- + \Delta) - f^{-1}([y]^-) \simeq (f^{-1})'(f(x)) \Delta = \frac{\Delta}{f'(x)}$ something that is remembered as $\Delta_f(x) \propto \frac{1}{f'(x)}$ The proportionaly constant may be easily recovered by noting that when $$f(x)=x$$, $$[\, \cdot \, ]_f = [\, \cdot \,]$$ and therefore $$\Delta(x) = \Delta$$. If we impose moreover $$f(0) = 0$$, we find $\label{BOOGA} f(x) \propto \int_0^x \frac{ds}{\Delta(s)}$ If the quantizer is to maximize the entropy for the random variable $$X$$ with density $$p(x)$$ we obtain

$\label{BOOGAD} f(x) \propto \int_0^x p(y) \, dy$

#### Example

Let’s consider the digital audio signal displayed in figure .

The uniform quantization on $$(-1,1)$$ with step $$\Delta = 10^{-1}$$ is dense enough so that the associated histogram may be considered as a continuous function of the parameter $$x$$. We observe in figure that this partition generates – for a large range of values of $$x$$ – a counting measure $$n(x)$$ of a few thousands. The ratio $$n(x)/n$$ where $$n$$ is the total number of samples should therefore generate a good approximation of the density of the signal, considered as a sequence of independent and identically distributed values. Around 20 seconds of audio data Audio Data Histogram

The logarithm of the histogram is similar to a function of the type $$-a |x| +b$$, $$a>0$$ (cf fig. ). We therefore select $$p(x) \propto \exp(-a|x|)$$. The optimal quantization – for the entropy criterion – and the corresponding characteristic function $$f$$ such that $$f(0)=0$$ are therefore given by: $\Delta(x) \propto e^{a|x|} \; \mbox{ and } \; f(x) \propto \mathrm{sign}\,(x) (1-e^{-a|x|})$ Log plot of the audio data histogram

# Logarithmic Quantization

We consider in this section several related quantizers whose characteristic function is – roughly speaking – the logarithm of their argument.

## The $$\mu$$-law Quantizer

Consider the probability law $\label{PL} p(x) \propto \left| \begin{array}{cl} \displaystyle \frac{1}{1+\mu|x|/A} & \mbox{ if } \, |x|\leq A, \\ 0 & \mbox{ otherwise. } \end{array} \right.$

The threshold $$A$$ is necessary as otherwise the right-hand side of the equation would not be summable. The parameter $$a$$ controls directly the relative probability of low and high amplitude values as $$p(\pm A)/p(0)= 1/(1 + \mu)$$. In the limit case $$\mu=0$$, we end up with a uniform probability distribution on $$[-A, A]$$.

The optimal quantizer for the entropy criterion satisfies () and therefore the characteristic function $$f$$ such that $$f(0)=0$$ satisfies $f(x) \propto \mathrm{sign}\,(x) \ln \left(1 + \mu \frac{x}{A}\right).$ If we limit the range of the quantizer to $$[-1,1]$$ (we set $$A=1$$) and enforce the constraint $$f([-1, 1]) = [-1, 1]$$, we end up with $f(x) = \mathrm{sign}\,(x) \frac{\log \left(1 + \mu|x| \right)}{\log(1+\mu)}$ This quantization scheme is called $$\mu$$-law and is for example used in the NeXT/Sun AU audio file format (files with extension .au or .snd). The actual implementation of the law, specified in the ITU-T G.711 standard – differs slightly from the theoretical formulas. A reference implementation is given in the code below:

class MuLaw(Quantizer):
"""
Mu-law quantizer
"""
scale  = 32768
iscale = 1.0 / scale
bias   = 132
clip   = 32635
etab   = array([0, 132, 396, 924, 1980, 4092, 8316, 16764])

@staticmethod
def sign(data):
"""
Sign function such that sign(+0) = 1 and sign(-0) = -1
"""
data = array(data, dtype=float)
s = numpy_sign(data)
i = where(s==0)
s[i] = numpy_sign(1.0 / data[i])
return s

def encode(self, data):
data = array(data)
s = MuLaw.scale * data
s = minimum(abs(s), MuLaw.clip)
[f,e] = frexp(s + MuLaw.bias)

step  = floor(32*f) - 16    # 4 bits
chord = e - 8               # 3 bits
sgn   = (MuLaw.sign(data) == 1)   # 1 bit

mu = 16 * chord + step      # 7-bit coding
mu = 127 - mu               # bits inversion
mu = 128 * sgn + mu         # final 8-bit coding

return array(mu, dtype=uint8)

def decode(self, i):
i = array(i)
i = 255 - i
sgn = i > 127
e = array(floor(i / 16.0) - 8 * sgn + 1, dtype=uint8)
f = i % 16
data = ldexp(f, e + 2)
e = MuLaw.etab[e-1]
data = MuLaw.iscale * (1 - 2 * sgn) * (e + data)

return data

mulaw = MuLaw()

Note that this code is applied to values between $$-1$$ and $$1$$ and uses 8 bits. The most significant bit encodes the sign; the amplitude of the signal is coded by the 7 remaining bits. The effective value of $$\mu$$ is approximately 250 but instead of using the expression $$\log(1+\mu|x|)$$, we prefer a piecewise affine approximation of it (see fig ). The values $$[x]$$ are then all multiples of $$2^{-13}$$ which limits the additional quantization error when the original signal is initially encoded with a uniform law using 14 bits or more. To ease the error correction when transmitted the bits other than the sign bit are finally inverted.  ## IEEE754 Floating-Point Numbers and $$A$$-law

All scientific computing applications use implicitely a quantizer: the quantizer that represents approximation of real numbers in the floating-point arithmetic. The description of two types of numbers – single and double (or rather, single and double-precision numbers) – is detailled in the IEEE 754 standard. In both cases, 1 bit is allocated to code the sign of the number, $$m$$ bits for the exponent part and $$n$$ bits for the fraction part, $s \in \{0,1\}, \; e \in \{0, \cdots, 2^{m}-1\}, \; f \in \{0,\cdots, 2^{n}-1\}$ consequently any real number is represented by an integer in $$\{0, \cdots, 2^{m+n+1}\}$$ according to: $n = s \times 2^{m+n} + e \times 2^n + f \in \{0,\cdots, 2^{m+n+1}\}$ The single type is defined by $$(m,n)=(8,23)$$ and the double type by $$(m,n)=(11,52)$$ ; they are respectively coded on 32 and 64 bits.

We define $e_0 = 2^{m-1} - 1$ so that the value of the actual exponent $$e-e_0$$ range (almost symmetrically) from $$2^{m-1}$$ to $$-2^{m-1}+1$$. The inverse quantizer attached to the standard floating point number representation is defined as follows: for an integer $$n$$, $$[x] = i^{-1}(n)$$ is given by $[x] = \left| \begin{array}{rc} NaN & \mbox{if } e = 2^{m}-1 \mbox{ and } f \neq 0 \\ (-1)^s \infty & \mbox{if } e = 2^{m}-1 \mbox{ and } f = 0 \\ (-1)^s (1 + f / 2^{n}) \times 2^{e - e_0} & \mbox{if } \; 0 < e < 2^{m}-1 \\ (-1)^s ( f / 2^{n}) \times 2^{1 - e_0} & \mbox{if } \; e = 0\; \end{array} \right.$

The structure of theses inverse quantizers are displayed in the figure ; they are piecewise affine approximation of an exponential with a base of 2, except in the range $$e=0$$ (the so-called denormalized numbers) where the graph is linear.

[graph of the inverse quantizer for a floating point representation such that $$(m,n)=(4,3)$$] (images/float.pdf)

The $$A$$-law is a variant of the $$\mu$$-law that has a structure similar the single and double types of floating point arithmetic but with a base different from 2. Given a value of $$A$$ (often $$87.7$$), the inverse of its characteristic function is defined on $$[-1, 1]$$ by $f^{-1}(x) = \mathrm{sgn}\, (x) \times \left| \begin{array}{rl} (1+\ln A) |x|/A & \mbox{if } \; |x| < \frac{1}{1+\ln A} \\ \exp(x(1+\ln A)-1)/A & \mbox{otherwise.} \end{array} \right.$

# Signal-to-Noise Ratio

## Computation of the signal-to-noise ratio

For a given sequence of $$k$$ values $$x_n$$, the output $$[x_n]$$ of a quantizer may be interpreted as the sum of the original value and a perturbation sequence $$b_n= [x_n] - x_n$$ called a noise. The square of the signal-to-noise ratio – or SNR – is simply the ratio between the energies of those two values: $\mathrm{SNR}^2 = \frac{ \displaystyle\mathbb{E}sp \left( \sum_{n=0}^{k-1} x_n^2 \right) }{ \displaystyle \mathbb{E}sp \left( \sum_{n=0}^{k-1} b_n^2 \right) }$ The SNR is often measured in decibels (dB): $\mathrm{SNR} \; \mbox{[dB]} = 20 \log_{10} \mathrm{SNR} = 10 \log_{10} \mathrm{SNR}^2$

When the values $$x_n$$ are independent and follow the same probability law $$p(x)$$, this energy is given by $\mathbb{E} \left( \sum_{n=0}^{p-1} x_n^2 \right) = k \, \mathbb{E} \left( x_n^2 \right) = k \int_{-\infty}^{+\infty} x^2 p(x) \, dx$ and under a high resolution assumption we have $\begin{eqnarray*} \mathbb{E}sp( b_n^2) &=& \int_{-\infty}^{+\infty} ([x]-x)^2 p(x)\, dx \\ &=& \sum_y \int_{y}^{y+\Delta(y)} ([x]-x)^2 p(x)\, dx \\ &\simeq& \sum_y p(y) \int_{y}^{y+\Delta(y)} (y + \Delta(y)/2 -x)^2 \, dx \\ &=& \sum_y p(y) \frac{\Delta(y)^3}{12} \\ &\simeq& \sum_y \int_{y}^{y+\Delta(y)} \frac{\Delta(x)^2}{12} p(x) \, dx \\ &=& \int_{-\infty}^{+\infty} \frac{\Delta(x)^2}{12} p(x)\, dx \\ &=& \frac{1}{12} \mathbb{E}sp \left( \Delta(x_n)^2 \right) \label{noise-square} \end{eqnarray*}$

Finally $\label{SNRSNR} \mathrm{SNR}^2 = 12\frac{\mathbb{E}sp(x_n^2)}{\mathbb{E}sp(\Delta(x_n)^2)} = 12 \frac { \displaystyle \int_{\mathbb{R}} x^2 p(x) \, dx } { \displaystyle \int_{\mathbb{R}} \Delta(x)^2 p(x)\, dx }$ In the typical case where the probability density of the signal is uniform on $$[-A, A]$$ and the quantization is uniform on this range with a step $$\Delta$$, we end up with ${\mathrm{SNR} = 2 A/\Delta}$

### Maximization of the SNR

For a given density of probability, how can we select the quantization scheme so that the SNR is maximal ? Formulated like that, this problem is not well-posed because the quantization noise may be made a small as possible with a decrease of the quantization step. The significant problem is to solve this problem under a constant bit budget. Without any loss of generality, we may assume that the signal has values in $$[-1, 1]$$ and that the characteristic function of the searched quantization satisfies $$f([-1,1])=[-1,1]$$. If we allocate $$N$$ bits to the quantization scheme, the step $$\Delta(x)$$ is determined by $\Delta(x) = \frac{2^{-N+1}}{f'(x)}$ The SNR then takes the form $\mathrm{SNR} = \kappa 2^N$ where the value of $$\kappa$$ depend only from the probability law of the signal and of the choice of $$f$$. In decibels, this equation is written as $\mathrm{SNR} \; \mbox{[dB]} \simeq 6.02 \times N + \kappa'$ that is, every extra bit increase the SNR by approximately 6 dB. To maximize the SNR, we then have to solve $\min_{f'} \int_{-1}^1 \frac{1}{f'(x)^2} p(x)\, dx \; \mbox{ subject to } \; f(1)-f(-1) = 2$ or even, with $$\psi = f'$$ $\min_{\psi} J(\psi) = \int_{-1}^1 \frac{1}{\psi(x)^2} p(x)\, dx \; \mbox{ with } \; K(\psi)= \int_{-1}^1 \psi(x) \, dx = 2$ At the optimum, there is a $$\lambda \in \mathbb{R}$$ such that the lagrangian $$L(\psi) = J(\psi)+\lambda K(\psi)$$ satisfies $$d L(\psi) =0$$, that is $\mbox{for all } \, \delta \psi: [-1,1] \to \mathbb{R}, \; \int_{-1}^1 \left(-\frac{2}{\psi(x)^3} p(x) + \lambda \right) (\delta \psi)(x)\, dx =0$ and that implies $-\frac{2}{\psi(x)^3} p(x) + \lambda = 0$ and hence $f'(x) \propto (p(x))^{\frac{1}{3}}.$