shithub: opus

ref: 9a6c4966309edea641602fea556e5ef654f95f08
dir: /doc/ietf/draft-valin-celt-codec.xml/

View raw version
<?xml version='1.0'?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc toc="yes" ?>

<rfc ipr="full3978" docName="Constrained-Energy Lapped Transform (CELT) Codec">

<front>
<title>draft-valin-celt-codec-00</title>



<author initials="J-M" surname="Valin" fullname="Jean-Marc Valin">
<organization>Octasic Semiconductor</organization>
<address>
<postal>
<street>4101, Molson Street, suite 300</street>
<city>Montreal</city>
<region>Quebec</region>
<code>H1Y 3L1</code>
<country>Canada</country>
</postal>
<email>[email protected]</email>
</address>
</author>

<!-- <author initials="et" surname="al." fullname="et al.">
<organization></organization>
</author>
-->

<date day="18" month="December" year="2008" />

<area>General</area>

<workgroup>AVT Working Group</workgroup>
<keyword>audio codec</keyword>
<keyword>low delay</keyword>
<keyword>Internet-Draft</keyword>
<keyword>CELT</keyword>

<abstract>
<t>
CELT <xref target="celt-website"/> is an open-source voice codec suitable for use in very low delay 
Voice over IP (VoIP) type applications.  This document describes the encoding
and decoding process. 
</t>
</abstract>
</front>

<middle>

<section anchor="Introduction" title="Introduction">
<t>
This document describes the CELT codec, which is designed for transmitting full-bandwidth
audio with very low delay. It is suitable for encoding both
speech and music and rates starting at 32 kbit/s. It is primarly designed for transmission
over packet networks and protocols such as RTP <xref target="rfc3550"/>, but also includes
a certain amount of robustness to bit errors, where this could be done at no significant
cost. The codec features are:
</t>

<t>
<list style="symbols">
<t>Ultra-low algorithmic delay (typically 3 to 9 ms)</t>
<t>Full audio bandwidth (44.1 kHz and 48 kHz)</t>
<t>Support for both voice and music</t>
<t>Stereo support</t>
<t>Packet loss concealment</t>
<t>Constant bit-rates from 32 kbps to 128 kbps and above</t>
<t>Free software/open-source/royalty-free</t>
</list>
</t>

<t>The novel aspect of CELT compared to most other codecs is its very low delay,
below 10 ms. There are two main advantages to having a very low delay audio link.
The lower delay itself is important some interactions, such as playing music
remotely. Another advantage is the behaviour in presence of acoustic echo. When
the round-trip audio delay is sufficiently low, acoustic echo is no longer
perceived as a distinct repetition, but as extra reverberation. Applications
of CELT include:</t>
<t>
<list style="symbols">
<t>Live network music performance</t>
<t>High-quality teleconferencing</t>
<t>Wireless audio equipment</t>
<t>Low-delay links for broadcast applications</t>
</list>
</t>

<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 <xref target="rfc2119"/>.
</t>
</section>

<section anchor="Overview of the CELT Codec" title="Overview of the CELT Codec">

<t>
CELT stands for "Constrained Energy Lapped Transform". This is
the fundamental princple of the codec: the quantization process is designed in such a way
as to preserve the energy in a certain number of bands.
</t>

<t>CELT is a transform codec, based on the Modified Discrete Cosine Transform 
<xref target="mdct"/>, which is based on a DCT-IV, with overlap and time-domain
aliasing calcellation.</t>




</section>

<section anchor="CELT Modes" title="CELT Modes">
<t>
The operation of both the encoder and decoder depend on the 
mode data. This data includes:
<list style="symbols">
<t>Frame size</t>
<t>Sampling rate</t>
<t>Windowing overlap</t>
<t>Number of channels</t>
<t>Definition of the bands</t>
<t>Definition of the "pitch bands"</t>
<t>Decay coefficients of the Laplace distributions for coarse energy</t>
<t>Fine energy allocation data</t>
<t>Pulse allocation data</t>
</list>
</t>
</section>

<section anchor="CELT Encoder" title="CELT Encoder">

<t>Insert encoder overview</t>

<t>The input audio first goes through a pre-emphasis filter, which attenuates the
"spectral tilt". The filter is has the transfer function A(z)=1-alpha_p*z^-1, with
alpha_p=0.8. The inverse of the pre-emphasis is applied at the decoder.</t>

<t>The top-level function for encoding a CELT frame is celt_encode() 
(<xref target="celt.c">celt.c</xref>).
</t>

<section anchor="Range Coder" title="Range Coder">
<t>
derf?
</t>
</section>

<section anchor="Forward MDCT" title="Forward MDCT">

<t>The MDCT implementation has no special characteristic. The
input is a windowed signal (after pre-emphasis) of 2*N samples and the output is N
frequency-domain samples. A "low-overlap" window is used to reduce the algorithmc delay. 
It is composed of a smaller window with symmetric zero padding on both sides. The window
is the same as the one used in the Vorbis codec and defined as: 
W(n)=[sin(pi/2*sin(pi/2*(n+.5)/L))]^2. The MDCT is computed in mdct_forward() 
(<xref target="mdct.c">mdct.c</xref>), and includes the windowing.
</t>

</section>

<section anchor="Bands and Normalization" title="Bands and Normalization">
<t>
The MDCT output is divided into bands that are designed to match the ear's critical bands,
with the exception that they have to be at least 3 bins wide. For each band, the encoder
computes the energy, that will later be encoded. Each band is then normalized by the 
square root of the *unquantized* energy, such that each band now forms a unit vector.
The energy and the normalization are computed by compute_band_energies()
and normalise_bands() (<xref target="bands.c">bands.c</xref>), respectively.
</t>
</section>

<section anchor="Energy Envelope Quantization" title="Energy Envelope Quantization">

<t>
It is important to quantize the energy with sufficient resolution because
any quantization error in the energy cannot be compensated for at a later
stage. Regardless of the resolution used for encoding the shape of a band,
it is perceptually important to preserve the energy in each band. We use a
coarse-fine strategy for encoding the energy in the log domain (dB), 
implemented in quant_coarse_energy_mono() and quant_coarse_energy() 
(<xref target="quant_bands.c">quant_bands.c</xref>)</t>

<t>
The coarse quantization of the energy uses a fixed resolution of
6 dB and is the only place where prediction and entropy coding are used.
The prediction is applied both in time (using the previous frame)
and in frequency (using the previous band). The 2-D z-transform of
the prediction filter is: A(z_l, z_b)=(1-a*z_l^-1)*(1-z_b^-1)/(1-b*z_b^-1)
where b is the band index and l is the frame index. We have obtained
good results with a=0.8 and b=0.7. To prevent error accumu-
lation, the prediction is applied on the quantized log-energy. The
prediction step reduces the entropy of the coarsely-quantized energy
from 61 to 30 bits. Of this 31-bit reduction, 12 are due to inter-frame
prediction. We approximate the ideal probability distribution of the
prediction error using a Laplace distribution, which results in an average 
of 33 bits per frame to encode the energy of all 19 bands at a
6 dB resolution. Because of the short frames, this represents a
15% bitrate savings in a typical configuration.
</t>

<t>
The Laplace distribution for each band is defined by a 16-bit (Q15) decay parameter.
Thus, the value 0 has a probability of p[0]=32767*(16384-decay)/(16384+decay). The 
values +/-i each have a probability p[i] = (p[i-1]*decay)>>14. The value of p[i] is always
rounded down (to avoid exceeding 32767 as the sum of all probabilities), so it is possible
for the sum to be less than 32767. There is thus is small range of values that are impossible.
The signed values corresponding to symbols 0, 1, 2, 3, 4, ... are [0, +1, -1, +2, -2, ...].
The encoding of the Laplace-distributed values is implemented in ec_laplace_encode() (<xref target="laplace.c">laplace.c</xref>).
</t>

</section>

<section anchor="Bit Allocation" title="Bit Allocation">
<t>Bit allocation is performed based only on information available to both
the encoder and decoder. The same calculations are performed in a bit-exact
manner in both the encoder and decoder to ensure that the result is always
exactly the same. Any mismatch would cause an error in the decoded output.
The allocation is computed by compute_allocation() (<xref target="rate.c">rate.c</xref>),
which is used in both the encoder and the decoder.</t>

<t>For a given band, the bit allocation is nearly constant across
frames that use the same number of bits for Q1 , yielding a pre-
defined signal-to-mask ratio (SMR) for each band. Because the
bands have a width of one Bark, this is equivalent to modelling the
masking occurring within each critical band, while ignoring inter-
band masking and tone-vs-noise characteristics. While this is not an
optimal bit allocation, it provides good results without requiring the
transmission of any allocation information.
</t>

</section>

<section anchor="Pitch Prediction" title="Pitch Prediction">
<t>
The pitch period is computed by find_spectral_pitch()
(<xref target="pitch.c">pitch.c</xref>) and the pitch gain is computed by
compute_pitch_gain() (<xref target="bands.c">bands.c</xref>).
</t>

</section>

<section anchor="Spherical Vector Quantization" title="Spherical Vector Quantization">
<t>CELT uses a Pyramid Vector Quantization (PVQ) <xref target="PVQ"></xref>
codebook for quantising the details of the spectrum in each band that have not
been predicted by the pitch predictor. The PVQ codebook consists of all combinations
of K pulses signed in a vector of N samples. 
</t>

<t>
The search is performed by alg_quant() (<xref target="vq.c">vq.c</xref>).
</t>

<section anchor="Index Encoding" title="Index Encoding">
<t>
Derf?? The index is encoded by encode_pulses() (<xref target="cwrs.c">cwrs.c</xref>).
</t>
</section>

</section>

<section anchor="Short windows" title="Short windows">
</section>


</section>

<section anchor="CELT Decoder" title="CELT Decoder">

<t>
Like for most audio codecs, the CELT decoder is less complex than the encoder. 
</t>

<section anchor="Range Decoder" title="Range Decoder">
<t>
derf?
</t>
</section>

<section anchor="Energy Envelope Decoding" title="Energy Envelope Decoding">
<t>
If the decoded range is within the "impossible range" of the encoder, then
the decoder knows there has been an error in the coding, decoding or transmission
and MAY take measures to conceal the error and/or report that a problem has occured.
</t>
</section>

<section anchor="Spherical VQ Decoder" title="Spherical VQ Decoder">
<t>
The spherical codebook is decoded by alg_unquant() (<xref target="vq.c">vq.c</xref>).
The index of the PVQ entry is obtained from the range coder and converted to 
a pulse vector by decode_pulses() (<xref target="cwrs.c">cwrs.c</xref>). Derf??
</t>

<t>
mix_pitch_and_residual() (<xref target="vq.c">vq.c</xref>).
</t>
</section>

<section anchor="Index Decoding" title="Index Decoding">
</section>

<section anchor="Denormalization" title="Denormalization">
<t>
Just like each band was normalised in the encoder, the last step of the decoder before
the inverse MDCT is to denormalize the bands. Each decoded normalized band is
multiplied by the square root of the decoded energy. This is done by denormalise_bands()
(<xref target="bands.c">bands.c</xref>).
</t>
</section>

<section anchor="Inverse MDCT" title="Inverse MDCT">
<t>The inverse MDCT implementation has no special characteristic. The
input is N frequency-domain samples and the output is 2*N time-domain 
samples. The output is windowed using the same "low-overlap" window 
as the encoder. The IMDCT and windowing are performed by mdct_backward
(<xref target="mdct.c">mdct.c</xref>). After the overlap-add process, 
the signal is de-emphasised using the inverse of the pre-emphasis filter 
used in the encoder: 1/A(z)=1/(1-alpha_p*z^-1).
</t>
</section>

<section anchor="Packet Loss Concealment" title="Packet Loss Concealment (PLC)">
<t>
Packet loss concealment (PLC) is an optional decoder-side feature which 
SHOULD be included when transmitting over an unreliable channel. Because 
PLC is not part of the bit-stream, there are several possible ways to 
implement PLC with different complexity/quality trade-offs. The PLC in
the reference implementation simply finds a periodicity in the decoded
signal and repeats the windowed waveform using the pitch offset. Care
must be taken to preserve the time-domain aliasing cancellation property
of the inverse MDCT. This is implemented in celt_decode_lost() 
(<xref target="celt.c">mdct.c</xref>).
</t>
</section>

</section>



<section anchor="Security Considerations" title="Security Considerations">

<t>
A potential denial-of-service threat exists for data encodings using
compression techniques that have non-uniform receiver-end
computational load.  The attacker can inject pathological datagrams
into the stream which are complex to decode and cause the receiver to
be overloaded.  However, this encoding does not exhibit any
significant non-uniformity.
</t>

</section> 

<section anchor="Evaluation of CELT Implementations" title="Evaluation of CELT Implementations">

<t>
Insert some text here.
</t>

</section>



<section anchor="Issues that need to be addressed" title="Issues that need to be addressed">

<t>
<list>
<t>Dynamic bit allocation</t>
<t>Stereo coupling</t>
</list>
</t>

</section>


<section anchor="Acknowledgments" title="Acknowledgments">

<t>
The authors would also like to thank the following members of the 
CELT and AVT communities for their input:
</t>
</section> 

</middle>

<back>

<references title="Normative References">

<reference anchor="rfc2119">
<front>
<title>Key words for use in RFCs to Indicate Requirement Levels </title>
<author initials="S." surname="Bradner" fullname="Scott Bradner"><organization/></author>
</front>
<seriesInfo name="RFC" value="2119" />
</reference> 

<reference anchor="rfc3550">
<front>
<title>RTP: A Transport Protocol for real-time applications</title>
<author initials="H." surname="Schulzrinne" fullname=""><organization/></author>
<author initials="S." surname="Casner" fullname=""><organization/></author>
<author initials="R." surname="Frederick" fullname=""><organization/></author>
<author initials="V." surname="Jacobson" fullname=""><organization/></author>
</front>
<seriesInfo name="RFC" value="3550" />
</reference> 


</references> 

<references title="Informative References">

<reference anchor="celt-website">
<front>
<title>The CELT ultra-low delay audio codec</title>
<author><organization/></author>
</front>
<seriesInfo name="CELT website" value="http://www.celt-codec.org/" />
</reference> 

<reference anchor="mdct">
<front>
<title>Modified Discrete Cosine Transform</title>
<author><organization/></author>
</front>
<seriesInfo name="MDCT" value="http://en.wikipedia.org/wiki/Modified_discrete_cosine_transform" />
</reference> 

<reference anchor="PVQ">
<front>
<title>A Pyramid Vector Quantizer</title>
<author initials="T." surname="Fischer" fullname=""><organization/></author>
<date month="July" year="1986" />
</front>
<seriesInfo name="Pyramid Vector Quantizer" value="http://en.wikipedia.org/wiki/Modified_discrete_cosine_transform" />
</reference> 

</references>

<section anchor="Reference Implementation" title="Reference Implementation">

<t>This appendix contains the complete source code for a reference
implementation of the CELT codec written in C. This implementation
can be compiled for either floating-point or fixed-point machines.
Floating-point is the default and fixed-point can be enabled by
defining FIXED_POINT when compiling.
</t>

<t>The implementation can be compiled with either a C89 or a C99
compiler. It is reasonably optimized for most platforms such that
only architecture-specific optimizations are likely to be useful.
The FFT used is a slightly modified version of the KISS-FFT package,
but it is easy to substitute any other FFT library.
</t>

<t>
The testcelt executable can be used to test the encoding and decoding
process:
<list style="empty">
<t><![CDATA[ testcelt <rate> <channels> <frame size> <bytes per packet> [<complexity> [packet loss rate]] <input> <output> ]]></t>
</list>
where "rate" is the sampling rate in Hz, "channels" is the number of
channels (1 or 2), "frame size" is the number of samples in a frame 
(64 to 512) and "bytes per packet" is the number of bytes desired for each
compressed frame. The input and output files are assumed to be a 16-bit
PCM file in the machine native endianness. The optional "complexity" argument
can select the quality vs complexity tradeoff (0-10) and the "packet loss rate"
argument simulates random packet loss (argument is in tenths or a percent).
</t>

<?rfc include="source/testcelt.c"?>
<?rfc include="source/celt.h"?>
<?rfc include="source/celt.c"?>
<?rfc include="source/modes.h"?>
<?rfc include="source/modes.c"?>
<?rfc include="source/bands.h"?>
<?rfc include="source/bands.c"?>
<?rfc include="source/cwrs.h"?>
<?rfc include="source/cwrs.c"?>
<?rfc include="source/vq.h"?>
<?rfc include="source/vq.c"?>
<?rfc include="source/pitch.h"?>
<?rfc include="source/pitch.c"?>
<?rfc include="source/rate.h"?>
<?rfc include="source/rate.c"?>
<?rfc include="source/psy.h"?>
<?rfc include="source/psy.c"?>
<?rfc include="source/mdct.h"?>
<?rfc include="source/mdct.c"?>
<?rfc include="source/ecintrin.h"?>
<?rfc include="source/entcode.h"?>
<?rfc include="source/entcode.c"?>
<?rfc include="source/entenc.h"?>
<?rfc include="source/entenc.c"?>
<?rfc include="source/entdec.h"?>
<?rfc include="source/entdec.c"?>
<?rfc include="source/mfrngcod.h"?>
<?rfc include="source/rangeenc.c"?>
<?rfc include="source/rangedec.c"?>
<?rfc include="source/laplace.h"?>
<?rfc include="source/laplace.c"?>
<?rfc include="source/quant_bands.h"?>
<?rfc include="source/quant_bands.c"?>
<?rfc include="source/quant_pitch.h"?>
<?rfc include="source/quant_pitch.c"?>
<?rfc include="source/pgain_table.h"?>
<?rfc include="source/arch.h"?>
<?rfc include="source/fixed_generic.h"?>
<?rfc include="source/fixed_debug.h"?>
<?rfc include="source/mathops.h"?>
<?rfc include="source/os_support.h"?>
<?rfc include="source/float_cast.h"?>
<?rfc include="source/stack_alloc.h"?>
<?rfc include="source/celt_types.h"?>
<?rfc include="source/_kiss_fft_guts.h"?>
<?rfc include="source/kiss_fft.h"?>
<?rfc include="source/kiss_fft.c"?>
<?rfc include="source/kiss_fftr.h"?>
<?rfc include="source/kiss_fftr.c"?>
<?rfc include="source/kfft_single.h"?>
<?rfc include="source/kfft_single.c"?>
<?rfc include="source/kfft_double.h"?>

</section>


</back>

</rfc>