shithub: opus

--- a/doc/ietf/draft-valin-celt-codec.xml

+++ b/doc/ietf/draft-valin-celt-codec.xml

@@ -96,7 +96,7 @@

 of CELT include:</t>

<t>

 <list style="symbols">

-<t>Live network music performance</t>

+<t>Collaborative network music performance</t>

 <t>High-quality teleconferencing</t>

 <t>Wireless audio equipment</t>

 <t>Low-delay links for broadcast applications</t>

@@ -117,8 +117,8 @@

 CELT stands for <spanx style="emph">Constrained Energy Lapped Transform</spanx>. This is

 the fundamental principle of the codec: the quantization process is designed in such a way

 as to preserve the energy in a certain number of bands. The theoretical aspects of the

-codec is described in greater details <xref target="celt-tasl"/> and

-<xref target="celt-eusipco"/>. Although these papers describe a slightly older version of

+codec are described in greater detail <xref target="celt-tasl"/> and

+<xref target="celt-eusipco"/>. Although these papers describe slightly older versions of

 the codec (version 0.3.2 and 0.5.1, respectively), the principles remain the same.

 </t>

@@ -148,13 +148,13 @@

<t>

 The CELT codec does not use a standard <spanx style="emph">bit-packer</spanx>,

 but rather uses a range coder to pack both integers and entropy-coded symbols.

-In mono mode, the bit-stream generated by the encoder contains (in the same order) the

-following parameters:

+In mono mode, the bit-stream generated by the encoder contains the

+following parameters (in order):

 </t>

<t>

 <list style="symbols">

-<t>Feature flags (2-4 bits)</t>

+<t>Feature flags I, P, S, F (2-4 bits)</t>

 <t>if P=1

    <list style="symbols">

       <t>Pitch period</t>

@@ -218,11 +218,11 @@

 </t>

<t>

-The windowing overlap is the amount of overlap between the frames. CELT uses a low-overlap window that is typically half of the frame size. For a frame size of 256 samples, the overlap is 128 samples, so the total algorithmic delay is 256+128=384. CELT divides the audio into frequency bands, for which the energy is preserved. These bands are chosen to follow the ear's critical bands (Bark scale), with the exception that each band has to contain at least 3 frequency bins.

+The windowing overlap is the amount of overlap between the frames. CELT uses a low-overlap window that is typically half of the frame size. For a frame size of 256 samples, the overlap is 128 samples, so the total algorithmic delay is 256+128=384. CELT divides the audio into frequency bands, for which the energy is preserved. These bands are chosen to follow the ear's critical bands, with the exception that each band has to contain at least 3 frequency bins.

 </t>

<t>

-The bands used for coding in CELT are based on the Bark scale. The Bark band edges (in Hz) are defined as:

+The energy bands are based on the Bark scale. The Bark band edges (in Hz) are defined as

 [0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270,  1480,  1720,  2000,  2320,

 2700, 3150, 3700, 4400, 5300, 6400,  7700, 9500, 12000, 15500, 20000]. The actual bands used by the codec

 depend on the sampling rate and the frame size being used. The mapping from Hz to MDCT bins is done by

@@ -235,8 +235,8 @@

 CELT includes a pitch predictor for which the gains are defined over

 a set of <spanx style="emph">pitch bands</spanx>. The pitch bands are defined

 (in Hz) as [0, 345, 689, 1034, 1378, 2067, 3273, 5340, 6374]. The Hz values

-are converted to MDCT bins in the same was as for the Bark bands. The pitch

-bands boundaries are aligned to Bark band boundaries. The definition of the pitch

+are mapped to MDCT bins in the same was as the energy bands. The pitch

+band boundaries are aligned to energy band boundaries. The definition of the pitch

 bands is computed in compute_pbands() (<xref target="modes.c">modes.c</xref>).

 </t>

 </section>

@@ -331,26 +331,153 @@

 CELT uses an entropy coder based upon <xref target="range-coding"></xref>,

 which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>.

 It is very similar to arithmetic encoding, except that encoding is done with

-digits in any base, instead of with bits, and so it is faster when using

-larger bases (i.e.: a byte).

+digits in any base, instead of with bits,

+so it is faster when using larger bases (i.e.: a byte). All of the

+calculations in the range coder must use bit-exact integer arithmetic.

 </t>

<t>

 The range coder also acts as the bit-packer for CELT. It is

-used in three different ways to encode:

+used in three different ways, to encode:

 <list style="symbols">

-<t>entropy-coded symbols with a fixed probability model using ec_encode() (<xref target="rangeenc.c">rangeenc.c</xref>)</t>

-<t>integers from 0 to 2^M-1 using ec_enc_uint() or ec_enc_bits() (<xref target="entenc.c">entenc.c</xref>)</t>

-<t>integers from 0 to N-1 (where N is not a power of two) using ec_enc_uint() (<xref target="entenc.c">entenc.c</xref>)</t>

+<t>entropy-coded symbols with a fixed probability model using ec_encode(), (<xref target="rangeenc.c">rangeenc.c</xref>)</t>

+<t>integers from 0 to 2^M-1 using ec_enc_uint() or ec_enc_bits(), (<xref target="entenc.c">entenc.c</xref>)</t>

+<t>integers from 0 to N-1 (where N is not a power of two) using ec_enc_uint(). (<xref target="entenc.c">entenc.c</xref>)</t>

 </list>

 </t>

-<t>The main encoding function is ec_encode(), which takes as an argument the range of the symbol to be encoded. [[Describe how ec_encode() works]]

+<t>

+The range encoder maintains an internal state vector composed of the

+four-tuple (low,rng,rem,ext), representing the low end of the current

+range, the size of the current range, a single buffered output byte,

+and a count of additional carry-propagating output bytes. Both rng

+and low are 32-bit unsigned integer values, rem is a byte value, or

+the special value -1, and ext is an integer with at least 16 bits.

+This state vector is initialized at the start of each each frame to

+the value (0,2^31,-1,0).

 </t>

<t>

-Functions ec_enc_uint() or ec_enc_bits() are based on ec_encode() and consider N equiprobable symbols, each with a frequency of 1. [[Describe the handling of larger integers]]

+Each symbol is drawn from a finite alphabet and coded in a separate

+context which describes the size of the alphabet and the relative

+frequency of each symbol in that alphabet. CELT only uses static

+contexts; they are not adapted to the statistics of the data that is

+coded.

 </t>

+<section anchor="encoding-symbols" title="Encoding Symbols">

+<t>

+   The main encoding function is ec_encode() (rangeenc.c (Appendix

+   A.29)), which takes as an argument a three-tuple (fl,fh,ft)

+   describing the range of the symbol to be encoded in the current

+   context, with 0 &lt;= fl &lt; fh &lt;= ft &lt;= 65535. The values of this tuple

+   are derived from the probability model for the symbol. Let f(i) be

+   the frequency of the ith symbol in the current context. Then the

+   three-tuple corresponding to the kth symbol is given by

+   <![CDATA[

+fl=sum(f(i),i<k), fh=fl+f(i), and ft=sum(f(i)).

+]]>

+</t>

+<t>

+   ec_encode() updates the state of the encoder as follows. If fl is

+   greater than zero, then low = low + rng - (rng/ft)*(ft-fl) and

+   rng = (rng/ft)*(fh-fl). Otherwise, low is unchanged and

+   rng = rng - (rng/ft)*(fh-fl). The divisions here are exact integer

+   division. After this update, the range is normalized.

+</t>

+<t>

+   To normalize the range, the following process is repeated until

+   rng > 2^23. First, the top 9 bits of low, (low>>23), are placed into

+   a carry buffer. Then, low is set to <![CDATA[(low << 8 & 0x7FFFFFFF) and rng

+   is set to (rng<<8)]]>. This process is carried out by

+   ec_enc_normalize() (rangeenc.c (Appendix A.29)).

+</t>

+<t>

+   The 9 bits produced in each iteration of the normalization loop

+   consist of 8 data bits and a carry flag. The final value of the

+   output bits is not determined until carry propagation is accounted

+   for. Therefore the reference implementation buffers a single

+   (non-propagating) output byte and keeps a count of additional

+   propagating (0xFF) output bytes. An implementation MAY choose to use

+   any mathematically equivalent scheme to perform carry propagation.

+</t>

+<t>

+   The function ec_enc_carry_out() (rangeenc.c (Appendix A.29)) performs

+   this buffering. It takes a 9-bit input value, c, from the normalization

+   8-bit output and a carry bit. If c is 0xFF, then ext is incremented

+   and no bytes are output. Otherwise, if rem is not the special value

+   -1, then the byte (rem+(c>>8)) is output. Then ext bytes are output

+   with the value 0 if the carry bit is set, or 0xFF if it is not, and

+   rem is set to the lower 8 bits of c.

+</t>

+<t>

+   In the reference implementation, a special version of ec_encode()

+   called ec_encode_bin() (rangeenc.c (Appendix A.29)) is defined to

+   take a two-tuple (fl,ftb), where <![CDATA[0 <= fl < 2^ftb and ftb < 16. It is

+   mathematically equivalent to calling ec_encode() with the three-tuple

+   (fl,fl+1,1<<ftb)]]>, but avoids using division.

+</t>

+</section>

+<section anchor="encoding-ints" title="Encoding Uniformly Distributed Integers">

+<t>

+   Functions ec_enc_uint() or ec_enc_bits() are based on ec_encode() and

+   encode one of N equiprobable symbols, each with a frequency of 1,

+   where N may be as large as 2^32-1. Because ec_encode() is limited to

+   a total frequency of 2^16-1, this is done by encoding a series of

+   symbols in smaller contexts.

+</t>

+<t>

+   ec_enc_bits() (entenc.c (Appendix A.25)) is defined, like

+   ec_encode_bin(), to take a two-tuple (fl,ftb), with <![CDATA[0 <= fl < 2^ftb

+   and ftb < 32. While ftb is greater than 8, it encodes bits (ftb-8) to

+   (ftb-1) of fl, e.g., (fl>>ftb-8&0xFF) using ec_encode_bin() and

+   subtracts 8 from ftb. Then, it encodes the remaining bits of fl, e.g.,

+   (fl&(1<<ftb)-1)]]>, again using ec_encode_bin().

+</t>

+<t>

+   ec_enc_uint() (entenc.c (Appendix A.25)) takes a two-tuple (fl,ft),

+   where ft is not necessarily a power of two. Let ftb be the location

+   of the highest one bit in the two's-complement representation of

+   (ft-1), or -1 if no bits are set. If ftb>8, then the top 8 bits of fl

+   are encoded using ec_encode() with the three-tuple

+   (fl>>ftb-8,(fl>>ftb-8)+1,(ft-1>>ftb-8)+1), and the remaining bits

+   are encoded with ec_enc_bits using the two-tuple

+   <![CDATA[(fl&(1<<ftb-8)-1,ftb-8). Otherwise, fl is encoded with ec_encode()

+   directly using the three-tuple (fl,fl+1,ft)]]>.

+</t>

+</section>

+<section anchor="encoder-finalizing" title="Finalizing the Stream">

+<t>

+   After all symbols are encoded, the stream must be finalized by

+   outputting a value inside the current range. Let end be the integer

+   in the interval [low,low+rng) with the largest number of trailing

+   zero bits. Then while end is not zero, the top 9 bits of end, e.g.,

+   <![CDATA[(end>>23), are sent to the carry buffer, and end is replaced by

+   (end<<8&0x7FFFFFFF). Finally, if the value in carry buffer, rem, is]]>

+   neither zero nor the special value -1, or the carry count, ext, is

+   greater than zero, then 9 zero bits are sent to the carry buffer.

+   After the carry buffer is finished outputting bytes, the rest of the

+   output buffer is padded with zero bytes. Finally, rem is set to the

+   special value -1. This process is implemented by ec_enc_done()

+   (rangeenc.c (Appendix A.29)).

+</t>

+</section>

+<section anchor="encoder-tell" title="Current Bit Usage">

+<t>

+   The bit allocation routines in CELT need to be able to determine a

+   conservative upper bound on the number of bits that have been used

+   to encode the current frame thus far. This drives allocation

+   decisions and ensures that the range code will not overflow the

+   output buffer. This is computed in the reference implementation to

+   fractional bit precision by the function ec_enc_tell() (rangeenc.c

+   (Appendix A.29)). Like all operations in the range encoder, it must

+   be implemented in a bit-exact manner.

+</t>

+</section>

 </section>