ref: 8858cac6bc904a5972b26678194b81c0eedee2fe
parent: 14d63d18795a2081adb8a3f9c714e94d021a666b
author: Jean-Marc Valin <[email protected]>
date: Thu May 17 15:45:10 EDT 2012
Sync with draft -14
--- a/Makefile.draft
+++ b/Makefile.draft
@@ -20,7 +20,7 @@
###################### END OF OPTIONS ######################
-CFLAGS += -DOPUS_VERSION='"0.9.11"'
+CFLAGS += -DOPUS_VERSION='"0.9.14"'
include silk_sources.mk
include celt_sources.mk
include opus_sources.mk
--- a/configure.ac
+++ b/configure.ac
@@ -9,7 +9,7 @@
OPUS_MAJOR_VERSION=0
OPUS_MINOR_VERSION=9
-OPUS_MICRO_VERSION=11
+OPUS_MICRO_VERSION=14
OPUS_EXTRA_VERSION=
OPUS_VERSION="$OPUS_MAJOR_VERSION.$OPUS_MINOR_VERSION.$OPUS_MICRO_VERSION$OPUS_EXTRA_VERSION"
--- a/doc/draft-ietf-codec-opus.xml
+++ b/doc/draft-ietf-codec-opus.xml
@@ -2,7 +2,7 @@
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc toc="yes" symrefs="yes" ?>
-<rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-13">
+<rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14">
<front>
<title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title>
@@ -53,7 +53,7 @@
</address>
</author>
-<date day="15" month="May" year="2012" />
+<date day="17" month="May" year="2012" />
<area>General</area>
@@ -563,9 +563,10 @@
</t>
<section anchor="toc_byte" title="The TOC Byte">
-<t>
-An Opus packet begins with a single-byte table-of-contents (TOC) header that
- signals which of the various modes and configurations a given packet uses.
+<t anchor="R1">
+A well-formed Opus packet MUST contain at least one byte [R1].
+This byte forms a table-of-contents (TOC) header that signals which of the
+ various modes and configurations a given packet uses.
It is composed of a configuration number, "config", a stereo flag, "s", and a
frame count code, "c", arranged as illustrated in
<xref target="toc_byte_fig"/>.
@@ -572,7 +573,7 @@
A description of each of these fields follows.
</t>
-<figure anchor="toc_byte_fig" title="The TOC byte">
+<figure anchor="toc_byte_fig" title="The TOC Byte">
<artwork align="center"><![CDATA[
0
0 1 2 3 4 5 6 7
@@ -638,11 +639,6 @@
the value of "c".
</t>
-<t anchor="R1">
-A well-formed Opus packet MUST contain at least one byte with the TOC
- information [R1], though the frame(s) within a packet MAY be zero bytes
- long.
-</t>
</section>
<section title="Frame Packing">
@@ -668,7 +664,7 @@
The special length 0 indicates that no frame is available, either because it
was dropped during transmission by some intermediary or because the encoder
chose not to transmit it.
-A length of 0 is valid for any Opus frame in any mode.
+Any Opus frame in any mode MAY have a length of 0.
</t>
<t>
@@ -1048,7 +1044,7 @@
which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>.
It is very similar to arithmetic encoding, except that encoding is done with
digits in any base instead of with bits,
-so it is faster when using larger bases (i.e., an octet). All of the
+so it is faster when using larger bases (i.e., a byte). All of the
calculations in the range coder must use bit-exact integer arithmetic.
</t>
<t>
@@ -1117,11 +1113,11 @@
<section anchor="range-decoder-init" title="Range Decoder Initialization">
<t>
-Let b0 be the first input octet (or zero if there are no octets in this Opus
+Let b0 be the first input byte (or zero if there are no bytes in this Opus
frame).
The decoder initializes rng to 128 and initializes val to
(127 - (b0>>1)), where (b0>>1) is the top 7 bits of the
- first input octet.
+ first input byte.
It saves the remaining bit, (b0&1), for use in the renormalization
procedure described in <xref target="range-decoder-renorm"/>, which the
decoder invokes immediately after initialization to read additional bits and
@@ -1202,14 +1198,14 @@
by ec_dec_normalize() (entdec.c), until rng > 2**23.
If rng is already greater than 2**23, the entire process is skipped.
First, it sets rng to (rng<<8).
-Then it reads the next octet of the Opus frame and forms an 8-bit value sym,
- using the left-over bit buffered from the previous octet as the high bit
- and the top 7 bits of the octet just read as the other 7 bits of sym.
-The remaining bit in the octet just read is buffered for use in the next
+Then it reads the next byte of the Opus frame and forms an 8-bit value sym,
+ using the left-over bit buffered from the previous byte as the high bit
+ and the top 7 bits of the byte just read as the other 7 bits of sym.
+The remaining bit in the byte just read is buffered for use in the next
iteration.
-If no more input octets remain, it uses zero bits instead.
+If no more input bytes remain, it uses zero bits instead.
See <xref target="range-decoder-init"/> for the initialization used to process
- the first octet.
+ the first byte.
Then, it sets
<figure align="center">
<artwork align="center"><![CDATA[
@@ -1559,7 +1555,7 @@
2: Coded parameters
3: Pulses, LSBs, and signs
4: Pitch lags, Long-Term Prediction (LTP) coefficients
-5: Linear Prediction Coefficients (LPC) and gains
+5: Linear Predictive Coding (LPC) coefficients and gains
6: Decoded signal (mono or mid-side stereo)
7: Unmixed signal (mono or left-right stereo)
8: Resampled signal
@@ -1804,7 +1800,7 @@
When switching from 20 ms to 10 ms, the 10 ms Opus frame can
contain an LBRR frame covering at most half the prior 20 ms Opus frame,
potentially leaving a hole that needs to be concealed from even a single
- packet loss.
+ packet loss (see <xref target="Packet Loss Concealment"/>).
When switching from mono to stereo, the LBRR frames in the first stereo Opus
frame MAY contain a non-trivial side channel.
</t>
@@ -2329,10 +2325,13 @@
The first VQ stage uses a 32-element codebook, coded with one of the PDFs in
<xref target="silk_nlsf_stage1_pdfs"/>, depending on the audio bandwidth and
the signal type of the current SILK frame.
-This yields a single index, I1, for the entire frame.
-This indexes an element in a coarse codebook, selects the PDFs for the
- second stage of the VQ, and selects the prediction weights used to remove
- intra-frame redundancy from the second stage.
+This yields a single index, I1, for the entire frame, which
+<list style="numbers">
+<t>Indexes an element in a coarse codebook,</t>
+<t>Selects the PDFs for the second stage of the VQ, and</t>
+<t>Selects the prediction weights used to remove intra-frame redundancy from
+ the second stage.</t>
+</list>
The actual codebook elements are listed in
<xref target="silk_nlsf_nbmb_codebook"/> and
<xref target="silk_nlsf_wb_codebook"/>, but they are not needed until the last
@@ -4563,9 +4562,9 @@
<xref target="silk_ltp_params"/> to produce an LPC residual.
The LTP filter requires LPC residual values from before the current subframe as
input.
-However, since the LPCs may have changed, it obtains this residual by
- "rewhitening" the corresponding output signal using the LPCs from the current
- subframe.
+However, since the LPC coefficients may have changed, it obtains this residual
+ by "rewhitening" the corresponding output signal using the LPC coefficients
+ from the current subframe.
Let out[i] for
(j - pitch_lags[s] - d_LPC - 2) <= i < j
be the fully reconstructed output signal from the last
@@ -4824,8 +4823,9 @@
<xref target='MDCT'/> with partially overlapping windows of 5 to 22.5 ms.
The main principle behind CELT is that the MDCT spectrum is divided into
bands that (roughly) follow the Bark scale, i.e., the scale of the ear's
-critical bands. The normal CELT layer uses 21 of those bands, though Opus
+critical bands <xref target="Zwicker61"/>. The normal CELT layer uses 21 of those bands, though Opus
Custom (see <xref target="opus-custom"/>) may use a different number of bands.
+In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded.
A band can contain as little as one MDCT bin per channel, and as many as 176
bins per channel, as detailed in <xref target="celt_band_sizes"/>.
In each band, the gain (energy) is coded separately from
@@ -5081,7 +5081,7 @@
Often this control is only indirect, and must be exercised carefully to
achieve the desired rate constraints.
The CELT layer, however, can adapt over a very wide range of rates, and thus
- has a large number of codebooks sizes to choose from for each band.
+ has a large number of codebook sizes to choose from for each band.
Explicitly signaling the size of each of these codebooks would impose
considerable overhead, even though the allocation is relatively static from
frame to frame.
@@ -5203,7 +5203,7 @@
may result in waste: bitstream capacity available at the end
of the frame which can not be put to any use. The maximums
specified by the codec reflect the average maximum. In the reference
-implementation, the maximums in bit/sample are precomputed in a static table
+implementation, the maximums in bits/sample are precomputed in a static table
(see cache_caps50[] in static_modes_float.h) for each band,
for each value of LM, and for both mono and stereo.
@@ -5239,11 +5239,11 @@
size of the frame in 8th bits, 'total_boost' to zero, and 'tell' to the total number
of 8th bits decoded
so far. For each band from the coding start (0 normally, but 17 in Hybrid mode)
-to the coding end (which changes depending on the signaled bandwidth): set 'width'
-to the number of MDCT bins in this band for all channels. Take the larger of width
-and 64, then the minimum of that value and the width times eight and set 'quanta'
-to the result. This represents a boost step size of six bits subject to limits
-of 1/bit/sample and 1/8th bit/sample. Set 'boost' to zero and 'dynalloc_loop_logp'
+to the coding end (which changes depending on the signaled bandwidth), the boost quanta
+in units of 1/8 bit is calculated as quanta = min(8*N, max(48, N)).
+This represents a boost step size of six bits, subject to a lower limit of
+1/8th bit/sample and an upper limit of 1 bit/sample.
+Set 'boost' to zero and 'dynalloc_loop_logp'
to dynalloc_logp. While dynalloc_loop_log (the current worst case symbol cost) in
8th bits plus tell is less than total_bits plus total_boost and boost is less than cap[] for this
band: Decode a bit from the bitstream with a with dynalloc_loop_logp as the cost
@@ -6160,9 +6160,9 @@
<t>
The range encoder maintains an internal state vector composed of the four-tuple
(val, rng, rem, ext) representing the low end of the current
- range, the size of the current range, a single buffered output octet, and a
- count of additional carry-propagating output octets.
-Both val and rng are 32-bit unsigned integer values, rem is an octet value or
+ range, the size of the current range, a single buffered output byte, and a
+ count of additional carry-propagating output bytes.
+Both val and rng are 32-bit unsigned integer values, rem is a byte value or
less than 255 or the special value -1, and ext is an unsigned integer with at
least 11 bits.
This state vector is initialized at the start of each each frame to the value
@@ -6181,11 +6181,11 @@
These are used to perform carry propagation in the renormalization loop below.
Each iteration of this loop produces 9 bits of output, consisting of 8 data
bits and a carry flag.
-The encoder cannot determine the final value of the output octets until it
+The encoder cannot determine the final value of the output bytes until it
propagates these carry flags.
Therefore the reference implementation buffers a single non-propagating output
- octet (i.e., one less than 255) in rem and keeps a count of additional
- propagating (i.e., 255) output octets in ext.
+ byte (i.e., one less than 255) in rem and keeps a count of additional
+ propagating (i.e., 255) output bytes in ext.
An implementation may choose to use any mathematically equivalent scheme to
perform carry propagation.
</t>
@@ -6254,12 +6254,12 @@
Then,
<list style="symbols">
<t>
-If the buffered octet rem contains a value other than -1, the encoder outputs
- the octet (rem + b).
-Otherwise, if rem is -1, no octet is output.
-</t>
+If the buffered byte rem contains a value other than -1, the encoder outputs
+ the byte (rem + b).
+Otherwise, if rem is -1, no byte is output.
+</t>
<t>
-If ext is non-zero, then the encoder outputs ext octets---all with a value of 0
+If ext is non-zero, then the encoder outputs ext bytes---all with a value of 0
if b is set, or 255 if b is unset---and sets ext to 0.
</t>
<t>
@@ -6329,7 +6329,7 @@
ec_enc_bits() (entenc.c).
Because the raw bits may continue into the last byte output by the range coder
if there is room in the low-order bits, the encoder must be prepared to merge
- these values into a single octet.
+ these values into a single byte.
The procedure in <xref target="encoder-finalizing"/> does this in a way that
ensures both the range coded data and the raw bits can be decoded
successfully.
@@ -6384,7 +6384,7 @@
end = (end<<8) & 0x7FFFFFFF .
]]></artwork>
</figure>
-Finally, if the buffered output octet, rem, is neither zero nor the special
+Finally, if the buffered output byte, rem, is neither zero nor the special
value -1, or the carry count, ext, is greater than zero, then 9 zero bits are
sent to the carry buffer to flush it to the output buffer.
When outputting the final byte from the range coder, if it would overlap any
@@ -6963,9 +6963,9 @@
The LTP coefficients are quantized using the method described in
<xref target='ltp_quantizer_overview_section'/>, and the quantized LTP
coefficients are used to compute the LTP residual signal.
- This LTP residual signal is the input to an LPC analysis where the LPCs are
+ This LTP residual signal is the input to an LPC analysis where the LPC coefficients are
estimated using Burg's method <xref target="Burg"/>, such that the residual energy is minimized.
- The estimated LPCs are converted to a Line Spectral Frequency (LSF) vector
+ The estimated LPC coefficients are converted to a Line Spectral Frequency (LSF) vector
and quantized as described in <xref target='lsf_quantizer_overview_section'/>.
After quantization, the quantized LSF vector is converted back to LPC
coefficients using the full procedure in <xref target="silk_nlsfs"/>.
@@ -6992,7 +6992,7 @@
</t>
<section title="Burg's Method">
<t>
-The main purpose of LPC coding in SILK is to reduce the bitrate by
+The main purpose of linear prediction in SILK is to reduce the bitrate by
minimizing the residual energy.
At least at high bitrates, perceptual aspects are handled
independently by the noise shaping filter.
@@ -7523,12 +7523,12 @@
<t>In addition to indicating whether the test vector comparison passes, the opus_compare tool
outputs an "Opus quality metric" that indicates how well the tested decoder matches the
reference implementation. A quality of 0 corresponds to the passing threshold, while
-a quality of 100 means that the output of the tested decoder is identical to the reference
-implementation. The passing threshold was calibrated in such a way that it corresponds to
+a quality of 100 is the highest possible value and means that the output of the tested decoder is identical to the reference
+implementation. The passing threshold (quality 0) was calibrated in such a way that it corresponds to
additive white noise with a 48 dB SNR (similar to what can be obtained on a cassette deck).
It is still possible for an implementation to sound very good with such a low quality measure
(e.g. if the deviation is due to inaudible phase distortion), but unless this is verified by
-listening tests, it is RECOMMENDED that implementations achive a quality above 90 for 48 kHz
+listening tests, it is RECOMMENDED that implementations achive a quality above 90 for 48 kHz
decoding. For other sampling rates, it is normal for the quality metric to be lower
(typically as low as 50 even for a good implementation) because of harmless mismatch with
the delay and phase of the internal sampling rate conversion.
@@ -7971,6 +7971,16 @@
<author initials="G." surname="Maxwell" fullname="Gregory Maxwell"><organization/></author>
</front>
<seriesInfo name="IEEE Trans. on Audio, Speech and Language Processing, Vol. 18, No. 1, pp. 58-67" value="2010" />
+</reference>
+
+
+<reference anchor="Zwicker61">
+<front>
+<title>Subdivision of the audible frequency range into critical bands</title>
+<author initials="E." surname="Zwicker" fullname="E. Zwicker"><organization/></author>
+<date month="February" year="1961" />
+</front>
+<seriesInfo name="The Journal of the Acoustical Society of America, Vol. 33, No 2" value="p. 248" />
</reference>