Which Audio Codec to Choose So an IP Camera Records Decent Sound

Audio as a Forgotten Component of IP Video Surveillance

In the architecture of IP video surveillance, audio has historically played a secondary role. System design focused on the video stream, bitrate, resolution, storage, and network bandwidth. The audio channel was treated as an optional add-on, often enabled on a leftover basis. As a result, most IP cameras and surveillance systems transmit audio at the minimum acceptable quality, using outdated codecs and conservative sampling parameters.

The situation changed with the spread of video analytics, ASR (Automatic Speech Recognition), detectors for screaming, gunshots, conflicts, baby crying, and other audio-dependent scenarios. In these conditions, audio quality stopped being a matter of convenience and became part of the system’s functional architecture. Poor audio directly reduces analytics accuracy, complicates incident investigations, and makes archives nearly useless.

In practice, however, audio problems are most often related not to the microphone or acoustics, but to the choice of audio codec, sampling frequency, and the way audio data is packaged within network protocols such as RTSP, ONVIF, and cloud gateways.

General Architecture of the Audio Stream in an IP Camera

A typical audio processing chain in an IP camera looks as follows:

Analog microphone or MEMS microphone
Analog-to-digital converter (ADC)
Pre-processing (AGC, noise reduction, filtering)
Encoding of the audio stream with the selected codec
Multiplexing with the video stream
Transmission via RTSP, HTTP, or a proprietary protocol
Decoding on the NVR, VMS, or client side

The key point is that the choice of codec and sampling frequency parameters affects several layers at once: network load, compatibility with the receiving side, detector quality, and the ability to process and analyze the audio archive later.

Audio Codecs Used in IP Cameras

PCM (LPCM)

PCM is an uncompressed digital representation of an audio signal. The most common variants in cameras are 8, 16, or 24 bits at sampling rates of 8, 16, or 48 kHz.

Technical characteristics:

Bitrate scales linearly with sampling frequency and bit depth
No loss during encoding
Minimal latency

Drawbacks in network systems:

Extremely high bitrate
Significant load on the network and storage
Limited support in NVRs and cloud platforms
Issues with RTP payloads and buffering

PCM works well in laboratory and closed systems where the developer controls the entire transmission chain. In real distributed video surveillance systems, PCM often leads to unstable playback, missing audio during remote access, and compatibility issues.

G.711 (A-law and μ-law)

G.711 is one of the oldest and most widely used audio codecs, originating from telephony.

Parameters:

Sampling Frequency: 8 kHz
Effective bandwidth: up to 3.4 kHz
Bitrate: 64 kbps

Pros:

Near-universal support
Minimal computational load
Predictable RTP behavior

Cons:

Very limited quality
Poor suitability for analytics and ASR

G.711 remains the de facto compatibility standard, but by modern requirements its quality is at the very lower limit of acceptability.

G.726

G.726 uses ADPCM compression and offers several bitrate modes.

Typical parameters:

Sampling Frequency: 8 kHz
Bitrate: 16–40 kbps

Quality is slightly better than G.711, but fundamentally the situation does not change. The codec remains narrowband and is suitable mainly for simple monitoring.

G.722 and G.722.1

G.722 became the first widely adopted wideband speech codec.

G.722:

Sampling Frequency: 16 kHz
Effective bandwidth: up to 7 kHz

G.722.1:

Improved compression
More flexible bitrates

In practice, these codecs deliver good speech quality but suffer from fragmented support. Many cameras claim G.722 support but implement it with non-standard RTP profiles, leading to decoding problems in third-party VMS platforms.

AAC (AAC-LC, HE-AAC)

AAC is the most universal modern codec used in video surveillance.

Supported sampling rates:

8, 16, 32, 44.1, 48 kHz

Advantages:

High quality at moderate bitrates
Good noise handling
Excellent compatibility with MP4, RTSP, and HLS
Supported by all modern players

AAC fits optimally into IP video surveillance architectures, especially when using MP4 and fMP4 containers.

Opus

Opus is technically superior to most other codecs.

Key features:

Wide range of sampling frequencies
Excellent speech quality
Low latency

However, in the video surveillance industry, Opus remains exotic due to the lack of widespread support in cameras and recorders.

Sampling Frequency: Why It Matters More Than It Seems

Sampling frequency directly determines the spectrum of the transmitted audio signal and its suitability for analytics.

8 kHz

Telephone-quality audio
Suitable only for basic speech intelligibility
Performs poorly for ASR and event detectors

16 kHz

Minimum acceptable level for analytics
Significantly improved intelligibility
Optimal balance between quality and bitrate

32 kHz

Improved detail
Better performance in noisy environments
Suitable for more complex detectors

44.1 and 48 kHz

Excessive for most video surveillance tasks
Increased load on network and storage
Little to no benefit for speech

In practice, 16 or 32 kHz are the optimal choices for IP cameras.

Licensing Constraints and Legal Aspects

Free codecs

PCM
G.711
G.722
Opus
Speex

These codecs do not require royalty payments but do not always provide optimal quality or compatibility.

Patented codecs

AAC
AMR / AMR-WB

In the case of IP cameras, AAC licensing is typically included in the hardware cost. For end users, this creates no additional legal risks, unlike server-side transcoders or cloud services, where licensing may require separate accounting.

Impact of Audio Codec Choice on Network and Storage

The codec directly affects:

RTP bitrate
Buffering behavior
Latency
Archive size

AAC at 16 kHz with a bitrate of 32–64 kbps provides an optimal balance between quality and load. Using PCM or high sampling rates without necessity leads to unjustified traffic growth.

Practical Recommendations for System Design

Avoid PCM in distributed systems
Do not use G.711 for analytics
Choose AAC as the baseline codec
Set sampling frequency to 16 or 32 kHz
Verify real codec support in the VMS and NVR
Test audio in remote access scenarios

Modern IP cameras support a wide range of audio codecs that reflect not a clean evolution, but the historical layers of the industry. When designing video surveillance systems, the choice of audio codec and sampling frequency should be treated as an architectural decision rather than a secondary setting. At present, AAC at 16 or 32 kHz remains the most balanced and predictable option for network video surveillance systems, providing acceptable quality, stability, and compatibility across all levels.