Audio as a Forgotten Component of IP Video Surveillance
In the architecture of IP video surveillance, audio has historically played a secondary role. System design focused on the video stream, bitrate, resolution, storage, and network bandwidth. The audio channel was treated as an optional add-on, often enabled on a leftover basis. As a result, most IP cameras and surveillance systems transmit audio at the minimum acceptable quality, using outdated codecs and conservative sampling parameters.
The situation changed with the spread of video analytics, ASR (Automatic Speech Recognition), detectors for screaming, gunshots, conflicts, baby crying, and other audio-dependent scenarios. In these conditions, audio quality stopped being a matter of convenience and became part of the system’s functional architecture. Poor audio directly reduces analytics accuracy, complicates incident investigations, and makes archives nearly useless.
In practice, however, audio problems are most often related not to the microphone or acoustics, but to the choice of audio codec, sampling frequency, and the way audio data is packaged within network protocols such as RTSP, ONVIF, and cloud gateways.
General Architecture of the Audio Stream in an IP Camera
A typical audio processing chain in an IP camera looks as follows:
- Analog microphone or MEMS microphone
- Analog-to-digital converter (ADC)
- Pre-processing (AGC, noise reduction, filtering)
- Encoding of the audio stream with the selected codec
- Multiplexing with the video stream
- Transmission via RTSP, HTTP, or a proprietary protocol
- Decoding on the NVR, VMS, or client side
The key point is that the choice of codec and sampling frequency parameters affects several layers at once: network load, compatibility with the receiving side, detector quality, and the ability to process and analyze the audio archive later.
Audio Codecs Used in IP Cameras
PCM (LPCM)
PCM is an uncompressed digital representation of an audio signal. The most common variants in cameras are 8, 16, or 24 bits at sampling rates of 8, 16, or 48 kHz.
Technical characteristics:
- Bitrate scales linearly with sampling frequency and bit depth
- No loss during encoding
- Minimal latency
Drawbacks in network systems:
- Extremely high bitrate
- Significant load on the network and storage
- Limited support in NVRs and cloud platforms
- Issues with RTP payloads and buffering
PCM works well in laboratory and closed systems where the developer controls the entire transmission chain. In real distributed video surveillance systems, PCM often leads to unstable playback, missing audio during remote access, and compatibility issues.
G.711 (A-law and μ-law)
G.711 is one of the oldest and most widely used audio codecs, originating from telephony.
Parameters:
- Sampling Frequency: 8 kHz
- Effective bandwidth: up to 3.4 kHz
- Bitrate: 64 kbps
Pros:
- Near-universal support
- Minimal computational load
- Predictable RTP behavior
Cons:
- Very limited quality
- Poor suitability for analytics and ASR
G.711 remains the de facto compatibility standard, but by modern requirements its quality is at the very lower limit of acceptability.
G.726
G.726 uses ADPCM compression and offers several bitrate modes.
Typical parameters:
- Sampling Frequency: 8 kHz
- Bitrate: 16–40 kbps
Quality is slightly better than G.711, but fundamentally the situation does not change. The codec remains narrowband and is suitable mainly for simple monitoring.
G.722 and G.722.1
G.722 became the first widely adopted wideband speech codec.
G.722:
- Sampling Frequency: 16 kHz
- Effective bandwidth: up to 7 kHz
G.722.1:
- Improved compression
- More flexible bitrates
In practice, these codecs deliver good speech quality but suffer from fragmented support. Many cameras claim G.722 support but implement it with non-standard RTP profiles, leading to decoding problems in third-party VMS platforms.
AAC (AAC-LC, HE-AAC)
AAC is the most universal modern codec used in video surveillance.
Supported sampling rates:
- 8, 16, 32, 44.1, 48 kHz
Advantages:
- High quality at moderate bitrates
- Good noise handling
- Excellent compatibility with MP4, RTSP, and HLS
- Supported by all modern players
AAC fits optimally into IP video surveillance architectures, especially when using MP4 and fMP4 containers.
Opus
Opus is technically superior to most other codecs.
Key features:
- Wide range of sampling frequencies
- Excellent speech quality
- Low latency
However, in the video surveillance industry, Opus remains exotic due to the lack of widespread support in cameras and recorders.
Sampling Frequency: Why It Matters More Than It Seems
Sampling frequency directly determines the spectrum of the transmitted audio signal and its suitability for analytics.
8 kHz
- Telephone-quality audio
- Suitable only for basic speech intelligibility
- Performs poorly for ASR and event detectors
16 kHz
- Minimum acceptable level for analytics
- Significantly improved intelligibility
- Optimal balance between quality and bitrate
32 kHz
- Improved detail
- Better performance in noisy environments
- Suitable for more complex detectors
44.1 and 48 kHz
- Excessive for most video surveillance tasks
- Increased load on network and storage
- Little to no benefit for speech
In practice, 16 or 32 kHz are the optimal choices for IP cameras.
Licensing Constraints and Legal Aspects
Free codecs
- PCM
- G.711
- G.722
- Opus
- Speex
These codecs do not require royalty payments but do not always provide optimal quality or compatibility.
Patented codecs
- AAC
- AMR / AMR-WB
In the case of IP cameras, AAC licensing is typically included in the hardware cost. For end users, this creates no additional legal risks, unlike server-side transcoders or cloud services, where licensing may require separate accounting.
Impact of Audio Codec Choice on Network and Storage
The codec directly affects:
- RTP bitrate
- Buffering behavior
- Latency
- Archive size
AAC at 16 kHz with a bitrate of 32–64 kbps provides an optimal balance between quality and load. Using PCM or high sampling rates without necessity leads to unjustified traffic growth.
Practical Recommendations for System Design
- Avoid PCM in distributed systems
- Do not use G.711 for analytics
- Choose AAC as the baseline codec
- Set sampling frequency to 16 or 32 kHz
- Verify real codec support in the VMS and NVR
- Test audio in remote access scenarios
Modern IP cameras support a wide range of audio codecs that reflect not a clean evolution, but the historical layers of the industry. When designing video surveillance systems, the choice of audio codec and sampling frequency should be treated as an architectural decision rather than a secondary setting. At present, AAC at 16 or 32 kHz remains the most balanced and predictable option for network video surveillance systems, providing acceptable quality, stability, and compatibility across all levels.