Video streaming involves several protocols and transport options. The combinations are confusing at first sight. This section provides an overview. For in-depth discussions refer to the links provided. If you want you can visit the RFCs themselves from there.
Let's start with the individual protocols that form the basis for video transport mechanisms.
Protocols
RTP
Real Time Protocol is used to transmit data together with time information. The time information crucial. It enables the receiver to do a time-accurate reconstruction of the media. It's important because on a digital connection some packets will travel faster than others. You can't just play the media as it is received because video will stutter and audio will be unuseable.
RTP originates from VOIP, that's probably why all the usefull functions for RTP analysis in Wireshark can be found in a menu called "Telephony". But it's used for video too, and even for data. RTP is useful for all data with timestamps.
RTSP
Real Time Streaming Protocol is used to setup an manage RTP streams between a source and a destination. What SIP (Session Initiation Protocol) is for telephony, is RTSP for IP cameras. Both manage RTP streams. Compared to SIP, RTSP is a lot simpler. Still, several things are taken care of in this protocol:
- Media negotiation: The source describes its media format, so that destination knows how to decode it (see: SDP)
- Transport negotiation: the destination tells the source what transport should be used
- Transport setup: source and destination exchange details (like port numbers) to setup that transport
- Playback control: destination controls the source like a media server: Play, Pause, Go to a specific time.
- Transport. Optionally the RTSP connection itself can be used to transmit the media stream. This is probably why it is called 'Streaming' and not 'Setup' protocol.
A camera can be seen as an RTSP media server, waiting for commands on TCP port 554. Without commands it doesn't do a thing. An RTSP camera streams never starts to stream to a unicast destination on its own initiative. The protocol doesn't work like that. The camera waits until a stream gets pulled by client. Only then it starts to deliver and only as long as the receiver repeatedly confirms it is still listening (keep-alive).
Some cameras do have a special function to unconditionally start multicasting after power-up. This does not involve RTSP.
SDP
Session Description Protocol is a text format to describe a media stream in detail. One way of looking at RTSP is as an 'SDP exchange mechanism''. When a receiver has the SDP, it knows the parameters required to decode a stream.
RTSP is one way to get the SDP, there can be others. For example take a camera transmitting a multicast stream, and not being unicast-reachable by the receiver so there can't be an RTSP exchange. When the receiver gets the SDP using other means (e.g. copying it as a file) it will still be able to decode the stream. This is supported for example by VLC.
Note that was not a good example (anymore) to describe how multicast works or should work. More about that on the multicast page. This was just an attempt to explain the purpose of SDP,
RTCP
Real Time Control Protocol is used for runtime exchange of performance data. Cameras send Sender Reports, and receivers send Receiver Reports. The most important parts are keepalive and clock drift correction. All participants (typically a source and a destination, but in case of multicast it can be more) exchange their time information so they learn how much they drift from each other. This allows the decoder to correct for that drift and play media accurately.
Many things are covered by RTCP. In theory a source could lower its quality when it receives notification of packet loss in the receiver reports. But most RTCP implementations are rather incomplete and only take care of keepalive and clockdrift. Keepalive can also be done over RTSP, which 'should' be supported by the server.
HTTP
Hyper Text Transfer Protocol, weblanguage number one, wasn't design for media transport like it wasn't designed for a lot of other purposes. But in practice it gets used for everything. While HTTP got very big, benefits emerged that make it worth try squeeze timestamped streaming media into it: to pass firewalls, to pass HTTP proxies, to have encryption. In short: to get a stream across the internet.There is a small document from Apple describing how to do it. It is very clever, it's an afterthought and it is a kludge.
The above refers to 'tunneling' a real RTP stream through HTTP. There is another use of HTTP for video, probably the original way of 'streaming' video on internet: A never-ending HTTP response with a series of JPEG images inside. Because it is so basic and widely supported by web browsers, it is still supported by many devices. There are drawbacks. First, JPEG is not efficient at all for video. Second, a more puristic argument, even though it looks like video, it isn't. It is a series of pictures. What is the difference? Timestamps. An RTP stream contains capture timestamps that allow accurate playback and JPEG-over-HTTP doesn't. The timestamps inside JPEG images lack sufficient resolution. Streaming JPEG with timestamps exists, it is called MJPEG. MJPEG can be sent over RTP. MJPEG is not JPEG-over-HTTP. One can do MJPEG over HTTP but then it is RTP-over-RTSP-over-HTTP using MJPEG payload. This part is especially confusing and it is discussed further below at transports.
NTP
Unrelated to streaming itself is the Network Time Protocol. For proper rendering camera and receiver must have synchronized clocks. Time inside most electronics isn't ticking very accurate and an external synchronisation is required. Note that the actual time isn't important, the devices should have the same time.
It isn't common for NTP synchronised devices to achieve the level of synchronisation advocated in the protocol. For best accuracy, devices inside a LAN should synchronise with a local NTP source to minimize deviations in roundtrip times. Every PC can act as a time source and you can synchronize that system with an external NTP source. So, part of properly deploying a videosystem is installation of a timeserver and synchronising all equipment. For forensic purposes you want the local timeserver to synchronize with an external one, like the NTP Pool.
Transports
In this section I assume you are aware of the two main means of IP transport: UDP and TCP. Follow the link if you don't.
RTP-UDP
Streaming RTP over UDP is the original intent of the protocol. RTP is designed to grafefully handle the lack of error correction on UDP level. This is an efficient way of transmitting media. In a well designed network the lack of error correction is not a problem as errors don't occur often. For live streams, in case errors happen what we want is to see the next frame, not a retry of aging data. So, UDP is fine. The RTP packets are designed to fit nicely in the packet-size offered by the underlying network. Lost data is handled efficiently as it is easy to get back on track again using the sequence data in the RTP header present in every packet. There'll be a temporary glitch and all is well again.UDP streams are lightweight for most hardware. An added benefit is that most switches prioritize UDP over TCP traffic. This acts as an implicit Quality of Service when competing e.g. against webpage traffic. UDP implies quick delivery of packets which translates in jitter free stream which can be rendered with low latency.
Multicast can only be done using RTP-UDP streaming.
For UDP streams, the RTCP data is typically sent on stream UDP port + 1.
RTP-TCP
Streaming RTP over TCP has many names, RTP-over-RTSP, RTSP interleaving or RTSP tunneling. It's all the same. In this method the RTP packets are sent over the TCP connection of the RTSP connection. Special bytes separate the media, RTSP commands and RTCP control data: these three types of data are 'interleaved' on the same connection.
Advantages of RTP-TCP are:
- Error correction
- Easier to pass firewalls and/or routers
The error correction can help to get a stream accross an unreliable link in case the frequency of errors is limited. When there are too many errors, the amount of retries rises with it, increasing the use of network capacity, worsening the problem.
There are disadvantages as well:
- Higher overhead for the devices, although on modern embedded CPU's this doesn't count so much anymore
- Switches sometimes cluster TCP packets for efficiency reasons. When a stream is small this may lead to e.g. a handfull of packets with 160 milliseconds worth of video clustered and delivered at once. This obviously adds to jitter and latency.
- No multicast
With TCP error recovery in place, the RTP error-resilience against losing individual packets isn't needed anymore. Work can be saved by not cutting the payload into thousands of small RTP packets each with their own header. The result is a non-standard stream with a lot less, but huge, RTP packets. On Axis cameras this is enabled by sending the parameter:
Blocksize: 65535
In the RTSP SETUP command. This is implemented by llv.
HTTP(S) tunneling
An RTP-over-RTSP stream can in turn be tunneled over HTTP, yielding an RTP-over-RTSP-over-HTTP stream. It brings all advantages and disadvantages of RTP-TCP, combined with the benefits of HTTP already listed above: pass firewalls and proxies. Additionally, it is a small step to RTP-over-RTSP-over-HTTP-over-TLS, making all protection layers of HTTP available to the RTSP stream too.
Overhead on the device is obviously even higher, with a weird bypass between the HTTP- and the RTSP-server.
RTSPS, SRTP
RTSP and RTP bring their own encryption mechanisms too. Encrypting media is more a complex subject compared to encrypting HTTP due to the different types of transport. For example, on a UDP connection packets may be easily lost, and this should not affect the decryption on client-side.
There is a simple case: for RTP-over-RTSP, all data over a single connection, a straightforward TLS layer could be applied. So you get RTP-over-RTSP-over-TLS, commonly abbreviated as RTSPS. Some devices support this, but in other cases you get more than just RTSP inside TLS, you get SRTP as well. This is for example the case with Axis devices.
SRTP, Secure RTP, deals with the mentioned aspect of encrypted RTP data over UDP. It can do more, it also enables multicast connections, where multiple clients need to possess the same key. Additionally SRTP brings as benefit that the client can store the keys, so that it does not need to decrypt immediately. This allows large scale recording of video, which is typically not watched so decryption is normally not necessary. With TLS, immediate decryption is mandatory because the symmetric key protecting the payload is determined on the fly. With SRTP the client is in full control of the keys to allow for multicast and this "no-decryption-when-not-necessary" usecase. The drawback is that the SRTP client has to do this. It must implement a key distribution logic itself, the encryption layer can't "just deal" with the encryption like it happens with TLS based connections.
So, to encrypt an RTP connection without complex key management, HTTPS tunneling is required because this essentially implies plain RTSP without dependency on SRTP, like RTSPS has.
Wrap up
This page discussed the 5 protocols related to video streaming: RTP, RTSP, RTCP, SDP and HTTP. After that we discussed the streaming itself: It always uses RTP: Either over UDP directly or wrapped in one of the TCP-based protocols. When to chose one over the other? It depends on the usecase.
- Large scale low latency video in a LAN likely benefits from RTP-UDP. Especially when multicast is needed, then it is the only way
- Smaller networks with aspects suboptimal for video, like: video combined with lots of other traffic, limited uplinks (read: wireless), low quality cabling, may benefit from one of the tunneling methods because they bring error correction
- A stream accross the internet most likely needs HTTP tunneling
- We had a short look at encrypted streams using HTTPS, RTSPS, SRTP
JPEG
Having discussed all the protocols and transports it makes sense to shortly touch JPEG again. This compression causes confusion because it was already around when RTP didn't exist.
- JPEG-over-HTTP. A never-ending list of JPEG images in an eternal HTTP response, violating, or ayt least stretching a bit, the HTTP standard. This method predates everything else. Because every stone age browser supports HTTP and JPEG it is the only method near-guaranteed to work everywhere except Internet Explorer.
- MJPEG. This is JPEG-encoded video frames over RTP. To reduce the size a little bit some headers are removed. An MJPEG frame is not a JPEG picture as this header is lacking, though it can be reconstructed. Being an RTP stream it can be tunneled (over RTSP-) over HTTP
So there are two ways of sending JPEG encoded material over HTTP. There is JPEG-over-HTTP and RTP-over-RTSP-over-HTTP with MJPEG payload. For purists, the first one is a series of pictures and the second one is video (using an inefficient compression.)