Latency

Unfinished

Latency is the time-delay of a live video image. Display of a scene always lags behind the real scene. This delay can vary: it can be small, e.g. 100 milliseconds, or large, like two seconds or more. Latency is the only area where digital lags, literally, behind to analog. Why does latency occur and how to minimize? There's a lot to tell about latency.

What contributes to latency?

Camera	A camera needs time to process an image and encode it. On recent, say post-2015, hardware this takes a lot less than 100 milliseconds for a 1080p stream. Hardware newer than that is not always faster, because the extra processing power is used for further image improvement. Latency may depend on scene complexity and overall workload on the camera. This may also cause the latency to vary from frame to frame, which we call jitter. Jitter is explained in more detail later on.
Network	At each hop in the network, e.g. switch or router, packets potentially end up in memory for a while before they are sent out to the next hop. The added latency using UDP in a well-designed LAN is typically small and can be ignored. But when a switches faces a link that is used at full capacity it may need to buffer packets longer. This occurs a lot earlier than you might think. It introduces significant jitter.
Receiver	Decoding takes time
Display device	The display device is often overlooked! It has latency too. To start with, it has a refresh rate. E.g. 60Hz. This means the image is redrawn every 16 milliseconds. So, worst case a pixel becomes visible 16 milliseconds after it was drawn by software. Futhermore delays are caused by its physical pixels not immediately adjusting to their new brightness level. At low end-to-end latencies the display device may take 10-20% of the total budget and a faster device might make sense.

Jitter

Tightly related to latency, and often more important, is jitter. Jitter is the effect that latency is different from frame to frame. Jitter is generated nearly everywhere in the processing chain:

Encoding time in the camera may depend on scene complexity or -motion
Some packets may get buffered inside a switch longer than other packets
Transmitting an I-frame takes longer than a P-frame
Decoding an I-frame takes a lot longer than a P-frame
Decoding speed may vary because the receiver system is task switching
Visualisation is irregular because monitor refresh rate is not aligned with video framerate

The net effect is that the time between video frames being available for rendering varies. Just writing out the pixel values to the display memory as fast as possible will result in a stuttering view.

Latency versus jitter

Jittering video is fatiguing to look at. It must be prevented or compensated for. Prevention is difficult and the only way to compensate is... add latency! A decoder must buffer data internally, and then deliver that data on regular intervals to the display device from that buffer. Thus, by adding only a small delay to a late frame and a bigger delay on an early frame the intervals between frames become equal. It's made possible by the RTP protocol. RTP includes the original capture times together with the media so that a decoder can see the original interval between frames and try recreate that.

Video players like VLC or Quicktime introduce a lot of latency. They're media players, made to play recorded movies streamed from internet, which have considerable jitter. This means typically seconds of buffering in order to play a movie in the smoothest possible way. This makes them unsuitable for low latency live view.

Minimizing latency

What are things that can be done to reduce latency?

Decrease receiver-side buffering. An incidental stutter may be more acceptable than a high latency image.
Higher framerate. Double the framerate gives half the delay in between, thus reducing the perceived latency
Low compression. Typically faster than high compression but the effect is small and jitter may occur due to the extra data
Decrease camera side buffering. WDR and noise reduction typically require multiple frames to be buffered for processing. Disabling these make the image pipeline faster at the price of losing image quality

On a settings-level it is not so easy to work on other things than these. Of course in a product development there are a lot more factors. Various steps in a receiver can be parallelized, hardware accelerators may be used. Tricks can be applied too: shorten play times to catch up an earlier delay without anyone noticing, skip rendering of an entire frame, skip decoding remaining P-frames when an I-frame comes in, etcetera.

Minimizing jitter

A low jitter allows a low 'dejitter'-latency at the receiver. So a low jitter is beneficial for latency. Some sources of jitter are harder to avoid than others. One obvious thing to do is to take care of a proper dimensioning of the system components.

Network design. When a switch needs to buffer, jitter will occur. Only way to avoid excessive buffering is to use equipment of sufficient capacity. When a system is very large, there is an economic limit to that and other strategies must be applied as well.
Trafficshaping on the edge. When many cameras share an uplink the maximum capacity needed at a specific timepoint may exceed the average collective bitrate by a factor of 10 to 20 or even 100 due to bursting. This is explained on the linked page. When bursting is reduced, network jitter will be reduced because the switch doesn't need to buffer on the uplink. Transmission latency increases, but receiver side dejitter can remain low. The net result can work out positive
UDP streaming. UDP is normally prioritized over TCP. So streaming video over UDP gives a form of QoS over TCP-based office/internet traffic. More important, TCP may cluster packets sometimes for throughput reasons. The receiver will get a bunch of packets delivered at once, instead each one as fast as possible. This has a detrimental effect on jitter. Especially on small streams the interval may vary from 0 to 160ms in practice. On 25 fps this means you get 4 videoframes at once.

Of course, TCP has it's advantages. You can read more about it on the transport page.
Longer GOP. More specifically: avoid I-frames. An I-frame takes considerable more time to send and decode. So, jitter. Obviously, I-frames are very necessary. Too long GOPs may degrade video quality. Some cameras can apply a dynamic GOP: they delay the I-frame as long as possible and send more P-frames instead. Only when the changes in the scene become too big an I-frame is inserted

I-frames are also important to recover from packetloss that has left a stream undecodeable. Long GOP's may cause such situation to prolong for quite some time. This is a reason to choose short GOPs in critical situations. Here, a decoder implementation can help. After a few frames of decoding errors it can request the camera to insert a new I-frame, correcting the situation. This is not standardized and manufacturer specific. As you may expect the Llv-decoder applies this strategy in combination with Axis cameras.

How to measure latency?

Latency is typically measured by changing something in a scene and track how much later that change appears in the rendered image. The easiest way is to clap in your hands and look at the video. You will immediately get a feel for the delay between feeling the clap and seeing your own hands stop moving on the display. Different devices and viewers can be compared this way to see which end-to-end chain is faster than the other.

To put a number to it requires more work. Hardware exists for this purpose, though for practical purposes software-only solutions are accurate enough and that's what is discussed here. Facing a delay of 400 milliseconds, an error of 10 or 20 milliseconds is not significant. Working carefully, software-only solutions yield results more accurate than that.

A word of caution here, lot's of processes will interfere during the procedure outlined below: rolling shutter, various jitters, WDR, monitor refresh cycle, display delay. It can be difficult to grasp what's going on.

Outline:

You need a clock. It can be a software clock, but not all software clocks are equal. It should update the display at strict intervals. The one with which I have best experience simply runs in the browser: https://www.online-stopwatch.com/. Native applications exist that perform less.
Point the camera to the clock
Put a video receiver on the display and make it show the camera image
Align the setup so that both the clock and the display of the clock are in the scene
Often you can have multiple copies of the clock display in the scene with sufficient resolution and contrast. This allows to do multiple measurements in a single image, so try get as many as possible
Validate the setup. There are things you may have done wrong. See the pitfalls below.
Take sceenshots and save them
Inspect all screenshots, calculate the timedifferences between the clocks by hand and take an average: the end-to-end latency

Here is an example view how your screen may look like. The viewing client is of course the Low Latency Viewer. This example uses a different clock than the one mentioned. You can see the time difference between first and second image is 84 milliseconds. Between second and third 79 milliseconds, between third and fourth 85 milliseconds. After that it becomes difficult.

So, 3 useable measurements from this one. Making complete screenshots using the OS (here the 'save bitmap' function of llv was used) would give you an extra one because it gives the difference between the clock and the first image of the clock.

You will soon notice not all screenshots are useable because two clockvalues are blended in a single image. You may also notice weird spikes upwards (likely I-frames), and in some cases even downwards. A fast PC will help to prevent the latter.

So you need to apply averaging to get at a realistic value.

Pitfalls

Most cameras have a rolling shutter. This means the first line in the image is captured a lot earlier than the last line. As a consequence, the first line has a higher latency than the bottom line. You can play tricks here, and some manufacturers actually do that when communicating numbers. Play fair, and align everything horizontally
A monitor also refreshes its display in a rolling fashion, line by line. Similar latency aspects apply. Make sure every relevant part of the image is aligned horizontally. Play fair and keep align in the middle, both on camera image and on the monitor.
High megapixel cameras may dissolve the pixel pattern on the monitor, this yields an unrealistic scene complexity and that may affect latency
- If possible, have an overlay in the image which shows the bitrate and framerate, like "#b kbit/s #r f/s" for Axis cameras
- Put the camera slightly out of focus to get to a realistic bitrate by blurring the pixel pattern
Power saving features like "clock frequency scaling" on the PC must be switched off. Decoding a single stream is often not enough for a multicore CPU to consider itself busy, clockfrequency drops and decoder latency increases

Can we express latency in a single figure?

It's tempting to say 'this system has an end-to-end latency of 100 milliseconds'. But you will understand by now it doesn't work that way. The I-frame alone may spike the latency with tens of milliseconds. At best one can say 'this system has an end-to-end average, median, ... latency of such and such number'. But there's more:

The camera has 33ms (at 30fps) or 40ms (at 25fps) available to capture an image using its rolling shutter. So the top line might be almost 33 or 40ms older than the bottom one. That's a large budget in the 100..150ms end-to-end latency range!
Most modern cameras apply some form of WDR algorithm. WDR implies multiple captures are performed to capture high- and lowlights. This implies that a WDR image contains components from different captures which means different latencies. Which pixel has which latency becomes scene dependent
The encoder may decide to reuse a part of the previous image, meaning that part has a higher latency, though one could see the encoder decision as a validation that a certain part still more or less represents the older version.

As a consequence, each pixel may have it's own latency. Image pipeline processes average out pixel values partly into their neighbors with the result that even within a single pixel there is information with different latencies. The result is a heavily processed scene representation. It has a high forensic value, but it's latency is less sure. The latency of each pixel is at best expressed as being within a certain range with a certain likeliness, as if it were a quantum process.

Reporting a latency figure on modern devices is therefor a bit pointless. Just clap hands and check if visualisation is acceptable. Control a PTZ camera and check the delays are acceptable for fine control. Done.

Dejitter in the Low Latency Viewer

The main point of the llv utility is just dump the video as fast as it can, to try visualise what is happening in a system with a minimum amount of processing in between. It has a dejitter mode but it is far from perfect in dealing with a phenomenon called clock drift. Devices have a different perception of time because their crystals tick different. This is caused by differences from sample to sample, different operating temperature, .... The effect is that 40ms on one devices is for example 39.997 on another. This quickly adds up.If a video receiver respects the timestamps but its clock ticks slightly slower than the source, a huge delay will build up over time.

This is why NTP is important. All equipment in a video system should be NTP synchronised so that clock drifts are compensated for. For streaming media. source and receiver exchange their 'wallclock' time over RTCP.

A decoder should use the RTCP Server Reports emitted by the camera. Llv doesn't do this but observes the apperent drift in the RTP timestamps instead. Itsaves an RTCP implementation and makes it work with sources lacking a proper RTCP implementation themselves. Unfortunately in a high-jitter environment it takes a while before the running average of RTP-reported camera time ticks and local time ticks settle on a consistent drift.

The effect is that dejitter works reasonable for short durations, but will drift over time. It is capped at 300 milliseconds max. Depending on the direction of the drift the result will be slow video or jittering video as if the function was not enabled.

To be continued