Unleashing the Ad Creative: Client-Side Dynamic Linear Ad Replacment with Interactive Creatives

The holy grail of media, and ads in particular, is viewer engagement. Personalization, hyper-localization, and other strategies which allow content to be more relevant to a viewer are key to improving engagement.  In an ideal case each viewer is presented with a unique experience based on the viewer’s specific characteristics.

There’s An App For That!

App-based advertising has the potential to deliver a hyper-personalized experience. Traditional ads conisist of video and audio of a particular length, with all viewers getting the same tracks. App-based ads consist of an app which runs for a particular time. What the app does is limited to the creativity of its author. It can, as with traditional ads, present audio and video. However it could present individualized audio and video tracks depending on information about the viewer such as preferred language, location, or age. It could add dynamically generated hyper-localized text indicating shopping opportunites. With the incorporation of generative AI, dynamically generated audio, video, and graphics are real possibilities.

For this post we’re assuming the app is an HTML5 app running in a web view.

Video and graphics layers depicting an HTML 5 ad – which may consist of any number of video and graphics layers itself – running on top of a network feed. Applications control whether the ad or network feed is presented to the viewer.

Key Points

  • Signaling of ID3 Timed Metadata by the system players (Apple AVPlayer, Android Media Player) to a native application is done in a relatively timely manner, generally within 30 ms.
  • Native applications can typically perform actions with 16 ms accuracy (i.e. frame accurate).  However due to a lack of realtime execution guarantees there is the possibility that execution may be delayed.
  • HTML5 applications running a web view are subject to comparatively high (up to 100 ms) scheduling jitter.
  • For dynamic ad replacement with an interactive ad running in a web view, muting and unmuting the network ad should be performed by a native application so that these operations occur within a 30 ms window. The web view’s visibility should also be controlled by a native application to give the appearance of a seamless transition.
  • Because the ad avail period is determined by the network feed, and execution of the replacement ad may be delayed due to scheduling jitter, ads should be structured so that the last 100 ms may be muted.  Ideally the last 100 ms would be audiovisual padding.
  • Experiments with ExoPlayer on Fire TV and AVPlayer on iOS suggest that native applications may not be able to react to timed metadata within 16 ms. To accomodate 60 Hz video, it may be desirable to mute the network feed 16 ms (one frame time) earlier, effectively truncating the previous add by 16 ms. In the case of back-to-back ads with no intervening black frames this minimizes the chance of a one-frame flash of the replaced network ad.

Receiver/Mobile App Architecture

With traditional broadcast the “receiver” is generally considered to be a television or set-top box. In other words, hardware and software purpose-built to receive and present broadcast television. With control at all levels from the SoC  to the user interface, receiver developers have a great degree of control over the reception and presentation of broadcast television.

However viewers are increasingly consuming media on phones, tablets, and other devices. Even in the television and set-top space, viewers are consuming live streams and VOD via streaming apps. Broadcast television can participate in this environment via gateway devices – boxes such as the HDHomerun which receive broadcast television and make it available as a live stream.

Phones, tablets, streaming boxes are not purpose-built  for viewing broadcast television and place limitations on how an application running on those devices can operate. Because presentation and synchronization of audio and video is a time-critical process,  the system software for these devices provide components for presenting media – for example the Android media framework and iOS AVPlayer.

An application interacts with the system player to control presentation – for example pausing or seeking through media.  The separation into a system compontent responsible for acquireing and presenting media and an application which operates at a higher level  is particularly relevant for protected content, where the application code never touches the encrypted media. Only components in the player, operating in a protected environment, has access to the decrypted stream.

Mobile app architecure for DLAR. The application utilizes the core OS media player for presentation of the network feed while the ad runs in a web view.

A key difference in this app/player architecture is how much insight the app has into what is being rendered at a particular time. In a traditional receiver stack, the software (and hardware) architecture is under the control of the receiver manufacturer, and so a design could allow any layer to have access to information anywhere in the stack.  Any component could also opreate at favored “real-time” priorities. In the app/player architecture the player is aware of what is being rendered. However whether an app knows what is rendered and with what precision is dependent on what facilities the player makes available

Timed Metadata

SCTE-35 has long been used as a way to signal events in a broadcast stream, and in the streaming ecosystem a similar mechanism is present: timed metadata.

A standard mechanism which has emerged in the mobile media player ecosystem is timed metadata. Timed metadata matures in the same way that audio, video, and other program components do. When timed metadata matures it is presented by the player to the application. In this way an application can synchronize with a point in the stream.

The player typically runs with a favorable priority so can dispatch the timed metadata frame accurately. Whether the application can receive the timed metadata and take action with frame accuracy depends on  the operating system and the application run-time environment. Real-time scheduling is generally not available to applications in mobile operating systems, although they may provide a notion of priority which can allow time-sensitive code run more favorably.

Ad Replacement Opportunity Signaling

One of the primary mechanisms for ad opportuntiies used within a broadcast (or streaming) airchain is SCTE-35. SCTE-35 provides a mechanism for decorating a stream with markers that signal, for example, ad start and end opportunities. MPEG transport streams decorated with SCTE-35 are common the OTA and cable broadcast environments and can also be applied to streaming clients, where the SCTE-35 markers are embedded into the stream as timed metadata.

For purposes of this discussion ID3 Timed Metadata is used to carry SCTE-35 splice commands. ID3 Timed Metadata provides a way to insert items into MPEG-TS and DASH streams. The ID3 metadata has an associated presentation timestamp allowing the metadata to be synchronzied with audio and video presentation. Both Apple’s iOS AVPlayer and ExoPlayer support ID3 Timed Metadata.

Frame Accuracy

One concern for frame-accurate tasks such as ad replacement is how accurately the player can signal the application. Accuracy involves both latency  (the delay in signaling the application that an event’s presentation time has been reached, or “matured”) and determinism (how much jitter there is in receipt of the event by the application)

The core presentation elements of media playback operate at a favored priority, so there is generally no problem in detecting an ID3 timed metadata tag and tying it to its presentation time. However once the timed metadata has matured, the application will be signaled. Applications, particularly mobile applications, will be operating at a less-favored priority than media presentation, so may not be able to respond to the event immediately. In addition, presentation of the replacement ad may involved “non-native” applications, such as HTML5 applications running in a web view.

To examine signaling accuracy a stream with burned-in timecode was decorated with timed metadata. Native applications using AVPlayer (for iOS/iPadOS) and ExoPlayer2  (FireOS )were used to listen for the timed metadata and trigger a visual change when the timed metadata was received.  The event is then forwarded to an HTML app running in a web view, simulating receipt of an ad  replacement opportuntiy. Because the HTML5 app can only schedule to system clock time and has no practical way of synchronizing to playback fo the network stream, the native application timestamps the event with the system clock time as soon as it is received. This establishes a relationship between the system clock and the timed metadata’s maturation time. The accuracy of this relationship depends on the prevision with which the timed metadata is delivered to the native app.

Upon receipt of the timed metadata event, the HTML 5 app schedules a timeout five seconds in the future, using teh timestamp provided by the native app as a basis. This simulates processing of an ad replacement opportunity with a five second preroll. When the time fires the HTML5 app triggers a visual change.

Timed Metadata Maturation Precision

On both the iOS and Fire TV devices, the timed metadata was delivered to the native application with a significant delay. It is not understood why this delay exists, but was constant when the timed metadata was muxed into the transport streams between two seconds and 100 ms early.  The delay was different between device families (iOS vs FireTV), but within each family was fairly constant within a 30 ms window. As a result the delay can be compensated for within the native application.

Latency of timed metadata received by app and signaled by changing the color of a view. Timed metadata matured at frame 55976 (green line). Both iOS and FireTV show significant latency. Both types of devices show an up to 30 ms jitter in the view’s appearance changing.

On a positive note, the 30 ms window represents only two frame times at 60 Hz, so if the delay is compensated for the native application can take action with a two frame accuracy.

The web view inherits the delay in receipt of the event from the native app.  But what is of greater interest is the jitter in the HTML5’s app signaling the ad opportunity start. Execution within a web view is generally less performant than native applications. We observed that on iOS and FireTV devices the ad opportunity start was signaled within a two to four frame (60 Hz) window. The source of the jitter could be OS scheduling or rendering (e.g. there could be a delay between the HTML 5 app changing the color of a view vs the DOM acutally rendering it).

Range of times when a view in a web view changed color, indicating the start of an ad replacement opportunity. Both iOS and FireTV devices show an up to 66 ms range in which the change becomes visible.

Recommendations

A four frame jitter at the HTML 5 app level  (approximately 66 ms) can be managed with minimal impact to the viewer. Because native apps can generally run with greater timeliness than htm5 apps, it is recommended that the native application control display of the network feed vs the web view (i.e. the native app controls the splice). In addition, because ads may be packed back-to-back in the network feed with no intervening black frame(s), and because there may be a one frame jitter in the native app processing the timed metadata and causing any visual change, it is suggested that the native app start the ad replacement one frame time early and end it one frame time later (effectively increasing the replacement duration by two seconds and potentially truncating the previous network ad by 1 frame and the following network ad by up to two frames).

It is also recommended that the replacement ad is authored in a way that the last 100 ms could be truncated. This accounts for a small amount of indeterminism at the web view level in scheduling the start of the ad (or subsequently when scheduling events within the ad).