May 4, 2022

Just-in-Time Transformation to Deliver Quality, Cut Cost, and Spread Joy

Jiawei Ou

Principal Software Engineer

We apply just-in-time transformations to keep our media stack lean and flexible. This helps us increase efficiency, lower storage costs, and spread joy.

With our Loom apps on desktops, users can record their screens at 30 FPS with resolutions up to 4K. High-quality recording enables users to present their work accurately with sharp images and legible text.

While high-resolution screen recordings are always welcomed, high resolutions are not always necessary for playback. Playing back full 4K videos on small screen devices is overkill in practice. Moreover, higher resolutions imply higher bitrate, which demands more network bandwidth for transit and more computational power to decode, process, and present. Therefore, it is difficult for less powerful devices to consume under low bandwidth network conditions.

To solve this problem, streaming technology’s standard practice is to serve clients a multi-rendition playlist. Both DASH and HLS streaming protocols support multi-rendition playlists.

A multi-rendition playlist is a manifest specified multiple “versions” or “quality levels” of a video, called renditions. Each rendition has the same video content in a different bitrate and resolution. The client players can dynamically choose which rendition to download and playback based on the current network condition and the device resources.

A multi-rendition playlist is like a multi-lane highway. The video player is like a car driving down the highway, picking the best lane to travel based on the traffic conditions.

Multi-rendition in action based on network condition — The video player is like a car driving down the highway and switching lanes based on the road condition.

Our Loom clients record and upload videos in the highest resolution the platform supports. Once the video upload is completed, our backend takes the raw video, transcodes it into multiple renditions, and creates a multi-rendition playlist for playback.

The Deficiencies of Building Multi-Rendition Playlists Upfront

The system has worked well for us. It produces smooth, high-quality playback for our users. However, this approach has several drawbacks. To understand these deficiencies, we need to examine our services' data access patterns.

The Data Pattern of Video Messaging

Loom is a video messaging platform for work. While it is built on top of streaming technology, it is very different from other video streaming services like YouTube, Netflix, or TikTok.

On a typical video streaming platform, videos are long-lived content played by thousands or millions of viewers once uploaded. Creating the multiple renditions playlist upfront using the most optimal settings ensures the best viewing experience for all users.

When using Loom as a video messaging service, a power user might create many videos daily. A typical video is viewed by a handful of people shortly (within a day or two) after recording it. As a result, performing extensive prepossessing and transcoding upfront is wasteful in practice.

Don’t Build the Road Not Traveled

Let us illustrate this using our imaginary multi-lane highway example. It makes sense for a standard streaming service to permanently build the multi-lane highway because we expect many cars to drive on this road. The cost of preprocessing and transcoding is well-amortized.

On the contrary, for Loom, if only one car would drive down the road once, its tires could only meet the lanes in red color. There is not much reason to build any of the blue lanes.

Multi-rendition usage in a single playback session — Only red lanes are traveled on our imaginary multi-lane highway. There is no reason to build the blue ones.

Furthermore, after creating the multi-rendition playlist, we need to store the content on S3 permanently. We will pay the storage cost for these derived media files forever, even though they might never be accessed after a few days.

Time-to-view (TTV) and Latency

If you are not entirely convinced that creating the multi-rendition playlist upfront is a bad idea from a resource standpoint, there is also latency and time-to-view metric to consider.

Time-to-view measures the time it takes for a Loom video to become viewable after a user hits the “Stop Recording” button. At Loom, we strive to keep this number as low as possible because we are the fastest video messaging service in the market.

It takes time to create multiple rendition playlists, so they are usually unavailable right after the video is uploaded. We could not prioritize processing videos viewed immediately after uploading over those not viewed for a while. This deficiency means the best viewing experience was usually not immediately attainable, especially during peak hours when the system load is high.

Consider the following scenario: A user creates a 4K recording and immediately sends it to another user who watches the Loom on their phone. The video buffers a lot because only a raw recording in 4K is available at the time. They cannot enjoy the Loom in optimal recording quality for their cellular connection and screen size until later. From the recipient’s perspective, the video is not ready for viewing, resulting in a long perceived TTV.

To worsen the situation, every time the creator edited the video (e.g., trimming), we had to recreate the multi-rendition playlist, imposing additional latency and cost.

Transcoding Just-in-Time

In early 2021, we launched instant editing, a feature that significantly reduced the time-to-view (to near zero) when the user edited a video. The core idea of instant editing is to avoid running a trimming job whenever the user makes an edit. Instead, we defer the trimming operation until the video is first viewed and do it just in time.

We are pushing this idea further to create multiple renditions just in time. The new architecture expanded the infrastructure of the instant editing project and introduced a new computing service, the AVServer.

The main idea behind just-in-time transcoding is that every time a new resource is requested, it will either be served from the CDN if it is there. If not, it will be dynamically generated as it is requested. The AVServer will be a special origin of CloudFront (our CDN provider). Thus, whenever there is a cache miss in the CDN, AVServer will generate the requested asset in real time. See the following figure for the high-level architecture.

avserver-architecture — AVServer Architecture

The Cost Savings

AVServer must download the original media files from S3 to generate the requested resource. Since the AVServer is colocated in the same data center as S3, this operation will be fast and at zero cost. We set the CDN cache to expire in 24 hours and do not persist the generated media files in permanent storage. By making the multi-rendition playlists and media files ephemeral artifacts on the CDN, we minimize our permanent storage cost for the longer term.

Besides saving on storage costs, the just-in-time transformation has another benefit. Without AVServer, every time the user edited the video (e.g., trimming), we would recreate a new multi-rendition playlist and a new set of media files. However, with AVServer, we will not need to reprocess the video until it is viewed the first time, regardless of the number of edits the user made in between. We can now lazily apply user edits to the video and save additional computational resources.

Video Processing, Revision, and Caching

The processing information of a video contains a set of modifiers for the video. It can change over time as users edit the video, like putting on filters or applying trims. When the video play requests the CloudFront path of the video, the main loom service will create a revision ID and cache the processing information onto Redis. The revision ID is an arbitrary string part of the CloudFront path. As AVServer dynamically creates a resource (a playlist or a video file), it uses the revision ID to look up the processing information from Redis and apply all the modifiers to the file. Then, the file is cached on CloudFront.

Every time a user edits the video, it updates the processing information. Afterward, the main loom service generates a new revision ID and returns a new path to the resource. Accessing the new path will result in a cache miss on CloudFront, falling back to the AVServer to generate a new resource with the new revision ID.

The following diagram shows the client video player and our backend interactions.

just-in-time transformation workflow — If the multi-rendition playlist and media files of different bitrates and resolutions are not already present in the cache (CDN), they are generated on the fly.

Conversion Between Streaming Protocols

Our Loom Chrome Extension produces video in DASH protocol with WebM files. However, the DASH protocol is not natively supported by iOS’s AVPlayer. Without just-in-time transcoding, for iOS clients to play the video, we would have to first transcode the video to a container format that AVPlayer can handle natively (e.g., mp4 or HLS), resulting in a longer Time-to-View for iOS clients when viewing videos produced by our chrome extension.

We can also support conversion between different streaming protocols with just-in-time transformation. We can convert DASH to HLS on the fly, practically reducing the time-to-view for iOS clients to be instant.

Spreading Joy Just-in-Time

Having a just-in-time layer allows us to modify video and generate new content on the fly. This ability allows us to build new features that delight our users and spread joy. Several new features were built on top of AVServer.

Noise Suppression

Recording in a noisy environment? Whether you are at the airport, next to a road, or in a coffee shop, Loom got you covered. With a flip of a switch, we can run the state-of-the-art noise suppression algorithm on AVServer to clean up the noise in your Loom. So you can be confident that your audience can hear you clearly wherever you are.

Trimming and Stitching Videos

Do you need an intro or outro for your Loom? Or do you want to create a multi-scene video? Leveraging the power of just-in-time transformation and AVServer, we could dynamically generate composite playlists by combining multiple video sources, instantly creating new Looms from existing Looms. Trimming and stitching of looms will be instantly available.

Text, Images, and Drawings

We can allow our user to edit their Loom post-recording, like adding watermarks and images and adding texts and drawings. The just-in-time transformation will make all these edits instant.

Dynamic Composition and Filtering

In the future, once we separate the camera and screen video streams, we can utilize AVServer and just-in-time transformation to enable dynamic changes to the video layout. Additionally, we can incorporate special effect filters to the camera stream, allowing users to add and edit them after recording.

Jiawei Ou

Principal Software Engineer