In this episode of the Lex Fridman Podcast, guests Jean-Baptiste Kempf and Kieran Kunhya explore the technology behind FFmpeg and VLC, two open-source projects that power video playback across the internet. The conversation covers the technical foundations of video codecs and compression, explaining how containers differ from codecs and how compression algorithms exploit human sensory limits to achieve extraordinary file size reductions while maintaining quality.
Beyond the technical details, the episode examines the philosophy and challenges of open-source development. Kempf and Kunhya discuss what motivates volunteer contributors, the role of hand-written assembly code in achieving performance gains, and the sustainability challenges facing critical infrastructure projects. They also look toward the future of multimedia technology, including applications in ultra-low-latency systems, brain-computer interfaces, and emerging formats for robotics and extended reality.

Sign up for Shortform to access the whole episode summary along with additional materials like counterarguments and context.
Video playback involves multiple intricate stages. The process begins with data retrieval, followed by stream demultiplexing where the container format (MP4, MOV, MKV) is parsed to separate audio, video, and subtitle tracks. During content decoding, the player determines whether to use GPU hardware acceleration or software decoding, conducting de-entropy coding, applying prediction to reconstruct frames, and performing inverse transforms to recover pixel information. The decoded raw samples are then sent to graphics and audio cards for rendering.
A crucial distinction exists between containers and codecs. Containers organize and synchronize multiple media streams within a single file, while codecs are the compression and decompression algorithms for those streams. Because file extensions are frequently misleading, tools like VLC and FFmpeg parse the file's content to determine the true format, ensuring robust compatibility with mislabeled files.
Video and audio codecs achieve extraordinary compression—100 to 200 times for video—by exploiting human sensory limits. Video codecs shift from RGB to YUV color space, preserving luminance while reducing chrominance resolution to match the eye's sensitivity. Each codec generation achieves about 30% better compression through more advanced prediction and transform algorithms. Encoding is far more computationally intensive than decoding because it happens once but decoding occurs millions of times during distribution, leading platforms like YouTube to accept high encoder complexity for optimal distribution efficiency.
H.264 revolutionized video compression by introducing psychovisual rate distortion optimization, focusing on perceptual visual quality rather than mathematical metrics. AV1 emerged as a royalty-free alternative to HEVC, developed by the Alliance for Open Media to avoid patent costs. However, newer codecs require dramatically higher computational effort, with encoding times sometimes two orders of magnitude longer than H.264.
Not all codecs are designed for streaming. Editing codecs like Apple ProRes use only I-frames, making seeking and cutting fast for video editors at the expense of file size. Screen recordings and anime require unique optimizations to handle their distinct visual characteristics. Proprietary codecs like those in GoToMeeting often must be reverse engineered through painstaking analysis of binary code—a process likened to digital archaeology—to ensure long-term media accessibility.
Open source projects like FFmpeg, VLC, and x264 thrive on community-driven development centered on openness and meritocratic principles.
Jean-Baptiste Kempf shares that volunteer developers are primarily motivated by passion for the subject matter and intellectual challenge rather than financial incentives. Working on open source multimedia software provides unique pride and visibility, with code used by billions globally offering a sense of achievement that commercial programming rarely provides. The FFmpeg and VLC communities function as elite programming schools, where contributors receive rigorous code reviews from world-class engineers. Andrew Kelly, creator of the Zig language, was trained in this "FFmpeg school."
Copyleft licenses like GPL and LGPL require modifications to be shared back with the community, unlike permissive MIT and BSD licenses that allow proprietary use. Relicensing VLC from GPL to LGPL required contacting over 350 contributors for legal permission, reflecting the collective nature of open source copyright. VideoLAN operates as a distributed non-profit without offices or employees, making it resilient against governmental pressures and ensuring project continuity even if individuals are removed.
Core teams prioritize long-term code quality over speed, with around five maintainers for VLC and 10-15 for FFmpeg. Code review is rigorous, focusing solely on quality regardless of the developer's status or employer. With contributors from across the globe, the community is highly resilient but must remain vigilant about security, as past incidents with maliciously modified VLC versions have demonstrated the importance of trusted distribution channels.
Low-level assembly programming in projects like dav1d (the AV1 decoder) demonstrates incredible performance gains when humans directly leverage CPU capabilities.
Kieran Kunhya and Jean-Baptiste Kempf assert that hand-written assembly dramatically outperforms C and compiler auto-vectorization, with performance improvements up to 62x in SIMD workloads. The dav1d project contains over 240,000 lines of handwritten assembly and only 30,000 lines of C, enabling real-time playback on modest hardware where software decoders are essential. Modern compilers cannot match these optimizations because they lack deep awareness of CPU pipeline characteristics, cache architecture, and instruction-level parallelism.
Assembly programming for SIMD lets a single instruction operate on entire vectors of data, unlike scalar operations. Software like FFmpeg and dav1d performs runtime processor detection to choose optimal code paths based on detected CPU features. Profound understanding of cache, memory hierarchy, and architectural specifics is essential for maximizing performance in ways no high-level language can replicate.
Dav1d breaks traditional OS calling conventions inside internal calls to reduce register overhead, designing custom lightweight conventions. Kunhya describes creatively repurposing cryptographic instructions for video processing, representing the artistry central to low-level programming. Supporting various platforms requires maintaining separate assembly implementations for each instruction set, dramatically increasing effort but ensuring optimal performance across diverse hardware.
Critical projects like FFmpeg and VLC face serious challenges related to maintainer burnout, government and corporate pressures, and financial sustainability.
These essential projects rely almost entirely on a small number of unpaid volunteers. The security industry frequently generates floods of AI-generated vulnerability reports, resembling denial of service attacks on developer attention. Kempf received death threats for ceasing PowerPC support, highlighting the emotional labor burdens. The recent XZ backdoor incident dramatically illustrated these dangers when a single overwhelmed maintainer, under sustained social engineering, relinquished control to attackers. Major corporations like Microsoft and Google often treat open source projects as conventional vendors, demanding urgent action while providing little meaningful support.
Governments have sought to introduce backdoors into VLC for surveillance purposes, which Kempf states the project has firmly refused, preferring to shut down rather than compromise integrity. Traditional video codecs have grown plagued with expensive patent pools, leading to the formation of the Alliance for Open Media to develop royalty-free alternatives like AV1. France's legal rejection of software patents has helped projects like VLC avoid some patent challenges.
Donations for FFmpeg and VLC are insufficient to fund even a single full-time developer. Some projects have adopted dual-licensing models, offering both GPL and commercial licenses to generate revenue from commercial users while keeping the software freely available. Additionally, some maintainers establish consulting companies providing specialized support around their open source projects.
Multimedia technology is expanding beyond traditional streams, with innovations in open-source frameworks, ultra-low-latency systems, and brain-computer interfaces.
Kempf defines multimedia as any synchronized data streams for human senses, envisioning FFmpeg handling future sensory data like odor or brainwaves through modular architecture. VLC already supports plugins for 4D cinema physical movement data. Both platforms are being adapted to manage point cloud codecs, volumetric video, and RGBD data vital for robotics and 3D experiences. The archiving community has funded development of FFV1, a mathematically lossless codec critical for digital preservation, democratizing access for institutions worldwide.
Kempf discusses Kyber, targeting under 10 milliseconds latency for remotely controlling drones, robots, and vehicles. Current progress achieves 6-7 millisecond latencies, approaching the four-millisecond target. Synchronizing multiple camera and sensor feeds requires advanced timestamping to prevent clock drift. Kyber uses UDP with forward error correction instead of TCP to minimize delays, sending extra data for instant reconstruction of lost packets.
Fridman and Kempf anticipate FFmpeg and VLC will need to standardize encoding for neural data from brain-computer interfaces. Work is underway on streaming volumetric video to AR glasses, which lack computational power for local rendering. Kempf observes that each new media stream triggers initial format incompatibility before convergence on standards, with open-source tools like FFmpeg and VLC accelerating this process and shaping multimedia interoperability.
1-Page Summary
Video playback in a player like VLC involves multiple intricate stages. The process starts with data retrieval, where the software interacts with the operating system to access the media file or stream using URLs like HTTP, local files, or DVDs. This source provides a raw byte stream.
Next, stream demultiplexing (demuxing) occurs. Here, the container format (such as MP4, MOV, or MKV) is parsed to separate individual audio, video, and subtitle tracks from the multiplexed stream. Each track is then identified, and further information is obtained to determine how it needs to be decoded.
The content decoding stage follows. The player probes the video frames to decide whether they can be handled by GPU hardware acceleration or must fall back to software decoding. Not all files are GPU-decodable, so detection is essential. Some files may have mixed codec variants, requiring the player to select an appropriate path. When software decoding is needed, the player conducts de-entropy coding (reversing mathematical bitstream compression like Huffman or arithmetic coding), applies intra and inter prediction to reconstruct frames, handles residual frequency domain data, and performs an inverse transform to recover pixel information.
Once decoded, the audio and video data exist as raw samples—pixel data for video and PCM for audio. These are sent to the graphics card and audio card, respectively, for rendering on the screen and speakers.
A crucial distinction exists between containers and codecs. Containers (MP4, MOV, MKV, AVI) are formats that store, organize, and synchronize multiple media streams—video, audio, subtitles—within a single file. Codecs, on the other hand, are algorithms for the compression and decompression (encoding and decoding) of those individual streams. The industry has contributed to confusion, as container extensions (.mp4, .mov, .mkv) often hide which codecs are truly inside. For example, H.264 (MPEG4 Part 10 or AVC) is a codec usually found within MP4 containers, but a file with a .mp4 extension might contain any number of codecs.
Because file extensions are frequently misleading, tools like VLC and FFmpeg parse the file’s content to determine the true format. While the extension suggests a likely container, both tools will open the file, attempt to demux it based on container contents, and prioritize decoding modules as needed. This ensures support for files with mismatched or incorrect extensions, offering robust real-world compatibility—a necessity, given the prevalence of malformed or mislabeled files.
Video and audio codecs achieve extraordinary compression—100 to 200 times for video—by exploiting human sensory limits. Compression algorithms remove details unlikely to be perceived by the viewer or listener: for video, this means shifting from the RGB color space to YUV, which better matches the eye’s sensitivity (luminance is preserved at higher fidelity, while chrominance is subsampled and reduced in resolution). Audio codecs similarly mimic the auditory system’s frequency response, shaping output so that information outside human perception is discarded.
Each generation of codecs—MPEG-2 to H.264 to HEVC (H.265) to VVC (H.266), or VP8 to VP9 to AV1 and AV2—achieves about 30% better compression for the same subjective quality, thanks to more advanced prediction and transform algorithms. New codecs integrate collections of specialized tools and coding strategies tailored to manage a wide range of content, be it natural video, animation, or screen recordings. The trade-off is increasing complexity: the encoder must search many more possibilities and apply more advanced processes, requiring significantly more computing power.
Compression (encoding) is computationally far more intensive than decompression (decoding) because encoding is typically done once—but decoding happens millions of times when content is distributed. Encoders need to exhaustively analyze and test many parameter combinations, consuming substantial CPU and energy resources, while decoders are optimized for fast, lightweight playback. Major platforms like YouTube re-encode popular videos using heavy-duty newer codecs to reduce long-term bandwidth and storage needs, accepting high encoder complexity for optimal distribution efficiency.
H.264 (AVC) was a turning point for video codecs. It introduced psychovisual rate distortion optimization, which focuses on perceptual visual quality rather than mathematical metrics like peak signal-to-noise ratio (PSNR), which previously led to visually unsatisfactory results despite mathematically "good" scores. H.264 development prioritized artifacts less visible to viewers, drawing on extensive subjective testing and feedback, and drove the HD video boom.
As codecs became ever more sophisticated, patent licensing grew more onerous. AV1, developed by the Alliance for Open Media (Google, Netflix, Amazon, Apple, VideoLAN and others), emerged as a next-generation, royalty-free standard—comparable to HEVC (H.265) and with similar or better compression, but without the high licensing costs plaguing H.264 and HEVC. AV1’s deployment is increasing, but it takes years for widespread adoption in hardware and software. ...
Video and Audio Codecs
Open source projects like FFmpeg, VLC, x264, and VideoLAN thrive on community-driven development. Their philosophy centers on openness, collaborative excellence, and a meritocratic approach that has attracted thousands of contributors from every corner of the world.
Volunteer developers in open source communities are primarily motivated by their passion for the subject matter and the intellectual challenge it presents, rather than by financial incentives. Jean-Baptiste Kempf shares that many contributors began working on multimedia projects like FFmpeg and VLC due to their love for watching video or anime. They get involved because the topic interests them deeply, and they continue contributing because the work is excellent and rewarding. This intrinsic motivation is more meaningful than commercial programming jobs, where making software for billing systems or corporate portals offers little pride or personal satisfaction.
Working on open source multimedia software gives contributors a unique pride and visibility that commercial programming rarely provides. The impact of their code—used by billions globally, from home video enthusiasts to trillion-dollar corporations—offers a sense of achievement and societal value. Telling a family member that you helped code VLC, a program enabling millions to watch videos, is relatable and impressive in a way that many standard corporate projects are not.
The FFmpeg and VLC communities function as elite schools of programming. Contributors receive rigorous code reviews from some of the world's best engineers, forcing them to confront and improve their weaknesses in real-world, high-impact projects. For example, Andrew Kelly, creator of the Zig language, was trained in the “FFmpeg school.” This environment encourages transparency, humility, and growth, as participants learn from constructive criticism and are held to a global standard of technical excellence.
Open source flourishes under a spectrum of licenses. Permissive licenses like MIT and BSD allow anyone to use, modify, and relicense the code—including for proprietary purposes or without attribution. In contrast, copyleft licenses, such as the GPL (General Public License) and LGPL (Lesser General Public License), require any modifications or derivatives to be shared back with the community under the same license, ensuring that improvements remain open. These licenses function as a social contract, aligning a diverse global community around shared values.
Changing the license of a software project like VLC from GPL to LGPL is a legally and logistically daunting task because every contributor retains copyright to their own code. For relicensing, all contributors—at times more than 350 individuals—must be contacted to obtain legal permission, reflecting the collective nature of open source copyright. This sometimes requires extraordinary efforts, including locating contributors or their families years later, and underscores the collaborative and deeply personal commitments within these communities.
VideoLAN, the entity behind VLC, has no office or employees, operating as a distributed non-profit. This organizational structure makes it resilient against governmental or legal pressures directed at individuals or centralized entities. The open source codebases remain accessible and survivable even if any person is removed from the project, ensuring project continuity and legal flexibility against shutdown attempts or restrictive regulations.
The governance of open source communities like FFmpeg and VLC centers on long-term code quality rather than rapid accumulation of features or speed of merging contributions. The core team—around five for VLC and 10–15 for FFmpeg—are responsible for maintaining the codebase and thus enforce uncomprom ...
Open Source Philosophy and Community
Low-level assembly programming, especially in video and multimedia projects like dav1d (the AV1 decoder), stands as a testament to the incredible performance gains and artistry possible when humans directly leverage CPU capabilities. Practitioners in this field defy conventional assumptions about compiler optimization, hardware abstraction, and the boundaries of computational creativity.
Kieran Kunhya and Jean-Baptiste Kempf assert that hand-written assembly, especially in SIMD (Single Instruction, Multiple Data) workloads, dramatically outperforms C and even the best compiler auto-vectorization. Lex Fridman notes performance improvements up to 62x compared to C code—a difference measured not in percentages but in orders of magnitude. Despite debates in the software community, with many claiming that compilers and intrinsics can match handwritten assembly, repeated benchmarking and hundreds of examples show that human-crafted assembly consistently wins, particularly where every CPU cycle matters. Kunhya explains that some SIMD-based functions achieve 10x to 50x speedups, with 62x not uncommon, enabling real-time performance on modest hardware where otherwise only more powerful hardware could suffice.
The dav1d project illustrates the scope and necessity of extreme hand-tuned optimization. Kempf describes dav1d as "beyond insane," with over 240,000 lines of handwritten assembly and only 30,000 lines of C code. This level of optimization is required because the decoder runs on billions of devices for video playback of formats like AV1, where there may be no hardware decoder available. Efficient decoding on CPUs—often only one or two cores—enables playback on devices ranging from streaming sticks to smartphones. Netflix and YouTube, for instance, rely extensively on software decoders, and every saved cycle translates into saved power, resources, and user satisfaction.
Contrary to common belief, modern compilers—despite aggressive auto-vectorization—cannot match the intricate optimizations possible in hand-written assembly. Kunhya and Kempf explain that compilers lack deep, practical awareness of exact CPU pipeline characteristics, cache architecture, memory bus bottlenecks, and subtle instruction-level parallelism. The flexibility of assembly lets programmers optimally pack registers, avoid memory stalls, and schedule instructions far more efficiently, exploiting every potential of the hardware for maximum throughput. This level of optimization is indispensable in real-time or resource-constrained applications.
Assembly programming for SIMD fundamentally transforms how computation is performed. Kunhya clarifies that, unlike scalar code (which processes one element at a time), SIMD lets a single instruction operate on an entire vector (such as 16 pixels), making it ideal for video, audio, and multimedia. Programmers must learn specialized instructions and architectures to utilize vector registers across different platforms (x86, ARM, etc.), as each has distinct methods for exploiting parallelism.
Software like FFmpeg and dav1d performs runtime processor detection to ensure that the best possible code path is chosen for each execution environment. Depending on the detected CPU features—such as AVX, AVX2, NEON, SVE, or others—function pointers and code paths are assigned to leverage every available hardware extension. This approach maximizes performance across both new and legacy hardware, but requires maintaining multiple optimized code paths for different instruction sets.
Profound understanding of cache, memory hierarchy (L1, L2, L3), and architectural specifics is essential for performance. Kempf recounts his experience with the Itanium processor, where the floating-point throughput could exceed memory bandwidth by a 4:1 ratio, requiring intricate reuse of registers and memory packing. Hand-written assembly can address such scenarios with tailored memory access patterns, minimizing cache misses, and optimizing register allocation in ways no high-level language or compiler abstraction can replicate.
Low-level Optimization and Assembly Programming
Critical digital infrastructure projects like FFmpeg and VLC underpin much of the modern multimedia and software ecosystem, yet face serious challenges related to maintainer burnout, government and corporate pressures, and financial sustainability.
Projects such as FFmpeg and VLC, essential to global digital infrastructure, rely almost entirely on a small number of unpaid volunteers. Often, just 10-15 core developers sustain massive software ecosystems; even more strikingly, crucial components like libxml or XZ have had only a single maintainer at times. When these individuals burn out or are driven away, major security and functionality risks arise for the vast web of software that depends upon them.
FFmpeg and VLC receive a flood of unfunded requests and bug reports, driven even further in recent years by AI-generated vulnerability reports. The security industry frequently marks trivial or highly niche issues as urgent, not accounting for the realities of volunteer-driven maintenance. This can resemble a denial of service attack on developer attention, compounding psychological stress and increasing burnout risk. Maintainers are also vulnerable to hostile behavior; for instance, Jean-Baptiste Kempf received death threats for ceasing support for VLC on PowerPC, highlighting not just technical but emotional labor burdens.
The recent XZ backdoor incident dramatically illustrated the dangers of this model. XZ, used on millions of installations, was maintained by just one person who, under sustained social engineering and pressure, relinquished control to attackers. This incident exposed the fragility of infrastructure when critical projects rely on overwhelmed, unsupported volunteers.
Major corporations, including Microsoft and Google, often treat open source projects as if they are conventional vendors, demanding urgent action on their priorities but providing little meaningful support. Microsoft even offered only a token one-time payment after pressing XZ maintainers for critical bug fixes, exemplifying broader systemic disregard for the human limits and needs behind open source software.
Open source projects also face significant pressure from governments and large companies. For instance, various governments have sought to introduce backdoors into VLC for surveillance purposes. The project has always firmly refused such requests, with Jean-Baptiste Kempf stating that VLC would rather shut down than compromise its integrity. VLC’s offline, telemetry-free model is designed specifically to protect end users, resisting weaponization or censorship requests regardless of their source.
Corporations like Microsoft and Google depend heavily on FFmpeg and similar projects but often do not reciprocate with proportional support. Companies sometimes conflate public bug trackers with service desks, misunderstanding volunteer-driven ecosystems and failing to engage through appropriate channels or contracts. The result is sustained pressure on maintainers who must field high-priority demands without organizational backing.
Licensing costs and intellectual property also drive technical and organizational choices. Traditional video codecs like H.264 and especially HEVC have grown plagued with expensive, complicated patent pools, making licensing prohibitively costly for wide distribution. Massive streaming and hardware companies—facing fees that could exceed hundreds of millions of dollars annually—have responded by forming the Alliance for Open Media and developing royalty-free alternatives like AV1 and AV2. These new codecs, designed to sidestep patent traps, are increasingly vital for sustainable infrastructure.
France’s legal rejecti ...
Sustainability and Challenges of Critical Infrastructure
The trajectory of multimedia technology is rapidly expanding beyond traditional audio and video streams, thanks to innovations in open-source frameworks, ultra-low-latency systems, and the early stages of brain-computer interface (BCI) adoption. Thought leaders like Jean-Baptiste Kempf, Kieran Kunhya, and Lex Fridman discuss how tools such as VLC and FFmpeg are evolving to meet complex sensory, archival, and control needs of the future.
Jean-Baptiste Kempf defines multimedia as any digital representation of multiple synchronized data streams for human senses, not simply audio and video. He envisions a near future where FFmpeg could handle "odour sensors," diffusers, and new data types. He emphasizes modularity: “We need to work with the architecture so that modules can be added to add future capabilities. And if it’s brainwaves, it’s going to be brainwaves.” Kieran Kunhya humorously speculates on formats like “stereo smell” via left and right nose tracks, illustrating the inevitability of frameworks accommodating exotic sensory streams.
VLC already features a plugin for Aptiq, used in 4D cinemas to synchronize and transport physical movement data along with media, even if such plugins are not in the mainstream release. Datasets representing touch, scent, or future sensory streams can be synchronized just as audio and video are today.
Kempf highlights active research in codecs for point cloud and volumetric video, as well as RGBD (color plus depth) data, all vital for applications in robotics and advanced 3D experiences. Both VLC and FFmpeg are being adapted to manage and archive 3D data, supporting future workflows.
Digital preservation is a critical concern for archival communities, who partner with open-source projects to ensure long-term playback and integrity of digital information. Kieran Kunhya points to the use of FFmpeg as a “Rosetta Stone” for future multimedia accessibility. Dave Rice and his colleagues in the archiving community funded development of the FFV1 codec: a fast, mathematically lossless codec that preserves every bit of original data, vital for historical, scientific, and artistic records. FFV1 supports GPU encoding for speed and is resilient: if some data is lost during storage, it enables quick recovery without a full loss.
These open workflows democratize preservation, allowing even low-budget or volunteer-run institutions—like those teaching FFmpeg in India—to archive and recover unique analog and digital media. This ethos protects against scenarios like the UK’s “New Doomsday Book,” where vital media became unreadable due to hardware obsolescence, and serves as a model for future digital stewardship.
Lex Fridman and Jean-Baptiste Kempf discuss Kyber, a new initiative targeting ultra-low-latency streaming for real-time control of robots, drones, or remote vehicles. Achieving under 10 millisecond “glass-to-glass” latency (from camera capture to display) is critical, as milliseconds directly influence safety, reaction time, and usability in feedback scenarios—unlike passive media consumption.
Kempf details Kyber’s progress: with current encoders and decoders (NVIDIA, Intel), total latencies around 6-7 milliseconds are achievable, approaching the target of four milliseconds per frame at 240 Hz. Faster encoders and specialized codecs are needed for further reductions.
A major challenge is synchronizing data streams from multiple sensors and cameras, as clock drift leads to desynchronization over time. Kyber employs broadcast-standard timestamping and server-side mechanisms to align time across feeds. Consistent and precise alignment is crucial for real-time AI model training, playback, and control—especially as robots routinely operate with many cameras and sensors.
Kyber replaces cl ...
Future of Multimedia and Emerging Applications
Download the Shortform Chrome extension for your browser
