Engineering Feed
A curated reading feed of engineering blogs I follow — from teams building real systems at scale.
Engineering and algorithmic interventions for multimodal post-training at Microsoft scale
Aditya Challapally leads post-training research and infrastructure for Copilot agent capabilities that process millions of multimodal interactions. This post builds on the diagnostics from Diagnosing instability in production-scale agent reinforcement learning with the engineering and algorithmic interventions we developed to get the best results out of post training at scale. Post-training multimodal agents at scale […] The post Engineering and algorithmic interventions for multimodal post-training at Microsoft scale appeared first on Engineering@Microsoft.
Intercepting messages inside IsDialogMessage, fine-tuning the message filter
Making sure it triggers when you need it, and not when you don't. The post Intercepting messages inside <CODE>IsDialogMessage</CODE>, fine-tuning the message filter appeared first on The Old New Thing.
Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure
Santander faced a significant technical challenge in managing an infrastructure that processes billions of daily transactions across more than 200 critical systems. The solution emerged through an innovative platform engineering initiative called Catalyst, which transformed the bank's cloud infrastructure and development management. This post analyzes the main cases, benefits, and results obtained with this initiative.
Using LLMs to amplify human labeling and improve Dash search relevance
How we train Dash's search ranking models with a mix of human and LLM-assisted labeling.
Intercepting messages inside IsDialogMessage, installing the message filter
Using an IsDialogMessage extension point. The post Intercepting messages inside <CODE>IsDialogMessage</CODE>, installing the message filter appeared first on The Old New Thing.
Superuser Gateway: Guardrails for Privileged Command Execution
Learn how Uber’s new Superuser Guardrails turn risky manual commands into peer-reviewed, machine-validated changes, and how to apply this pattern to your own systems.
6,000 AWS accounts, three people, one platform: Lessons learned
This post describes why ProGlove chose a account-per-tenant approach for our serverless SaaS architecture and how it changes the operational model. It covers the challenges you need to anticipate around automation, observability and cost. We will also discuss how the approach can affect other operational models in different environments like an enterprise context.
Improving Search Ranking for Maps
How Airbnb is adapting ranking for our map interface.
Building a Next-Generation Key-Value Store at Airbnb
How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.
From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store
How Airbnb hardened Mussel, our key-value store, with smarter traffic controls to stay fast and reliable during traffic spikes.
Pay as a Local
How Airbnb rolled out 20+ locally relevant payment methods worldwide in just 14 months
Academic Publications & Airbnb Tech: 2025 Year in Review
2025 was a big year for research at Airbnb, as we made significant progress toward our mission to use AI, data science, and machine learning to become the best travel and living platform.
Safeguarding Dynamic Configuration Changes at Scale
How Airbnb ships dynamic config changes safely and reliably
My Journey to Airbnb — Anna Sulkina
Anna Sulkina has always been a traveler, and we’re lucky her travels have brought her to Airbnb. Anna is a Senior Director of Engineering, and she’s responsible for Application & Cloud infrastructure.
Intercepting messages before IsDialogMessage can process them
Process the message before you let IsDialogMessage see it. The post Intercepting messages before <CODE>IsDialogMessage</CODE> can process them appeared first on The Old New Thing.
RCCLX: Innovating GPU Communications on AMD Platforms
We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. Communication patterns for AI models are constantly evolving, as are hardware [...] Read More... The post RCCLX: Innovating GPU Communications on AMD Platforms appeared first on Engineering at Meta.
Customizing the ways the dialog manager dismisses itself: Isolating the Close pathway
Intercepting the flow in your message loop. The post Customizing the ways the dialog manager dismisses itself: Isolating the Close pathway appeared first on The Old New Thing.
MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix
Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, second (failed) attempt
Sniffing the synchronous keyboard state is still not precise enough. The post Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, second (failed) attempt appeared first on The Old New Thing.
The 2026/2027 Seattle Symphony subscription season at a glance
The pocket reference guide for 2026/2027. The post The 2026/2027 Seattle Symphony subscription season at a glance appeared first on The Old New Thing.
Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, first (failed) attempt
Sniffing the asynchronous keyboard state. The post Customizing the ways the dialog manager dismisses itself: Detecting the ESC key, first (failed) attempt appeared first on The Old New Thing.
Our Multi-Agent Architecture for Smarter Advertising
When we kicked this off, we weren’t trying to ship an “AI feature.” We were trying to fix a structural... The post Our Multi-Agent Architecture for Smarter Advertising appeared first on Spotify Engineering.
Exploring the signals the dialog manager uses for dismissing a dialog
Summarizing the flow. The post Exploring the signals the dialog manager uses for dismissing a dialog appeared first on The Old New Thing.
Database Federation: Decentralized and ACL-Compliant Hive™ Databases
Uber’s 10PB, 16K-dataset Hive monolith for the Delivery business had huge limitations. See how we transformed it into a secure, scalable, decentralized platform with zero downtime and saved more than 1PB along the way. #BigData #DataSecurity
Could WriteProcessMemory be made faster by avoiding the intermediate buffer?
I guess it could, but why bother? The post Could <CODE>WriteProcessMemory</CODE> be made faster by avoiding the intermediate buffer? appeared first on The Old New Thing.
Teaching AI to read a map
Machine Perception
Microspeak: Escrow
Final build, final, final, final 2, ship this one. The post Microspeak: Escrow appeared first on The Old New Thing.
It rather involved being on the other side of the airtight hatchway: Tricking(?) a program into reading files
Is it really a trick when reading the file is the purpose of the program? The post It rather involved being on the other side of the airtight hatchway: Tricking(?) a program into reading files appeared first on The Old New Thing.
How can I distinguish between the numeric keypad 0 and the top-row 0 in the WM_CHAR message?
See if it matches the scan code. The post How can I distinguish between the numeric keypad 0 and the top-row 0 in the <CODE>WM_<WBR>CHAR</CODE> message? appeared first on The Old New Thing.
Scaling LLM Post-Training at Netflix
How low-bit inference enables efficient AI
Making products like Dropbox Dash accessible to individuals and businesses means tackling new challenges around efficiency and resource use.
How can I distinguish between the numeric keypad 0 and the top-row 0 in the WM_KEYDOWN message?
Check whether it is an extended key. The post How can I distinguish between the numeric keypad 0 and the top-row 0 in the <CODE>WM_<WBR>KEYDOWN</CODE> message? appeared first on The Old New Thing.
Uber’s Rate Limiting System
Discover how Uber built and automated a global rate-limiting system that protects millions of RPCs per second, improving reliability, reducing latency, and simplifying operations across our service mesh.
Automating RDS Postgres to Aurora Postgres Migration
How we built the Microsoft Learn MCP Server
When we launched the Microsoft Learn Model Context Protocol (MCP) Server in June 2025, our goal was simple: make it effortless for AI agents to use trusted, up-to-date Microsoft Learn documentation. GitHub Copilot and other agents are increasingly common, and they need to be able to ground responses just like humans with browsers do. Learn […] The post How we built the Microsoft Learn MCP Server appeared first on Engineering@Microsoft.
The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It
WHAT IT IS The rise of agentic software development means code is being written, reviewed, and shipped faster than ever before across the entire industry. It also means that testing frameworks need to evolve for this rapidly changing landscape. Faster development demands faster testing that can catch bugs as they land in a codebase, without [...] Read More... The post The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It appeared first on Engineering at Meta.
Insights from our executive roundtable on AI and engineering productivity
From Claude Code to Cursor, we're big adopters of AI coding tools at Dropbox. The early results have been promising, but there are still a lot of open questions about how to work with these tools most effectively and where they can have the most impact. To push this conversation forward, we hosted an executive roundtable at our San Francisco studio. Here's how it went.
How do I suppress the hover effects when I put a Win32 common controls ListView in single-click mode?
You can prevent the item from becoming hot-tracked. The post How do I suppress the hover effects when I put a Win32 common controls ListView in single-click mode? appeared first on The Old New Thing.
Scheduling in a changing world: Maximizing throughput with time-varying capacity
Algorithms & Theory
Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations
Human-Computer Interaction and Visualization
How did Windows 95 get permission to put the Weezer video Buddy Holly on the CD?
Asking nicely, and asking a lot of people. The post How did Windows 95 get permission to put the Weezer video <I>Buddy Holly</I> on the CD? appeared first on The Old New Thing.
How AI trained on birds is surfacing underwater mysteries
Climate & Sustainability
Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
We’re sharing details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows us to seamlessly connect thousands of GPUs across multiple data centers and regions. Our BAG implementation is connecting two different network fabrics – Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF). Once it’s complete our AI [...] Read More... The post Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters appeared first on Engineering at Meta.
What should I do if a wait call reports WAIT_ABANDONED?
It's your one chance to make amends. The post What should I do if a wait call reports <CODE>WAIT_<WBR>ABANDONED</CODE>? appeared first on The Old New Thing.
How We Release the Spotify App: A Look Under the Hood (Part 2)
In Part 2, we will peek under the hood at the tooling that makes the Spotify release process possible. The post How We Release the Spotify App: A Look Under the Hood (Part 2) appeared first on Spotify Engineering.
How can I prevent the user from changing the widths of ListView columns in version 5 of the common controls?, part 2
Preventing the resize cursor from appearing. The post How can I prevent the user from changing the widths of ListView columns in version 5 of the common controls?, part 2 appeared first on The Old New Thing.
How Convera built fine-grained API authorization with Amazon Verified Permissions
In this post, we share how Convera used Amazon Verified Permissions to build a fine-grained authorization model for their API platform.
How can I prevent the user from changing the widths of ListView columns in version 5 of the common controls?
Deny changes to the width. The post How can I prevent the user from changing the widths of ListView columns in version 5 of the common controls? appeared first on The Old New Thing.
Introducing uFowarder: The Consumer Proxy for Kafka Async Queuing
Uber processes trillions of Kafka messages per day on a push-based consumer proxy in real time. Read this blog to learn about the thinking behind open source uForwarder before applying it to your use cases.
How AI tools can redefine universal design to increase accessibility
Education Innovation
No Display? No Problem: Cross-Device Passkey Authentication for XR Devices
We’re sharing a novel approach to enabling cross-device passkey authentication for devices with inaccessible displays (like XR devices). Our approach bypasses the use of QR codes and enables cross-device authentication without the need for an on-device display, while still complying with all trust and proximity requirements. This approach builds on work done by the FIDO [...] Read More... The post No Display? No Problem: Cross-Device Passkey Authentication for XR Devices appeared first on Engineering at Meta.
Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite
In this post, we explore how the Amazon Key team used Amazon EventBridge to modernize their architecture, transforming a tightly coupled monolithic system into a resilient, event-driven solution. We explore the technical challenges we faced, our implementation approach, and the architectural patterns that helped us achieve improved reliability and scalability. The post covers our solutions for managing event schemas at scale, handling multiple service integrations efficiently, and building an extensible architecture that accommodates future growth.
Sequential Attention: Making AI models leaner and faster without sacrificing accuracy
Algorithms & Theory
Super Bowl LX creates an opportunity for symphonic friendly wagering
Betting classical music. The post Super Bowl LX creates an opportunity for symphonic friendly wagering appeared first on The Old New Thing.
How can I prevent the user from changing the widths of ListView columns?
You can ask the header to be non-resizing. The post How can I prevent the user from changing the widths of ListView columns? appeared first on The Old New Thing.
Collaborating on a nationwide randomized study of AI in real-world virtual care
Generative AI
Some small stories about the giant satellite dish antenna that was behind Microsoft Building 11
A little trivia. The post Some small stories about the giant satellite dish antenna that was behind Microsoft Building 11 appeared first on The Old New Thing.
Studying compiler error messages closely: Input file paths
Are you even compiling the correct file? The post Studying compiler error messages closely: Input file paths appeared first on The Old New Thing.
Sovereign failover – Design for digital sovereignty using the AWS European Sovereign Cloud
This post explores the architectural patterns, challenges, and best practices for building cross-partition failover, covering network connectivity, authentication, and governance. By understanding these constraints, you can design resilient cloud-native applications that balance regulatory compliance with operational continuity.
My Journey to Airbnb: Peter Coles
The story of Airbnb’s Head Economist for Policy and Director of Data Science involves geology, co-teaching with a Nobel Prize winner, and CSI. (No, not the hit TV franchise.)
Why not store the SAFEARRAY reference count as a hidden allocation next to the SAFEARRAY?
The case of "Bring your own SAFEARRAY." The post Why not store the <CODE>SAFEARRAY</CODE> reference count as a hidden allocation next to the <CODE>SAFEARRAY</CODE>? appeared first on The Old New Thing.
Announcing the AWS Digital Sovereignty Well-Architected Lens
As organizations accelerate cloud adoption, meeting digital sovereignty requirements has become essential to build trust with customers and regulators worldwide. The challenge isn’t whether to adopt the cloud—it’s how to do so while meeting sovereignty requirements, using a multidisciplinary approach. Even though requirements vary by geography, organizations commonly address them through technical and operational controls […]
How Artera enhances prostate cancer diagnostics using AWS
In this post, we explore how Artera used Amazon Web Services (AWS) to develop and scale their AI-powered prostate cancer test, accelerating time to results and enabling personalized treatment recommendations for patients.
How can I retain access to the data in a SAFEARRAY after my method returns?
Find a way to take ownership. The post How can I retain access to the data in a <CODE>SAFEARRAY</CODE> after my method returns? appeared first on The Old New Thing.
How Uber Scaled Data Replication to Move Petabytes Every Day
Uber prioritizes a reliable data lake, which is distributed across on-premise and cloud environments. This multi-region setup presents challenges for ensuring reliable and timely data access due to limited network bandwidth and the need for seamless data availability, particularly for disaster recovery. Uber uses the Hive Sync service, which uses Apache HadoopⓇ Ditscp (Distributed Copy) for data replication. However, with Uber’s Data Lake exceeding 350 PB, Distcp’s limitations became apparent. This blog explores the optimizations made to Distcp to enhance its performance and meet Uber’s growing data replication and disaster recovery needs across its distributed infrastructure.
Diagnosing instability in production-scale agent reinforcement learning
On January 28, 2026, Hugging Face announced that they have upstreamed the Post-Training Toolkit into TRL as a first-party integration, making these diagnostics directly usable in production RL and agent post-training pipelines. This enables closed-loop monitoring and control patterns that are increasingly necessary for long-running and continuously adapted agent systems. Documentation @ https://huggingface.co/docs/trl/main/en/ptt_integration. Overview In […] The post Diagnosing instability in production-scale agent reinforcement learning appeared first on Engineering@Microsoft.
Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash
Engineering VP Josh Clemm deep-dives into how we think about knowledge graphs, indexes, MCP, and prompt optimization using tools like DSPy.
Why did I lose the data even though I called SafeArrayAddRef?
You have to use the original pointer, but even that won't be good enough. The post Why did I lose the data even though I called <CODE>SafeArrayAddRef</CODE>? appeared first on The Old New Thing.
Towards a science of scaling agent systems: When and why agent systems work
Generative AI
ATLAS: Practical scaling laws for multilingual models
Generative AI
Rust at Scale: An Added Layer of Security for WhatsApp
WhatsApp has adopted and rolled out a new layer of security for users – built with Rust – as part of its effort to harden defenses against malware threats. WhatsApp’s experience creating and distributing our media consistency library in Rust to billions of devices and browsers proves Rust is production ready at a global scale. [...] Read More... The post Rust at Scale: An Added Layer of Security for WhatsApp appeared first on Engineering at Meta.
A digression on the design and implementation of SafeArrayAddRef and extending APIs in general
The concerns when adding a feature to an existing API. The post A digression on the design and implementation of <CODE>SafeArrayAddRef</CODE> and extending APIs in general appeared first on The Old New Thing.
The AI Evolution of Graph Search at Netflix
What’s the difference between SafeArrayAccessData and SafeArrayAddRef?
Two ways of preserving the data. The post What’s the difference between <CODE>SafeArrayAccessData</CODE> and <CODE>SafeArrayAddRef</CODE>? appeared first on The Old New Thing.
Introducing GIST: The next stage in smart sampling
Algorithms & Theory
C++ has scope_exit for running code at scope exit. C# says “We have scope_exit at home.”
You can wrap it in an IDisposable. The post C++ has <CODE>scope_exit</CODE> for running code at scope exit. C# says “We have <CODE>scope_exit</CODE> at home.” appeared first on The Old New Thing.
Small models, big results: Achieving superior intent extraction through decomposition
Generative AI
Congratulations to the recipients of the 2025 Spotify FOSS Fund
TL;DR Established in 2022 as a way to help support the great open source ecosystem projects that Spotify... The post Congratulations to the recipients of the 2025 Spotify FOSS Fund appeared first on Spotify Engineering.
A simple helper function for attaching a progress handler to a Windows Runtime IAsyncActionWithProgress or IAsyncOperationWithProgress
It doesn't do much, but it saves typing. The post A simple helper function for attaching a progress handler to a Windows Runtime IAsyncActionWithProgress or IAsyncOperationWithProgress appeared first on The Old New Thing.
On the proper usage of a custom Win32 dialog class
You are replacing the window procedure, not the dialog procedure. The post On the proper usage of a custom Win32 dialog class appeared first on The Old New Thing.
Microspeak: On fire, putting out fires
Dealing with emergencies. The post Microspeak: On fire, putting out fires appeared first on The Old New Thing.
What was the secret sauce that allows for a faster restart of Windows 95 if you hold the shift key?
An old flag from 16-bit Windows. The post What was the secret sauce that allows for a faster restart of Windows 95 if you hold the shift key? appeared first on The Old New Thing.
What was the secret sauce that allows for a faster restart of Windows 95 if you hold the shift key?
An old flag from 16-bit Windows. The post What was the secret sauce that allows for a faster restart of Windows 95 if you hold the shift key? appeared first on The Old New Thing.
How can I get the tab index number from a dialog box control?
The tab index number is an authoring concept, not a runtime concept. The post How can I get the tab index number from a dialog box control? appeared first on The Old New Thing.
Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations
Check out this deep dive into how Uber runs Apache Hudi™ at extreme scale—handling trillions of records, petabytes of data, and high-concurrency table services across regions.
Unlocking health insights: Estimating advanced walking metrics with smartwatches
Health & Bioscience
When programs assume that the system will never change, episode 4: Stealing strings
The strings are an implementation detail. The post When programs assume that the system will never change, episode 4: Stealing strings appeared first on The Old New Thing.
Adapting the Facebook Reels RecSys AI Model Based on User Feedback
We’ve improved personalized video recommendations on Facebook Reels by moving beyond metrics such as likes and watch time and directly leveraging user feedback. Our new User True Interest Survey (UTIS) model, now helps surface more niche, high-quality content and boosts engagement, retention, and satisfaction. We’re doubling down on personalization, tackling challenges like sparse user data [...] Read More... The post Adapting the Facebook Reels RecSys AI Model Based on User Feedback appeared first on Engineering at Meta.
Clipping the focus item when looking for its on-screen location, part 3
Finding all the clipping parents. The post Clipping the focus item when looking for its on-screen location, part 3 appeared first on The Old New Thing.
Hard-braking events as indicators of road segment crash risk
Algorithms & Theory
Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR
Generative AI
Dynamic surface codes open new avenues for quantum error correction
Quantum
Clipping the focus item when looking for its on-screen location, part 2
Finding the correct clipping parent. The post Clipping the focus item when looking for its on-screen location, part 2 appeared first on The Old New Thing.
How Uber Conquered Database Overload: The Journey from Static Rate-Limiting to Intelligent Load Management
🧠 Overload in stateful databases isn’t one-dimensional. See how we built an intelligent load manager that sheds smarter, adapts to diverse signals, and stays fair under pressure. Even better, this approach led to a ~70% reduction in P99 latency.
How Salesforce migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 EKS clusters
This blog post examines how Salesforce, operating one of the world's largest Kubernetes deployments, successfully migrated from Cluster Autoscaler to Karpenter across their fleet of 1,000 plus Amazon Elastic Kubernetes Service (Amazon EKS) clusters.
CSS at Scale With StyleX
Build a large enough website with a large enough codebase, and you’ll eventually find that CSS presents challenges at scale. It’s no different at Meta, which is why we open-sourced StyleX, a solution for CSS at scale. StyleX combines the ergonomics of CSS-in-JS with the performance of static CSS. It allows atomic styling of components [...] Read More... The post CSS at Scale With StyleX appeared first on Engineering at Meta.
NeuralGCM harnesses AI to better simulate long-range global precipitation
Climate & Sustainability
Clipping the focus item when looking for its on-screen location
Preventing the cursor from pointing to nothing. The post Clipping the focus item when looking for its on-screen location appeared first on The Old New Thing.
Using Active Accessibility to find out where the focus item is
Looking at child objects. The post Using Active Accessibility to find out where the focus item is appeared first on The Old New Thing.
Using Active Accessibility to find out where the Windows caret is
It's old and rather simple, but we like simple. The post Using Active Accessibility to find out where the Windows caret is appeared first on The Old New Thing.
Code of conduct
Airbnb's Code of conduct for Open Source.
Code of conduct
Airbnb's Code of conduct for Open Source.
How can I find out where the Windows caret is?
You'll have go to a larger scope. The post How can I find out where the Windows caret is? appeared first on The Old New Thing.
Why We Use Separate Tech Stacks for Personalization and Experimentation
The technical and practical rationale for a clear separation between these domains. The post Why We Use Separate Tech Stacks for Personalization and Experimentation appeared first on Spotify Engineering.
From Monitoring to Observability: Our Ultra-Marathon to a Cloud-Native Platform
Managing a global corporate network at Uber’s scale can feel a bit like running an ultra-marathon. There are long stretches of smooth sailing, but you’re always preparing for the unexpected mountain pass or sudden change in weather. For years, our engineering teams have navigated this terrain with a traditional, monolithic monitoring system. We knew we needed to switch to a modern pair of carbon-fiber running shoes. This meant a complete overhaul: a journey to replace our legacy system with a cloud-native observability platform built for speed, flexibility, and endurance on an open-source stack.
Swapping two blocks of memory that reside inside a larger block, in constant memory, refinement
Could do with a little less rotating. The post Swapping two blocks of memory that reside inside a larger block, in constant memory, refinement appeared first on The Old New Thing.
How can you swap two non-adjacent blocks of memory using only forward iterators?
Applying the rotation trick to our new problem. The post How can you swap two non-adjacent blocks of memory using only forward iterators? appeared first on The Old New Thing.
How can you swap two adjacent blocks of memory using only forward iterators?
A different algorithm, employing a different kind of cleverness. The post How can you swap two adjacent blocks of memory using only forward iterators? appeared first on The Old New Thing.
Swapping two blocks of memory that reside inside a larger block, in constant memory
A variation on the constant-memory rotation. The post Swapping two blocks of memory that reside inside a larger block, in constant memory appeared first on The Old New Thing.
2025 year-end link clearance
Another year gets relegated to history. The post 2025 year-end link clearance appeared first on The Old New Thing.
Understanding and mitigating a stack overflow in our task sequencer
The recurring problem of synchronous resumption. The post Understanding and mitigating a stack overflow in our task sequencer appeared first on The Old New Thing.
Additional notes on color-keyed overlays as a way of doing smooth video rendering
Choosing the color-key and other brief discussions. The post Additional notes on color-keyed overlays as a way of doing smooth video rendering appeared first on The Old New Thing.
The Gävle Goat (Gävlebocken) succumbs in 2025 to a new menace
You could blow me over. The post The Gävle Goat (Gävlebocken) succumbs in 2025 to a new menace appeared first on The Old New Thing.
How can I detect that the system is running low on memory? Or that my job is running low on memory?
You can register for a memory notification. The post How can I detect that the system is running low on memory? Or that my job is running low on memory? appeared first on The Old New Thing.
Why are we worried about memory access semantics? Full barriers should be enough for anybody
You have to find new ways of going faster. The post Why are we worried about memory access semantics? Full barriers should be enough for anybody appeared first on The Old New Thing.
Reading the fine print, episode 4: Holiday promotions
Checking those validity dates. The post Reading the fine print, episode 4: Holiday promotions appeared first on The Old New Thing.
Why is the last letter of my string not making it to the clipboard?
The struggle for null termination. The post Why is the last letter of my string not making it to the clipboard? appeared first on The Old New Thing.
Why does my Ctrl+M accelerator key activate when I press the Enter key?
Understanding the difference between keys and characters for accelerators. The post Why does my <KBD>Ctrl</KBD>+<KBD>M</KBD> accelerator key activate when I press the <KBD>Enter</KBD> key? appeared first on The Old New Thing.
When irate product support customers demand to speak to Bill Gates
So transfer them to his office, or so it seems. The post When irate product support customers demand to speak to Bill Gates appeared first on The Old New Thing.
All the other cool languages have try…finally. C++ says “We have try…finally at home.”
The destructor serves as the "finally". The post All the other cool languages have <CODE>try</CODE>…<CODE>finally</CODE>. C++ says “We have <CODE>try</CODE>…<CODE>finally</CODE> at home.” appeared first on The Old New Thing.
Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption
The 2025 Typed Python Survey, conducted by contributors from JetBrains, Meta, and the broader Python typing community, offers a comprehensive look at the current state of Python’s type system and developer tooling. With 1,241 responses (a 15% increase from last year), the survey captures the evolving sentiment, challenges, and opportunities around Python typing in the [...] Read More... The post Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption appeared first on Engineering at Meta.
DrP: Meta’s Root Cause Analysis Platform at Scale
Incident investigation can be a daunting task in today’s digital landscape, where large-scale systems comprise numerous interconnected components and dependencies DrP is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil Today, DrP is used [...] Read More... The post DrP: Meta’s Root Cause Analysis Platform at Scale appeared first on Engineering at Meta.
A shortcut gives me a weird path for a program shortcut that doesn’t point to the executable, so what is it?
It's a placeholder because the shortcut is to an MSI application. The post A shortcut gives me a weird path for a program shortcut that doesn’t point to the executable, so what is it? appeared first on The Old New Thing.
Google Research 2025: Bolder breakthroughs, bigger impact
Year in Review
Inside the feature store powering real-time AI in Dropbox Dash
The feature store is a critical part of how we rank and retrieve the right context across your work.
Concluding thoughts on our deep dive into Windows clipboard text conversion
Stick to Unicode and you'll be fine. The post Concluding thoughts on our deep dive into Windows clipboard text conversion appeared first on The Old New Thing.
Powering Billion-Scale Vector Search with OpenSearch
Uber powers billion-scale vector search with OpenSearch. Discover the innovative optimizations we designed to boost search efficiency, scalability, and reliability for massive datasets.
Deducing the consequences of Windows clipboard text formats on UTF-8
Working out the implications. The post Deducing the consequences of Windows clipboard text formats on UTF-8 appeared first on The Old New Thing.
How We Built Meta Ray-Ban Display: From Zero to Polish
We’re going behind the scenes of the Meta Ray-Ban Display, Meta’s most advanced AI glasses yet. In a previous episode we met the team behind the Meta Neural Band, the EMG wristband packaged with the Ray-Ban Display. Now we’re delving into the glasses themselves. Kenan and Emanuel, from Meta’s Wearables org, join Pascal Hartig on [...] Read More... The post How We Built Meta Ray-Ban Display: From Zero to Polish appeared first on Engineering at Meta.
Why is the Windows clipboard taking the scenic route when converting from CF_TEXT to CF_OEMTEXT?
Something is forcing it down an alternate path. The post Why is the Windows clipboard taking the scenic route when converting from <CODE>CF_<WBR>TEXT</CODE> to <CODE>CF_<WBR>OEMTEXT</CODE>? appeared first on The Old New Thing.
How Uber Indexes Streaming Data with Pull-Based Ingestion in OpenSearch™
Discover how Uber uses OpenSearch™’s streaming ingestion architecture for powerful search, and learn about our contributions to a pull-based ingestion framework in the OpenSearch project.
How Temporal Powers Reliable Cloud Operations at Netflix
Netflix Live Origin
Gemini provides automated feedback for theoretical computer scientists at STOC 2026
Algorithms & Theory
How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks
Meta’s secure-by-default frameworks wrap potentially unsafe OS and third-party functions, making security the default while preserving developer speed and usability. These frameworks are designed to closely mirror existing APIs, rely on public and stable interfaces, and maximize developer adoption by minimizing friction and complexity. Generative AI and automation accelerate the adoption of secure frameworks at [...] Read More... The post How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks appeared first on Engineering at Meta.
Misunderstanding what the Cricket Celebration Bowl is
Apparently, not a bowl of crickets. The post Misunderstanding what the Cricket Celebration Bowl is appeared first on The Old New Thing.
The Windows clipboard automatic text conversion algorithm is path-dependent
When the journey is not half of the fun. The post The Windows clipboard automatic text conversion algorithm is path-dependent appeared first on The Old New Thing.
How Uber, OCI™, and Ampere® Co-Optimized OCI AmpereOne® M A4 Compute
By co-designing the new OCI™ AmpereOne-M® A4 instances, Uber has moved beyond adapting to existing hardware and is now shaping the next generation of compute, optimized for its real-world workloads.
Resolving an ambiguity in the Windows clipboard automated text conversion table
Who goes first? The post Resolving an ambiguity in the Windows clipboard automated text conversion table appeared first on The Old New Thing.
Resolving an ambiguity in the Windows clipboard automated text conversion table
Who goes first? The post Resolving an ambiguity in the Windows clipboard automated text conversion table appeared first on The Old New Thing.
Spotlight on innovation: Google-sponsored Data Science for Health Ideathon across Africa
Conferences & Events
Studying the various locale mismatch scenarios in Windows clipboard text format synthesis
If they don't match, then the 8-bit strings are basically broken already. The post Studying the various locale mismatch scenarios in Windows clipboard text format synthesis appeared first on The Old New Thing.
Architecting conversational observability for cloud applications
In this post, we walk through building a generative AI–powered troubleshooting assistant for Kubernetes. The goal is to give engineers a faster, self-service way to diagnose and resolve cluster issues, cut down Mean Time to Recovery (MTTR), and reduce the cycles experts spend finding the root cause of issues in complex distributed systems.
Studying the various locale mismatch scenarios in Windows clipboard text format synthesis
If they don't match, then the 8-bit strings are basically broken already. The post Studying the various locale mismatch scenarios in Windows clipboard text format synthesis appeared first on The Old New Thing.
From Batch to Streaming: Accelerating Data Freshness in Uber’s Data Lake
Learn how Uber moved from batch to streaming ingestion with IngestionNext, reducing data latency and unlocking real-time analytics across its petabyte-scale data lake.
A differentially private framework for gaining insights into AI chatbot use
Generative AI
How BASF’s Agriculture Solutions drives traceability and climate action by tokenizing cotton value chains using Amazon Managed Blockchain
BASF Agricultural Solutions combines innovative products and digital tools with practical farmer knowledge. This post explores how Amazon Managed Blockchain can drive a positive change in the agricultural industry by tokenizing food and cotton value chains for traceability, climate action, and circularity.
How does Windows synthesize the CF_LOCALE clipboard format?
Getting it from a place that might have been obvious in the past, but maybe not today. The post How does Windows synthesize the <CODE>CF_<WBR>LOCALE</CODE> clipboard format? appeared first on The Old New Thing.
Background Coding Agents: Predictable Results Through Strong Feedback Loops (Honk, Part 3)
The system we built to ensure our AI agents produce predictable, trustworthy code. The post Background Coding Agents: Predictable Results Through Strong Feedback Loops (Honk, Part 3) appeared first on Spotify Engineering.
How does Windows synthesize CF_UNICODETEXT from CF_TEXT and vice versa?
Let's ask the locale. The post How does Windows synthesize <CODE>CF_<WBR>UNICODETEXT</CODE> from <CODE>CF_<WBR>TEXT</CODE> and vice versa? appeared first on The Old New Thing.
Blazing Fast OLAP on Uber’s Inventory and Catalog Data with Apache Pinot™
Discover how Uber used Apache Pinot™ to build a real-time index on its massive inventory of billions of items to power search use cases, internal tools, and operational workflows.
She architects: Bringing unique perspectives to innovative solutions at AWS
Have you ever wondered what it is really like to be a woman in tech at one of the world's leading cloud companies? Or maybe you are curious about how diverse perspectives drive innovation beyond the buzzwords? Today, we are providing an insider's perspective on the role of a solutions architect (SA) at Amazon Web Services (AWS). However, this is not a typical corporate success story. We are three women who have navigated challenges, celebrated wins, and found our unique paths in the world of cloud architecture, and we want to share our real stories with you.
How does Windows synthesize CF_OEMTEXT from CF_TEXT and vice versa?
Starting with the easy case, or at least it looks easy. The post How does Windows synthesize <CODE>CF_<WBR>OEMTEXT</CODE> from <CODE>CF_<WBR>TEXT</CODE> and vice versa? appeared first on The Old New Thing.
How can my process read its own standard output?
You'll have to trick yourself before anybody notices, which may not be possible. The post How can my process read its own standard output? appeared first on The Old New Thing.
AV1 — Now Powering 30% of Netflix Streaming
Titans + MIRAS: Helping AI have long-term memory
Generative AI
How can I read the standard output of an already-running process?
You can't. You'll have to do it before the process starts. The post How can I read the standard output of an already-running process? appeared first on The Old New Thing.
Improving MySQL® Cluster Uptime: Making MGR Viable at Scale
Dive into the implementation, automation and failover logic that made MySQL® Group Replication viable at Uber scale.
From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence
Machine Intelligence
How do I check whether the user has permission to create files in a directory?
Request the directory security attributes that correspond to your proposed operation. The post How do I check whether the user has permission to create files in a directory? appeared first on The Old New Thing.
The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation
Discover how treating AI agents as collaborators, not automation, transforms engineering workflows and accelerates complex projects The post The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation appeared first on Engineering@Microsoft.
Microspeak: Big rocks
The large obstacles. The post Microspeak: Big rocks appeared first on The Old New Thing.
Improving MySQL® Cluster Uptime: Designing Advanced Detection, Mitigation, and Consensus with Group Replication
At Uber, high availability is non-negotiable. Learn how we’ve adopted MySQL® Group Replication in single-primary mode to achieve a less than 10 second failover time and massively improve reliability and write availability during failures.
Streamlining Security Investigations with Agents
Slack’s Security Engineering team is responsible for protecting Slack’s core infrastructure and services. Our security event ingestion pipeline handles billions of events per day from a diverse array of data sources. Reviewing alerts produced by our security detection system is our primary responsibility during on-call shifts. We’re going to show you how we’re using AI…
How do I get my edit control text to be autoselected when I choose it to be the default focus in my dialog?
Remembering some old APIs. The post How do I get my edit control text to be autoselected when I choose it to be the default focus in my dialog? appeared first on The Old New Thing.
How can I have a Win32 drop-down combo box with a read-only edit control?
You can ask for its handle and mark it read-only. The post How can I have a Win32 drop-down combo box with a read-only edit control? appeared first on The Old New Thing.
Message-only windows are for messaging, not as a convenient victim for hosting UI
If you want to host UI, use a real window (possibly hidden). The post Message-only windows are for messaging, not as a convenient victim for hosting UI appeared first on The Old New Thing.
Building the future: highlights from Dropbox’s 2025 summer intern class
The Dropbox Intern Program is thoughtfully designed to cultivate growth, spark innovation, and build lasting connections.
Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall
In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.
At what point in the Windows development cycle is it too late to change the text of a translatable string?
The translation team sets the deadline. The post At what point in the Windows development cycle is it too late to change the text of a translatable string? appeared first on The Old New Thing.
The apocryphal origins of the Hot Dog Stand color scheme
Challenge accepted. The post The apocryphal origins of the Hot Dog Stand color scheme appeared first on The Old New Thing.
Why does XAML break down when I have an element that is half a billion pixels tall?
You've far exceeded the design goals and have even exceeded the expressive ability of a float. The post Why does XAML break down when I have an element that is half a billion pixels tall? appeared first on The Old New Thing.
Background Coding Agents: Context Engineering (Honk, Part 2)
We explore context engineering for background coding agents and what makes a good migration prompt. The post Background Coding Agents: Context Engineering (Honk, Part 2) appeared first on Spotify Engineering.
Evolution and Scale of Uber’s Delivery Search Platform
How does Uber Eats power search across billions of stores, dishes, and grocery items? We built a next-gen semantic search platform that understands meaning, not just keywords—handling typos, synonyms, and multiple languages.
Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization
We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization appeared first on Engineering at Meta.
Reducing EV range anxiety: How a simple AI model predicts port availability
Algorithms & Theory
Maybe somebody can explain to me how weak references solve the ODR problem
I don't see it. The post Maybe somebody can explain to me how weak references solve the ODR problem appeared first on The Old New Thing.
Key Transparency Comes to Messenger
We’re excited to share another advancement in the security of your conversations on Messenger: the launch of key transparency verification for end-to-end encrypted chats. This new feature enables an additional level of assurance that only you — and the people you’re communicating with — can see or listen to what is sent, and that no [...] Read More... The post Key Transparency Comes to Messenger appeared first on Engineering at Meta.
In the commit-on-demand pattern, what happens if an access violation straddles multiple pages?
The access violation exceptions will continue until commit improves. The post In the commit-on-demand pattern, what happens if an access violation straddles multiple pages? appeared first on The Old New Thing.
Ceilometer: Uber’s Adaptive Benchmarking Framework
Dig into Ceilometer, Uber’s adaptive benchmarking framework for ensuring system performance and reliability at scale. Learn how it automates performance testing while providing production-like insights and continuous validation.
Building an AI gateway to Amazon Bedrock with Amazon API Gateway
In this post, we'll explore a reference architecture that helps enterprises govern their Amazon Bedrock implementations using Amazon API Gateway. This pattern enables key capabilities like authorization controls, usage quotas, and real-time response streaming. We'll examine the architecture, provide deployment steps, and discuss potential enhancements to help you implement AI governance at scale.
Architecting for AI excellence: AWS launches three Well-Architected Lenses at re:Invent 2025
At re:Invent 2025, we introduce one new lens and two significant updates to the AWS Well-Architected Lenses specifically focused on AI workloads: the Responsible AI Lens, the Machine Learning (ML) Lens, and the Generative AI Lens. Together, these lenses provide comprehensive guidance for organizations at different stages of their AI journey, whether you're just starting to experiment with machine learning or already deploying complex AI applications at scale.
Announcing the updated AWS Well-Architected Generative AI Lens
We are delighted to announce an update to the AWS Well-Architected Generative AI Lens. This update features several new sections of the Well-Architected Generative AI Lens, including new best practices, advanced scenario guidance, and improved preambles on responsible AI, data architecture, and agentic workflows.
Announcing the updated AWS Well-Architected Machine Learning Lens
We are excited to announce the updated AWS Well-Architected Machine Learning Lens, now enhanced with the latest capabilities and best practices for building machine learning (ML) workloads on AWS.
Android VPAT journey
Background A Voluntary Product Accessibility Template (VPAT) is a document that outlines how well a product aligns with accessibility (a11y) standards. Its primary purpose is to inform customers about a product’s a11y features, enabling them to make informed decisions before purchasing software. At Slack, we conducted a VPAT by a third party a11y vendor in…
Real-time speech-to-speech translation
Algorithms & Theory
Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation
We’ve released Ax 1.0, an open-source platform that uses machine learning to automatically guide complex, resource-intensive experimentation. Ax is used at scale across Meta to improve AI models, tune production infrastructure, and accelerate advances in ML and even hardware design. Our accompanying paper, “Ax: A Platform for Adaptive Experimentation” explains Ax’s architecture, methodology, and how it [...] Read More... The post Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation appeared first on Engineering at Meta.
Build priority-based message processing with Amazon MQ and AWS App Runner
In this post, we show you how to build a priority-based message processing system using Amazon MQ for priority queuing, Amazon DynamoDB for data persistence, and AWS App Runner for serverless compute. We demonstrate how to implement application-level delays that high-priority messages can bypass, create real-time UIs with WebSocket connections, and configure dual-layer retry mechanisms for maximum reliability.
Generative UI: A rich, custom, visual interactive user experience for any prompt
Generative AI
Enhancing Uber’s Guidance Heatmap with Deep Probabilistic Models
How does Uber forecast earnings for drivers? Our latest deep dive explores the probabilistic prediction models behind earnings insights in the Driver app, from dynamic heatmaps to trends and alerts.
Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together
Connecting Africa and the World We’re excited to share the completion of the core 2Africa infrastructure, the world’s longest open access subsea cable system. 2Africa is a landmark subsea cable system that sets a new standard for global connectivity. This project is the result of years of collaboration, innovation, and a shared vision to connect [...] Read More... The post Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together appeared first on Engineering at Meta.
How Dash uses context engineering for smarter AI
Building effective, agentic AI isn’t just about adding more; it’s about helping the model focus on what matters most.
Enhancing HDR on Instagram for iOS With Dolby Vision
We’re sharing how we’ve enabled Dolby Vision and ambient viewing environment (amve) on the Instagram iOS app to enhance the video viewing experience. HDR videos created on iPhones contain unique Dolby Vision and amve metadata that we needed to support end-to-end Instagram for iOS is now the first Meta app to support Dolby Vision video, [...] Read More... The post Enhancing HDR on Instagram for iOS With Dolby Vision appeared first on Engineering at Meta.
Open Source Is Good for the Environment
Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment? On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements [...] Read More... The post Open Source Is Good for the Environment appeared first on Engineering at Meta.
Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions
Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.
Separating natural forests from other tree cover with AI for deforestation-free supply chains
Climate & Sustainability
I/O Observability for Uber’s Massive Petabyte-Scale Data Lake
Learn how Uber powers real-time, petabyte-scale I/O observability for its data lake powering 2 million compute jobs.
Shuffle: Making Random Feel More Human
Shuffle has always been one of Spotify’s most-used features, and also one of the most misunderstood. For... The post Shuffle: Making Random Feel More Human appeared first on Spotify Engineering.
A new quantum toolkit for optimization
Algorithms & Theory
Differentially private machine learning at scale with JAX-Privacy
Algorithms & Theory
StyleX: A Styling Library for CSS at Scale
StyleX is Meta’s styling system for large-scale applications. It combines the ergonomics of CSS-in-JS with the performance of static CSS, generating collision-free atomic CSS while allowing for expressive, type-safe style authoring. StyleX was open sourced at the end of 2023 and has since become the standard styling system across Meta products like Facebook, Instagram, WhatsApp, [...] Read More... The post StyleX: A Styling Library for CSS at Scale appeared first on Engineering at Meta.
Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation
We’re sharing details about Meta’s Generative Ads Recommendation Model (GEM), a new foundation model that delivers increased ad performance and advertiser ROI by enhancing other ads recommendation models’ ability to serve relevant ads. GEM’s novel architecture allows it to scale with an increasing number of parameters while consistently generating more precise predictions efficiently. GEM propagates [...] Read More... The post Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation appeared first on Engineering at Meta.
Introducing Nested Learning: A new ML paradigm for continual learning
Algorithms & Theory
1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1)
Thousands of merged AI-generated pull requests and the future of large-scale software maintenance. The post 1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1) appeared first on Spotify Engineering.
DS-STAR: A state-of-the-art versatile data science agent
Data Mining & Modeling
Build better software to build software better
We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…
Building Zone Failure Resilience in Apache Pinot™ at Uber
By building zone failure resilience into Apache Pinot™, Uber strengthened reliability for real-time analytics, sped up release cycles, and created a foundation for future failure recovery. Now queries and ingestion stay strong, even when zones go dark.
Forecasting the future of forests with AI: From counting losses to predicting risk
Climate & Sustainability
GraphQL Data Mocking at Scale with LLMs and @generateMock
How Airbnb combines GraphQL infra, product context, and LLMs to generate and maintain convincing, type-safe mock data using a new directive.
GraphQL Data Mocking at Scale with LLMs and @generateMock
How Airbnb combines GraphQL infra, product context, and LLMs to generate and maintain convincing, type-safe mock data using a new directive.
Supercharging the ML and AI Development Experience at Netflix
Video Invisible Watermarking at Scale
At Meta, we use invisible watermarking for a variety of content provenance use cases on our platforms. Invisible watermarking serves a number of use cases, including detecting AI-generated videos, verifying who posted a video first, and identifying the source and tools used to create a video. We’re sharing how we overcame the challenges of scaling [...] Read More... The post Video Invisible Watermarking at Scale appeared first on Engineering at Meta.
Exploring a space-based, scalable AI infrastructure system design
General Science
Accelerating the magic cycle of research breakthroughs and real-world applications
Climate & Sustainability
Raising the Bar on ML Model Deployment Safety
How do you safely ship thousands of ML models without slowing teams down? At Uber, we’ve built guardrails that catch issues early, prevent rollbacks, and raise the bar on reliability. Discover how safety became a measurable standard across our ML ecosystem.
Toward provably private insights into AI use
Generative AI
StreetReaderAI: Towards making street view accessible via context-aware multimodal AI
Generative AI
How we are building the personal health coach
Generative AI
Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning
Advancing Our Chef Infrastructure: Safety Without Disruption
Last year, I wrote a blog post titled Advancing Our Chef Infrastructure, where we explored the evolution of our Chef infrastructure over the years. We talked about the shift from a single Chef stack to a multi-stack model, and the challenges that came with it – from updating how we handle cookbook uploads to navigating…
With Mobius Labs' Aana models, we're bringing deeper multimodal understanding to Dropbox Dash
Dropbox welcomes Mobius Labs to advance Dash’s multimodal AI, integrating Aana’s efficient architecture to enhance photo and video understanding at Dropbox scale.
Enabling Deep Model Explainability with Integrated Gradients at Uber
Uber’s ML platform Michelangelo now supports Integrated Gradients, enabling scalable, interpretable deep model explainability across TensorFlow™ and PyTorch™. Learn how this powers trust, debugging, and decision-making throughout the ML life cycle.
Google Earth AI: Unlocking geospatial insights with foundation models and cross-modal reasoning
Climate & Sustainability
BASF Digital Farming builds a STAC-based solution on Amazon EKS
This post was co-written with Frederic Haase and Julian Blau with BASF Digital Farming GmbH. At xarvio – BASF Digital Farming, our mission is to empower farmers around the world with cutting-edge digital agronomic decision-making tools. Central to this mission is our crop optimization platform, xarvio FIELD MANAGER, which delivers actionable insights through a range […]
A verifiable quantum advantage
Quantum
Half-Quadratic Quantization of large machine learning models
Learn how Half-Quadratic Quantization (HQQ) makes it easy to compress large AI models without sacrificing accuracy—no calibration data required.
Requirement Adherence: Boosting Data Labeling Quality Using LLMs
Harnessing the power of LLMs, Uber AI Solutions developed a system to reduce data labeling audits by 80%. Learn how the system detects labeling errors, boosting data quality.
Behind the Streams: Real-Time Recommendations for Live Events Part 3
A picture's worth a thousand (private) words: Hierarchical generation of coherent synthetic photo albums
Generative AI
Teaching Gemini to spot exploding stars with just a few examples
General Science
How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…
Solving virtual machine puzzles: How AI is optimizing cloud computing
Algorithms & Theory
Using AI to identify genetic variants in tumors with DeepSomatic
General Science
Rebuilding Uber’s Apache Pinot™ Query Architecture
The next chapter of real-time analytics at Uber. Uncover how Uber restructured its Apache Pinot™ query architecture to unlock a ton of new features, redefining the capabilities of a mature OLAP platform.
Coral NPU: A full-stack platform for Edge AI
Generative AI
XR Blocks: Accelerating AI + XR innovation
Generative AI
Deploy Safety: Reducing customer impact from change
It’s mid 2023 and we’ve identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward. We’re a year and half into the Deploy Safety Program at Slack, improving the way we deploy, uplifting our safety culture and continuing…
Speech-to-Retrieval (S2R): A new approach to voice search
Machine Intelligence
Cadence Workflow Joins the Cloud Native Computing Foundation
Cadence Workflow is now part of the Cloud Native Computing Foundation®. This milestone strengthens our commitment to open source and ensures continued investment in the project’s future.
A collaborative approach to image generation
Generative AI
A practical blueprint for evaluating conversational AI at scale
Building Dropbox Dash taught us that in the foundation-model era, AI evaluations matter just as much as model training.
How Uber Standardized Mobile Analytics for Cross-Platform Insights
Follow Uber’s journey of standardizing mobile analytics. We unified event instrumentation, collected consistent metadata, and provided sampled event coverage to reduce dev effort and deliver quality, cross-platform insights.
Modernization of real-time payment orchestration on AWS
The global real-time payments market is experiencing significant growth. According to Fortune Business Insights, the market was valued at USD 24.91 billion in 2024 and is projected to grow to USD 284.49 billion by 2032, with a CAGR of 35.4%. Similarly, Grand View Research reports that the global mobile payment market, valued at USD 88.50 […]
Introducing interactive on-device segmentation in Snapseed
Human-Computer Interaction and Visualization
AI as a research partner: Advancing theoretical computer science with AlphaEvolve
Algorithms & Theory
Build resilient generative AI agents
Generative AI agents in production environments demand resilience strategies that go beyond traditional software patterns. AI agents make autonomous decisions, consume substantial computational resources, and interact with external systems in unpredictable ways. These characteristics create failure modes that conventional resilience approaches might not address. This post presents a framework for AI agent resilience risk analysis […]
The anatomy of a personal health agent
Generative AI
100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine
Building a Resilient Data Platform with Write-Ahead Log at Netflix
Uber’s Strategy to Upgrading 2M+ Spark Jobs
Discover how Uber migrated 2M daily Apache Spark™ jobs to Spark 3.3 with automation and safe shadow testing, achieving over $4M in savings.
Towards better health conversations: Research insights on a “wayfinding” AI agent based on Gemini
Generative AI
AfriMed-QA: Benchmarking large language models for global health
Generative AI
Time series foundation models can be few-shot learners
Generative AI
Beyond Winning: Spotify’s Experiments with Learning Framework
TL;DR Spotify’s experimentation platform, Confidence, scaled product decision-making across hundreds of... The post Beyond Winning: Spotify’s Experiments with Learning Framework appeared first on Spotify Engineering.
Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale
A scalable, elastic database and search solution for 1B+ vectors built on LanceDB and Amazon S3
In this post, we explore how Metagenomi built a scalable database and search solution for over 1 billion protein vectors using LanceDB and Amazon S3. The solution enables rapid enzyme discovery by transforming proteins into vector embeddings and implementing a serverless architecture that combines AWS Lambda, AWS Step Functions, and Amazon S3 for efficient nearest neighbor searches.
Deep researcher with test-time diffusion
Machine Intelligence
Empowering Netflix Engineers with Incident Management
Sensible Agent: A framework for unobtrusive interaction with proactive AR agents
Human-Computer Interaction and Visualization
Adding Determinism and Safety to Uber IAM Policy Changes
Uber’s Policy Simulator tool enhances the safety and predictability of IAM policy changes by allowing policy authors to preview the impact of their modifications prior to deployment, ensuring deterministic outcomes after policy change deployment.
Viaduct, Five Years On: Modernizing the Data-Oriented Service Mesh
A more powerful engine and a simpler API for our data-oriented mesh
Migrating Airbnb’s JVM Monorepo to Bazel
By: Jack Dai, Howard Ho, Loc Dinh, Stepan Goncharov, Ted Tenedorio, and Thomas Bao At Airbnb, we recently completed migrating our largest repo, the JVM monorepo, to Bazel. This repo contains tens of millions of lines of Java, Kotlin, and Scala code that power the vast array of backend services and data pipelines behind airbnb.com. Migration in numbers […]
Making LLMs more accurate by using all of their layers
Algorithms & Theory
Learn Your Way: Reimagining textbooks with generative AI
Education Innovation
VaultGemma: The world's most capable differentially private LLM
Generative AI
Speculative cascades — A hybrid approach for smarter, faster LLM inference
Generative AI
Smarter nucleic acid design with NucleoBench and AdaBeam
Health & Bioscience
Open-Sourcing Starlark Worker: Define Cadence Workflows with Starlark
Starlark meets Cadence. We’re excited to announce the open-source release of Starlark Worker, a powerful integration between Cadence workflow orchestration and the Starlark scripting language to simplify workflow execution.
Accelerating scientific discovery with AI-powered empirical software
General Science
Building Uber’s Data Lake: Batch Data Replication Using HiveSync
Go behind the scenes of how Uber powers batch data replication at scale using HiveSync to keep its data lake consistent, reliable, and performant.
Building Slack’s Anomaly Event Response
As cyberattacks evolve to unprecedented levels of sophistication and speed, the time gap between breach detection and response has never been more critical. Traditional security approaches often operate reactively, identifying compromises only after damage has occurred. This delay grants attackers a tactical advantage, forcing security teams to focus on damage assessment and remediation rather than…
Controlling the Rollout of Large-Scale Monorepo Changes
Discover how Uber controls the blast radius of large-scale commits with cross-cutting service deployment orchestration. As Uber embraces fully automated continuous deployment, strong safety practices are more critical than ever.
How Google’s AI can help transform health professions education
Education Innovation
Hack Week 2025: How these engineers liquid-cooled a GPU server
How our engineers designed a custom liquid cooling system for high-powered GPU servers to tackle the rising thermal demands of AI workloads.
A scalable framework for evaluating health language models
Generative AI
Simplify multi-tenant encryption with a cost-conscious AWS KMS key strategy
In this post, we explore an efficient approach to managing encryption keys in a multi-tenant SaaS environment through centralization, addressing challenges like key proliferation, rising costs, and operational complexity across multiple AWS accounts and services. We demonstrate how implementing a centralized key management strategy using a single AWS KMS key per tenant can maintain security and compliance while reducing operational overhead as organizations scale.
From massive models to mobile magic: The tech behind YouTube real-time generative AI effects
Generative AI
From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix
Securing private data at scale with differentially private partition selection
Algorithms & Theory
How CommBank made their CommSec trading platform highly available and operationally resilient
In this post, we explore how CommSec, Australia's leading online broker, transitioned from a multicloud environment to AWS as their sole cloud provider while implementing Amazon Application Recovery Controller (ARC) zonal shift to maintain high availability and operational resilience. The consolidation resulted in significant benefits including 25% base capacity reduction, two times faster deployments, and improved failover capabilities through ARC zonal shift, enabling CommSec to continue serving millions of customers while meeting strict regulatory requirements.
Driving AI adoption at Dropbox: a conversation with CTO Ali Dasdan
How Dropbox approached AI adoption not just as a tool for automation, but as a catalyst for rethinking the entire software development lifecycle.
ML Observability: Bringing Transparency to Payments and Beyond
Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator
Generative AI
How Karrot built a feature platform on AWS, Part 1: Motivation and feature serving
This two-part series shows how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. This post starts by presenting our motivation, our requirements, and the solution architecture, focusing on feature serving.
How Karrot built a feature platform on AWS, Part 2: Feature ingestion
This two-part series shows how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. This post covers the process of collecting features in real-time and batch ingestion into an online store, and the technical approaches for stable operation.
Deploy LLMs on Amazon EKS using vLLM Deep Learning Containers
In this post, we demonstrate how to deploy the DeepSeek-R1-Distill-Qwen-32B model using AWS DLCs for vLLMs on Amazon EKS, showcasing how these purpose-built containers simplify deployment of this powerful open source inference engine. This solution can help you solve the complex infrastructure challenges of deploying LLMs while maintaining performance and cost-efficiency.
Enabling physician-centered oversight for AMIE
Generative AI
Seamless Istio Upgrades at Scale
How Airbnb upgrades tens of thousands of pods on dozens of Kubernetes clusters to new Istio versions
Seamless Istio Upgrades at Scale
How Airbnb upgrades tens of thousands of pods on dozens of Kubernetes clusters to new Istio versions
Achieving High Availability with distributed database on Kubernetes at Airbnb
How to achieve high availability with distributed database on Kubernetes.
Understanding and Improving SwiftUI Performance
New techniques we’re using at Airbnb to improve and maintain performance of SwiftUI features at scale
Understanding and Improving SwiftUI Performance
New techniques we’re using at Airbnb to improve and maintain performance of SwiftUI features at scale
Load Testing with Impulse at Airbnb
Comprehensive Load Testing with Load Generator, Dependency Mocker, Traffic Collector, and More
Achieving 10,000x training data reduction with high-fidelity labels
Human-Computer Interaction and Visualization
Insulin resistance prediction from wearables and routine blood biomarkers
Generative AI
Highly accurate genome polishing with DeepPolisher: Enhancing the foundation of genomic research
General Science
Maximizing Business Value Through Strategic Cloud Optimization
As cloud spending continues to surge, organizations must focus on strategic cloud optimization to maximize business value. This blog post explores key insights from MIT Technology Review's publication on cloud optimization, highlighting the importance of viewing optimization as a continuous process that encompasses all six AWS Well-Architected pillars.
MLE-STAR: A state-of-the-art machine learning engineering agent
Machine Intelligence
Simulating large systems with Regression Language Models
Generative AI
SensorLM: Learning the language of wearable sensors
Generative AI
How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale
In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier's control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.
How HashiCorp made cross-Region switchover seamless with Amazon Application Recovery Controller
In this post, we discuss HashiCorp’s journey from manual, stress-inducing failover procedures to a streamlined, confident approach that fundamentally changed how they deliver on their enterprise-grade resilience promises.
Synthetic and federated: Privacy-preserving domain adaptation with LLMs for mobile applications
Generative AI
LSM-2: Learning from incomplete wearable sensor data
Generative AI
Implement monitoring for Amazon EKS with managed services
In this post, we show you how to implement comprehensive monitoring for Amazon Elastic Kubernetes Service (Amazon EKS) workloads using AWS managed services. This solution demonstrates building an EKS platform that combines flexible compute options with enterprise-grade observability using AWS native services and OpenTelemetry.
Measuring heart rate with consumer ultra-wideband radar
Hardware & Architecture
Android Earthquake Alerts: A global system for early warning
Data Mining & Modeling
Enhancing Code Quality at Scale with AI-Powered Code Reviews
Microsoft’s AI-powered code review assistant has transformed pull request workflows by automating routine checks, suggesting improvements, and enabling conversational Q&A, leading to faster PR completion, improved code quality, and enhanced developer onboarding. Its seamless integration and customizability have driven widespread adoption within Microsoft The post Enhancing Code Quality at Scale with AI-Powered Code Reviews appeared first on Engineering@Microsoft.
Making file encryption fast and secure for teams with advanced key management
We developed features to help teams limit their security risks and respond more effectively to potential threats or breaches.
Graph foundation models for relational data
Algorithms & Theory
MedGemma: Our most capable open models for health AI development
Generative AI
Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet
This generation represents our most efficient, capable, and scalable architecture yet—and it’ll help us as we continue to build AI products like Dropbox Dash.
Making group conversations more accessible with sound localization
Human-Computer Interaction and Visualization
How we created HOV-specific ETAs in Google Maps
Algorithms & Theory
REGEN: Empowering personalized recommendations with natural language
Data Mining & Modeling
MUVERA: Making multi-vector retrieval as fast as single-vector search
Algorithms & Theory
From research to climate resilience
Climate & Sustainability
Unlocking rich genetic insights through multimodal AI with M-REGLE
Generative AI
A colorful quantum future
Quantum
Optimizing LLM-based trip planning
Algorithms & Theory
Zooming in: Efficient regional environmental risk assessment with generative AI
Climate & Sustainability
Listening, Learning, and Helping at Scale: How Machine Learning Transforms Airbnb’s Voice Support Experience
A look into how Airbnb uses speech recognition, intent detection, and language models to understand users and assist agents more effectively.
Learning to clarify: Multi-turn conversations with Action-Based Contrastive Self-Training
Generative AI
How we brought multimedia search to Dropbox Dash
Our multimedia retrieval features allow users to find images, video, and audio just as easily as they find documents.
Fine-tuning LLMs with user-level differential privacy
Algorithms & Theory
Google Research at Google I/O 2025
Climate & Sustainability
Deeper insights into retrieval augmented generation: The role of sufficient context
Data Mining & Modeling
Differential privacy on trust graphs
Algorithms & Theory
Bringing 3D shoppable products online with generative AI
Generative AI
Incident Report: Spotify Outage on April 16, 2025
On April 16, Spotify experienced an outage that affected users worldwide. Here is what happened and what we... The post Incident Report: Spotify Outage on April 16, 2025 appeared first on Spotify Engineering.
A new light on neural connections
General Science
Making complex text understandable: Minimally-lossy text simplification with Gemini
Generative AI
Amplify Initiative: Localized data for globalized AI
Generative AI
AMIE gains vision: A research AI agent for multimodal diagnostic dialogue
Generative AI
Benchmarking LLMs for global health
Generative AI
Improving brain models with ZAPBench
General Science
Introducing Mobility AI: Advancing urban transportation
Algorithms & Theory
A new hybrid platform for quantum simulation of magnetism
Quantum
InstructPipe: Generating Visual Blocks pipelines with human instructions and LLMs
Human-Computer Interaction and Visualization
Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis
Health & Bioscience
Optimizing Our E2E Pipeline
In the world of DevOps and Developer Experience (DevXP), speed and efficiency can make a big difference on an engineer’s day-to-day tasks. Today, we’ll dive into how Slack’s DevXP team took some existing tools and used them to optimize an end-to-end (E2E) testing pipeline. This lowered build times and reduced redundant processes, saving both time…
Geospatial Reasoning: Unlocking insights with generative AI and multiple foundation models
Climate & Sustainability
Evaluating progress of LLMs on scientific problem-solving
General Science
ECLeKTic: A novel benchmark for evaluating cross-lingual knowledge transfer in LLMs
Generative AI
Accelerating Large-Scale Test Migration with LLMs
How Airbnb migrated nearly 3.5K Enzyme test files to React Testing Library in just 6 weeks using automation and LLMs
Embedding-Based Retrieval for Airbnb Search
Our journey in applying embedding-based retrieval techniques to build an accurate and scalable candidate retrieval system for Airbnb Homes search
The evolution of graph learning
Algorithms & Theory
Deciphering language processing in the human brain through LLM representations
General Science
Load balancing with random job arrivals
Algorithms & Theory
Loss of Pulse Detection on the Google Pixel Watch 3
Health & Bioscience
Generating synthetic data with differentially private LLM inference
Machine Intelligence
How we built enterprise search to be secure and private
Many don’t know that “Slack” is in fact a backronym—it stands for “Searchable Log of all Communication and Knowledge”. And these days, it’s not just a searchable log: with Slack AI, Slack is now an intelligent log, leveraging the latest in generative AI to securely surface powerful, time-saving insights. We built Slack AI from the…
From diagnosis to treatment: Advancing AMIE for longitudinal disease management
Generative AI
Discovering new words with confidential federated analytics
Mobile Systems
How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps
For developers, the emphasis on building intelligence into apps has never been clearer. Over the next three years, 92% of companies plan on investing in AI to achieve business outcomes like enhancing productivity and delivering better customer service. At Microsoft, developers and engineers are pushing the boundaries of AI at scale, crafting applications that harness […] The post How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps appeared first on Engineering@Microsoft.
Mind the GAP: Geometry Aware Passthrough mitigates cybersickness
Human-Computer Interaction and Visualization
Accelerating scientific breakthroughs with an AI co-scientist
Generative AI
Mechanism design for large language models
Algorithms & Theory
Building AI for the pluralistic society
Generative AI
Urban mobility solutions: Calibrating digital twins at scale
Algorithms & Theory
Chain of Agents: Large language models collaborating on long-context tasks
Generative AI
Parfait: Enabling private AI with research tools
Distributed Systems & Parallel Computing
Zero-shot mono-to-binaural speech synthesis
Sound & Accoustics
Automated Accessibility Testing at Slack
At Slack, customer love is our first priority and accessibility is a core tenet of customer trust. We have our own Slack Accessibility Standards that product teams follow to guarantee their features are compliant with Web Content Accessibility Guidelines (WCAG). Our dedicated accessibility team supports developers in following these guidelines throughout the development process. We…
Dev Box Ready-To-Code Dev Box images template
Microsoft One Engineering System (1ES) team shares a sample for building Ready-To-Code Dev Box environments pre-configured with the necessary tools, repositories, and settings, ensuring consistency and reliability across teams. The post Dev Box Ready-To-Code Dev Box images template appeared first on Engineering@Microsoft.
Common annotated security keys
In April 2021, GitHub announced changes to their security token format that significantly enhanced security. The improvement leveraged two straightforward techniques: a fixed signature in the generated token and a checksum – both of which are highly effective in eliminating false positives (noise) and false negatives (missed findings). Microsoft also implements these techniques widely in […] The post Common annotated security keys appeared first on Engineering@Microsoft.
Managed DevOps Pools – The Origin Story
Learn about how Microsoft's 1ES organization developed an internal service called "1ES Hosted Pools" to manage Microsoft's diverse Engineering system infrastructure and how it helped make significant improvements to productivity, cost savings, and security. This solution will soon be available as a third-party offering named "Managed DevOps Pools". The post Managed DevOps Pools – The Origin Story appeared first on Engineering@Microsoft.
Developing with Accessibility in Mind at Microsoft
Celebrate the Global Accessibility Awareness Day GAAD by taking actionable and easy steps to build accessibility into your development life-cycle! Learn how tools like Accessibility Insights & Visual Studio can help find accessibility issues in development. The post Developing with Accessibility in Mind at Microsoft appeared first on Engineering@Microsoft.
Copy-on-Write performance and debugging
This is a follow-up to our previous coverage of Dev Drive and copy-on-write (CoW) linking. See our previous articles from May 24, 2023, October 13, 2023, and November 2, 2023. Dev Drive was released in Windows 11 in October, 2023, and will be part of Windows Server 2025 this fall. Server 2025 and Windows 11 […] The post Copy-on-Write performance and debugging appeared first on Engineering@Microsoft.
How we built “Ask Learn”, the RAG-based knowledge service
My name is Bob Tabor and I’m a member of Microsoft’s Skilling organization. We create documentation and training content about Azure, developer tooling and languages, AI, Windows and much more hosted at Microsoft Learn. Our organization also develops and maintains the content publishing platform, the content hosting platform, the interactivity, and popular sites like Microsoft […] The post How we built “Ask Learn”, the RAG-based knowledge service appeared first on Engineering@Microsoft.
Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing
Microsoft has employed Azure Load Testing to enhance the reliability of Microsoft Fabric and Azure Synapse, ensuring they can handle high loads. Azure Synapse integrates various data analytics technologies, while Microsoft Fabric offers a full enterprise analytics solution. Through rigorous daily and weekly load testing, involving complex scenarios and extensive data sizes, Microsoft aims to identify and rectify potential issues, ensuring optimal performance. This testing, integrated within their development pipelines, supports continuous improvement, leverages Azure's scalability, and utilizes Power BI for detailed reporting, ultimately enhancing service reliability and user experience. The post Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing appeared first on Engineering@Microsoft.