Skip to content

Engineering Feed

A curated reading feed of engineering blogs I follow — from teams building real systems at scale.

Context Search

199 articles

Ranked by recency by default. Add a concept query to surface related context.

Slack

Slack AI: The Path to Multi-Cloud

Useful angle: systems performance via performance

In early 2023, Slack faced a foundational challenge: serving Large Language Models (LLMs) at enterprise scale with the security, reliability, and performance our customers expect. Over three years, we evolved from basic infrastructure to orchestrating a sophisticated multi-cloud architecture. We didn’t just want shiny new models; we needed a system resilient to regional outages and…

Meta

SilverTorch: Index as Model — A New Retrieval Paradigm for Recommendation Systems

Useful angle: systems performance via throughput

We’re introducing SilverTorch, a reimagining of recommendation systems that unifies all retrieval components for user generated content under a unified architecture.  SilverTorch shows up to 23.7x higher throughput compared to the state-of-the-art approaches. It’s also showing 20.9x more compute cost efficiency compared to a CPU-based solution while also improving accuracy.  Our research paper, “SilverTorch: A [...] Read More... The post SilverTorch: Index as Model — A New Retrieval Paradigm for Recommendation Systems appeared first on Engineering at Meta.

AWS

Cyber resilience on AWS: A reference approach for recovery from ransomware and destructive events

Useful angle: real-time and reliable systems via resilience, recovery

Cyber resilience is the ability to recover workloads to a known-good state after an adversary has affected the environment. Prevention works to keep threat actors out and detection works to find them quickly. Cyber resilience focuses on recovery: restoring a trustworthy environment when backups, credentials, or parts of the infrastructure can no longer be assumed […]

AWS

How Synthesia optimizes generative AI video inference on Amazon EC2 G7e instances

Useful angle: systems performance via latency, throughput

This post introduces a video decoding optimization technique that we have ideated in collaboration with Synthesia Research Engineering team, which we call Asynchronous Frame Generation Pipeline. Adopting this technique allows you to overlap GPU compute, device-to-host (D2H) data transfer, and host-side post-processing. In this post, we apply this technique to the VAE decoder of a Wan video generation model as an example, where our benchmarks on G7e show increased GPU kernel utilization from 82% to 99.9%, in turn leading to an 8.2% decrease in latency (and increase in throughput) for video decoding. We expect this technique to benefit any customer with a chunked video generation pipeline that transfers frames to host memory.

Microsoft

What is the history of the ERROR_ARENA_TRASHED error code?

Useful angle: systems performance via performance

The storage control blocks were destroyed. The post What is the history of the <CODE>ERROR_<WBR>ARENA_<WBR>TRASHED</CODE> error code? appeared first on The Old New Thing.

Spotify

Better Experiments with LLM Evals — A funnel, not a fork

TL;DR  LLM evals, automated judges that assess relevance, coherence, and quality at scale, are a powerful new... The post Better Experiments with LLM Evals — A funnel, not a fork appeared first on Spotify Engineering.

AWS

Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda

Useful angle: systems performance via latency

In this post, we demonstrate an approach we used to address this challenge for a customer by implementing an AWS Lambda transformation function that streams Amazon CloudWatch metrics directly to internal OpenTelemetry collectors running within a VPC.

Meta

Reel Friends: Building Social Discovery that Scales to Billions

On its face the new Friend Bubbles feature looks simple enough. It highlights Reels your friends have watched and reacted to. But sometimes the features that seem the most straightforward require the deepest engineering work. On this episode of the Meta Tech Podcast, Pascal Hartig chats with Subasree and Joseph, two software engineers from the Facebook [...] Read More... The post Reel Friends: Building Social Discovery that Scales to Billions appeared first on Engineering at Meta.

Meta

Migrating Data Ingestion Systems at Meta Scale

Useful angle: real-time and reliable systems via reliability

Meta’s data ingestion system, which our engineering teams leverage for up-to-date snapshots of the social graph, has recently undergone a significant revamp to enhance its reliability at scale.  Moving from our legacy system to our new architecture required a large-scale migration of our entire data ingestion system.  We’re sharing the solutions and strategies that enabled [...] Read More... The post Migrating Data Ingestion Systems at Meta Scale appeared first on Engineering at Meta.

AWS

Building hybrid multi-tenant architecture for stateful services on AWS

Useful angle: systems performance via performance

In this post, we show you how to build a hybrid multi-tenant architecture that provides strong tenant isolation without requiring per-tenant AWS accounts. You learn how to configure Route 53 weighted routing to distribute traffic across multiple accounts, deploy Application Load Balancer listener rules for tenant-specific routing, create dedicated ECS clusters per tenant, and establish AWS PrivateLink connectivity to shared dependencies.

AWS

Choosing between single or multiple organizations in AWS Organizations

Useful angle: architecture depth via architecture

Organizations face critical architectural decisions that can impact their operations for years to come such as: Is it better to maintain a single organization or implement multiple organizations? In this post, I explain the key advantages and disadvantages of both approaches and the scenarios where each model fits best.

Meta

Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable

Useful angle: real-time and reliable systems via reliability

We’re rolling out version 1.1 of Labyrinth, the encrypted storage system and protocol that secures messages and history on Messenger. Labyrinth 1.1 enhances the reliability of end-to-end encrypted backups with a new sub-protocol that helps messages survive the loss of a device, a switched device, and long gaps between sign-ins. Read our updated white paper, [...] Read More... The post Labyrinth 1.1: Making End-to-End Encrypted Backups Even More Reliable  appeared first on Engineering at Meta.

Airbnb

Monitoring reliably at scale

Useful angle: real-time and reliable systems via incident, reliability

Designing monitoring that works when everything else doesn’t.

Slack

From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines

Excerpt By 2024, Slack’s data platform had accumulated 700+ SSH-based operators orchestrating critical data pipelines. We’re talking daily search indexing that processed terabytes of data, analytics jobs powering business intelligence, the whole shebang. Every single one of these jobs required direct SSH access to production AWS Elastic MapReduce (EMR) clusters. We had a massive security…

Meta

How Meta Is Strengthening End-to-End Encrypted Backups

Useful angle: real-time and reliable systems via resilience, recovery

The HSM-based Backup Key Vault Meta’s HSM-based Backup Key Vault provides the foundation for end-to-end encrypted backups for WhatsApp and Messenger. The system allows people to protect their backed-up message history with a recovery code, ensuring that the recovery code is stored in tamper-resistant hardware security modules (HSMs) and is inaccessible to Meta, cloud storage [...] Read More... The post How Meta Is Strengthening End-to-End Encrypted Backups appeared first on Engineering at Meta.

AWS

Modernizing KYC with AWS serverless solutions and agentic AI for financial services

Useful angle: systems performance via latency

This post extends IBM's approach to real-time KYC validation using generative AI, as previously discussed in the post IBM Digital KYC on AWS uses Generative AI to transform Client Onboarding and KYC Operations. It transforms compliance operations through autonomous decision-making and intelligent automation using agentic AI, event-driven architecture, and AWS serverless services. The solution addresses the fundamental limitations of traditional rule-based systems. It provides autonomous decision-making, dynamic adaptation, and intelligent automation that transforms compliance operations.

AWS

PACIFIC enables multi-tenant, sovereign product carbon footprint exchange on the Catena-X data space using AWS

Useful angle: architecture depth via architecture

This post explores how PACIFIC enables multi-tenant, sovereign PCF exchange on the Catena-X data space using Amazon Elastic Container Service (Amazon ECS) on AWS Fargate, Amazon Cognito, and AWS Identity and Access Management (IAM) to deliver measurable environmental impact and competitive advantage in a carbon-conscious marketplace.

AWS

Real-time analytics: Oldcastle integrates Infor with Amazon Aurora and Amazon Quick Sight

Useful angle: systems performance via performance

This post explores how Oldcastle used AWS services to transform their analytics and AI capabilities by integrating Infor ERP with Amazon Aurora and Amazon Quick Sight. We discuss how they overcame the limitations of traditional cloud ERP reporting to deploy real-time dashboards and build a scalable analytics system. This practical, enterprise-grade approach offers a blueprint that organizations can adapt when extending ERP capabilities with cloud-native analytics and AI.

Meta

Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge

Useful angle: computational geometry and graphics via surface

We’ve fundamentally transformed Facebook Groups Search to help people more reliably discover, sort through, and validate community content that’s most relevant to them. We’ve adopted a new hybrid retrieval architecture and implemented automated model-based evaluation to address the major friction points people experience when searching community content. Under this new framework, we’ve made tangible improvements [...] Read More... The post Modernizing the Facebook Groups Search to Unlock the Power of Community Knowledge appeared first on Engineering at Meta.

Meta

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Useful angle: systems performance via performance, scaling

We’re sharing insights into Meta’s Capacity Efficiency Program, where we’ve built an AI agent platform that helps automate finding and fixing performance issues throughout our infrastructure. By leveraging encoded domain expertise across a unified, standardized tool interface these agents help save power and free up engineers’ time away from addressing performance issues to innovating on [...] Read More... The post Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale appeared first on Engineering at Meta.

Meta

Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways

Useful angle: real-time and reliable systems via resilience

We’re sharing lessons learned from Meta’s post-quantum cryptography (PQC) migration to help other organizations strengthen their resilience as industry transitions to post-quantum cryptography standards. We’re proposing the idea of PQC Migration Levels to help teams within organizations manage the complexity of PQC migration for their various use cases. By outlining Meta’s approach to this work [...] Read More... The post Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways appeared first on Engineering at Meta.

Slack

Managing context in long-run agentic applications

Excerpt In complex, long-running agentic systems, maintaining alignment and coherent reasoning between agents requires careful design. In this second article of our series, we explore these challenges and the mechanisms we built to keep teams of agents working productively over long time spans. We present a range of complementary techniques that balance the conflicting requirements…

Meta

Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases

Useful angle: systems performance via performance, latency

At Meta, WebRTC powers real-time audio and video across various platforms. But forking a large open-source project like WebRTC within our monorepo presents unique challenges – over time, an internal fork can drift behind upstream, cutting itself off from community upgrades. We’re sharing how we escaped this “forking trap” – from building a dual-stack architecture [...] Read More... The post Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases appeared first on Engineering at Meta.

AWS

Build a multi-tenant configuration system with tagged storage patterns

Useful angle: systems performance via performance, scaling

In this post, we demonstrate how you can build a scalable, multi-tenant configuration service using the tagged storage pattern, an architectural approach that uses key prefixes (like tenant_config_ or param_config_) to automatically route configuration requests to the most appropriate AWS storage service. This pattern maintains strict tenant isolation and supports real-time, zero-downtime configuration updates through event-driven architecture, alleviating the cache staleness problem.

Spotify

Let’s Talk Agentic Development: Spotify x Anthropic Live

AI agents are transforming the way we build — and even how we think of ourselves as software developers. Both... The post Let’s Talk Agentic Development: Spotify x Anthropic Live appeared first on Spotify Engineering.

AWS

Automate safety monitoring with computer vision and generative AI

Useful angle: systems performance via scaling

This post describes a solution that uses fixed camera networks to monitor operational environments in near real-time, detecting potential safety hazards while capturing object floor projections and their relationships to floor markings. While we illustrate the approach through distribution center deployment examples, the underlying architecture applies broadly across industries. We explore the architectural decisions, strategies for scaling to hundreds of sites, reducing site onboarding time, synthetic data generation using generative AI tools like GLIGEN, and other critical technical hurdles we overcame.

AWS

Streamlining access to powerful disaster recovery capabilities of AWS

Useful angle: real-time and reliable systems via resilience, recovery

In this blog post, we take a building blocks approach. Starting with the tools like AWS Backup to protect your data, we then add protection for Amazon Elastic Compute Cloud (Amazon EC2) compute using AWS Elastic Disaster Recovery (AWS DRS). Finally, we show how to use the full capabilities of AWS to restore your entire workload—data, infrastructure, networking, and configuration, using Arpio disaster recovery automation.

Slack

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Useful angle: real-time and reliable systems via availability

The Problem: Legacy Tooling and Its Limitations Currently, Slack utilizes a hybrid approach to network measurement, incorporating both internal (such as traffic between AWS Availability Zones) and external (monitoring traffic from the public internet into Slack’s infrastructure) solutions. These tools comprise a combination of commercial SaaS offerings and custom-built network testing solutions developed by our…

AWS

How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI

Useful angle: systems performance via throughput, scaling

In this post, you will learn how Aigen modernized its machine learning (ML) pipeline with Amazon SageMaker AI to overcome industry-wide agricultural robotics challenges and scale sustainable farming. This post focuses on the strategies and architecture patterns that enabled Aigen to modernize its pipeline across hundreds of distributed edge solar robots and showcase the significant business outcomes unlocked through this transformation. By adopting automated data labeling and human-in-the-loop validation, Aigen increased image labeling throughput by 20x while reducing image labeling costs by 22.5x.

AWS

Architecting for agentic AI development on AWS

Useful angle: architecture depth via architecture

In this post, we demonstrate how to architect AWS systems that enable AI agents to iterate rapidly through design patterns for both system architecture and code base structure. We first examine the architectural problems that limit agentic development today. We then walk through system architecture patterns that support rapid experimentation, followed by codebase patterns that help AI agents understand, modify, and validate your applications with confidence.

AWS

How Generali Malaysia optimizes operations with Amazon EKS

Useful angle: systems performance via performance

In this post, we look at how Generali is using Amazon EKS Auto Mode and its integration with other AWS services to enhance performance while reducing operational overhead, optimizing costs, and enhancing security.

Slack

How Slack Rebuilt Notifications 📣

Introduction 🔔  At Slack, notifications are how teams stay in the loop, but they can also become overwhelming when not designed with intention. Our goal was to make staying informed feel effortless. We set out to rebuild one of Slack’s most complicated systems from the ground up by bringing calm, consistency, and clarity to the…

AWS

AI-powered event response for Amazon EKS

Useful angle: systems performance via performance

In this post, you'll learn how AWS DevOps Agent integrates with your existing observability stack to provide intelligent, automated responses to system events.

Dropbox

How we optimized Dash's relevance judge with DSPy

Useful angle: systems performance via performance

We used DSPy to turn prompt engineering for our relevance judge into a measurable, automated optimization loop, improving task performance, cost, and how reliably it works in production.

AWS

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Useful angle: systems performance via performance

In this post, we discuss how following the AWS Cloud Adoption Framework (AWS CAF) and AWS Well-Architected Framework can help reduce these risks through proper implementation of AWS guidance and best practices while taking into consideration the practical challenges organizations face in implementing these best practices, including resource constraints, evaluating trade-offs and competing business priorities.

Microsoft

Engineering and algorithmic interventions for multimodal post-training at Microsoft scale

Useful angle: systems performance via latency

Aditya Challapally leads post-training research and infrastructure for Copilot agent capabilities that process millions of multimodal interactions. This post builds on the diagnostics from Diagnosing instability in production-scale agent reinforcement learning with the engineering and algorithmic interventions we developed to get the best results out of post training at scale. Post-training multimodal agents at scale […] The post Engineering and algorithmic interventions for multimodal post-training at Microsoft scale appeared first on Engineering@Microsoft.

AWS

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Useful angle: architecture depth via architecture

Santander faced a significant technical challenge in managing an infrastructure that processes billions of daily transactions across more than 200 critical systems. The solution emerged through an innovative platform engineering initiative called Catalyst, which transformed the bank's cloud infrastructure and development management. This post analyzes the main cases, benefits, and results obtained with this initiative.

Google

Teaching AI to read a map

Useful angle: computational geometry and graphics via geometric

Machine Perception

Dropbox

How low-bit inference enables efficient AI

Making products like Dropbox Dash accessible to individuals and businesses means tackling new challenges around efficiency and resource use.

Microsoft

How we built the Microsoft Learn MCP Server

Useful angle: optimization techniques via optimized

When we launched the Microsoft Learn Model Context Protocol (MCP) Server in June 2025, our goal was simple: make it effortless for AI agents to use trusted, up-to-date Microsoft Learn documentation. GitHub Copilot and other agents are increasingly common, and they need to be able to ground responses just like humans with browsers do. Learn […] The post How we built the Microsoft Learn MCP Server appeared first on Engineering@Microsoft.

Dropbox

Insights from our executive roundtable on AI and engineering productivity

From Claude Code to Cursor, we're big adopters of AI coding tools at Dropbox. The early results have been promising, but there are still a lot of open questions about how to work with these tools most effectively and where they can have the most impact. To push this conversation forward, we hosted an executive roundtable at our San Francisco studio. Here's how it went.

Microsoft

Diagnosing instability in production-scale agent reinforcement learning

Useful angle: computational geometry and graphics via surface

On January 28, 2026, Hugging Face announced that they have upstreamed the Post-Training Toolkit into TRL as a first-party integration, making these diagnostics directly usable in production RL and agent post-training pipelines. This enables closed-loop monitoring and control patterns that are increasingly necessary for long-running and continuously adapted agent systems. Documentation @ https://huggingface.co/docs/trl/main/en/ptt_integration. Overview In […] The post Diagnosing instability in production-scale agent reinforcement learning appeared first on Engineering@Microsoft.

Microsoft

The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation

Useful angle: architecture depth via architecture

Discover how treating AI agents as collaborators, not automation, transforms engineering workflows and accelerates complex projects The post The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation appeared first on Engineering@Microsoft.

Slack

Streamlining Security Investigations with Agents

Useful angle: implementation detail via pipeline

Slack’s Security Engineering team is responsible for protecting Slack’s core infrastructure and services. Our security event ingestion pipeline handles billions of events per day from a diverse array of data sources. Reviewing alerts produced by our security detection system is our primary responsibility during on-call shifts. We’re going to show you how we’re using AI…

Slack

Android VPAT journey

Background A Voluntary Product Accessibility Template (VPAT) is a document that outlines how well a product aligns with accessibility (a11y) standards. Its primary purpose is to inform customers about a product’s a11y features, enabling them to make informed decisions before purchasing software. At Slack, we conducted a VPAT by a third party a11y vendor in…

Slack

Build better software to build software better

Useful angle: implementation detail via pipeline

We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…

Slack

Advancing Our Chef Infrastructure: Safety Without Disruption

Last year, I wrote a blog post titled Advancing Our Chef Infrastructure, where we explored the evolution of our Chef infrastructure over the years. We talked about the shift from a single Chef stack to a multi-stack model, and the challenges that came with it – from updating how we handle cookbook uploads to navigating…

Microsoft

Enhancing Code Quality at Scale with AI-Powered Code Reviews

Useful angle: benchmarking data via experiment

Microsoft’s AI-powered code review assistant has transformed pull request workflows by automating routine checks, suggesting improvements, and enabling conversational Q&A, leading to faster PR completion, improved code quality, and enhanced developer onboarding. Its seamless integration and customizability have driven widespread adoption within Microsoft The post Enhancing Code Quality at Scale with AI-Powered Code Reviews appeared first on Engineering@Microsoft.

Microsoft

How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps

Useful angle: systems performance via performance

For developers, the emphasis on building intelligence into apps has never been clearer. Over the next three years, 92% of companies plan on investing in AI to achieve business outcomes like enhancing productivity and delivering better customer service. At Microsoft, developers and engineers are pushing the boundaries of AI at scale, crafting applications that harness […] The post How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps appeared first on Engineering@Microsoft.

Microsoft

Dev Box Ready-To-Code Dev Box images template

Useful angle: real-time and reliable systems via reliability

Microsoft One Engineering System (1ES) team shares a sample for building Ready-To-Code Dev Box environments pre-configured with the necessary tools, repositories, and settings, ensuring consistency and reliability across teams. The post Dev Box Ready-To-Code Dev Box images template appeared first on Engineering@Microsoft.

Microsoft

Common annotated security keys

In April 2021, GitHub announced changes to their security token format that significantly enhanced security. The improvement leveraged two straightforward techniques: a fixed signature in the generated token and a checksum – both of which are highly effective in eliminating false positives (noise) and false negatives (missed findings). Microsoft also implements these techniques widely in […] The post Common annotated security keys appeared first on Engineering@Microsoft.

Microsoft

Managed DevOps Pools – The Origin Story

Useful angle: real-time and reliable systems via reliability

Learn about how Microsoft's 1ES organization developed an internal service called "1ES Hosted Pools" to manage Microsoft's diverse Engineering system infrastructure and how it helped make significant improvements to productivity, cost savings, and security. This solution will soon be available as a third-party offering named "Managed DevOps Pools". The post Managed DevOps Pools – The Origin Story appeared first on Engineering@Microsoft.

Microsoft

Developing with Accessibility in Mind at Microsoft

Celebrate the Global Accessibility Awareness Day GAAD by taking actionable and easy steps to build accessibility into your development life-cycle! Learn how tools like Accessibility Insights & Visual Studio can help find accessibility issues in development. The post Developing with Accessibility in Mind at Microsoft appeared first on Engineering@Microsoft.