Skip to content

Engineering Feed

A curated reading feed of engineering blogs I follow — from teams building real systems at scale.

Context Search

474 articles

Ranked by recency by default. Add a concept query to surface related context.

Meta

Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale

Useful angle: systems performance via performance, scaling

We’re sharing insights into Meta’s Capacity Efficiency Program, where we’ve built an AI agent platform that helps automate finding and fixing performance issues throughout our infrastructure. By leveraging encoded domain expertise across a unified, standardized tool interface these agents help save power and free up engineers’ time away from addressing performance issues to innovating on [...] Read More... The post Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale appeared first on Engineering at Meta.

Meta

Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways

Useful angle: real-time and reliable systems via resilience

We’re sharing lessons learned from Meta’s post-quantum cryptography (PQC) migration to help other organizations strengthen their resilience as industry transitions to post-quantum cryptography standards. We’re proposing the idea of PQC Migration Levels to help teams within organizations manage the complexity of PQC migration for their various use cases. By outlining Meta’s approach to this work [...] Read More... The post Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways appeared first on Engineering at Meta.

Slack

Managing context in long-run agentic applications

Excerpt In complex, long-running agentic systems, maintaining alignment and coherent reasoning between agents requires careful design. In this second article of our series, we explore these challenges and the mechanisms we built to keep teams of agents working productively over long time spans. We present a range of complementary techniques that balance the conflicting requirements…

Meta

Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases

Useful angle: systems performance via performance, latency

At Meta, WebRTC powers real-time audio and video across various platforms. But forking a large open-source project like WebRTC within our monorepo presents unique challenges – over time, an internal fork can drift behind upstream, cutting itself off from community upgrades. We’re sharing how we escaped this “forking trap” – from building a dual-stack architecture [...] Read More... The post Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases appeared first on Engineering at Meta.

AWS

Build a multi-tenant configuration system with tagged storage patterns

Useful angle: systems performance via performance, scaling

In this post, we demonstrate how you can build a scalable, multi-tenant configuration service using the tagged storage pattern, an architectural approach that uses key prefixes (like tenant_config_ or param_config_) to automatically route configuration requests to the most appropriate AWS storage service. This pattern maintains strict tenant isolation and supports real-time, zero-downtime configuration updates through event-driven architecture, alleviating the cache staleness problem.

Meta

Trust But Canary: Configuration Safety at Scale

Useful angle: real-time and reliable systems via incident

As AI increases developer speed and productivity it also increases the need for safeguards. On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Ishwari and Joe from Meta’s Configurations team to discuss how Meta makes config rollouts safe at scale. Listen in to learn about canarying and progressive rollouts, the health checks [...] Read More... The post Trust But Canary: Configuration Safety at Scale appeared first on Engineering at Meta.

Spotify

Let’s Talk Agentic Development: Spotify x Anthropic Live

AI agents are transforming the way we build — and even how we think of ourselves as software developers. Both... The post Let’s Talk Agentic Development: Spotify x Anthropic Live appeared first on Spotify Engineering.

Meta

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

Useful angle: architecture depth via design choices

AI coding assistants are powerful but only as good as their understanding of your codebase. When we pointed AI agents at one of Meta’s large-scale data processing pipelines – spanning four repositories, three languages, and over 4,100 files – we quickly found that they weren’t making useful edits quickly enough.  We fixed this by building [...] Read More... The post How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines appeared first on Engineering at Meta.

Meta

KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure

Useful angle: systems performance via performance, throughput

This is the second post in the Ranking Engineer Agent blog series exploring the autonomous AI capabilities accelerating Meta’s Ads Ranking innovation. The previous post introduced Ranking Engineer Agent’s ML exploration capability, which autonomously designs, executes, and analyzes ranking model experiments. This post covers how to optimize the low-level infrastructure that makes those models run [...] Read More... The post KernelEvolve: How Meta’s Ranking Engineer Agent Optimizes AI Infrastructure appeared first on Engineering at Meta.

AWS

Automate safety monitoring with computer vision and generative AI

Useful angle: systems performance via scaling

This post describes a solution that uses fixed camera networks to monitor operational environments in near real-time, detecting potential safety hazards while capturing object floor projections and their relationships to floor markings. While we illustrate the approach through distribution center deployment examples, the underlying architecture applies broadly across industries. We explore the architectural decisions, strategies for scaling to hundreds of sites, reducing site onboarding time, synthetic data generation using generative AI tools like GLIGEN, and other critical technical hurdles we overcame.

AWS

Streamlining access to powerful disaster recovery capabilities of AWS

Useful angle: real-time and reliable systems via resilience, recovery

In this blog post, we take a building blocks approach. Starting with the tools like AWS Backup to protect your data, we then add protection for Amazon Elastic Compute Cloud (Amazon EC2) compute using AWS Elastic Disaster Recovery (AWS DRS). Finally, we show how to use the full capabilities of AWS to restore your entire workload—data, infrastructure, networking, and configuration, using Arpio disaster recovery automation.

Slack

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Useful angle: real-time and reliable systems via availability

The Problem: Legacy Tooling and Its Limitations Currently, Slack utilizes a hybrid approach to network measurement, incorporating both internal (such as traffic between AWS Availability Zones) and external (monitoring traffic from the public internet into Slack’s infrastructure) solutions. These tools comprise a combination of commercial SaaS offerings and custom-built network testing solutions developed by our…

Meta

Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads

Useful angle: systems performance via performance, latency

Meta continues to lead the industry in utilizing groundbreaking AI Recommendation Systems (RecSys) to deliver better experiences for people, and better results for advertisers. To reach the next frontier of performance, we are scaling Meta’s Ads Recommender runtime models to LLM-scale & complexity to further a deeper understanding of people’s interests and intent. This increase [...] Read More... The post Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads appeared first on Engineering at Meta.

Meta

AI for American-Produced Cement and Concrete

Useful angle: systems performance via performance

Meta is continuing its long-term roadmap to help the construction industry leverage AI to produce high-quality and more sustainable concrete mixes, as well as those exclusively produced in the United States.  Concurrent with the 2026 American Concrete Institute (ACI) Spring Convention, Meta is releasing a new AI model for designing concrete mixes – Bayesian Optimization [...] Read More... The post AI for American-Produced Cement and Concrete appeared first on Engineering at Meta.

AWS

How Aigen transformed agricultural robotics for sustainable farming with Amazon SageMaker AI

Useful angle: systems performance via throughput, scaling

In this post, you will learn how Aigen modernized its machine learning (ML) pipeline with Amazon SageMaker AI to overcome industry-wide agricultural robotics challenges and scale sustainable farming. This post focuses on the strategies and architecture patterns that enabled Aigen to modernize its pipeline across hundreds of distributed edge solar robots and showcase the significant business outcomes unlocked through this transformation. By adopting automated data labeling and human-in-the-loop validation, Aigen increased image labeling throughput by 20x while reducing image labeling costs by 22.5x.

AWS

Architecting for agentic AI development on AWS

Useful angle: architecture depth via architecture

In this post, we demonstrate how to architect AWS systems that enable AI agents to iterate rapidly through design patterns for both system architecture and code base structure. We first examine the architectural problems that limit agentic development today. We then walk through system architecture patterns that support rapid experimentation, followed by codebase patterns that help AI agents understand, modify, and validate your applications with confidence.

AWS

How Generali Malaysia optimizes operations with Amazon EKS

Useful angle: systems performance via performance

In this post, we look at how Generali is using Amazon EKS Auto Mode and its integration with other AWS services to enhance performance while reducing operational overhead, optimizing costs, and enhancing security.

Slack

How Slack Rebuilt Notifications 📣

Introduction 🔔  At Slack, notifications are how teams stay in the loop, but they can also become overwhelming when not designed with intention. Our goal was to make staying informed feel effortless. We set out to rebuild one of Slack’s most complicated systems from the ground up by bringing calm, consistency, and clarity to the…

AWS

AI-powered event response for Amazon EKS

Useful angle: systems performance via performance

In this post, you'll learn how AWS DevOps Agent integrates with your existing observability stack to provide intelligent, automated responses to system events.

Meta

Friend Bubbles: Enhancing Social Discovery on Facebook Reels

Useful angle: computational geometry and graphics via surface

Friend bubbles in Facebook Reels highlight Reels your friends have liked or reacted to, helping you discover new content and making it easier to connect over shared interests. This article explains the technical architecture behind friend bubbles, including how machine learning estimates relationship strength and ranks content your friends have interacted with to create more [...] Read More... The post Friend Bubbles: Enhancing Social Discovery on Facebook Reels appeared first on Engineering at Meta.

Microsoft

Windows stack limit checking retrospective: Alpha AXP

Useful angle: memory and parallelism via memory

Double the size, double the fun. The post Windows stack limit checking retrospective: Alpha AXP appeared first on The Old New Thing.

Meta

Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation

Useful angle: benchmarking data via experiment

Meta’s Ranking Engineer Agent (REA) autonomously executes key steps across the end-to-end machine learning (ML) lifecycle for ads ranking models. This post covers REA’s ML experimentation capabilities: autonomously generating hypotheses, launching training jobs, debugging failures, and iterating on results. Future posts will cover additional REA capabilities. REA reduces the need for manual intervention. It manages [...] Read More... The post Ranking Engineer Agent (REA): The Autonomous AI Agent Accelerating Meta’s Ads Ranking Innovation appeared first on Engineering at Meta.

Dropbox

How we optimized Dash's relevance judge with DSPy

Useful angle: systems performance via performance

We used DSPy to turn prompt engineering for our relevance judge into a measurable, automated optimization loop, improving task performance, cost, and how reliably it works in production.

Meta

Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps

Even seemingly simple engineering tasks — like updating an API — can become monumental undertakings when you’re dealing with millions of lines of code and thousands of engineers, especially if the changes are security-related. Nowhere is this more apparent than in mobile security, where a single class of vulnerability can be replicated across hundreds of [...] Read More... The post Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps appeared first on Engineering at Meta.

Microsoft

Windows stack limit checking retrospective: MIPS

Useful angle: memory and parallelism via memory, cache

Optimizing out the unnecessary probes comes with its own complexity. The post Windows stack limit checking retrospective: MIPS appeared first on The Old New Thing.

Uber

How Uber Built an Agentic System to Automate Design Specs in Minutes

Useful angle: implementation detail via implementation, pipeline

Uber is setting a new standard for design systems by using the Figma Console MCP to shatter the manual documentation bottleneck. By letting AI agents pull directly from design data, weeks of spec writing turns into minutes of automated precision.

Meta

How Advanced Browsing Protection Works in Messenger

Useful angle: optimization techniques via improved

We’re sharing the technical details behind how Advanced Browsing Protection (ABP) in Messenger protects the privacy of the links clicked on within chats while still warning people about malicious links. We hope that this post has helped to illuminate some of the engineering challenges and infrastructure components involved for providing this feature for our users. [...] Read More... The post How Advanced Browsing Protection Works in Messenger appeared first on Engineering at Meta.

Uber

Building High Throughput Payment Account Processing

Useful angle: systems performance via throughput

Uber’s Payment Account Batch Processing system handles over 30 financial update operations per second for hot accounts with sub-second batching and strict consistency. Learn how we built it without using special hardware or software.

AWS

The Hidden Price Tag: Uncovering Hidden Costs in Cloud Architectures with the AWS Well-Architected Framework

Useful angle: systems performance via performance

In this post, we discuss how following the AWS Cloud Adoption Framework (AWS CAF) and AWS Well-Architected Framework can help reduce these risks through proper implementation of AWS guidance and best practices while taking into consideration the practical challenges organizations face in implementing these best practices, including resource constraints, evaluating trade-offs and competing business priorities.

Meta

FFmpeg at Meta: Media Processing at Scale

Useful angle: real-time and reliable systems via real time, reliability

FFmpeg is truly a multi-tool for media processing. As an industry-standard tool it supports a wide variety of audio and video codecs and container formats. It can also orchestrate complex chains of filters for media editing and manipulation. For the people who use our apps, FFmpeg plays an important role in enabling new video experiences [...] Read More... The post FFmpeg at Meta: Media Processing at Scale appeared first on Engineering at Meta.

Meta

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Useful angle: systems performance via performance

Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure. We are renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and workloads. We are committed to continuing to develop jemalloc development with the [...] Read More... The post Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc appeared first on Engineering at Meta.

Microsoft

Engineering and algorithmic interventions for multimodal post-training at Microsoft scale

Useful angle: systems performance via latency

Aditya Challapally leads post-training research and infrastructure for Copilot agent capabilities that process millions of multimodal interactions. This post builds on the diagnostics from Diagnosing instability in production-scale agent reinforcement learning with the engineering and algorithmic interventions we developed to get the best results out of post training at scale. Post-training multimodal agents at scale […] The post Engineering and algorithmic interventions for multimodal post-training at Microsoft scale appeared first on Engineering@Microsoft.

AWS

Digital Transformation at Santander: How Platform Engineering is Revolutionizing Cloud Infrastructure

Useful angle: architecture depth via architecture

Santander faced a significant technical challenge in managing an infrastructure that processes billions of daily transactions across more than 200 critical systems. The solution emerged through an innovative platform engineering initiative called Catalyst, which transformed the bank's cloud infrastructure and development management. This post analyzes the main cases, benefits, and results obtained with this initiative.

Uber

Superuser Gateway: Guardrails for Privileged Command Execution

Useful angle: real-time and reliable systems via recovery

Learn how Uber’s new Superuser Guardrails turn risky manual commands into peer-reviewed, machine-validated changes, and how to apply this pattern to your own systems.

AWS

6,000 AWS accounts, three people, one platform: Lessons learned

Useful angle: real-time and reliable systems via real time

This post describes why ProGlove chose a account-per-tenant approach for our serverless SaaS architecture and how it changes the operational model. It covers the challenges you need to anticipate around automation, observability and cost. We will also discuss how the approach can affect other operational models in different environments like an enterprise context.

Airbnb

Improving Search Ranking for Maps

Useful angle: implementation detail via algorithm

How Airbnb is adapting ranking for our map interface.

Airbnb

Building a Next-Generation Key-Value Store at Airbnb

Useful angle: systems performance via performance, latency

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

Airbnb

Pay as a Local

Useful angle: architecture depth via architecture

How Airbnb rolled out 20+ locally relevant payment methods worldwide in just 14 months

Airbnb

Academic Publications & Airbnb Tech: 2025 Year in Review

Useful angle: benchmarking data via measurement, evaluation

2025 was a big year for research at Airbnb, as we made significant progress toward our mission to use AI, data science, and machine learning to become the best travel and living platform.

Airbnb

My Journey to Airbnb — Anna Sulkina

Useful angle: memory and parallelism via parallel

Anna Sulkina has always been a traveler, and we’re lucky her travels have brought her to Airbnb. Anna is a Senior Director of Engineering, and she’s responsible for Application & Cloud infrastructure.

Meta

RCCLX: Innovating GPU Communications on AMD Platforms

Useful angle: systems performance via performance, latency

We are open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. Communication patterns for AI models are constantly evolving, as are hardware [...] Read More... The post RCCLX: Innovating GPU Communications on AMD Platforms appeared first on Engineering at Meta.

Spotify

Our Multi-Agent Architecture for Smarter Advertising

Useful angle: architecture depth via architecture

When we kicked this off, we weren’t trying to ship an “AI feature.” We were trying to fix a structural... The post Our Multi-Agent Architecture for Smarter Advertising appeared first on Spotify Engineering.

Uber

Database Federation: Decentralized and ACL-Compliant Hive™ Databases

Useful angle: systems performance via latency

Uber’s 10PB, 16K-dataset Hive monolith for the Delivery business had huge limitations. See how we transformed it into a secure, scalable, decentralized platform with zero downtime and saved more than 1PB along the way. #BigData #DataSecurity

Google

Teaching AI to read a map

Useful angle: computational geometry and graphics via geometric

Machine Perception

Microsoft

Microspeak: Escrow

Final build, final, final, final 2, ship this one. The post Microspeak: Escrow appeared first on The Old New Thing.

Dropbox

How low-bit inference enables efficient AI

Making products like Dropbox Dash accessible to individuals and businesses means tackling new challenges around efficiency and resource use.

Uber

Uber’s Rate Limiting System

Useful angle: systems performance via latency, capacity

Discover how Uber built and automated a global rate-limiting system that protects millions of RPCs per second, improving reliability, reducing latency, and simplifying operations across our service mesh.

Microsoft

How we built the Microsoft Learn MCP Server

Useful angle: optimization techniques via optimized

When we launched the Microsoft Learn Model Context Protocol (MCP) Server in June 2025, our goal was simple: make it effortless for AI agents to use trusted, up-to-date Microsoft Learn documentation. GitHub Copilot and other agents are increasingly common, and they need to be able to ground responses just like humans with browsers do. Learn […] The post How we built the Microsoft Learn MCP Server appeared first on Engineering@Microsoft.

Meta

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Useful angle: systems performance via scaling

WHAT IT IS The rise of agentic software development means code is being written, reviewed, and shipped faster than ever before across the entire industry. It also means that testing frameworks need to evolve for this rapidly changing landscape. Faster development demands faster testing that can catch bugs as they land in a codebase, without [...] Read More... The post The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It appeared first on Engineering at Meta.

Dropbox

Insights from our executive roundtable on AI and engineering productivity

From Claude Code to Cursor, we're big adopters of AI coding tools at Dropbox. The early results have been promising, but there are still a lot of open questions about how to work with these tools most effectively and where they can have the most impact. To push this conversation forward, we hosted an executive roundtable at our San Francisco studio. Here's how it went.

Meta

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Useful angle: systems performance via performance, latency

We’re sharing details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows us to seamlessly connect thousands of GPUs across multiple data centers and regions. Our BAG implementation is connecting two different network fabrics – Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF). Once it’s complete our AI [...] Read More... The post Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters appeared first on Engineering at Meta.

Uber

Introducing uFowarder: The Consumer Proxy for Kafka Async Queuing

Useful angle: real-time and reliable systems via real time

Uber processes trillions of Kafka messages per day on a push-based consumer proxy in real time. Read this blog to learn about the thinking behind open source uForwarder before applying it to your use cases.

Meta

No Display? No Problem: Cross-Device Passkey Authentication for XR Devices

We’re sharing a novel approach to enabling cross-device passkey authentication for devices with inaccessible displays (like XR devices). Our approach bypasses the use of QR codes and enables cross-device authentication without the need for an on-device display, while still complying with all trust and proximity requirements. This approach builds on work done by the FIDO [...] Read More... The post No Display? No Problem: Cross-Device Passkey Authentication for XR Devices appeared first on Engineering at Meta.

AWS

Mastering millisecond latency and millions of events: The event-driven architecture behind the Amazon Key Suite

Useful angle: systems performance via latency

In this post, we explore how the Amazon Key team used Amazon EventBridge to modernize their architecture, transforming a tightly coupled monolithic system into a resilient, event-driven solution. We explore the technical challenges we faced, our implementation approach, and the architectural patterns that helped us achieve improved reliability and scalability. The post covers our solutions for managing event schemas at scale, handling multiple service integrations efficiently, and building an extensible architecture that accommodates future growth.

AWS

Sovereign failover – Design for digital sovereignty using the AWS European Sovereign Cloud

Useful angle: real-time and reliable systems via recovery

This post explores the architectural patterns, challenges, and best practices for building cross-partition failover, covering network connectivity, authentication, and governance. By understanding these constraints, you can design resilient cloud-native applications that balance regulatory compliance with operational continuity.

Airbnb

My Journey to Airbnb: Peter Coles

The story of Airbnb’s Head Economist for Policy and Director of Data Science involves geology, co-teaching with a Nobel Prize winner, and CSI. (No, not the hit TV franchise.)

AWS

Announcing the AWS Digital Sovereignty Well-Architected Lens

Useful angle: architecture depth via architecture

As organizations accelerate cloud adoption, meeting digital sovereignty requirements has become essential to build trust with customers and regulators worldwide. The challenge isn’t whether to adopt the cloud—it’s how to do so while meeting sovereignty requirements, using a multidisciplinary approach. Even though requirements vary by geography, organizations commonly address them through technical and operational controls […]

AWS

How Artera enhances prostate cancer diagnostics using AWS

Useful angle: architecture depth via architecture

In this post, we explore how Artera used Amazon Web Services (AWS) to develop and scale their AI-powered prostate cancer test, accelerating time to results and enabling personalized treatment recommendations for patients.

Uber

How Uber Scaled Data Replication to Move Petabytes Every Day

Useful angle: systems performance via performance

Uber prioritizes a reliable data lake, which is distributed across on-premise and cloud environments. This multi-region setup presents challenges for ensuring reliable and timely data access due to limited network bandwidth and the need for seamless data availability, particularly for disaster recovery. Uber uses the Hive Sync service, which uses Apache HadoopⓇ Ditscp (Distributed Copy) for data replication. However, with Uber’s Data Lake exceeding 350 PB, Distcp’s limitations became apparent. This blog explores the optimizations made to Distcp to enhance its performance and meet Uber’s growing data replication and disaster recovery needs across its distributed infrastructure.

Microsoft

Diagnosing instability in production-scale agent reinforcement learning

Useful angle: computational geometry and graphics via surface

On January 28, 2026, Hugging Face announced that they have upstreamed the Post-Training Toolkit into TRL as a first-party integration, making these diagnostics directly usable in production RL and agent post-training pipelines. This enables closed-loop monitoring and control patterns that are increasingly necessary for long-running and continuously adapted agent systems. Documentation @ https://huggingface.co/docs/trl/main/en/ptt_integration. Overview In […] The post Diagnosing instability in production-scale agent reinforcement learning appeared first on Engineering@Microsoft.

Meta

Rust at Scale: An Added Layer of Security for WhatsApp

WhatsApp has adopted and rolled out a new layer of security for users – built with Rust – as part of its effort to harden defenses against malware threats. WhatsApp’s experience creating and distributing our media consistency library in Rust to billions of devices and browsers proves Rust is production ready at a global scale. [...] Read More... The post Rust at Scale: An Added Layer of Security for WhatsApp appeared first on Engineering at Meta.

Meta

Adapting the Facebook Reels RecSys AI Model Based on User Feedback

Useful angle: computational geometry and graphics via surface

We’ve improved personalized video recommendations on Facebook Reels by moving beyond metrics such as likes and watch time and directly leveraging user feedback.  Our new User True Interest Survey (UTIS) model, now helps surface more niche, high-quality content and boosts engagement, retention, and satisfaction. We’re doubling down on personalization, tackling challenges like sparse user data [...] Read More... The post Adapting the Facebook Reels RecSys AI Model Based on User Feedback appeared first on Engineering at Meta.

Meta

CSS at Scale With StyleX

Useful angle: systems performance via performance

Build a large enough website with a large enough codebase, and you’ll eventually find that CSS presents challenges at scale. It’s no different at Meta, which is why we open-sourced StyleX, a solution for CSS at scale. StyleX combines the ergonomics of CSS-in-JS with the performance of static CSS. It allows atomic styling of components [...] Read More... The post CSS at Scale With StyleX appeared first on Engineering at Meta.

Airbnb

Code of conduct

Airbnb's Code of conduct for Open Source.

Airbnb

Code of conduct

Airbnb's Code of conduct for Open Source.

Uber

From Monitoring to Observability: Our Ultra-Marathon to a Cloud-Native Platform

Useful angle: production lessons via operational

Managing a global corporate network at Uber’s scale can feel a bit like running an ultra-marathon. There are long stretches of smooth sailing, but you’re always preparing for the unexpected mountain pass or sudden change in weather. For years, our engineering teams have navigated this terrain with a traditional, monolithic monitoring system. We knew we needed to switch to a modern pair of carbon-fiber running shoes. This meant a complete overhaul: a journey to replace our legacy system with a cloud-native observability platform built for speed, flexibility, and endurance on an open-source stack.

Microsoft

2025 year-end link clearance

Another year gets relegated to history. The post 2025 year-end link clearance appeared first on The Old New Thing.

Meta

Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption

The 2025 Typed Python Survey, conducted by contributors from JetBrains, Meta, and the broader Python typing community, offers a comprehensive look at the current state of Python’s type system and developer tooling. With 1,241 responses (a 15% increase from last year), the survey captures the evolving sentiment, challenges, and opportunities around Python typing in the [...] Read More... The post Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption appeared first on Engineering at Meta.

Meta

DrP: Meta’s Root Cause Analysis Platform at Scale

Useful angle: real-time and reliable systems via incident

Incident investigation can be a daunting task in today’s digital landscape, where large-scale systems comprise numerous interconnected components and dependencies DrP is a root cause analysis (RCA) platform, designed by Meta, to programmatically automate the investigation process, significantly reducing the mean time to resolve (MTTR) for incidents and alleviating on-call toil Today, DrP is used [...] Read More... The post DrP: Meta’s Root Cause Analysis Platform at Scale appeared first on Engineering at Meta.

Uber

Powering Billion-Scale Vector Search with OpenSearch

Useful angle: systems performance via performance

Uber powers billion-scale vector search with OpenSearch. Discover the innovative optimizations we designed to boost search efficiency, scalability, and reliability for massive datasets.

Meta

How We Built Meta Ray-Ban Display: From Zero to Polish

We’re going behind the scenes of the Meta Ray-Ban Display, Meta’s most advanced AI glasses yet. In a previous episode we met the team behind the Meta Neural Band, the EMG wristband packaged with the Ray-Ban Display. Now we’re delving into the glasses themselves. Kenan and Emanuel, from Meta’s Wearables org, join Pascal Hartig on [...] Read More... The post How We Built Meta Ray-Ban Display: From Zero to Polish appeared first on Engineering at Meta.

Meta

How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks

Meta’s secure-by-default frameworks wrap potentially unsafe OS and third-party functions, making security the default while preserving developer speed and usability. These frameworks are designed to closely mirror existing APIs, rely on public and stable interfaces, and maximize developer adoption by minimizing friction and complexity. Generative AI and automation accelerate the adoption of secure frameworks at [...] Read More... The post How AI Is Transforming the Adoption of Secure-by-Default Mobile Frameworks appeared first on Engineering at Meta.

AWS

Architecting conversational observability for cloud applications

Useful angle: real-time and reliable systems via recovery

In this post, we walk through building a generative AI–powered troubleshooting assistant for Kubernetes. The goal is to give engineers a faster, self-service way to diagnose and resolve cluster issues, cut down Mean Time to Recovery (MTTR), and reduce the cycles experts spend finding the root cause of issues in complex distributed systems.

AWS

How BASF’s Agriculture Solutions drives traceability and climate action by tokenizing cotton value chains using Amazon Managed Blockchain

Useful angle: architecture depth via architecture

BASF Agricultural Solutions combines innovative products and digital tools with practical farmer knowledge. This post explores how Amazon Managed Blockchain can drive a positive change in the agricultural industry by tokenizing food and cotton value chains for traceability, climate action, and circularity.

Microsoft

How does Windows synthesize the CF_LOCALE clipboard format?

Getting it from a place that might have been obvious in the past, but maybe not today. The post How does Windows synthesize the <CODE>CF_<WBR>LOCALE</CODE> clipboard format? appeared first on The Old New Thing.

AWS

She architects: Bringing unique perspectives to innovative solutions at AWS

Useful angle: architecture depth via architecture

Have you ever wondered what it is really like to be a woman in tech at one of the world's leading cloud companies? Or maybe you are curious about how diverse perspectives drive innovation beyond the buzzwords? Today, we are providing an insider's perspective on the role of a solutions architect (SA) at Amazon Web Services (AWS). However, this is not a typical corporate success story. We are three women who have navigated challenges, celebrated wins, and found our unique paths in the world of cloud architecture, and we want to share our real stories with you.

Microsoft

How can my process read its own standard output?

You'll have to trick yourself before anybody notices, which may not be possible. The post How can my process read its own standard output? appeared first on The Old New Thing.

Microsoft

The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation

Useful angle: architecture depth via architecture

Discover how treating AI agents as collaborators, not automation, transforms engineering workflows and accelerates complex projects The post The Interaction Changes Everything: Treating AI Agents as Collaborators, Not Automation appeared first on Engineering@Microsoft.

Microsoft

Microspeak: Big rocks

The large obstacles. The post Microspeak: Big rocks appeared first on The Old New Thing.

Slack

Streamlining Security Investigations with Agents

Useful angle: implementation detail via pipeline

Slack’s Security Engineering team is responsible for protecting Slack’s core infrastructure and services. Our security event ingestion pipeline handles billions of events per day from a diverse array of data sources. Reviewing alerts produced by our security detection system is our primary responsibility during on-call shifts. We’re going to show you how we’re using AI…

AWS

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

Useful angle: systems performance via performance

In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.

Uber

Evolution and Scale of Uber’s Delivery Search Platform

How does Uber Eats power search across billions of stores, dishes, and grocery items? We built a next-gen semantic search platform that understands meaning, not just keywords—handling typos, synonyms, and multiple languages.

Meta

Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

Useful angle: systems performance via performance

We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI.  Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure.  Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization appeared first on Engineering at Meta.

Meta

Key Transparency Comes to Messenger

We’re excited to share another advancement in the security of your conversations on Messenger: the launch of key transparency verification for end-to-end encrypted chats.  This new feature enables an additional level of assurance that only you — and the people you’re communicating with — can see or listen to what is sent, and that no [...] Read More... The post Key Transparency Comes to Messenger appeared first on Engineering at Meta.

Uber

Ceilometer: Uber’s Adaptive Benchmarking Framework

Useful angle: systems performance via performance

Dig into Ceilometer, Uber’s adaptive benchmarking framework for ensuring system performance and reliability at scale. Learn how it automates performance testing while providing production-like insights and continuous validation.

AWS

Building an AI gateway to Amazon Bedrock with Amazon API Gateway

Useful angle: real-time and reliable systems via real time

In this post, we'll explore a reference architecture that helps enterprises govern their Amazon Bedrock implementations using Amazon API Gateway. This pattern enables key capabilities like authorization controls, usage quotas, and real-time response streaming. We'll examine the architecture, provide deployment steps, and discuss potential enhancements to help you implement AI governance at scale.

AWS

Architecting for AI excellence: AWS launches three Well-Architected Lenses at re:Invent 2025

Useful angle: systems performance via performance

At re:Invent 2025, we introduce one new lens and two significant updates to the AWS Well-Architected Lenses specifically focused on AI workloads: the Responsible AI Lens, the Machine Learning (ML) Lens, and the Generative AI Lens. Together, these lenses provide comprehensive guidance for organizations at different stages of their AI journey, whether you're just starting to experiment with machine learning or already deploying complex AI applications at scale.

AWS

Announcing the updated AWS Well-Architected Generative AI Lens

Useful angle: architecture depth via architecture

We are delighted to announce an update to the AWS Well-Architected Generative AI Lens. This update features several new sections of the Well-Architected Generative AI Lens, including new best practices, advanced scenario guidance, and improved preambles on responsible AI, data architecture, and agentic workflows.

AWS

Announcing the updated AWS Well-Architected Machine Learning Lens

Useful angle: systems performance via performance

We are excited to announce the updated AWS Well-Architected Machine Learning Lens, now enhanced with the latest capabilities and best practices for building machine learning (ML) workloads on AWS.

Slack

Android VPAT journey

Background A Voluntary Product Accessibility Template (VPAT) is a document that outlines how well a product aligns with accessibility (a11y) standards. Its primary purpose is to inform customers about a product’s a11y features, enabling them to make informed decisions before purchasing software. At Slack, we conducted a VPAT by a third party a11y vendor in…

Meta

Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation

Useful angle: architecture depth via architecture

We’ve released Ax 1.0, an open-source platform that uses machine learning to automatically guide complex, resource-intensive experimentation. Ax is used at scale across Meta to improve AI models, tune production infrastructure, and accelerate advances in ML and even hardware design. Our accompanying paper, “Ax: A Platform for Adaptive Experimentation” explains Ax’s architecture, methodology, and how it [...] Read More... The post Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation appeared first on Engineering at Meta.

AWS

Build priority-based message processing with Amazon MQ and AWS App Runner

Useful angle: systems performance via scaling

In this post, we show you how to build a priority-based message processing system using Amazon MQ for priority queuing, Amazon DynamoDB for data persistence, and AWS App Runner for serverless compute. We demonstrate how to implement application-level delays that high-priority messages can bypass, create real-time UIs with WebSocket connections, and configure dual-layer retry mechanisms for maximum reliability.

Meta

Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together

Connecting Africa and the World We’re excited to share the completion of the core 2Africa infrastructure, the world’s longest open access subsea cable system. 2Africa is a landmark subsea cable system that sets a new standard for global connectivity. This project is the result of years of collaboration, innovation, and a shared vision to connect [...] Read More... The post Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together appeared first on Engineering at Meta.

Meta

Enhancing HDR on Instagram for iOS With Dolby Vision

We’re sharing how we’ve enabled Dolby Vision and ambient viewing environment (amve) on the Instagram iOS app to enhance the video viewing experience. HDR videos created on iPhones contain unique Dolby Vision and amve metadata that we needed to support end-to-end Instagram for iOS is now the first Meta app to support Dolby Vision video, [...] Read More... The post Enhancing HDR on Instagram for iOS With Dolby Vision appeared first on Engineering at Meta.

Meta

Open Source Is Good for the Environment

Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment? On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements [...] Read More... The post Open Source Is Good for the Environment appeared first on Engineering at Meta.

AWS

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Useful angle: architecture depth via architecture

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Spotify

Shuffle: Making Random Feel More Human

Shuffle has always been one of Spotify’s most-used features, and also one of the most misunderstood. For... The post Shuffle: Making Random Feel More Human appeared first on Spotify Engineering.

Meta

StyleX: A Styling Library for CSS at Scale

Useful angle: systems performance via performance

StyleX is Meta’s styling system for large-scale applications. It combines the ergonomics of CSS-in-JS with the performance of static CSS, generating collision-free atomic CSS while allowing for expressive, type-safe style authoring. StyleX was open sourced at the end of 2023 and has since become the standard styling system across Meta products like Facebook, Instagram, WhatsApp, [...] Read More... The post StyleX: A Styling Library for CSS at Scale appeared first on Engineering at Meta.

Meta

Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation

Useful angle: systems performance via performance

We’re sharing details about Meta’s Generative Ads Recommendation Model (GEM), a new foundation model that delivers increased ad performance and advertiser ROI by enhancing other ads recommendation models’ ability to serve relevant ads. GEM’s novel architecture allows it to scale with an increasing number of parameters while consistently generating more precise predictions efficiently. GEM propagates [...] Read More... The post Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation appeared first on Engineering at Meta.

Slack

Build better software to build software better

Useful angle: implementation detail via pipeline

We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…

Uber

Building Zone Failure Resilience in Apache Pinot™ at Uber

Useful angle: real-time and reliable systems via real time, reliability

By building zone failure resilience into Apache Pinot™, Uber strengthened reliability for real-time analytics, sped up release cycles, and created a foundation for future failure recovery. Now queries and ingestion stay strong, even when zones go dark.

Meta

Video Invisible Watermarking at Scale

Useful angle: systems performance via scaling

At Meta, we use invisible watermarking for a variety of content provenance use cases on our platforms. Invisible watermarking serves a number of use cases, including detecting AI-generated videos, verifying who posted a video first, and identifying the source and tools used to create a video. We’re sharing how we overcame the challenges of scaling [...] Read More... The post Video Invisible Watermarking at Scale appeared first on Engineering at Meta.

Uber

Raising the Bar on ML Model Deployment Safety

Useful angle: real-time and reliable systems via reliability

How do you safely ship thousands of ML models without slowing teams down? At Uber, we’ve built guardrails that catch issues early, prevent rollbacks, and raise the bar on reliability. Discover how safety became a measurable standard across our ML ecosystem.

Slack

Advancing Our Chef Infrastructure: Safety Without Disruption

Last year, I wrote a blog post titled Advancing Our Chef Infrastructure, where we explored the evolution of our Chef infrastructure over the years. We talked about the shift from a single Chef stack to a multi-stack model, and the challenges that came with it – from updating how we handle cookbook uploads to navigating…

Uber

Enabling Deep Model Explainability with Integrated Gradients at Uber

Useful angle: implementation detail via debugging

Uber’s ML platform Michelangelo now supports Integrated Gradients, enabling scalable, interpretable deep model explainability across TensorFlow™ and PyTorch™. Learn how this powers trust, debugging, and decision-making throughout the ML life cycle.

AWS

BASF Digital Farming builds a STAC-based solution on Amazon EKS

Useful angle: architecture depth via architecture

This post was co-written with Frederic Haase and Julian Blau with BASF Digital Farming GmbH. At xarvio – BASF Digital Farming, our mission is to empower farmers around the world with cutting-edge digital agronomic decision-making tools. Central to this mission is our crop optimization platform, xarvio FIELD MANAGER, which delivers actionable insights through a range […]

Uber

Rebuilding Uber’s Apache Pinot™ Query Architecture

Useful angle: real-time and reliable systems via real time

The next chapter of real-time analytics at Uber. Uncover how Uber restructured its Apache Pinot™ query architecture to unlock a ton of new features, redefining the capabilities of a mature OLAP platform.

Slack

Deploy Safety: Reducing customer impact from change

Useful angle: real-time and reliable systems via reliability

It’s mid 2023 and we’ve identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward. We’re a year and half into the Deploy Safety Program at Slack, improving the way we deploy, uplifting our safety culture and continuing…

AWS

Modernization of real-time payment orchestration on AWS

Useful angle: real-time and reliable systems via real time

The global real-time payments market is experiencing significant growth. According to Fortune Business Insights, the market was valued at USD 24.91 billion in 2024 and is projected to grow to USD 284.49 billion by 2032, with a CAGR of 35.4%. Similarly, Grand View Research reports that the global mobile payment market, valued at USD 88.50 […]

AWS

Build resilient generative AI agents

Useful angle: real-time and reliable systems via resilience

Generative AI agents in production environments demand resilience strategies that go beyond traditional software patterns. AI agents make autonomous decisions, consume substantial computational resources, and interact with external systems in unpredictable ways. These characteristics create failure modes that conventional resilience approaches might not address. This post presents a framework for AI agent resilience risk analysis […]

AWS

A scalable, elastic database and search solution for 1B+ vectors built on LanceDB and Amazon S3

Useful angle: architecture depth via architecture

In this post, we explore how Metagenomi built a scalable database and search solution for over 1 billion protein vectors using LanceDB and Amazon S3. The solution enables rapid enzyme discovery by transforming proteins into vector embeddings and implementing a serverless architecture that combines AWS Lambda, AWS Step Functions, and Amazon S3 for efficient nearest neighbor searches.

Uber

Adding Determinism and Safety to Uber IAM Policy Changes

Uber’s Policy Simulator tool enhances the safety and predictability of IAM policy changes by allowing policy authors to preview the impact of their modifications prior to deployment, ensuring deterministic outcomes after policy change deployment.

Airbnb

Migrating Airbnb’s JVM Monorepo to Bazel

By: Jack Dai, Howard Ho, Loc Dinh, Stepan Goncharov, Ted Tenedorio, and Thomas Bao At Airbnb, we recently completed migrating our largest repo, the JVM monorepo, to Bazel. This repo contains tens of millions of lines of Java, Kotlin, and Scala code that power the vast array of backend services and data pipelines behind airbnb.com. Migration in numbers […]

Slack

Building Slack’s Anomaly Event Response

As cyberattacks evolve to unprecedented levels of sophistication and speed, the time gap between breach detection and response has never been more critical. Traditional security approaches often operate reactively, identifying compromises only after damage has occurred. This delay grants attackers a tactical advantage, forcing security teams to focus on damage assessment and remediation rather than…

Uber

Controlling the Rollout of Large-Scale Monorepo Changes

Discover how Uber controls the blast radius of large-scale commits with cross-cutting service deployment orchestration. As Uber embraces fully automated continuous deployment, strong safety practices are more critical than ever.

AWS

Simplify multi-tenant encryption with a cost-conscious AWS KMS key strategy

Useful angle: architecture depth via architecture

In this post, we explore an efficient approach to managing encryption keys in a multi-tenant SaaS environment through centralization, addressing challenges like key proliferation, rising costs, and operational complexity across multiple AWS accounts and services. We demonstrate how implementing a centralized key management strategy using a single AWS KMS key per tenant can maintain security and compliance while reducing operational overhead as organizations scale.

AWS

How CommBank made their CommSec trading platform highly available and operationally resilient

Useful angle: systems performance via capacity

In this post, we explore how CommSec, Australia's leading online broker, transitioned from a multicloud environment to AWS as their sole cloud provider while implementing Amazon Application Recovery Controller (ARC) zonal shift to maintain high availability and operational resilience. The consolidation resulted in significant benefits including 25% base capacity reduction, two times faster deployments, and improved failover capabilities through ARC zonal shift, enabling CommSec to continue serving millions of customers while meeting strict regulatory requirements.

AWS

How Karrot built a feature platform on AWS, Part 1: Motivation and feature serving

Useful angle: architecture depth via architecture

This two-part series shows how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. This post starts by presenting our motivation, our requirements, and the solution architecture, focusing on feature serving.

AWS

How Karrot built a feature platform on AWS, Part 2: Feature ingestion

Useful angle: real-time and reliable systems via real time

This two-part series shows how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. This post covers the process of collecting features in real-time and batch ingestion into an online store, and the technical approaches for stable operation.

AWS

Deploy LLMs on Amazon EKS using vLLM Deep Learning Containers

Useful angle: systems performance via performance

In this post, we demonstrate how to deploy the DeepSeek-R1-Distill-Qwen-32B model using AWS DLCs for vLLMs on Amazon EKS, showcasing how these purpose-built containers simplify deployment of this powerful open source inference engine. This solution can help you solve the complex infrastructure challenges of deploying LLMs while maintaining performance and cost-efficiency.

Airbnb

Seamless Istio Upgrades at Scale

How Airbnb upgrades tens of thousands of pods on dozens of Kubernetes clusters to new Istio versions

Airbnb

Seamless Istio Upgrades at Scale

How Airbnb upgrades tens of thousands of pods on dozens of Kubernetes clusters to new Istio versions

AWS

Maximizing Business Value Through Strategic Cloud Optimization

Useful angle: architecture depth via architecture

As cloud spending continues to surge, organizations must focus on strategic cloud optimization to maximize business value. This blog post explores key insights from MIT Technology Review's publication on cloud optimization, highlighting the importance of viewing optimization as a continuous process that encompasses all six AWS Well-Architected pillars.

AWS

How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale

Useful angle: architecture depth via architecture

In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier's control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.

AWS

Implement monitoring for Amazon EKS with managed services

Useful angle: architecture depth via architecture

In this post, we show you how to implement comprehensive monitoring for Amazon Elastic Kubernetes Service (Amazon EKS) workloads using AWS managed services. This solution demonstrates building an EKS platform that combines flexible compute options with enterprise-grade observability using AWS native services and OpenTelemetry.

Microsoft

Enhancing Code Quality at Scale with AI-Powered Code Reviews

Useful angle: benchmarking data via experiment

Microsoft’s AI-powered code review assistant has transformed pull request workflows by automating routine checks, suggesting improvements, and enabling conversational Q&A, leading to faster PR completion, improved code quality, and enhanced developer onboarding. Its seamless integration and customizability have driven widespread adoption within Microsoft The post Enhancing Code Quality at Scale with AI-Powered Code Reviews appeared first on Engineering@Microsoft.

Spotify

Incident Report: Spotify Outage on April 16, 2025

Useful angle: real-time and reliable systems via incident

On April 16, Spotify experienced an outage that affected users worldwide. Here is what happened and what we... The post Incident Report: Spotify Outage on April 16, 2025 appeared first on Spotify Engineering.

Slack

Optimizing Our E2E Pipeline

Useful angle: implementation detail via pipeline

In the world of DevOps and Developer Experience (DevXP), speed and efficiency can make a big difference on an engineer’s day-to-day tasks. Today, we’ll dive into how Slack’s DevXP team took some existing tools and used them to optimize an end-to-end (E2E) testing pipeline. This lowered build times and reduced redundant processes, saving both time…

Airbnb

Embedding-Based Retrieval for Airbnb Search

Our journey in applying embedding-based retrieval techniques to build an accurate and scalable candidate retrieval system for Airbnb Homes search

Slack

How we built enterprise search to be secure and private

Useful angle: computational geometry and graphics via surface

Many don’t know that “Slack” is in fact a backronym—it stands for “Searchable Log of all Communication and Knowledge”. And these days, it’s not just a searchable log: with Slack AI, Slack is now an intelligent log, leveraging the latest in generative AI to securely surface powerful, time-saving insights. We built Slack AI from the…

Microsoft

How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps

Useful angle: systems performance via performance

For developers, the emphasis on building intelligence into apps has never been clearer. Over the next three years, 92% of companies plan on investing in AI to achieve business outcomes like enhancing productivity and delivering better customer service. At Microsoft, developers and engineers are pushing the boundaries of AI at scale, crafting applications that harness […] The post How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps appeared first on Engineering@Microsoft.

Slack

Automated Accessibility Testing at Slack

At Slack, customer love is our first priority and accessibility is a core tenet of customer trust. We have our own Slack Accessibility Standards that product teams follow to guarantee their features are compliant with Web Content Accessibility Guidelines (WCAG). Our dedicated accessibility team supports developers in following these guidelines throughout the development process. We…

Microsoft

Dev Box Ready-To-Code Dev Box images template

Useful angle: real-time and reliable systems via reliability

Microsoft One Engineering System (1ES) team shares a sample for building Ready-To-Code Dev Box environments pre-configured with the necessary tools, repositories, and settings, ensuring consistency and reliability across teams. The post Dev Box Ready-To-Code Dev Box images template appeared first on Engineering@Microsoft.

Microsoft

Common annotated security keys

In April 2021, GitHub announced changes to their security token format that significantly enhanced security. The improvement leveraged two straightforward techniques: a fixed signature in the generated token and a checksum – both of which are highly effective in eliminating false positives (noise) and false negatives (missed findings). Microsoft also implements these techniques widely in […] The post Common annotated security keys appeared first on Engineering@Microsoft.

Microsoft

Managed DevOps Pools – The Origin Story

Useful angle: real-time and reliable systems via reliability

Learn about how Microsoft's 1ES organization developed an internal service called "1ES Hosted Pools" to manage Microsoft's diverse Engineering system infrastructure and how it helped make significant improvements to productivity, cost savings, and security. This solution will soon be available as a third-party offering named "Managed DevOps Pools". The post Managed DevOps Pools – The Origin Story appeared first on Engineering@Microsoft.

Microsoft

Developing with Accessibility in Mind at Microsoft

Celebrate the Global Accessibility Awareness Day GAAD by taking actionable and easy steps to build accessibility into your development life-cycle! Learn how tools like Accessibility Insights & Visual Studio can help find accessibility issues in development. The post Developing with Accessibility in Mind at Microsoft appeared first on Engineering@Microsoft.

Microsoft

Copy-on-Write performance and debugging

Useful angle: systems performance via performance

This is a follow-up to our previous coverage of Dev Drive and copy-on-write (CoW) linking. See our previous articles from May 24, 2023, October 13, 2023, and November 2, 2023. Dev Drive was released in Windows 11 in October, 2023, and will be part of Windows Server 2025 this fall. Server 2025 and Windows 11 […] The post Copy-on-Write performance and debugging appeared first on Engineering@Microsoft.

Microsoft

How we built “Ask Learn”, the RAG-based knowledge service

My name is Bob Tabor and I’m a member of Microsoft’s Skilling organization. We create documentation and training content about Azure, developer tooling and languages, AI, Windows and much more hosted at Microsoft Learn. Our organization also develops and maintains the content publishing platform, the content hosting platform, the interactivity, and popular sites like Microsoft […] The post How we built “Ask Learn”, the RAG-based knowledge service appeared first on Engineering@Microsoft.

Microsoft

Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing

Useful angle: systems performance via performance

Microsoft has employed Azure Load Testing to enhance the reliability of Microsoft Fabric and Azure Synapse, ensuring they can handle high loads. Azure Synapse integrates various data analytics technologies, while Microsoft Fabric offers a full enterprise analytics solution. Through rigorous daily and weekly load testing, involving complex scenarios and extensive data sizes, Microsoft aims to identify and rectify potential issues, ensuring optimal performance. This testing, integrated within their development pipelines, supports continuous improvement, leverages Azure's scalability, and utilizes Power BI for detailed reporting, ultimately enhancing service reliability and user experience. The post Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing appeared first on Engineering@Microsoft.