Skip to content

Engineering Feed

A curated reading feed of engineering blogs I follow — from teams building real systems at scale.

209 articles
AWS

Secure Amazon Elastic VMware Service (Amazon EVS) with AWS Network Firewall

In this post, we demonstrate how to utilize AWS Network Firewall to secure an Amazon EVS environment, using a centralized inspection architecture across an EVS cluster, VPCs, on-premises data centers and the internet. We walk through the implementation steps to deploy this architecture using AWS Network Firewall and AWS Transit Gateway.

Spotify

Background Coding Agents: Context Engineering (Part 2)

In Part 2, we explore context engineering for background coding agents and what makes a good migration prompt. The post Background Coding Agents: Context Engineering (Part 2) appeared first on Spotify Engineering.

Meta

Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI.  Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure.  Zoomer has delivered training time reductions, and significant QPS improvements, making it the [...] Read More... The post Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization appeared first on Engineering at Meta.

Meta

Key Transparency Comes to Messenger

We’re excited to share another advancement in the security of your conversations on Messenger: the launch of key transparency verification for end-to-end encrypted chats.  This new feature enables an additional level of assurance that only you — and the people you’re communicating with — can see or listen to what is sent, and that no [...] Read More... The post Key Transparency Comes to Messenger appeared first on Engineering at Meta.

Uber

Ceilometer: Uber’s Adaptive Benchmarking Framework

Dig into Ceilometer, Uber’s adaptive benchmarking framework for ensuring system performance and reliability at scale. Learn how it automates performance testing while providing production-like insights and continuous validation.

AWS

Building an AI gateway to Amazon Bedrock with Amazon API Gateway

In this post, we'll explore a reference architecture that helps enterprises govern their Amazon Bedrock implementations using Amazon API Gateway. This pattern enables key capabilities like authorization controls, usage quotas, and real-time response streaming. We'll examine the architecture, provide deployment steps, and discuss potential enhancements to help you implement AI governance at scale.

AWS

Architecting for AI excellence: AWS launches three Well-Architected Lenses at re:Invent 2025

At re:Invent 2025, we introduce one new lens and two significant updates to the AWS Well-Architected Lenses specifically focused on AI workloads: the Responsible AI Lens, the Machine Learning (ML) Lens, and the Generative AI Lens. Together, these lenses provide comprehensive guidance for organizations at different stages of their AI journey, whether you're just starting to experiment with machine learning or already deploying complex AI applications at scale.

AWS

Announcing the updated AWS Well-Architected Generative AI Lens

We are delighted to announce an update to the AWS Well-Architected Generative AI Lens. This update features several new sections of the Well-Architected Generative AI Lens, including new best practices, advanced scenario guidance, and improved preambles on responsible AI, data architecture, and agentic workflows.

Meta

Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation

We’ve released Ax 1.0, an open-source platform that uses machine learning to automatically guide complex, resource-intensive experimentation. Ax is used at scale across Meta to improve AI models, tune production infrastructure, and accelerate advances in ML and even hardware design. Our accompanying paper, “Ax: A Platform for Adaptive Experimentation” explains Ax’s architecture, methodology, and how it [...] Read More... The post Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation appeared first on Engineering at Meta.

AWS

Build priority-based message processing with Amazon MQ and AWS App Runner

In this post, we show you how to build a priority-based message processing system using Amazon MQ for priority queuing, Amazon DynamoDB for data persistence, and AWS App Runner for serverless compute. We demonstrate how to implement application-level delays that high-priority messages can bypass, create real-time UIs with WebSocket connections, and configure dual-layer retry mechanisms for maximum reliability.

Microsoft

Microspeak: Little-r

Harkening back to a very old mail program. The post Microspeak: Little-r appeared first on The Old New Thing.

Meta

Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together

Connecting Africa and the World We’re excited to share the completion of the core 2Africa infrastructure, the world’s longest open access subsea cable system. 2Africa is a landmark subsea cable system that sets a new standard for global connectivity. This project is the result of years of collaboration, innovation, and a shared vision to connect [...] Read More... The post Announcing the Completion of the Core 2Africa System: Building the Future of Connectivity Together appeared first on Engineering at Meta.

Meta

Enhancing HDR on Instagram for iOS With Dolby Vision

We’re sharing how we’ve enabled Dolby Vision and ambient viewing environment (amve) on the Instagram iOS app to enhance the video viewing experience. HDR videos created on iPhones contain unique Dolby Vision and amve metadata that we needed to support end-to-end Instagram for iOS is now the first Meta app to support Dolby Vision video, [...] Read More... The post Enhancing HDR on Instagram for iOS With Dolby Vision appeared first on Engineering at Meta.

Meta

Open Source Is Good for the Environment

Most people have heard of open-source software. But have you heard about open hardware? And did you know open source can have a positive impact on the environment? On this episode of the Meta Tech Podcast, Pascal Hartig sits down with Dharmesh and Lisa to talk about all things open hardware, and Meta’s biggest announcements [...] Read More... The post Open Source Is Good for the Environment appeared first on Engineering at Meta.

AWS

Know before you go – AWS re:Invent 2025 guide to Well-Architected and Cloud Optimization sessions

Are you ready to maximize your Well-Architected and Cloud Optimization learning and networking time at re:Invent 2025? We have put together this comprehensive guide to help you plan your schedule and make the most of the Well-Architected and cloud optimization sessions available this year. These sessions will deliver the practical guidance your teams need to lead strategic cloud initiatives, design next-generation architectures, optimize costs, or secure AI-powered systems.

Spotify

Shuffle: Making Random Feel More Human

Shuffle has always been one of Spotify’s most-used features, and also one of the most misunderstood. For... The post Shuffle: Making Random Feel More Human appeared first on Spotify Engineering.

Meta

StyleX: A Styling Library for CSS at Scale

StyleX is Meta’s styling system for large-scale applications. It combines the ergonomics of CSS-in-JS with the performance of static CSS, generating collision-free atomic CSS while allowing for expressive, type-safe style authoring. StyleX was open sourced at the end of 2023 and has since become the standard styling system across Meta products like Facebook, Instagram, WhatsApp, [...] Read More... The post StyleX: A Styling Library for CSS at Scale appeared first on Engineering at Meta.

Meta

Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation

We’re sharing details about Meta’s Generative Ads Recommendation Model (GEM), a new foundation model that delivers increased ad performance and advertiser ROI by enhancing other ads recommendation models’ ability to serve relevant ads. GEM’s novel architecture allows it to scale with an increasing number of parameters while consistently generating more precise predictions efficiently. GEM propagates [...] Read More... The post Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation appeared first on Engineering at Meta.

Slack

Build better software to build software better

We manage the build pipeline that delivers Quip and Slack Canvas’s backend. A year ago, we were chasing exciting ideas to help engineers ship better code, faster. But we had one huge problem: builds took 60 minutes. With a build that slow, the whole pipeline gets less agile, and feedback doesn’t come to engineers until…

Uber

Building Zone Failure Resilience in Apache Pinot™ at Uber

By building zone failure resilience into Apache Pinot™, Uber strengthened reliability for real-time analytics, sped up release cycles, and created a foundation for future failure recovery. Now queries and ingestion stay strong, even when zones go dark.

Meta

Video Invisible Watermarking at Scale

At Meta, we use invisible watermarking for a variety of content provenance use cases on our platforms. Invisible watermarking serves a number of use cases, including detecting AI-generated videos, verifying who posted a video first, and identifying the source and tools used to create a video. We’re sharing how we overcame the challenges of scaling [...] Read More... The post Video Invisible Watermarking at Scale appeared first on Engineering at Meta.

Uber

Raising the Bar on ML Model Deployment Safety

How do you safely ship thousands of ML models without slowing teams down? At Uber, we’ve built guardrails that catch issues early, prevent rollbacks, and raise the bar on reliability. Discover how safety became a measurable standard across our ML ecosystem.

Slack

Advancing Our Chef Infrastructure: Safety Without Disruption

Last year, I wrote a blog post titled Advancing Our Chef Infrastructure, where we explored the evolution of our Chef infrastructure over the years. We talked about the shift from a single Chef stack to a multi-stack model, and the challenges that came with it – from updating how we handle cookbook uploads to navigating…

Uber

Enabling Deep Model Explainability with Integrated Gradients at Uber

Uber’s ML platform Michelangelo now supports Integrated Gradients, enabling scalable, interpretable deep model explainability across TensorFlow™ and PyTorch™. Learn how this powers trust, debugging, and decision-making throughout the ML life cycle.

AWS

BASF Digital Farming builds a STAC-based solution on Amazon EKS

This post was co-written with Frederic Haase and Julian Blau with BASF Digital Farming GmbH. At xarvio – BASF Digital Farming, our mission is to empower farmers around the world with cutting-edge digital agronomic decision-making tools. Central to this mission is our crop optimization platform, xarvio FIELD MANAGER, which delivers actionable insights through a range […]

Slack

Deploy Safety: Reducing customer impact from change

It’s mid 2023 and we’ve identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward. We’re a year and half into the Deploy Safety Program at Slack, improving the way we deploy, uplifting our safety culture and continuing…

AWS

Modernization of real-time payment orchestration on AWS

The global real-time payments market is experiencing significant growth. According to Fortune Business Insights, the market was valued at USD 24.91 billion in 2024 and is projected to grow to USD 284.49 billion by 2032, with a CAGR of 35.4%. Similarly, Grand View Research reports that the global mobile payment market, valued at USD 88.50 […]

AWS

Build resilient generative AI agents

Generative AI agents in production environments demand resilience strategies that go beyond traditional software patterns. AI agents make autonomous decisions, consume substantial computational resources, and interact with external systems in unpredictable ways. These characteristics create failure modes that conventional resilience approaches might not address. This post presents a framework for AI agent resilience risk analysis […]

AWS

A scalable, elastic database and search solution for 1B+ vectors built on LanceDB and Amazon S3

In this post, we explore how Metagenomi built a scalable database and search solution for over 1 billion protein vectors using LanceDB and Amazon S3. The solution enables rapid enzyme discovery by transforming proteins into vector embeddings and implementing a serverless architecture that combines AWS Lambda, AWS Step Functions, and Amazon S3 for efficient nearest neighbor searches.

Uber

Adding Determinism and Safety to Uber IAM Policy Changes

Uber’s Policy Simulator tool enhances the safety and predictability of IAM policy changes by allowing policy authors to preview the impact of their modifications prior to deployment, ensuring deterministic outcomes after policy change deployment.

Airbnb

Migrating Airbnb’s JVM Monorepo to Bazel

By: Jack Dai, Howard Ho, Loc Dinh, Stepan Goncharov, Ted Tenedorio, and Thomas Bao At Airbnb, we recently completed migrating our largest repo, the JVM monorepo, to Bazel. This repo contains tens of millions of lines of Java, Kotlin, and Scala code that power the vast array of backend services and data pipelines behind airbnb.com. Migration in numbers […]

Slack

Building Slack’s Anomaly Event Response

As cyberattacks evolve to unprecedented levels of sophistication and speed, the time gap between breach detection and response has never been more critical. Traditional security approaches often operate reactively, identifying compromises only after damage has occurred. This delay grants attackers a tactical advantage, forcing security teams to focus on damage assessment and remediation rather than…

Uber

Controlling the Rollout of Large-Scale Monorepo Changes

Discover how Uber controls the blast radius of large-scale commits with cross-cutting service deployment orchestration. As Uber embraces fully automated continuous deployment, strong safety practices are more critical than ever.

AWS

Simplify multi-tenant encryption with a cost-conscious AWS KMS key strategy

In this post, we explore an efficient approach to managing encryption keys in a multi-tenant SaaS environment through centralization, addressing challenges like key proliferation, rising costs, and operational complexity across multiple AWS accounts and services. We demonstrate how implementing a centralized key management strategy using a single AWS KMS key per tenant can maintain security and compliance while reducing operational overhead as organizations scale.

AWS

How CommBank made their CommSec trading platform highly available and operationally resilient

In this post, we explore how CommSec, Australia's leading online broker, transitioned from a multicloud environment to AWS as their sole cloud provider while implementing Amazon Application Recovery Controller (ARC) zonal shift to maintain high availability and operational resilience. The consolidation resulted in significant benefits including 25% base capacity reduction, two times faster deployments, and improved failover capabilities through ARC zonal shift, enabling CommSec to continue serving millions of customers while meeting strict regulatory requirements.

AWS

How Karrot built a feature platform on AWS, Part 2: Feature ingestion

This two-part series shows how Karrot developed a new feature platform, which consists of three main components: feature serving, a stream ingestion pipeline, and a batch ingestion pipeline. This post covers the process of collecting features in real-time and batch ingestion into an online store, and the technical approaches for stable operation.

AWS

Deploy LLMs on Amazon EKS using vLLM Deep Learning Containers

In this post, we demonstrate how to deploy the DeepSeek-R1-Distill-Qwen-32B model using AWS DLCs for vLLMs on Amazon EKS, showcasing how these purpose-built containers simplify deployment of this powerful open source inference engine. This solution can help you solve the complex infrastructure challenges of deploying LLMs while maintaining performance and cost-efficiency.

AWS

Maximizing Business Value Through Strategic Cloud Optimization

As cloud spending continues to surge, organizations must focus on strategic cloud optimization to maximize business value. This blog post explores key insights from MIT Technology Review's publication on cloud optimization, highlighting the importance of viewing optimization as a continuous process that encompasses all six AWS Well-Architected pillars.

AWS

How Zapier runs isolated tasks on AWS Lambda and upgrades functions at scale

In this post, you’ll learn how Zapier has built their serverless architecture focusing on three key aspects: using Lambda functions to build isolated Zaps, operating over a hundred thousand Lambda functions through Zapier's control plane infrastructure, and enhancing security posture while reducing maintenance efforts by introducing automated function upgrades and cleanup workflows into their platform architecture.

AWS

Implement monitoring for Amazon EKS with managed services

In this post, we show you how to implement comprehensive monitoring for Amazon Elastic Kubernetes Service (Amazon EKS) workloads using AWS managed services. This solution demonstrates building an EKS platform that combines flexible compute options with enterprise-grade observability using AWS native services and OpenTelemetry.

Microsoft

Enhancing Code Quality at Scale with AI-Powered Code Reviews

Microsoft’s AI-powered code review assistant has transformed pull request workflows by automating routine checks, suggesting improvements, and enabling conversational Q&A, leading to faster PR completion, improved code quality, and enhanced developer onboarding. Its seamless integration and customizability have driven widespread adoption within Microsoft The post Enhancing Code Quality at Scale with AI-Powered Code Reviews appeared first on Engineering@Microsoft.

Spotify

Incident Report: Spotify Outage on April 16, 2025

On April 16, Spotify experienced an outage that affected users worldwide. Here is what happened and what we... The post Incident Report: Spotify Outage on April 16, 2025 appeared first on Spotify Engineering.

Slack

Optimizing Our E2E Pipeline

In the world of DevOps and Developer Experience (DevXP), speed and efficiency can make a big difference on an engineer’s day-to-day tasks. Today, we’ll dive into how Slack’s DevXP team took some existing tools and used them to optimize an end-to-end (E2E) testing pipeline. This lowered build times and reduced redundant processes, saving both time…

Slack

How we built enterprise search to be secure and private

Many don’t know that “Slack” is in fact a backronym—it stands for “Searchable Log of all Communication and Knowledge”. And these days, it’s not just a searchable log: with Slack AI, Slack is now an intelligent log, leveraging the latest in generative AI to securely surface powerful, time-saving insights. We built Slack AI from the…

Microsoft

How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps

For developers, the emphasis on building intelligence into apps has never been clearer. Over the next three years, 92% of companies plan on investing in AI to achieve business outcomes like enhancing productivity and delivering better customer service. At Microsoft, developers and engineers are pushing the boundaries of AI at scale, crafting applications that harness […] The post How Microsoft Engineers Build AI: Learn about scalable RAG-enabled AI Apps appeared first on Engineering@Microsoft.

Slack

Automated Accessibility Testing at Slack

At Slack, customer love is our first priority and accessibility is a core tenet of customer trust. We have our own Slack Accessibility Standards that product teams follow to guarantee their features are compliant with Web Content Accessibility Guidelines (WCAG). Our dedicated accessibility team supports developers in following these guidelines throughout the development process. We…

Slack

Migration Automation: Easing the Jenkins → GHA shift with help from AI

Overview The past few months have been exciting times for Slack’s CI infrastructure. After years of developer frustration with Jenkins (everything from security issues to downtime to generally poor UX) internal pressure led us to move a majority of Slack’s CI jobs from Jenkins to GitHub Actions.  My intern project at Slack this summer involved…

Microsoft

Dev Box Ready-To-Code Dev Box images template

Microsoft One Engineering System (1ES) team shares a sample for building Ready-To-Code Dev Box environments pre-configured with the necessary tools, repositories, and settings, ensuring consistency and reliability across teams. The post Dev Box Ready-To-Code Dev Box images template appeared first on Engineering@Microsoft.

Microsoft

Common annotated security keys

In April 2021, GitHub announced changes to their security token format that significantly enhanced security. The improvement leveraged two straightforward techniques: a fixed signature in the generated token and a checksum – both of which are highly effective in eliminating false positives (noise) and false negatives (missed findings). Microsoft also implements these techniques widely in […] The post Common annotated security keys appeared first on Engineering@Microsoft.

Microsoft

Managed DevOps Pools – The Origin Story

Learn about how Microsoft's 1ES organization developed an internal service called "1ES Hosted Pools" to manage Microsoft's diverse Engineering system infrastructure and how it helped make significant improvements to productivity, cost savings, and security. This solution will soon be available as a third-party offering named "Managed DevOps Pools". The post Managed DevOps Pools – The Origin Story appeared first on Engineering@Microsoft.

Microsoft

Developing with Accessibility in Mind at Microsoft

Celebrate the Global Accessibility Awareness Day GAAD by taking actionable and easy steps to build accessibility into your development life-cycle! Learn how tools like Accessibility Insights & Visual Studio can help find accessibility issues in development. The post Developing with Accessibility in Mind at Microsoft appeared first on Engineering@Microsoft.

Microsoft

Copy-on-Write performance and debugging

This is a follow-up to our previous coverage of Dev Drive and copy-on-write (CoW) linking. See our previous articles from May 24, 2023, October 13, 2023, and November 2, 2023. Dev Drive was released in Windows 11 in October, 2023, and will be part of Windows Server 2025 this fall. Server 2025 and Windows 11 […] The post Copy-on-Write performance and debugging appeared first on Engineering@Microsoft.

Microsoft

How we built “Ask Learn”, the RAG-based knowledge service

My name is Bob Tabor and I’m a member of Microsoft’s Skilling organization. We create documentation and training content about Azure, developer tooling and languages, AI, Windows and much more hosted at Microsoft Learn. Our organization also develops and maintains the content publishing platform, the content hosting platform, the interactivity, and popular sites like Microsoft […] The post How we built “Ask Learn”, the RAG-based knowledge service appeared first on Engineering@Microsoft.

Microsoft

Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing

Microsoft has employed Azure Load Testing to enhance the reliability of Microsoft Fabric and Azure Synapse, ensuring they can handle high loads. Azure Synapse integrates various data analytics technologies, while Microsoft Fabric offers a full enterprise analytics solution. Through rigorous daily and weekly load testing, involving complex scenarios and extensive data sizes, Microsoft aims to identify and rectify potential issues, ensuring optimal performance. This testing, integrated within their development pipelines, supports continuous improvement, leverages Azure's scalability, and utilizes Power BI for detailed reporting, ultimately enhancing service reliability and user experience. The post Enhancing reliability in Microsoft Fabric and Azure Synapse through load testing appeared first on Engineering@Microsoft.

Microsoft

Accessibility Insights now supports WCAG 2.2 AA

To celebrate the International Day for Persons with Disabilities on December 3rd we have some exciting new announcements for Accessibility Insights, Microsoft’s open-source suite of tools to help developers deliver accessible software! Technology plays a huge role in empowering everyone, including people with disabilities around the globe. Developers can now build with more accessibility in […] The post Accessibility Insights now supports WCAG 2.2 AA appeared first on Engineering@Microsoft.