TheCyberExpress

SRE And Security Engineering: Trends And Challenges


The convergence of SRE and Security Engineering is reshaping how organizations build, operate, and protect modern cloud environments. As infrastructure grows more complex and distributed, reliability, security, identity management, and observability are becoming increasingly interconnected disciplines rather than separate functions.

Advait Patel, Senior Site Reliability Engineer at Broadcom and author of DockSec, has witnessed this shift firsthand. With experience spanning cloud infrastructure, DevSecOps, and observability platforms such as Wavefront (Tanzu Observability), he has worked on systems processing more than 10 million data points per second while leading initiatives in IAM, cloud migration, and security engineering.

In this interview, Patel shares his insights on securing observability platforms at scale, managing identity across multi-cloud environments, balancing automation with human oversight, and the role AI is playing in the future of DevSecOps and incident response.

Advait Patel Breaks Down SRE and Security Engineering 

TCE: How are you seeing SRE and security engineering converge in modern cloud environments? 

Advait Patel: The short version is that the failure modes started overlapping and the org charts are catching up. A misconfigured IAM policy that takes down a service and a misconfigured IAM policy that exposes data are usually the same mistake. SREs already own the deployment pipeline, the observability stack, and the incident process, which happen to be the three places security has to live if it wants to be effective instead of decorative. 

What changed it for me was security as code. When I ran the zero-downtime migration of our observability platform from AWS to GCP, security could not be a review step bolted on at the end. It had to be expressed the same way reliability was, as policy in the pipeline, with the same testing and the same rollback story.  

That is the real convergence. Not security and SRE attending the same standup, but security becoming something you can measure and enforce the way you measure latency or error rate. We are not all the way there as an industry. Plenty of shops still treat security as a gate at the end. But the teams moving fastest have stopped pretending the two disciplines are separate. 

report-ad-banner

TCE: What are the biggest challenges in securing large-scale observability platforms handling high-volume data streams? 

Advait Patel: This one is close to home, since I spent a long time on a platform ingesting north of 10 million data points per second. A few things make it genuinely hard. 

First, telemetry is one of the most underrated attack surfaces in a company. Your metrics, traces, and logs describe your entire architecture. Get read access to that and you do not need to break into anything, the map is already drawn for you. And secrets leak into logs constantly. Someone logs a full request, the token rides along with it, and now your observability store is a credential store you never meant to build. 

Second, at that volume you cannot inspect everything inline. Any control you add has to be cheap or it becomes the exact bottleneck you were hired to prevent. That single constraint rules out a lot of textbook advice. 

Third is tenant isolation. When many teams share one pipeline, one team seeing another team’s data is both a security incident and a trust failure at once. Getting that right without wrecking throughput was one of the harder problems in that migration. 

TCE: How do you approach identity and access management (IAM/CIAM/WIAM) in multi-cloud architectures? 

Advait Patel: I have spent enough time here to have written a couple of books on identity in the cloud, and the honest summary is that multi-cloud IAM is hard mostly because the providers disagree with each other. AWS, GCP, and Azure each have a different mental model for what an identity even is and how permissions attach to it. The abstractions do not map cleanly, so anyone selling you one tidy policy language across all three is usually hiding the seams. 

The part I am most interested in right now is workload identity. For years, we secured machines the same way we secured people, with long-lived static credentials sitting in config files waiting to leak. That model is finally dying. Short-lived, attested identities through approaches like SPIFFE and workload identity federation are a much better answer, because the credential expires before an attacker can do much with it. 

For human and customer identity, the rules are simpler, but the stakes are higher. Kill static keys, federate to one source of truth, and treat access review as something continuous rather than an annual audit nobody reads. Entitlement creep is the quiet killer here. People accumulate access and almost never lose it. 

TCE: What role do you see AI playing in improving reliability and security operations (AIOps/DevSecOps)? 

Advait Patel: I will give you the unfashionable version. AI is genuinely good at one specific thing in security operations and oversold at most of the rest. 

The thing it is good at is the layer between detection and action. You run a scan, you get 200 findings, and historically, a human burns half a day working out which three actually matter for their system. AI is very good at that triage and at explaining a finding in the context of your specific setup. That is most of the real value, and it is the whole reason I built DockSec the way I did. 

Where it gets oversold is autonomous action in production and the idea that it replaces the analyst. It does not. The right pattern is AI sitting on top of deterministic signals, not in place of them. A coding assistant telling you a Dockerfile looks fine does not survive an auditor’s first question. You still need the scanner underneath and the human judgment on top. 

And there is a twist people forget. AI is also a new attack surface. Agentic systems can be manipulated through their own inputs in ways we are only starting to score properly, which is part of why I put time into AI-specific vulnerability scoring. We are adding capability and risk in the same motion. 

TCE: How can teams balance automation with human oversight in incident response? 

Advait Patel: My rule of thumb is to automate the reversible and the boring and keep humans on the irreversible and the ambiguous. 

Automation is excellent at the parts of incident response that are well understood and repetitive. Detect a known pattern, enrich it, page the right person, contain something you have contained a hundred times.  

That should all run at machine speed. Where I get nervous is letting automation take actions with real blast radius on its own, because automation fails confidently and at scale. A human making a bad call breaks one thing. A bad automated remediation can take the whole fleet down before anyone has read the alert. 

So I think of it as trust earned in increments. New automation runs in suggest mode first, where it only tells you what it would have done. Once it has been right enough times on low-risk actions, you let it act on those, and you keep the high-consequence decisions with a person. The piece people skip is the after. Humans own the retro and the learning. You do not automate understanding why it broke. 

TCE: What are the most important security practices for containerized environments today? 

Advait Patel: A few that matter more than the rest. 

Start small. Minimal base images and multi-stage builds do more for your posture than almost any tool you can buy, because you cannot be vulnerable to something that is not in your image. Most containers ship with a full operating system that they never touch. 

Do not run as root, and drop the capabilities you do not need. It is basic, and people still skip it. 

Care about provenance. Sign your images, generate an SBOM, and know where your base layers came from, because you inherit every vulnerability in them, whether you wrote that code or not. The supply chain is where the interesting attacks are now. 

But the practice I would push hardest is making your scanning actionable. A report with 200 CVEs that nobody can act on is security theater. The problem most teams actually have is not detection, it is prioritization and remediation. Coverage without a path to a fix just manufactures guilt. Closing that gap between found and fixed is what genuinely moves your risk down, and it is the problem I have spent the most time on. 

Conclusion

From securing observability platforms handling millions of data points per second to managing identity across multi-cloud environments, Advait Patel’s experience highlights the practical challenges facing today’s infrastructure teams. His views on automation, AI, incident response, and container security reinforce a common theme throughout the discussion: the growing overlap between SRE and Security Engineering.

As organizations continue to modernize their cloud environments, the ability to balance reliability, security, and operational efficiency will become increasingly important. For teams navigating that shift, Patel’s insights offer a grounded perspective on what it takes to build and secure systems at scale.



Source link