a bunch of purple cubes are stacked on top of each other on a purple background .

A Data Context Engine Built for AWS

Bedrock Data gives security teams the context AWS-native tools and traditional DSPM solutions miss. Instead of stopping at “this bucket contains sensitive data,” Bedrock models the full AWS environment around that data: who can reach it, what infrastructure can touch it, where copies have spread, which vulnerabilities matter most, and what AI agents could expose.
June 9, 2026|11 min read
Pranava Adduri

Pranava Adduri

CTO, Co-founder

Share:

Every DSPM can tell you that a bucket contains Social Security numbers. That is visibility, but visibility alone is not security. Security lives in the context around each finding: who can reach the data, where it came from and where it went, what vulnerable infrastructure can touch it, which exposure would hurt most, what policy says about it and what an AI agent could surface in a response.

Context cannot be bolted on through a connector, because every one of those questions requires modeling the platform itself: its policy evaluation logic, its storage layouts, its enforcement primitives, its AI services. In AWS, the answer to "who can reach this bucket" lives in the interaction of identity policies, resource policies, SCPs, permission boundaries and trust relationships, far beyond what any single document records.

Bedrock Data models AWS at that depth.

The foundation: complete, intelligent visibility

Workloads on AWS spread data at scale across DynamoDB, DocumentDB, RDS and Redshift, across unmanaged databases teams stand up on EC2 and across S3 and EFS. Incumbent approaches such as Amazon Macie are 1/ limited in their data source integrations, e.g. S3 only, 2/ their ability to cost effectively scale up to multi petabyte data environments and 3/ their ability to understand nuanced enterprise data.

Bedrock Data scans structured, semi-structured and unstructured stores across most AWS data workloads. It reads managed relational databases through snapshot side-scanning against a temporary clone, so production never feels it, and reads stores it scans directly through native read-only APIs that self-throttle to preserve quality of service. Everything runs inside the customer's environment through the Bedrock Data Outpost: the Bedrock Data’s SaaS control plane never sees customer data.

At scale, data sources are highly partitioned, e.g. data lakes, observability logs. Such formats repeat the same structure across millions of objects and reading every byte of a petabyte-scale lake mostly re-confirms patterns the scanner has already seen. Bedrock Data's patented Adaptive Sampling clusters partitioned objects by folder structure, file structure and data type into one logical dataset, reducing the scan to a representative set of files to be sampled and cutting scanning costs 10 to 100 times or more against brute-force full reads. For scenarios requiring deeper scans, Bedrock Data also offers tuneable depth scanning including scanning the full data source.

After discovering data, understanding it is the next challenge, and this is where rules-based classification falters. Macie matches content against managed data identifiers plus custom regex an administrator writes by hand. Because the matching knows nothing about why the data exists, it fails with false positives and false negatives. For example, a rule for account numbers flags internal cost-center codes with the same urgency as customer financial accounts, and proprietary identifiers that match no published pattern never surface at all. Writing more rules sharpens the patterns without ever adding the context.

Bedrock Data's topic modeling infers each dataset's data domain and business purpose, so a finding reads as data of this type, used for this purpose, likely owned by this team. The same mechanism resolves both failures at once. A cost-center code stops resembling a customer account once the model understands the dataset around it, and inference surfaces the proprietary identifiers and internal schemas no rules-based classifier had a rule for.

Who can reach it?

Risk lives at the intersection of data and reach, and in AWS the question of reach has no single answer document. Effective access is the net of granting layers and capping layers evaluated together: identity-based and resource-based policies grant; SCPs, permission boundaries and session policies cap and never grant. A bucket policy can read as a clean grant of s3:GetObject while an SCP in the account denies it or a permission boundary on the role omits it, and the effective answer is deny. The inverse trap exists too, where a resource policy granting directly to a user produces an allow the user's identity policy never mentions. Cross-account access doubles the work: AWS runs separate evaluations in the calling account and the resource account and allows only when both agree, with the role's trust policy, the caller's sts:AssumeRole permission and the SCPs of each account all in play, and every hop in an AssumeRole chain multiplies the checks.

This is why a connector reading policy documents one at a time cannot answer the question. Bedrock Data deploys across every account in the AWS Organization through CloudFormation StackSets and resolves effective access by evaluating those layers together, including cross-account AssumeRole chains, then attributes each path to the specific user or role at its end.

IAM Access Analyzer covers part of this ground well: automated reasoning over resource policies finds externally shared resources, unused-access analysis flags dormant roles and keys, while policy checks validate new policies. But its model stops at the resource boundary. It can prove a principal reaches a bucket while carrying no sensitivity layer that says whether the bucket holds payroll records or build logs, and a security team cannot prioritize access findings without sensitivity. Bedrock Data supplies the layer Access Analyzer reasons without.

For least privilege, access activity tracking on S3 closes the loop by showing who uses the access they hold; access that sits unexercised is access a team can prune. One caution: break-glass and disaster-recovery roles are designed to sit unused, so activity-based pruning warrants a review step before any revoke. And the map is only as complete as the deployment; an account outside the StackSet is a blind spot, which is why deployment spans the Organization.

Where did it come from, and where did it go?

A point-in-time inventory records where data sits today while saying nothing about how it got there. Sensitive data is copied, transformed and re-exported across services constantly, which is how the customer identifier from the foundation section ended up in four places.

Bedrock Data finds similar data across the environment, inside AWS and beyond it, and uses that similarity to reconstruct lineage, so a team can trace a dataset back to its origin and forward to every copy, including the unexpected fourth copy in a store nobody flagged. Similarity is the mechanism's strength and its limit: exports and direct copies match strongly, while heavily transformed or aggregated derivatives carry a weaker signal, so lineage reads as a map of likely flows for a team to confirm.

What threatens it?

AWS environments produce more vulnerability findings than any team can fix. Amazon Inspector scans EC2 instances, ECR container images and Lambda functions for CVEs and network exposure, and its risk score adjusts CVSS for reachability and exploitability. What that score lacks is data sensitivity, so two functions with identical ratings can carry very different business risk depending on what each can reach.

Bedrock Data maps each service (EC2, ECS, Lambda and the rest) to the data it can access, then joins Inspector's findings to that map, producing a remediation queue ranked by data impact. A medium-severity flaw on a Lambda function with a path to a sensitive datastore outranks a high-severity flaw on infrastructure that touches nothing important. The queue inherits Inspector's coverage, so resources Inspector does not scan will not appear in it, and its accuracy rests on the access map and classification beneath it, one more reason the deployment spans the full Organization.

Which exposure would hurt most?

Each of the preceding questions produces findings; the synthesis question is where limited remediation time goes first. Bedrock Data assigns every datastore and dataset an Impact Score built from the volume of sensitive data present and the sensitivity of the types found, with sensitivity weighting drawn from the same topic modeling that classified the data in the first place.

The easy case is easy anywhere: ten thousand customer records outrank a single test file. The case that breaks naive scoring is a small store of crown-jewel data (credentials, deal documents) sitting beside an enormous store of low-grade marketing contacts. Business context is what lets the small store win that comparison, because the score reflects what the data is for. The same scoring applies to identities: each user and role receives an Impact Score tracking the volume and sensitivity of data it can reach, so the identities carrying the largest breach liability surface first.

A score remains a model, so it informs triage before it drives automation. Scores can also enrich a SIEM or ticketing system so each alert arrives pre-ranked by the sensitivity of the data it touches.

From context to action: policy and enforcement

Context earns its keep when it changes what the platform allows. Bedrock Data ships detections for common anti-patterns (sensitive data in non-production environments, AI agents positioned to expose sensitive information to models) and lets organizations write their own policies in plain language: no SSNs reachable by user accounts; sensitive data belongs only in this bucket. Each policy becomes a live detection that alerts on violation. A translated policy is still a detection rule, so the sound practice is to run it in alert-only mode, watch the volume and tune it before wiring it to any response.

Enforcement itself runs through AWS-native machinery. Bedrock Data pushes tags onto datastores recording what they hold and how sensitive it is, and those tags plug directly into IAM through attribute-based access control: a condition on aws:ResourceTag can restrict S3 buckets tagged as containing PII to a defined set of service accounts, with ABAC switched on per bucket for S3 general-purpose buckets. Bedrock Data does the discovery and classification; IAM itself performs the deny.

Tag-based enforcement carries two sharp edges anyone proposing it should name. First, a principal that can write tags can rewrite its own access: whoever holds s3:PutBucketTagging or ec2:CreateTags alongside a tag-conditioned grant can re-tag a resource until the condition matches. AWS's recommended mitigations apply: SCPs that deny changes to authorization tags except by designated administrators, plus separation of duties so no principal holds tagging permissions and data permissions on the same resources. Second, a misclassification driving an automated deny can lock a production pipeline out of its own data. Tags should drive alerts first and automated denies only after a team has reviewed the store's classification, beginning with non-production resources where the blast radius of a mistake is a ticket instead of an outage.

What could AI combine and reveal?

This is the question where every layer above gets used at once. Enterprises building AI on AWS build it on Amazon Bedrock, and an agent's risk is the intersection of what it can reach and how sensitive that reach is. The reach is layered: an Amazon Bedrock agent acts through an execution role evaluated by standard IAM logic, invokes action-group Lambda functions that carry execution roles of their own and queries knowledge bases fed from sources like S3, Confluence, SharePoint, Salesforce or a web crawler.

Knowledge bases deserve particular care because they ingest at sync time: the connector crawls and embeds source content into the vector store, and from then on anyone with bedrock:Retrieve permission can pull it, regardless of the per-document permissions the source system enforced. A SharePoint library restricted to a handful of executives, once synced, answers to every principal the knowledge base answers to.

Bedrock Data enumerates the agents created in Amazon Bedrock, resolves each one's effective reach across execution roles, action-group Lambdas and knowledge bases (the same effective-access resolution that answered "who can reach it") and shows the sensitive data behind that reach, using the same classification that built the foundation. What it surfaces is precisely the data the agent could surface in a response.

It then evaluates the Amazon Bedrock Guardrails configured to constrain those responses. Guardrails offer content filters, denied topics, word filters, sensitive-information filters with a fixed list of built-in PII types plus custom regex, contextual grounding checks and automated reasoning checks. Two properties make a coverage review worth running: guardrails apply per call or per agent, so any invocation path or new agent left unwired is unprotected; and the built-in PII list is fixed, so proprietary and locale-specific identifiers go undetected until someone writes a custom rule, while the PII filter reads text output only, leaving sensitive values passed as tool-call parameters unfiltered. Bedrock Data performs static analysis of agent and knowledge-base configuration together with the associated IAM roles, compares the sensitive data types each agent can reach against the filters its guardrail defines, flags types with no corresponding filter and suggests revisions.

Static analysis describes configuration coverage, while the filters themselves remain probabilistic at inference time, so a clean report is the start of assurance and a red one is a finding to fix today.

One system, modeled on the platform

The agent analysis is the proof of the thesis: it only works because classification, effective access, Impact Scores and policy evaluation already exist and feed each other. That compounding runs through everything above. Classification feeds Impact Scores, scores prioritize entitlement analysis, entitlements drive least-privilege pruning, lineage maps the copies, Inspector findings get ranked by the data behind them and tags turn policy into IAM enforcement.

Each layer answers a question that lives inside AWS itself, in the SCPs, trust policies, partition layouts and guardrail configurations no connector contains. Classification tells you where the data is. Modeling the platform tells you whether it is safe.

Related Content

Subscribe to our newsletter