This is the second part of my CIEM series. In the first part, we covered the least-privilege principle and found out that one important piece was missing: How can we measure the probability of an IAM Role or IAM User being a target of an attack?
This article will analyze different types of Tasks and their characteristics. The knowledge will be mapped to a prioritized list of attack vectors and the expected blast radius. This helps us to classify IAM Roles or IAM User candidates for an attack. Before callers can be classified there is a need to understand more about what we can do with the AWS API:
Having a look at AWS Actions
Typically AWS actions determine what we can do when we call any API Endpoint provided by AWS. In the beginning, I had a very hard time understanding the essentials of AWS Actions. In my humble opinion, the way AWS way of writing policies is often misleading when you try to understand the core concept. I have tried to visualize my understanding of AWS actions and hope this also helps you comprehend the big picture:
The picture shows variable properties at the top and fixed properties at the bottom. As you can see resources are instances of a given resource type. Each resource must be placed inside exactly one AWS Account and one AWS region. What we know as supported actions for a resource are typically the inherited actions from the resource type. However, the implementation of AWS may differ from this theoretical abstraction. List actions aren´t always inherited and bound to the resource type itself. This means the list action must be specified against the resource type rather than the instance of a resource. Some implementations also ask us to map the list actions against a "*" resource. For each action, AWS provides a classification (access level) with the values: list, read, write, tagging, and permission management. I typically extend the basic classification of AWS by further splitting the read and write types into data plane and control plane activities. If you ask yourself why I am doing this - be patient, you will need it later in this post :). Under data plane activities I understand every task that is typically needed at runtime. This includes for example InvokeFunction, PutObject, GetObject, StartInstance, StopInstance. On the other side the control plane activities are all about the lifecycle and configuration of our Objects. This includes create, configuration and delete actions. Some examples are CreateFunction, PutLifecyclePolicy, UpdateTable. Each action will result in an AWS Event which is normally stored in AWS Cloudtrail. Events are categorized into data and management events, whereas only the latter ones are stored in AWS Cloudtrail as default.
Data plane and control plane are not equal to data events and management events. Data events are typically high volume calls which aren´t stored as default in AWS Cloudtrail.
In the previous post, I also talked about conditions in policy statements. The available condition statements are also inherited from global conditions, a resource type, or directly mapped to a given action. A simplified data model of action could look like this (don´t analyze the diagram in detail, I just want to show the most important objects and links and didn´t apply the real implementation):
Please also note that the resource Type always supports a specific Amazon Resource Name (ARN) Format. The big picture above shows how an ARN is structured and shows some example formats.
During my (still short) career as a Cloud Engineer, I have already faced a lot of challenges. Let me share some relevant pieces of information with you:
The level of Integration varies between different Services: Whereas dynamodb, aurora, or S3 are real native services from AWS we also observe the existence of "wrapper" services. An RDS service for a mySQL database will never reach the feature set and compatibility of an AWS native service. This has an impact on the blast radius and least privilege. Some specific side cases may be configurable via IAM Policies only if the service supports a certain level of integration.
The implementation and behavior of Services vary: Let´s take the create action as an example. Some services treat the creat action as a resource-bound action and other services treat it as an action bound to a resource type. This little difference will result in a different behavior if you are using ABAC for example. If the create action doesn´t support a resource tag you need to specify an additional statement to cover this action.
The depends_on action documentation is not 100% complete: Typically the AWS documentation only refers PassRole actions. However, the documentation is not consistent. Two examples: The CreateStack (Cloudformation) action may have an execution role defined which needs the PassRole action - the documentation as of today shows that there is not such a dependency. The StartQuery (Athena) call depends on the used workgroup and may depend on a glue data table and S3 Object level access including KMS action to decrypt objects. You could interpret that the call is async and the job starts. However, in the end, you will end up with an unsuccessful query if you do not implement the dependent actions.
With this understanding, you are ready to continue to the next element in our supply chain: The internals of a task.
The Internals of a Task and the impact of infrastructure choice
The type and scope of the task influence the probability of an attack. An attacker will always try to find the easiest task to hijack with the biggest blast radius. So to understand which IAM Principal may be a target of an attack we can try to analyze the different types of tasks and their characteristics. We will not explore the task itself, but the impact of the underlying hosting infrastructure/caller on the task. The following graph shows some typical tasks and tries to visualize the difference between them. Even though the Actions do not change, we can influence the scope and visibility of a task by choosing the right technology/caller type.
"Vendor lock in" and "kubernetes" users should have a special look on this section! The following part shows why cloud native deployments must be considered in a solid architecture.
Each graph shows typical hosting infrastructure/callers, the Task itself which can be further split into subtasks, and a timeline with typical API Actions for the given task.
The examples above are ment to be read as avarage use case. Of course there are always exceptions and other valid use cases. The goal is to measure how the mayority of the use cases looks like. Also the attack vektor is not complete and should just give us some understanding of the impact of an attack and the esiness to exploit vulnerabilities.
Let me try to summarize the characteristics of each example use case: As you can see each type of infrastructure or caller has different properties. We need to take all of those properties into consideration when we think about the probability of an attack. The following tables show my personal view on the most important metrics:
Simple Task (Lambda):
Execution Environment | Single session per invocation |
Scope of Task | Single defined tasks |
Task duration | 1 second up to 15 minutes |
API Call Pattern | Predictable (Action / Time) |
Visibility of API Actions per Task | Fully visible via cloudtrail (if data events included) |
Behavior lock |
|
Runtime Environment | Micro Container managed by AWS |
Invocation behavior | Scheduled, On Demand (load varies based on Use Case) |
Action Types (typical) | List, Read (data plane) and write (data plane) |
Attack Vector | Prerequisite: Ability to invoke function (directly or via 3rd Party Service). Application-level vulnerability with limited possibilities through event manipulation. API Level Access - Ability to change codebase. For example: Create a new function and assign an existing execution role with more privileges. |
Blast Radius | Small (For example: Part of an Application component) |
Orchestrated Task (Step Function):
Execution Environment | Multiple Sessions (depending on Workflow) |
Scope of Task | Multiple defined tasks |
Task duration | Seconds to hours |
API Call Pattern | Predictable for subtasks (Action / Time) and Parent tasks (Action / Time). Patterns are more difficult to consolidate compared to a single task. Dependency on Workflow. |
Visibility of API Actions per Task | Fully visible via Cloudtrail Workflow visible to the user (if data events included) |
Behavior lock |
|
Runtime Environment | AWS Native Services - typically locked |
Invocation behavior | Scheduled, On Demand (load varies based on Use Case) |
Action Types (typical) | List, Read (data plane) and write (data plane) |
Attack Vector | Prerequisite: Ability to invoke step function (directly or via 3rd Party Service). Application-level vulnerability with limited possibilities through event manipulation. In comparison to the simple task, the possibilities are even more limited as we are only able to manipulate the entry point. Actions inside the step function (lambda invocation) are locked by the step function definition. API Level Access - Ability to change step function definition. For example: Create a new step function and assign an existing execution role with more privileges. |
Blast Radius | Small-Medium (For example: A component of an Application) |
Container Task (ECS):
Execution Environment | Single session per container runtime |
Scope of Task | Multiple defined tasks |
Task duration | Minutes to hours |
API Call Pattern | Predictable for Task (Action) |
Visibility of API Actions per Task | Visibility per subtask is not available in Cloudtrail -> Relies on Application logging. |
Behavior lock |
|
Runtime Environment | See container Image |
Invocation behavior | Scheduled, On Demand |
Action Types (typical) | List, Read (data plane) and write (data plane) |
Attack Vector | Prerequisites: Network access to container endpoint Application level attacks (CVEs/exploits in application code or vulnerable library/installed extension). A successful attack results in full API access for all tasks the container owns. |
Blast Radius | Medium (for example: The whole application) |
Virtual Machine Task (EC2):
Execution Environment | Session(s) during EC2 runtime |
Scope of Task | Multiple defined tasks (middle-big) |
Task duration | Hours to always on |
API Call Pattern | Predictable for Task (Action) |
Visibility of API Actions per Task | Visibility per subtask is not available in Cloudtrail -> Relies on Application logging. |
Behavior lock |
|
Runtime Environment | See Operating System |
Invocation behavior | Scheduled, On Demand |
Action Types (typical) | List, Read (data plane) and write (data plane) |
Attack Vector | Prerequisites: Network access to EC2 machine Application level attacks (CVEs/exploits in application code or vulnerable library/installed extension). A successful attack results in full API access for all tasks the container owns. |
Blast Radius | Medium (for example: The whole application includes all data and apps running on the EC2 Machine) |
External Task (On-Prem Machine via IAM User or IAM Anywhere Role):
Execution Environment | Either role session for IAM Anywhere role or permanent access (IAM User) |
Scope of Task | Single defined tasks (small) |
Task duration | Minutes |
API Call Pattern | Predictable for Task (Action) |
Visibility of API Actions per Task | Fully visible via cloud trail |
Behavior lock |
|
Runtime Environment | Operating System |
Invocation behavior | Scheduled, On Demand |
Action Types (typical) | List, Read (data plane) and write (data plane) |
Attack Vector | Compromised keys result in the ability to hijack role |
Blast Radius | Small to medium |
IaC Task (Cloudformation, Terraform, ...):
Execution Environment | Single session per container runtime/deployment |
Scope of Task | Single defined tasks (big) |
Task duration | Seconds to Minutes |
API Call Pattern | Unpredictable |
Visibility of API Actions per Task | Fully visible via cloud trail |
Behavior lock |
|
Runtime Environment | See container Image |
Invocation behavior | Scheduled, On Demand |
Action Types (typical) | List, Create, Delete, Read (control plane), Write (control plane) |
Attack Vector | Hijacked deployment pipeline (execution environment). Hijacked code repository |
Blast Radius | Big (whole deployments) |
External Task (User via IAM User):
Execution Environment | Either role session for IAM Anywhere role or permanent access (IAM User) |
Scope of Task | Task definition unpredictable, limited by policy |
Task duration | Minutes to hours |
API Call Pattern | Unpredictable |
Visibility of API Actions per Task | Fully visible via cloud trail |
Behavior lock |
|
Runtime Environment | Operating System |
Invocation behavior | Unpredictable |
Action Types (typical) | List, Read (data plane, control plane), Write (data plane) |
Attack Vector | Compromised keys result in the ability to hijack role |
Blast Radius | Small to medium |
External Task (User via federated access/SSO):
Execution Environment | Session(s) per login |
Scope of Task | Task definition unpredictable, limited by policy |
Task duration | Minutes to hours |
API Call Pattern | Unpredictable |
Visibility of API Actions per Task | Fully visible via cloud trail |
Behavior lock |
|
Runtime Environment | Operating System |
Invocation behavior | Unpredictable |
Action Types (typical) | List, Read (data plane, control plane), Write (data plane) |
Attack Vector | Compromised session keys result in the ability to hijack role |
Blast Radius | Varies (small for purpose-driven roles to big for general roles) |
Additional Metrics
The probability of an attack can already be derived via the caller - however, in most cases, there are additional metrics that influence the decision of an attack target:
Environment: It should be clear that a productive environment is typically more interesting than a test environment. Tip: I always recommend defining a guideline regarding test and productive environments in AWS. Typically each environment will be hosted in a different account.
Account scope and type: Attackers will do their homework and profile your account's scope. Do you gather a lot of Workloads in single accounts? Do you separate each Application by account? Do you host shared services inside an Account? This shows once again the importance of a fully automated landing zone.
Application relevance for the business: Is the account hosting some critical assets that are needed for your business core processes or are there non-critical apps deployed?
Persons interacting with the Account: Can I expect that developers also have access to productive environments? Maybe some Managers have way too many privileges just because they are paying the bill. What is the skill set of the users interacting with the account? Are they still learning how to use AWS or are they developing cloud native apps for years?
Pivoting possibilities: Some roles may have access to assume roles in other accounts. These interfaces can result in serious problems and should be secured and monitored with additional focus.
Data: Can I expect business-critical data in the account? Most people forget that also a read-only role has access rights to download S3 objects. Often attackers exploit the limited knowledge of cloud beginners and go for such quick wins.
Interim Result
All in all our picture of the probability of an attack get´s clear. The bad news is that the amount of metrics is too high to derive a simple rule. My recommendation: Follow the money. If you understand your business you will know what you need to protect. Invest your time securing those assets with additional focus. Try to apply organizational measurements if possible and solve the problem rather than just one instance of it.
We´ve also seen that there are poisonous combinations and critical actions that should get additional attention. I will dive into that combination in the next blog posts in this series.
Last, consider cloud native metrics when you build applications on AWS. The task comparison shows the strength of cloud-native vs. other architecture. Whereas cloud-native applications (when done right) offer a minimal attack surface with locked behavior, task visibility, and short role sessions other architectures are offering a bigger attack surface with less visibility.
My prioritization so far would be (1 low risk, 5 high risk):
Federated User (4-5): Typically federated users have more access than needed. If I can hijack one session I may get access to all of the accounts which are available to the attacked user. MFA is also misleading - If I run a session of 8 hours with one MFA login for X Accounts I have plenty of time to exploit the machine of one key personnel.
IaC (3-5): This is the heart of our deployment. IaC Roles are powerful by nature. Even though an attack may be more complicated it´s worth to consider it as a top target.
Roles with Trust policies allowing access outside of own organization (3-4): 3rd Party companies are often not "cloud native" and demand overpowered IAM Roles. This results in a huge attack surface. Since you cannot control the source account assuming your environment it´s an additional threat that is not under your control.
IAM User - Machine or User-driven (3): It´s not the first time that Access Keys have been stored in public repositories. Even though many people apply the least privilege the attack vector of IAM Users may still be easier to exploit because you run long-term permissions.
VM Task (2-3): Finding an exploit in an App or on a machine that is misconfigured/a firewall that exposes too much to a potential attacker. However, in most cases, the blast radius is too small and the attack will most probably be a targeted one.
Container Task (2): Same as for the VM task. However, the possibilities are lower since a container typically doesn´t host a full-blown OS and delivers a smaller attack surface and fewer attack vectors due to missing utilities.
Simple Task (1-2): Harder to exploit and even smaller blast radius. An attacker may hope for a lucky shot and find an overpermissive lambda role.
Orchestrated Task (1-2): Smallblast radius and locked behavior via the step function definition. Not my first target!
A deeper insight into the IAM Role Trust policy document
My first blog post has shown that the AWS principal and the actual caller are connected via a trust policy document. This policy document is the key to an AWS environment if you are depending on AWS Roles (In the case of IAM Users the trust is built upon a shared secret between the caller and AWS). My next blog post will show you what you need to do to specify a good trust policy. Beginners and also advanced AWS Users are often running into security issues without even knowing due to missing knowledge and best practices. As an example: Did you ever consider securing the "passrole" action when you use compute services like lambda, ecs, or codebuild together with IAM roles? Do you understand the IAM ARN format of different IAM Roles and are you able to apply them in your trust policy with the least privilege in mind?
AWS Access Analyser is a great AWS native service that can help you a lot in building visibility in your AWS organization. Use it and see your potential attack surface - it´s free of charge.
Next steps
Now you´ve read a lot about my theoretical view on basic elements of AWS IAM. What is still missing is a proper best practice and some guidelines on how to secure your IAM landscape and manage cloud identity entitlements with AWS native services. I have two more posts to share in this series which focuses on developer teams in AWS that are writing IAM Policies and a centralized platform team implementing measurements at an organizational level.
Both are needed to secure your environment. I hope you have enjoyed this article and look forward to the next part coming soon. Until then - reflect on the content and build your view!
Comentários