Terraform Scenarios¶
AI Assisted (Grok 3)
Common Scenarios and how to handle them.
Project Structure Reference¶
data-infra/
├── modules/ # Reusable Terraform modules (e.g., s3_bucket, rds, glue_job)
├── environments/ # Environment-specific configurations
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ └── prod/
├── scripts/
├── .gitignore
├── README.md
└── .gitlab-ci.yml # CI/CD pipeline with validate, plan, and apply stages
1. What happens if your state file is accidentally deleted?¶
Answer: If the Terraform state file (terraform.tfstate) is deleted, Terraform loses track of managed infrastructure. The next terraform apply assumes resources don’t exist and attempts to recreate them, potentially causing duplicates or failures. Recovery involves restoring a backup or manually importing resources with terraform import. Always enable versioning on remote state storage (e.g., S3).
Example:
- Project Context: The project uses an S3 backend for state storage (
s3://my-terraform-state/data-infra/dev/terraform.tfstate) with versioning enabled. - Scenario: The
devstate file is deleted. - Recovery:
# Restore from S3 versioned backup
aws s3api get-object --bucket my-terraform-state --key data-infra/dev/terraform.tfstate --version-id <version-id> terraform.tfstate
cd environments/dev
terraform init
terraform apply
terraform import module.data_lake.aws_s3_bucket.bucket dev-data-lake
apply in the pipeline (see apply_dev job in .gitlab-ci.yml)
2. How do you handle large-scale refactoring without downtime?¶
Answer: For most resources, use terraform state mv to rename them in the state file without destruction. For S3 buckets, which have immutable names in AWS, create a new bucket, copy data, and update the state. Split refactoring into smaller, non-destructive pull requests (PRs), use targeted applies (terraform apply -target), and verify plans to prevent resource destruction. Test in dev before staging or prod.
Example:
- Project Context: Refactor the
s3_bucketmodule to rename the bucket fromdev-data-laketodev-data-lake-v2inenvironments/dev/main.tf. Since S3 bucket names are immutable, a new bucket is created, and data is copied. -
Step-by-Step Process:
-
Add New Bucket Module:
- Update
environments/dev/main.tfto define a temporary module (data_lake_v2) alongsidedata_lake.# environments/dev/main.tf module "data_lake" { source = "../../modules/s3_bucket" bucket_name = "data-lake" # Original: dev-data-lake environment = var.environment } module "data_lake_v2" { source = "../../modules/s3_bucket" bucket_name = "data-lake-v2" # New: dev-data-lake-v2 environment = var.environment } - Run:
cd environments/dev terraform init terraform apply -var-file=terraform.tfvars - Outcome: Creates
dev-data-lake-v2.dev-data-lakeremains active, ensuring no downtime.
- Update
-
Copy Data:
- Sync data from
dev-data-laketodev-data-lake-v2.aws s3 sync s3://dev-data-lake s3://dev-data-lake-v2 - Outcome:
dev-data-lake-v2contains all data. Services usingdev-data-lakeare unaffected.
- Sync data from
-
Update State:
- Move the state entry for
dev-data-lake-v2to replacedev-data-lake.terraform state mv module.data_lake.aws_s3_bucket.bucket module.data_lake_v2.aws_s3_bucket.bucket - Outcome: State now maps
module.data_lake_v2.aws_s3_bucket.buckettodev-data-lake-v2.
- Move the state entry for
-
Remove Old Bucket from State:
- Remove
dev-data-lakefrom state.terraform state rm module.data_lake.aws_s3_bucket.bucket - Outcome: Terraform no longer manages
dev-data-lake, which remains in AWS.
- Remove
-
Update Configuration:
- Revise
main.tfto use only the new bucket.# environments/dev/main.tf module "data_lake" { source = "../../modules/s3_bucket" bucket_name = "data-lake-v2" environment = var.environment } - Run:
terraform plan -var-file=terraform.tfvars # Verify no destroy terraform apply -var-file=terraform.tfvars - Outcome: Terraform manages
dev-data-lake-v2undermodule.data_lake.
- Revise
-
Update Dependencies:
- Modify dependent resources (e.g., Glue jobs) to reference
dev-data-lake-v2.# modules/glue_job/main.tf data "aws_s3_bucket" "bucket" { bucket = "${var.environment}-data-lake-v2" } - Apply changes in a separate PR.
- Outcome: Services transition to
dev-data-lake-v2without disruption.
- Modify dependent resources (e.g., Glue jobs) to reference
-
Delete Old Bucket (Optional):
- If safe, delete
dev-data-lake.aws s3 rb s3://dev-data-lake --force - Ensure
force_destroy = falseinmodules/s3_bucket/main.tfto prevent accidental deletion:resource "aws_s3_bucket" "bucket" { bucket = "${var.environment}-${var.bucket_name}" force_destroy = false } - Outcome: Old bucket is removed after confirmation.
- If safe, delete
-
Best Practice: Test in
devusing theplan_devjob, automate data sync inapply_dev, document in PRs, and ensureforce_destroy = false. UpdateREADME.md:
## S3 Bucket Renaming
- Add new module in `main.tf`.
- Sync data: `aws s3 sync s3://dev-data-lake s3://dev-data-lake-v2`.
- Update state: `terraform state mv`.
- Remove old module and state entry.
- Delete old bucket if safe.
3. What happens if a resource fails halfway through a terraform apply?¶
Answer: If a resource fails during terraform apply, Terraform creates a partial deployment. Successful resources are applied, but failed ones are marked as tainted in the state file. Use targeted applies (terraform apply -target) or -refresh-only to recover systematically, addressing failures one by one.
Example:
- Project Context: The
apply_devjob fails when creating an RDS instance due to an invalid parameter. - Scenario: The
module.rds_postgres.aws_db_instance.rds_instancefails, but themodule.data_lake.aws_s3_bucket.bucketis created. - Recovery:
cd environments/dev
terraform init
terraform plan -var-file=terraform.tfvars # Check tainted resources
terraform apply -target=module.rds_postgres.aws_db_instance.rds_instance # Retry specific resource
apply.log artifact from the apply_dev job in GitLab to diagnose the error.
- Best Practice: Use the pipeline’s
apply.log(stored in S3:s3://my-audit-logs/terraform-apply/dev/) for debugging and target specific resources to minimize disruption.
4. How do you manage secrets in Terraform?¶
Answer: Store secrets in external systems like AWS Secrets Manager or HashiCorp Vault, use encrypted remote state, mark outputs as sensitive, and integrate with CI/CD securely. Avoid hardcoding secrets in .tfvars or code, and consider managing highly sensitive values outside Terraform.
Example:
- Project Context: The RDS module in
environments/dev/main.tfrequires adb_password. - Implementation:
# environments/dev/main.tf
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "dev/rds/password"
}
module "rds_postgres" {
source = "../../modules/rds"
db_name = "datawarehouse"
environment = var.environment
db_username = var.db_username
db_password = data.aws_secretsmanager_secret_version.db_password.secret_string
subnet_ids = data.aws_subnet_ids.default.ids
security_group_ids = [aws_security_group.rds_sg.id]
}
# outputs.tf
output "rds_endpoint" {
value = module.rds_postgres.aws_db_instance.rds_instance.endpoint
sensitive = true
}
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in GitLab CI/CD variables for pipeline authentication.
- Best Practice: Use the
backend.tfS3 backend with encryption and restrict access via IAM policies. Avoid storing secrets interraform.tfvars.
5. What happens if terraform plan shows no changes but infrastructure was modified outside Terraform?¶
Answer: Terraform is unaware of external changes (state drift) until terraform refresh updates the state file. Implement regular drift detection in CI/CD pipelines to catch unauthorized modifications and reconcile them with terraform apply or terraform import.
Example:
- Project Context: An S3 bucket (
dev-data-lake) is manually modified in the AWS Console to change its ACL. - Detection:
# .gitlab-ci.yml
drift_detection:
stage: validate
extends: .terraform_base
script:
- cd environments/dev
- terraform init -backend-config=backend.tf
- terraform refresh -var-file=terraform.tfvars
- terraform plan -var-file=terraform.tfvars -out=tfplan
artifacts:
paths:
- environments/dev/tfplan.txt
expire_in: 1 week
cd environments/dev
terraform init
terraform refresh -var-file=terraform.tfvars
terraform plan -var-file=terraform.tfvars # Shows drift
terraform apply -var-file=terraform.tfvars # Reconcile changes
- Best Practice: Schedule a
drift_detectionjob weekly in.gitlab-ci.ymland reviewtfplan.txtartifacts to identify drift.
6. What happens if you delete a resource definition from your configuration?¶
Answer: Removing a resource from Terraform configuration causes terraform apply to destroy the corresponding infrastructure. Use terraform state rm to remove the resource from state without destroying it, or add lifecycle { prevent_destroy = true } for critical resources.
Example:
- Project Context: The
module.data_lake.aws_s3_bucket.bucketis removed fromenvironments/prod/main.tf. - Prevention:
# modules/s3_bucket/main.tf
resource "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-${var.bucket_name}"
lifecycle {
prevent_destroy = true
}
}
cd environments/prod
terraform state rm module.data_lake.aws_s3_bucket.bucket
terraform plan -var-file=terraform.tfvars # Verify no destroy
- Best Practice: Apply
prevent_destroyto critical resources like production S3 buckets or RDS instances in themodules/directory.
7. What happens if Terraform provider APIs change between versions?¶
Answer: Provider API changes can break compatibility, causing errors in resource creation or updates. Read release notes, pin provider versions, test upgrades in dev, and use targeted applies for gradual migration.
Example:
- Project Context: Upgrading the AWS provider from
~> 4.0to~> 5.0inenvironments/dev/main.tf. - Implementation:
# environments/dev/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # Upgraded from 4.0
}
}
}
cd environments/dev
terraform init -upgrade
terraform plan -var-file=terraform.tfvars # Check for breaking changes
terraform apply -var-file=terraform.tfvars
- Best Practice: Test upgrades in the
plan_devjob, review release notes (e.g., AWS provider changelog), and update one environment at a time.
8. How do you implement zero-downtime infrastructure updates?¶
Answer: Use create_before_destroy lifecycle blocks, blue-green deployments, health checks, and state manipulation. For databases, leverage replicas or managed services with failover capabilities to avoid downtime.
Example:
- Project Context: Update the RDS instance class in
environments/prod/main.tfwithout downtime. - Implementation:
# modules/rds/main.tf
resource "aws_db_instance" "rds_instance" {
identifier = "${var.environment}-${var.db_name}"
instance_class = var.prod_instance_class
allocated_storage = var.allocated_storage
multi_az = true # Enable for failover
apply_immediately = false # Apply during maintenance window
lifecycle {
create_before_destroy = true
}
}
cd environments/prod
terraform init
terraform plan -var-file=terraform.tfvars -out=tfplan
terraform apply tfplan
- Best Practice: Enable
multi_azfor production RDS (as in the cheatsheet), useapply_immediately = false, and test instagingvia theapply_stagingjob.
9. What happens if you have circular dependencies in your Terraform modules?¶
Answer: Circular dependencies cause Terraform to fail with "dependency cycle" errors. Refactor modules using data sources, outputs, or restructured resources to establish a clear dependency hierarchy.
Example:
- Project Context: The
s3_bucketmodule depends on aglue_jobmodule, which references the S3 bucket’s ARN. - Resolution:
# modules/s3_bucket/main.tf
resource "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-${var.bucket_name}"
}
output "bucket_arn" {
value = aws_s3_bucket.bucket.arn
}
# modules/glue_job/main.tf
data "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-${var.bucket_name}"
}
resource "aws_glue_job" "pyspark_job" {
name = "${var.environment}-${var.job_name}"
role_arn = var.glue_role_arn
command {
script_location = "s3://${data.aws_s3_bucket.bucket.bucket}/${var.script_path}"
}
}
- Best Practice: Use data sources to fetch existing resources, avoiding direct dependencies. Validate with the
validatejob in.gitlab-ci.yml.
10. What happens if you rename a resource in your Terraform code?¶
Answer: Renaming a resource in Terraform code is interpreted as destroying and recreating the resource. Use terraform state mv to update the state file, preserving the existing infrastructure and avoiding rebuilds or downtime. For S3 buckets, create a new bucket and copy data, as bucket names are immutable.
Example:
- Project Context: Rename
module.data_lake.aws_s3_bucket.buckettomodule.data_lake.aws_s3_bucket.data_lakeinenvironments/dev/main.tf. - Steps:
# Original: environments/dev/main.tf
module "data_lake" {
source = "../../modules/s3_bucket"
bucket_name = "data-lake"
environment = var.environment
}
# Updated: modules/s3_bucket/main.tf
resource "aws_s3_bucket" "data_lake" { # Renamed from bucket
bucket = "${var.environment}-${var.bucket_name}"
}
cd environments/dev
terraform init
terraform state mv module.data_lake.aws_s3_bucket.bucket module.data_lake.aws_s3_bucket.data_lake
terraform plan -var-file=terraform.tfvars # Verify no destroy
terraform apply -var-file=terraform.tfvars
- Best Practice: Run
terraform state mvin theplan_devjob to preview changes, and document renaming in PRs for team review.
Additional Notes¶
- Pipeline Integration: Use the
.gitlab-ci.ymlfrom the main cheatsheet to automate validation, planning, and applying changes, ensuring safe handling of state, secrets, and drift detection. - Documentation: Update
README.mdto include recovery steps for each scenario (e.g., state restoration, drift detection setup).
## Handling Terraform Scenarios
- State Deletion: Restore from `s3://my-terraform-state/backups/<env>/<timestamp>.tfstate`.
- S3 Bucket Renaming: See Scenario 2 for detailed steps.
- Drift Detection: Run the `drift_detection` job in GitLab CI/CD.
- Zero-Downtime Updates: Enable `create_before_destroy` and `multi_az` for RDS in `modules/rds/main.tf`.
- Testing: Test all changes in
devorstagingenvironments using theplan_devandapply_devjobs before applying toprod.