ref(docs): restructure core infrastructure section for improved readability

403-html · 403-html · commit 4d58e57d8d15 · 2025-02-22T09:18:24.000+01:00
- Consolidated tips and tooling into bullet points
- Streamlined language for conciseness
- Added sections on Environment Parity, Chaos Engineering, Ephemeral Test Environments, and Test Data Management
diff --git a/docs/core-infrastructure.md b/docs/core-infrastructure.md
@@ -1,200 +1,92 @@
-# Section 2: Core infrastructure - monitoring and early detection
+# Section 2: Core Infrastructure - Monitoring and Early Detection
 
-The core infrastructure of a QA architecture is the backbone that supports the
-quality-first culture and ensures the smooth operation of your development
-processes. It encompasses the tools, systems, and practices that enable
-continuous deployments, continuous monitoring, and early detection of issues. By
-establishing a rock solid infrastructure, you can proactively identify problems,
-ensure security, and deliver high-quality products to your customers.
+The core infrastructure of a QA architecture supports a quality-first culture and ensures smooth development processes. It encompasses the tools, systems, and practices that enable continuous deployments, continuous monitoring, and early detection of issues. A solid infrastructure allows proactive problem identification, ensures security, and delivers high-quality products.
 
 ## 2.1 From "Reactive" to "Proactive"
 
-In many organizations, monitoring and issue detection are often reactive
-processes that occur after problems have already surfaced. This approach
-introduces risks of delayed responses, customer dissatisfaction, and potential
-security breaches. To build a quality-first culture, it is essential to shift
-from a reactive to a proactive approach to monitoring and issue detection. By
-implementing proactive monitoring practices, you can identify issues early,
-prevent incidents, and ensure the stability and security of your systems.
-
-### 2.1.1 Integration and collaboration
-
-To build a solid core infrastructure, it is essential to integrate monitoring
-and issue detection practices seamlessly into your development processes and
-foster collaboration between teams. By aligning monitoring tools, practices, and
-workflows with your development lifecycle, you can ensure that issues are
-detected early, resolved quickly, and prevented from recurring. Integration and
-collaboration also help teams to share knowledge, best practices, and insights
-to improve the reliability and security of your systems.
-
-## 2.2 Key practices and tools
-
-### 2.2.1 Continuous deployments
-
-Continuous deployments are a key practice in modern software development that
-enables teams to deliver changes to production frequently and reliably. By
-automating the deployment process, you can reduce the risk of human errors,
-streamline releases, and ensure that new features are delivered to customers
-quickly. Continuous deployments also facilitate early issue detection by
-allowing teams to monitor changes in real-time and roll back deployments if
-necessary.
-
-#### 2.2.1.1 Quick tips and sample tooling
-
-- **Automate your deployment pipeline**: Use tools like Jenkins, GitLab CI/CD,
-  or CircleCI to automate your deployment pipeline and ensure that changes are
-  deployed consistently and reliably.
-- **Implement blue-green deployments**: Set up blue-green deployments to reduce
-  downtime and minimize the impact of deployments on your users. This practice
-  allows you to switch between two identical production environments seamlessly.
-- **Monitor deployments in real-time**: Use monitoring tools like Datadog, New
-  Relic, or Prometheus to track the performance and health of your deployments
-  in real-time. Set up alerts to notify your team of any issues immediately.
-- **Leverage feature flags**: Implement feature flags to enable or disable
-  features in production without deploying code changes. This practice allows
-  you to control the release of new features and test them with a subset of
-  users before rolling them out to everyone.
-- **Automate rollbacks**: Create automated rollback procedures to revert changes
-  quickly in case of deployment failures or issues. Test your rollback process
-  regularly to ensure that it works as expected.
-
-### 2.2.2 Continuous monitoring
-
-Continuous monitoring is a critical practice that enables teams to track the
-performance, availability, and security of their systems in real-time. By
-monitoring key metrics and logs continuously, you can proactively identify
-issues, troubleshoot problems, and ensure the reliability of your applications.
-Continuous monitoring also helps teams to detect anomalies, predict failures,
-and optimize the performance of their systems.
-
-#### 2.2.2.1 Quick tips and sample tooling
-
-- **Set up monitoring dashboards**: Create monitoring dashboards using tools
-  like Grafana, Kibana, or Splunk to visualize key metrics and logs in
-  real-time. Customize your dashboards to display relevant information and set
-  up alerts for critical thresholds.
-- **Monitor key performance indicators (KPIs)**: Define key performance
-  indicators (KPIs) for your applications and services to track their
-  performance and health. Monitor metrics like response time, error rate,
-  throughput, and resource utilization to ensure that your systems are operating
-  optimally.
-- **Implement log aggregation**: Use log aggregation tools like ELK Stack, Sumo
-  Logic, or Graylog to centralize and analyze logs from your applications and
-  infrastructure. Search, filter, and correlate logs to troubleshoot issues and
-  gain insights into the behavior of your systems.
-- **Use tracing and profiling tools**: Leverage distributed tracing tools like
-  Jaeger, Zipkin, or OpenTelemetry to trace requests across microservices and
-  identify performance bottlenecks. Profile your applications using tools like
-  YourKit, VisualVM, or JProfiler to optimize resource usage and improve
-  performance.
-- **Automate incident response**: Implement incident response automation using
-  tools like PagerDuty, OpsGenie, or VictorOps to notify your team of critical
-  issues and coordinate response efforts. Create runbooks and playbooks to guide
-  your team through incident resolution and post-incident analysis.
-
-### 2.2.3 Error tracking and alerting
-
-Error tracking and alerting are essential practices that help teams to identify,
-prioritize, and resolve issues quickly. By tracking errors in real-time and
-setting up alerts for critical issues, you can ensure that your team is notified
-of problems immediately and can take action to address them. Error tracking also
-provides valuable insights into the root causes of issues, trends in error
-rates, and areas for improvement.
-
-#### 2.2.3.1 Quick tips and sample tooling
-
-- **Integrate error tracking tools**: Use error tracking tools like Sentry,
-  Rollbar, or Raygun to capture and aggregate errors from your applications.
-  Monitor error rates, trends, and impact to prioritize and resolve issues
-  effectively.
-- **Set up alerting rules**: Define alerting rules based on error severity,
-  frequency, and impact to notify your team of critical issues. Configure alerts
-  to trigger notifications via email, Slack, or SMS and escalate alerts to
-  on-call responders if necessary.
-- **Automate error resolution**: Implement error resolution automation using
-  tools like Rollbar Deploy Tracking, Sentry Releases, or Raygun Real User
-  Monitoring to correlate errors with code changes and releases. Identify the
-  root cause of errors quickly and roll back changes if needed.
-- **Analyze error trends**: Analyze error trends and patterns to identify
-  recurring issues, common root causes, and areas for improvement. Use error
-  data to prioritize bug fixes, optimize code quality, and prevent similar
-  issues in the future.
-- **Integrate error tracking with monitoring**: Integrate error tracking tools
-  with monitoring systems to correlate errors with performance metrics and logs.
-  Gain a holistic view of your systems and applications to troubleshoot issues
-  effectively and improve the reliability of your services.
-
-### 2.2.4 Static code analysis and security scanning
-
-Static code analysis and security scanning are essential practices that help
-teams to identify vulnerabilities, code smells, and quality issues early in the
-development process. By analyzing code statically and scanning for security
-vulnerabilities, you can prevent defects, improve code quality, and ensure the
-security of your applications. Static code analysis and security scanning also
-help teams to enforce coding standards, identify performance bottlenecks, and
-optimize code for maintainability.
-
-#### 2.2.4.1 Quick tips and sample tooling
-
-- **Run static code analysis**: Use static code analysis tools like SonarQube,
-  CodeClimate, or ESLint to analyze your codebase for quality issues, code
-  smells, and security vulnerabilities. Configure rules, quality gates, and
-  thresholds to enforce coding standards and prevent defects.
-- **Perform security scanning**: Conduct security scanning using tools like
-  OWASP ZAP, Burp Suite, or Checkmarx to identify vulnerabilities in your
-  applications. Scan for common security issues like SQL injection, cross-site
-  scripting (XSS), and sensitive data exposure to protect your systems from
-  attacks.
-- **Integrate security checks in CI/CD pipelines**: Integrate security checks
-  into your continuous integration and continuous deployment pipelines to
-  automate code analysis and scanning. Fail builds or deployments that violate
-  security policies and require developers to address security findings before
-  merging code.
-- **Enforce secure coding practices**: Educate developers on secure coding
-  practices and provide training on common security vulnerabilities. Use secure
-  coding guidelines, checklists, and best practices to prevent security issues
-  and ensure that your applications are secure by design.
-- **Monitor security alerts and advisories**: Stay informed about security
-  alerts, advisories, and vulnerabilities in third-party libraries and
-  dependencies. Subscribe to security mailing lists, CVE databases, and
-  vulnerability databases to receive updates and patches for known security
-  issues.
-- **Conduct security reviews and audits**: Conduct regular security reviews and
-  audits of your applications, infrastructure, and processes to identify
-  security risks and compliance gaps. Perform penetration testing, threat
-  modeling, and security assessments to validate the security of your systems
-  and applications.
-
-### 2.2.5 Disaster recovery and business continuity
-
-Disaster recovery and business continuity planning are essential practices that
-help teams to prepare for and respond to unexpected events and disruptions. By
-developing robust disaster recovery plans and business continuity strategies,
-you can minimize downtime, protect critical data, and ensure the resilience of
-your systems. Disaster recovery and business continuity planning also help teams
-to recover from incidents quickly, maintain service availability, and safeguard
-the continuity of your operations.
-
-#### 2.2.5.1 Quick tips and sample tooling
-
-- **Define recovery objectives and priorities**: Define recovery objectives,
-  priorities, and critical systems to guide your disaster recovery and business
-  continuity planning. Identify key dependencies, resources, and stakeholders to
-  ensure that your plans are comprehensive and effective.
-- **Develop recovery plans and playbooks**: Develop detailed recovery plans and
-  playbooks that outline the steps, procedures, and responsibilities for
-  responding to incidents. Document recovery strategies, communication
-  protocols, and escalation paths to guide your team through recovery efforts.
-- **Conduct disaster recovery drills**: With SRE team, conduct disaster recovery
-  drills and tabletop exercises to simulate incidents and test your recovery
-  plans. Practice incident response, coordination, and communication to validate
-  the effectiveness of your plans and identify areas for improvement.
-- **Automate recovery procedures**: Automate recovery procedures using tools
-  like AWS CloudFormation, Terraform, or Ansible to streamline recovery efforts
-  and reduce manual intervention. Create runbooks and playbooks to automate
-  incident response, recovery tasks, and failover processes.
-- **Monitor recovery metrics and performance**: Monitor recovery metrics like
-  recovery time objective (RTO), recovery point objective (RPO), and mean time
-  to recover (MTTR) to measure the effectiveness of your recovery plans. Analyze
-  performance data, conduct post-incident reviews, and iterate on your plans to
-  improve resilience and response capabilities.
+Many organizations use reactive monitoring, addressing issues only after they surface. This risks delayed responses, customer dissatisfaction, and security breaches. A quality-first culture requires a proactive approach, identifying issues early to prevent incidents and ensure system stability and security.
+
+**2.1.1 Integration and Collaboration:** Integrating monitoring and issue detection into development processes, along with fostering team collaboration, is essential. Aligning tools, practices, and workflows with the development lifecycle ensures early issue detection and resolution. This also facilitates knowledge sharing and best practice adoption for improved system reliability and security.
+
+## 2.2 Key Practices and Tools
+
+Here are the key practices and tools to establish a solid core infrastructure for monitoring and early issue detection:
+
+**2.2.1 Continuous Deployments:** Enables frequent, reliable production changes, reducing human error and streamlining releases. Real-time monitoring allows for quick rollbacks if necessary.
+
+* **Quick Tips and Sample Tooling:**
+  * **Automate your deployment pipeline:** Jenkins, GitLab CI/CD, CircleCI.
+  * **Implement blue-green deployments:** Minimize downtime by switching between identical production environments.
+  * **Monitor deployments in real-time:** Datadog, New Relic, Prometheus.
+  * **Leverage feature flags:** Control feature releases and test with user subsets. **Remember:** This shouldn't be your primary testing strategy.
+  * **Automate rollbacks:** Revert changes quickly in case of failures.
+  * **Changelog and release notes:** Document changes and updates for transparency and communication.
+
+**2.2.2 Continuous Monitoring:** Tracks system performance, availability, and security in real-time, enabling proactive issue identification, troubleshooting, and anomaly detection.
+
+* **Quick Tips and Sample Tooling:**
+  * **Set up monitoring dashboards:** Grafana, Kibana, Splunk.
+  * **Monitor key performance indicators (KPIs):** Response time, error rate, throughput, resource utilization.
+  * **Implement log aggregation:** ELK Stack, Sumo Logic, Graylog.
+  * **Use tracing and profiling tools:** Jaeger, Zipkin, OpenTelemetry; YourKit, VisualVM, JProfiler.
+  * **Automate incident response:** PagerDuty, OpsGenie, VictorOps.
+
+**2.2.3 Error Tracking and Alerting:** Identifies, prioritizes, and resolves issues quickly through real-time tracking and alerts. Provides insights into root causes, error trends, and areas for improvement.
+
+* **Quick Tips and Sample Tooling:**
+  * **Integrate error tracking tools:** Sentry, Rollbar, Raygun.
+  * **Set up alerting rules:** Based on severity, frequency, and impact.
+  * **Automate error resolution:** Rollbar Deploy Tracking, Sentry Releases, Raygun Real User Monitoring.
+  * **Analyze error trends:** Identify recurring issues and areas for improvement.
+  * **Integrate error tracking with monitoring:** Correlate errors with performance metrics and logs.
+
+**2.2.4 Static Code Analysis and Security Scanning:** Identifies vulnerabilities, code smells, and quality issues early in development. Enforces coding standards and improves code quality and security.
+
+* **Quick Tips and Sample Tooling:**
+  * **Run static code analysis:** SonarQube, CodeClimate, ESLint.
+  * **Perform security scanning:** OWASP ZAP, Burp Suite, Checkmarx.
+  * **Integrate security checks in CI/CD pipelines:** Automate code analysis and scanning.
+  * **Enforce secure coding practices:** Training and guidelines on common vulnerabilities.
+  * **Monitor security alerts and advisories:** Stay updated on patches and vulnerabilities.
+  * **Conduct security reviews and audits:** Penetration testing, threat modeling, security assessments.
+
+**2.2.5 Environment Parity:** Maintaining consistency across development, testing, and production environments is crucial for minimizing environment-specific issues.
+
+* **Key Considerations:**
+  * **Development Environments:** Should closely mirror production to reduce integration issues.
+  * **On-Demand Environments (Dev-X):** Provide developers, designers, testers and QAs with easily spinnable, isolated environments for feature development and testing.
+  * **No Need For Staging:** In scenario with environment parity achieved through robust dev-x for acceptance criteria manual testing and comprehensive automated testing (including integration and regression tests), a separate staging environment becomes unnecessary. Smaller, more frequent releases tested thoroughly in production-like dev-x environments, coupled with advanced deployment strategies like canary releases or blue/green deployments, can replace traditional staging cycles which takes days/weeks.
+  * **Partner Integration Environments:** Dedicated environments for partners to integrate and test their systems with yours.
+  * **Tooling:** Docker, Kubernetes, Terraform, Vagrant, CloudFormation.
+
+**2.2.6 Chaos Engineering:** Introduce controlled disruptions into your systems to identify weaknesses and improve resilience.
+
+* **Key Considerations:**
+  * **Planned Experiments:** Design experiments to target specific failure scenarios.
+  * **Monitoring and Analysis:** Observe system behavior during experiments to identify vulnerabilities.
+  * **Blast Radius Control:** Limit the impact of experiments to prevent widespread outages.
+  * **Tooling:** Chaos Monkey, Gremlin, LitmusChaos.
+
+**2.2.7 Ephemeral Test Environments:** Leverage containerization and automation to create and destroy test environments on demand. This ensures consistency and reduces environment maintenance overhead.
+
+* **Key Considerations:**
+  * **Containerization:** Docker, Kubernetes.
+  * **Infrastructure as Code:** Terraform, Ansible, CloudFormation.
+  * **Automated Provisioning:** Scripts and tools to automate environment creation and teardown.
+
+**2.2.8 Test Data Management:** Efficiently manage test data in ephemeral environments.
+
+* **Key Considerations:**
+  * **Data Generation:** Tools and techniques for generating realistic test data.
+  * **Data Masking:** Protect sensitive data by masking or anonymizing it.
+  * **Data Subsetting:** Create smaller, representative datasets for testing.
+  * **Data Versioning:** Track changes to test data and revert to previous versions if needed.
+
+**2.2.9 Disaster Recovery and Business Continuity:** Prepares for and responds to unexpected events, minimizing downtime and protecting critical data. Ensures system resilience and operational continuity.
+
+* **Quick Tips and Sample Tooling:**
+  * **Define recovery objectives and priorities:** Identify critical systems and dependencies.
+  * **Develop recovery plans and playbooks:** Outline steps, procedures, and responsibilities.
+  * **Conduct disaster recovery drills:** Simulate incidents and test recovery plans with the SRE team.
+  * **Automate recovery procedures:** AWS CloudFormation, Terraform, Ansible.
+  * **Monitor recovery metrics and performance:** RTO, RPO, MTTR.