Hazard Analysis and Risk Assessment

Hazard analysis and risk assessment form the cornerstone of safety-critical system development. These systematic processes identify potential hazards, evaluate associated risks, and guide the implementation of appropriate mitigations. Without rigorous hazard analysis, safety-critical systems may harbor unrecognized dangers that could lead to catastrophic consequences when deployed in the field.

The discipline has evolved from early fault tree methods developed for aerospace and nuclear industries into a comprehensive set of complementary techniques applicable across all safety-critical domains. Modern safety standards mandate specific hazard analysis activities at each lifecycle phase, ensuring that safety considerations drive system design from concept through decommissioning. This article explores the fundamental principles and practical techniques essential for conducting effective hazard analysis and risk assessment in embedded systems.

Fundamental Concepts

Understanding the terminology and principles underlying hazard analysis is essential before applying specific techniques. Clear definitions ensure consistent communication among safety engineers, system designers, and regulatory authorities.

Hazards, Risks, and Accidents

A hazard is a potential source of harm, representing a condition or situation with the inherent potential to cause injury, damage, or loss. Hazards themselves do not cause harm; rather, they create the potential for harm when combined with exposure and triggering conditions. In embedded systems, hazards might include unintended motor activation, loss of braking capability, or incorrect medication dosing.

Risk quantifies the combination of hazard severity and likelihood of occurrence. Risk assessment considers both how bad the consequences could be if a hazard leads to an accident and how probable that outcome is. A hazard with catastrophic potential but extremely low probability may present lower overall risk than a moderately severe hazard that occurs frequently.

An accident occurs when a hazard is triggered and leads to actual harm. Accidents result from chains of events that begin with initiating causes, progress through hazardous states, and culminate in harmful outcomes. Effective safety engineering aims to break these chains at multiple points, preventing hazards from ever leading to accidents.

Safety Functions and Safety Integrity

Safety functions are specific actions or features implemented to achieve or maintain a safe state. These functions may prevent hazardous conditions from arising, detect hazardous conditions and trigger appropriate responses, or mitigate consequences when hazardous conditions occur. Examples include emergency shutdown systems, overspeed protection, and fail-safe defaults.

Safety integrity refers to the probability that a safety function will perform its required function satisfactorily under all stated conditions within a specified time period. Higher safety integrity implies greater confidence that the safety function will work when needed. Safety Integrity Levels (SILs) provide standardized categories for specifying required safety integrity based on the risk reduction needed from each safety function.

Tolerable Risk and ALARP

Tolerable risk represents the level of risk that society is willing to accept for a given activity in return for its benefits. Determining tolerable risk involves ethical, economic, and societal considerations beyond pure engineering analysis. Different application domains establish different tolerable risk thresholds based on societal expectations and historical precedent.

The ALARP (As Low As Reasonably Practicable) principle requires that risks be reduced below the intolerable threshold and then further reduced until the cost of additional reduction becomes grossly disproportionate to the safety benefit gained. ALARP recognizes that absolute safety is unattainable and that resources for safety improvement are limited. This principle guides engineers in making rational decisions about safety investments.

Defense in Depth

Defense in depth implements multiple independent layers of protection against hazards. If one protective layer fails, subsequent layers continue to provide protection. This principle recognizes that no single protection mechanism is perfectly reliable and that diverse approaches provide greater overall protection than reliance on any single measure.

In embedded systems, defense in depth might include input validation at multiple processing stages, independent monitoring of safety-critical parameters, hardware interlocks supplementing software protections, and mechanical safeguards backing up electronic controls. Each layer addresses the possibility that previous layers might fail or be bypassed.

Hazard Identification Techniques

Hazard identification is the critical first step in safety analysis. Techniques range from informal brainstorming to highly structured methodologies. The choice of technique depends on system complexity, available information, and lifecycle phase.

Preliminary Hazard Analysis

Preliminary Hazard Analysis (PHA) provides early identification of potential hazards before detailed design begins. PHA examines system concepts, intended functions, and operating environments to identify hazardous conditions that might arise. The analysis considers energy sources, hazardous materials, environmental factors, human interactions, and interfaces with other systems.

PHA typically produces a list of identified hazards with preliminary severity assessments and recommended design constraints or safety requirements. This early analysis influences fundamental architectural decisions, potentially eliminating hazards through design choices rather than requiring later mitigation. PHA results guide resource allocation for more detailed analysis as design progresses.

The technique examines categories of potential hazards systematically. For embedded systems, these categories include electrical hazards such as shock and fire, mechanical hazards from moving parts, thermal hazards from heating or cooling, radiation hazards if applicable, software hazards from erroneous outputs or timing failures, and interface hazards at system boundaries.

Hazard and Operability Study

Hazard and Operability Study (HAZOP) provides a structured methodology for identifying hazards and operability problems in process systems. Originally developed for chemical plants, HAZOP has been adapted for other domains including embedded systems. The technique systematically examines deviations from design intent to identify potential problems.

HAZOP applies guide words to system parameters to generate deviation scenarios. Standard guide words include NO (complete negation), MORE (quantitative increase), LESS (quantitative decrease), AS WELL AS (qualitative addition), PART OF (qualitative reduction), REVERSE (logical opposite), and OTHER THAN (complete substitution). For each deviation, the team identifies possible causes, consequences, existing safeguards, and recommendations.

For embedded systems, HAZOP examines deviations in signals, data flows, timing, and functional behavior. A HAZOP study might consider scenarios such as "no sensor signal received," "higher than expected temperature reading," "command executed late," or "incorrect mode transition." The systematic nature of HAZOP helps ensure comprehensive coverage of potential deviations.

Failure Mode and Effects Analysis

Failure Mode and Effects Analysis (FMEA) examines potential failure modes of system components and their effects on system behavior. The analysis proceeds bottom-up, starting with individual components and tracing the effects of their failures through the system hierarchy. FMEA provides systematic coverage of hardware failure modes and identifies single points of failure requiring additional protection.

For each component, FMEA identifies possible failure modes based on component type and function. Common failure modes include open circuit, short circuit, drift, stuck-at, intermittent operation, and degraded performance. The analysis then traces the local effect of each failure mode, the effect on higher-level assemblies, and ultimately the end effect on system safety and operation.

FMEA results include severity ratings, occurrence probability estimates, and detection capability assessments. The Risk Priority Number (RPN), calculated as the product of severity, occurrence, and detection ratings, helps prioritize failure modes for mitigation attention. While RPN has limitations, it provides a useful first-pass ranking for allocating analysis resources.

Design FMEA (DFMEA) focuses on product design weaknesses, while Process FMEA (PFMEA) examines manufacturing process failures that could compromise product safety. Both types contribute to comprehensive safety assurance for embedded systems.

System-Theoretic Process Analysis

System-Theoretic Process Analysis (STPA) takes a systems-thinking approach to hazard analysis, focusing on inadequate control rather than component failure. STPA recognizes that accidents often result from unsafe interactions among components rather than individual component failures. The technique is particularly valuable for complex software-intensive systems where traditional failure-based methods may miss important hazards.

STPA begins by identifying accidents and system-level hazards, then models the system as a hierarchical control structure. The analysis examines how control actions could be unsafe: a required control action is not provided, an unsafe control action is provided, a potentially safe control action is provided too early, too late, or out of sequence, or a control action is stopped too soon or applied too long.

For each unsafe control action, STPA identifies causal scenarios that could lead to its occurrence. These scenarios consider controller failures, feedback inadequacies, communication problems, and flawed mental models. The resulting causal scenarios inform safety requirements and architectural decisions for preventing unsafe control.

What-If Analysis

What-if analysis uses structured brainstorming to identify hazards by posing questions about potential deviations and failures. A team of experts systematically asks "what if" questions about system behavior under various conditions. This technique leverages team experience and creativity while providing structured documentation of the analysis process.

What-if questions address various aspects of system behavior: What if the sensor fails? What if the operator enters incorrect data? What if power is interrupted during a critical operation? What if software executes an unexpected branch? The team discusses each scenario, identifying potential consequences and existing protections.

What-if analysis works well for early hazard identification when detailed design information is limited. The technique also complements more structured methods by capturing hazards that formal procedures might miss. Effective what-if analysis requires experienced participants with diverse perspectives on system behavior and potential failure modes.

Risk Assessment Methods

Risk assessment evaluates identified hazards to determine which require mitigation and to what degree. Assessment methods range from qualitative categorization to detailed quantitative analysis, with the choice depending on available data and required precision.

Risk Matrix Approach

Risk matrices provide a semi-quantitative framework for classifying risks based on severity and likelihood. The matrix rows represent severity categories ranging from negligible to catastrophic, while columns represent likelihood categories from incredible to frequent. Cell intersections indicate risk levels that map to required actions.

Typical severity categories for embedded systems include catastrophic (death or system loss), critical (severe injury or major system damage), marginal (minor injury or significant system degradation), and negligible (less than minor injury or minor system impairment). Likelihood categories might span from frequent (likely to occur often in system life) to incredible (so unlikely as to be practically impossible).

Risk matrices enable rapid risk classification without detailed quantitative analysis. However, they have limitations including subjective categorization, boundary effects between categories, and inability to represent continuous risk variation. Despite these limitations, risk matrices provide useful first-pass assessments and facilitate communication about risk levels.

Quantitative Risk Analysis

Quantitative risk analysis calculates numerical risk values from probability and consequence data. This approach enables precise risk comparison and verification that risks meet numerical targets. Quantitative analysis requires failure rate data, consequence severity estimates, and mathematical models relating failures to outcomes.

Risk calculations typically express results as probability of harm per unit time or per operation. For example, a safety target might specify that the probability of a fatal accident shall not exceed 10^-9 per hour of operation. Quantitative analysis demonstrates compliance with such targets by calculating the probability of accident scenarios and summing across all identified scenarios.

Quantitative analysis depends heavily on input data quality. Failure rate databases provide component reliability data, but application-specific factors may require adjustment. Uncertainty analysis, often using Monte Carlo simulation, characterizes confidence bounds on calculated risk values. Conservative assumptions ensure that calculated risks do not underestimate actual risks.

Fault Tree Analysis

Fault Tree Analysis (FTA) models the logical relationships between component failures and system-level hazards. Starting from a top event representing a hazardous condition, the analysis works backward to identify combinations of basic events that could cause the top event. The resulting tree structure visualizes failure logic and enables quantitative probability calculation.

FTA uses Boolean logic gates to represent relationships between events. AND gates indicate that all input events must occur for the output event to occur, while OR gates indicate that any single input event causes the output. More complex gates represent voting logic, priority relationships, and other conditional dependencies.

Minimal cut sets identify the smallest combinations of basic events sufficient to cause the top event. Single-point cut sets, containing only one basic event, represent critical vulnerabilities requiring additional protection. Common cause analysis examines whether apparently independent basic events might share common failure mechanisms that would defeat redundancy.

Quantitative FTA calculates top event probability from basic event probabilities using the tree logic. This calculation supports verification of probabilistic safety targets and identification of dominant contributors to risk. Importance measures quantify each basic event's contribution to overall risk, guiding mitigation priorities.

Event Tree Analysis

Event Tree Analysis (ETA) models accident sequences starting from initiating events and progressing through safety function responses. Each branch point represents a safety function that either succeeds or fails, with different outcomes depending on the combination of successes and failures. The tree structure visualizes the range of possible accident sequences and their probabilities.

Event trees complement fault trees by modeling success paths as well as failure paths. While fault trees focus on how hazardous conditions arise, event trees show how initiating events progress or are arrested depending on safety function performance. The combination of both techniques provides comprehensive accident modeling.

Constructing event trees requires identifying initiating events, determining the safety functions that respond to each initiator, establishing the sequence in which safety functions act, and estimating success and failure probabilities for each function. Dependent failures between safety functions must be considered, as common cause failures can compromise multiple branches simultaneously.

Bow-Tie Analysis

Bow-tie analysis combines fault tree and event tree methods into an integrated visualization centered on a hazardous event. The left side of the bow-tie (the fault tree portion) shows how initiating causes can lead to the central hazardous event. The right side (the event tree portion) shows how the hazardous event can progress to various consequences.

Prevention barriers appear on the left side, blocking paths from causes to the hazardous event. Mitigation barriers appear on the right side, preventing the hazardous event from leading to harmful consequences. This visualization clearly shows the complete barrier structure protecting against the hazard.

Bow-tie diagrams communicate safety architecture effectively to diverse stakeholders. The visual format helps non-specialists understand how multiple barriers contribute to safety. Bow-ties also facilitate barrier management by clearly identifying each barrier and its role in the overall protection scheme.

Safety Integrity Level Determination

Safety Integrity Level (SIL) determination assigns numerical integrity requirements to safety functions based on the risk reduction they must provide. Proper SIL determination ensures that safety functions receive appropriate development rigor and achieve necessary reliability.

Risk Graph Method

The risk graph method, described in IEC 61508, determines SIL requirements through a structured decision tree. The method considers consequence severity, frequency of exposure to hazard, possibility of avoiding the hazard, and probability of unwanted occurrence without the safety function. Each factor selects a branch in the decision tree, with the path through the tree determining the required SIL.

Consequence severity categories range from minor injury to multiple fatalities or catastrophic environmental damage. Exposure frequency considers how often people are in the hazard zone, from rare to continuous presence. Avoidance possibility reflects whether harm can be escaped once a hazardous situation develops. Demand probability estimates how often the safety function must act.

The risk graph approach is straightforward to apply but provides only coarse SIL discrimination. The method works best for initial SIL estimation during early development phases. More refined analysis may be needed when risk graph results indicate SIL boundaries or when consequences span multiple categories.

Risk Matrix Method

The risk matrix method for SIL determination maps risk levels directly to required SIL values. The organization defines acceptable risk thresholds and the risk reduction each SIL provides. Comparing unmitigated risk levels to acceptable thresholds determines the SIL needed for adequate risk reduction.

This method requires calibrating the risk matrix to quantitative risk values and establishing the risk reduction factor associated with each SIL. Typical risk reduction factors are 10 to 100 for SIL 1, 100 to 1000 for SIL 2, 1000 to 10000 for SIL 3, and 10000 to 100000 for SIL 4. The required SIL is the level that reduces risk from unmitigated levels to tolerable thresholds.

Quantitative Method

The quantitative method calculates required SIL from numerical safety targets and hazard analysis results. Given a tolerable risk target and calculated unmitigated risk, the required risk reduction factor determines SIL. This method provides the most precise SIL determination but requires quantitative hazard analysis.

For example, if tolerable risk is 10^-8 fatalities per hour and unmitigated risk is 10^-4 fatalities per hour, the required risk reduction factor is 10^4, corresponding to SIL 3 or SIL 4. The safety function must achieve at least this risk reduction factor to bring overall risk within tolerability limits.

Quantitative SIL determination should account for multiple hazards protected by the same safety function and multiple safety functions protecting against the same hazard. The combined effect of all safety measures must achieve tolerable risk for each identified hazard.

Layers of Protection Analysis

Layers of Protection Analysis (LOPA) provides a semi-quantitative approach to SIL determination popular in process industries. LOPA examines each hazard scenario, identifying independent protection layers (IPLs) and their probability of failure on demand (PFD). The analysis determines whether existing layers provide sufficient protection or whether additional safety-instrumented functions are needed.

Each IPL receives a PFD credit based on its design and independence. The product of initiating event frequency and all IPL PFDs yields the mitigated event frequency. Comparing this frequency to target frequency thresholds determines whether additional protection is required and at what SIL.

LOPA works well when independent protection layers can be clearly identified and characterized. The method forces explicit consideration of layer independence and failure modes. However, LOPA requires careful attention to common cause failures that might defeat multiple layers simultaneously.

Software Hazard Analysis

Software-intensive embedded systems require specific attention to software-related hazards. Traditional hardware-focused techniques may miss hazards arising from software behavior, making specialized software hazard analysis essential.

Software FMEA

Software FMEA adapts the traditional FMEA methodology for software components. Rather than examining physical failure modes, software FMEA considers functional failure modes such as incorrect output, missing output, output out of range, output at wrong time, or unexpected additional output. Each software function is analyzed for these potential failure modes and their system effects.

Software FMEA examines interfaces between software components, identifying data corruption, timing violations, and protocol errors. The analysis also considers configuration errors, resource exhaustion, and interaction failures between concurrent processes. Results guide defensive programming requirements and error handling design.

Software Fault Tree Analysis

Software FTA extends fault tree methods to software failures. Top events represent hazardous software behaviors, with the tree decomposing these into contributing software faults. Unlike hardware fault trees, software fault trees typically cannot be quantified probabilistically since software failures are deterministic given specific inputs and states.

Software fault trees identify conditions under which software could produce hazardous outputs. These conditions might include specific input combinations, particular state sequences, resource conditions, or timing relationships. The identified conditions inform test case development and defensive measure design.

Interface Hazard Analysis

Interface hazard analysis examines boundaries between system components where communication or interaction occurs. Interfaces between hardware and software, between software modules, and between the system and external entities all merit careful analysis. Interface failures often cause accidents because they involve assumptions that may not be validated.

The analysis considers data format mismatches, timing incompatibilities, protocol violations, error handling inconsistencies, and assumption mismatches. For each interface, potential hazards from both directions are examined: hazards from upstream component failures affecting downstream components and hazards from downstream component behaviors affecting upstream components.

Code-Level Safety Analysis

Code-level safety analysis examines implementation details that could contribute to hazardous behavior. Static analysis tools identify potentially dangerous coding patterns such as uninitialized variables, buffer overflows, integer overflows, null pointer dereferences, and race conditions. Dynamic analysis through testing exercises code paths related to safety-critical functions.

Safety-critical code often requires adherence to coding standards such as MISRA C that prohibit language features with known safety risks. Compliance checking verifies that code meets applicable standard requirements. Additional analysis may examine worst-case execution time for real-time safety functions and stack usage to prevent overflow.

Common Cause Failure Analysis

Common cause failures defeat redundancy by causing multiple supposedly independent components to fail simultaneously. Identifying and mitigating common cause failures is essential when redundancy provides safety protection.

Sources of Common Cause Failure

Common cause failures arise from shared vulnerabilities among redundant components. Environmental factors such as temperature extremes, humidity, electromagnetic interference, or power supply disturbances can affect multiple components simultaneously. Design errors replicated across redundant channels defeat redundancy when the same fault activates in each channel.

Manufacturing defects from shared production processes may affect batches of components used in redundant channels. Maintenance errors affecting multiple channels simultaneously, such as incorrect calibration procedures, create common cause vulnerabilities. Software executing on redundant processors represents a systematic common cause that no amount of hardware redundancy can address.

Beta Factor Method

The beta factor method provides a simple model for quantifying common cause failure probability. The beta factor represents the fraction of total failures that affect multiple redundant channels. For example, a beta factor of 0.1 indicates that 10% of failures are common cause failures affecting all redundant channels simultaneously.

Industry databases and standards provide guidance on beta factor selection based on system design features. Factors that reduce beta include physical separation, diverse designs, independent power supplies, different environmental exposures, and staggered testing. The beta factor method provides first-pass common cause estimates, with more detailed methods available when greater precision is needed.

Defense Against Common Cause Failure

Defending against common cause failures requires multiple strategies. Diversity uses different technologies, designs, or implementations for redundant functions, preventing single design errors from affecting all channels. Physical separation protects against environmental common causes. Staggered testing intervals reduce vulnerability to systematic test-induced failures.

Functional diversity implements redundant functions through different algorithms or approaches. For example, a primary protection function might use calculated parameters while a backup uses direct measurements. Hardware diversity might combine electronic and mechanical protection mechanisms. Software diversity through independently developed programs provides defense against systematic software faults.

Human Factors in Hazard Analysis

Human operators interact with embedded systems as users, maintainers, and administrators. Human errors can initiate hazardous conditions or prevent recovery from system failures. Comprehensive hazard analysis must consider human factors.

Human Error Identification

Human error analysis identifies potential operator mistakes and their consequences. Task analysis breaks down human interactions with the system into discrete steps, examining each for potential errors. Error types include slips (correct intention, incorrect action), mistakes (incorrect intention), and violations (deliberate deviation from procedures).

Common human errors in embedded system interaction include incorrect parameter entry, wrong mode selection, misinterpretation of display information, failure to respond to alarms, and incorrect maintenance actions. The analysis identifies which errors could contribute to hazardous conditions and estimates their likelihood based on task complexity, training, and interface design.

Human Reliability Analysis

Human Reliability Analysis (HRA) quantifies human error probabilities for inclusion in probabilistic safety assessment. Techniques such as THERP (Technique for Human Error Rate Prediction) and HEART (Human Error Assessment and Reduction Technique) provide frameworks for estimating error probabilities based on task characteristics and performance shaping factors.

Performance shaping factors include time available, training, procedures, human-machine interface design, stress level, and environmental conditions. Favorable factors reduce error probability while unfavorable factors increase it. HRA results enable inclusion of human actions in fault trees and event trees alongside hardware and software failures.

Designing for Human Reliability

System design can reduce human error probability and mitigate error consequences. Clear, consistent user interfaces reduce cognitive load and interpretation errors. Confirmation requirements for critical actions prevent inadvertent activations. Feedback mechanisms inform operators of system state and action results.

Error tolerance designs detect human errors and either prevent their propagation or enable easy correction. Input validation catches out-of-range entries before they affect system behavior. Undo capabilities allow recovery from incorrect actions. Safety interlocks prevent dangerous actions regardless of operator commands. These design features complement operator training and procedures to minimize human contribution to accidents.

Documentation and Traceability

Hazard analysis produces essential documentation that demonstrates safety due diligence and supports ongoing safety management. Proper documentation enables review, audit, and maintenance of safety analysis throughout system lifecycle.

Hazard Log

The hazard log serves as the central repository for all identified hazards and their management status. Each hazard entry includes unique identification, description, severity classification, current status, responsible party, and links to supporting analysis. The hazard log tracks hazards from initial identification through resolution and verification.

Hazard status categories typically include open (identified but not yet addressed), mitigated (controls implemented but not verified), closed (controls verified effective), or transferred (responsibility assigned to another party). The hazard log provides visibility into safety analysis progress and outstanding safety concerns.

Safety Requirements Traceability

Safety requirements trace from hazards through design solutions to verification evidence. Forward traceability ensures that each hazard has corresponding safety requirements and that each requirement has implementing design elements. Backward traceability confirms that all safety design features address identified hazards rather than implementing unnecessary protections.

Traceability matrices document relationships between hazards, requirements, design elements, and verification activities. These matrices support impact analysis when changes occur, identifying which safety requirements might be affected by proposed modifications. Maintaining traceability throughout development and modification enables ongoing safety assurance.

Safety Case Structure

A safety case presents the structured argument that a system is acceptably safe for its intended use. The safety case compiles evidence from hazard analysis, design documentation, verification results, and operational experience into a coherent argument. Goal Structuring Notation (GSN) and Claims-Arguments-Evidence (CAE) provide standard formats for presenting safety arguments.

The safety case demonstrates that all significant hazards have been identified, that appropriate mitigations address each hazard, that mitigations have been correctly implemented, and that residual risks are tolerable. The safety case provides the foundation for regulatory approval and ongoing safety management throughout operational life.

Lifecycle Integration

Hazard analysis activities occur throughout the system lifecycle, with different techniques appropriate at different phases. Integrating hazard analysis with development processes ensures that safety considerations influence design decisions at appropriate points.

Concept Phase Analysis

Concept phase analysis identifies hazards early when fundamental design decisions are being made. Preliminary Hazard Analysis examines the system concept to identify inherent hazards and influence architecture selection. Early hazard identification enables hazard elimination through design rather than requiring later mitigation of inherent hazards.

Concept phase results include initial hazard lists, preliminary safety requirements, and architecture constraints. These results guide subsequent detailed design and establish safety targets for later verification. Investing in thorough concept phase analysis pays dividends through reduced rework and safer fundamental designs.

Design Phase Analysis

Design phase analysis examines detailed designs for compliance with safety requirements and identification of additional hazards. FMEA, FTA, and HAZOP provide systematic examination of design details. Interface analysis verifies safe interaction between components. SIL verification confirms that safety function designs meet integrity requirements.

Design phase hazard analysis is iterative, with results feeding back into design modifications. Design reviews incorporate hazard analysis results, requiring resolution of identified concerns before proceeding. The hazard log tracks design-phase hazards and their resolution through design changes or acceptance rationale.

Implementation and Test Phase Analysis

Implementation phase analysis verifies that detailed implementations match analyzed designs. Code-level analysis examines software implementations for hazardous patterns. Hardware inspections verify that built systems match designs and specifications. Test results provide evidence that safety functions perform as required.

Testing specifically exercises safety functions and hazard mitigations. Test cases derived from hazard analysis verify that identified hazard scenarios are properly handled. Fault injection testing demonstrates correct response to failure conditions. Integration testing verifies safe interaction between system components.

Operational Phase Analysis

Operational phase analysis monitors safety performance and identifies emerging hazards. Incident investigation examines near-misses and accidents to identify previously unrecognized hazards or inadequate mitigations. Safety performance indicators track leading indicators of safety degradation. Periodic safety reviews assess whether original hazard analysis remains valid.

Changes during operation require impact assessment against the hazard analysis. Modifications might introduce new hazards, affect existing mitigations, or invalidate analysis assumptions. Change management processes ensure that hazard analysis is updated to reflect system changes and that safety is maintained throughout modifications.

Industry-Specific Considerations

Different industries have developed specific hazard analysis practices aligned with their regulatory frameworks and technical characteristics. Understanding these domain-specific approaches is essential for practitioners working across industries.

Automotive Safety Analysis

Automotive functional safety under ISO 26262 emphasizes Hazard Analysis and Risk Assessment (HARA) as a key activity. HARA evaluates vehicle-level hazards considering severity, exposure probability, and controllability. The Automotive Safety Integrity Level (ASIL) rating derives from these factors, ranging from ASIL A (lowest) to ASIL D (highest), with QM indicating no safety requirement.

Automotive analysis must consider diverse operating conditions, driver populations, and road environments. The controllability factor reflects whether a typical driver could maintain vehicle control if a hazard occurs. Automotive fault tolerant time intervals define how quickly safety mechanisms must respond to faults.

Medical Device Risk Analysis

Medical device risk management under ISO 14971 requires systematic risk analysis considering device failures and user errors. Medical device hazards include direct harm from device malfunction as well as indirect harm from diagnostic errors or treatment delays. Risk acceptability considers both individual patient risk and population-level risk-benefit balance.

Medical device analysis involves clinical risk assessment requiring medical domain expertise alongside engineering analysis. Post-market surveillance provides ongoing hazard identification from field experience. Software of Unknown Provenance (SOUP) analysis addresses risks from third-party software components.

Aerospace Safety Assessment

Aerospace safety assessment follows ARP 4761 for system safety and ARP 4754A for development assurance. Functional Hazard Assessment (FHA) identifies failure conditions and classifies their severity from minor to catastrophic. Preliminary System Safety Assessment (PSSA) allocates safety requirements to system elements. System Safety Assessment (SSA) verifies that implemented designs meet requirements.

Aerospace safety analysis must demonstrate independence between redundant systems and address common cause concerns rigorously. Development Assurance Levels (DAL) correspond to failure condition severity, imposing development process requirements similar to SIL requirements. The combination of system-level probability targets and process-level assurance provides comprehensive safety demonstration.

Industrial Control Safety Analysis

Industrial safety instrumented systems under IEC 61511 implement safety functions protecting against process hazards. HAZOP analysis identifies process deviations that could lead to hazardous conditions. LOPA determines required SIL for safety instrumented functions based on initiating event frequency and existing protection layers.

Industrial applications must consider systematic failures in safety systems and dangerous undetected failures that could prevent response to demands. Proof test intervals and diagnostic coverage significantly affect achievable SIL for safety instrumented systems. Safety requirements specifications document the safety instrumented function requirements derived from hazard analysis.

Tools and Automation

Software tools support hazard analysis by managing information, performing calculations, and maintaining traceability. While tools do not replace engineering judgment, they enable efficient analysis of complex systems.

Hazard Analysis Tools

Dedicated hazard analysis tools provide structured support for techniques such as FMEA, FTA, and HAZOP. These tools enforce consistent methodology, maintain analysis databases, and generate reports. Integration between tools enables propagation of results from one analysis type to another.

Fault tree analysis tools include graphical editors for tree construction, libraries of common gate types, and calculation engines for quantitative analysis. Event tree tools similarly support graphical construction and probability calculation. Some tools integrate fault tree and event tree analysis for comprehensive accident modeling.

Model-Based Safety Analysis

Model-based safety analysis derives safety analysis artifacts from system models. Failure annotations on model elements enable automatic generation of FMEA tables and fault trees. This approach ensures consistency between system models and safety analysis while reducing manual analysis effort.

Tools such as AADL (Architecture Analysis and Design Language) with error annexes support modeling of fault behavior alongside normal system behavior. Automated analysis extracts fault propagation paths and generates fault trees from annotated models. Model-based approaches particularly benefit systems undergoing frequent modification, as safety analysis updates automatically with model changes.

Requirements Management Integration

Integration between hazard analysis tools and requirements management systems maintains traceability throughout development. Safety requirements derived from hazard analysis link to their originating hazards. Design elements link to the safety requirements they implement. Verification activities link to the requirements they verify.

Change impact analysis uses these links to identify safety analysis affected by proposed changes. Traceability reports demonstrate completeness of safety requirement implementation and verification. This integration supports both development efficiency and regulatory compliance demonstration.

Best Practices

Analysis Team Composition

Effective hazard analysis requires diverse expertise including system design, domain knowledge, safety engineering, and human factors. Team composition should include people with different perspectives who can challenge assumptions and identify overlooked hazards. Independent review by safety engineers not involved in design provides additional hazard identification.

Appropriate Technique Selection

Different hazard analysis techniques have different strengths and limitations. Selecting appropriate techniques requires considering system characteristics, lifecycle phase, available information, and regulatory requirements. Most safety-critical systems benefit from multiple complementary techniques rather than reliance on any single method.

Iterative Analysis

Hazard analysis should be iterative, with results from each cycle informing subsequent analysis and design. Initial broad analysis identifies major hazards and guides architecture decisions. Subsequent detailed analysis examines specific design solutions. Analysis continues through testing and operation as new information becomes available.

Conservative Assumptions

When uncertainty exists, hazard analysis should make conservative assumptions that do not underestimate risk. Severity assessments should consider worst-case consequences. Probability estimates should not rely on unverified assumptions about component reliability or human performance. Conservative analysis ensures that safety measures provide adequate protection even when analysis inputs prove optimistic.

Summary

Hazard analysis and risk assessment provide the foundation for safety-critical embedded system development. Through systematic identification of hazards, rigorous evaluation of risks, and determination of appropriate safety integrity requirements, these processes ensure that safety considerations drive design decisions throughout development.

The discipline encompasses diverse techniques from preliminary hazard analysis through detailed fault tree and event tree methods. Each technique addresses particular aspects of system safety, and comprehensive analysis typically employs multiple complementary approaches. Software-specific and human factors analyses extend traditional methods to address the full scope of embedded system hazards.

Effective hazard analysis requires integration with system development processes, proper documentation for traceability and audit, and ongoing attention throughout the operational lifecycle. By following established methodologies and best practices, engineers can develop embedded systems that protect human life and critical infrastructure while meeting increasingly demanding regulatory requirements.