Human Reliability Analysis

Human Reliability Analysis encompasses the systematic methods and techniques used to identify, model, and quantify human contributions to system performance and risk. While electronic systems are often designed with careful attention to hardware reliability, the humans who operate, maintain, and interact with these systems can be significant contributors to both successful operation and system failures. Research across industries consistently shows that human factors are involved in a substantial proportion of accidents and incidents, making human reliability analysis an essential complement to traditional hardware-focused reliability engineering.

The field of human reliability analysis emerged from the nuclear power industry in the 1970s and 1980s, driven by incidents that highlighted the critical role of human operators in maintaining safety. The Three Mile Island accident in 1979 demonstrated how human errors, combined with inadequate instrumentation and procedures, could lead to catastrophic consequences. Since then, human reliability analysis methods have been refined and applied across aviation, chemical processing, healthcare, transportation, and increasingly in complex electronic systems where human-machine interaction is integral to system performance.

For electronics engineers, human reliability analysis provides tools to quantify the human contribution to overall system reliability, identify scenarios where human errors are most likely, design interfaces that reduce error probability, and develop procedures and training that enhance human performance. This article provides comprehensive coverage of human reliability analysis methods, from foundational concepts through advanced techniques, enabling engineers to integrate human factors into reliability programs for safety-critical electronic systems.

Fundamentals of Human Error

Understanding Human Error

Human error refers to human actions or decisions that lead to outcomes other than those intended. This definition encompasses both active errors, where the immediate consequence is apparent, and latent errors, where the consequence may not become apparent until later when combined with other factors. Understanding the nature of human error is foundational to human reliability analysis because it determines how errors are classified, modeled, and addressed.

The concept of error implies deviation from some standard of correct performance. In well-defined tasks, this standard may be explicit: following a procedure, achieving a specified measurement, or responding within a required time. In more complex situations, the standard of correct performance may be implicit or subject to interpretation, making error classification more challenging. Human reliability analysis must account for both well-defined tasks where errors are clear and complex situations where the boundary between error and reasonable judgment is less distinct.

Human error should not be conflated with human blame. Modern safety science recognizes that errors are inevitable consequences of human cognitive architecture operating in complex environments. Humans evolved cognitive mechanisms optimized for survival in natural environments, not for operating complex technological systems. Understanding this evolutionary mismatch helps engineers design systems that accommodate human cognitive limitations rather than demanding superhuman performance.

The frequency and consequences of human error depend heavily on system design, procedural quality, training effectiveness, and organizational factors. Blaming individuals for errors that system design made likely accomplishes nothing except discouraging error reporting and organizational learning. Effective human reliability analysis focuses on system-level factors that can be modified to reduce error likelihood and consequence rather than on assigning individual blame.

Error Classification Taxonomies

Human reliability analysis benefits from systematic classification of error types because different error types have different causes and require different interventions. The most influential error taxonomy, developed by cognitive psychologist James Reason, distinguishes between slips, lapses, and mistakes based on the underlying cognitive mechanism.

Slips are errors of execution where the intention is correct but the action is carried out incorrectly. A technician who intends to turn a dial clockwise but turns it counterclockwise has committed a slip. Slips typically occur when attention is diverted from a routine task, when similar actions are required in different contexts, or when interruptions disrupt the sequence of actions. Slips are associated with skill-based behavior where actions are largely automatic.

Lapses are errors of memory where intended actions are omitted or performed out of sequence. A maintenance technician who forgets to replace a safety guard after servicing equipment has committed a lapse. Lapses occur when memory demands exceed capacity, when interruptions disrupt task completion, or when tasks have many steps. Unlike slips, which involve incorrect execution of the intended action, lapses involve failure to execute intended actions at all.

Mistakes are errors of planning or problem-solving where the intention itself is inappropriate for the situation. Mistakes occur when operators misdiagnose situations, apply inappropriate rules, or make flawed judgments under uncertainty. A control room operator who misinterprets instrument readings and takes actions appropriate for a different scenario has committed a mistake. Mistakes are associated with rule-based and knowledge-based behavior rather than the automatic processing that underlies skill-based slips and lapses.

Violations represent a distinct category where rules or procedures are deliberately not followed. Violations may be routine, where non-compliance has become normalized; situational, where circumstances seem to justify deviation; or exceptional, where unusual circumstances lead to one-time non-compliance. While violations involve intentional deviation, they rarely involve intent to cause harm; operators typically believe the violation is justified or low-risk. Understanding violation patterns is essential for designing effective compliance systems.

Performance Shaping Factors

Performance shaping factors are conditions that influence human performance, increasing or decreasing the likelihood of error. Human reliability analysis methods use performance shaping factors to adjust base error probabilities for specific contexts. Understanding these factors enables engineers to design systems and procedures that optimize conditions for human performance.

Task-related factors include complexity, time pressure, workload, and the clarity of task requirements. Complex tasks with many steps, decision points, or interdependencies increase error likelihood. Severe time pressure forces operators to take shortcuts and reduces opportunity for error detection. High workload consumes cognitive resources that might otherwise be available for error checking. Ambiguous task requirements increase the probability of mistakes.

Environmental factors include noise, lighting, temperature, and workspace design. Excessive noise interferes with communication and increases cognitive load. Poor lighting affects visual inspection accuracy and increases fatigue. Temperature extremes degrade cognitive and physical performance. Workspace designs that require awkward postures or excessive reaching increase error likelihood for physical tasks.

Individual factors include training, experience, fatigue, and stress. Inadequate training leaves operators unprepared for situations they encounter. Lack of experience means operators have not developed the pattern recognition that experts use to identify problems quickly. Fatigue degrades vigilance, reaction time, and decision-making quality. Stress, whether from work demands or personal factors, narrows attention and degrades performance on complex tasks.

Organizational factors include safety culture, management commitment, resource allocation, and communication patterns. Organizations that tolerate procedural violations normalize behaviors that increase risk. Management decisions that prioritize production over safety send signals that influence operator behavior. Inadequate staffing increases workload and reduces opportunity for peer checking. Poor communication leads to misunderstandings that propagate through the organization.

Human Error in Electronic Systems

Electronic systems present particular human error challenges and opportunities. The increasing sophistication of electronic control systems has automated many tasks previously performed by humans, but automation creates new human factors challenges even as it addresses others. Understanding how humans interact with electronic systems is essential for effective human reliability analysis.

Automation can reduce error by taking over routine tasks that humans perform unreliably, but it introduces new error modes related to monitoring, mode confusion, and automation surprises. Operators monitoring automated systems may fail to detect automation failures because sustained vigilance is cognitively demanding. Complex automation with multiple operating modes creates opportunities for mode confusion where operators believe the system is in a different state than it actually is. Automation surprises occur when automated systems behave in ways operators do not expect.

Interface design directly affects human error probability in electronic systems. Displays that present information clearly and in formats aligned with operator mental models support accurate situation assessment. Controls that provide clear feedback about system state reduce errors of execution. Alarm systems that prioritize alerts and suppress nuisance alarms help operators identify and respond to genuine problems. Poor interface design increases workload and error probability.

Electronic systems increasingly incorporate software that mediates human interaction with hardware. Software-related human errors include data entry errors, navigation errors in complex menu structures, and misunderstanding of software functions. Software can also incorporate error prevention and detection features such as input validation, confirmation dialogs, and automated checking. Human reliability analysis for electronic systems must address both the human-hardware and human-software interfaces.

Human Error Probability Assessment

Principles of Probability Assessment

Human error probability assessment quantifies the likelihood that human errors will occur in specific tasks and contexts. Unlike component failure rates derived from accelerated testing or field data, human error probabilities cannot be determined experimentally for most safety-critical tasks because errors are rare and the consequences of inducing errors would be unacceptable. Human reliability analysis methods therefore rely on structured expert judgment, anchoring on data from analogous situations, and theoretical models of human cognition.

Base human error probabilities represent error rates for generic task types under nominal conditions. These base rates are derived from empirical studies, operational data, and expert consensus. For simple tasks performed frequently under good conditions, base error probabilities may be on the order of one in a thousand. For complex tasks performed under stress with poor interface design, error probabilities may approach certainty. Human reliability analysis adjusts these base rates for the specific performance shaping factors present in each analyzed situation.

Uncertainty in human error probability estimates is substantial and must be acknowledged. Unlike component failure rates that may be known within a factor of two, human error probabilities may be uncertain by an order of magnitude or more. This uncertainty reflects variability in human performance, limitations in the data used to derive base rates, and difficulty in precisely characterizing performance shaping factors. Human reliability analysis should report uncertainty bounds alongside point estimates.

The appropriate precision of human error probability estimates depends on the purpose of the analysis. For screening analyses intended to identify dominant risk contributors, order-of-magnitude estimates may be sufficient. For detailed analyses supporting safety cases, greater precision and explicit uncertainty treatment are required. Analysts should match analytical effort to decision needs rather than pursuing false precision.

Data Sources for Human Error Probability

Several sources provide data for human error probability estimation, each with strengths and limitations. Operational experience data from incident reports, maintenance records, and quality records provides information about actual error rates in specific contexts. This data has the advantage of relevance to actual operations but may suffer from incomplete reporting, especially for errors that are caught before causing consequences.

Simulator studies provide controlled environments where error rates can be measured for specific scenarios. Simulator data can address error types that are rare in actual operations and can examine performance under conditions too dangerous to create in reality. However, simulator behavior may differ from real-world behavior because operators know they are being observed and because simulators cannot fully replicate operational stress.

Task analysis data from human factors studies provides information about error likelihood for generic task types. This data supports estimation of base error probabilities that are adjusted for specific applications. The generic nature of this data is both a strength, enabling application across contexts, and a limitation, requiring careful consideration of how specific contexts differ from the conditions under which data was collected.

Expert judgment provides essential input for human error probability estimation, particularly for unique or complex situations not well represented in empirical data. Structured elicitation methods such as absolute probability judgment, paired comparisons, and the Delphi technique help experts provide useful estimates while managing biases. Expert judgment should be documented, including the basis for estimates and the qualifications of the experts consulted.

Quantification Methods

Several approaches exist for quantifying human error probability. Direct estimation methods ask experts to provide probability estimates for specific errors in specific contexts. This approach is flexible but depends heavily on expert quality and is vulnerable to biases in probability estimation. Decomposition methods break complex tasks into simpler elements with better-established error probabilities, then combine element probabilities to estimate overall task error probability.

Simulation-based methods use computational models of human cognition to predict error probabilities. These models incorporate theories of human information processing, decision-making, and action execution. While theoretically attractive, simulation methods require substantial expertise to apply and validate, and their predictions should be verified against empirical data where possible.

Bayesian methods provide frameworks for combining prior information with new evidence to update probability estimates. Prior distributions may be derived from generic data or expert judgment. Likelihood functions represent the probability of observing specific evidence given different underlying error rates. Posterior distributions represent updated beliefs after incorporating evidence. Bayesian methods provide explicit treatment of uncertainty and support rational integration of multiple information sources.

Whatever quantification method is used, results should be subjected to reasonableness checks. Do the resulting probabilities fall within plausible ranges based on general experience? Are relative probabilities across different errors consistent with intuition about their relative likelihood? Do results respond appropriately to changes in performance shaping factors? Such checks help identify errors in analysis and build confidence in results.

Treatment of Dependencies

Dependencies between human errors complicate probability assessment. If errors in different tasks are independent, the probability of both errors occurring is simply the product of individual probabilities. However, human errors are rarely truly independent. Common causes such as poor training, fatigue, or misleading displays can make multiple errors more likely. Errors in sequential tasks may be dependent if the first error affects the conditions under which the second task is performed.

Positive dependence increases the probability of combined errors above what independence would predict. If an operator's fatigue causes one error, the same fatigue makes subsequent errors more likely. Negative dependence decreases combined error probability; if an operator catches one error, heightened awareness may reduce subsequent error probability. Failing to account for dependencies can significantly bias human reliability analysis results.

Dependency models provide frameworks for adjusting combined probabilities to account for dependence. The simplest approach classifies dependence as zero, low, medium, high, or complete and applies corresponding adjustment factors. More sophisticated models derive dependency adjustments from analysis of shared performance shaping factors or from explicit modeling of common causes.

Recovery dependencies deserve particular attention. If multiple opportunities exist to detect and recover from an error, these recovery actions may share common causes of failure. An error in procedure interpretation that causes the initial error may also cause failure to recognize the error during recovery attempts. Human reliability analysis should explicitly model recovery and its dependencies with initial errors.

Technique for Human Error Rate Prediction

THERP Methodology Overview

The Technique for Human Error Rate Prediction, developed at Sandia National Laboratories in the 1980s, represents one of the most comprehensive and widely applied human reliability analysis methods. THERP provides a systematic framework for decomposing human tasks into elements, assigning error probabilities to each element based on task type and performance shaping factors, and combining element probabilities to estimate overall task reliability. The method is documented in NUREG/CR-1278, commonly known as the THERP Handbook.

THERP is based on the premise that human tasks can be meaningfully decomposed into discrete actions whose error probabilities can be estimated from empirical data and expert judgment. The method provides extensive tables of human error probabilities for different task types, derived from nuclear power plant operational experience, simulator studies, and related industries. These probabilities serve as starting points that are adjusted for specific performance shaping factors.

The THERP methodology proceeds through defined steps: defining the analysis scope and objectives, performing task analysis to identify human actions, constructing event trees representing action sequences, assigning error probabilities to each event tree branch, accounting for dependencies between actions, and calculating overall task success and failure probabilities. This systematic approach ensures comprehensive coverage and supports documentation and review.

While THERP was developed for nuclear power applications, its principles apply broadly to other industries and systems. Electronics engineers can adapt THERP for analyzing human reliability in manufacturing operations, equipment maintenance, system operation, and emergency response. The method's explicit documentation and extensive database make it relatively accessible to analysts new to human reliability analysis.

Task Analysis in THERP

Task analysis is the foundation of THERP application, providing the detailed understanding of human tasks required for meaningful error probability assignment. Task analysis identifies the actions humans perform, the information they use, the decisions they make, and the sequence and timing of activities. Without thorough task analysis, human reliability analysis cannot identify all significant error opportunities or correctly characterize performance conditions.

Hierarchical task analysis decomposes tasks from high-level goals through intermediate operations to individual actions. At the highest level, a task might be defined as returning a system to normal operation after an alarm. This goal is decomposed into sub-goals such as diagnosing the cause, taking corrective action, and verifying successful correction. Each sub-goal is further decomposed until reaching the level of individual actions for which error probabilities can be assigned.

For each task element, the analyst identifies the human action type, the cues that trigger the action, the feedback that confirms correct performance, the potential errors, and the consequences of errors. Action types in THERP include procedural actions, diagnostic actions, control actions, and checking actions. Different action types have different base error probabilities reflecting their cognitive and physical requirements.

Task analysis should identify both prescribed tasks documented in procedures and actual tasks as performed by operators. Differences between prescribed and actual practice may indicate procedural inadequacies, training gaps, or work-arounds that have evolved in response to operational realities. Human reliability analysis based only on prescribed tasks may miss significant error opportunities present in actual practice.

Event Tree Modeling

THERP uses event trees to model sequences of human actions and their potential outcomes. Event trees branch at each action point, with branches representing successful action, unsuccessful action, and potentially different error modes. The tree structure makes explicit the relationships between actions and enables systematic probability calculation.

Event tree construction begins with the initiating event or task trigger and proceeds through the sequence of required actions. At each branch point, the analyst identifies possible outcomes and their probabilities. Success branches lead toward successful task completion; failure branches may lead to task failure or to recovery opportunities. The tree continues until all branches reach defined end states representing task success or various failure modes.

Recovery branches represent opportunities to detect and correct errors before they lead to final consequences. THERP explicitly models recovery through checking steps, redundant actions, and error annunciation. Recovery probabilities depend on the detectability of the error, the time available for recovery, and the independence of recovery actions from initial errors.

Event tree analysis quantifies overall task reliability by combining branch probabilities. Success probability is calculated by summing the probabilities of all paths leading to successful end states. Failure probability is the complement of success probability or can be calculated by summing paths leading to failure end states. The tree structure identifies which actions are most critical to overall reliability, guiding improvement efforts.

THERP Human Error Probability Tables

The THERP Handbook provides extensive tables of human error probabilities for different task types. These tables distinguish between action types, complexity levels, stress conditions, and other factors. Analysts select base error probabilities from these tables and adjust for situation-specific performance shaping factors not captured in table selection.

Execution error probabilities in THERP cover routine procedural actions such as operating controls, entering data, and performing measurements. Base probabilities range from about 0.001 for simple, well-practiced actions under good conditions to 0.05 or higher for complex actions under stress. Tables distinguish between different control types, different feedback conditions, and different procedural support.

Diagnosis error probabilities address cognitive tasks such as situation assessment, fault identification, and decision-making. These probabilities are generally higher and more uncertain than execution error probabilities, reflecting the greater cognitive complexity involved. THERP provides time-reliability correlations showing how diagnosis accuracy improves as more time is available for analysis.

Checking error probabilities address the effectiveness of verification activities. Self-checking catches some errors but is limited by cognitive biases that tend to confirm expected results. Independent checking by a second person is more effective but still imperfect. THERP tables quantify checking effectiveness under different conditions, enabling analysts to model the value of verification procedures.

Performance Shaping Factor Adjustments

THERP adjusts base error probabilities using multipliers for performance shaping factors. These multipliers increase or decrease error probability based on conditions that differ from the nominal conditions assumed in base probability tables. Major performance shaping factors in THERP include stress level, experience and training, quality of procedures, human-machine interface quality, and environmental conditions.

Stress effects on human performance follow an inverted-U relationship in THERP. Very low stress may be associated with complacency and reduced attention. Moderate stress optimizes performance by maintaining alertness without overwhelming cognitive capacity. High stress degrades performance by narrowing attention, increasing reliance on habitual responses, and reducing cognitive flexibility. THERP provides stress multipliers that increase error probability under high-stress emergency conditions.

Experience and training multipliers recognize that skilled operators make fewer errors than novices on familiar tasks. However, expertise does not protect against all error types; experienced operators may be more susceptible to certain mistakes based on expectations from past experience. THERP multipliers address skill-based differences while acknowledging that different error types respond differently to experience.

Interface and procedure quality multipliers reflect the observation that well-designed interfaces and clear procedures support human performance while poor designs increase error likelihood. THERP provides guidance for evaluating interface quality based on human factors principles and adjusting error probabilities accordingly. These multipliers quantify the reliability value of good human factors engineering.

Cognitive Reliability and Error Analysis Method

CREAM Methodology Overview

The Cognitive Reliability and Error Analysis Method represents a second-generation human reliability analysis approach that addresses some limitations of earlier methods like THERP. Developed by Erik Hollnagel, CREAM provides both retrospective analysis capabilities for understanding past events and prospective analysis capabilities for predicting future performance. The method emphasizes the context-dependent nature of human performance and provides systematic treatment of cognitive functions.

CREAM is grounded in a model of cognition that identifies four cognitive functions: observation, interpretation, planning, and execution. Errors can occur in any of these functions, and the method provides systematic identification of potential failure modes for each. This cognitive foundation enables CREAM to address a broader range of human performance issues than methods focused primarily on procedural task execution.

A distinguishing feature of CREAM is its treatment of context through Common Performance Conditions. These conditions capture aspects of the work environment, organization, and task that affect human performance across all cognitive functions. CREAM uses these conditions to classify the overall control mode ranging from strategic to scrambled, which determines the range of cognitive failure probabilities.

CREAM exists in multiple versions with different levels of detail. The basic version provides screening-level analysis with relatively simple assessment procedures. The extended version provides more detailed analysis with explicit quantification of cognitive failure probabilities. Analysts select the appropriate version based on the purpose of the analysis and available resources.

Cognitive Function Analysis

CREAM's four cognitive functions provide a framework for systematically identifying potential human performance problems. Observation refers to acquiring information from the environment through perception and attention. Interpretation involves understanding the meaning of observed information in context. Planning addresses the development of goals and strategies for action. Execution covers the implementation of planned actions through physical and cognitive activities.

Observation failures include failing to observe relevant information, observing incorrect information, and delayed observation. These failures may result from attention limitations, perceptual errors, or environmental factors that obscure relevant information. In electronic systems, observation failures often relate to display design, alarm presentation, and the salience of critical information.

Interpretation failures include wrong identification of system state, faulty diagnosis, and delayed interpretation. These failures reflect limitations in human pattern recognition, the influence of expectations on interpretation, and difficulties reasoning about complex dynamic systems. Interpretation failures are particularly significant in electronic systems with many possible states and failure modes.

Planning failures include inadequate planning, priority errors, and wrong timing. These failures may result from incomplete situation understanding, conflicting goals, or failure to anticipate consequences of actions. Planning failures often underlie mistakes where operators take actions that make sense given their flawed understanding but are inappropriate for the actual situation.

Execution failures include actions performed incorrectly, actions performed on wrong object, actions out of sequence, and actions omitted. These failures correspond to slips and lapses in other taxonomies. Execution failures are often addressed through interface design, procedure quality, and error-proofing that makes correct actions easier and incorrect actions harder.

Common Performance Conditions

CREAM assesses nine Common Performance Conditions that characterize the context within which human performance occurs. These conditions collectively determine the control mode, which represents the overall quality of human-system interaction. Understanding these conditions enables analysts to assess how context affects performance across all cognitive functions.

Adequacy of organization addresses whether organizational structures support safe and effective performance. This includes factors such as management commitment to safety, resource allocation, communication systems, and organizational learning. Organizations with inadequate safety management create contexts where errors are more likely.

Working conditions encompass the physical environment including workspace design, lighting, noise, temperature, and access to needed tools and information. Good working conditions support human performance; poor conditions increase error likelihood across all task types.

Adequacy of human-machine interface and operational support assesses the quality of displays, controls, procedures, and other artifacts that mediate human-system interaction. Well-designed interfaces and procedures reduce error probability; poor designs increase it. This condition is particularly important for electronic systems with complex interfaces.

Availability of procedures and plans addresses whether documented guidance exists and is accessible when needed. Available procedures reduce reliance on memory and reduce variability in performance. Unavailable or inaccessible procedures force operators to rely on memory or improvisation.

Number of simultaneous goals reflects the workload imposed by concurrent tasks. Multiple simultaneous goals divide attention, increase cognitive load, and may create conflicts requiring prioritization. High numbers of simultaneous goals increase error probability.

Available time represents the relationship between time required and time available for task completion. Adequate time allows careful, deliberate performance and provides opportunity for error checking and recovery. Inadequate time forces shortcuts and reduces error detection.

Time of day reflects circadian effects on human performance. Night shifts and early morning hours are associated with reduced alertness and increased error probability. Performance variations across the day should be considered when assessing tasks performed at different times.

Adequacy of training and experience addresses whether operators have the knowledge and skills required for assigned tasks. Adequate training and experience support competent performance; inadequate preparation increases error likelihood, particularly for non-routine situations.

Crew collaboration quality assesses team performance factors including communication, coordination, and mutual monitoring. Effective teams catch and correct individual errors; dysfunctional teams may amplify errors or create new ones through miscommunication.

Control Mode Determination

CREAM defines four control modes representing qualitatively different levels of human-system interaction quality. The control mode determines the overall context within which specific error probabilities are assessed. Control modes range from strategic, representing thoughtful, proactive control, through tactical and opportunistic, to scrambled, representing reactive, error-prone performance.

Strategic control mode represents optimal human-system interaction where operators have adequate time and resources to plan ahead, monitor system evolution, and respond thoughtfully to events. Error probability is lowest in strategic mode. This mode is associated with good performance conditions across all nine Common Performance Conditions.

Tactical control mode represents competent but less proactive performance where operators follow established procedures and practices effectively. Performance is good but operators may not anticipate problems as effectively as in strategic mode. Error probability is somewhat elevated but remains acceptable for routine operations.

Opportunistic control mode represents performance driven by immediate circumstances rather than systematic planning. Operators respond to salient features of the situation but may miss less obvious aspects. Error probability is significantly elevated, and performance may be inconsistent. This mode is associated with degraded performance conditions.

Scrambled control mode represents performance dominated by immediate pressures where operators struggle to maintain any systematic approach. Actions are reactive and poorly coordinated. Error probability is high across all cognitive functions. This mode reflects severely degraded performance conditions, often associated with emergency situations or accumulated problems.

CREAM Quantification Approaches

CREAM provides quantification approaches at different levels of detail. The basic method determines the control mode from Common Performance Conditions and assigns a range of failure probabilities consistent with that mode. This screening approach is suitable for initial analysis to identify contexts and tasks warranting more detailed examination.

The extended CREAM method provides more precise quantification by explicitly assessing cognitive failure probabilities for specific cognitive functions and failure modes. The analyst identifies which cognitive functions are involved in the task, assesses performance conditions specific to each function, and derives failure probabilities using method-specific tables and algorithms.

Extended CREAM defines cognitive failure modes for each cognitive function and provides base probabilities for each failure mode. For observation, failure modes include wrong identification and observation not made. For interpretation, failure modes include faulty diagnosis and decision error. Performance condition weights adjust these base probabilities for the specific context.

Uncertainty treatment in CREAM typically involves interval estimation rather than point estimates. The method recognizes that human performance variability and analysis uncertainty make precise point estimates misleading. Reporting probability intervals communicates the inherent uncertainty in human reliability estimates and supports appropriate use in decision-making.

Systematic Human Action Reliability Procedure

SHARP Framework

The Systematic Human Action Reliability Procedure provides a structured framework for conducting human reliability analysis that can incorporate various quantification methods including THERP, CREAM, and others. SHARP defines the overall analysis process rather than specifying particular quantification techniques, making it adaptable to different applications and organizational preferences.

SHARP was developed by the Electric Power Research Institute to address the observation that human reliability analysis quality depends as much on the overall analysis process as on the specific quantification method used. By providing a systematic procedure for defining scope, identifying human actions, characterizing performance conditions, quantifying probabilities, and documenting results, SHARP helps ensure comprehensive and consistent analysis.

The SHARP procedure comprises seven steps: definition, screening, identification, representation, screening for human error dependencies, quantification, and documentation. Each step has defined objectives, inputs, activities, and outputs. The structured approach ensures that all necessary aspects of human reliability analysis are addressed while providing flexibility in how specific steps are executed.

SHARP emphasizes integration of human reliability analysis with probabilistic risk assessment. Human actions identified through SHARP analysis feed into fault trees and event trees that model overall system risk. This integration ensures that human contributions to risk are quantified consistently with hardware contributions and that risk reduction decisions consider both human and hardware improvements.

SHARP Analysis Steps

The definition step establishes the scope and objectives of the human reliability analysis. This includes identifying the system or process to be analyzed, the scenarios of concern, the human actions within scope, and the level of detail required. Clear definition prevents scope creep and ensures that analysis effort is directed toward decision-relevant results.

Initial screening identifies human actions that are candidates for detailed analysis. Not all human actions significantly affect system risk; screening focuses subsequent effort on actions with potential to contribute meaningfully to failure probability. Screening criteria typically consider the importance of the action to safety functions and the potential for error.

Identification develops detailed understanding of the human actions remaining after screening. This step involves task analysis to understand what operators do, information analysis to understand what information they use, and performance condition analysis to understand the context affecting performance. Identification provides the foundation for meaningful quantification.

Representation structures the identified human actions for quantification. This typically involves constructing event trees or other models that show the relationships between actions, errors, and consequences. Representation makes explicit the logic connecting human reliability to system outcomes and enables systematic probability calculation.

Screening for human error dependencies identifies relationships between different human errors that affect how probabilities combine. Dependencies may arise from common causes affecting multiple actions, from sequential relationships where one action affects conditions for subsequent actions, or from shared performance shaping factors. Dependency identification is essential for accurate combined probability assessment.

Quantification assigns numerical probabilities to human errors using selected methods. SHARP does not prescribe particular quantification techniques; analysts may use THERP, CREAM, expert judgment, or other appropriate methods. The quantification step also addresses uncertainty, deriving probability distributions rather than point estimates where appropriate.

Documentation creates the record of analysis scope, methods, assumptions, data sources, and results. Good documentation enables review, supports updates when conditions change, and provides the basis for incorporating results into risk-informed decisions. Documentation requirements may be specified by regulatory authorities or organizational procedures.

SHARP Integration with PRA

SHARP analysis integrates human reliability into probabilistic risk assessment through explicit modeling of human actions in fault trees and event trees. Human errors appear as basic events in fault trees representing system failures. Human actions also appear in event tree branches representing operator responses to initiating events. This integration enables quantification of human contributions to overall risk.

Pre-initiator human actions are errors that occur before an initiating event and that may disable safety functions or create latent conditions. Examples include maintenance errors that leave equipment misconfigured or testing errors that leave safety systems unavailable. Pre-initiator actions typically appear in system fault trees as contributors to equipment unavailability.

Post-initiator human actions are responses to initiating events that may either successfully mitigate the event or fail to do so. Examples include emergency operating procedure execution, manual backup system actuation, and innovative recovery actions. Post-initiator actions appear in event trees as branch points where operator success or failure determines the subsequent accident progression.

Recovery actions represent opportunities to detect and correct errors or equipment failures before they lead to unacceptable consequences. Recovery may involve recognizing symptoms that indicate problems, diagnosing the cause, and taking corrective action. Human reliability analysis should explicitly model recovery opportunities and their dependencies with initial failures.

Task Analysis Methods

Hierarchical Task Analysis

Hierarchical task analysis is a fundamental method for understanding human tasks in sufficient detail to support human reliability analysis. The method decomposes tasks from high-level goals through intermediate operations to specific actions. This hierarchical structure reveals the relationships between actions and enables systematic identification of error opportunities at each level.

The analysis begins by identifying the overall goal of the task being analyzed. This goal is then decomposed into sub-goals that must be achieved to accomplish the overall goal. Sub-goals are further decomposed into operations, and operations into specific actions. Decomposition continues until reaching a level of detail appropriate for the analysis purpose.

Plans specify the conditions under which sub-goals or operations are executed and the sequencing between them. Plans may be simple sequences where operations are executed in fixed order, or complex decision structures with branches depending on conditions encountered. Plans capture the procedural knowledge that guides task execution.

Hierarchical task analysis representations typically use numbering schemes to show the hierarchical structure. The overall goal is numbered 0. Sub-goals are numbered 1, 2, 3, and so on. Operations under sub-goal 1 are numbered 1.1, 1.2, 1.3, and so on. This numbering enables clear reference to specific task elements and their relationships.

Cognitive Task Analysis

Cognitive task analysis extends behavioral task analysis by examining the cognitive demands of tasks: the knowledge, decision-making, and problem-solving required for successful performance. This analysis identifies cognitive functions such as situation assessment, diagnosis, prediction, and planning that are not apparent from observing overt behavior. Cognitive task analysis is essential for understanding error opportunities in complex tasks.

Several methods support cognitive task analysis. Critical decision method interviews explore expert decision-making by eliciting accounts of challenging situations and probing the cues, goals, and strategies involved. Think-aloud protocols capture real-time cognitive processes as operators perform tasks. Cognitive walkthrough systematically evaluates interface designs against cognitive requirements.

Cognitive task analysis reveals the mental models that operators use to understand system behavior. Mental models are internal representations that enable operators to explain system states, predict future states, and determine appropriate actions. Errors often result from flawed mental models that lead to incorrect expectations or inappropriate actions. Understanding required mental models informs training design and interface design.

Workload assessment is often part of cognitive task analysis. Workload refers to the mental demands placed on operators relative to their cognitive capacity. High workload reduces available resources for error detection and recovery. Workload assessment methods include subjective ratings, secondary task performance, and physiological measures. Results inform task design to keep workload within acceptable bounds.

Tabular Task Analysis

Tabular task analysis presents task information in structured tables that support systematic error identification and probability assessment. Tables typically include columns for task step, action description, cues, feedback, potential errors, error consequences, and error probability. This format ensures consistent treatment of all task elements and facilitates review.

For each task step, the analyst identifies the cue that triggers the action and the feedback that confirms correct performance. Cues may be procedural instructions, system indications, or temporal triggers. Feedback may include display changes, control position, or sensory confirmation. Understanding cues and feedback reveals potential observation and execution errors.

Potential errors for each step are identified systematically by considering different error types. What could go wrong with observing the cue? What could go wrong with interpreting the cue? What execution errors could occur? What could prevent recognition that the action was incorrect? This systematic approach ensures comprehensive error identification.

Error consequences trace the impact of each potential error through the system to understand severity. Some errors may have no significant consequence; others may cause system damage or safety hazards. Understanding consequences informs both error probability assessment and prioritization of error reduction measures.

Link Analysis and Timeline Analysis

Link analysis examines the relationships between operators, displays, controls, and other task elements to identify potential interaction problems. Links are represented graphically showing which elements operators must access to complete tasks. High link density indicates potentially problematic areas with heavy interaction requirements. Link analysis informs workspace design and display arrangement.

Timeline analysis examines the temporal aspects of tasks, plotting task elements against time to identify potential timing problems. Timeline analysis reveals whether available time is adequate for required actions, whether task elements conflict with each other, and whether there are periods of excessive workload. This analysis is particularly important for time-critical scenarios.

Combined link and timeline analysis reveals dynamic workload patterns as operators move through tasks. Some task phases may involve intensive interaction with many elements in limited time; others may involve waiting with little activity. Understanding these patterns enables design of tasks, procedures, and automation that smooths workload variation.

Critical path analysis identifies task elements that constrain overall completion time. Delays in critical path elements directly delay task completion; delays in non-critical elements may be absorbed without affecting completion. This analysis helps prioritize improvements and assess the impact of potential errors on overall task success.

Workload and Situational Awareness

Workload Assessment Methods

Workload assessment quantifies the demands placed on operators relative to their capacity, providing input to human reliability analysis and task design. High workload degrades performance by leaving insufficient capacity for error detection and recovery. Assessment methods include subjective measures, performance measures, and physiological measures, each capturing different aspects of workload.

Subjective workload measures ask operators to rate their experienced workload. The NASA Task Load Index is widely used, assessing mental demand, physical demand, temporal demand, performance, effort, and frustration on separate scales that are combined into an overall workload score. Subjective measures capture the operator's experience but may be influenced by factors other than actual workload.

Performance-based workload measures infer workload from degradation in primary or secondary task performance. Degradation in primary task performance indicates workload approaching or exceeding capacity. Secondary task methods add a concurrent task whose performance serves as a workload indicator; degraded secondary task performance indicates that the primary task is consuming available capacity.

Physiological workload measures use biological indicators such as heart rate variability, pupil diameter, and brain activity. These measures can provide continuous, non-intrusive workload assessment during task performance. However, physiological measures may be influenced by factors other than workload and require specialized equipment for collection and interpretation.

Mental Workload Modeling

Mental workload models predict workload from task characteristics, enabling assessment of proposed designs before implementation. Multiple resource theory proposes that cognitive resources are structured into channels for different information types and processing stages. Tasks that demand the same resources compete and create interference; tasks that demand different resources can be performed concurrently with less interference.

The Improved Performance Research Integration Tool and similar models quantify expected workload based on task characteristics and multiple resource theory. Models decompose tasks into components, assign resource demands to each component, and compute overall workload considering resource conflicts. Such models enable comparison of design alternatives based on predicted workload impact.

Attention allocation models address how operators distribute limited attention across multiple concurrent demands. Strategic models propose that operators allocate attention based on perceived task importance and expected payoff. These models help predict which aspects of complex situations operators will attend to and which they may neglect.

Workload modeling limitations should be acknowledged. Models simplify complex cognitive processes and may not capture all relevant factors. Model predictions should be validated against empirical data where possible. Despite limitations, models provide useful input for design decisions, particularly in early design phases when empirical testing is not yet feasible.

Situational Awareness Assessment

Situational awareness refers to the operator's understanding of the current system state, its causes, and its likely evolution. Accurate situational awareness is prerequisite to effective decision-making and action. Loss of situational awareness is implicated in many accidents where operators took actions inappropriate for actual conditions. Situational awareness assessment evaluates whether operators have adequate understanding to perform their tasks.

Endsley's three-level model distinguishes perception of elements in the environment (Level 1), comprehension of the current situation (Level 2), and projection of future status (Level 3). Situational awareness failures can occur at any level. Operators may fail to perceive relevant information, may perceive information but misinterpret it, or may understand the current situation but fail to anticipate its evolution.

The Situation Awareness Global Assessment Technique assesses situational awareness by freezing simulations at random points and querying operators about their understanding of system state. Queries are developed based on situational awareness requirements derived from cognitive task analysis. Responses are scored against ground truth to assess situational awareness accuracy.

Situation awareness requirements analysis identifies the information operators need at each situational awareness level for different task scenarios. This analysis informs display design by identifying what information must be presented to support required awareness. It also reveals situations where inherent limitations may make accurate situational awareness difficult to achieve.

Attention and Vigilance

Attention limitations significantly affect human reliability in monitoring and surveillance tasks. Sustained attention, or vigilance, refers to the ability to maintain attention on task-relevant stimuli over extended periods. Vigilance decrements, where detection probability decreases over time, are well-documented and affect many monitoring tasks in electronic systems.

Vigilance decrement is influenced by signal rate, signal salience, task complexity, and work schedule. Rare signals are detected less reliably than frequent signals. Low-salience signals embedded in complex displays are more likely to be missed. Complex tasks that require interpretation beyond simple detection show greater vigilance decrement. Night shifts and extended work periods degrade vigilance.

Attention capture by salient stimuli can either help or hinder performance. Alarms and alerts are designed to capture attention to important events. However, attention capture can also distract from ongoing tasks, causing operators to miss other important information. Display design must balance the need for attention capture against the cost of distraction.

Strategies for addressing attention limitations include automation of monitoring tasks, display design that makes critical information salient, work scheduling that limits exposure to vigilance-demanding tasks, and task rotation that provides variation. Human reliability analysis should consider attention limitations when assessing detection-dependent tasks.

Crew Resource Management and Team Performance

Team Performance Factors

Many electronic system operations involve teams rather than individuals, making team performance a significant factor in human reliability. Teams can potentially achieve higher reliability than individuals through mutual monitoring, error correction, and complementary expertise. However, teams can also introduce new failure modes related to communication, coordination, and team dynamics.

Effective team performance requires shared mental models where team members have compatible understanding of the situation, the task, and each others' roles. Shared mental models enable implicit coordination where team members anticipate each others' needs without explicit communication. Teams lacking shared mental models may work at cross purposes or fail to provide needed support.

Communication quality directly affects team reliability. Clear, timely communication ensures that team members have the information they need. Communication failures include omitted information, garbled transmission, and misinterpretation. Structured communication protocols such as read-backs and three-way communication reduce communication errors.

Team leadership influences both task performance and team climate. Effective leaders coordinate team activities, manage workload, and maintain situational awareness. They also create environments where team members feel comfortable speaking up about concerns. Authoritarian leadership styles may suppress the error correction that enables team reliability.

Crew Resource Management Principles

Crew Resource Management originated in aviation to address accidents attributed to failures in team coordination rather than technical proficiency. CRM training develops skills in communication, leadership, situation awareness, decision-making, and workload management. These principles have been adapted for healthcare, nuclear power, and other domains where team performance is critical.

Effective CRM emphasizes assertiveness balanced with respect for expertise and authority. Team members must be willing to speak up when they observe problems, even to superiors. At the same time, concerns must be communicated constructively. Training develops skills in expressing concerns effectively and in receiving concerns non-defensively.

Briefings and debriefings are CRM practices that establish shared understanding and enable learning. Pre-task briefings establish shared mental models, identify potential problems, and clarify roles. Post-task debriefings review performance, identify lessons learned, and reinforce positive behaviors. Regular briefings and debriefings build team cohesion and continuous improvement.

Workload management in CRM involves monitoring team workload and redistributing tasks when any member approaches overload. This requires awareness of others' workload and willingness to ask for and offer assistance. Effective workload management prevents the performance degradation that accompanies individual overload.

Team Performance Analysis

Team performance analysis examines how teams function in operational contexts to identify strengths and improvement opportunities. Analysis methods include observation of actual or simulated operations, review of communication records, and structured interviews with team members. Results inform both team design and training development.

Behavioral markers provide observable indicators of effective team performance. Communication markers include clarity, completeness, and appropriate timing. Coordination markers include task prioritization, workload distribution, and mutual support. Leadership markers include briefing quality, decision-making effectiveness, and team climate maintenance. Behavioral markers enable systematic observation and assessment.

Communication analysis examines the content, frequency, and patterns of team communications. Content analysis identifies whether necessary information is communicated. Frequency analysis reveals whether communication volume is appropriate for task demands. Pattern analysis examines communication structure to identify isolation of team members or communication bottlenecks.

Team reliability modeling extends individual human reliability analysis to team contexts. This modeling addresses how team interactions affect error probability, considering both the error reduction potential of mutual monitoring and the error creation potential of miscommunication. Dependencies between team member errors are particularly important to model accurately.

Communication Protocols

Structured communication protocols reduce errors in safety-critical communications. These protocols specify how information should be exchanged to ensure accuracy and completeness. Effective protocols address both routine communications and critical communications requiring high reliability.

Read-back and hear-back protocols require that the receiver repeat critical communications back to the sender, who verifies accuracy. This closed-loop communication catches many transmission and interpretation errors. Read-back is standard practice in aviation and increasingly in healthcare for critical communications such as medication orders.

Standard terminology reduces ambiguity by ensuring that terms have consistent meanings across team members and situations. Domain-specific terminology should be clearly defined, and common terms with multiple meanings should be avoided or clarified. Phonetic alphabets and standardized number formats reduce errors in spoken communications.

Escalation protocols define how concerns should be raised when initial communications do not receive adequate response. These protocols enable team members to persist with safety concerns through defined channels. Effective escalation protocols balance respect for expertise and authority with the need to ensure that concerns are heard and addressed.

Decision-Making Under Stress

Naturalistic Decision-Making

Naturalistic decision-making research examines how experts make decisions in real-world contexts characterized by time pressure, uncertainty, dynamic conditions, and high stakes. This research reveals that expert decision-making often differs substantially from the analytical models assumed in classical decision theory. Understanding naturalistic decision-making informs both human reliability analysis and the design of decision support systems.

Recognition-primed decision-making describes how experienced operators use pattern recognition to quickly identify situations and generate appropriate responses without extensive option analysis. Experts recognize situations as instances of familiar patterns and know immediately what actions are appropriate. This rapid recognition enables fast response but may lead to errors when situations are misrecognized.

Serial option evaluation characterizes much naturalistic decision-making. Rather than generating and comparing multiple options, decision-makers often evaluate options one at a time, accepting the first satisfactory option found. This strategy is efficient when good options exist and are found early, but may lead to suboptimal choices when better options would be found through more extensive search.

Mental simulation enables decision-makers to evaluate options by imagining how they would play out. Experienced operators can quickly run through scenarios in their minds to assess whether proposed actions will achieve desired outcomes. Mental simulation depends on accurate mental models; flawed models produce misleading simulations.

Stress Effects on Decision-Making

Stress affects decision-making through multiple mechanisms. Acute stress narrows attention, focusing cognitive resources on the most salient aspects of the situation while neglecting peripheral information. This narrowing may be adaptive when central information is most important but leads to errors when peripheral information is also relevant.

Stress increases reliance on habitual responses and reduces cognitive flexibility. Under stress, operators tend to apply familiar patterns and procedures rather than engaging in effortful analysis. This tendency is efficient when familiar patterns apply but leads to errors when situations require novel responses.

Time pressure, a common stressor, forces decision-makers to act before they would prefer. Time pressure truncates information search, reduces option generation and evaluation, and increases reliance on heuristics. While heuristics often produce acceptable decisions quickly, they are also sources of systematic bias.

Cognitive appraisal mediates stress effects. Situations perceived as threatening and uncontrollable produce more severe stress effects than those perceived as challenging but manageable. Training and experience that build competence and confidence reduce the stress response to objectively similar situations.

Decision Biases and Heuristics

Cognitive biases systematically influence decisions in predictable ways that may depart from optimal. Confirmation bias leads decision-makers to seek and interpret information in ways that support existing beliefs while neglecting contradictory evidence. Anchoring bias causes excessive influence of initial information on subsequent judgments. Availability bias leads to overweighting of easily recalled examples.

Heuristics are mental shortcuts that enable rapid judgment with limited information. The recognition heuristic judges options based on whether they are recognized. The representativeness heuristic judges probability based on similarity to prototypes. Such heuristics often produce reasonable judgments efficiently but can lead to systematic errors in predictable circumstances.

Framing effects demonstrate that logically equivalent choices can lead to different decisions depending on how they are presented. Choices framed in terms of gains tend to produce risk-averse decisions; choices framed in terms of losses tend to produce risk-seeking decisions. Framing influences decisions even when decision-makers are aware of the effect.

Debiasing strategies attempt to reduce the influence of biases on important decisions. Strategies include seeking disconfirming evidence, considering alternative hypotheses, using structured decision aids, and obtaining independent assessments. While complete elimination of bias is unrealistic, awareness and deliberate countermeasures can reduce bias effects.

Decision Support Design

Decision support systems can enhance human decision-making by compensating for human limitations while leveraging human strengths. Effective decision support provides relevant information, highlights important considerations, and presents options without replacing human judgment. Poor decision support may increase workload, distract from important information, or create automation complacency.

Information display design affects decision quality by determining what information is available and how salient it is. Displays should present information needed for decisions in formats aligned with decision requirements. Information overload degrades decisions by consuming attention and creating search demands. Display design should prioritize decision-relevant information.

Alarm and alert systems support decisions by directing attention to conditions requiring response. Effective alarm systems discriminate true conditions requiring attention from false alarms and nuisance alarms. Excessive false alarms lead to alarm fatigue and reduced response to genuine alarms. Alarm prioritization helps operators identify the most important conditions.

Automation can support decisions by performing analyses that exceed human capacity and presenting results for human interpretation. However, automation must be designed to support rather than replace human understanding. Opaque automation that provides recommendations without explanation does not support the learning and adaptation that enable human resilience in novel situations.

Error Recovery and Defense

Error Recovery Mechanisms

Error recovery refers to detection and correction of errors before they cause harm. Human reliability analysis should model recovery as well as initial errors because recovery significantly affects the probability that errors lead to consequences. Recovery success depends on error detectability, time available for recovery, and the availability of recovery resources.

Self-recovery occurs when the person who made an error detects and corrects it. Self-recovery may occur immediately through skill-based error correction, or may be delayed when the error is detected through unexpected consequences. Self-recovery is limited by cognitive biases that tend to confirm expectations and by attention limitations that prevent detection of subtle errors.

Team recovery occurs when other team members detect and correct errors. Effective teams actively monitor each others' actions and speak up when errors are observed. Team recovery extends the error detection capability beyond any individual but depends on team climate that encourages error reporting and correction.

System-mediated recovery occurs when technical systems detect errors and either correct them automatically or alert operators. Examples include input validation that prevents incorrect data entry, interlocks that prevent incorrect sequences, and alarms that indicate deviation from expected conditions. System-mediated recovery is particularly valuable for errors that humans would have difficulty detecting.

Defense-in-Depth Strategies

Defense-in-depth creates multiple independent barriers against errors, each capable of preventing harm even if other barriers fail. This strategy recognizes that any single barrier may fail and therefore provides redundant protection. Human reliability analysis should assess the reliability of each barrier and their independence.

Barriers may be physical, such as interlocks that prevent incorrect actions; procedural, such as verification steps in procedures; or administrative, such as authorization requirements. Effective defense-in-depth uses diverse barrier types because different types have different failure modes. Barriers with common failure modes provide less effective redundancy.

Barrier analysis identifies the barriers that protect against each identified error and assesses the reliability of each barrier. This analysis reveals where barrier coverage is inadequate and where barriers may share common modes of failure. Results inform decisions about where additional barriers are needed.

The Swiss cheese model visualizes how errors can propagate through multiple barrier layers when holes in different layers align. Each layer has holes representing barrier failures. Most of the time, holes do not align and errors are caught. Accidents occur when holes happen to align, allowing error propagation through all layers. This model highlights the importance of maintaining multiple independent barriers.

Error-Proofing Design

Error-proofing, also known as poka-yoke, designs systems so that errors are prevented or immediately detected. Error-proofing is preferable to relying on human vigilance because it provides consistent protection regardless of human variability. The most robust error-proofing makes errors physically impossible; less robust approaches make errors immediately obvious.

Forcing functions are design features that make the intended sequence of actions the only possible sequence. Interlocks that prevent the next step until the previous step is complete are forcing functions. Connectors that only fit in the correct orientation prevent reversal errors. Forcing functions eliminate entire classes of errors but may not be feasible for all applications.

Affordances and constraints guide behavior toward correct actions without completely preventing errors. Controls that look like they should be operated in a certain way provide affordances for correct operation. Physical constraints that make incorrect actions difficult, though not impossible, reduce error likelihood. These approaches are applicable where forcing functions are not feasible.

Error detection and feedback designs make errors immediately apparent so they can be corrected. Displays that clearly indicate current state reveal when settings are incorrect. Feedback that confirms action completion reveals when actions have been omitted. Alarms that indicate out-of-range conditions prompt investigation and correction. Detection is valuable when prevention is not feasible.

Learning from Errors

Organizational learning from errors improves system reliability by addressing conditions that made errors likely. Learning requires that errors be reported, analyzed to understand contributing factors, and addressed through systemic changes. Organizations that punish error reporters suppress the information needed for learning.

Non-punitive reporting systems encourage error reporting by focusing on system improvement rather than individual blame. Such systems acknowledge that errors are inevitable and that reporting provides valuable information. Non-punitive reporting does not mean ignoring willful violations or repeated negligence; it means distinguishing inadvertent errors from culpable behaviors.

Root cause analysis investigates errors to identify contributing factors at multiple levels. Immediate causes are the actions or conditions that directly led to the error. Contributing causes are the factors that created conditions for error. Root causes are the fundamental organizational or systemic factors that allowed contributing factors to exist. Effective learning addresses root causes, not just immediate causes.

Corrective action development translates analysis findings into changes that reduce future error probability. Corrective actions may address equipment design, procedure quality, training content, organizational factors, or other contributors identified through analysis. Corrective actions should be tracked to ensure implementation and validated to confirm effectiveness.

Behavioral Markers and Assessment

Non-Technical Skills Assessment

Non-technical skills are the cognitive, social, and personal resource skills that complement technical knowledge and enable safe, effective performance. Assessment of non-technical skills provides insights into human reliability that supplement traditional technical competence assessment. Non-technical skills assessment is used for selection, training, and performance evaluation.

Key non-technical skill categories include situation awareness, decision-making, communication, teamwork, leadership, and workload management. Each category encompasses multiple specific skills that can be observed and assessed. For example, communication includes clarity of expression, active listening, and appropriate assertiveness.

Behavioral rating systems provide structured approaches for observing and assessing non-technical skills. Systems such as NOTECHS for aviation and ANTS for anesthesia define observable behaviors that indicate skill levels. Trained observers rate behaviors against defined standards, providing quantifiable assessment of non-technical performance.

Assessment reliability depends on observer training, rating scale quality, and assessment conditions. Observer training develops shared understanding of behavioral indicators and consistent application of rating scales. Well-designed rating scales provide clear distinctions between performance levels. Assessment conditions should provide adequate opportunity to observe relevant behaviors.

Behavioral Marker Systems

Behavioral marker systems define observable behaviors that indicate effective or ineffective non-technical skill application. Markers are developed through analysis of expert performance, incident investigation, and empirical research. Systems are domain-specific to ensure markers are relevant to the tasks and contexts of the target domain.

Marker development begins with identifying the non-technical skills important for the domain. For each skill, markers are developed that describe observable behaviors at different performance levels. Markers should be specific enough to enable reliable observation while being applicable across the range of situations encountered in the domain.

Exemplar behaviors provide concrete illustrations of performance at each level. Exemplars help observers understand the distinctions between performance levels and calibrate their ratings. Different exemplars may be provided for different task types or scenarios within the domain.

Rating scales typically use four or five levels ranging from poor to excellent performance. Each level has descriptors that characterize performance at that level. Odd-numbered scales include a middle neutral point; even-numbered scales force discrimination between above and below average. Scale selection considers the purpose of assessment and desired discrimination.

Training Assessment Applications

Behavioral markers support training assessment by providing structured criteria for evaluating performance during training exercises. Trainees receive specific feedback about observed behaviors rather than vague impressions of overall performance. This specificity enables targeted improvement.

Formative assessment during training uses behavioral markers to guide development. Instructors observe trainee performance, identify behaviors indicating skill deficiencies, and provide targeted instruction. Trainees learn what effective performance looks like and receive feedback about their own performance against these standards.

Summative assessment at training completion uses behavioral markers to determine whether trainees have achieved required competence. Assessment results determine readiness for operational duties. Consistent application of behavioral markers across trainees ensures fair, defensible assessment decisions.

Recurrent training assessment uses behavioral markers to evaluate ongoing competence. Regular assessment identifies skill degradation before it leads to operational problems. Results guide individualized training to address identified deficiencies while avoiding redundant training in areas of demonstrated competence.

Operational Performance Monitoring

Behavioral markers enable systematic observation of operational performance to identify trends, best practices, and improvement opportunities. Observational programs provide data about how personnel actually perform their duties, complementing other safety data sources such as incident reports.

Line Operations Safety Audits in aviation provide a model for operational observation. Trained observers ride along on normal operations and record observations using structured forms. Data are aggregated to identify systemic issues while maintaining confidentiality of individual observations. Results inform training, procedures, and system design.

Peer observation programs use operational personnel as observers. Peer observers may be more readily accepted than external observers and have operational credibility. However, peer observation requires careful management to maintain observer objectivity and prevent concerns about surveillance. Non-punitive use of observations is essential.

Observation data analysis identifies patterns across multiple observations. Individual observations provide limited information due to natural variability. Aggregated data reveal systematic issues that occur across multiple operators, situations, or time periods. Trend analysis tracks whether performance is improving or degrading over time.

Human Factors Engineering Integration

Human-Centered Design Process

Human-centered design integrates human factors considerations throughout the system development lifecycle. Rather than addressing human factors as an afterthought, human-centered design considers human capabilities and limitations from initial concept through detailed design, implementation, testing, and operation. This integration prevents problems that are expensive or impossible to fix later.

Early involvement of human factors in concept development ensures that fundamental architecture decisions account for human capabilities. Allocation of function between humans and automation should consider which functions humans perform well and which are better automated. Workstation concepts should support task requirements. Early decisions constrain later options, making human factors input at this stage particularly valuable.

Design iteration with human factors evaluation refines concepts through progressive analysis and testing. Human factors methods are applied at each design phase to identify potential problems and evaluate design alternatives. User involvement through focus groups, surveys, and usability testing brings actual user perspectives into the design process.

Human factors verification ensures that final designs meet human factors requirements. Verification activities include analytical evaluation against human factors guidelines, usability testing with representative users, and human reliability analysis of critical tasks. Verification findings may identify needed design changes before production.

Function Allocation

Function allocation determines which functions humans will perform and which will be automated. This allocation fundamentally shapes human-system interaction and significantly affects human reliability. Poor allocation either burdens humans with functions they perform unreliably or deprives them of engagement and situational awareness.

Humans excel at pattern recognition, flexible response to novel situations, judgment under ambiguity, and creative problem-solving. Functions that leverage these strengths are good candidates for human allocation. Humans perform poorly on sustained vigilance, precise repetitive actions, rapid response to predictable situations, and complex calculations. Functions with these characteristics are candidates for automation.

Automation that supports human performance is preferable to automation that replaces humans entirely. Supporting automation provides information and recommendations while humans retain decision authority. Replacing humans entirely creates risks of automation complacency, skill degradation, and inability to cope with automation failures. The appropriate level of automation depends on task characteristics and system criticality.

Allocation decisions should be documented with supporting rationale. Documentation enables review and supports future design decisions. Rationale should address why each function is allocated as it is, considering human capabilities, automation capabilities, and system requirements.

Interface Design for Reliability

Interface design directly affects human error probability by determining how easily operators can perceive information, understand system state, and execute correct actions. Well-designed interfaces support human performance; poor interfaces create opportunities for error. Interface design should be guided by human factors principles and validated through analysis and testing.

Display design principles include supporting pattern recognition through consistent formats, reducing memory demands by displaying needed information, managing attention through appropriate use of highlighting and alerts, and supporting situation awareness by presenting information in context. Displays should present information at appropriate levels of abstraction for different tasks.

Control design principles include providing clear feedback about control action effects, using control characteristics that naturally suggest correct operation, preventing inadvertent activation of critical controls, and supporting error recovery through undoability. Controls should be arranged to support natural task sequences and minimize physical demands.

Alarm design principles include limiting alarm volume to manageable levels, distinguishing between alarm priorities, presenting alarms in formats that support diagnosis, and enabling easy acknowledgment and silencing. Alarm systems should be designed as systems, not collections of individual alarms, with consideration of alarm patterns during abnormal conditions.

Procedure Design

Procedure quality significantly affects human reliability for proceduralized tasks. Well-designed procedures support correct performance by providing clear instructions, appropriate detail, and logical organization. Poor procedures increase error likelihood through ambiguity, excessive complexity, or mismatch with actual task requirements.

Procedure content should include all information needed to perform the task correctly. Instructions should be specific enough to be performed as written while not being so detailed that operators cannot exercise appropriate judgment. Procedures should be verified against actual equipment and conditions rather than designed in isolation.

Procedure format affects usability. Clear organization with logical grouping of related steps supports navigation. Action steps should be clearly distinguished from information or cautions. Step numbering supports place-keeping during execution. Format conventions should be consistent across procedures.

Procedure validation confirms that procedures support correct performance. Walkthrough validation has operators review procedures step-by-step, identifying ambiguities and problems. Simulation validation has operators execute procedures while observers identify difficulties. Validation findings should prompt procedure revision before operational use.

Application to Electronic Systems

Manufacturing Operations

Human reliability analysis applies to manufacturing operations where human actions affect product quality and safety. Electronic manufacturing involves numerous human tasks including component handling, assembly operations, inspection, testing, and process control. Each task presents error opportunities that may affect product reliability or manufacturing efficiency.

Assembly errors include component placement errors, soldering errors, and workmanship defects. Human reliability analysis identifies tasks with significant error potential and evaluates the adequacy of prevention and detection controls. Results inform decisions about automation, fixture design, procedure improvement, and inspection strategy.

Inspection errors include both false alarms, where good product is rejected, and misses, where defective product is accepted. Human reliability analysis of inspection considers factors affecting inspector performance including lighting, workload, fatigue, and defect characteristics. Results inform inspection process design and inspector support.

Process control errors include incorrect parameter settings, failure to respond to process deviations, and incorrect recovery actions. Human reliability analysis of process control identifies critical control tasks and evaluates the human-machine interface, procedures, and training supporting these tasks. Results inform control system design and operator qualification.

Equipment Maintenance

Maintenance activities present significant human reliability challenges because they occur infrequently, involve diverse tasks, and may be performed under time pressure or adverse conditions. Maintenance errors can introduce latent failures that remain undetected until challenged by other events. Human reliability analysis of maintenance identifies critical tasks and supports maintenance program improvement.

Restoration errors occur when equipment is not correctly returned to service after maintenance. Common examples include components left disconnected, fasteners not properly torqued, safety guards not replaced, and test equipment left in circuit. Human reliability analysis identifies restoration error risks and evaluates controls such as return-to-service verification.

Calibration errors affect measurement accuracy and may not be apparent until measurement results are significantly wrong. Human reliability analysis of calibration considers procedure quality, technician training, equipment condition, and verification practices. Results inform calibration procedure design and technician qualification.

Troubleshooting errors include incorrect diagnosis, leading to unnecessary component replacement, and failure to identify actual failures, leaving equipment unreliable. Human reliability analysis of troubleshooting considers information available to technicians, diagnostic aids, and time pressure. Results inform diagnostic support system design and technician training.

System Operation

System operation in control rooms, operations centers, and field installations presents human reliability challenges related to monitoring, response to abnormal conditions, and routine operations. Human reliability analysis of operations supports interface design, procedure development, and training program design.

Monitoring tasks require sustained attention to detect abnormal conditions. Human reliability analysis considers alarm system design, display organization, and workload patterns. Results inform alarm philosophy development and display design to support effective monitoring without overwhelming operators.

Emergency response tasks occur infrequently but have high consequences for error. Human reliability analysis of emergency response considers procedure quality, training effectiveness, and environmental conditions during emergencies. Results inform emergency procedure development and exercise program design.

Shift handover represents a period of elevated risk where information must transfer between operators. Human reliability analysis of handover considers information transfer methods, verification practices, and time available for handover. Results inform handover procedure development and scheduling practices.

Software Development

While software does not have human operators in the traditional sense, software development involves intensive human cognitive work where errors lead to software defects. Human reliability analysis concepts apply to software development by identifying tasks with error potential, evaluating factors affecting developer performance, and designing processes that prevent and detect errors.

Requirements specification errors introduce defects that propagate through design and implementation. Human factors in requirements include stakeholder communication, requirement clarity, and analyst expertise. Structured requirements methods, reviews, and validation address these factors.

Design errors may create software architecture that is difficult to implement correctly or that behaves incorrectly under certain conditions. Human factors in design include design method quality, designer expertise, and design review effectiveness. Structured design methods, design reviews, and prototyping address design errors.

Coding errors include logic errors, boundary condition errors, and interface errors. Human factors in coding include programming language characteristics, development environment quality, and programmer expertise. Code reviews, static analysis, and testing detect coding errors. Programming language design and development environment design can reduce error introduction.

Conclusion

Human reliability analysis provides the methods and frameworks needed to systematically address human contributions to system performance and risk. While hardware reliability has been the traditional focus of reliability engineering, experience across industries demonstrates that human factors are involved in a substantial proportion of system failures. Effective reliability programs must address human performance with the same rigor applied to hardware and software components.

The methods covered in this article, from foundational error classification through quantification techniques and team performance analysis, provide a comprehensive toolkit for human reliability analysis. THERP offers systematic task decomposition and probability quantification grounded in extensive operational data. CREAM provides cognitive modeling that addresses the context-dependent nature of human performance. SHARP integrates human reliability analysis into probabilistic risk assessment. Together with task analysis, workload assessment, and decision-making research, these methods enable thorough analysis of human contributions to system reliability.

For electronics engineers, human reliability analysis informs design decisions about human-machine interfaces, automation, procedures, and training. Systems designed with attention to human factors achieve better performance by enabling humans to contribute their unique strengths while compensating for predictable human limitations. Error-proofing, defense-in-depth, and organizational learning create resilient systems that detect and recover from the errors that inevitably occur.

The ultimate goal of human reliability analysis is not to eliminate human involvement in systems but to optimize the human contribution. Humans bring flexibility, judgment, and creative problem-solving that no automated system can match. By understanding human capabilities and limitations, and by designing systems that work with rather than against human cognitive architecture, engineers create electronic systems that are both more reliable and more effective in achieving their intended purposes.