Glossary of Terms/Topics
Glossary of Statistical Terms
Term | Description |
---|---|
Alpha | In hypothesis testing, alpha (α) is the significance level, representing the probability of rejecting a true null hypothesis. Commonly set at 0.05, it indicates the threshold beyond which we consider evidence significant enough to reject the null hypothesis and accept an alternative hypothesis. |
Alternative Hypothesis | The alternative hypothesis is a statement in statistical hypothesis testing that proposes a difference or effect, challenging the null hypothesis. It represents what researchers aim to support with evidence from their data, indicating a change, relationship, or impact in the population. |
Analysis of Covariance (ANCOVA) | ANCOVA is a statistical method that combines aspects of analysis of variance (ANOVA) and regression. It assesses whether population means of a dependent variable (DV) are equal across levels of a categorical independent variable (IV), while statistically controlling for the effects of other continuous variables known as covariates. |
Analysis of Variance (ANOVA) | ANOVA is a statistical technique used to compare means among three or more groups and as the name implies, variances will be used to make this comparison. Specifically, the within group variance (obtained by averaging the variance from each group) will be compared to the overall between group variance. If the variances are similar, the means of the groups are not likely to be diffrent. If the two variances are quite different, then it is likely one or more of the means are different from the others. |
Association | Association between two variables means that they occur together more frequently than would be expected by chance, but this does not necessarily imply a cause-and-effect relationship. Associations can be identified through observational studies, but further research is often required to determine whether a causal relationship exists. |
Attack Rate | Attack rate is a measure used in epidemiology to describe the proportion of individuals who develop a certain disease or health outcome during a specified period among a population at risk. It is commonly used in the investigation of outbreaks, providing a quick measure of the risk of disease in a population. For example, during a flu outbreak, the attack rate would indicate the percentage of people who became ill after exposure. |
Attributable fraction in the exposed cases | A measure in epidemiology that quantifies the proportion of a specific outcome among individuals who have been exposed to a particular risk factor. It helps assess the contribution of the exposure to the occurrence of the outcome in the exposed group. |
Attributable fraction in the population | A population-level measure indicating the proportion of a specific outcome that can be attributed to a particular risk factor. It considers both exposed and unexposed individuals, providing an estimate of the overall impact of the risk factor on the occurrence of the outcome in the entire population. |
Attrition (Dropout) Bias | This bias arises in longitudinal studies when participants drop out or are lost to follow-up at different rates between the compared groups. If the reasons for dropout are related to the study, this can lead to biased results |
Background Risk | Background risk refers to the baseline level of risk for a health outcome in a population without exposure to the specific risk factor being studied. It's the rate at which a condition occurs in the general population, serving as a reference point for comparing the risk in groups with additional risk factors. |
Bayes' Theorem | Bayes' Theorem is a fundamental concept in probability theory that provides a way to update or revise probability estimates based on new evidence. Conditional probabilities often used in Bayesian statistics to update the probability of a hypothesis or event given prior knowledge and new data. It plays a crucial role in making predictions and decisions in medical research, especially when dealing with uncertain information or incorporating prior beliefs. |
Bias | Bias in statistics refers to a systematic error or deviation from the truth in data collection, analysis, interpretation, or reporting. It can lead to inaccurate conclusions and is crucial to identify and minimize in research to ensure the validity and reliability of study results. |
Binary Outcome | A binary outcome refers to a situation where the outcome of interest has only two possible categories or states, such as "yes" or "no," "success" or "failure," or "alive" or "dead." These outcomes are often analyzed using techniques like logistic regression to understand the factors influencing the likelihood of each category. |
Binomial Distribution | The binomial distribution is a probability distribution that models the number of successes in a fixed number of independent and identical trials, where each trial has only two possible outcomes (success or failure). It is frequently used to analyze binary data and is characterized by parameters such as the number of trials and the probability of success in each trial. |
Biologic Plausibility | Biologic plausibility refers to the extent to which a causal relationship between a particular exposure and outcome can be explained by known biological mechanisms. It strengthens the case for causation when the association observed in epidemiological studies can be linked to a logical and scientifically established biological pathway. |
Biomarkers | Biomarkers are biological molecules found in blood, other body fluids, or tissues that are a sign of a normal or abnormal process, or of a condition or disease. They can be used for early detection, diagnosis, or monitoring of disease progression or response to treatment. |
Biostatistics | The application of statistical principles to medical, public health, and biological sciences. |
Bland-Altman Plot | A Bland-Altman Plot is a graphical tool used in assessing agreement between two measurement methods. It displays the difference between two measurements on the y-axis against their average on the x-axis, providing insights into the agreement, potential bias, and any systematic errors between the methods. |
Blinded Study | A study in which study subjects do not know to which treatment arm they have been randomized. |
Canonical Correlation | Canonical correlation analysis assesses the linear relationship between two sets of variables. It identifies linear combinations of variables from each set, known as canonical variates, to maximize the correlation between these sets. |
Case Report | An observational study that gives a detailed report of a subject's specific features. This is usually an inexpensive study but lacks a comparison group and has no specific research question. Statistical analyses cannot generally be applied to case reports since they consist of a single subject. These usually generate hypothesis to be formally tested in a future study. |
Case Series | An observational study that gives a detailed report of a small number of subject's along with their specific features. As with a case report, this is usually an inexpensive study but lacks a comparison group and has no specific research question. Statistical analyses cannot generally be applied to case series since they consist of a small number of subjects. These usually generate hypothesis to be formally tested in a future study. |
Case-Control Study Design | An observational study that usually begins by identifying subject with the disease of interest (cases), and then identifies disease-free controls that are as similar to the cases as possible (i.e. age, weight, biological sex, etc). These studies are especially useful when investigating rare diseases. The prevalence of the disease CAN NOT be estimated with these studies. Rather the odds of the disease are the statistical measure of focus. |
Case-fatality Rate | The case-fatality rate is the proportion of individuals diagnosed with a particular disease who die from that disease within a specified period. It provides a measure of the lethality of the disease and is crucial for understanding the severity of outbreaks or health conditions. |
Causal Relationship | A causal relationship exists when a change in one variable directly causes a change in another. In epidemiology, establishing causation requires evidence beyond simple association, typically through controlled experiments, cohort studies, and the fulfillment of specific criteria (like temporality, dose-response relationship, and biological plausibility). |
Causation | Causation implies that exposure to a certain factor (like a virus, behavior, or environmental toxin) directly results in an outcome (such as a disease or health condition). Demonstrating causation requires strong evidence that the factor is not only associated with the outcome but is responsible for causing it. |
Censoring | Censoring in survival analysis occurs when the observation of an event is incomplete or terminated before it happens. It is common in longitudinal studies where some individuals may not experience the event of interest within the study period, and their data is considered censored. |
Central Limit Theorem | The Central Limit Theorem states that, regardless of the distribution of the population, the sampling distribution of the sample mean will be approximately normally distributed if the sample size is sufficiently large. This theorem is fundamental for making statistical inferences about population parameters. |
Chi-square Distribution | The chi-square distribution is a probability distribution that describes the distribution of the sum of squared standard normal deviates. It is widely used in hypothesis testing and confidence interval construction for the variance of a normal distribution and in the analysis of categorical data. It is a special case of the Gamma distribution. |
Chi-square Test | The chi-square test is a statistical test used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies under a null hypothesis, helping researchers assess the independence or dependence of variables. |
Cluster Sample | A cluster sample is a sampling method where the population is divided into clusters, and entire clusters are randomly selected for inclusion in the study. This approach is practical when it is logistically challenging or expensive to sample individual units, making it more efficient to study groups or clusters. |
Coefficient of Determination | The coefficient of determination, often denoted as R-squared, quantifies the proportion of the variability in the dependent variable that can be explained by the independent variable(s) in a regression model. It ranges from 0 to 1, with higher values indicating a better fit of the model to the data. |
Coefficient of Variation | The coefficient of variation (CV) is a measure of relative variability and is calculated as the ratio of the standard deviation to the mean, expressed as a percentage. It provides a standardized way to compare the dispersion of different datasets, particularly when comparing variables with different units of measurement. |
Cohort Study Design | An observational study involving a group (cohort) of disease-free subjects that are followed prospectively in time to see if or when they are exposed to a risk factor and whether or not they develop the disease. These studies can assess the temporal association between the exposure and disease and can investigate several diseases at the same time. They can, however, be expensive, time consuming, require large numbers of subjects especially for rare diseases, and subject to confounding. These studies can measure the prevalence and incidence of a disease. |
Conditional Probability | Conditional probability is the probability of one event occurring given that another event (or a set of events) has already occurred. It is calculated by dividing the joint probability of the two events by the probability of the conditioning event, providing insight into the likelihood of an outcome in specific circumstances. |
Confidence Interval | A confidence interval is a range of values constructed around a point estimate, such as a sample mean or proportion, to provide a plausible range within which the true population parameter is likely to fall with a certain level of confidence. Common confidence levels include 95% or 99%, reflecting the degree of certainty in the interval. |
Confirmation Bias | Confirmation bias occurs when researchers or study participants subconsciously focus on information or data that confirm their preexisting beliefs or hypotheses, while disregarding evidence that contradicts them. This can lead to a biased interpretation of the study results. |
Confounding Bias | Confounding Bias occurs when an outside variable affects both the independent and dependent variables, leading to a mistaken causal relationship between them. Proper statistical methods and study design must be employed to adjust for confounding variables to accurately understand the true relationship between variables of interest. |
Confounding Variable | A confounding variable is an external influence that can affect the apparent association between the independent variable (exposure) and the dependent variable (outcome) in a study. It is a factor that is associated with both the exposure and the outcome but is not an intermediate step in the causal pathway. Properly identifying and controlling for confounding variables is crucial for accurately assessing the true relationship between exposure and outcome. For example, a researcher may want to explore the effect of exercise (independent variable) on cardiovascular disease (CVD, dependent variable). A subject's smoking status is a confounding variable here since it is likely related to both their amount (and intensity) of exercise as well as their likelihood of developing cardiovascular disease. It will be difficult for a researcher to make conclusive statements about the effect of exercise on CVD without accounting for smoking status. |
Continuous Outcome | A continuous outcome refers to a variable that can take any value within a range, often including decimals or fractions. Examples include measurements like blood pressure, weight, or serum cholesterol levels. Analyzing continuous outcomes typically involves techniques such as t-tests, ANOVA, or linear regression to explore relationships between variables. |
Convenience Sample | A convenience sample is a non-probabilistic sampling method where individuals are selected based on their ease of availability or accessibility. While convenient for researchers, this method may introduce bias as it does not ensure a representative sample from the entire population. |
Counting Techniques | Counting techniques, often referred to as combinatorics, involve methods for systematically counting the number of possible outcomes in various scenarios. This includes permutations and combinations, essential for solving problems related to arranging and selecting items, respectively, in probability and statistics. |
Cross-sectional Study Design | An observational study conducted at a single point in time. These are often inexpensive, easy to implement, ethical, and can measure disease and/or exposure prevalence. Drawbacks to these studies however are that they do not include temporal information (such as the exposure preceded the outcome) and they can be subject to non-response bias. |
Database | A database is a structured collection of data, organized for efficient retrieval and management. In epidemiology and biostatistics, databases are often used to store and analyze large datasets containing information about populations, diseases, and other relevant variables. |
Deep Learning Model | Deep learning models, such as neural networks, involve complex architectures with multiple layers to learn intricate patterns from data, often used in image and speech recognition. |
Degrees of Freedom | Degrees of freedom represent the number of independent values or quantities in a statistical calculation that are free to vary. Degrees of freedom are essential in various statistical tests, such as t-tests, ANOVA, and regression analysis. Understanding degrees of freedom is crucial for determining critical values, assessing the variability in data, and making valid statistical inferences. |
Dependent Variable | The dependent variable is the outcome or response variable in a statistical study. It is the variable that researchers observe and measure to assess the impact of one or more independent variables, providing insights into the relationships and patterns within the data. |
Descriptive Statistics | Descriptive statistics involve methods for summarizing and presenting data in a meaningful way. This includes measures such as mean, median, mode, range, and standard deviation, which help describe the central tendency and variability of a dataset. |
Discriminant Function | In discriminant analysis, the discriminant function is a mathematical function used to distinguish between two or more groups based on observed variables. It optimally combines the variables to maximize the differences between groups. |
Double-Blinded Study | In a double-blinded study design, both the participants and the researchers are unaware of who is receiving the treatment and who is in the control group. This design helps minimize bias and ensures that the assessment of outcomes is objective and unbiased. |
Ecological Study Design | An ecological study design examines the relationship between variables at the population level rather than the individual level. It explores associations between exposures and outcomes based on group-level data, offering insights into population health patterns. |
Effect Modification | Effect modification occurs when the relationship between two variables is influenced by the presence of a third variable. It is essential in epidemiological research, as it helps identify situations where the effect of an exposure on an outcome varies depending on the level of another factor. |
Efficacy | Efficacy refers to the ability of an intervention or treatment to produce the desired effect under ideal or controlled conditions. It is often assessed in clinical trials and research studies to determine the effectiveness of a specific approach in achieving its intended outcome. |
Equivalence Trial | An equivalence trial is designed to determine whether two treatments are equivalent or equally effective within a predetermined margin of clinical significance. These trials aim to establish whether a new treatment is as good as an existing standard, without necessarily demonstrating superiority. Equivalence is typically assessed by comparing confidence intervals or conducting hypothesis tests to evaluate whether the treatments fall within the predefined equivalence margin. |
Event | A collection of outcomes which are of interest to the investigator. |
Exclusion Bias | Exclusion bias occurs when certain individuals or groups are systematically excluded from participation in a study, leading to results that may not be generalizable to the entire population. This can happen due to strict eligibility criteria, leading to a study population that does not accurately reflect the broader population of interest. |
Experiment | Any system of process that, when carried out, can result in more than one possible outcome. |
Experimental Study Design | A design in which study subjects are exposed to a treatment or procedure and then followed for presense or absense of an outcome of interest. Common experimental study designs are randomized controlled trials. These trials typically seek to establish a cause and effect relationship between the treatment and outcome. |
Exposure | Exposure refers to the presence or experience of a factor, condition, or treatment that individuals or groups may encounter. In epidemiology, exposure often relates to a potential risk factor, and understanding its association with health outcomes is crucial in investigating disease causation. |
Factor Analysis | Factor analysis is a statistical method used to identify underlying patterns or latent factors that explain the observed correlations among variables. It is often employed in survey research and psychometrics to understand the structure of relationships among a set of observed variables. |
Finite Population | A finite population refers to a set of elements that can be enumerated and counted. In contrast to infinite populations, where the size is conceptually limitless, finite populations have a known and countable number of elements. |
Fisher's Exact Test | Fisher's exact test is a statistical test used to determine whether there are significant associations between two categorical variables. It is particularly useful when sample sizes are small, and the chi-square test may not be appropriate. |
Fixed Effect | In the context of mixed effects models, fixed effects represent constant parameters that apply to the entire population under study. |
Hawthorne Effect | The Hawthorne Effect refers to the phenomenon where study participants alter their behavior because they are aware they are being observed or part of a study. This can impact the outcomes of research, as the changes in behavior may not be a result of the intervention being tested but rather the awareness of being observed. It highlights the importance of considering how the awareness of study participation might influence the results. |
Hazard Function | The hazard function in survival analysis describes the instantaneous rate of occurrence of an event, such as disease or failure, at a specific point in time. It is fundamental for understanding the risk of events over time in longitudinal studies. |
Heteroscedasticity | Heteroscedasticity refers to the unequal variance of errors in a regression model. Detecting and addressing heteroscedasticity is important in ensuring the reliability of statistical inferences and predictions based on regression analyses. |
High Dimensional Data | High dimensional data refers to datasets with a large number of variables or features compared to the number of observations. Techniques like regularization are often employed to handle the challenges posed by high dimensional datasets. |
Homogeneity | Homogeneity in statistics indicates the similarity or equality of characteristics within a group or between groups. In research, homogeneity is often assessed to ensure that study samples are comparable and that statistical assumptions are met. |
Hypothesis Testing | Hypothesis testing is a statistical method used to assess the validity of a claim or hypothesis about a population parameter. It involves comparing sample data to a null hypothesis to determine whether any observed effects are statistically significant or likely due to chance. |
Incidence Rate | The incidence rate is a measure in epidemiology that quantifies the occurrence of new cases of a specific disease or health event in a population over a defined time period. It is often expressed as the number of new cases per unit of population at risk. |
Independent Events | Independent events in probability theory are events whose occurrence or non-occurrence does not affect each other. The probability of one event happening does not change based on the occurrence or non-occurrence of another event. Note: Independent events cannot be mutually exclusive events. |
Independent Variable | An independent variable is a variable that is manipulated or selected by the researcher in an experiment or observational study. It is the variable thought to have an effect on the dependent variable and is used to investigate relationships and causation. |
Indication Bias | Indication bias refers to a systematic error introduced in observational studies or analyses due to the presence of a specific indication or reason for receiving a particular treatment or intervention. This bias arises when individuals with different baseline characteristics or health conditions are more or less likely to be exposed to a treatment, leading to distorted associations between exposure and outcomes. |
Indication Bias | Indication bias refers to a systematic error introduced in observational studies or analyses due to the presence of a specific indication or reason for receiving a particular treatment or intervention. This bias arises when individuals with different baseline characteristics or health conditions are more or less likely to be exposed to a treatment, leading to distorted associations between exposure and outcomes. |
Intent-to-Treat Analysis | Intent-to-treat analysis is an approach in clinical trials where participants are analyzed based on their assigned treatment group, regardless of whether they completed or fully adhered to the treatment. This method helps maintain the randomization principle and provides a more realistic assessment of treatment effectiveness |
Interquartile Range | The interquartile range (IQR) is a measure of statistical dispersion that represents the range within which the middle 50% of the data values lie. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1) and is resistant to outliers. |
Interval Regression | Interval regression is a statistical method used when the dependent variable is censored or measured on an interval scale. It accounts for both observed and interval-censored data, commonly employed in survival analysis. |
Lead-time Bias | Lead-time bias occurs in studies that screen for diseases (like cancer), where early detection is mistaken for increased survival time without considering that the diagnosis was simply made earlier. This bias can give the false impression that early treatment prolongs life, even if the overall course of the disease remains unchanged. |
Length-time Bias | A bias in screening studies where slower-progressing diseases are more likely to be detected than faster-progressing ones, which can make a screening test appear more beneficial than it truly is by preferentially detecting less aggressive diseases. |
Likert Scale | A Likert scale is a commonly used rating scale in survey research that allows respondents to express their agreement or disagreement with a statement. It typically consists of a series of ordered response categories, such as "strongly agree," "agree," "neutral," "disagree," and "strongly disagree." |
Logistic Regression | Logistic regression is a statistical method used for modeling the probability of a binary outcome. It is particularly useful when the dependent variable is categorical and represents two possible outcomes (e.g., presence or absence of a condition) and may be influenced by one or more independent variables. |
Machine Learning Model | Machine learning models use algorithms to learn patterns from data and make predictions or decisions without being explicitly programmed. Examples include decision trees, support vector machines, and random forests |
MANOVA (Multivariate Analysis of Variance) | MANOVA is an extension of analysis of variance (ANOVA) that allows the simultaneous analysis of multiple dependent variables. It is useful when there are multiple outcome variables, providing insights into group differences across these variables. |
McNemar Test | The McNemar test is a statistical test used to assess the significance of changes or differences in proportions between two related groups, often used in before-and-after studies or paired data. It is particularly suited for dichotomous categorical data. |
Measurement (Information) Bias | Measurement bias results from systematic differences in the way data on exposure or outcome is collected or classified in a study, potentially leading to incorrect conclusions about the relationship between exposure and outcome. It can arise from inaccurate measurement, recall issues, or misinterpretation of data. |
Measurement Error | Measurement error refers to the difference between a measured value and the true value of a variable. It can arise from various sources, such as instrumentation limitations or human error, and understanding and minimizing measurement error is crucial for accurate statistical analyses. |
Meta-Analysis | Meta-analysis is a statistical method that combines and analyzes data from multiple independent studies on a specific topic to obtain a more comprehensive and robust estimate of the overall effect. It helps synthesize evidence and draw more generalizable conclusions. |
Misclassification Bias | Misclassification bias happens when individuals or data points are incorrectly categorized with respect to exposure or outcome status, leading to an inaccurate estimation of the association between exposure and outcome. It can be differential (related to the outcome) or non-differential (unrelated to the outcome). |
Mixed Effects Model | A mixed effects model accommodates both fixed and random effects in the analysis. Fixed effects represent population parameters, while random effects account for variability within a sample, often arising from repeated measurements or nested structures. |
Modified Intent-to-Treat Analysis | Modified intent-to-treat analysis is a variation of the intent-to-treat analysis that includes participants who received at least one dose of the assigned treatment. This modification is often used to address practical considerations and better reflect real-world scenarios. |
Multinomial Logistic Regression | Multinomial logistic regression extends binary logistic regression to handle more than two categories in the dependent variable. It models the probability of each category relative to a reference category. |
Multiple Imputation | Multiple imputation is a statistical technique used to address missing data by generating multiple plausible values for each missing observation. It helps preserve the uncertainty associated with missing data and produces more accurate and reliable results in statistical analyses. |
Multiple Linear Regression | Multiple linear regression is an extension of simple linear regression that involves modeling the relationship between a dependent variable and two or more independent variables. It allows for the assessment of the combined effect of multiple predictors on the outcome variable. |
Multiplicity | Multiplicity refers to the issue of conducting multiple statistical tests within a study, increasing the likelihood of obtaining false-positive results. To address this, adjustments like Bonferroni correction are applied to control the overall Type I error rate and maintain the study's integrity. |
Mutually Exclusive Events | Events that cannont simultaneously occur. That is, if one event has occured, the others cannot occur. Note: Mutually exclusive events cannot be independent events. |
Negative Binomial Regression | The negative binomial distribution is used to model count data with overdispersion, where the variance is greater than the mean. Negative binomial regression extends this concept to regression modeling for count data. |
Negative Binomial Zero-Inflated Regression | This model combines negative binomial regression with a component to account for excess zeros, making it suitable for count data with a high proportion of zero values. |
Negative Binomial Zero-Truncated Regression | Similar to the zero-inflated model, this regression model is designed for count data but excludes zero values, focusing on the distribution of positive counts. |
Nominal Data | Nominal data represent categories or labels without a specific order or hierarchy. In research, these categories are used to classify variables into distinct groups. For example, gender (male, female) or blood type (A, B, AB, O) are nominal variables. |
Non-inferiority Trial | A non-inferiority trial is conducted to demonstrate that a new treatment is not clinically worse than an existing standard treatment by more than a predetermined margin. Instead of aiming to show superiority, these trials seek to establish that the new treatment is at least as effective as the standard. Non-inferiority margins are defined based on clinical judgment and prior research, and statistical analyses involve hypothesis testing or confidence interval estimation to assess whether the new treatment meets the non-inferiority criterion. |
Nonparemetric Statistics | Nonparametric statistics are methods that do not rely on specific assumptions about the underlying distribution of data. These techniques are particularly useful when data may not meet the requirements of parametric methods. Examples include the Wilcoxon rank-sum test and the Mann-Whitney U test. |
Nonresponse Bias | Nonresponse bias occurs when individuals who do not participate in a study differ systematically from those who do, potentially leading to misleading conclusions. It is crucial to address nonresponse bias to ensure that study findings accurately reflect the characteristics of the target population. |
Normal Distribution | The normal distribution, often called the bell curve, is a symmetrical probability distribution characterized by a specific mean and standard deviation. Many biological and natural phenomena, such as human height or test scores, tend to follow a normal distribution. Understanding this distribution is fundamental in statistical analysis. |
Null Hypothesis | The null hypothesis is a fundamental concept in hypothesis testing. It suggests that there is no significant difference, effect, or relationship between variables in a study. Researchers compare the observed data to what would be expected under the null hypothesis to determine statistical significance. |
Number Needed to Treat (NNT) | The Number Needed to Treat (NNT) is a measure used in clinical trials to convey the effectiveness of a medical intervention. It represents the number of patients who need to be treated to prevent one additional adverse outcome. A lower NNT indicates a more effective intervention. |
Observational Study Design | Observational study designs involve the collection and analysis of data without intervention or manipulation by the researcher. These studies observe and record naturally occurring variables and their relationships. Common observational study designs are case reports/series, cross-sectional studies, case-control studies, and cohort studies. Conclusions of observational studies are limited to descriptions and association. |
Observer Bias | Observer bias occurs when the researcher's expectations, beliefs, or preferences unconsciously influence the data collection or interpretation process. This can lead to a skewed presentation of the study results, as the researcher might selectively notice or emphasize outcomes that align with their hypotheses. |
Odds Ratio | The odds ratio is a statistical measure commonly used in biostatistics to express the relationship between two groups in terms of their odds of a specific outcome. It is particularly useful in studies comparing the likelihood of an event happening in one group relative to another, such as the odds of developing a disease in individuals with and without a particular risk factor. An odds ratio of 1 suggests no association, while values greater or less than 1 indicate increased or decreased odds, respectively. |
One-tailed Hypothesis Test | A one-tailed hypothesis test is a method for evaluating a directional research hypothesis. Before conducting the study, researchers predict the direction of the expected effect, such as an increase or decrease in a particular outcome due to an intervention. The one-tailed test then assesses statistical significance in only that predicted direction, enhancing sensitivity to detecting effects in the specified hypothesis. This approach is valuable when researchers have a clear expectation about the direction of the relationship they are investigating. |
Ordinal Logistic Regression | Ordinal logistic regression models the relationship between an ordinal dependent variable and one or more independent variables. It is suitable for outcomes with ordered categories. |
Ordinal Outcome | An ordinal outcome is a categorical variable with ordered levels or ranks, but the intervals between the ranks may not be uniform. Examples include disease severity categories (mild, moderate, severe), allowing for the ranking of outcomes without precise measurement of the intervals between them. |
Outcome | The result of one completed trial of an experiment. |
Outcome Variable | An outcome variable is the variable that researchers are interested in studying or predicting. Also known as the dependent variable, it represents the response or outcome being measured in an experiment or observational study. For instance, in a clinical trial, the outcome variable might be a patient's improvement level after receiving a specific treatment. |
Overdiagnosis Bias | Overdiagnosis bias occurs when screening identifies diseases that would not have caused symptoms or death. It leads to the overestimation of the benefit of screening programs by diagnosing conditions that would remain asymptomatic, potentially leading to unnecessary treatment. |
p-value | The probability of observing the test statistic obtained from the data, or anything more extreme, if the null hypothesis is infact true. This is usually compared to a significance level, often 0.05. If it is smaller then the siginficance level of the test, this is taken as substantial evidence against the null hypothesis. Logically, a small p-value means what we observed in the data was not likely to have occured if the null hypothesis is true. |
Paired t-Test | The paired t-test is a statistical method used to compare the means of two related groups. In a biostatistics context, it is often applied when measurements are taken on the same subjects before and after an intervention, like assessing the effectiveness of a drug. The paired t-test helps determine if there is a statistically significant difference between the two sets of measurements. |
Parameter | A parameter is a numerical characteristic that describes a feature of a population. In biostatistics, parameters are used to quantify aspects such as the average, variability, or proportion within a population. For example, the population mean or standard deviation are parameters that provide important insights into the central tendency and dispersion of a set of data. |
Pearson Correlation Coefficient | The Pearson correlation coefficient, often denoted as "r," is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables. In a biostatistics context, it might be used to explore the association between variables like body weight and blood pressure. A value of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative relationship, and 0 suggests no linear relationship. |
Per Protocol Analysis | Per protocol analysis is a method used in clinical trials to assess treatment effectiveness by analyzing only the data from participants who strictly adhered to the study protocol. This approach helps evaluate the intervention's impact under optimal conditions, providing insights into the treatment's efficacy when implemented as intended. |
Person-time Rate | In epidemiology, the person-time rate, also known as the incidence rate, is a measure of disease occurrence that accounts for varying lengths of time individuals are at risk. It is calculated by dividing the number of new cases by the total person-time at risk. This rate is valuable for studying diseases with varying exposure times, such as infectious diseases. |
Phase 0 Trials | Phase 0 trials are the first stage of clinical research for new drugs or treatments, conducted with a very small number of participants. These trials aim to gather preliminary data on how a drug is processed in the body (pharmacokinetics) and how it affects the body (pharmacodynamics), but not to test efficacy or safety. Phase 0 trials help researchers decide if a drug should move on to further testing. |
Phase I Trials | Phase I trials are the initial phase of testing a new drug or treatment in humans after preliminary safety has been established in preclinical studies. These trials involve a small group of participants and aim to determine the safest dose range, identify side effects, and learn how the drug is metabolized and excreted. The primary focus is on safety rather than efficacy. |
Phase II Trials | Phase II trials are conducted to assess the efficacy of a drug or treatment for a particular condition or disease, after its safety has been established in Phase I trials. These trials involve a larger group of participants and aim to obtain preliminary data on whether the drug works in people who have a certain disease or condition. They also continue to evaluate safety and side effects. |
Phase III Trials | Phase III trials are conducted to confirm the effectiveness of a drug or treatment, monitor side effects, compare it to commonly used treatments, and collect information that will allow the drug or treatment to be used safely. These trials involve large groups of participants and are often the final step before a drug is approved for public use by regulatory bodies. The results from Phase III trials can lead to the drug or treatment being made available to the public. |
Phase IV Trials | Phase IV trials, also known as post-marketing surveillance trials, are conducted after a drug or treatment has been approved for public use. These trials continue to monitor the drug's effectiveness and long-term safety in a larger, more diverse population. Phase IV trials can uncover rare or long-term adverse effects and can lead to changes in how the drug is used or to further restrictions. |
Pilot Study | A pilot study is a small-scale research project conducted before a larger study to assess feasibility, identify potential issues, and refine the study design. In biostatistics, pilot studies help researchers refine data collection methods, estimate variability, and determine sample sizes for the main study. |
Placebo Effect | The Placebo Effect is a psychological response where individuals experience a perceived improvement in their condition after receiving a treatment that has no therapeutic effect. This effect is attributed to the individual's expectations of improvement rather than the treatment itself. It underscores the importance of using placebo controls in clinical trials to differentiate between the actual efficacy of a treatment and the effects of participants' expectations. |
Poisson Distribution | The Poisson distribution is a probability distribution used to model the number of events that occur within a fixed interval of time or space. In biostatistics, it might be applied to study rare events, such as the number of hospital admissions for a specific condition in a given day. The Poisson distribution is characterized by a single parameter, the average rate of event occurrence. |
Poisson Regression | Poisson regression models count data with the assumption that the mean and variance are equal. It is commonly used when analyzing rare events. |
Poisson Zero-Inflated Regression | This regression model combines Poisson regression with a zero-inflated component to address excess zeros in count data. |
Poisson Zero-Truncated Regression | Similar to the zero-inflated model, this regression model focuses on the distribution of positive counts, excluding zero values. |
Population | In statistics, a population refers to the entire group of individuals, items, or units that share a common characteristic and are of interest to the researcher. In biostatistics, a population might be all individuals with a specific medical condition or a particular group exposed to a certain risk factor. |
Power | In statistical analysis, power is the probability that a study will correctly detect a true effect or difference if it exists. It is influenced by factors such as sample size, effect size, and significance level. This probability is the complement of the probability of type II error and denoted as 1-beta. In a biostatistics course, understanding power is crucial for designing studies that can reliably detect meaningful relationships or treatment effects. |
Predictor Variable | A predictor variable, also known as an independent variable, is a variable used in statistical models to predict or explain the variation in the outcome variable. In biostatistics, for instance, if researchers are investigating the factors influencing blood pressure, predictor variables might include age, diet, and exercise habits. |
Prevalence | Prevalence is a measure in epidemiology that represents the proportion of a population with a specific condition at a particular point in time. In a biostatistics context, prevalence might be used to describe the percentage of individuals in a community with a particular disease or health characteristic. |
Principle Components Analysis | Principal Components Analysis (PCA) is a statistical technique used to transform and simplify data by identifying patterns and reducing dimensionality. PCA could be applied to a dataset with numerous correlated variables, aiming to identify the principal components that explain the majority of the variance in the data. |
Probability | Probability is a fundamental concept in statistics that quantifies the likelihood of an event occurring. In biostatistics, probability is used to express the chance of specific outcomes, such as the probability of a patient recovering from a particular treatment. |
Probit Regression | Probit regression models the relationship between binary outcomes and independent variables using the cumulative distribution function of the standard normal distribution. |
Publication Bias | Publication bias arises when studies with positive or significant findings are more likely to be published than studies with negative or inconclusive results. This bias can skew the literature, making interventions seem more effective than they are, as studies showing no effect or harmful effects are less visible. |
Qualitative Data | Qualitative data consist of non-numeric information that describes the characteristics or qualities of an object or phenomenon. these data might include categorical variables like blood type or treatment response categories. |
Quantile Regression | Quantile regression models the relationship between variables across different quantiles of the distribution, providing a more comprehensive understanding of how the predictors influence different parts of the response variable distribution. |
Quantitative Data | Quantitative data, in contrast to qualitative data, are numerical measurements that represent quantities or amounts. These data are frequently used to describe variables like height, weight, or blood pressure. |
Random Effect | Random effects account for variability within a sample that is not of primary interest, such as individual differences in a repeated-measures study. |
Random Error | Random error is the variability in data that occurs by chance and is unpredictable. Random error can affect the precision of measurements or observations. By averaging multiple measurements or increasing sample sizes, researchers can mitigate the impact of random error. |
Random Variable | A random variable is a variable whose values are determined by chance. This could represent the number of adverse events in a clinical trial or the time until a patient experiences a specific outcome. Probability distributions are often used to model random variables in statistical analyses. |
Randomized Controlled Trial | An experimental study where subjects are randomized to recieve one of two (or more) treatments. Subject may be randomized to a placebo or treatment group for example. If it would be unethical to give a placebo, subjects are randomized to a "standard of care" group instead. These studies are the gold standard for determining the effect of the treatement(s) in that they minimize bias and confounding. They are often expensive, require monitoring to be sure subjects adhere to assigned treatment, require careful ethical considerations, and may have limited generalizability if the real world clinic setting differs from that of the exprimental setting. |
Range | In statistics, the range is a measure that represents the difference between the maximum and minimum values in a dataset. For example, the range might be used to describe the variability in patient ages or blood pressure readings within a study sample. |
Recall Bias | Recall bias is a type of information bias that occurs when participants in a study do not accurately remember past events or exposures, leading to misclassification. It is more common in retrospective studies, especially when relying on participants’ memory for exposure data. |
Receiver Operator Characteristic Curve (ROC curve) | The ROC curve is a graphical representation used in assessing the performance of a binary classification model. It illustrates the trade-off between sensitivity and specificity at various thresholds, helping to identify the optimal threshold for a given model. |
Referral Bias | Referral bias happens when the study population is selected from a subset of the general population that has been referred for treatment or diagnosis, which may not be representative of the population at large. This can lead to skewed results that do not accurately reflect the relationship between exposure and outcome in the general population. |
Rejection Region | The rejection region is a critical concept in hypothesis testing. It refers to the range of values in the statistical test that leads to the rejection of the null hypothesis. Researchers compare the test statistic to critical values in the rejection region to determine whether there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis. |
Reporting Bias | Reporting bias refers to selective revealing or suppression of information by study participants or researchers. For example, participants may under-report undesirable behaviors or outcomes, leading to inaccurate data. As another example, studies with positive results are more likely to be published (publication bias), and researchers may only report outcomes that are statistically significant (outcome reporting bias), ignoring other relevant data. |
Risk Difference | Risk difference is a measure in epidemiology that quantifies the absolute difference in the probability of an event occurring between two groups. It might be used, for exmaple, to express the absolute difference in the risk of a side effect between patients receiving different treatments in a clinical trial. |
Risk Ratio | Risk ratio, also known as relative risk, is a measure in epidemiology that compares the probability of an event occurring in one group to the probability in another group. The risk ratio is often used to assess the association between an exposure and an outcome, such as comparing the risk of developing a disease in individuals with and without a specific risk factor. |
Robust Regression | Robust regression methods aim to provide reliable parameter estimates in the presence of outliers or violations of normality assumptions. |
Sample | A sample is a subset of individuals, items, or data points selected from a larger population for study. Researchers often use samples to make inferences about the characteristics of the entire population. The selection and analysis of a representative sample are crucial for drawing valid conclusions in research studies. |
Sample Space | A collection of all the possible outcomes from an experiment. |
Sampling Bias | A specific type of selection bias, sampling bias occurs when the sample chosen for the study does not accurately reflect the population from which it was drawn. This can lead to inaccurate conclusions because the participants might have characteristics that are significantly different from the general population. |
Sampling Distribution | The sampling distribution represents the distribution of a statistic (such as the mean or proportion) calculated from multiple samples of the same size taken from a population. Understanding the sampling distribution is crucial for making inferences about population parameters based on sample statistics. |
Sampling Frame | A sampling frame is the list or framework from which a sample is drawn. This might be a roster of eligible participants for a clinical trial or a database of patients with a specific medical condition. |
Screening Test Metrics | Sensitivity: The proportion of true positives among all actual positives, indicating a test's ability to correctly identify individuals with the condition. It is a conditional probability that the test is positive given the subject has the condition. It is often expressed as a percentage. Specificity: The proportion of true negatives among all actual negatives, showing a test's accuracy in correctly identifying individuals without the condition. It is a conditional probability that the test is negative given the subject does not have the condition. It is often expressed as a percentage. Positive Predictive Value (PPV): The probability that a positive test result is true, providing insights into the likelihood of having the condition given a positive result. It is a conditional probability that the subject has the condition given a positive test result. It is often expressed as a percentage. Negative Predictive Value (NPV): The probability that a negative test result is true, indicating the likelihood of not having the condition given a negative result. It is a conditional probability that the subject does NOT have the condition given a negative test result. It is often expressed as a percentage. |
Selection Bias | This occurs when the participants included in the study are not representative of the target population, leading to skewed results. It can happen during the selection process for study participants, where certain groups may be over- or under-represented. This bias affects the generalizability of the study findings to the broader population. |
Simple Random Sample | A simple random sample is a subset of a population in which every member has an equal chance of being selected. Simple random sampling ensures that each individual in a study has an equal probability of inclusion, enhancing the generalizability of study findings to the entire population. |
Snowball Sample | A snowball sample is a non-probabilistic sampling method where existing participants recruit new participants. This approach may be useful when studying rare or hard-to-reach populations, such as individuals with specific medical conditions who are connected through support groups. |
Social Desirability Bias | This bias occurs when respondents in a study provide answers they believe are more socially acceptable or favorable, rather than being truthful. This can lead to inaccurate data, especially in self-reported measures related to sensitive or controversial topics. |
Spearman Rank Correlation Coefficient | The Spearman rank correlation coefficient, denoted as "ρ," is a non-parametric measure of the strength and direction of the monotonic relationship between two variables. It could be used to assess the correlation between the rank-ordered values of two variables when the assumption of linearity is not met. |
Standard Deviation | Standard deviation is a measure of the amount of variability or dispersion in a set of values. It is used to quantify the spread of data points around the mean. A higher standard deviation indicates greater variability. |
Standard Error | Standard error is an estimate of the variability of a sample statistic. It is crucial for constructing confidence intervals and conducting hypothesis tests. It provides an understanding of how much the sample statistic is expected to vary from the true population parameter. |
Standard Normal Distribution | The standard normal distribution is a specific normal distribution with a mean of 0 and a standard deviation of 1. This distribution is often used for hypothesis testing and constructing confidence intervals. Z-scores, representing the number of standard deviations a data point is from the mean, are commonly employed in standard normal distribution calculations. |
Standardized Mortality Rate (SMR) | SMR is a measure comparing the observed number of deaths in a specific population to the expected number of deaths based on a standard population. It helps assess whether a particular population has higher or lower mortality rates compared to the standard. |
Statistic | A statistic is a numerical measure calculated from a sample and used to estimate a corresponding parameter in the population. Statistics such as the mean, median, and correlation coefficient are commonly employed to summarize and analyze data. |
Statistical Inference | Statistical inference involves making predictions or generalizations about a population based on a sample of data. In biostatistics, statistical inference is fundamental for drawing conclusions about the effectiveness of treatments, the presence of associations, or the prevalence of diseases within a population. |
Statistical Significance vs. Clinical Relevance | Statistical significance indicates whether an observed effect is likely not due to chance, while clinical relevance assesses whether the observed effect is meaningful in a practical or clinical context. Statistical significance can be achieved in large sample sizes for settings that have little clinical relevance (ie. prevalence of a condition in 2005 was 12.1% and in 2023 is 12.2%...a large sample size may lead to statistical significance but if an increase of 0.1% in the prevalence of the condition does not warrant changes to clinical practice, its clinical relavance is low). It's essential to consider both aspects in the interpretation of study findings. |
Stratified Random Sample | A stratified random sample involves dividing a population into subgroups or strata and then randomly selecting samples from each stratum. This method is useful when there are distinct subgroups with different characteristics that need representation in the sample. |
Student's t-Test | Student's t-test is a statistical method used to compare the means of two independent groups. This test might be applied, for example, to assess whether there is a significant difference in the mean blood pressure between patients receiving different treatments in a clinical trial. |
Superiority Trial | A superiority trial in clinical research aims to demonstrate that a new treatment or intervention is superior to an existing standard of care or placebo. These trials typically use hypothesis testing to compare outcomes between groups, with the null hypothesis stating that there is no difference between the treatments and the alternative hypothesis asserting that the new treatment is superior. |
Surveillance Bias | Surveillance bias, also known as detection bias, occurs when there is a systematic difference in how outcomes are diagnosed or reported among groups being studied, often due to increased monitoring or scrutiny of a particular group. |
Survivorship Bias | This bias happens when conclusions are drawn from a subset of data or participants that "survived" a particular process, ignoring those that did not make it through the process. It can lead to overly optimistic results because the analysis does not account for all outcomes, including failures or dropouts. |
Systematic Random Sample | A systematic sample involves selecting every kth individual from a list after an initial random start. This method can be efficient when studying large populations, providing a systematic way to obtain a representative sample. |
Time-to-event Outcome | In epidemiology, time-to-event outcomes involve measuring the time until a specific event occurs, such as disease onset or death. Survival analysis techniques are commonly used to analyze and interpret these outcomes. |
Tobit Regression | Tobit regression is used when the dependent variable is censored, observed only within a certain range. It accounts for both observed and censored data. |
Truncated Regression | Truncated regression is applied when the dependent variable is only observed for a subset of the population, typically due to a threshold or limit. |
Two-tailed Hypothesis Test | A two-tailed hypothesis test assesses whether the observed data is significantly different from what is expected in either direction. This approach is used when researchers are interested in detecting any significant effect, regardless of its direction. The critical region is divided between both tails of the distribution. |
Type I Error | Rejecting the Null Hypothesis when it is true. The probability of this error is the significance level of the test, alpha. |
Type II Error | Failing to reject the null hypothesis when it is indeed false. The probability of this error depends on what the null value is and is called beta. |
Wilcoxon-Mann-Whitney Test | The Wilcoxon-Mann-Whitney test, also known as the Mann-Whitney U test, is a non-parametric statistical method used to assess whether there is a significant difference between two independent groups. This test might be applied, for example, when comparing the distribution of a continuous variable, such as the response to a treatment, between two groups. It does not assume normal distribution and is useful when the data violates parametric assumptions. |
Wish Bias | Wish bias is a form of bias where the expectations or wishes of participants or researchers influence the outcomes of a study, potentially affecting the accuracy of the results. It can manifest through placebo effects or through biased reporting of outcomes. |
z-score | A z-score, or standard score, is a measure of how many standard deviations a particular data point is from the mean in a standard normal distribution. z-scores are often used to assess the relative position of individual data points within a dataset. Understanding z-scores is crucial for identifying outliers, interpreting the significance of individual observations, and making comparisons across different datasets. |