State-dependent speciation and extinction (SSE) models are powerful tools for testing hypotheses about whether specific traits drive diversification.
State-dependent speciation and extinction (SSE) models are powerful tools for testing hypotheses about whether specific traits drive diversification. However, recent research reveals that false positives, where a trait is incorrectly inferred to influence diversification, are a significant risk, primarily driven by incomplete phylogenetic trees and mis-specified sampling fractions. This article provides a comprehensive framework for researchers and drug development professionals to understand, detect, and avoid these pitfalls. We cover the foundational principles of SSE models, methodological best practices, targeted troubleshooting for common data issues, and validation techniques to ensure robust, reliable results in evolutionary and biomedical studies.
1. What is trait-dependent diversification and why is it important in evolutionary biology? Trait-dependent diversification examines how specific biological characteristics of lineages influence their rates of speciation and extinction. Understanding these patterns helps explain why some groups evolve into many species while others remain species-poor. In biomedical contexts, these principles can inform how disease mechanisms diversify and evolve across populations.
2. Why do trait-dependent diversification models sometimes produce false positives? Early models like MuSSE (Multiple State Speciation and Extinction) often detect spurious correlations between traits and diversification rates because they cannot separate the effect of your observed trait from other unmeasured (hidden) traits that may be the true drivers of diversification rate variation [1] [2]. This occurs when your observed trait is correlated with these hidden factors.
3. How can I avoid false positives when testing for trait-dependent diversification? Utilize newer modeling approaches that specifically account for unmeasured variables. The HiSSE (Hidden State Speciation and Extinction) and SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) frameworks incorporate hidden states, allowing you to test whether diversification is better explained by your observed trait or by unmeasured factors [1] [2].
4. What is the difference between MuSSE, HiSSE, and SecSSE models?
Table 1: Comparison of Trait-Dependent Diversification Models
| Model | Key Features | Limitations | Best Use Cases |
|---|---|---|---|
| MuSSE | Tests dependence on a single observed trait with multiple states | High Type I error rate (false positives); cannot account for hidden traits | Preliminary screening (with caution) [1] |
| HiSSE | Accounts for hidden states affecting diversification; reduces false positives | Limited to binary observed traits [2] | Testing binary traits with suspected hidden drivers [1] [2] |
| SecSSE | Allows multiple examined AND concealed traits; handles traits with simultaneous states | More computationally intensive | Complex traits; testing multiple observed traits simultaneously [1] |
5. Can these evolutionary models be applied to biomedical research? Yes. While originally developed for macroevolutionary studies, these frameworks can analyze how disease-related traits or genetic variations diversify across populations. For instance, studies using high-diversity mouse populations (like Collaborative Cross or Diversity Outbred mice) effectively harness genetic variation to identify complex disease mechanisms, connecting evolutionary diversification principles to biomedical trait discovery [3].
6. How does the phylogenetic signal in traits affect my analysis? The mode of trait evolution significantly impacts your results. Traits with weak phylogenetic signal (evolving recently in diversification events) may produce different diversification patterns compared to highly conserved traits (with strong phylogenetic signal) [4]. Accounting for this in your model selection is crucial for accurate inference.
7. What experimental designs help validate trait-dependent diversification findings? Combine phylogenetic comparative methods with systems genetics approaches. Using genetically diverse reference populations (like Collaborative Cross mice) provides known genetic variation in a controlled framework, allowing you to test whether traits of interest genuinely affect diversification processes or are confounded by other factors [3].
Symptoms: Your MuSSE analysis indicates a strong relationship between your focal trait and diversification, but you suspect this might be driven by unmeasured factors.
Solution: Implement hidden state models to account for unmeasured variables.
Step-by-Step Protocol:
For complex traits with multiple states, use SecSSE [1]:
Validate with simulations:
Table 2: Key Research Reagent Solutions for Diversification Studies
| Reagent/Resource | Function | Application Example |
|---|---|---|
| SecSSE R Package | Implements several examined and concealed states-dependent speciation and extinction models | Testing multiple observed traits while accounting for hidden drivers [1] |
| HiSSE Model | Hidden State Speciation and Extinction framework | Testing binary traits with reduced false positives [2] |
| Collaborative Cross (CC) Mice | Genetically diverse recombinant inbred mouse panel | Studying how genetic variation influences trait diversification in disease models [3] |
| Diversity Outbred (DO) Mice | Outbred mouse population with high genetic diversity | High-precision mapping of traits and their evolutionary dynamics [3] |
| TreeSim R Package | Simulates phylogenetic trees under various diversification scenarios | Testing method performance and validating models [5] |
Symptoms: Your analysis detects apparent diversification rate shifts that might actually correspond to mass extinction events in the fossil record.
Solution: Implement models that simultaneously account for both lineage-specific shifts and mass extinction events.
Step-by-Step Protocol:
Simulate realistic scenarios:
sim.rateshift.taxa functionCheck for temporal clustering of shifts:
Symptoms: Your trait of interest doesn't fit neatly into single-state categories (e.g., generalist species, polymorphic traits).
Solution: Utilize SecSSE's capacity for simultaneous states.
Step-by-Step Protocol:
Specify the SecSSE model:
Implement the correct likelihood calculation:
Table 3: Key Software Packages for Trait-Dependent Diversification Analysis
| Software/Package | Primary Function | Implementation | Key Reference |
|---|---|---|---|
| SecSSE | Several examined and concealed states-dependent speciation and extinction | R package | [1] |
| HiSSE | Hidden State Speciation and Extinction | R package | [2] |
| MuSSE | Multiple State Speciation and Extinction | R package (diversitree) | [1] |
| TreeSim | Simulates phylogenetic trees under complex scenarios | R package | [5] |
| Medusa | Detects shifts in diversification rates | R package | [5] |
| TreePar | Identifies changes in speciation/extinction through time | R package | [5] |
What are State-Dependent Speciation and Extinction (SSE) models, and what is their primary purpose? SSE models are a class of macroevolutionary models that link the diversification rates of lineages (speciation and extinction) to the state of a specific biological trait [6]. The primary purpose of these models is to test hypotheses about whether certain character states (e.g., having a specific morphological feature or ecological niche) are associated with higher or lower rates of speciation and extinction [6] [7].
What is the "False Positive" problem in SSE models, and why is it a significant concern? The false positive problem refers to the tendency of some SSE models to incorrectly detect a correlation between a trait and diversification rates when the trait is actually neutral—that is, when it does not influence speciation or extinction [8] [9]. This spurious correlation can occur when there is an unmeasured (hidden) trait that truly affects diversification rates, and its evolution is coincidentally correlated with your observed trait [7]. This is a major concern because it can lead to incorrect biological conclusions about the drivers of diversity [8].
How do more complex SSE models like HiSSE and SecSSE help mitigate false positives? Later-generation SSE models incorporate "concealed" or "hidden" states to account for the influence of unobserved traits [7] [9].
By comparing the fit of models that include trait-dependent diversification (ETD models) against models where only hidden states affect diversification (CTD or CID models), researchers can more rigorously test for genuine trait effects [9].
My analysis has low statistical power. What factors could be responsible? Low power in SSE analyses can stem from several sources, many of which are related to data quality and model specification [9]:
How does the completeness and quality of the phylogenetic tree impact my results? Phylogenetic tree completeness and accuracy are critical for reliable SSE inference [9].
Can I include fossil data in an SSE analysis, and what are the benefits? Yes, it is possible and often beneficial to incorporate fossil data. Methods exist to combine SSE models with the fossilized birth-death (FBD) process [8].
What is the recommended model comparison framework to avoid false conclusions? A robust model comparison framework is essential for reliable inference. You should compare a suite of models to determine the best-supported hypothesis [9]:
| Model Type | Description | Purpose in Model Comparison |
|---|---|---|
| Examined Trait Dependent (ETD) | Diversification depends on the observed trait. | The focal hypothesis to be tested. |
| Concealed Trait Dependent (CTD/CID) | Diversification depends on a hidden trait, not the observed one. | Controls for spurious correlations and false positives. Critical for model comparison. |
| Constant Rate (CR) | Diversification is constant across the tree and independent of any trait. | A simple null model. |
Your analysis should compare the fit of the ETD model against both the CTD and CR models. Strong support for an ETD model over a CTD model of equal complexity provides much more compelling evidence for a genuine effect of your observed trait [9].
What are some key reagents and computational tools for SSE analysis? The following table lists essential "research reagents" – in this case, software and data – required for conducting SSE analyses.
| Tool / Data Type | Function / Explanation |
|---|---|
| R Statistical Environment | The primary platform for many SSE packages. |
| SecSSE R Package | Implements models for multiple examined and concealed states, helping to reduce false positives [7]. |
| HiSSE & MiSSE Models | Used to model hidden states and trait-independent diversification heterogeneity [9]. |
| RevBayes Software | A Bayesian framework for phylogenetic inference that can implement various SSE models [6] [8]. |
| Phylogenetic Tree with Branch Lengths | The fundamental input data representing evolutionary relationships and time. |
| Trait Data | The observed character states (e.g., binary or multi-state) for the tips of the tree. |
| Sampling Fraction Estimate | The proportion of species in the entire clade included in your tree, per trait state [9]. |
What is the general workflow for a robust SSE analysis? The following diagram outlines a recommended workflow designed to minimize false positives.
My analysis strongly supports trait-dependent diversification, but I am concerned it might be a false positive. What should I do?
I have a trait with more than two states. What model should I use, and what pitfalls should I avoid? For multi-state traits, you can use the MuSSE (Multiple State Speciation and Extinction) model. However, MuSSE is also known to be prone to false positives [7]. The recommended approach is to use the SecSSE model, which is specifically designed for multiple examined states while also accounting for the potential influence of a concealed trait, thereby providing a more robust test [7].
In the field of macroevolution, accurately identifying the traits that drive species diversification is crucial. However, a significant challenge persists: statistical methods designed to detect these relationships can produce false positives, leading to incorrect conclusions about evolutionary drivers. Understanding the sources of these errors is the first step toward more robust scientific discoveries.
This guide addresses the primary causes of false positives in trait-dependent diversification studies, providing troubleshooting and best practices to enhance the reliability of your research.
A false positive occurs when a statistical model incorrectly infers that a biological trait has a significant effect on speciation or extinction rates, when in reality no such relationship exists.
The MuSSE (Multiple State-dependent Speciation and Extinction) model is prone to false positives because it cannot separate the true effect of an observed trait from the influence of other, unobserved (hidden) traits that also affect diversification rates [1] [7]. The model assumes the trait in question is the sole driver, an assumption often violated in biological systems.
The SecSSE (Several Examined and Concealed States-dependent Speciation and Extinction) model incorporates the possibility of hidden traits [1] [7]. By accounting for these unmeasured variables, SecSSE significantly reduces false positives without sacrificing the statistical power to detect true trait-dependent diversification. When applied to previous studies that used MuSSE, SecSSE showed that the original conclusions were premature in five out of seven cases [1].
Lower sampling fractions (i.e., less complete phylogenetic trees) reduce the accuracy of both model selection and parameter estimation (speciation, extinction, and transition rates) [10]. The table below summarizes the impact of sampling on false positive rates.
Table 1: Impact of Sampling Fraction on Model Accuracy
| Sampling Fraction (Tree Completeness) | Effect on False Positive Rate | Effect on Parameter Accuracy |
|---|---|---|
| Low (≤ 60%) | Increased | Reduced accuracy |
| Low (≤ 60%) with Taxonomic Bias | Further Increased | Less accurate |
| High | Lower | Improved accuracy |
Furthermore, how you account for sampling in your model matters. Mis-specifying the sampling fraction (providing an incorrect estimate) severely affects parameter accuracy [10]:
Traditional models often assume simple, linear relationships between a single trait and diversification rates. Biological reality is more complex. False inferences can arise from:
Advanced models like the Birth-Death Neural Network (BDNN) are being developed to capture these complex, nonlinear, and interacting effects, providing a more realistic and less error-prone framework [11].
Table 2: Common Issues and Recommended Solutions
| Problem | Symptoms | Solution & Best Practices |
|---|---|---|
| Unaccounted Hidden Traits | Significant trait effect in MuSSE, but not in HiSSE or SecSSE. | Use models that incorporate hidden states, such as HiSSE or SecSSE [1] [7]. |
| Low or Biased Sampling | Inaccurate parameter estimates; high false positive rates, especially in biased samples. | Strive for higher and more balanced taxonomic sampling. If sampling is incomplete (e.g., 60% or less), avoid heavy sampling biases across sub-clades [10]. |
| Mis-specified Sampling Fraction | Speciation/Extinction rates are consistently over- or under-estimated. | Accurately estimate your sampling fraction. If uncertain, a cautious under-estimation is preferable to over-estimation to avoid inflating false positives [10]. |
| Oversimplified Model of Trait Effects | Model fails to capture known biological complexity; poor fit. | Consider flexible models like BDNN for fossil data that can infer complex, nonlinear effects and interactions among multiple traits and environmental factors [11]. |
The following diagram illustrates a robust workflow for a trait-dependent diversification analysis, incorporating key checks to minimize false positives.
Table 3: Essential Tools for Trait-Dependent Diversification Analysis
| Tool / Reagent | Function | Key Consideration |
|---|---|---|
| SecSSE R Package | Infers state-dependent diversification for multiple observed traits while accounting for hidden traits [1] [7]. | Allows a trait to be in multiple states simultaneously (e.g., for generalist species). Correctly implements the likelihood calculation. |
| HiSSE Model | Provides a framework for detecting trait-dependent diversification while accounting for hidden states [1]. | Limited to binary traits. Serves as a foundational method that inspired SecSSE. |
| BDNN (PyRate) | A Bayesian birth-death model using neural networks to infer complex, nonlinear effects on diversification from fossil data [11]. | Particularly powerful for analyzing fossil data and integrating multiple continuous/categorical traits and paleoenvironmental variables. |
| Robust Phylogeny | A time-calibrated phylogenetic tree of the study group. | Tree completeness and balanced sampling are critical. Incomplete or biased trees are a major source of error [10]. |
| Annotated Trait Data | Data on morphological, ecological, or behavioral traits for the species in the phylogeny. | Should be as complete and accurate as possible. Consider states for generalist species or uncertainty [1]. |
Issue: Inaccurate model selection and parameter estimation in State-dependent Speciation and Extinction (SSE) models due to incomplete phylogenetic trees.
Explanation: Phylogenetic tree completeness refers to the percentage of extant species included in your phylogenetic tree compared to the known diversity of the clade. Lower sampling fractions reduce analytical power and increase error rates. When tree completeness falls below 60% and sampling is taxonomically biased (uneven across sub-clades), the risk of false positives increases significantly. Parameter estimates for speciation, extinction, and transition rates become less accurate with decreasing sampling fractions [12] [13].
Solution:
Issue: Systematic biases in parameter estimates resulting from incorrect specification of the sampling fraction (the proportion of species included in the analysis relative to the total clade diversity).
Explanation: The sampling fraction is a critical parameter in SSE models that accounts for incomplete taxon sampling. Mis-specification occurs when researchers over-estimate or under-estimate this value. When the specified sampling fraction is lower than the true value, parameter estimates tend to be over-estimated. Conversely, when the specified sampling fraction is higher than the true value, parameters are under-estimated. False positive rates increase when sampling fractions are over-estimated [12] [13].
Solution:
Table 1: Impact of Sampling Fraction on SSE Model Accuracy
| Sampling Fraction (Completeness) | Model Selection Accuracy | Parameter Estimate Accuracy | False Positive Rate |
|---|---|---|---|
| High (>80%) | High | High | Low |
| Moderate (60-80%) | Moderate | Moderate | Low to Moderate |
| Low (<60%) | Reduced | Reduced | Increased |
| Low (<60%) with Taxonomic Bias | Severely Reduced | Severely Reduced | High |
Table 2: Effects of Sampling Fraction Mis-specification
| Type of Mis-specification | Effect on Parameter Estimates | Effect on False Positives |
|---|---|---|
| Over-estimated Sampling Fraction | Parameters under-estimated | Increased |
| Under-estimated Sampling Fraction | Parameters over-estimated | Not significantly increased |
Purpose: To evaluate how phylogenetic tree completeness and sampling fraction specification affect trait-dependent diversification inferences.
Methodology:
TreeSim or diversitree [14].Key Considerations:
Table 3: Essential Research Reagents & Computational Tools
| Tool/Resource | Function/Purpose | Application Context |
|---|---|---|
diversitree R package |
Fits SSE models to phylogenetic data | Model fitting and comparison [14] |
BEAST 2 |
Bayesian evolutionary analysis | Phylogenetic tree estimation and dating [14] |
TreeSim |
Simulates phylogenetic trees | Method validation and power analysis [14] |
ape R package |
Phylogenetic analysis and manipulation | Data preparation and tree handling [14] |
| AIC model selection | Compares fit of alternative models | Model selection and weighting [14] |
Answer: While there's no absolute threshold, studies show that sampling fractions below 60% substantially increase error rates, particularly when sampling is taxonomically biased. Aim for the highest possible completeness, ideally exceeding 80% for reliable inferences. When completeness is between 60-80%, results should be interpreted with caution and include sensitivity analyses [12] [13].
Answer: The research suggests it's safer to cautiously under-estimate sampling efforts. Over-estimating sampling fractions increases false positive rates, while under-estimating primarily affects parameter magnitudes without significantly increasing false inferences. However, the optimal approach is to invest time in accurately determining the sampling fraction through thorough literature review and taxonomic verification [12] [13].
Answer: Taxonomic bias occurs when some sub-clades are heavily under-sampled while others are well-sampled, creating an unrepresentative tree. This is particularly problematic because it can create spurious correlations between traits and diversification rates when the sampling imbalance coincides with trait distribution. Random sampling with low completeness is less likely to produce such systematic biases [12].
The Binary State Speciation and Extinction (BiSSE) model was developed to address two key problems identified in comparative methods [6]. First, inferences about character state transitions based on simple transition models (like Pagel's 1999 model) can be misled if the character affects speciation or extinction rates. Second, inferences about whether a character affects lineage diversification based on sister clade comparisons can be invalid if transition rates between character states are asymmetric [6]. Essentially, BiSSE provides a framework to jointly model trait evolution and diversification, testing whether specific character states are associated with differential diversification rates.
HiSSE (Hidden State Speciation and Extinction) and SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) were developed to address a critical flaw in BiSSE and its multistate extension MuSSE: their high rate of false positives [1] [15]. These models can incorrectly infer that a trait affects diversification when the pattern is actually driven by some unobserved (hidden) trait [16]. HiSSE accounts for this by incorporating a hidden state that affects diversification rates, while SecSSE extends this framework to multiple examined and concealed states, providing a more robust testing framework [1] [15].
Rather than optimizing speciation (λ) and extinction (μ) rates separately, HiSSE uses a transformed parameter space [17]. It defines:
This reparameterization alleviates problems with overfitting when λᵢ and μᵢ are highly correlated but both contribute to explaining diversity patterns [17].
SecSSE combines features of both MuSSE and HiSSE while adding new functionality [1] [15]. It can simultaneously infer state-dependent diversification across two or more examined (observed) traits while accounting for possible concealed (hidden) traits. Additionally, it allows for:
Recent research shows that both model selection and parameter estimate accuracy are reduced at lower sampling fractions (tree completeness) [10]. When sampling is taxonomically biased and tree completeness is ≤60%, false positive rates increase significantly compared to random sampling. Mis-specifying the sampling fraction severely affects parameter accuracy: parameters are over-estimated when the sampling fraction is under-specified, and under-estimated when over-specified [10]. The study recommends cautiously under-estimating sampling efforts when uncertain, as false positives increase more when sampling fractions are over-estimated.
Problem: Your BiSSE/MuSSE analysis detects significant trait-dependent diversification, but you're concerned it might be a false positive driven by an unobserved trait.
Solution:
Workflow:
Problem: Your phylogeny has limited taxon sampling (completeness) or biased sampling across clades, potentially affecting SSE model accuracy.
Solution:
Recommended Sampling Practices: Table 1: Impact of Sampling Fraction Mis-specification on Parameter Estimates
| Scenario | Effect on Parameter Estimates | Effect on False Positives |
|---|---|---|
| Sampling fraction specified lower than true value | Parameters are over-estimated | Minimal increase |
| Sampling fraction specified higher than true value | Parameters are under-estimated | Substantial increase |
| Random sampling with completeness ≤60% | Reduced accuracy | Moderate increase |
| Taxonomically biased sampling with completeness ≤60% | Severely reduced accuracy | High increase |
Problem: Setting up appropriate transition rate models in HiSSE with proper parameter constraints.
Solution:
TransMatMaker.old() function to create the basic structure [17].ParDrop() [17].ParEqual() to link parameters when appropriate to reduce model complexity.Example Transition Matrix Setup:
Common Transition Matrix Configurations:
Table 2: Common HiSSE Transition Rate Model Specifications
| Model Type | Dual Transitions Allowed? | Typical Number of Parameters | Use Case |
|---|---|---|---|
| Full HiSSE model | Yes | 12 | Most complex model |
| No dual transitions | No | 8 | Biologically more realistic |
| Equal transition rates | No | 1 | Reduced complexity |
| BiSSE equivalent | Not applicable | 2 | Simple trait-dependent diversification |
Problem: You need to test the influence of multiple traits on diversification while accounting for potential hidden states.
Solution:
install.packages("secsse")) or GitHub [15].Table 3: Key Software Packages for SSE Analyses
| Tool/ Package | Primary Function | Key Features | Implementation |
|---|---|---|---|
| RevBayes | General phylogenetic inference | Implements BiSSE, MuSSE, and other SSE models; flexible model specification | [6] |
| hisse | Hidden-state SSE models | Accounts for hidden traits; reparameterized turnover and extinction fraction | [17] |
| diversitree | Multiple SSE frameworks | Implements BiSSE, MuSSE, HiSSE; useful for simulation studies | [17] |
| SecSSE | Multiple examined and concealed traits | Combines MuSSE and HiSSE features; reduces false positives | [1] [15] |
What are SSE models in evolutionary biology? SSE (State-dependent Speciation and Extinction) models are a phylogenetic comparative framework used to determine if a specific biological trait influences diversification rates (speciation and extinction). They test whether different states of a trait (e.g., presence or absence of a morphological feature) are associated with different rates of species formation and extinction over evolutionary time [18] [9].
My analysis found a significant trait-diversification association. Could this be a false positive? Yes. False positives, where a trait appears to be associated with diversification but is not, are a significant risk. Key factors that increase this risk include [9]:
How does phylogenetic tree completeness affect my SSE model results? The completeness of your phylogenetic tree, known as the sampling fraction, is critical. Lower sampling fractions reduce the accuracy of both model selection and parameter estimation (like speciation and transition rates) [9]. The table below summarizes the quantitative effects of sampling fraction on false positive rates.
| Sampling Fraction (Tree Completeness) | Impact on False Positive Rate |
|---|---|
| ≤ 60% | Increased rate of false positives, especially when sampling is taxonomically biased [9]. |
| Random Sampling | More accurate parameter estimation compared to biased sampling at the same completeness [9]. |
| Taxonomically Biased Sampling | Less accurate parameter estimates and higher false positives at low completeness [9]. |
I am unsure of the exact sampling fraction for my clade. What should I do? When the total number of species in a clade is unknown, mis-specifying the sampling fraction severely affects results. It is better to cautiously under-estimate your sampling efforts. Over-estimating the sampling fraction increases false positives. If possible, using a Bayesian framework with a prior on the sampling fraction can help account for this uncertainty [9].
What are the different types of models within the SSE framework? It is crucial to compare different models to guard against false inferences [9]:
Problem: Inconsistent results when using different phylogenetic trees. Solution:
Problem: The model selects a trait-dependent model, but I suspect an unmeasured trait is the real driver. Solution:
Problem: Parameter estimates (speciation, extinction, transition rates) seem biologically unrealistic. Solution:
| Sampling Fraction Specification | Impact on Parameter Estimates |
|---|---|
| Specified lower than true value | Parameter values are over-estimated [9]. |
| Specified higher than true value | Parameter values are under-estimated [9]. |
This protocol outlines key steps for a reliable trait-dependent diversification analysis.
1. Phylogeny and Trait Data Preparation
2. Model Selection and Comparison
HiSSE or SecSSE [9].3. Sensitivity Analysis
The following table lists the essential "research reagents" — the primary data and software components — needed for a robust SSE analysis.
| Item | Function in SSE Analysis |
|---|---|
| Time-Calibrated Phylogeny | The evolutionary scaffold for estimating diversification rates. Branch lengths represent time [9]. |
| Trait Dataset | The examined trait data (categorical or continuous) for the species in the phylogeny, used to test for association with diversification [18] [9]. |
| Sampling Fraction Estimate | A crucial correction factor that accounts for missing species in the phylogeny, specified per trait state to avoid biased parameter estimates [9]. |
| SSE Software Package (e.g., HiSSE, SecSSE) | The computational engine that fits the state-dependent speciation and extinction models to your data and performs statistical comparisons [9]. |
| Concealed Trait Models (CTD) | A control model that accounts for the influence of unmeasured traits, essential for reducing false positives in the analysis [9]. |
The following diagram illustrates the logical workflow for a robust SSE model selection process, highlighting key decision points to avoid false positives.
SSE Model Selection Workflow
| Problem Area | Common Symptoms | Likely Causes | Recommended Solutions |
|---|---|---|---|
| Low Sampling Fraction | Inaccurate parameter estimates (speciation, extinction, transition rates); reduced model selection accuracy [12]. | The phylogenetic tree represents a small percentage (e.g., ≤60%) of the total known clade diversity [12]. | Increase taxon sampling where possible. If not, use a uniform prior on the sampling fraction (rho) to propagate uncertainty in your analysis [19]. |
| Mis-specified Sampling Fraction | Parameter values are systematically over- or under-estimated [12]. | The rho value used in the model is incorrect (e.g., based on outdated taxonomy or an improper calculation) [12] [19]. |
Carefully re-calculate the sampling fraction using current taxonomic databases. Conduct sensitivity analyses using a range of plausible rho values [19]. |
| Taxonomically Biased Sampling | High rates of false positives for trait-dependent diversification; inaccurate parameter estimates, even with moderate (e.g., 60%) sampling [12]. | Some sub-clades are heavily over-sampled while others are under-sampled, violating the assumption of random sampling in many models [12] [20]. | Re-sample to balance taxonomic coverage or use models that can incorporate clade-specific sampling fractions. Explicitly report and justify the sampling methodology [20]. |
| Uncertain Clade Size | Inability to calculate a precise sampling fraction; circular dependency between diversification rate (r) and clade size (m) [19]. |
The total number of species in the clade is poorly characterized or unknown [19]. | Use a uniform prior on the sampling fraction based on the plausible range of clade sizes. Report results with this uncertainty explicitly stated [19]. |
Q1: What is the sampling fraction, and why is it critical for diversification models?
The sampling fraction (rho) is the proportion of species in a clade that are included in your phylogenetic tree (n) relative to the total number of known species in that clade (m), so rho = n / m [21]. It is critical because state-dependent speciation and extinction (SSE) models use this information to correctly estimate the number of missing speciation events. Mis-specifying this fraction severely affects the accuracy of parameter estimates [12]. If you assume perfect sampling (rho = 1) when it is incomplete, you will systematically under-estimate speciation and extinction rates [12] [21].
Q2: I am working on a poorly known clade where the total number of species is uncertain. How can I proceed?
This is a common challenge. Since you cannot calculate a single, precise rho, it is recommended to propagate the uncertainty through your analysis [19].
m) based on expert knowledge and taxonomic resources.rho).rho values, for instance, by using a uniform prior in a Bayesian framework [19].Q3: My taxon sampling is imbalanced across different sub-clades. What are the risks?
This is a major source of risk for false positives [12]. Most models assume missing taxa are randomly distributed across the tree. When sampling is imbalanced (e.g., you have sequenced 90% of species in one sub-clade but only 10% in another), this assumption is violated. Simulations show that with ≤60% tree completeness, taxonomically biased sampling leads to higher rates of false positives and less accurate parameter estimates compared to random sampling [12]. You should aim for balanced sampling or use models that can account for this bias.
Q4: Is it better to over-estimate or under-estimate the sampling fraction in my model?
It is generally better to cautiously under-estimate your sampling efforts [12]. Research has shown that false positives increase when the sampling fraction is over-estimated (specified as higher than its true value). When in doubt, using a slightly conservative (lower) estimate of rho can be a safer practice [12].
Q5: How does an incorrect sampling fraction lead to a false positive in trait-dependent diversification?
A false positive in this context occurs when your model incorrectly infers a significant association between a trait and diversification rates. Incomplete or biased sampling can create branching patterns in a phylogeny that mimic the signal of trait-dependent diversification [12] [20]. For example, if a particular trait state is more common in a well-sampled, species-rich subclade, the model may interpret the high diversification of that subclade as being caused by the trait, when the pattern is actually an artifact of uneven sampling [12].
1. Define Your Initial Patient Population (The Total Clade)
m) under investigation using authoritative taxonomic databases and recent systematic revisions.2. Calculate the Exact Sampling Fraction
n). Calculate the sampling fraction as rho = n / m [21].201/216 = 0.93 [21].3. Account for Uncertainty (If Necessary)
m is uncertain, define a plausible range (e.g., m_min to m_max). Calculate the corresponding range for rho [19].RevBayes, you can specify a uniform prior on rho (e.g., rho ~ Uniform(0.7, 0.95)) [19].4. Incorporate the Fraction into Model Fitting
rho argument (or equivalent) in your diversification model software.5. Conduct Sensitivity Analysis
rho range.6. Document and Report
m), the sample size (n), the calculated sampling fraction (rho), and the source of taxonomic information in your manuscript.The following diagram illustrates the decision-making process and methodological flow for handling the sampling fraction in diversification analyses.
| Item / Resource | Function in Analysis | Implementation Notes |
|---|---|---|
| Taxonomic Databases (e.g., IUCN, GBIF, specialist databases) | Provides the best available estimate for the total clade size (m), the denominator for the sampling fraction [20]. |
Cross-reference multiple sources to account for synonymy and recent discoveries. |
phytools R Package |
Fits birth-death and Yule models of diversification, allowing for the direct specification of the sampling fraction via the rho argument [21]. |
The fit.bd and fit.yule functions are directly applicable. |
RevBayes / BAMM |
Bayesian software platforms for estimating diversification rates. They can model incomplete sampling and are less sensitive to moderate missing taxa, respectively [19] [20]. | Allows for the most flexible modeling of uncertainty via priors on the sampling fraction [19]. |
| Sensitivity Analysis Script | A custom script (e.g., in R or Python) to re-run analyses across a range of sampling fractions. | Essential for quantifying the robustness of your findings to sampling uncertainty [12] [20]. |
In phylogenetic comparative studies, particularly those focused on detecting trait-dependent diversification, incomplete taxon sampling is not merely an inconvenience—it is a potential source of significant bias. Real-world missing data are rarely random; they are often phylogenetically clumped or correlated with the trait of interest, which can, in turn, lead to false inferences about evolutionary processes [23]. This guide provides troubleshooting strategies to help researchers identify and mitigate these risks.
Q1: Why is random missing data assumption problematic in trait-dependent diversification studies? Assuming missing taxa are random is often biologically unrealistic. Taxa are more likely to be missing due to ecological traits that have phylogenetic signal, such as rarity, small geographical range, or specific habitat preferences. This creates "phylogenetically clumped" missing data. If the trait you are studying (e.g., range size) is itself correlated with the probability of being sampled, your dataset can become systematically biased, potentially leading to false positives in trait-dependent diversification tests [23].
Q2: What is the key difference between 'clumped' and 'correlated' missing taxa?
Q3: How can incomplete distance matrices be handled during tree inference? When dealing with incomplete distance matrices (e.g., from sequence data with gaps), a weighted least-squares approach can be used for phylogenetic inference. This method combines the four-point condition and the ultrametric inequality to handle missing entries, and has been shown to outperform other methods like the standard Ultrametric or Additive procedures when data is incomplete [24].
Q4: What tools can help visualize phylogenetic trees with associated data? Several tools are designed for visualizing and annotating phylogenetic trees, which is crucial for exploring data completeness and patterns:
Problem: Your analysis of trait-dependent diversification may be biased because your missing taxa are not random.
Solution Steps:
Problem: You have an incomplete distance matrix due to missing sequence data or gaps, and you need to infer a reliable phylogeny.
Solution Steps:
Objective: To evaluate how non-random missing taxa might bias the outcomes of a trait-dependent diversification analysis.
Methodology:
The workflow for this assessment can be summarized as follows:
Objective: To test for trait-dependent diversification on multiple traits while accounting for hidden states, thereby reducing Type I errors.
Methodology:
| Tool/Package Name | Primary Function | Key Application in Addressing Missing Data |
|---|---|---|
| SecSSE [1] | State-dependent speciation & extinction model | Models diversification dependent on multiple observed traits while accounting for hidden traits, reducing false positives. |
| ggtree [25] | Phylogenetic tree visualization & annotation | Visually explores patterns of missing taxa and integrates associated data (e.g., traits) for diagnostic purposes. |
| PhyloScape [26] | Interactive web-based tree visualization | Allows scalable visualization and annotation for large trees, helping to identify clades with poor sampling. |
| T-Rex [24] | Phylogeny inference from distance matrices | Implements a weighted least-squares approach to infer trees from incomplete distance matrices. |
The following table summarizes key findings from simulation studies on the impacts of missing taxa, providing a benchmark for your own analyses [23].
| Scenario of Missing Taxa | Impact on Model Selection | Impact on Parameter Estimation |
|---|---|---|
| Random (rMT) | Minimal to no bias, even with sparse sampling (e.g., 50% missing). | Minimal to no bias. |
| Phylogenetically Clumped (cluMT) | Generally robust performance. | Generally robust performance. |
| Correlated with Trait (corMT) | Generally robust performance. | Notable bias can be introduced under very high proportions (e.g., 90%) of missing taxa. |
What is the core issue with missing trait data in diversification studies? Missing trait data can lead to two major problems: 1) it can reduce the statistical power of your models, and 2) more critically, it can introduce bias into parameter estimates, leading to incorrect conclusions about the relationship between a trait and diversification rates [27]. In State-dependent Speciation and Extinction (SSE) models, this can result in false positives, where a neutral trait is incorrectly identified as being linked to speciation or extinction [7] [8].
My MuSSE analysis shows a significant trait-diversification relationship. Should I trust it? You should be very cautious. Studies have shown that standard MuSSE models can erroneously detect correlations between neutral traits and diversification rates if a true, but unobserved (hidden), trait is the actual driver [7] [8]. It is recommended to use models that account for hidden states, such as SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) or HiSSE, which are specifically designed to avoid these false positives [7]. In fact, a re-evaluation of seven MuSSE studies found that the original conclusions were premature in five cases once more robust models were applied [7].
Can I simply remove species with missing trait data from my analysis? While common, simply deleting species with missing data (list-wise deletion) is generally not recommended [27] [28]. This approach can discard valuable information and, if the data is not missing randomly, can skew the inferred phylogenetic relationships and trait dynamics, potentially distorting your results [27] [29].
Are there reliable methods to estimate missing trait values? Yes, several robust imputation methods are available. A key strategy is to use phylogenetic information, as closely related species often share similar traits due to shared evolutionary history—a property known as phylogenetic signal [28]. The missForest algorithm, especially when enhanced with phylogenetic eigenvectors, has been shown to accurately impute missing continuous trait values, with performance depending on the strength of the phylogenetic signal and correlation among traits [28].
Symptoms: Your SSE model (e.g., BiSSE, MuSSE) indicates a strong relationship between a trait and diversification, but you suspect it might be spurious, perhaps driven by an unmeasured factor.
Solutions:
Symptoms: Your functional trait dataset has gaps, and you are unsure how to proceed without biasing your analysis of community assembly or ecosystem functioning.
Solutions:
missForest function in R, including the phylogenetic eigenvectors as predictor variables alongside the other observed traits.Table 1: Comparison of Methods for Handling Missing Trait Data
| Method | Key Principle | Best For | Key Advantage | Key Limitation |
|---|---|---|---|---|
| List-wise Deletion | Removes any species with missing data | Complete datasets with minimal, random missingness | Simple and fast | Can introduce severe bias and reduce statistical power [27] |
| Mean/Median Imputation | Fills gaps with the average value of the trait | Preliminary exploration | Very simple to implement | Ignores phylogenetic structure and uncertainty; can distort distributions [27] |
| Phylogenetic Imputation (missForest+PVR) | Predicts missing values using trait correlation and phylogenetic signal | Medium to large datasets with correlated and/or conserved traits | High accuracy when traits are phylogenetically conserved; handles complex relationships [28] | Requires a phylogeny; performance lower for traits with weak phylogenetic signal |
| Multiple Imputation (MICE, etc.) | Creates multiple plausible datasets by modeling relationships among all variables | Complex datasets with multiple variable types | Accounts for uncertainty in the imputation process [27] | Can be computationally intensive; requires careful model specification |
Table 2: Performance of Phylogenetic Imputation under Different Conditions [28]
| Level of Phylogenetic Signal in Traits | Level of Correlation Among Traits | Expected Imputation Accuracy (Inverse of Error) |
|---|---|---|
| High | High | Highest |
| High | Low | High |
| Low | High | Medium |
| Low | Low | Lowest |
Objective: To accurately estimate missing continuous trait values in an ecological or evolutionary dataset using the missForest algorithm combined with phylogenetic information [28].
Materials/Reagents:
NA.missForest, ape, PVR.Methodology:
PVR package to decompose the phylogenetic distance matrix derived from your tree.missForest function using the combined predictor matrix.missForest output.Objective: To test for trait-dependent diversification while accounting for the effect of unobserved (hidden) traits, thereby reducing false positives [7].
Materials/Reagents:
SecSSE package installed.Methodology:
lambda), extinction (mu), and transition rates (q) for the examined traits.secsse_loglik function to calculate the likelihood of your data (tree + traits) under the specified model. SecSSE uses an improved likelihood calculation conditioned on non-extinction, correcting an error present in earlier SSE models [7].
Table 3: Key Research Reagent Solutions for Handling Missing Data
| Item | Function in Research | Example Use-Case |
|---|---|---|
| SecSSE R Package | A state-dependent diversification model that infers the effect of multiple observed traits while accounting for hidden states to reduce false positives. | Testing if flowering time and seed size jointly affect plant diversification, while controlling for an unmeasured factor like pollinator shift [7]. |
| missForest R Package | A non-parametric imputation method using Random Forests to predict missing values; can be combined with phylogenetic data. | Estimating missing body mass values for a set of bird species using their phylogenetic relationships and other known traits like beak depth [28]. |
| Phylogenetic Eigenvectors (PVRs) | Numerical vectors derived from a phylogenetic distance matrix that quantify the phylogenetic relatedness among species for use in statistical models. | Used as predictors in the missForest algorithm to ensure that imputed trait values respect the evolutionary relationships among species [28]. |
| PyRate with BDNN | A Bayesian framework for analyzing fossil data that uses a Birth-Death Neural Network (BDNN) to model complex, non-linear effects of traits and environment on diversification. | Analyzing the proboscidean fossil record to disentangle the interacting effects of body size, diet, and paleoclimate on speciation and extinction [11]. |
| RevBayes Software | A Bayesian phylogenetic inference platform that can implement state-dependent speciation-extinction models combined with the fossilized birth-death process. | Estimating speciation and extinction rates for a clade using a tree that includes both extant species and fossil occurrences to improve parameter accuracy [8]. |
Data curation is a disciplined practice that ensures your data can be discovered, accessed, understood, and used now and into the future. For phylogenetic trait analysis, this process is essential because it anticipates inevitable changes in technology and research methods, ensuring that your data remains usable and understandable, even years later. Without proper curation, there are no guarantees that anyone, including you, will be able to use or even understand the data, regardless of where they are housed [30].
The Data Curation Network provides a standardized workflow for this process [31]:
Data curation is performed at different levels of depth and involvement [32]:
| Level | Name | Description | Common Use in Repositories |
|---|---|---|---|
| Level 0 | No Curation | Data deposited as submitted | Varies |
| Level 1 | Record Level | Brief check of metadata to enhance FAIR-ness | Very common |
| Level 2 | File Level | Review file arrangement and suggest file type transformations | Less common |
| Level 3 | Document Level | Review documentation and add or request missing information | Very common |
| Level 4 | Data Level | Open data files and examine for accuracy and interoperability | Less common |
Most repositories perform a mixture of Levels 1-4, with Levels 1 and 3 being most common. For phylogenetic trait data aiming to avoid false positives, targeting at least Level 3 (Document Level) is recommended to ensure sufficient documentation for others to understand and reuse your data properly [32].
The CURATE(D) model provides a systematic approach to data curation that directly addresses issues leading to false positives in phylogenetic analyses [32]. The process includes specific quality control checks:
| Failure Type | Impact on Model Fitting | Prevention Strategy |
|---|---|---|
| Insufficient documentation | Models cannot be properly replicated or understood | Implement Level 3 document-level curation [32] |
| Poor file organization | Interrelationships between trees and trait data are misunderstood | Apply file-level curation and clear naming conventions [32] |
| Inadequate metadata | Data lacks context for proper interpretation | Augment metadata using disciplinary standards [32] |
| Ignoring data quality issues | Missing data or errors propagate through analysis | Conduct thorough quality assurance checks [32] |
Model fitting is the process of adjusting model parameters to best match your observed data [33]. In trait-dependent diversification studies, proper model fitting is crucial because poorly fit models can produce incorrect insights and should not be used for making scientific decisions [33].
A well-fit model follows the overall patterns in your data without matching every data point exactly. This balance is essential for avoiding both underfitting and overfitting [33]:
The model fitting process follows three key steps [34]:
For diversification analysis, this typically involves [7]:
SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) combines features of HiSSE and MuSSE to simultaneously infer state-dependent diversification across two or more examined traits while accounting for possible concealed traits [7]. This approach directly addresses the false positive problem identified in MuSSE analyses.
Key improvements in SecSSE include [7]:
| Problem | Possible Causes | Solutions |
|---|---|---|
| Failure to converge | Poor starting parameters, model misspecification | Try multiple starting points, simplify model structure |
| Unrealistic parameter estimates | Identifiability issues, data insufficient | Check parameter identifiability, increase data quality |
| Poor model performance | Underfitting or overfitting | Adjust model complexity, cross-validation |
| Computational bottlenecks | Large trees, complex models | Optimize code, use approximate methods |
| Reagent/Tool | Function | Application in Trait Analysis |
|---|---|---|
| SecSSE R Package | Several examined and concealed states-dependent speciation and extinction | Detecting trait-dependent diversification while accounting for hidden traits [7] |
| Phylogenetic Trees | Evolutionary relationships among taxa | Foundation for diversification rate analysis [7] |
| Trait Data Matrix | Character states for examined traits | Input for state-dependent diversification models [7] |
| CURATE(D) Checklist | Systematic data curation framework | Ensuring data quality before analysis [32] [31] |
| FAIR Principles | Findable, Accessible, Interoperable, Reusable data guidelines | Enhancing data reproducibility and reuse [30] |
Data Curation to Model Fitting Workflow
Model Fitting Optimization Process
Problem Description Researchers are detecting a statistically significant signal of trait-dependent diversification in their SSE (State-dependent Speciation and Extinction) model analysis, but suspect it might be a false positive driven by issues with their phylogenetic tree.
Diagnostic Steps
Solutions
Problem Description Estimated parameters for speciation, extinction, and transition rates from SSE models seem biologically implausible or shift dramatically with small changes to the model.
Diagnostic Steps
Solutions
Q1: What is "tree completeness" and why is it critical for SSE models? Tree completeness, or sampling fraction, is the percentage of known taxa in a clade included in your phylogenetic tree. It is critical because SSE models use this information to estimate true diversification rates. Low or mis-specified sampling fractions can lead to both inaccurate parameter estimates and false inferences of trait-dependent diversification, making it appear that a trait influences speciation or extinction when it does not [9].
Q2: At what threshold of tree completeness does false positive risk become a major concern? False positive risk is significantly elevated when tree completeness is 60% or lower, especially if the missing species are not random but clustered in specific sub-clades (taxonomic bias). At this low completeness, the rate of false positives increases and parameter estimates become substantially less accurate [9].
Q3: How does biased sampling differ from simple low sampling, and why is it worse? Low sampling means many species are missing randomly from the entire tree. Biased sampling means species are missing non-randomly, often from specific sub-clades (e.g., tropical species are under-sampled compared to temperate ones). Biased sampling is worse because it can create spurious correlations that mimic a true trait-dependent diversification signal, making false positives more likely than with random sampling at the same overall completeness level [9].
Q4: My phylogenetic tree is incomplete and cannot be improved. How should I proceed with my analysis? You should:
Q5: What is the single most important practice to reduce false positives from low tree completeness? The most important practice is to correctly specify your sampling fraction and to use SSE models that incorporate concealed states (CTD models). These models explicitly test whether the diversification pattern is better explained by your observed trait or by some unmeasured, hidden trait, thereby dramatically reducing false positives [1].
The following tables consolidate key quantitative findings from simulation studies on how tree completeness affects SSE model outcomes.
Table 1: Impact of Sampling Fraction and Bias on False Positive Rates
| Sampling Fraction (Completeness) | Sampling Regime | Key Impact on Model Selection & False Positives |
|---|---|---|
| ≤ 60% | Random | Accuracy of model selection and parameter estimates is significantly reduced [9]. |
| ≤ 60% | Taxonomic Bias | Rates of false positives increase markedly; parameter estimates are less accurate compared to random sampling [9]. |
| > 60% | Random | Lower risk of false positives; more reliable parameter estimation [9]. |
Table 2: Consequences of Mis-specifying the Sampling Fraction
| Type of Mis-specification | Impact on Parameter Estimation |
|---|---|
| Specified fraction < True fraction | Parameter values are over-estimated [9]. |
| Specified fraction > True fraction | Parameter values are under-estimated; also leads to an increase in false positives [9]. |
This methodology is used to generate synthetic datasets to test the performance of SSE models under controlled conditions, including known levels of tree incompleteness and sampling bias [9] [5].
Workflow Diagram Title: Simulation and Validation Workflow for SSE Models
Materials and Reagents
TreeSim [5] or TESS [5] for simulating phylogenetic trees; SecSSE [1] or similar for fitting SSE models.Step-by-Step Procedure
Workflow Diagram Title: Sensitivity Analysis for Sampling Fraction
Table 3: Key Software Solutions for SSE Analysis
| Tool Name | Primary Function | Key Feature for Addressing False Positives |
|---|---|---|
| SecSSE [1] | Several Examined and Concealed States-dependent Speciation and Extinction. | Simultaneously infers diversification dependence on multiple observed traits while accounting for the role of a possible concealed trait, directly reducing Type I error. |
| HiSSE [9] | Hidden-State-dependent Speciation and Extinction. | Includes concealed/hidden states in the model, which allows it to separate the effect of an observed trait from that of an unmeasured trait. |
| TreeSim [5] | Phylogenetic Tree Simulation. | Simulates trees under complex scenarios (e.g., with mass extinctions, rate shifts) for method validation and power analysis. |
| Medusa [5] | Modeling Evolutionary Diversification Using Stepwise AIC. | Detects lineage-specific shifts in diversification rates on a phylogenetic tree. |
1. What are the main types of sampling bias in biodiversity data? Sampling biases are primarily categorized into geographic and taxonomic biases. Geographic bias occurs when data are unevenly distributed across a landscape, often clustered near roads, cities, and other accessible areas, leaving remote regions undersampled [35]. Taxonomic bias describes the disproportionate focus on certain charismatic or well-known species groups (e.g., birds and mammals) while neglecting others (e.g., insects and arachnids) [36]. A third, critical type is environmental bias, where the collected data do not represent the full range of environmental conditions in the study area. It's important to note that correcting for geographic bias does not automatically correct for environmental bias [35].
2. Why is sampling bias a critical problem for trait-dependent diversification research? In trait-dependent diversification research, the goal is to determine if a specific trait influences speciation and extinction rates. Sampling bias can create false positives by causing a spurious correlation between a trait and diversification rates. For instance, if a lineage is both more dispersive and better studied, it might appear to have higher diversification rates simply because its species are more completely documented. Methods like MuSSE can produce false positives if they do not account for hidden traits or sampling biases [7]. Newer methods like SecSSE are designed to account for this by including both examined (observed) and concealed (hidden) traits in the model [7].
3. How can I quantify the geographic and environmental bias in my dataset? A robust method involves analyzing the relationship between sampling probability and accessibility factors (for geographic bias) and climatic variables (for environmental bias). You can fit a model, for example in a Bayesian framework, where sampling rate is a function of distance from roads, cities, etc. [35]. For environmental bias, calculate the multivariate climatic distance between your species occurrence points and the entire study area. A Local Indicator of Multivariate Spatial Association (LISA) can then be used to identify areas where geographic and environmental biases are misaligned [35].
4. What is "co-location" and how can it help mitigate bias? Co-location is the strategic use of a single in-situ research facility by multiple Research Infrastructures (RIs). This is a cost-efficient way to densify a research network and improve its geographic and environmental representativeness. For example, adding 50 candidate facilities from other RIs to the eLTER RI in Europe reduced sampling bias for future climate scenarios by up to 40%. However, co-location alone cannot completely overcome bias, especially in severely underrepresented regions like the Iberian Peninsula [37] [38].
5. How can I use causal theory to correct for sampling bias? You can frame data gaps as a missing data problem. The key is to identify and condition on variables that render the sampling process independent of your variable of interest (e.g., species abundance). Construct a causal diagram that includes all known factors affecting both sampling probability and your study variable. By conditioning on the correct variables (e.g., protected area status), you can break the spurious correlation and eliminate the bias [39] [40].
Problem: My species distribution model is overfitted due to spatially clustered occurrence records.
Problem: I suspect my diversification analysis is yielding a false positive for a trait.
Problem: Despite my efforts, significant geographic and taxonomic bias remains in my dataset.
Problem: I need to analyze temporal trends, but my data has many gaps in time and space.
Table 1: Taxonomic Bias in GBIF Data (Adapted from Scientific Reports, 2017) [36]
| Class | Total Occurrences | Percent of GBIF Data | Median Records per Species | Representation Status |
|---|---|---|---|---|
| Aves (Birds) | 345 million | 53% | 371 | Over-represented |
| Insecta (Insects) | Not Provided | Not Provided | <7 | Severely Under-represented |
| Arachnida (Spiders, Mites) | 2.17 million | ~0.3% | 3 | Severely Under-represented |
| Magnoliopsida (Flowering Plants) | Not Provided | Not Provided | Not Provided | Over-represented |
| Amphibia (Amphibians) | Not Provided | Not Provided | >20 | Over-represented |
Table 2: Effectiveness of Co-Location in Mitigating Geographic Bias [37] [38]
| Scenario | Number of Candidate Facilities Needed to Reduce Bias | Remaining Bias After Adding 50 Sites | Regions with Best Improvement |
|---|---|---|---|
| Current Conditions | 25 | 80% | Eastern Europe, Fennoscandia |
| Future Climate (RCP4.5) | 10 | 60% | Eastern Europe, Fennoscandia |
| Future Climate (RCP8.5) | 5 | Not Specified | Eastern Europe, Fennoscandia |
Protocol 1: Assessing and Correcting Geographic and Environmental Bias
Protocol 2: Testing for Trait-Dependent Diversification with SecSSE
Table 3: Essential Tools for Bias-Aware Macroecological Research
| Tool / Resource | Type | Function in Research |
|---|---|---|
| GBIF (Global Biodiversity Information Facility) | Data Repository | Provides massive, open-access species occurrence data; also used to quantify and study taxonomic and geographic biases themselves [36]. |
| SecSSE R Package | Software / Statistical Tool | Models trait-dependent diversification while accounting for hidden traits, reducing false positives [7]. |
| Causal Diagram | Conceptual Framework | A graphical model used to identify variables that, when conditioned on, can eliminate sampling bias by breaking correlations between sampling probability and the study variable [39]. |
| RCP (Representative Concentration Pathway) Scenarios | Climate Data | Projections of future climate used to assess the representativeness of research infrastructures under future conditions and guide strategic network planning [37]. |
| DoPI (Database of Pollinator Interactions) | Thematic Database | Example of a curated database aiming to consolidate interaction data, helping to overcome information gaps and biases for specific ecological groups [41]. |
1. What is the sampling fraction and why is it critical for SSE models? The sampling fraction represents the proportion of species included in your phylogenetic tree relative to the total number of species in the clade. In State-dependent Speciation and Extinction (SSE) models, it is a critical parameter for accurately estimating speciation, extinction, and trait transition rates. Mis-specifying this parameter can lead to severely biased results, including false positives for trait-dependent diversification [12] [14].
2. How does mis-specifying the sampling fraction lead to false positives? Mis-specification creates a mismatch between the model and the true evolutionary process. When the sampling fraction is incorrect, the model may incorrectly attribute patterns of diversity to an observed trait, when in fact the pattern was caused by the uneven sampling of the tree or other unobserved (hidden) traits [7] [12]. This is a serious form of model misspecification, where the assumed data-generating process does not match reality [42].
3. Is it better to over-estimate or under-estimate the sampling fraction? Simulation studies suggest it is better to cautiously under-estimate your sampling efforts. Over-estimating the sampling fraction (i.e., assuming your tree is more complete than it really is) leads to a significant increase in false positive rates. Under-estimating it may cause parameter values to be over-estimated, but it is considered the less risky approach [12] [14].
4. Beyond the overall fraction, what other sampling issues should I worry about? Taxonomically biased sampling, where some sub-clades are heavily under-sampled while others are well-sampled, is a major concern. This type of uneven sampling is particularly dangerous when overall tree completeness is 60% or less, as it can dramatically increase false positive rates and reduce parameter accuracy compared to random sampling [12].
5. Can including fossil data solve these problems? Incorporating fossil data can improve the accuracy of extinction-rate estimates, which are traditionally difficult to estimate from extant-only trees. However, even with fossils, SSE models can still incorrectly identify correlations between diversification rates and neutral traits if the true driver of diversification is not observed. Therefore, fossils are a valuable tool but not a complete solution to the model misspecification problem [8].
| Problem Symptom | Potential Cause | Diagnostic Checks | Corrective Actions |
|---|---|---|---|
| High false positive rate for trait-diversification correlation. | Sampling fraction is over-estimated; taxonomic sampling is biased [12] [14]. | Check for clade-specific sampling heterogeneity. Compare model fits between HiSSE/SecSSE and MuSSE [7]. | Re-calculate clade-specific sampling fractions. Use a model that accounts for hidden states [7]. |
| Parameter estimates are consistently too high (e.g., speciation, extinction). | Sampling fraction is under-estimated [12] [14]. | Compare analysis results using a range of plausible sampling fractions. | Use the most accurate, evidence-based sampling fraction, even if it is a cautious under-estimate [12]. |
| Poor model fit and unreliable parameter estimates. | Overall low tree completeness (low sampling fraction) [12]. | Assess the completeness of your phylogenetic tree. | Consider analyses that are robust to incomplete sampling. Be transparent about uncertainty. |
| Spurious trait association is detected. | The true trait driving diversification is unobserved (a concealed trait) [7] [8]. | Test multiple trait hypotheses. Use models like SecSSE that include a hidden state [7]. | Incorporate fossil data to improve extinction estimates, but remain cautious in interpretation [8]. |
Protocol 1: Testing Robustness to Sampling Fraction Mis-specification
This protocol uses sensitivity analysis to evaluate how uncertainty in the sampling fraction affects your study's conclusions.
| Sampling Fraction Scenario | Speciation Rate (λ) | Extinction Rate (μ) | Trait-Dependent Diversification Support |
|---|---|---|---|
| Baseline (Best Estimate) | 0.3 | 0.1 | Strong |
| Cautious Under-Estimate | 0.35 | 0.12 | Strong |
| Over-Estimate | 0.25 | 0.08 | Weak |
Protocol 2: Designing a Simulation to Validate Your Workflow
This methodology, based on Mynard et al. [12] [14], uses simulated data to verify your analytical pipeline.
secsse or TreeSim to generate phylogenetic trees under a known model. This can include:
The following diagram outlines a systematic workflow to minimize errors related to sampling fraction in SSE analyses.
| Tool / Reagent | Function in Analysis | Technical Notes |
|---|---|---|
| SecSSE R Package [7] | Simultaneously infers state-dependent diversification across multiple observed traits while accounting for the role of a possible hidden trait. | Allows for traits to be in more than one state simultaneously (e.g., for generalist species). Correctly implements the likelihood conditional on non-extinction. |
| HiSSE Model [8] | Infers diversification rates that depend on hidden states, helping to avoid false positives driven by unobserved traits. | A foundational model for addressing the limitations of MuSSE. Typically limited to binary observed traits. |
| TensorPhylo Plugin (RevBayes) [8] | A Bayesian framework that integrates SSE models with the fossilized birth-death process, allowing for the inclusion of fossil data. | Improves the accuracy of extinction-rate estimates. Requires more sophisticated statistical setup but is highly powerful. |
| Clade-Specific Sampling Fractions [12] | A set of proportions reflecting the completeness of sampling for individual sub-clades within a larger phylogeny. | Critical for avoiding false positives when sampling is taxonomically biased. Should be used instead of a single overall fraction whenever possible. |
| FiSSE Method [8] | A non-parametric approach that tests for a correlation between a trait and diversification rates before applying complex SSE models. | Serves as a useful preliminary check to frame subsequent hypothesis testing. |
FAQ 1: What is "tree completeness" and why is it critical for phylogenetic analysis? Tree completeness refers to the proportion of known or extant species included in a phylogenetic tree relative to the total number in the clade of interest. Reaching a critical threshold of completeness (often around 60% or higher) is vital because incomplete trees can lead to biased parameter estimates in downstream analyses. Under-sampled phylogenies can misrepresent evolutionary relationships and processes, significantly impacting the accuracy of trait-dependent diversification analyses [43] [8].
FAQ 2: How can incomplete trees lead to false positives in trait-dependent diversification studies? Incomplete trees can create spurious correlations between neutral traits and diversification rates. When the true source of diversification rate variation is not observed (a "hidden state"), models may incorrectly identify an observed but neutral trait as the driver. This is a known issue with State-dependent Speciation and Extinction (SSE) models like BiSSE and MuSSE. The problem is exacerbated when phylogenetic trees are incomplete, as the model has less information to correctly infer the underlying evolutionary process [7] [8].
FAQ 3: What methods can help reduce false positives when my tree is not fully complete? To mitigate false positives, use models that account for unobserved traits. The SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) model is specifically designed to infer state-dependent diversification across multiple observed traits while accounting for the role of a possible concealed (hidden) trait. Additionally, incorporating fossil data where possible using the Fossilized Birth-Death (FBD) process can significantly improve extinction rate estimates, which are often poorly estimated from extant-only trees [7] [8].
FAQ 4: Are there specific thresholds for tree completeness I should aim for? While a universal threshold is difficult to define, simulation studies suggest that accuracy improves significantly with higher sampling. The "critical 60% threshold" is a practical benchmark. Below this level of completeness, the error in parameter estimation (like extinction rates) can become substantial. However, the exact required completeness can vary based on tree size, the strength of the trait-diversification relationship, and the complexity of the model used [43].
Symptoms:
Diagnostic Steps:
Solutions:
Symptoms:
Diagnostic Steps:
Solutions:
This protocol allows researchers to quantify how tree incompleteness affects their specific analyses.
TreeSim in R or within RevBayes.A methodological workflow for minimizing false positives.
The table below summarizes key quantitative findings from simulation studies on the effects of data quality and correction methods.
Table 1: Impact of Data Quality and Correction Methods on Phylogenetic Inference
| Condition / Method | Key Metric | Performance / Finding | Reference |
|---|---|---|---|
| Low Sequence Informativeness (θ=0.001, 200 sites) | Average Gene Tree Estimation Error (GTEE) | 0.794 (High error) | [43] |
| High Sequence Informativeness (θ=0.01, 2000 sites) | Average Gene Tree Estimation Error (GTEE) | 0.135 (Low error) | [43] |
| TRACTION Error Correction (High ILS) | % of cases where corrected tree is closer to true gene tree | As low as 0.485% | [43] |
| TreeFix Error Correction (Low informativeness) | % of cases where corrected tree is closer to true gene tree | As low as 5.34% | [43] |
| SSE Models with Extant-Only Data | Accuracy of extinction rate estimates | Low power and accuracy; high false-positive risk from hidden states | [8] |
| SSE Models with Fossil Data (FBD) | Accuracy of extinction rate estimates | Improved accuracy, but does not fully eliminate false positives from neutral traits | [8] |
Table 2: Research Reagent Solutions for Phylogenetic Analysis
| Reagent / Software | Primary Function | Application in Trait-Diversification Research |
|---|---|---|
| SecSSE (R package) | State-dependent diversification analysis | Infers dependence on multiple observed traits while accounting for hidden traits to reduce false positives [7]. |
| RevBayes + TensorPhylo | Bayesian phylogenetic inference | Implements FBD-based SSE models to integrate fossil data, improving extinction rate estimates [8]. |
| FastTree | Phylogeny inference for large alignments | Computes approximately-maximum-likelihood trees quickly; useful for building large gene trees as input for species tree estimation [44]. |
| HiSSE | State-dependent diversification analysis | Models the influence of a hidden state on diversification rates, a key strategy for mitigating false positives [7]. |
| TreeFix | Gene tree error correction | Adjusts gene trees to be more consistent with a species tree and sequence alignment; use with caution as it can increase error under high ILS [43]. |
Robust Trait-Diversification Analysis Workflow
False Positive Causation Path
1. Why is accounting for sampling fraction uncertainty critical in trait-dependent diversification studies?
Inaccurate sampling fractions can lead to false positives when testing for trait-dependent diversification [1]. Methods like MuSSE are known to produce false positives because they cannot separate the true effect of a trait from underlying, unobserved (hidden) diversification rate variation [1]. By incorporating uncertainty about the sampling process through Bayesian priors, you explicitly model this source of error, leading to more robust parameter estimates and reducing the risk of spurious conclusions.
2. What is the fundamental difference between a global sampling fraction and clade-specific sampling probabilities?
A global sampling fraction assumes that species across your entire phylogeny are missing at random. You use a single value (e.g., globalSamplingFraction = 0.73) to indicate that 73% of all species in the clade are included in your tree [45]. In contrast, clade-specific sampling probabilities are used when sampling is non-random and varies significantly between different subclades. This approach allows you to assign a unique sampling fraction to each major group in your phylogeny (e.g., 0.2 for one genus and 0.8 for another) within the same analysis [45].
3. My phylogeny is highly incomplete (e.g., <10% of species sampled). What is the best practice?
For phylogenies that are extremely incomplete, the analytical correction for incomplete sampling may be insufficient. The BAMM project strongly recommends using a stochastic polytomy resolver, such as the PASTIS method and its associated R package, to place missing species into the tree. Even with the inherent uncertainty this introduces, it often yields better results than the standard analytical correction for highly incomplete phylogenies [45].
4. How do I set appropriate priors for rate parameters to minimize bias?
Specifying inappropriate priors can significantly influence your results. The scale of your phylogenetic tree (its branch lengths) should inform your prior choices. It is recommended to use helper functions like setBAMMpriors from the BAMMtools R package. This function automatically sets priors for parameters like lambdaInit and muInit based on the properties of your tree, making the analysis less sensitive to the absolute scale of your data and helping to prevent the detection of spurious rate shifts [45].
5. When should I consider using the SecSSE model over other trait-dependent models?
You should consider using SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction) when your analysis involves:
Symptoms: Diversification rate estimates seem biologically implausible or vary wildly between subclades without a clear reason. Shifts in diversification are detected in clades with poor sampling.
Solution: Implement clade-specific sampling probabilities in your Bayesian analysis.
Create a Sampling Data File: This file informs the software of the varying sampling effort across your tree.
1.0 [45].speciesName, cladeName, and samplingFraction [45].Example Sampling Data File Structure:
In this example, the two species from "Genus_fu" are from a clade where only 20% of species were sampled, while the two from "Genus_bar" are from a clade with 80% sampling. [45]
Configure Software: In your analysis control file (e.g., for BAMM), set the parameter useGlobalSamplingProbability = 0 and provide the path to your sampling data file using sampleProbsFilename [45].
Symptoms: You have unsampled lineages that do not belong to any of the defined subclades in your sampling file, making it difficult to set the backbone sampling probability.
Solution: Use a total-clade-based estimate for the backbone fraction.
6 / 22 = 0.27 [45].Symptoms: Your analysis strongly suggests a trait influences diversification, but you suspect the signal might be driven by an unmeasured confounding variable.
Solution: Use a method that accounts for hidden states.
Symptoms: Analyses on the same data but with rescaled branch lengths (e.g., from units of million years to time units) produce different estimates of rate heterogeneity.
Solution: Use tools to set scale-aware priors automatically.
setBAMMpriors function from the BAMMtools R package on your phylogeny.lambdaInit) and extinction (muInit) parameters. This ensures that the ratio of estimated rates remains consistent even if the tree's scale is changed [45].Table 1: Essential computational tools and their functions for analyzing diversification with sampling uncertainty.
| Tool/Package Name | Primary Function | Key Application in Context |
|---|---|---|
| SecSSE (R package) | Several Examined and Concealed States-Dependent Speciation and Extinction | Simultaneously tests dependence of diversification on multiple observed traits while accounting for hidden states to reduce false positives [1] [7]. |
| BAMM / BAMMtools | Bayesian Analysis of Macroevolutionary Mixtures | Estimates complex models of speciation, extinction, and rate shifts through time and among lineages, with robust options for incorporating sampling fractions [45]. |
| PASTIS (R package) | Stochastic Polytomy Resolver | Places missing taxa into a phylogeny via stochastic resolution, recommended for highly incomplete datasets (<10% sampling) [45]. |
| TreeSim (R package) | Simulating Phylogenetic Trees | Generates trees under various diversification scenarios (e.g., with mass extinctions and rate shifts) for method testing and validation [5]. |
| Medusa | Modeling Evolutionary Diversification Using Stepwise AIC | A maximum likelihood framework for detecting shifts in diversification rates across a phylogeny [5]. |
This protocol outlines the steps to run a robust trait-dependent diversification analysis using SecSSE.
Step 1: Data Preparation
Step 2: Model Specification
Step 3: Prior Selection
Step 4: Model Execution & Comparison
Step 5: Interpretation
The diagram below visualizes the integrated workflow for conducting a robust diversification analysis that accounts for sampling fraction and hidden states.
Q1: What is the core purpose of comparing ETD, CTD, and CR models? The comparison aims to reliably determine if a specific observed trait has influenced species diversification rates. This framework tests the hypothesis of trait-dependent diversification against two alternative explanations: that diversification is driven by some unobserved, "concealed" trait (CTD) or that it has occurred at a constant rate, independent of any trait (CR) [9] [1].
Q2: Why might my analysis falsely identify a trait as driving diversification? A common cause of false positives is taxonomically biased sampling, where some sub-clades in your phylogenetic tree are heavily under-sampled. This risk is particularly high when overall tree completeness is 60% or lower [9]. Earlier models like MuSSE were also known for high false positive rates, a problem that newer models like HiSSE and SecSSE were designed to reduce by accounting for hidden traits [1] [7].
Q3: How does an incomplete phylogenetic tree impact my results? Lower sampling fractions (i.e., lower tree completeness) reduce the accuracy of both model selection and parameter estimation (like speciation and extinction rates) [9]. The table below summarizes the key impacts of sampling fraction and bias based on simulation studies [9].
| Issue | Impact on Model Selection | Impact on Parameter Estimation |
|---|---|---|
| Low Sampling Fraction (e.g., ≤ 60%) | Reduced accuracy in selecting the correct model (ETD, CTD, CR). | Speciation, extinction, and transition rates are estimated less accurately. |
| Taxonomically Biased Sampling (e.g., under-sampling tropical species) | Increased rate of false positives for trait-dependent diversification. | Parameter estimates are less accurate compared to random sampling at the same completeness level. |
| Mis-specified Sampling Fraction (Using wrong clade size) | False positives increase if the sampling fraction is over-estimated. | Parameters are over-estimated if sampling is under-specified; parameters are under-estimated if sampling is over-specified. |
Q4: I have multiple traits of interest. Which model should I use? For a single trait, you can use HiSSE. However, if you need to analyze two or more observed traits simultaneously while also accounting for the possible effect of a hidden trait, you should use the SecSSE (Several Examined and Concealed States-dependent Speciation and Extinction) model [1] [7].
Q5: What is the best practice for specifying the sampling fraction when the true clade size is uncertain? If the total number of species in the clade is unknown, it is better to cautiously under-estimate the sampling fraction. Over-estimating the sampling fraction (specifying it as higher than it truly is) leads to a greater increase in false positives [9]. Using a Bayesian framework with a prior distribution on the sampling fraction can also help account for this uncertainty [9].
Potential Causes and Solutions:
Cause: Low or Biased Phylogenetic Sampling
Cause: Mis-specification of the Sampling Fraction
Cause: Using a Model Prone to False Positives
Potential Causes and Solutions:
Cause: Low Overall Sampling Fraction
Cause: Incorrect Sampling Fraction Parameter
sampling.f argument in your SSE function (e.g., in hisse or SecSSE) correctly reflects the proportion of species included in the tree for each trait state. As shown in the table below, mis-specification directly and predictably biases parameter estimates [9].| Mis-specification Scenario | Effect on Parameter Estimates |
|---|---|
| Sampling fraction specified lower than true value. | Parameter values are over-estimated. |
| Sampling fraction specified higher than true value. | Parameter values are under-estimated. |
Experimental Protocol: Model Comparison Workflow
The following workflow outlines a robust methodology for comparing ETD, CTD, and CR models, incorporating best practices to minimize false inferences [9] [1].
The following table details key software and methodological "reagents" for implementing the ETD/CTD/CR model comparison framework [1] [7].
| Item Name | Type | Function / Application |
|---|---|---|
| HiSSE | R Package | Models trait-dependent diversification for a single binary trait while accounting for hidden states via the CTD model, reducing false positives. |
| SecSSE | R Package | Extends the framework to multiple examined traits and states, allowing simultaneous analysis while accounting for a concealed trait. |
| Sampling Fraction (sampling.f) | Model Parameter | A critical correction factor that accounts for incomplete taxon sampling in the phylogenetic tree, specified per trait state. |
| Concealed Trait Model (CTD/CID) | Model Structure | A null model that tests whether diversification is better explained by a hidden trait rather than the observed trait of interest. |
| Akaike Information Criterion (AIC) | Statistical Metric | Used for model selection, penalizing model complexity to identify the best-fitting model among ETD, CTD, and CR. |
| Constant Rates Model (CR) | Model Structure | The simplest null model assuming no change in diversification rates across the tree, used as a baseline for comparison. |
Q: My SSE analysis strongly supports trait-dependent diversification, but I suspect it might be a false positive. What could be wrong?
A: False positives in State-dependent Speciation and Extinction (SSE) models frequently arise from inadequate phylogenetic sampling and mis-specified sampling fractions. When your phylogenetic tree contains ≤60% of known species and sampling is taxonomically biased (e.g., uneven across sub-clades), the risk of incorrectly inferring trait-dependent diversification increases substantially [9].
Q: How reliable is "background knowledge" from previous studies for informing my model selection?
A: Background knowledge derived from preceding studies often proves unreliable. Simulation studies demonstrate that variables identified as "known predictors" from previous research are often false positives, especially when those studies used inappropriate selection methods like univariable preselection [46].
Q: What are the most critical factors affecting accuracy in SSE model selection?
A: Key factors include phylogenetic tree completeness, accuracy of sampling fraction specification, and whether sampling is random or taxonomically biased. Mis-specifying the sampling fraction severely affects parameter estimation accuracy [9].
Simulation-Based Validation Protocol
This methodology evaluates whether SSE model support reflects true biological patterns or statistical artifacts [9]:
Data Generation: Simulate phylogenetic trees and trait data under three scenarios: Examined Trait Dependent (ETD), Concealed Trait Dependent (CTD), and Constant Rate (CR) models.
Sampling Manipulation:
Model Fitting: Apply SSE models to each simulated dataset using appropriate concealed/hidden trait models.
Performance Assessment:
Table 1: Impact of Sampling Fraction on SSE Model Accuracy
| Sampling Fraction | False Positive Rate (Random Sampling) | False Positive Rate (Biased Sampling) | Parameter Estimate Accuracy |
|---|---|---|---|
| 20% | High (>40%) | Very High (>60%) | Poor (<50%) |
| 40% | Moderate (20-40%) | High (40-60%) | Fair (50-70%) |
| 60% | Low (10-20%) | Moderate (20-40%) | Good (70-85%) |
| 80%+ | Very Low (<10%) | Low (10-20%) | Excellent (>85%) |
Best Practices for Empirical Studies
Sampling Documentation: Explicitly report sampling fractions for each trait state and describe any taxonomic or geographic sampling biases [9].
Model Comparison Framework: Always compare Examined Trait Dependent (ETD) models against appropriate null models including Constant Rate (CR) and Concealed Trait Dependent (CTD) models [9].
Sensitivity Analyses: Conduct comprehensive sensitivity tests for sampling fraction specification, as over-estimation increases false positives while under-estimation provides more conservative results [9].
Table 2: Essential Computational Tools for Trait-Dependent Diversification Analysis
| Tool Name | Functionality | Application Context |
|---|---|---|
| HiSSE | Hidden-State-Dependent Speciation and Extinction | Accounting for unmeasured traits |
| GeoHiSSE | Biogeographical trait-dependent diversification | Spatial analyses of diversification |
| MuHiSSE | Multi-trait state diversification analysis | Complex trait interactions |
| SecSSE | Several Examined and Concealed States | Partial trait state data accommodation |
| BISSE | Binary State Speciation and Extinction | Basic binary trait analyses |
SSE Analysis Decision Framework
Q: What sampling fraction is sufficient to minimize false positives in SSE analyses?
A: Based on simulation studies, sampling fractions ≥80% provide the most reliable results, while fractions ≤60% significantly increase false positive risks, especially with taxonomically biased sampling [9]. When sampling is imbalanced across sub-clades and tree completeness is ≤60%, false positive rates increase substantially compared to random sampling scenarios.
Q: How does mis-specification of sampling fraction affect parameter estimates?
A: Mis-specifying sampling fractions systematically biases parameter estimates:
Table 3: Sampling Fraction Mis-specification Effects
| Mis-specification Type | Effect on Speciation Rates | Effect on Transition Rates | False Positive Risk |
|---|---|---|---|
| Over-estimated (80% vs true 60%) | Under-estimated | Under-estimated | Increased |
| Under-estimated (60% vs true 80%) | Over-estimated | Over-estimated | Decreased |
Q: What are the key differences between ETD, CTD, and CR models in SSE analyses?
A: These models represent different hypotheses about diversification drivers [9]:
The CTD model is particularly important as it controls for the fact that diversification rates might vary with some unmeasured trait rather than your focal trait, thereby reducing false inferences of trait-dependent diversification.
Q: How can I determine if my model selection results are robust?
A: Implement these validation steps:
1. What is sensitivity analysis in the context of evolutionary biology research? Sensitivity Analysis is the study of how uncertainty in the output of a model can be apportioned to different sources of uncertainty in the model input [47] [48]. In phylogenetic studies of trait-dependent diversification, it examines how changes in model assumptions, parameters, or input data affect the detection of speciation and extinction rates, helping to validate findings and identify false positives [1].
2. Why is global sensitivity analysis preferred over local methods for complex diversification models? Local sensitivity analysis varies parameters around specific reference values and can be heavily biased for nonlinear models or where factors interact, as it underestimates their importance and only partially explores the parametric space [48]. Global sensitivity analysis varies uncertain factors within the entire feasible space, revealing the global effects of each parameter on the model output, including any interactive effects, and is therefore preferred for non-linear models common in diversification research [47] [48].
3. What is the difference between sensitivity analysis and scenario analysis? Sensitivity analysis typically changes one or two variables at a time to isolate their individual impact on the outcome. In contrast, scenario analysis changes multiple variables simultaneously to create coherent, realistic scenarios, such as modeling a "recession" scenario that affects several assumptions at once [49]. They are often used together, with sensitivity analysis identifying critical variables for inclusion in broader scenario analysis [50] [49].
4. My sensitivity analysis results show the same value in every cell of the data table. What is wrong? This common issue in tools like Excel can stem from several causes [51]:
Manual instead of Automatic.5. How can sensitivity analysis help reduce false positives in trait-dependent diversification studies? Methods like MuSSE (multiple-states dependent speciation and extinction) are known to yield false positives because they cannot separate differential diversification rates from dependence on the observed traits [1]. Techniques like HiSSE and SecSSE (several examined and concealed states-dependent speciation and extinction) address this by incorporating a hidden state that affects diversification, providing a more robust framework to confirm whether an observed trait genuinely influences diversification rates [1].
Problem Simulation studies indicate that under complex diversification scenarios involving both lineage-specific rate shifts and mass extinction events, phylogenetic methods have better performance detecting lineage shifts than mass extinctions [5]. There is a tendency to over-predict rate-shift events as scenario complexity increases, while mass extinction events remain under-detected [5].
Solution
Problem When performing a "What-If" analysis using Data Tables in Excel, the resulting matrix populates with the same value in every cell, failing to show how the output varies with different inputs [51].
Resolution Steps
Formulas > Calculation Options and ensure it is set to Automatic. If it was set to Manual, switching to Automatic will recalculate the table [51] [52].=B20, where B20 contains your model's result like EPS or Net Present Value) [51] [52].F9 to force a recalculation of the entire worksheet [52].Problem The results of a global sensitivity analysis are unreliable or do not adequately represent the model's behavior across the entire parameter space.
Resolution Steps
The table below summarizes key methods used in sensitivity analysis, which can be applied to models in evolutionary biology and drug development.
Table 1: Key Sensitivity Analysis Methods and Applications
| Method | Core Principle | Best Use Cases | Common Visualizations |
|---|---|---|---|
| Local Sensitivity Analysis [47] [48] | Varies one parameter at a time (OAT) around a base value, often using derivatives. | Understanding the immediate impact of minor variations in assumptions near expected values. | Spider/Radar Charts, Bar Charts showing percentage change [49]. |
| Global Sensitivity Analysis [47] [48] | Varies multiple parameters simultaneously across their entire range of uncertainty. | Exploring model behavior under extreme conditions and understanding interactive effects between parameters. | Scatter Plot Matrices, Sensitivity Indices Charts (e.g., Sobol indices), Heatmaps [47] [49]. |
| One-Way Analysis [49] [53] | Changes a single input variable across a range while holding all others constant. | Identifying which individual variables have the most significant influence on the output. | Tornado Diagrams, Line Plots [49]. |
| Two-Way Analysis [52] [49] | Examines how simultaneous changes in two specific variables affect the outcome. | Uncovering interactions between two key variables that might not be apparent from one-way analysis. | Heatmaps, Contour Plots, 3D Surface Plots [49]. |
| Probabilistic (Monte Carlo) Analysis [47] [49] | Uses probability distributions for inputs and runs thousands of iterations to create a probability distribution for the output. | Quantifying risk and uncertainty, providing a full probability distribution of outcomes rather than a single value. | Histograms/Probability Density Functions, Cumulative Distribution Functions (CDFs), Box Plots [49]. |
This protocol is adapted from general global sensitivity analysis workflows for use with complex evolutionary models like those detecting trait-dependent diversification [47] [48].
1. Define Model and Objective
2. Set Up the Experimental Design
3. Execute Model and Compute Output
4. Analyze the Relations
5. Interpret Results
Global Sensitivity Analysis Workflow
This protocol provides a detailed methodology for creating a two-way data table, a common tool for local sensitivity analysis, and addresses the common "same value" error [51] [52].
1. Initial Set-Up and Structure
E33 for Revenue Growth, E35 for EBIT Margin) [52].D208), create a formula that calculates your desired output and directly links to the two input cells. This cell will be the top-left corner of your data table [52].2. Construct the Data Table Matrix
D209, D210, etc.), list the different values you want to test for your first variable (e.g., varying growth rates: 10%, 13%, 14%, etc.) [51] [52].E208, F208, etc.), list the different values for your second variable (e.g., varying margin percentages: 0%, 1%, 2%, etc.) [51] [52].3. Execute the Data Table Function
D208:I214) [52].Data tab, click What-If Analysis, and select Data Table. Alternatively, use the keyboard shortcut Alt-D-T (or Alt-A-W-T in newer Excel versions) [52].Row input cell box, enter the cell reference for the variable you listed in the row (e.g., E35).Column input cell box, enter the cell reference for the variable you listed in the column (e.g., E33) [52].OK.4. Sanity Check and Troubleshooting
Formulas > Calculation Options is set to Automatic.
b. Confirm the top-left cell of your table range contains a formula, not a hard-coded value.
c. Press F9 to force a manual recalculation.
d. Check the status bar for circular reference warnings.
Data Table Troubleshooting Logic
Table 2: Essential Computational Tools and Packages for Sensitivity Analysis
| Tool / Software Package | Function / Application | Relevant Context |
|---|---|---|
| R Statistical Environment | A free software environment for statistical computing and graphics, used as the primary platform for many specialized phylogenetic and sensitivity analysis packages. | Core platform for running analyses [1] [5]. |
| SecSSE (R package) | Several Examined and Concealed States-Dependent Speciation and Extinction. Used to infer state-dependent diversification across multiple observed traits while accounting for the possible role of a hidden trait. | Directly addresses false positives in trait-dependent diversification research [1]. |
| TreeSim (R package) | Simulates phylogenetic trees under defined speciation and extinction rates, including models with mass extinction events and rate shifts. | Generating simulated data to test model performance and power [5]. |
| Sobol Sensitivity Indices | A variance-based global sensitivity analysis method that quantifies the contribution of each input parameter to the output variance, including interaction effects. | Quantifying parameter influence in complex, non-linear models [48]. |
| Monte Carlo Simulation Engine | A computational algorithm that relies on repeated random sampling to obtain numerical results. Can be implemented in R, Python, or specialized software. | Probabilistic sensitivity analysis and uncertainty quantification [47] [49]. |
| Microsoft Excel Data Tables | A built-in "What-If Analysis" tool for performing local, one-way or two-way sensitivity analysis on financial or mathematical models. | Quick, accessible local sensitivity testing and presentation of results [50] [52]. |
Sampling bias can lead to severe false positives in trait-dependent diversification analysis. When you use a model like MuSSE (Multiple-State Speciation and Extinction) without accounting for unobserved, or "hidden," traits, you may incorrectly conclude that an observed trait affects diversification rates. The bias occurs because the model cannot distinguish whether the differential diversification is truly caused by your trait of interest or by another, unmeasured factor that is correlated with it [7].
This was confirmed by applying a more robust method, SecSSE (Several Examined and Concealed States-Dependent Speciation and Extinction), to previous studies that used MuSSE. The conclusion was that in 5 out of 7 cases, the original findings based on MuSSE were premature and likely false positives. SecSSE avoids this pitfall by explicitly modeling the potential influence of a concealed trait [7].
A classic example comes from research on protein-protein interaction networks (PINs). These networks are often incomplete and subject to "ba it selection bias" [54]. This means that researchers frequently focus their experiments on a specific, small subset of proteins that are already well-known.
The primary error is using a model that relies solely on observed traits without testing for the influence of hidden traits. The MuSSE model is particularly prone to this. The false positive arises because the model attributes all variation in diversification rates to the trait you have data for, even if that variation is actually caused by a factor you did not measure [7].
The following table summarizes the primary causes of sampling bias and strategies to avoid them, applicable across biological research fields [55] [56] [57].
| Bias Type | Cause | Mitigation Strategy |
|---|---|---|
| Observer Bias | Researcher's expectations influence observations or data interpretation [55] [58]. | Use blinding methods; ensure inter-rater reliability; use automated data collection where possible [55] [58]. |
| Self-Selection Bias | Participants/units with specific characteristics are more likely to be included [55] [56]. | Use random or stratified sampling instead of convenience or volunteer-based sampling [55] [57]. |
| Undercoverage Bias | A subgroup of the population is systematically excluded from the sampling frame [59] [56]. | Use an up-to-date and comprehensive sampling frame; employ oversampling for underrepresented groups [59] [56]. |
| Non-Response Bias | Individuals who do not respond systematically differ from those who do [56] [57]. | Follow up with non-responders; simplify study protocols to improve accessibility and completion rates [56] [57]. |
| Ascertainment Bias | The sample is collected from a source that does not represent the target population (e.g., only using clinical records) [60] [56]. | Clearly define the target population and ensure the data source matches it as much as possible [56]. |
The diagram below outlines a robust workflow to avoid false positives when testing for trait-dependent diversification, incorporating the key methodological upgrade from MuSSE to HiSSE/SecSSE.
The table below lists key solutions and resources for researchers conducting phylogenetic analyses of trait-dependent diversification.
| Tool/Reagent | Function/Description |
|---|---|
| SecSSE (R package) | Several Examined and Concealed States-Dependent Speciation and Extinction. A primary tool for testing the dependence of diversification on multiple observed traits while accounting for hidden traits to avoid false positives [7]. |
| HiSSE (R package) | Hidden-State Speciation and Extinction. The predecessor to SecSSE, designed to model the effect of a hidden trait on diversification rates. Best for binary traits [7]. |
| Phylogenetic Tree | The essential input data representing the evolutionary relationships of the species group under study. |
| Trait Dataset | The compiled data for the observed morphological, ecological, or molecular traits hypothesized to influence diversification rates. |
| MuSSE (R package) | Multiple-State Speciation and Extinction. Used with caution for initial exploration but should not be used for final inference due to its high false positive rate [7]. |
Q1: What are the most common symptoms of an unreliable parameter estimation result? You may be dealing with an unreliable result if you observe several of the following issues:
Q2: My model fails to converge. Where should I start troubleshooting? Model instability often stems from a mismatch between model complexity and the information content of your data [61]. Follow this workflow:
Q3: How can I design a benchmark to avoid false positives in trait-dependent diversification studies? Traditional methods like MuSSE (Multiple-State dependent Speciation and Extinction) are known to produce false positives because they cannot separate the effect of an observed trait from the effect of a hidden trait on diversification rates [7] [1]. To avoid this:
This guide addresses the "unstable model" diagnosis, a common problem in pharmacometrics and systems biology.
Required Expertise: Intermediate to Advanced.
Background: Model instability manifests as failed runs, inconsistent parameter estimates, or biologically unreasonable results. The root cause is often a combination of data quality and an imbalance between model complexity and data information content [61].
Protocol: A Heuristic Workflow for Stable Parameter Estimation
Problem Identification & Reproduction
Data Quality and Information Content Check
Model Complexity Reduction
Table: Model Selection Trade-Off Based on Data Information Content
| Model Choice | Data Information Requirement | Risk of Instability | extrapolation Potential |
|---|---|---|---|
| Linear PK model | Low | Low | Low |
| Time-varying model | Medium-Low | Low-Medium | Low-Medium |
| Equilibrium binding | Medium-High | Medium | Medium-High |
| Full kinetic TMDD | High | High | High |
This guide provides a methodology for fairly comparing optimization methods, relevant for systems biology models with dozens to hundreds of parameters.
Required Expertise: Advanced.
Background: Contradictory recommendations exist on the best optimization strategy. A fair comparison requires a collaborative effort and multiple performance metrics to evaluate the trade-off between computational efficiency and robustness [64] [66].
Experimental Protocol for Benchmarking
Select a Representative Benchmark Suite:
Choose the Methods for Comparison:
Define Performance Metrics:
Table: Key Performance Metrics for Benchmarking Optimization Methods
| Metric | What It Measures | Interpretation |
|---|---|---|
| Success Rate | The proportion of runs that converge to an acceptable optimum. | Measures robustness. |
| Convergence Speed | The average number of function evaluations or wall time to reach the solution. | Measures computational efficiency. |
| Solution Quality | The best objective function value achieved. | Measures accuracy. |
| Sensitivity to Initial Guesses | The variance in solution quality from different starting points. | Measures reliability. |
This diagram outlines a robust, multi-stage workflow for parameter estimation, emphasizing steps that improve reliability.
This diagram illustrates the conceptual structure of the SecSSE model, which helps detect false positives in trait-dependent diversification studies.
This table details key computational and methodological "reagents" essential for robust parameter estimation and benchmarking.
Table: Essential Research Reagents for Reliable Parameter Estimation
| Item Name | Function / Purpose | Field of Application |
|---|---|---|
| SecSSE R Package | Infers state-dependent diversification across multiple observed traits while accounting for hidden traits to control false positives. | Evolutionary Biology, Phylogenetics [7] [1] |
| Clustered Bootstrapping | A statistical method to estimate accuracy and confidence intervals, accounting for dependencies in data (e.g., multiple perturbations of the same benchmark question). | AI Evaluation, Psychometrics [62] |
| Hybrid Metaheuristics | Optimization methods that combine a global search strategy (e.g., scatter search) with a local, gradient-based method for efficient and robust parameter estimation. | Systems Biology, Kinetic Modeling [64] |
| Adjoint Sensitivities | An efficient method for calculating gradients (derivatives) of an objective function with respect to parameters, crucial for gradient-based optimization of large models. | Systems Biology, PBPK/QSP Modeling [64] [66] |
| Item Response Theory (IRT) | A latent trait framework that models the probability of a correct response as a function of underlying ability and item difficulty, enabling more robust ability estimation. | AI Evaluation, Psychometrics [62] |
Accurately detecting trait-dependent diversification requires meticulous attention to phylogenetic data quality and analytical parameters. The key to minimizing false positives lies in understanding that low phylogenetic tree completeness, particularly below 60%, and taxonomically biased sampling severely compromise SSE model accuracy. Crucially, mis-specifying the sampling fraction—especially over-estimating it—directly inflates false positive rates. Researchers should adopt a conservative approach, ideally using Bayesian methods to account for sampling uncertainty, and always include concealed trait models (CTD) in their comparisons. Future directions should focus on integrating genomic data to build more complete phylogenies and developing more robust models that explicitly account for common empirical data imperfections, thereby strengthening the biological inferences drawn from these powerful comparative methods.