otu | barcode | actual species | n reads | dnabarcoder classification | score | cutoff |
---|---|---|---|---|---|---|
136 | barcode72 | Pichia kudriavzevii | 1,996 | Komagataella pastoris | 100.00% | 99.10% |
85 | barcode83 | Diaporthe sp CCL067 | 1,996 | Gnomoniopsis paraclavulata | 100.00% | 99.60% |
85 | barcode87 | Asteroma sp CCL068 | 1,995 | Gnomoniopsis paraclavulata | 100.00% | 99.60% |
1 | barcode43 | Aspergillus fumigatus | 1,920 | Inopinatum lactosum | 100.00% | 99.10% |
20 | barcode65 | Geotrichum candidum | 1,675 | Dipodascus capitatus | 90.50% | 89.60% |
109 | barcode88 | Asteroma sp CCL060 | 1,620 | Diaporthe foeniculina | 100.00% | 99.10% |
11 | barcode66 | Kluyveromyces lactis | 1,579 | Candida albicans | 100.00% | 97.30% |
119 | barcode61 | Cryptococcus neoformans | 1,169 | Kluyveromyces marxianus | 100.00% | 99.10% |
183 | barcode60 | Cryptococcus gattii | 703 | Candida orthopsilosis | 99.76% | 97.30% |
43 | barcode88 | Asteroma sp CCL060 | 211 | Diaporthe foeniculina | 99.59% | 99.10% |
7 Discussion
Our results indicate that our workflow is effective for detecting the major fungal taxa from our mock communities with less clear results in Australian soil samples due to the high rate of unclassified OTUs. We observed consistent patterns in clustering with VSEARCH and with our customised NanoCLUST implementation. We have tested the impact of a minimum OTU size threshold and found that the lowest OTU size threshold gave the highest classification accuracy at the cost of overestimating species diversity and a higher rate of false positive identifications. For classifying OTUs at the genus level, we found that the most abundant sequence was more reliable than the consensus sequence.
1 Quality control
As expected from ONT sequencing, the reads used to construct our mock scenarios had considerably higher error rates than those produced by other sequencing platforms such as Illumina and PacBio (Anslan et al., 2018; Tedersoo et al., 2021, 2018). With a mean read quality of Q13.2, our approach to quality control aimed to significantly decrease noise by setting stringent quality thresholds. Setting a minimum quality score of Q20 and requiring both forward and reverse primers to be present on reads resulted in the removal of 85% of the dataset (Figure 2). If applied generally in real world scenarios, this QC approach may be overly conservative and bias against some taxa. Therefore, we aim to make these quality filtering settings configurable in the workflow so end-users can determine their own acceptable level of error. It should be noted that the sequences for the mock dataset were generated in 2022 and were not compatible with dorado’s latest basecalling model (v5.0.0 at the time of writing). We observed that using modern ONT chemistry and the latest basecalling model for our soil samples resulted in many more sequences being retained (40% loss versus the 85% loss in the mock scenarios).
2 Clustering
We have demonstrated that clustering with VSEARCH and with our NanoCLUST implementation produced comparable results and both were able to successfully delineate major taxa in our mock scenarios. The NanoCLUST implementation was generally more sensitive to biological variation than VSEARCH in some fungal groups as it split single taxa into multiple OTUs (for example: Pucciniales (Figure 6), Cryptococcus (Figure 8) and Sclerotiniaceae (Figure 9)). The splitting of taxa in NanoCLUST may suggest the presence of ITS haplotype variants in these fungal groups. This interpretation should be treated with caution as the UMAP dimension reduction technique cannot guarantee the preservation of all structure in the lower dimensional space (Mittal et al., 2024; Asnicar et al., 2024). Closer analysis of the sequences found in each OTU may give further insight into the true source of this variation. We anticipate that by increasing the clustering identity used by VSEARCH, these taxa would also be split into multiple OTUs, although this was not tested.
The metrics we used to measure species recovery (i.e. number of OTUs, number of species and percentage of reads lost) gave insight into broad patterns of clustering, but had limitations when comparing clustering methods. We found that the number of OTUs was a good initial measure for comparing clustering methods as it was relatively efficient to compute. However, after performing taxonomic assignments, we found that the number of OTUs tended to overestimate the number of species present because many OTUs shared the same species-level classifications (Figure 5).
When comparing the effect of a minimum OTU size on our clustering methods, we observed predictable trends in the number of OTUs that VSEARCH produced for all library sizes (Figure 4). Our NanoCLUST implementation showed less stability than VSEARCH in the number of OTUs it produced at high library sizes. Care should be taken when interpreting this comparison as the minimum OTU size filtering is not applied in the same way between clustering methods. For VSEARCH, the minimum OTU size cutoff is applied after clustering on a fixed set of OTUs. In our NanoCLUST implementation, HDBSCAN recomputed the set of OTUs for each min_cluster_size
value and is therefore more dynamic than VSEARCH.
We found that interpreting the percentage of reads lost as a measure of the success was misleading. Our NanoCLUST implementation often lost fewer reads than VSEARCH but clustered unrelated taxa together as the minimum OTU size cutoff increased (Figure 4 and Figure 11). The HDBSCAN algorithm had the unexpected behaviour of merging small clusters together as the min_cluster_size
parameter increased (Leland McInnes et al., 2016). The original NanoCLUST pipeline (Rodríguez-Pérez et al., 2021) was designed for more conserved 16S regions in Bacteria, parameters validated against 16S data may not be suitable for variable ITS regions in fungi without further testing.
Our NanoCLUST implementation was unsuitable for clustering large datasets due to computational limitations. In our case, attempting to cluster 1.8 million reads from our soil dataset with UMAP and HDBSCAN resulted in much higher memory use than was possible with a limit of 128GB RAM. By default, the NanoCLUST pipeline (Rodríguez-Pérez et al., 2021) on which we modelled our implementation, only selects 100K reads from the given dataset and is expected to use between 32GB-36GB of RAM (ITER, 2024). Despite these limitations, NanoCLUST is a complex and dynamic clustering approach that shows promise for clustering variable ITS sequences. While VSEARCH uses a fixed similarity metric which is much easier to understand, with further work, NanoCLUST may be suited to cluster fungal groups with differing rates of variability at the ITS locus.
3 Taxonomic assignments
In our mock communities, in the best cases we observed a genera-level classification rate of ~95% and genera-level precision of 82% to 85%. This demonstrated that in a controlled scenario, a significant proportion of the major taxa were correctly identified. However, there were a number of major taxa that were consistently misidentified regardless of clustering approach (Table 1).
While we cannot be certain of the cause of these misclassifications, we hypothesise that some may be explained by cross-sample contamination and mislabelling. Some misclassified reads had high matching BLAST scores for other taxa that were present in the mock community which may indicate that contamination has occurred. For example: reads from Kluyveromyces lactis, Cryptococcus neoformans and Cryptococcus gattii had high scoring classifications for Candida albicans, Kluyveromyces marxianus, Candida orthopsilosis respectively. There was also a potential mislabelling of samples where Diaporthe sp CCL067 and Asteroma sp CCL060 were swapped. There were two Asteroma species in the mock community and we expected them to cluster together. Instead, Asteroma sp CCL068 and Diaporthe sp CCL067 were consistently placed in the same OTU. In addition, the major OTU in the Asteroma sp CCL060 sample was classified as Diaporthe foeniculina with 100% identity (Table 1). Despite these observations, the mislabelling is unclear because Asteroma and Diaporthe are both in Diaporthales order, so we expect considerable similarity between ITS sequences.
The inconsistency of taxonomic nomenclature was a significant limitation of assessing correct taxonomic assignments in an automated way. The taxa examined in this study had many taxonomic synonyms, complicating the validation process. A simple approach that only considers the taxonomic labels as-is can lead to false negatives. Our approach involved the manual (and non-exhaustive) validation of nomenclature for mock community taxa which would not scale well for larger mock communities.
When assessing the precision of taxonomic classifications from our scenarios, surprisingly, we found that the highest rates of classification with the most correct classifications occurred with minimal OTU filtering for both NanoCLUST and VSEARCH (Figure 10). This went against our expectations that low-abundance OTUs were mostly erroneous and would reduce classification accuracy. We expect that these small OTUs were the result of sequencing errors but still contain sequences with a high level of similarity to the ‘correct’ reference database entry. Further investigation for patterns in these low-abundance OTUs may be fruitful for separating signal from noise.
When comparing the classification accuracy of most abundant sequences versus consensus sequences, we found that most abundant sequences gave less variable results (Figure 10). We expected the consensus sequence approach used by NanoCLUST to significantly improve classification precision by correcting sequencing errors. However, we expect that there was considerable biological variation present within OTUs which introduced noise into the resulting consensus sequence. This issue was exacerbated when computing consensus sequences for OTUs clustered by VSEARCH at 97% identity as only up to 35% of reads were given classifications. As the NanoCLUST pipeline was originally intended for more conserved 16S sequences, the consensus approach may need significant adjustment for highly variable ITS sequences. If we were to refine the consensus sequence construction, it would be important to determine whether the source of errors in the resulting sequence was due to a specific polishing tool (such as racon or medaka) or due to variation in the OTUs.
The real world soil case study displayed remarkably different patterns in taxonomic assignments than in our mock scenarios. 58% of the 775 OTUs recovered in the soil case study were unidentified at the kingdom level using dnabarcoder. Many of these were in the top 30 most abundant OTUs and had between 40% to 93.75% identity to their closest matches in the UNITE+INSD database (Table A8). The class level is the highest taxonomic rank that dnabarcoder can assign with the published pre-computed ITS cutoffs (Vu et al., 2022). If dnabarcoder cannot classify a sequence to class level, it will remain unidentified at the phylum and kingdom levels. The classes Basidiomycota and Ascomycota require a similarity of 94.2% and 93.4% respectively, leaving many of our OTUs unclassified. We expect that the dynamic cutoff approach taken by dnabarcoder makes fewer but more precise classifications. As much research focus has been on fungal communities found in the northern hemisphere, Australian fungi are often underrepresented in DNA reference databases (Johnston et al., 2024). We suspect lack of representation in the UNITE database and dnabarcoder’s conservative classification approach is the reason for the low classification rate. However, we still have detected a number of Basidiomycota, Ascomycota and Mortierellaceae across the four tree species allowing for an initial comparative analysis (Figure 16, Figure 18).
Improved classification rate may be possible by taking advantage of highly-conserved regions found within the amplicons. The amplicons from the soil case study include the partial LSU region which may useful to assign higher taxonomic levels for OTUs that cannot be identified with the full ITS region alone. This approach may also enable phylogenetic methods and leverage existing approaches taken for studying arbuscular mycorrhizal fungi (Delavaux et al., 2021; Větrovský et al., 2023). The EUKARYOME reference databases (Tedersoo et al., 2024) would be a useful additional resource which contains curated sequences for SSU, ITS and LSU markers for all eukaryotes.
4 Recommendations
In determining the smallest library size required for reliably detecting taxa, in our even-abundance mock scenarios, we found that 10,000 reads across 58 taxa (167 reads per taxon) worked well for recovering species present with minimal OTU filtering (Figure 5). Higher sampling rates resulted in more misidentified and unclassified OTUS and benefited from more aggressive filtering. Although if using our NanoCLUST implementation to cluster reads, we would not recommend high values of the min_cluster_size
parameter due to merging of unrelated taxa as discussed above and shown in Figure 11. If focusing on detecting rare OTUs, we suggest using minimal OTU filtering. In our uneven mock community, we were able to recover a taxon with 5 reads in a library with ~100K reads with minimum OTU sizes of 2 and 5. While low abundance taxa are recovered there is a drawback of overestimating diversity and a higher rate of false positives.
A minimum mean quality score of Q20 was used for our analyses but as demonstrated, ONT basecalling accuracy is improving. With the latest ONT technology and the ability to generate millions of reads, it may be beneficial to increase this quality cutoff in future trials.