We wanted to understand the factors influencing this performance variation and to identify which conditions are particularly suited for bioactivity prediction. Overall, the Cell Painting fluorescence-based approach performed best both in terms of correctly predicting bioactivity (measured by ROC-AUC) and in terms of increased chemical diversity (measured by Tanimoto similarity). A Nemenyi’s post-hoc test showed that the chemical diversity of structure-based predictions was significantly lower than image-based predictions (Fig. 2b). Both fluorescence-based approaches (Whole Image Fluorescence and Cell-Features) outperformed the brightfield model at 0.704 ± 0.107 ROC-AUC.
- High-throughput screening (HTS) provides starting chemical matter in the adventure of developing a new drug.
- Notably, kinase targets and cell-based assays exhibited strong performance, and additional trends can be recognized albeit at small sample size (Fig. 3).
- Our compound libraries are available as single compounds but also in pools, optimised for affinity selection mass spectrometry screening (ASMS).
- Our results extend this by demonstrating the capacity to learn from small sets of readily available, but relatively noisy, unrefined single-point activity readouts.
Data availability
The authors would like to thank Guy Williams, Diane Smith, Hannah Semple, and Elizabeth Mouchet for their invaluable help in producing the Cell painting dataset used in this publication. The publicly available Cell Painting data used in this study are available from the JUMP consortium dataset13, CPG0016 available from the Cell Painting Gallery on the Registry of Open Data on AWS. The raw HTS datasets generated and analysed in this study are protected and are not available due to them being AstraZeneca proprietary information.
- It was then trained to predict bioactivity readouts for each of the 140 assays.
- Any submissions which do not use these hosting platforms for images or gifs will receive immediate removal and may result in a ban.
- The top ranked 5% of compounds were randomly sampled for each of the four follow-up assays, with varying numbers of compounds selected for each of the assays.
Each top ranked compound was compared to all the known actives and the most similar one was identified for each of the 20 compounds, meaning the one with highest Tanimoto Similarity was then assigned as the most similar. Tanimoto similarity42 between the ECFP4 fingerprints of compounds was used to determine how structurally similar each compound pair was. The sample size used was vegas casino download not determined based on any statistical method, all data available was included.
We found the ROC-AUC values in the follow-up assays to be consistent with the values from the primary assays. Among molecular target subtypes, kinase targets appeared to benefit the most from our Cell Painting-based approach, performing significantly better than other molecular target subtypes (Fig. 3d). The general trend we found was that the predictive performance was consistently good for different assay types. Therefore, we conducted a detailed analysis, breaking down the results to examine how various assay characteristics contribute to performance (Fig. 3).
Other Literature Sources
Evotec’s library is refreshed regularly to include novel scaffolds, eliminate inactive or problematic compounds and check for purity and solubility issues. Success in hit identification is dependent on the quality of the compound library. Stringent review processes ensures the selection and prioritization of the most promising compounds with the highest likelihood of success.
A Box plot of each modality type’s average ROC-AUC computed over each assay. To this end, we extracted hand-crafted image features, hereafter referred to as Cell-Features, using the Columbus image-analysis software. Brightfield imaging has some advantages compared to Cell Painting-stained cells as it can be performed on live cells and does not require staining of the cells and can be performed on simpler microscopes.
Hit Identification
However, it may still be an attractive alternative because the slight drop in predictive performance can be justified by other benefits compared to the Cell Painting assay. The Cell-Features approach, using extracted image features, resulted in a slightly lower performance than the Whole-Image Fluorescence approach. Each dot represents the average Tanimoto similarity score per assay over all cross-validation splits. B Box plot of average Tanimoto Similarity of top 20 ranked compounds to the closest known active in respective training set for each modality.
The serine kinase assay, which had the highest predictive performance (ROC-AUC 0.91) showed an astoundingly high enrichment of 14x in the follow-up, representing a significant improvement and suggests this assay could focus on a small, highly targeted set of compounds. A The predictive performance of the image-based Fluorescence model compared to the structure based, when grouped by Test Material Type. For each bioactivity prediction approach, we compared the structural diversity of the 20 top-ranked compounds to the known actives in the training set. This subset encompassed 29 assays comprising 10,660 unique compounds (See Materials and Methods 1.3. JUMP consortium and ChEMBL datasets for details).
Evaluation of DNA encoded library and machine learning model combinations for hit discovery
Finally, we confirm the validity of our predictions through a series of in vitro follow-up experiments which demonstrate that the bioactivity predictions of our models are reliable and consistent. Recently, alternatives to structure representations have been explored2,3,4,5 for prediction of bioactivity6 or toxicology7. Whereas false positives can be identified and removed by further probing with follow-up assays, false negatives can be problematic as they can filter out potentially interesting compounds.
Our analysis revealed that compounds predicted from images showed lower structural similarity i.e., greater chemical diversity, than structure-based approaches. Although the brightfield image-based approach was outperformed by the fluorescence-based approach, it was still able to predict 49% of the assays with a ROC-AUC above 0.7 and even 5% above 0.9. This dataset included 209 assays comprising 10,574 compounds8,12, where binary activity data was derived from dose-response curves (IC50/EC50) of each compound in a given assay. Initially, we assessed our framework’s performance on a dataset established by Hofmarcher and colleagues8, demonstrating end-to-end learning with convolutional neural networks (CNNs) for biological assay prediction from Cell painting images. Our results demonstrate the capability of models trained on phenotypic data combined with a few hundred single-concentration data points, to predict compound activity reliably and efficiently across diverse targets in a realistic drug screening scenario. As only a few hundred activity data points are needed to train the predictive model for a particular target and assay, assays of higher complexity and biological relevance could potentially be used.
Expansive and High-Quality Compound Collection for HTS
Because the initial screening assays are often very simple representations of the target biology, they run the risk of producing false positive and negative results. Because of this, hit finding is generally done with simple assays such as biochemical assays to enrich the compound set before more resource-intense assays can be used further down the cascade. Accurate bioactivity prediction using morphological profiles could streamline the process, enabling smaller, more focused compound screens. Another important aspect of cell-based HT assays is the response of the organism of interest through the primary screen. On the other hand, cell-based assays discussed include viability, reporter gene, second messenger, and high-throughput microscopy assays.
Biochemical assays discussed include fluorescence polarization and anisotropy, FRET, TR-FRET, and fluorescence lifetime analysis. Any submissions which do not use these hosting platforms for images or gifs will receive immediate removal and may result in a ban. Tapping into large internal compound libraries, a vast wealth of knowledge and a wide range of technology platforms can reduce time and cost and increase the likelihood of success without a major upfront investment in infrastructure and personnel. HTS could also serve as an engine to generate high quality big data sets to build AI/ML models. Data generated can be used to guide future compound selection, hit expansion and early Structure-Activity Relationships (SARs). Evotec offers an extensive range of different technologies and platforms for hit confirmation and validation.
Fluorescence-Based Assays
Applying a post-hoc Nemenyi’s test, we find that the performance differences are significant between all modalities except for brightfield and structure. This approach reached an average ROC-AUC of 0.744 ± 0.108 compared to the cell-feature based model at 0.726 ± 0.115. These image-based modalities were then compared against a standard structure-based approach using Extended Connectivity Fingerprints17 (labeled Structure). As described above, we observed encouraging results using a multiplexed fluorescence Cell Painting screen to capture phenotypic profiles of a library of compounds. The average performance for these 29 assays was 0.660 ± 0.094 ROC-AUC. Notably, our results align closely with the performance reported for the supervised ResNet model by Hofmarcher et al. (0.731 ± 0.19 ROC-AUC)8 and the linear probing contrastive learning model (CLOOME) recently reported on the same dataset (0.714 ± 0.20 ROC-AUC)12.
HTS Platforms and Related Technologies
These types of approaches promise to efficiently enrich likely hit compounds into focused compound sets. One strategy to accelerate hit finding is to use computational methods to prioritize and select compounds deemed more likely to be active. Thus, there is an interest in using as biologically relevant assays as possible early in the screening cascade. This approach has the potential to reduce the size of screening campaigns, saving time and resources, and enabling primary screening with more complex assays. Identifying active compounds for a target is a time- and resource-intensive task in early drug discovery. What is a cell-based HT screening approach?
Compounds that were structurally similar, based on ECFP-4 clustering, were assigned to the same fold to measure the ability of the model to identify actives in unknown regions of the compound space. We selected a structurally diverse set of 8,300 compounds to be representative of a larger HTS screening library. Phenotypic profiles are derived from cells, tissues, or even whole organisms, and contain information on the characteristics or behaviors of these complex biological systems in response to perturbations with small molecule compounds or other drug modalities.