What is the Cross-Beta DB ?
Global informations:
The Cross-Beta DB is a database dedicated to compiling naturally occurring cross-beta-forming amyloids. All data included are experimentally validated for cross-beta structure formation. The database primarily serves as a resource for training and benchmarking new amyloid prediction models (see Cross-Beta predictor). The database also includes experimental conditions and additional information. All entries can be downloaded individually or in groups. The benchmark set and other database versions are available for download in the "Download" section.
A full description of all the variables in the database, accessible by downloading one or more entries, is available Here.
For detailed instructions on using the database interface and its features, refer to the "Tutorial" section.
Abstract:
The importance of protein amyloidogenesis, associated with various diseases and functional roles, has driven the creation of computational predictors of amyloidogenicity. The accuracy of these predictors, particularly those utilizing artificial intelligence technologies, heavily depends on the quality of the data. We built Cross-Beta DB, a database containing high-quality data on known cross-β amyloids formed under natural conditions. We used it to train and benchmark several machine-learning (ML) algorithms to predict amyloid-forming potential of proteins. We developed the Cross-Beta predictor using an Extra trees ML algorithm, which outperforms other amyloid predictors with the highest F1 score (0.852) and accuracy (0.844) compared to existing methods. The development of the Cross-Beta DB database and a new ML-based Cross-Beta predictor may enable the creation of personalized risk profiles for neurodegenerative diseases and other amyloidoses—especially as genome sequencing becomes more affordable.
Citing us:
Gonay V, Dunne MP, Caceres-Delpiano J, Kajava AV. Developing machine-learning-based amyloidogenicity predictors with Cross-Beta DB. Alzheimer's Dement. 2025; 1-7. https://doi.org/10.1002/alz.14510
Partners:
Cross-Beta DB statistics:
Graphical representation of the entry source diversity. The large majority of entry are comming from the PDB as it is the most complete source of structured protein.
Graphical representation of the DB protein diversity ("Other proteins" regroup proteins represented by less than 3 entries). Most of our entries shares a similare, but not identical, sequence. With this, protein like a-synuclein as an example, have expose different overlapping regions or some mutations forming amyloids.
Graphical representation of the DB species diversity ("Other species" regroup species represented by less than 2 entries). As the most studied species is The human, the majority of our entries are comming from our species.
Graphical representation of the DB protein pathogenicity diversity. Some of our entries as been observed to have both functional or pathogenic behaviur depending of the environment. This minority of entry is shown as "Pathgenic/Functional" in this graphic.
Graphical representation of the DB protein aggregates observation methods diversity ("Other methods" regroup methods represented by less than 4 entries).
Graphical representation of the DB sequence length distribution.
Graphical representation of the DB mean amino acid composition.