What is the Cross-Beta DB ?
Global informations:
The Cross-Beta DB is a database focused on gathering naturally occurring-cross-beta-forming amyloids into one place. All data present in it have shown experimental proof of cross-beta structure formation following the later criteria. The main purpose of the database is to provide data for training and benchmarking of new amyloid prediction models (see Cross-Beta predictor). But it also includes data about experimental conditions and other information for general usage. All entries of the database are downloadable individually or by group. The benchmark set and other database versions used can be download in the "Download" section.
The full description of all the variables present in the database and accessible by downloading one or several entries is available Here.
For more information about how to use the database interface and all the features in it, check the "Tutorial" section.
Abstract:
Due to a shift in environmental conditions or other factors, certain soluble proteins undergo aggregation, resulting in the formation of clumps of amyloid fibrils. Understanding of this phenomenon is of paramount importance due to its association with various diseases including Alzheimer's disorder as well as an increasingly abundant data on its functional roles. Numerous studies have demonstrated that the propensity to form amyloids is coded by the amino acid sequence and this finding paved the way for the development of several computational predictors of amyloidogenicity. The ultimate objective of computational methods is to accurately predict the formation of disease-related and functionally relevant amyloids that occur in vivo. These amyloid fibrils are known to form a very specific “cross-beta structure” by protein regions longer than about 15 residues. Remarkably, despite the significance of naturally occurring amyloids, there had been a lack of datasets specifically dedicated to them. Hence, we built Cross-Beta DB, a database composed of cross-beta amyloids formed in the natural conditions. This database is expected to be indispensable for benchmarking amyloid predictors. Moreover, as machine learning is demonstrating its high potential in various fields, we used Cross-Beta DB to train several such algorithms. Their benchmark revealed that Cross-Beta RF Predictor, developed on the basis of the random forest algorithm demonstrates the best performance. The benchmark results also demonstrate superior performance of Cross-Beta RF Predictor over the other existing methods, fostering high expectations for an improved prediction of naturally occurring amyloids.
Citing us:
Valentin Gonay, Michael P. Dunne, Javier Caceres-Delpiano, & Andrey V. Kajava. (2024). Developing machine-learning-based amyloid predictors with Cross-Beta DB. bioRxiv, 2024.02.12.579644. https://doi.org/10.1101/2024.02.12.579644
Partners:
Cross-Beta DB statistics:
Graphical representation of the entry source diversity. The large majority of entry are comming from the PDB as it is the most complete source of structured protein.
Graphical representation of the DB protein diversity ("Other proteins" regroup proteins represented by less than 3 entries). Most of our entries shares a similare, but not identical, sequence. With this, protein like a-synuclein as an example, have expose different overlapping regions or some mutations forming amyloids.
Graphical representation of the DB species diversity ("Other species" regroup species represented by less than 2 entries). As the most studied species is The human, the majority of our entries are comming from our species.
Graphical representation of the DB protein pathogenicity diversity. Some of our entries as been observed to have both functional or pathogenic behaviur depending of the environment. This minority of entry is shown as "Pathgenic/Functional" in this graphic.
Graphical representation of the DB protein aggregates observation methods diversity ("Other methods" regroup methods represented by less than 4 entries).
Graphical representation of the DB sequence length distribution.
Graphical representation of the DB mean amino acid composition.