BiSMM | Bioinformatique Structurale et Modélisation Moléculaire

Meta-Repeat-Finder - Documentation

1. Input

1.1 Format
1.2 Options

2. Output

2.1 Raw results
2.2 Trimmed results
2.3 Results

1. Input

1.1 Format

Meta-Repeat-Finder uses a FASTA sequence (or just a sequence) as input.

1.2 Options

The detection of tandem repeats (TRs) is made by selected finders using their default parameters. If necessary, the default parameters can be changed for T-REKS and TRUST. Post-treatment option allows to choose validation of the detected TRs either based on Tally-2.0 score (>= 0.5) or using both Tally-2.0 (>= 0.5) and p-value-phylo (< 0.2).

2. Output

It returns multiple sequence alignments (MSAs) of TRs. These MSAs will be used to compute scores such as Psim, Parsimony, and p-value-phylo. The Tally machine learning model uses these scores (as features) to compute the Tally-2.3 score. Results have three options (Raw results, Trimmed results and Results) stored in CSV files.

2.1 Raw results

The raw results contain all detected TRs prior to post-processing steps. These initial results might include some false-positive TRs, especially when their Tally score is less than 0.5.

2.2 Trimmed results

Trimming targets only those MSAs that have Tally-2.3 score < 0.5. During this process, the top repeat of the MSA is removed, and the Tally-2.3 score is recalculated. The same procedure is made with the repeat located at the bottom of the MSA. If the recalculated scores are >= 0.5, the modified MSAs are stored and one of them with the bigger score is chosen for the output. However, if the scores remain below 0.5, the MSA is removed from the output. In the local version of Meta-Repeat-Finder, the trimming iteratively removes repeats until the Tally-2.3 score becomes >= 0.5. The number of repeats eliminated from the MSA can be adjusted. Thus, the trimmed results contain only valid tandem repeats, with the Tally score greater than or equal to 0.5. Additionally, if more rigorous selection is activated, the selected TRs must also have the p-value-phylo score less than 0.2.

2.3 Results

This option represents the final result, which contain only valid TRs without overlap between them (some exception exists). Among the overlapped TRs, MRF prioritizes ones with shorter repeat units. If the length of repeat units from different TRs is the same, the longer TR region (with more repeats) is given the preference. The length of the TR repeat region is calculated as the average length of repeat units from a given MSA, with any gaps excluded. If the average length is a fractional number, the value is rounded to the nearest whole number.

Exceptionally, we keep two overlapping TRs :

If condition 1 is true :

\( \text{if} \;\; \frac{\text{Average length of repeat units from MSA 1}}{\text{Average length of repeat units from MSA 2}} > 3 \)

\( \text{Where} \;\; \text{(Average length of repeat unit from MSA 1)} \geq \text{(Average length of repeat units from MSA 2)} \)

AND condition 2 is true :

\( \text{if} \;\; \frac{\text{Length of overlapped region}}{\text{MSA 1 length}} \leq 0.10 \)

\( \text{OR} \)

\( \text{if} \;\; \frac{\text{Length of overlapped region}}{\text{MSA 2 length}} \leq 0.10 \)