bug #9527: Consensus Sequence Details: Consensus Sequence 5' -> 3' limit characters to [aAcCgGTtUu\s] - EDIT - EDIT Project Management

Actions

Copy link

bug #9527

open

Consensus Sequence Details: Consensus Sequence 5' -> 3' limit characters to [aAcCgGTtUu\s]

Added by Andreas Kohlbecker over 3 years ago. Updated over 3 years ago.

Status:

New

Priority:

New

Assignee:

Katja Luther

Category:

taxeditor

Target version:

Unassigned CDM tickets

Start date:

Due date:

% Done:

Estimated time:

Severity:

normal

Found in Version:

Tags:

Description

for data to be entered or modified into the "Consensus Sequence 5' -> 3'" text-area the allowed characters, to be typed or pasted must be limited to those that are being used as code for the nucelosides of DNA, it might be a good idea though to also allow uracil which replaces thymin in RNA.

Also whitespace must be allowed.

Depending on how the consensus sequence is used, the consensus sequence calculates the most frequently appearing nucleotide for every position or it shows which residues are conserved and which residues are variable. Consider the following example DNA sequence: A[CT]N{A}YR. In this notation, A means that an A is always found in that position; [CT] stands for either C or T; N stands for any base; and {A} means any base except A. Y represents any pyrimidine, and R indicates any purine. (see https://en.wikipedia.org/wiki/Consensus_sequence)

Therefore we also need to allow different kind of brackets, Y and R. Maybe there are other characters used in consensus sequences.

regex for validation of DNA and RNA sequences:

^[aAcCgGTtUuRrNnYy\s\{\}\[\]].*$

Actions

Copy link

Updated by Andreas Müller over 3 years ago

Is there a reason for this requirment or is it only because we expect sequences to be like this.
And why is it a TaxEditor ticket? Is it only a requirement for entering data in TaxEditor or is it a meant a general constraint?

The reason why I aske is that the problem with such constraints is that dirty data or strangely formatted data are then difficult to enter (e.g. during automated imports). Therefore it is always a trade off between correctnes and usability.
So the question is if there is a reason why we need this correctnes (e.g. because we have a viewer for sequences that requires it). Otherwise I would suggest to make it a soft validation rule (giving a hint that data is not correct but not forbid it)

Actions

Copy link

Updated by Andreas Kohlbecker over 3 years ago

Description updated (diff)

Andreas Müller wrote:

Is there a reason for this requirment or is it only because we expect sequences to be like this.

It prevents from entering false data in the editor.

And why is it a TaxEditor ticket? Is it only a requirement for entering data in TaxEditor or is it a meant a general constraint?
The reason why I aske is that the problem with such constraints is that dirty data or strangely formatted data are then difficult to enter (e.g. during automated imports). Therefore it is always a trade off between correctnes and usability.

To avoid problems during the import etc this ticket specifically dedicated to the taxeditor. I adapted the the ticket description a bit to make that more clear.

Actions

Copy link

Updated by Katja Luther over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Katja Luther over 3 years ago

there was a related ticket already, I add it for discussion informations: #5057

Actions

Copy link

Updated by Andreas Müller over 3 years ago

The given regex is not valid anymore for the extended findings we have now on the usage of brackets and Y (and maybe others).

Actions

Copy link

Updated by Andreas Müller over 3 years ago

With the given uncertainties on additional information like brackets I suggest to use only soft validation (warning but not forbidding). But finally the users which are familiar with possible formats (also dirty data) should decide.

Actions

Copy link

Updated by Andreas Kohlbecker over 3 years ago

Description updated (diff)

the regex is now complete according to the notation found on https://en.wikipedia.org/wiki/Consensus_sequence

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

EDIT

Custom queries

bug #9527

Consensus Sequence Details: Consensus Sequence 5' -> 3' limit characters to [aAcCgGTtUu\s]

Updated by Andreas Müller over 3 years ago

Updated by Andreas Kohlbecker over 3 years ago

Updated by Katja Luther over 3 years ago

Updated by Katja Luther over 3 years ago

Updated by Andreas Müller over 3 years ago

Updated by Andreas Müller over 3 years ago

Updated by Andreas Kohlbecker over 3 years ago