Domain prediction - Why and How?

What are protein domains?

Since the first protein structures were solved, it was apparent that the polypeptide chain could often fold into one or more distinct regions of structure. Such substructures, or domains, are considered as the basic units of folding, function and evolution and often have similar chain topologies (Holm & Sander, 1994). Protein domains are often considered as independent or, at the least, semi-independent units, able to fold and in some cases retain function if separated from the parent chain. The independent, modular nature of many domains means that they can often be found in proteins with the same domain content, but in different orders, or in different proteins in combination with entirely different domain structures. The concept of the protein domain is just as valid at the sequence level as the structural level. This can be shown by the fact that the alignment of sequences containing similar domains, but in different orders can result in poor and possibly misleading alignments. However alignment of the shared domains if extracted from the parent sequence may reveal a high level of sequence similarity, demonstrating an evolutionary link between the domain sequences.

Why search for protein domains?

The identification of domains within a protein sequence is an important precursor for a range of methods.
Protein structural determination method such as X-ray crystallography and NMR have size limitations which limits their use - they are often employed more successfully when solving smaller domain units rather than whole chains.
As discussed above, multiple sequence alignment at the domain level can result in the detection of homologous sequences that are more difficult to detect using a complete chain sequence.
It is also well known that fold recognition methods perform more reliably if a putative multi-domain target is considered in terms of its constituent domains rather than as a whole chain (Jones & Hadley, 2000).
All of the method mentioned above are used in order to gain an insight into the structure and ultimately function of a given protein chain. Such results are often best achieved at the domain level.
How can they be found?

The delineation of protein domains within a polypeptide chain can be achieved in several ways. Methods applied by classification databases such as the Dali Domain Dictionary (DDD; Dietmann & Holm, 2001), CATH (Orengo et al., 1997) and SCOP (Murzin et al., 1995) use structural data to locate and assign domains. However, complete automation of domain assignment even from structural data is not a trivial problem (Jones et al., 1998), and obviously requires a solved protein structure. Identification of domains at the sequence level most often relies on the detection of global-local sequence alignments between a given target sequence and domain sequences found in databases such as such as Pfam (Bateman et al., 2000).However difficulties in elucidating the domain content of a given sequence at the sequence homology level arise when searching the target sequence against sequence databases results in a lack of significant matches. In such situations, an ab initio approach to domain assignment from sequence is required. Methods employed by the DomPred server



Marsden, R.L., McGuffin, L.J. & Jones, D.T. (2002) Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Science, 11, 2814-2824.