Speeding disease gene discovery by sequence based candidate prioritization

Euan A Adie, Richard R Adams, Kathryn L Evans, David J Porteous, Ben S Pickard

Research output: Contribution to journalArticle

193 Citations (Scopus)

Abstract

Background: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning. Results: We examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold 37% of the time and twenty-fold 11% of the time.
Conclusion: PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.
LanguageEnglish
Article number55
Number of pages13
JournalBMC Bioinformatics
Volume6
DOIs
Publication statusPublished - 14 Mar 2005

Fingerprint

Prioritization
Genetic Association Studies
Genes
Gene
Fold
Region of Interest
Classifiers
Classifier
Decision Trees
Inborn Genetic Diseases
Genetic Linkage
Gene Order
Case-control
Tree Algorithms
Decision trees
Case-Control Studies
Decision tree
Phenotype
Linkage
Annotation

Keywords

  • alternating decision trees
  • automatic classifiers
  • candidate genes
  • functional annotation
  • genetic linkage
  • mutation detection
  • prioritization
  • regions of interest

Cite this

Adie, Euan A ; Adams, Richard R ; Evans, Kathryn L ; Porteous, David J ; Pickard, Ben S. / Speeding disease gene discovery by sequence based candidate prioritization. In: BMC Bioinformatics. 2005 ; Vol. 6.
@article{c14a0ccfb6154d0ea25f503d3fd9a573,
title = "Speeding disease gene discovery by sequence based candidate prioritization",
abstract = "Background: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning. Results: We examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77{\%} of the time, five-fold 37{\%} of the time and twenty-fold 11{\%} of the time. Conclusion: PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.",
keywords = "alternating decision trees, automatic classifiers, candidate genes, functional annotation, genetic linkage, mutation detection, prioritization, regions of interest",
author = "Adie, {Euan A} and Adams, {Richard R} and Evans, {Kathryn L} and Porteous, {David J} and Pickard, {Ben S}",
year = "2005",
month = "3",
day = "14",
doi = "10.1186/1471-2105-6-55",
language = "English",
volume = "6",
journal = "BMC Bioinformatics",
issn = "1471-2105",

}

Speeding disease gene discovery by sequence based candidate prioritization. / Adie, Euan A; Adams, Richard R; Evans, Kathryn L; Porteous, David J; Pickard, Ben S.

In: BMC Bioinformatics, Vol. 6, 55, 14.03.2005.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Speeding disease gene discovery by sequence based candidate prioritization

AU - Adie, Euan A

AU - Adams, Richard R

AU - Evans, Kathryn L

AU - Porteous, David J

AU - Pickard, Ben S

PY - 2005/3/14

Y1 - 2005/3/14

N2 - Background: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning. Results: We examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold 37% of the time and twenty-fold 11% of the time. Conclusion: PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.

AB - Background: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning. Results: We examined a variety of sequence-based features and found that for many of them there are significant differences between the sets of genes known to be involved in human hereditary disease and those not known to be involved in disease. We have created an automatic classifier called PROSPECTR based on those features using the alternating decision tree algorithm which ranks genes in the order of likelihood of involvement in disease. On average, PROSPECTR enriches lists for disease genes two-fold 77% of the time, five-fold 37% of the time and twenty-fold 11% of the time. Conclusion: PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders. It performs markedly better than the single existing sequence-based classifier on novel data. PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.

KW - alternating decision trees

KW - automatic classifiers

KW - candidate genes

KW - functional annotation

KW - genetic linkage

KW - mutation detection

KW - prioritization

KW - regions of interest

UR - http://www.biomedcentral.com/bmcbioinformatics/

U2 - 10.1186/1471-2105-6-55

DO - 10.1186/1471-2105-6-55

M3 - Article

VL - 6

JO - BMC Bioinformatics

T2 - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 55

ER -