TY - CHAP
T1 - On the statistics of identifying candidate pathogen effectors
AU - Pritchard, Leighton
AU - Broadhurst, David
PY - 2014/3/19
Y1 - 2014/3/19
N2 - High-throughput sequencing is an increasingly accessible tool for cataloging gene complements of plant pathogens and their hosts. It has had great impact in plant pathology, enabling rapid acquisition of data for a wide range of pathogens and hosts, leading to the selection of novel candidate effector proteins, and/or associated host targets (Bart et al., Proc Nat Acad Sci U S A doi:10.1073/pnas.1208003109, 2012; Agbor and McCormick, Cell Microbiol 13:1858–1869, 2011; Fabro et al., PLoS Pathog 7:e1002348, 2011; Kim et al., Mol Plant Pathol 2:715–730, 2011; Kimbrel et al., Mol Plant Pathol 12:580–594, 2011; O’Brien et al., Curr Opin Microbiol 14:24–30, 2011; Vleeshouwers et al., Annu Rev Phytopathol 49:507– 531, 2011; Sarris et al., Mol Plant Pathol 11:795–804, 2010; Boch and Bonas, Annu Rev Phytopathol 48:419–436, 2010; Mcdermott et al., Infect Immun 79:23–32, 2011). Identification of candidate effectors from genome data is not different from classification in any other high-content or high-throughput experiment. The primary aim is to discover a set of qualitative or quantitative sequence characteristics that discriminate, with a defined level of certainty, between proteins that have previously been identified as being either “effector” (positive) or “not effector” (negative). Combination of these characteristics in a mathematical model, or classifier , enables prediction of whether a protein is or is not an effector, with a defined level of certainty. High-throughput screening of the gene complement is then performed to identify candidate effectors; this may seem straightforward, but it is unfortunately very easy to identify seemingly persuasive candidate effectors that are, in fact, entirely spurious. The main sources of danger in this area of statistical modeling are not entirely independent of each other, and include: inappropriate choice of classifier model; poor selection of reference sequences (known positive and negative examples); poor definition of classes (what is, and what is not, an effector); inadequate training sample size; poor model validation; and lack of adequate model performance metrics (Xia et al., Metabolomics doi:10.1007/s11306-012-0482-9, 2012). Many studies fail to take these issues into account, and thereby fail to discover anything of true significance or, worse, report spurious findings that are impossible to validate. Here we summarize the impact of these issues and present strategies to assist in improving design and evaluation of effector classifiers, enabling robust scientific conclusions to be drawn from the available data.
AB - High-throughput sequencing is an increasingly accessible tool for cataloging gene complements of plant pathogens and their hosts. It has had great impact in plant pathology, enabling rapid acquisition of data for a wide range of pathogens and hosts, leading to the selection of novel candidate effector proteins, and/or associated host targets (Bart et al., Proc Nat Acad Sci U S A doi:10.1073/pnas.1208003109, 2012; Agbor and McCormick, Cell Microbiol 13:1858–1869, 2011; Fabro et al., PLoS Pathog 7:e1002348, 2011; Kim et al., Mol Plant Pathol 2:715–730, 2011; Kimbrel et al., Mol Plant Pathol 12:580–594, 2011; O’Brien et al., Curr Opin Microbiol 14:24–30, 2011; Vleeshouwers et al., Annu Rev Phytopathol 49:507– 531, 2011; Sarris et al., Mol Plant Pathol 11:795–804, 2010; Boch and Bonas, Annu Rev Phytopathol 48:419–436, 2010; Mcdermott et al., Infect Immun 79:23–32, 2011). Identification of candidate effectors from genome data is not different from classification in any other high-content or high-throughput experiment. The primary aim is to discover a set of qualitative or quantitative sequence characteristics that discriminate, with a defined level of certainty, between proteins that have previously been identified as being either “effector” (positive) or “not effector” (negative). Combination of these characteristics in a mathematical model, or classifier , enables prediction of whether a protein is or is not an effector, with a defined level of certainty. High-throughput screening of the gene complement is then performed to identify candidate effectors; this may seem straightforward, but it is unfortunately very easy to identify seemingly persuasive candidate effectors that are, in fact, entirely spurious. The main sources of danger in this area of statistical modeling are not entirely independent of each other, and include: inappropriate choice of classifier model; poor selection of reference sequences (known positive and negative examples); poor definition of classes (what is, and what is not, an effector); inadequate training sample size; poor model validation; and lack of adequate model performance metrics (Xia et al., Metabolomics doi:10.1007/s11306-012-0482-9, 2012). Many studies fail to take these issues into account, and thereby fail to discover anything of true significance or, worse, report spurious findings that are impossible to validate. Here we summarize the impact of these issues and present strategies to assist in improving design and evaluation of effector classifiers, enabling robust scientific conclusions to be drawn from the available data.
KW - bioinformatics
KW - classification
KW - effectors
KW - genomics
KW - high-throughput screening
KW - sequence analysis
KW - statistical modeling
UR - http://www.scopus.com/inward/record.url?scp=84908481758&partnerID=8YFLogxK
U2 - 10.1007/978-1-62703-986-4_4
DO - 10.1007/978-1-62703-986-4_4
M3 - Chapter
C2 - 24643551
AN - SCOPUS:84908481758
SN - 9781627039864
SN - 9781627039857
T3 - Methods in Molecular Biology
SP - 53
EP - 64
BT - Plant-Pathogen Interactions
A2 - Birch, Paul
A2 - Jones, John T.
A2 - Bos, Jorunn I.B.
PB - Springer
CY - New York
ER -