On the statistics of identifying candidate pathogen effectors

Leighton Pritchard, David Broadhurst

Research output: Chapter in Book/Report/Conference proceedingChapter

2 Citations (Scopus)


High-throughput sequencing is an increasingly accessible tool for cataloging gene complements of plant pathogens and their hosts. It has had great impact in plant pathology, enabling rapid acquisition of data for a wide range of pathogens and hosts, leading to the selection of novel candidate effector proteins, and/or associated host targets (Bart et al., Proc Nat Acad Sci U S A doi:10.1073/pnas.1208003109, 2012; Agbor and McCormick, Cell Microbiol 13:1858–1869, 2011; Fabro et al., PLoS Pathog 7:e1002348, 2011; Kim et al., Mol Plant Pathol 2:715–730, 2011; Kimbrel et al., Mol Plant Pathol 12:580–594, 2011; O’Brien et al., Curr Opin Microbiol 14:24–30, 2011; Vleeshouwers et al., Annu Rev Phytopathol 49:507– 531, 2011; Sarris et al., Mol Plant Pathol 11:795–804, 2010; Boch and Bonas, Annu Rev Phytopathol 48:419–436, 2010; Mcdermott et al., Infect Immun 79:23–32, 2011). Identification of candidate effectors from genome data is not different from classification in any other high-content or high-throughput experiment. The primary aim is to discover a set of qualitative or quantitative sequence characteristics that discriminate, with a defined level of certainty, between proteins that have previously been identified as being either “effector” (positive) or “not effector” (negative). Combination of these characteristics in a mathematical model, or classifier , enables prediction of whether a protein is or is not an effector, with a defined level of certainty. High-throughput screening of the gene complement is then performed to identify candidate effectors; this may seem straightforward, but it is unfortunately very easy to identify seemingly persuasive candidate effectors that are, in fact, entirely spurious. The main sources of danger in this area of statistical modeling are not entirely independent of each other, and include: inappropriate choice of classifier model; poor selection of reference sequences (known positive and negative examples); poor definition of classes (what is, and what is not, an effector); inadequate training sample size; poor model validation; and lack of adequate model performance metrics (Xia et al., Metabolomics doi:10.1007/s11306-012-0482-9, 2012). Many studies fail to take these issues into account, and thereby fail to discover anything of true significance or, worse, report spurious findings that are impossible to validate. Here we summarize the impact of these issues and present strategies to assist in improving design and evaluation of effector classifiers, enabling robust scientific conclusions to be drawn from the available data.

Original languageEnglish
Title of host publicationPlant-Pathogen Interactions
Subtitle of host publicationMethods and Protocols
EditorsPaul Birch, John T. Jones, Jorunn I.B. Bos
Place of PublicationNew York
Number of pages12
ISBN (Print)9781627039864, 9781627039857
Publication statusPublished - 19 Mar 2014

Publication series

NameMethods in Molecular Biology
PublisherSpringer Verlag
ISSN (Print)1064-3745


  • bioinformatics
  • classification
  • effectors
  • genomics
  • high-throughput screening
  • sequence analysis
  • statistical modeling

Cite this