Technical Approach


Home > Research > Technical Approach

Phenotypic variation in mice can be induced with N-ethyl-N-nitrosourea (ENU), which creates single base pair substitutions in germline DNA.  Methods employed in the Center for the Genetics of Host Defense, described on this page (Figure 1), are used to identify ENU-induced mutations causative for phenotypes “instantly,” that is, concurrent with phenotypic screening (Wang et al. Proc.Natl.Acad.Sci.U.S.A. 112, E440-9).  This method is distinguished from conventional forward genetic methods because it permits (1) unbiased declaration of mappable phenotypes, including those that are incompletely penetrant, (2) automated identification of causative mutations concurrent with phenotypic screening, without the need to outcross mutant mice to another strain and backcross them, and (3) exclusion of genes not involved in phenotypes of interest. 

Figure 1. Overview of the forward genetic approach in the Center for the Genetics of Host Defense. (A) Male C57BL/6J mice (G0) mutagenized with ENU are bred to produce at least 20 third generation (G3) mice carrying heterozygous and homozygous mutations. (B) Exome sequencing of the G1 male founder of each pedigree is performed. (C) The mutated loci identified in sequencing are then genotyped in the G2 and G3 mice. This is accomplished by high-throughput amplification of mutated loci using Ion AmpliSeq panels, followed by Ion PGM sequencing of the amplification products. (D) The genotyped G3 mice are subjected to a predetermined set of phenotypic screens. (E) Genotype data and quantitative phenotype data are used for mapping by Linkage Analyzer. Calculated P values for non-linkage and scatter plots of phenotypic data for every mutant allele are displayed by Linkage Explorer. (F) Causative mutations are confirmed by observation of the mutant phenotype in mice with a second mutant allele, which may be generated by CRISPR/Cas9 targeting.

Advancing the forward genetic approach

Only recently has instant mutation identification by automated mapping become a reality, made possible by cumulative methodological and technological advances achieved during the past ~20 years (Timeline). These advances, including several made in the Center for the Genetics of Host Defense, are described briefly.

Prior to the advent of molecular cloning and DNA sequencing techniques, the causes of heritable phenotypes could only be established using biochemical assays.  This limitation greatly restricted the number of traits that could be analyzed, but in the early 1980s researchers developed and applied a genetic approach to define the molecular basis of heritable traits (Botstein et al. Am.J.Hum.Genet. 32, 314-331; Davies et al. Nucleic Acids Res. 11, 2303-2312; Gusella et al. Nature. 306, 234-238; Royer-Pokora et al. Nature. 322, 32-38).  Now termed “positional cloning,” the method involved four basic steps: (1) high-resolution genetic mapping to establish a critical region within which the genetic cause of the phenotype was proven to reside, (2) physical mapping of the critical region, which was performed by cloning all of the critical region as large, overlapping pieces of DNA, (3) gene identification within the critical region by exon trapping, cDNA selection, etc., and (4) mutation identification, by sequencing all candidate genes and detecting a mutation invariably associated with the trait in affected individuals.  The development of the polymerase chain reaction (PCR) (Saiki et al. Science. 230, 1350-1354), databases of expressed sequence tags (ESTs) (Marra et al. Nat.Genet. 21, 191-194), and fluorescence-based capillary sequencing (Smith et al. Nature. 321, 674-679) facilitated steps (2), (3), and (4), respectively, yet the entire process typically required 5 to 8 years.

The speed of positional cloning improved as a result of two breakthroughs.  First, the annotated C57BL/6J mouse genome sequence was published as a draft in 2002 (Mouse Genome Sequencing Consortium et al. Nature. 420, 520-562), eliminating the second and third steps in positional cloning.  Second, in 2010 it became possible to sequence whole mammalian genomes, and eventually whole exomes (Metzker. Nat.Rev.Genet. 11, 31-46), making it possible to see all the candidate mutations that might be responsible for a phenotype in a given pedigree. In one of our initial uses of whole genome sequencing, we identified a panel of 127 single nucleotide polymorphisms distinguishing the C57BL/6J and C57BL/10J mouse strains (Xia et al. Genetics. 186, 1139-1146). This facilitated the use of C57BL/10J as a mapping strain for phenotypes induced in the C57BL/6J strain, which was desirable since the strains are closely related and phenotypes are therefore less likely to be altered by modifier loci present in more distantly related strains. Even with these advances, the genetic mapping step remained necessary and lengthy; thus, methods to speed mapping were developed.

Our contributions to increasing the speed of mapping began with the use of bulk segregation analysis (BSA) for quick, low-resolution mapping of mouse phenotypes (Arnold et al. Genetics. 187, 633-641). BSA measures mutant vs. mapping strain allele frequency at strain-specific markers across the genome in pools of DNA from phenotypically affected and nonaffected F2 offspring (from mutants outcrossed to a mapping strain). For each marker, enrichment of the mutant strain allele in the affected DNA pool and depletion in the nonaffected DNA pool are used to establish linkage. With only about 20 meioses, BSA can localize a mutation to a sub-chromosomal region, within which there may be only one mutation identified by whole genome or exome sequencing.

Causative mutation identification directly from exome sequencing data without the need for a separate genetic mapping step was first reported in 2012 (Andrews et al. Open Biol. 2, 120061; Sun et al. G3 (Bethesda). 2, 143-150), and paved the way for development of the automated mapping process now used in our laboratory for instant mutation identification (Wang et al. Proc.Natl.Acad.Sci.U.S.A. 112, E440-9).  With this technology, it is possible to rationally calculate genome saturation for specific screens, to detect associations between mutations and lethality, and to conduct screens for complex phenotypes or the suppression of disease (Wang et al. Nat.Commun. 9, 441).


Breeding ENU-mutagenized mice for phenotypic screening

G3 mice carrying homozygous and heterozygous mutations induced by ENU in germ cells of G0 male C57BL/6J mice are generated using two possible breeding schemes (Figure 2).  Mutagenized G0 males are bred to either C57BL/6J females, or to G0’ females carrying ENU-induced mutations from her father. The resulting G1 males are crossed to C57BL/6J females to produce G2 mice.  G1 males are bred to G2 females over about 12 weeks to produce ~50 G3 offspring per G1 x G2 pedigree. 

Identification of causative mutations concurrent with phenotypic screening requires the determination of genotype at all mutation sites in every G3 mouse prior to phenotypic assessment.  This is accomplished by exome sequencing of the G1 male progenitor of each pedigree to identify all coding and splice site mutations that could possibly be present in the G3 mice. The G3 mice are then genotyped at each mutation site before phenotypic screening; mutations are also validated by genotyping the G1 and G2 mice.  REF (homozygous for C57BL/6J reference allele), HET (heterozygous for reference allele and variant allele), or VAR (homozygous for variant allele) genotypes are registered for each mutation site in each mouse, and data are stored in the Mutagenetix database for analysis with phenotypic data.

Once ~50 genotyped G3 mice from a single pedigree are of age for screening, they are tested in phenotypic screens (see Research Areas). If possible, all mice from a pedigree are screened in the same experiment on the same day to minimize phenotypic differences due to experimental variability.

Figure 2. Breeding schemes for generation of G3 mice. The schemes differ in whether the female mouse mated to the G0 male carries (A) or does not carry (B) ENU-induced mutations from her father. If she does (G0 x G0’ mating), G3 descendants carry a greater number of mutations, including X-linked mutations, than G3 descendants of G0 x C57BL/6J matings, which carry no X-linked mutations. Asterisks indicate mutations originating from the G0 male (red) or G0’ female (blue), with larger asterisks representing initial germline transmission.

Automated mutation identification

Identification of causative mutations depends on purpose-built software that performs automated linkage analyses (Linkage Analyzer), and a sophisticated display platform that permits searching and presentation of the resulting data (Linkage Explorer). 

Linkage Analyzer

Analyses of genotype and phenotype data are automatically performed using Linkage Analyzer, a software program designed and written in our laboratory to test the probability of single locus linkage to phenotypes using recessive, semidominant (additive), and dominant transmission models, and to assess the probability of preweaning lethal effects due to single locus mutations.  Linkage Analyzer detects phenovariance when it is statistically linked to genotype as determined by a linear regression model.  For each mutation, the null hypothesis of nonlinkage is tested assuming a normal or a binomial distribution of phenotype scores for quantitative and qualitative phenotypes, respectively. The P value of association between genotype and phenotype is calculated using a likelihood ratio test from a generalized linear model or generalized linear mixed effect model.

Linkage Analyzer operates at a scalable speed depending on the capabilities of the cluster on which it is run.  As presently configured it processes data at a rate that exceeds our capacity to produce mutations and develop screening data and delivers linkage assessments in real time.  When phenotypic data are uploaded, the genetic cause of any phenovariance that may exist in the dataset is usually known within a few minutes.  The production and phenotypic analysis of G3 mutant mice are thus the rate-limiting steps in the forward genetic approach used in our laboratory.

Linkage Explorer

For each variant phenotype identified by phenotypic screening, Linkage Analyzer performs automated computation of P values of association between genotype and phenotype for every mutation in the pedigree using all three transmission models. These data are accessible through the Linkage Explorer application.

Single pedigree linkage analysis

Linkage Explorer may be used to search for phenotypes (among those screened in the Center for the Genetics of Host Defense) linked to a gene of interest, or conversely, for mutated genes linked to a phenotype of interest (Video 1). Several parameters may be specified to target analyses to specific genes, phenotypes, pedigrees, mutation types or effects, or to limit the results to genotype-phenotype associations in which a specified number of linkage peaks was found in the Manhattan plot (Table 1). Other parameters set the stringency of criteria for linkage (Table 1). Three settings dramatically alter the sensitivity and specificity of automated mapping assignments: the number of mice with VAR genotype tested, the P value cutoff, and the requirement for both raw and normalized datasets to reveal linkage in a given screen. By varying the stringency of such criteria for linkage, the specificity and sensitivity of the search are varied accordingly.

Table 1. Parameters that may be specified in Linkage Explorer

Video 1. Linkage Explorer example search for phenotypes linked to mutations of Ptprc. The gene name is entered and the search is submitted using default search parameters (single linkage analysis; no specification of screen or phenotype; no specification of mutation type or effect; G3 ≥ 20, VAR ≥ 2, REF ≥ 2; P ≤ 0.05 with Bonferroni correction; and results displayed only for implicated genes). Search results are displayed in a table (see Figure 3) and clicking any P value opens the Manhattan plot for the corresponding phenotype. Right-clicking any data point opens a scatter plot showing the phenotypic performance of mice REF, HET, or VAR for the selected mutation.

For each search, Linkage Explorer displays in a results table the P value of association calculated by Linkage Analyzer under the three transmission models, with a clickable link to the corresponding Manhattan plot for each inheritance mode, from which raw or normalized phenotypic data for mice of REF, HET, or VAR genotypes can be accessed in a table or scatter plot (Figure 3). Linkage Explorer also displays the mutation coordinate, mutation type, phenotypic screen, numbers of mice with REF, HET, or VAR genotypes, and precalculated information about each implicated gene and mutation, including the predicted effect of the mutation as determined by PolyPhen-2 (Adzhubei et al. Nat.Methods. 7, 248-249) or by a splice site prediction program. The “candidate status” is determined by the Candidate Explorer program (see below) and indicates one of four potential ratings of the likelihood that a mutation would be validated as causative (excellent, good, potential, and not good).

Figure 3. Example of Linkage Explorer results table from a search for phenotypes linked to mutations of Ptprc. (A) A nonsense mutation of Ptprc (see Video 1) at position 13806401 bp on Chr. 1 was associated with several phenotypes; for example, it scored in the “FACS CD4+ T cells” screen in both additive and recessive models of inheritance. P values for additive, recessive, and dominant transmission models are given for each mutation with at least one statistically significant P value of genotype-phenotype association; significant P values are shown in red. (B) Clicking the P value opens the Manhattan plot showing the linkage scores (-log10[P value]) of all mutations in the pedigree for the phenotype in question (here CD4+ T cell frequency). In the Manhattan plot, mousing over a data point shows the gene name, mutation coordinate, distance to next closest mutation, and P value. (C) Right-clicking the data point opens a scatter plot of phenotypic data (raw and normalized) graphed for each genotype and for wild type non-mutagenized C57BL/6J mice; here, VAR mice had reduced frequencies of circulating CD4+ T cells. Left-clicking the data point opens the Incidental Mutation page. Raw data are accessible from the “data set” link in the scatter plot.

Combining pedigrees for linkage analysis

Approximately 3.1% of ENU-induced mutations in our colony are shared between two or more pedigrees, inherited from a common ancestral G0 male. To date, multiple alleles have been identified for approximately 87% of genes with validated mutations. Because the genotypes at all mutation sites in all G3 mice are known, combining pedigrees with identical or non-identical allelic mutations to make “superpedigrees” is possible. This increases the power to detect linkage, especially for weak or low penetrance phenotypes, and can help resolve a causative mutation where causative and non-causative mutations are closely linked. Relative to single pedigree analysis, combining pedigrees in this manner can greatly increase the strength of a genotype-phenotype association, or eliminate it from consideration. As data accumulate from many pedigrees over time, the power to implicate or exonerate genes from participation in defined biological processes increases.

Superpedigrees are automatically generated and analyzed by Linkage Analyzer whenever allelic mutations and phenotypic data are added to the database. “Gene-based” superpedigrees consist of pedigrees containing non-identical mutations of the same gene, whereas “position-based” superpedigrees consist of pedigrees containing identical mutations of the same gene. Superpedigree linkage data are accessed via Linkage Explorer, and searches can be restricted by specifying the gene(s), phenotype(s), pedigree(s), P value cutoff, number of VAR mice tested, and/or application of the “raw+normalized” restriction (Video 2). The results table output by Linkage Explorer is similar to the results table for single pedigree linkage data, except that the number of alleles and number of pedigrees in the superpedigree are given in place of a single mutation coordinate. The minimum PolyPhen-2 score among the set of alleles in the superpedigree is also provided. P values are linked to the Manhattan plot for each transmission mode, from which raw or normalized phenotypic data can be accessed in a table or scatter plot (Figure 4). Phenotypic data in scatter plots are color-coded by pedigree so the contribution of each pedigree to a particular gene-phenotype association can be easily observed.

Video 2. Linkage Explorer example search for mutations implicated in DSS-induced colitis by gene-based superpedigree linkage analysis. The screen name (DSS Day 10) is entered, and a P value cutoff of 0.001 calculated using both raw and normalized data is applied to increase the stringency of the search. In addition, the results display is limited to mutations found in Manhattan plots with only one peak above the Bonferroni correction line, with this peak at least one log greater than the next highest peak. The search is submitted with no further restrictions (the default). Search results are displayed in a table (see Figure 4) and clicking any P value opens the Manhattan plot for the corresponding phenotype. Right-clicking any data point opens a scatter plot showing the phenotypic performance of mice REF, HET, or VAR for mutations of the selected gene included in the superpedigree. Data points can be sized to reflect the damage probability of the mutations (assessed by PolyPhen-2), and the Y-axis range can be reduced to more easily view the distribution of the data. Individual pedigree Manhattan plots are accessible from links above the superpedigree Manhattan plot.

Figure 4. Example of Linkage Explorer results table for gene-based superpedigree linkage analysis. (A) A search (Video 2) for mutations implicated in DSS-induced colitis by superpedigree linkage analysis yielded Slc5a2. Three pedigrees and three alleles (null or missense) were contained in the gene-based superpedigree, and a total of 14 VAR, 50 HET, and 41 REF mice were analyzed. P values for additive, recessive, and dominant transmission models are given, with significant P values shown in red. (B) Clicking the P value opens the Manhattan plot showing the linkage scores (-log10[P value]) of all mutations in the superpedigree for the phenotype in question. In the Manhattan plot, mousing over a data point shows the gene name, P value, and coordinates of the individual mutations in the superpedigree (here, three mutations). (C) Right-clicking the data point opens a scatter plot of phenotypic data (raw and normalized) graphed for each genotype and for wild type non-mutagenized C57BL/6J mice; data points are color coded by pedigree. Left-clicking the data point opens a table listing the individual mutations in the superpedigree (not shown). (D) Individual pedigree Manhattan plots are accessible from the superpedigree Manhattan plot. Here, weak linkage is observed only in pedigree R4836, but combining the three pedigrees greatly increased the gene-phenotype association. Raw data are accessible from the “data set” links in the scatter plot.

Confirmation of causative mutations

Confirmation of causative mutations depends on duplication of the mutant phenotype by a second allele, which may be generated by CRISPR/Cas9 gene targeting.

Candidate Explorer

The Candidate Explorer program aids in the identification of borderline candidate mutations, i.e., those which may show weak phenotypic effects or relatively wider variance of the measured phenotype.  Based on previous experience with CRISPR validation, Candidate Explorer rates new mutations as “excellent,” “good,” “potential,” or “not good” candidates for CRISPR validation.  The ratings are displayed on the Phenotypic Mutations list, on individual phenotypic mutation records, and on Linkage Explorer search results.  Ratings are based on a variety of criteria such as pedigree size, number of homozygous G3 mice in the pedigree, phenotypic screen, predicted deleteriousness of the mutation (by PolyPhen-2), variance of the measured phenotypic data, P value, and many others, each of which may be differentially weighted by the program.  Candidate Explorer is trained using the Random Forest machine learning algorithm on an ongoing basis as new mutation and phenotypic data are acquired.  Two types of training sets have been tested:  mutations verified to cause phenotype by analysis of mice with CRISPR-targeted alleles (“CRISPR-verified”); or CRISPR-verified mutations plus mutations presumed to cause phenotype based on published reports of gene function, a strong or distinctive phenotype, and the predicted effect of the mutation as surveyed by a human researcher (“literature-verified”).  The performance of Candidate Explorer is similar whether trained using CRISPR-verified or CRISPR+literature-verified mutations, and all mutations are automatically assessed using Candidate Explorer trained on CRISPR-verified mutations.  The precision, accuracy, and recall of the program were assessed for candidate causative mutations with P ≤ 0.002 (for both raw and normalized phenotype data), from pedigrees with at least 20 G3 mice and REF ≥ 4 and VAR ≥ 3 (Table 2).  Of 151 alleles generated by CRISPR-mediated gene targeting and tested in 344 phenotypic assays, 90 alleles were confirmed and 61 alleles were excluded from causation.  For alleles rated “good” or better, Candidate Explorer demonstrated 96.25% precision, 89.40% accuracy, and 85.56% recall for mutations rated “good” or better.


Table 2. Performance measures for Candidate Explorer
Retrieved 2/6/2018