Introduction

To make progress in breeding, populations should have a favorable mean and high genetic variance (Bernardo 2010). These two parameters can be combined into a single measure called the usefulness criterion (Schnell and Utz 1975), visualized in Figure 1.

Figure 1. Visualization of the mean, genetic variance, and superior progeny mean of a single population.

Ideally, breeders would identify the set of parent combinations that, when realized in a cross, would give rise to populations meeting these requirements. PopVar is a package that uses phenotypic and genomewide marker data on a set of candidate parents to predict the mean, genetic variance, and superior progeny mean in bi-parental or multi-parental populations. Thre package also contains functionality for performing cross-validation to determine the suitability of different statistical models. More details are available in Mohammadi, Tiede, and Smith (2015). A dataset think_barley is included for reference and examples.

Installation

You can install the released version of PopVar from CRAN with:

install.packages("PopVar")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("UMN-BarleyOatSilphium/PopVar")

Functions

Below is a description of the functions provided in PopVar:

Function	Description
`pop.predict`	Uses simulations to make predictions in recombinant inbred line populations; can internally perform cross-validation for model selections; can be quite slow.
`pop.predict2`	Uses deterministic equations to make predictions in populations of complete or partial selfing and with or without the induction of doubled haploids; is much faster than `pop.predict`; does not perform cross-validation or model selection internally.
`pop_predict2`	Has the same functionality as `pop.predict2`, but accepts genomewide marker data in a simpler matrix format.
`x.val`	Performs cross-validation to estimate model performance.
`mppop.predict`	Uses deterministic equations to make predictions in 2- or 4-way populations of complete or partial selfing and with or without the induction of doubled haploids; does not perform cross-validation or model selection internally.
`mpop_predict2`	Has the same functionality as `mppop.predict`, but accepts genomewide marker data in a simpler matrix format.

Examples

Below are some example uses of the functions in PopVar:

# Load the package
library(PopVar)

# Load the example data
data("think_barley", package = "PopVar")

Predictions using simulated populations

The code below simulates a single population of 1000 individuals for each of 150 crosses. For the sake of speed, the marker effects are predicted using RR-BLUP and no cross-validation is performed.

out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex,
                   crossing.table = cross.tab_ex,
                   nInd = 1000, nSim = 1, 
                   nCV.iter = 1, models = "rrBLUP")
#> Warning in read.cross.csv(dir, file, na.strings, genotypes, estimate.map, : There is no genotype data!
#> Warning in summary.cross(cross): Some markers at the same position on chr
#> 1,2,3,4,5,6,7; use jittermap().
#> Warning in summary.cross(cross): Strange genotype pattern on chr 7.

The function returns a list, one element of which is called predictions. This element is itself a list of matrices containing the predictions for each trait. They can be combined as such:

predictions1 <- lapply(X = out$predictions, FUN = function(x) {
  x1 <- as.data.frame(apply(X = x, MARGIN = 2, FUN = unlist), stringsAsFactors = FALSE)
  cbind(x1[,c("Par1", "Par2")], sapply(X = x1[,-1:-2], as.numeric)) 
})

# Display the first few lines of the predictions for grain yield
knitr::kable(head(predictions1$Yield_param.df))

Par1	Par2	midPar.Pheno	midPar.GEBV	pred.mu	pred.mu_sd	pred.varG	pred.varG_sd	mu.sp_low	mu.sp_high	low.resp_FHB	low.resp_DON	low.resp_Height	high.resp_FHB	high.resp_DON	high.resp_Height	cor_w/_FHB	cor_w/_DON	cor_w/_Height
FEG66-08	MN97-125	99.200	100.65607	100.61149	NA	8.011302	NA	95.48177	105.3897	23.25261	23.81642	78.66704	25.18887	25.24296	75.70263	0.3137508	0.2811108	-0.3541447
MN99-71	FEG90-31	NaN	99.17635	99.12289	NA	5.576896	NA	94.91185	103.1419	21.48615	23.49776	78.04245	24.72784	24.81527	75.70203	0.4713201	0.1449542	-0.1898267
MN96-141	FEG183-52	NaN	101.28761	101.32443	NA	5.776544	NA	97.13504	105.3802	23.65552	23.71714	79.74646	27.50000	28.69740	75.59053	0.7472909	0.7654580	-0.5681989
MN99-02	FEG183-52	NaN	100.78451	100.87503	NA	9.855819	NA	95.32431	106.2812	25.66253	23.73906	75.19602	26.29520	26.78093	73.61646	0.1199405	0.4524297	-0.1786077
FEG99-10	FEG148-56	NaN	98.68505	98.70340	NA	4.093935	NA	95.27910	102.2099	19.55343	20.17620	85.78418	22.91372	24.18288	80.15950	0.5402183	0.6007480	-0.5570587
MN99-62	MN01-46	105.775	102.29724	102.22422	NA	5.499448	NA	98.43974	106.1892	26.06697	28.34679	73.17671	26.41473	28.45262	73.96141	0.1645986	-0.0970906	0.3382690

Predictions using deterministic equations

Generating predictions via simulated populations can become computationally burdensome when many thousands or hundreds of thousands of crosses are possible. Fortunately, deterministic equations are available to generate equivalent predictions in a fraction of the time. These equations are provided in the pop.predict2 and pop_predict2 functions.

The pop.predict2 function takes arguments in the same format as pop.predict. We have eliminated the arguments for marker filtering and imputation and cross-validation, as the pop.predict2 function does not support these steps. (You may continue to conduct cross-validation using the x.val function.) Therefore, the genotype data input for pop.predict2 must not contain any missing data. Further, these predictions assume fully inbred parents, so marker genotypes must only be coded as -1 or 1. The data G.in_ex_imputed contains genotype data that is formatted properly.

Below is an example of using the pop.predict2 function:

out2 <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
                     crossing.table = cross.tab_ex, models = "rrBLUP")

knitr::kable(head(subset(out2, trait == "Yield")))

	parent1	parent2	trait	pred_mu	pred_varG	pred_musp_low	pred_musp_high	cor_w_FHB	cor_w_DON	cor_w_Yield	cor_w_Height	pred_cor_musp_low_FHB	pred_cor_musp_low_DON	pred_cor_musp_low_Yield	pred_cor_musp_low_Height	pred_cor_musp_high_FHB	pred_cor_musp_high_DON	pred_cor_musp_high_Yield	pred_cor_musp_high_Height
3	FEG66-08	MN97-125	Yield	100.41166	7.768362	95.52021	105.3031	0.3412831	0.3303378	NA	-0.4004154	22.27035	23.76433	NA	78.84012	24.96750	25.91641	NA	75.11838
7	MN99-71	FEG90-31	Yield	98.98546	5.298105	94.94591	103.0250	0.4959745	0.2776666	NA	-0.1612311	21.86563	23.41697	NA	77.04669	24.85938	25.18159	NA	75.75153
11	MN96-141	FEG183-52	Yield	101.07137	5.571655	96.92884	105.2139	0.6132612	0.6961326	NA	-0.4603399	24.36629	23.67733	NA	79.59345	27.88028	28.40806	NA	75.16009
15	MN99-02	FEG183-52	Yield	100.82967	9.499737	95.42053	106.2388	0.0249962	0.4211875	NA	-0.1929275	26.15174	23.73807	NA	74.76672	26.29180	26.60344	NA	72.86270
19	FEG99-10	FEG148-56	Yield	98.01593	4.121803	94.45292	101.5789	0.5959335	0.6401323	NA	-0.5080571	19.67552	19.58928	NA	85.81854	23.38585	24.23917	NA	80.69997
23	MN99-62	MN01-46	Yield	102.51094	4.673197	98.71709	106.3048	0.0755599	-0.0827658	NA	0.4010907	26.07342	28.37124	NA	72.67125	26.20580	28.16102	NA	74.51340

Note that the output of pop.predict2 is no longer a list, but a data frame containing the combined predictions for all traits.

The formatting requirements of G.in for pop.predict and pop.predict2 are admittedly confusing. Marker genotype data is commonly stored as a n x p matrix, where n is the number of entries and p the number of markers. The function pop_predict2 accommodates this general marker data storage. Here is an example:

out3 <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex,
                     crossing.table = cross.tab_ex, models = "rrBLUP")

knitr::kable(head(subset(out2, trait == "Yield")))

	parent1	parent2	trait	pred_mu	pred_varG	pred_musp_low	pred_musp_high	cor_w_FHB	cor_w_DON	cor_w_Yield	cor_w_Height	pred_cor_musp_low_FHB	pred_cor_musp_low_DON	pred_cor_musp_low_Yield	pred_cor_musp_low_Height	pred_cor_musp_high_FHB	pred_cor_musp_high_DON	pred_cor_musp_high_Yield	pred_cor_musp_high_Height
3	FEG66-08	MN97-125	Yield	100.41166	7.768362	95.52021	105.3031	0.3412831	0.3303378	NA	-0.4004154	22.27035	23.76433	NA	78.84012	24.96750	25.91641	NA	75.11838
7	MN99-71	FEG90-31	Yield	98.98546	5.298105	94.94591	103.0250	0.4959745	0.2776666	NA	-0.1612311	21.86563	23.41697	NA	77.04669	24.85938	25.18159	NA	75.75153
11	MN96-141	FEG183-52	Yield	101.07137	5.571655	96.92884	105.2139	0.6132612	0.6961326	NA	-0.4603399	24.36629	23.67733	NA	79.59345	27.88028	28.40806	NA	75.16009
15	MN99-02	FEG183-52	Yield	100.82967	9.499737	95.42053	106.2388	0.0249962	0.4211875	NA	-0.1929275	26.15174	23.73807	NA	74.76672	26.29180	26.60344	NA	72.86270
19	FEG99-10	FEG148-56	Yield	98.01593	4.121803	94.45292	101.5789	0.5959335	0.6401323	NA	-0.5080571	19.67552	19.58928	NA	85.81854	23.38585	24.23917	NA	80.69997
23	MN99-62	MN01-46	Yield	102.51094	4.673197	98.71709	106.3048	0.0755599	-0.0827658	NA	0.4010907	26.07342	28.37124	NA	72.67125	26.20580	28.16102	NA	74.51340

Benchmarking and comparisons

The code below compares the functions pop.predict and pop.predict2 with respect to computation time and results:

time1 <- system.time({
  capture.output(pop.predict.out <- pop.predict(
    G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex,
    nInd = 1000, nSim = 1, nCV.iter = 1, models = "rrBLUP"))
})
#> Warning in read.cross.csv(dir, file, na.strings, genotypes, estimate.map, : There is no genotype data!
#> Warning in summary.cross(cross): Some markers at the same position on chr
#> 1,2,3,4,5,6,7; use jittermap().
#> Warning in summary.cross(cross): Strange genotype pattern on chr 7.

time2 <- system.time({pop.predict2.out <- pop.predict2(
  G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
  crossing.table = cross.tab_ex,model = "rrBLUP")})

# Print the time (seconds) required for each function.
c(pop.predict = time1[[3]], pop.predict2 = time2[[3]])
#>  pop.predict pop.predict2 
#>       24.919        0.459

Plot results

predictions1 <- lapply(X = pop.predict.out$predictions, FUN = function(x) {
  x1 <- as.data.frame(apply(X = x, MARGIN = 2, FUN = unlist), stringsAsFactors = FALSE)
  cbind(x1[,c("Par1", "Par2")], sapply(X = x1[,-1:-2], as.numeric))
})

pop.predict.out1 <- predictions1$Yield_param.df[,c("Par1", "Par2", "pred.varG")]
pop.predict2.out1 <- subset(pop.predict2.out, trait == "Yield", c(parent1, parent2, pred_varG))

toplot <- merge(pop.predict.out1, pop.predict2.out1, by.x = c("Par1", "Par2"),
                by.y = c("parent1", "parent2"))

plot(pred.varG ~ pred_varG, toplot,
     xlab = "pop.predict2", ylab = "pop.predict",
     main = "Comparsion of predicted genetic variance")

Multi-parent populations

PopVar also includes functions for predicting the mean, genetic variance, and superior progeny mean of multi-parent populations. The mppop.predict function takes the same inputs as pop.predict or pop.predict2, and the mppop_predict2 function takes the same inputs as pop_predict2. Both require the additional argument n.parents, which will determine whether the populations are formed by 2- or 4-way matings (support for 8-way populations is forthcoming.)

Below is an example of using the mppop.predict function:

# Generate predictions for all possible 4-way crosses of 10 sample parents
sample_parents <- sample(unique(unlist(cross.tab_ex)), 10)

mp_out <- mppop.predict(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex,
                        parents = sample_parents, n.parents = 4, models = "rrBLUP")

knitr::kable(head(subset(mp_out, trait == "Yield")))

	parent1	parent2	parent3	parent4	trait	pred_mu	pred_varG	pred_musp_low	pred_musp_high	FHB	DON	Yield	Height	pred_cor_musp_low_FHB	pred_cor_musp_low_DON	pred_cor_musp_low_Yield	pred_cor_musp_low_Height	pred_cor_musp_high_FHB	pred_cor_musp_high_DON	pred_cor_musp_high_Yield	pred_cor_musp_high_Height
3	FEG154-47	FEG160-03	FEG18-20	FEG184-24	Yield	100.14668	6.298749	95.74214	104.5512	0.3986592	0.4607936	NA	-0.1218653	17.83454	18.50832	NA	83.02656	20.27529	21.12072	NA	81.95248
7	FEG154-47	FEG160-03	FEG18-20	FEG90-22	Yield	99.78611	6.009182	95.48400	104.0882	0.2943902	0.5076860	NA	-0.1175733	18.39459	19.70797	NA	81.21060	20.14798	22.30845	NA	80.23725
11	FEG154-47	FEG160-03	FEG18-20	M112	Yield	100.49709	6.704024	95.95306	105.0411	0.3743612	0.4128489	NA	-0.2126614	18.66195	20.91389	NA	80.49162	21.55229	24.01300	NA	78.15792
15	FEG154-47	FEG160-03	FEG18-20	MN01-43	Yield	101.37372	6.667078	96.84224	105.9052	0.4568892	0.5722436	NA	-0.3261032	18.65720	20.48511	NA	81.72813	22.25172	24.76814	NA	78.42352
19	FEG154-47	FEG160-03	FEG18-20	MN04-11	Yield	102.11854	8.538906	96.99023	107.2468	0.5259837	0.6042110	NA	-0.4288760	18.10655	20.05849	NA	81.78266	22.51856	24.78705	NA	77.03538
23	FEG154-47	FEG160-03	FEG18-20	MN96-155	Yield	102.21162	11.650920	96.22126	108.2020	0.6115019	0.6399380	NA	-0.4850788	18.49504	20.61557	NA	82.10847	23.74206	25.30627	NA	76.96529

Below is an example of using the mppop_predict2 function:

# Generate predictions for all possible 4-way crosses of 10 sample parents
mp_out2 <- mppop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex,
                          parents = sample_parents, n.parents = 4, models = "rrBLUP")

knitr::kable(head(subset(mp_out2, trait == "Yield")))

	parent1	parent2	parent3	parent4	trait	pred_mu	pred_varG	pred_musp_low	pred_musp_high	FHB	DON	Yield	Height	pred_cor_musp_low_FHB	pred_cor_musp_low_DON	pred_cor_musp_low_Yield	pred_cor_musp_low_Height	pred_cor_musp_high_FHB	pred_cor_musp_high_DON	pred_cor_musp_high_Yield	pred_cor_musp_high_Height
3	FEG154-47	FEG160-03	FEG18-20	FEG184-24	Yield	100.14668	6.298749	95.74214	104.5512	0.3986592	0.4607936	NA	-0.1218653	17.83454	18.50832	NA	83.02656	20.27529	21.12072	NA	81.95248
7	FEG154-47	FEG160-03	FEG18-20	FEG90-22	Yield	99.78611	6.009182	95.48400	104.0882	0.2943902	0.5076860	NA	-0.1175733	18.39459	19.70797	NA	81.21060	20.14798	22.30845	NA	80.23725
11	FEG154-47	FEG160-03	FEG18-20	M112	Yield	100.49709	6.704024	95.95306	105.0411	0.3743612	0.4128489	NA	-0.2126614	18.66195	20.91389	NA	80.49162	21.55229	24.01300	NA	78.15792
15	FEG154-47	FEG160-03	FEG18-20	MN01-43	Yield	101.37372	6.667078	96.84224	105.9052	0.4568892	0.5722436	NA	-0.3261032	18.65720	20.48511	NA	81.72813	22.25172	24.76814	NA	78.42352
19	FEG154-47	FEG160-03	FEG18-20	MN04-11	Yield	102.11854	8.538906	96.99023	107.2468	0.5259837	0.6042110	NA	-0.4288760	18.10655	20.05849	NA	81.78266	22.51856	24.78705	NA	77.03538
23	FEG154-47	FEG160-03	FEG18-20	MN96-155	Yield	102.21162	11.650920	96.22126	108.2020	0.6115019	0.6399380	NA	-0.4850788	18.49504	20.61557	NA	82.10847	23.74206	25.30627	NA	76.96529

References

Bernardo, Rex. 2010. Breeding for Quantitative Traits in Plants. 2nd ed. Woodbury, Minnesota: Stemma Press.

Mohammadi, Mohsen, Tyler Tiede, and Kevin P. Smith. 2015. “PopVar: A Genome-Wide Procedure for Predicting Genetic Variance and Correlated Response in Biparental Breeding Populations.” Crop Science 55 (5): 2068–77. https://doi.org/10.2135/cropsci2015.01.0030.

Schnell, F. W., and H. F. Utz. 1975. “F1-leistung und elternwahl euphyder züchtung von selbstbefruchtern.” In Bericht über Die Arbeitstagung Der Vereinigung Österreichischer Pflanzenzüchter, 243–48. Gumpenstein, Austria: BAL Gumpenstein.

Using PopVar