Last updated: 2025-01-17
Checks: 6 1
Knit directory:
Genomic-Selection-for-Drought-Tolerance-Using-Genome-Wide-SNPs-in-Casava/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20221020)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 00c2e25. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/figure/GWS.Rmd/
Ignored: data/Articles/
Ignored: data/Artigo/
Ignored: data/allchrAR08.txt
Ignored: data/data.rar
Ignored: data/geno.rds
Ignored: data/pheno.rds
Ignored: output/BLUPS_density_med_row_col.png
Ignored: output/Density_residual_row_col.tiff
Ignored: output/Residuals_vs_fitted_row_col.tiff
Ignored: output/result_sommer_row_col_random.RDS
Ignored: output/varcomp_row_col.tiff
Untracked files:
Untracked: analysis/GWS_BayesA.Rmd
Untracked: analysis/GWS_BayesB.Rmd
Untracked: analysis/GWS_RF.Rmd
Untracked: analysis/clone_selection.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
To perform the analyses, we will need the following packages:
We can perform the Ridge Regression in the Bayesian framework. For this purpose, we will use the BGLR package [@BGLR]. In this package, we will need an element called “ETA”. In the ETA, we will set the linear predictors of the model and the priors.
To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.
First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.
In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.
Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.
The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.
pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)
[1] TRUE
This will be useful to assign each observation for a fold, which will be the next step.
In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.
In this step, we will use the cvTools package [@cvTools]
For better performance, we included this step in the loop of the next step.
The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called results_cv_BayesA and, contains the results for each fold assignation that we built in the previous step. The second GEBVS_BayesA contains the GEBVs.
Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition and trait.
Set up parallel processing. Define and run multiple models to get the GEBVs ATTENTION: This process is time-consuming.
# Planejar o processamento em várias sessões para paralelização
plan(multisession)
# Iniciar uma lista para armazenar os resultados
results_list <-
future_map(traits, .options = furrr_options(seed = 100), function(j) {
# Pré-processamento dos dados
pheno2 <- pheno %>%
select(ID_Clone, all_of(j)) %>%
na.omit() %>%
ungroup() %>%
filter(ID_Clone %in% rownames(geno)) %>%
mutate(ID = factor(1:nrow(.))) %>%
droplevels()
# Filtrar as linhas de Z com base em pheno2$ID_Clone
ETA = list(list(X = geno[rownames(geno) %in% pheno2$ID_Clone, ], model = "BayesA"))
#Semente para reprodução
set.seed(100)
# Listas das folds
fold.list <- lapply(1:nrept, function(a) {
folds <- cvFolds(nlevels(pheno2$ID),
type = "random",
K = nfolds)
split(folds$subsets, f = folds$which)
})
# Usar future_map_dfr para criar dobras cruzadas em paralelo
future_map_dfr(1:length(fold.list), function(z) {
for (i in 1:nfolds) {
# Conjunto de treinamento
train_data <- pheno2
# Conjunto de validação
train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
# Fitting model
CV_M <-
BGLR(
y = train_data[[j]],
ETA = ETA,
nIter = 20000,
burnIn = 5000,
thin = 5,
verbose = F,
saveAt = "output/BayesA_"
)
# GEBV
Pred <-
data.frame(
Yhat = CV_M$yHat,
ID = train_data$ID,
ID_Clone = train_data$ID_Clone
) %>%
filter(ID %in% fold.list[[z]][[i]])
result1 <-
tibble(
GEBV = Pred$Yhat,
ID = Pred$ID,
ID_Clone = Pred$ID_Clone,
Trait = j,
rep = z,
fold = i
)
# Resultados da previsão
if (i == 1) {
result <- result1
} else {
result <- rbind(result, result1)
}
}
#Ordenando os resultados de acordo com o ID
result <- result[order(result$ID, decreasing = F),]
# Resultados da validação cruzada
result_cv <- tibble(
Trait = j,
rep = z,
Log = all(result$ID == pheno2$ID),
Ac = round(cor(result$GEBV, pheno2[[j]], use = "na.or.complete"), 3),
MSPE = round(mean((
result$GEBV - pheno2[[j]]
) ^ 2, na.rm = TRUE), 3),
result = list(result)
)
})
})
# Combinar todos os resultados da validação cruzada em uma única estrutura
results_cv_BayesA <- do.call(rbind, results_list)
# Extrair os GEBVs (Valores Genômicos Empíricos Melhorados) em uma estrutura separada
GEBVS_BayesA <- do.call(rbind, results_cv_BayesA$result)
The object “GEBVS_BayesA” contains the GEBVs for each trait and clone. The table above shows the first five clones and their GEBVs for each trait.
GEBVS_BayesA |>
group_by(Trait, ID_Clone) |>
summarise(GEBV = mean(GEBV)) %>%
pivot_wider(names_from = Trait, values_from = GEBV) %>%
head() %>%
kbl(escape = F, align = 'c') |>
kable_classic("hover", full_width = F, position = "center", fixed_thead = T)
`summarise()` has grouped output by 'Trait'. You can override using the
`.groups` argument.
ID_Clone | DMC | FRY | HI | N_Roots | Nstem.Plant | Plant.Height | Root.Di | Root.Le | ShY | StC | StY | Stem.D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BGM_2042 | 0.4436600 | 1.1092686 | 1.7114440 | 0.3746705 | 0.1650986 | 0.0203053 | 0.4144456 | 0.9995610 | 1.7185341 | 0.4548519 | 0.2637665 | 0.0594662 |
BGM0368.250437186 | 0.7657258 | -0.5130257 | 0.5961233 | 0.0720482 | -0.0200861 | -0.0162096 | -0.5744150 | -0.5710136 | -3.0210501 | 0.7791993 | -0.0480404 | -0.0441026 |
BGM0424.250437187 | 0.0748989 | -0.2246431 | -0.0147139 | -0.2097110 | -0.0564301 | 0.0006734 | -0.4975889 | -0.4774000 | -1.5876806 | 0.0959514 | -0.0246192 | -0.0180559 |
BGM0611.250437190 | -1.0895086 | -0.2184819 | -0.5204956 | -0.0093541 | 0.0685501 | -0.0531977 | -0.2418798 | 0.0749479 | -0.7201068 | -1.0842358 | -0.0814045 | -0.0784224 |
BGM0652.250437861 | -0.1421398 | -0.1197146 | -0.4676251 | -0.1147646 | 0.0409002 | -0.0310311 | -0.4865752 | -0.1653318 | -0.1806367 | -0.1447774 | -0.0513303 | -0.0252523 |
BGM0699.250437199 | -0.1925456 | -0.4011327 | -0.9788114 | -0.3853206 | -0.0551043 | -0.0114720 | -0.3858275 | -0.1457257 | -0.6257111 | -0.2075533 | -0.0996586 | 0.0317091 |
Acuracy and MSPE for each trait and repetition:
results_cv_BayesA[-6] |>
group_by(Trait) %>%
summarise_all(mean) %>%
arrange(desc(Ac), MSPE) %>%
kbl(escape = F, align = 'c') |>
kable_classic("hover", full_width = F, position = "center", fixed_thead = T)
Trait | rep | Log | Ac | MSPE |
---|---|---|---|---|
Root.Le | 3 | 1 | 0.3694 | 1.8068 |
Stem.D | 3 | 1 | 0.3458 | 0.0078 |
FRY | 3 | 1 | 0.3222 | 1.1586 |
ShY | 3 | 1 | 0.3176 | 12.5958 |
Root.Di | 3 | 1 | 0.3160 | 3.3288 |
StY | 3 | 1 | 0.3044 | 0.0708 |
Plant.Height | 3 | 1 | 0.2958 | 0.0074 |
Nstem.Plant | 3 | 1 | 0.2708 | 0.0276 |
N_Roots | 3 | 1 | 0.2528 | 0.8608 |
HI | 3 | 1 | 0.2426 | 11.9586 |
StC | 3 | 1 | 0.2180 | 5.8666 |
DMC | 3 | 1 | 0.2164 | 5.8982 |
The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).
Let’s now move on to processing each model. Since this can be time-consuming, I’ve separated it into parallel processing and each model into each script:
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8 LC_CTYPE=Portuguese_Brazil.utf8
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.utf8
time zone: America/Sao_Paulo
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BGLR_1.1.3 kableExtra_1.4.0 cvTools_0.3.3
[4] robustbase_0.99-4-1 lattice_0.22-6 lubridate_1.9.4
[7] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[10] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[13] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
[16] furrr_0.3.1 future_1.34.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.50 bslib_0.8.0 tzdb_0.4.0
[5] vctrs_0.6.5 tools_4.3.3 generics_0.1.3 parallel_4.3.3
[9] DEoptimR_1.1-3-1 pkgconfig_2.0.3 truncnorm_1.0-9 lifecycle_1.0.4
[13] compiler_4.3.3 git2r_0.35.0 munsell_0.5.1 codetools_0.2-20
[17] httpuv_1.6.15 htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
[21] later_1.4.1 pillar_1.10.1 jquerylib_0.1.4 MASS_7.3-60.0.1
[25] cachem_1.1.0 parallelly_1.41.0 tidyselect_1.2.1 digest_0.6.37
[29] stringi_1.8.4 listenv_0.9.1 rprojroot_2.0.4 fastmap_1.2.0
[33] grid_4.3.3 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
[37] withr_3.0.2 scales_1.3.0 promises_1.3.2 timechange_0.3.0
[41] rmarkdown_2.29 globals_0.16.3 workflowr_1.7.1 hms_1.1.3
[45] evaluate_1.0.3 knitr_1.49 viridisLite_0.4.2 rlang_1.1.4
[49] Rcpp_1.0.14 glue_1.8.0 xml2_1.3.6 svglite_2.1.3
[53] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 systemfonts_1.1.0
[57] fs_1.6.5
Weverton Gomes da Costa, Pós-Doutorando, Embrapa Mandioca e Fruticultura, wevertonufv@gmail.com↩︎