Bayes A

Last updated: 2025-01-17

Checks: 6 1

Knit directory: Genomic-Selection-for-Drought-Tolerance-Using-Genome-Wide-SNPs-in-Casava/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221020)

The command set.seed(20221020) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 00c2e25

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 00c2e25. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figure/GWS.Rmd/
    Ignored:    data/Articles/
    Ignored:    data/Artigo/
    Ignored:    data/allchrAR08.txt
    Ignored:    data/data.rar
    Ignored:    data/geno.rds
    Ignored:    data/pheno.rds
    Ignored:    output/BLUPS_density_med_row_col.png
    Ignored:    output/Density_residual_row_col.tiff
    Ignored:    output/Residuals_vs_fitted_row_col.tiff
    Ignored:    output/result_sommer_row_col_random.RDS
    Ignored:    output/varcomp_row_col.tiff

Untracked files:
    Untracked:  analysis/GWS_BayesA.Rmd
    Untracked:  analysis/GWS_BayesB.Rmd
    Untracked:  analysis/GWS_RF.Rmd
    Untracked:  analysis/clone_selection.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Configurations and packages

To perform the analyses, we will need the following packages:

library(furrr)
library(tidyverse)
library(cvTools)
library(kableExtra)
library(BGLR)

Data

geno<- readRDS("data/geno.rds")
pheno<-readRDS("data/pheno.rds")
traits <- colnames(pheno)[-1]

Bayes A

We can perform the Ridge Regression in the Bayesian framework. For this purpose, we will use the BGLR package [@BGLR]. In this package, we will need an element called “ETA”. In the ETA, we will set the linear predictors of the model and the priors.

Cross-validation

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 5
nrept = 5

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)

[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

For better performance, we included this step in the loop of the next step.

Run model

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called results_cv_BayesA and, contains the results for each fold assignation that we built in the previous step. The second GEBVS_BayesA contains the GEBVs.

Then, we will construct the loop. Each iteration will assign NA for a different fold, and we will use the other folds to predict the missing values. Note that the folds vary for each repetition and trait.

Set up parallel processing. Define and run multiple models to get the GEBVs ATTENTION: This process is time-consuming.

# Planejar o processamento em várias sessões para paralelização
plan(multisession)

# Iniciar uma lista para armazenar os resultados
results_list <-
  future_map(traits, .options = furrr_options(seed = 100), function(j)  {
    # Pré-processamento dos dados
    pheno2 <- pheno %>%
      select(ID_Clone, all_of(j)) %>%
      na.omit() %>%
      ungroup() %>%
      filter(ID_Clone %in% rownames(geno)) %>%
      mutate(ID = factor(1:nrow(.))) %>%
      droplevels()
    
    # Filtrar as linhas de Z com base em pheno2$ID_Clone
    ETA = list(list(X = geno[rownames(geno) %in% pheno2$ID_Clone, ], model = "BayesA"))
    
    #Semente para reprodução
    set.seed(100)
    
    # Listas das folds
    fold.list <- lapply(1:nrept, function(a) {
      folds <- cvFolds(nlevels(pheno2$ID),
                       type = "random",
                       K = nfolds)
      
      split(folds$subsets, f = folds$which)
    })
    
    # Usar future_map_dfr para criar dobras cruzadas em paralelo
    future_map_dfr(1:length(fold.list), function(z) {
      for (i in 1:nfolds) {
        # Conjunto de treinamento
        train_data <- pheno2
        
        # Conjunto de validação
        train_data[train_data$ID %in% fold.list[[z]][[i]], j] <- NA
        
        # Fitting model
        CV_M <-
          BGLR(
            y = train_data[[j]],
            ETA = ETA,
            nIter = 20000,
            burnIn = 5000,
            thin = 5,
            verbose = F,
            saveAt = "output/BayesA_"
          )
        
        # GEBV
        Pred <-
          data.frame(
            Yhat = CV_M$yHat,
            ID = train_data$ID,
            ID_Clone = train_data$ID_Clone
          ) %>%
          filter(ID %in% fold.list[[z]][[i]])
        
        result1 <-
          tibble(
            GEBV = Pred$Yhat,
            ID = Pred$ID,
            ID_Clone = Pred$ID_Clone,
            Trait = j,
            rep = z,
            fold = i
          )
        
        # Resultados da previsão
        if (i == 1) {
          result <- result1
        } else {
          result <- rbind(result, result1)
        }
      }
      
      #Ordenando os resultados de acordo com o ID
      result <- result[order(result$ID, decreasing = F),]
      
      # Resultados da validação cruzada
      result_cv <- tibble(
        Trait = j,
        rep = z,
        Log = all(result$ID == pheno2$ID),
        Ac = round(cor(result$GEBV, pheno2[[j]], use = "na.or.complete"), 3),
        MSPE = round(mean((
          result$GEBV - pheno2[[j]]
        ) ^ 2, na.rm = TRUE), 3),
        result = list(result)
      )
    })
  })

# Combinar todos os resultados da validação cruzada em uma única estrutura
results_cv_BayesA <- do.call(rbind, results_list)

# Extrair os GEBVs (Valores Genômicos Empíricos Melhorados) em uma estrutura separada
GEBVS_BayesA <- do.call(rbind, results_cv_BayesA$result)

Results

The object “GEBVS_BayesA” contains the GEBVs for each trait and clone. The table above shows the first five clones and their GEBVs for each trait.

GEBVS_BayesA |>
  group_by(Trait, ID_Clone) |>
  summarise(GEBV = mean(GEBV)) %>%
  pivot_wider(names_from = Trait, values_from = GEBV) %>%
  head() %>%
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)

`summarise()` has grouped output by 'Trait'. You can override using the
`.groups` argument.

ID_Clone	DMC	FRY	HI	N_Roots	Nstem.Plant	Plant.Height	Root.Di	Root.Le	ShY	StC	StY	Stem.D
BGM_2042	0.4436600	1.1092686	1.7114440	0.3746705	0.1650986	0.0203053	0.4144456	0.9995610	1.7185341	0.4548519	0.2637665	0.0594662
BGM0368.250437186	0.7657258	-0.5130257	0.5961233	0.0720482	-0.0200861	-0.0162096	-0.5744150	-0.5710136	-3.0210501	0.7791993	-0.0480404	-0.0441026
BGM0424.250437187	0.0748989	-0.2246431	-0.0147139	-0.2097110	-0.0564301	0.0006734	-0.4975889	-0.4774000	-1.5876806	0.0959514	-0.0246192	-0.0180559
BGM0611.250437190	-1.0895086	-0.2184819	-0.5204956	-0.0093541	0.0685501	-0.0531977	-0.2418798	0.0749479	-0.7201068	-1.0842358	-0.0814045	-0.0784224
BGM0652.250437861	-0.1421398	-0.1197146	-0.4676251	-0.1147646	0.0409002	-0.0310311	-0.4865752	-0.1653318	-0.1806367	-0.1447774	-0.0513303	-0.0252523
BGM0699.250437199	-0.1925456	-0.4011327	-0.9788114	-0.3853206	-0.0551043	-0.0114720	-0.3858275	-0.1457257	-0.6257111	-0.2075533	-0.0996586	0.0317091

Acuracy and MSPE for each trait and repetition:

results_cv_BayesA[-6] |>
  group_by(Trait) %>% 
  summarise_all(mean) %>% 
  arrange(desc(Ac), MSPE) %>% 
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)

Trait	rep	Log	Ac	MSPE
Root.Le	3	1	0.3694	1.8068
Stem.D	3	1	0.3458	0.0078
FRY	3	1	0.3222	1.1586
ShY	3	1	0.3176	12.5958
Root.Di	3	1	0.3160	3.3288
StY	3	1	0.3044	0.0708
Plant.Height	3	1	0.2958	0.0074
Nstem.Plant	3	1	0.2708	0.0276
N_Roots	3	1	0.2528	0.8608
HI	3	1	0.2426	11.9586
StC	3	1	0.2180	5.8666
DMC	3	1	0.2164	5.8982

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

Let’s now move on to processing each model. Since this can be time-consuming, I’ve separated it into parallel processing and each model into each script:

sessionInfo()

R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BGLR_1.1.3          kableExtra_1.4.0    cvTools_0.3.3      
 [4] robustbase_0.99-4-1 lattice_0.22-6      lubridate_1.9.4    
 [7] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[10] purrr_1.0.2         readr_2.1.5         tidyr_1.3.1        
[13] tibble_3.2.1        ggplot2_3.5.1       tidyverse_2.0.0    
[16] furrr_0.3.1         future_1.34.0      

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.50         bslib_0.8.0       tzdb_0.4.0       
 [5] vctrs_0.6.5       tools_4.3.3       generics_0.1.3    parallel_4.3.3   
 [9] DEoptimR_1.1-3-1  pkgconfig_2.0.3   truncnorm_1.0-9   lifecycle_1.0.4  
[13] compiler_4.3.3    git2r_0.35.0      munsell_0.5.1     codetools_0.2-20 
[17] httpuv_1.6.15     htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10      
[21] later_1.4.1       pillar_1.10.1     jquerylib_0.1.4   MASS_7.3-60.0.1  
[25] cachem_1.1.0      parallelly_1.41.0 tidyselect_1.2.1  digest_0.6.37    
[29] stringi_1.8.4     listenv_0.9.1     rprojroot_2.0.4   fastmap_1.2.0    
[33] grid_4.3.3        colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3   
[37] withr_3.0.2       scales_1.3.0      promises_1.3.2    timechange_0.3.0 
[41] rmarkdown_2.29    globals_0.16.3    workflowr_1.7.1   hms_1.1.3        
[45] evaluate_1.0.3    knitr_1.49        viridisLite_0.4.2 rlang_1.1.4      
[49] Rcpp_1.0.14       glue_1.8.0        xml2_1.3.6        svglite_2.1.3    
[53] rstudioapi_0.17.1 jsonlite_1.8.9    R6_2.5.1          systemfonts_1.1.0
[57] fs_1.6.5

Weverton Gomes da Costa, Pós-Doutorando, Embrapa Mandioca e Fruticultura, wevertonufv@gmail.com ↩︎

Bayes A

Costa, W. G.1

2025-01-17

Configurations and packages

Data

Bayes A

Cross-validation

1. Determine the number of folds and repetitions

2. Match the order of the data and the rows of the SNP matrix

3. Add a column indicating a number for each observation

4. Folds assignment

Run model

Results

Costa, W. G.¹