Random Forest

Last updated: 2025-01-17

Checks: 6 1

Knit directory: Genomic-Selection-for-Drought-Tolerance-Using-Genome-Wide-SNPs-in-Casava/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221020)

The command set.seed(20221020) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 00c2e25

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 00c2e25. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/figure/GWS.Rmd/
    Ignored:    data/Articles/
    Ignored:    data/Artigo/
    Ignored:    data/allchrAR08.txt
    Ignored:    data/data.rar
    Ignored:    data/geno.rds
    Ignored:    data/pheno.rds
    Ignored:    output/BLUPS_density_med_row_col.png
    Ignored:    output/Density_residual_row_col.tiff
    Ignored:    output/Residuals_vs_fitted_row_col.tiff
    Ignored:    output/result_sommer_row_col_random.RDS
    Ignored:    output/varcomp_row_col.tiff

Untracked files:
    Untracked:  analysis/GWS_BayesA.Rmd
    Untracked:  analysis/GWS_BayesB.Rmd
    Untracked:  analysis/GWS_RF.Rmd
    Untracked:  analysis/clone_selection.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Configurations and packages

To perform the analyses, we will need the following packages:

library(furrr)
library(tidyverse)
library(cvTools)
library(kableExtra)
library(randomForest)

Data

geno<- readRDS("data/geno.rds")
pheno<-readRDS("data/pheno.rds")
traits <- colnames(pheno)[-1]

Random Forest

We can also use models by decision tree to genomic selection. Here, we use the Random Forest of the package [@randomForest]:

Cross-validation

To prove that the prediction is accurate, we should perform a cross-validation (CV) scheme. For this purpose, we divide the data into a training set and a validation set.

First we separate the data into k folds. Then, we attribute NA for one fold and try to predict the data from this fold based on the others. When selecting the number of folds, one must prioritize the balance between the number of observations in each fold.

In addition, this process should be repeated for further validation. The step-by-step below will guide the CV in the data we are analysing.

1. Determine the number of folds and repetitions

nfolds = 5
nrept = 5

Since we defined 5 folds, our data will be divided into 5 parts with 83 observations each.

2. Match the order of the data and the rows of the SNP matrix

The order is decreasing or increasing (numeric or alphabetical) regarding the name of the genotypes.

pheno <- pheno[order(pheno$ID_Clone, decreasing = F),]
geno <- geno[order(row.names(geno)),]
all(rownames(geno) == pheno$ID_Clone)

[1] TRUE

3. Add a column indicating a number for each observation

This will be useful to assign each observation for a fold, which will be the next step.

pheno$ID <- factor(1:nrow(pheno))

4. Folds assignment

In this step, we will assign each observation to a fold. Bear in mind that for each repetition, the folds will comprise different observations. The purpose of the repetition is to make sure of the randomness of the assignment step.

In this step, we will use the cvTools package [@cvTools]

For better performance, we included this step in the loop of the next step.

Run model

The next step is the very CV. Here, we will define the linear predictor and the lists that will be useful in the loop. The first list, here called results_cv_RF and, contains the results for each fold assignation that we built in the previous step. The second GEBVS_RF contains the GEBVs.

Then, we will construct the loop. For each iteration we assign a different training dataset, and we will use the other folds to predict the validation set. Note that the folds vary for each iteration and trait

Set up parallel processing. Define and run multiple models to get the GEBVs ATTENTION: This process is time-consuming.

# Planejar o processamento em várias sessões para paralelização
plan(multisession)

# Iniciar uma lista para armazenar os resultados
results_list <-
  future_map(traits, .options = furrr_options(seed = 100), function(j)  {
    # Pré-processamento dos dados
    pheno2 <- pheno %>%
      select(ID_Clone, all_of(j)) %>%
      na.omit() %>%
      ungroup() %>%
      filter(ID_Clone %in% rownames(geno)) %>%
      mutate(ID = factor(1:nrow(.))) %>%
      droplevels()
    
    # Filtrar as linhas de Z com base em pheno2$ID_Clone
    geno1 <- geno[rownames(geno) %in% pheno2$ID_Clone, ]
    
    #Semente para reprodução
    set.seed(100)
    
    # Listas das folds
    fold.list <- lapply(1:nrept, function(a) {
      folds <- cvFolds(nlevels(pheno2$ID),
                       type = "random",
                       K = nfolds)
      
      split(folds$subsets, f = folds$which)
    })
    
    # Usar future_map_dfr para criar dobras cruzadas em paralelo
    future_map_dfr(1:length(fold.list), function(z) {
      for (i in 1:nfolds) {
        # pheno de Treinamento
        pheno_train_data <- pheno2[-fold.list[[z]][[i]], ]
        geno_train_data <- geno1[-fold.list[[z]][[i]],]
        
        # pheno de validaço
        pheno_test_data <- pheno2[fold.list[[z]][[i]],]
        geno_test_data <- geno1[fold.list[[z]][[i]],]
        
        # Fitting model
        RF <-
          randomForest(x = geno_train_data,
                       y = pheno_train_data[[j]])
        
        # GEBV
        Pred <-
          data.frame(
            Yhat = predict(RF, geno_test_data),
            ID = pheno_test_data$ID,
            ID_Clone = pheno_test_data$ID_Clone
          ) %>%
          filter(ID %in% fold.list[[z]][[i]])
        
        result1 <-
          tibble(
            GEBV = Pred$Yhat,
            ID = Pred$ID,
            ID_Clone = Pred$ID_Clone,
            Trait = j,
            rep = z,
            fold = i
          )
        
        # Resultados da previsão
        if (i == 1) {
          result <- result1
        } else {
          result <- rbind(result, result1)
        }
      }
      
      #Ordenando os resultados de acordo com o ID
      result <- result[order(result$ID, decreasing = F),]
      
      # Resultados da validação cruzada
      result_cv <- tibble(
        Trait = j,
        rep = z,
        Log = all(result$ID == pheno2$ID),
        Ac = round(cor(result$GEBV, pheno2[[j]], use = "na.or.complete"), 3),
        MSPE = round(mean((
          result$GEBV - pheno2[[j]]
        ) ^ 2, na.rm = TRUE), 3),
        result = list(result)
      )
    })
  })

# Combinar todos os resultados da validação cruzada em uma única estrutura
results_cv_RF <- do.call(rbind, results_list)

# Extrair os GEBVs (Valores Genômicos Empíricos Melhorados) em uma estrutura separada
GEBVS_RF <- do.call(rbind, results_cv_RF$result)

Results

The object “GEBVS_RF” contains the GEBVs for each trait and clone. The table above shows the first five clones and their GEBVs for each trait.

GEBVS_RF |>
  group_by(Trait, ID_Clone) |>
  summarise(GEBV = mean(GEBV)) %>%
  pivot_wider(names_from = Trait, values_from = GEBV) %>%
  head() %>%
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)

`summarise()` has grouped output by 'Trait'. You can override using the
`.groups` argument.

ID_Clone	DMC	FRY	HI	N_Roots	Nstem.Plant	Plant.Height	Root.Di	Root.Le	ShY	StC	StY	Stem.D
BGM_2042	0.1451931	1.4496079	3.1015353	0.6449121	0.3191106	-0.0012459	0.8382182	0.7391006	2.1877567	0.0782769	0.3742687	0.0467300
BGM0368.250437186	0.9345959	-0.7285073	0.8285951	0.0010088	-0.0128948	-0.0503279	-0.6692147	-0.9897567	-4.1589767	0.9374582	-0.0754125	-0.0574171
BGM0424.250437187	-0.0123616	-0.1009185	-0.3284599	-0.3140162	-0.0226895	0.0062196	-0.5188334	-0.3481312	-1.0347330	-0.0575397	0.0022422	-0.0186176
BGM0611.250437190	-1.3296228	-0.1204026	0.3692578	0.2081517	0.0822986	-0.0725367	0.2534240	0.3220756	-0.3743010	-1.3447060	-0.0867330	-0.0883289
BGM0652.250437861	-0.0075193	-0.0939703	-0.4704020	-0.0702045	0.0568957	-0.0213872	-0.2044276	-0.1057579	0.3375576	-0.0769899	-0.0397454	-0.0145388
BGM0699.250437199	-0.5693895	-0.3686601	-0.2958907	-0.3346485	-0.0435400	-0.0218531	-0.3045638	-0.1673995	-0.9447571	-0.6048763	-0.0780387	0.0084988

Acuracy and MSPE for each trait and repetition:

results_cv_RF[-6] |>
  group_by(Trait) %>% 
  summarise_all(mean) %>% 
  arrange(desc(Ac), MSPE) %>% 
  kbl(escape = F, align = 'c') |>
  kable_classic("hover", full_width = F, position = "center", fixed_thead = T)

Trait	rep	Log	Ac	MSPE
Root.Le	3	1	0.3446	1.8728
Stem.D	3	1	0.3412	0.0080
ShY	3	1	0.3182	12.7628
FRY	3	1	0.3140	1.1738
Plant.Height	3	1	0.2886	0.0078
Root.Di	3	1	0.2824	3.4866
StY	3	1	0.2818	0.0724
N_Roots	3	1	0.2578	0.8768
HI	3	1	0.2554	12.1224
Nstem.Plant	3	1	0.2082	0.0298
StC	3	1	0.1556	6.2552
DMC	3	1	0.1548	6.2874

The object “result_cv” is divided by repetition. In the “result_cv” objects for each repetition, “Rep” is the number of the repetition, “Log” is a diagnostic indicating if the order of the predicted breeding values matches the order of the adjusted means, “Ac” is the prediction accuracy (correlation between the GEBV and adjusted means), and “MSPE” is the mean square prediction error (the lower, the better).

We will now obtain the joint results of all methods and compare them. Additionally, we will adopt a selection model to optimize the selection of cassava clones for EMBRAPA’s cassava breeding program.

Our aim is to select clones with high GEBVs and high GETGVs, identifying clones with strong intrinsic potential as well as those that can be crossbred to produce high-potential hybrid clones.

This step will be detailed in the next phase of the project:

*Clone Selection**

sessionInfo()

R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomForest_4.7-1.2 kableExtra_1.4.0     cvTools_0.3.3       
 [4] robustbase_0.99-4-1  lattice_0.22-6       lubridate_1.9.4     
 [7] forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4         
[10] purrr_1.0.2          readr_2.1.5          tidyr_1.3.1         
[13] tibble_3.2.1         ggplot2_3.5.1        tidyverse_2.0.0     
[16] furrr_0.3.1          future_1.34.0       

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.50         bslib_0.8.0       tzdb_0.4.0       
 [5] vctrs_0.6.5       tools_4.3.3       generics_0.1.3    parallel_4.3.3   
 [9] DEoptimR_1.1-3-1  pkgconfig_2.0.3   lifecycle_1.0.4   compiler_4.3.3   
[13] git2r_0.35.0      munsell_0.5.1     codetools_0.2-20  httpuv_1.6.15    
[17] htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10       later_1.4.1      
[21] pillar_1.10.1     jquerylib_0.1.4   cachem_1.1.0      parallelly_1.41.0
[25] tidyselect_1.2.1  digest_0.6.37     stringi_1.8.4     listenv_0.9.1    
[29] rprojroot_2.0.4   fastmap_1.2.0     grid_4.3.3        colorspace_2.1-1 
[33] cli_3.6.3         magrittr_2.0.3    withr_3.0.2       scales_1.3.0     
[37] promises_1.3.2    timechange_0.3.0  rmarkdown_2.29    globals_0.16.3   
[41] workflowr_1.7.1   hms_1.1.3         evaluate_1.0.3    knitr_1.49       
[45] viridisLite_0.4.2 rlang_1.1.4       Rcpp_1.0.14       glue_1.8.0       
[49] xml2_1.3.6        svglite_2.1.3     rstudioapi_0.17.1 jsonlite_1.8.9   
[53] R6_2.5.1          systemfonts_1.1.0 fs_1.6.5

Weverton Gomes da Costa, Pós-Doutorando, Embrapa Mandioca e Fruticultura, wevertonufv@gmail.com ↩︎

Random Forest

Costa, W. G.1

2025-01-17

Configurations and packages

Data

Random Forest

Cross-validation

1. Determine the number of folds and repetitions

2. Match the order of the data and the rows of the SNP matrix

3. Add a column indicating a number for each observation

4. Folds assignment

Run model

Results

Costa, W. G.¹