Enrichement databases • BGSmartR

library(BGSmartR)

This vignette is to show how to create enrichment data frames in R, used to enrich plant information within collections. Within BGSmartR we have pre-written methods to create data frames from World Checklist of Vascular Plants and IUCN’s red list and functions to prepare any data set ready for taxonomic matching.

Note that creating enrichment data frames can be a rather slow process. So once you’ve created the data it’s best to save it and load it when required. Then maybe once a year re-create the enrichment data frames in case there have been any changes (such as new threatened plants, changes in accepted names, etc).

Preparing a data set for taxonomic matching

In this section we describe how to prepare a data ready for matching to your collection via taxonomic matching. Initially you need some data that you want to use to enrich your collection with. This data is commonly contained with .csv files. So first you need to use your favourite method for loading the data into R as a data frame (not a tibble).

As an example we load a list of trees obtained from BGCI’s Global Tree Search.

# This link may become invalid over time. See https://tools.bgci.org/global_tree_search.php for newer link if needed.
BGCI_trees <- read.csv(url('https://tools.bgci.org/global_tree_search_trees_1_7.csv'))[,1:2]
names(BGCI_trees) = c('taxon_names', 'taxon_author')
BGCI_trees$is_tree = rep(T, nrow(BGCI_trees)) # Add column stating each record is a tree.

Note that for taxonomic name matching the data requires taxonomic names (preferably taxonomic authors too!). In the above example we rename the columns such that the taxonomic names and authority are contained in taxon_names and taxon_authors, respectively.

After loading the data into R we need to prepare it for the matching algorithm. We do this using prepare_enrich_database().

BGCI_trees = BGSmartR::prepare_enrich_database(BGCI_trees,
                                  enrich_taxon_name_column = 'taxon_names',
                                  enrich_taxon_authors_column = 'taxon_author',
                                  console_message = TRUE)

This adds additional columns used in the matching algorithm such as sanitise_name and sanitise_author which ‘cleans’ the taxonomic name and authority ready to be matched to the collection. For further details see the function’s documentation.

Now the data is prepared we could perform taxonomic matching via match_collection_to_enrich_database(). Here, we go directly to enriching the collection (more likely what you would do) via enrich_collection_from_enrich_database().

### Load in a collection
load('data/value_of_items_collection_example.rda')
collection = collection[,1:2] 

### Enrich the collection with trees
enriched_collection = BGSmartR::enrich_collection_from_enrich_database(
  collection,
  enrich_database = BGCI_trees,
  taxon_name_column = 'TaxonName',
  taxon_name_full_column = 'TaxonNameFull',
  enrich_taxon_name_column = 'sanitise_name',
  enrich_taxon_authors_column = 'sanitise_author',
  columns_to_enrich = 'is_tree'
  )

We can see that for the function to work we need to input the collection together with the column names in collection that correspond to the taxonomic names (taxon_name_column) and the taxonomic name and author (taxon_name_full_column). We also need the enriched data (enrich_database) together with the column names within it for the taxonomic name (enrich_taxon_name_column) and authors (enrich_taxon_authors_column). Finally we need to choose the columns in the enriched data to add to the collection (columns_to_enrich).

This is a simple example of enriching a collection with tree information. See enrich_collection_from_enrich_database()’s documentation for further details, such as selecting parts of the matching algorithm you want to use. Moreover, see Case Study: Trees in a collection to see how we can improve enrichment by utilising multiple enrichment data sources (i.e. matching trees to WCVP to find further names of trees) and specific data preparation (only using Genus and Species rather then full taxonomic name).

Creating WCVP data frame ready for matching

WCVP is an excellent resource for retrieving geographic distribution information or taxonomic standardisation of plant records. Therefore, we have written a specific method for loading and preparing this data for BGSmartR matching algorithms.

WCVP can be downloaded directly from Plants of the World Online via the data tab, or you can use the rWCVPData package. WCVP is split into two parts wcvp_names and wcvp_distribution we shall deal with these two parts separately.

To load wcvp_names we use import_wcvp_names().

wcvp = import_wcvp_names(use_rWCVPdata = TRUE)

This function will load the data to R and prepare the data ready to be matched using BGSmartR. It will also:

Update the author of autonyms to the base Genus Species. Within wcvp_names autonyms often do not contain taxonomic authors needed for taxonomic matching so we add this to these records.
Sometimes the downloaded data contains records that are synonyms without providing the corresponding accepted taxonomic name, where POWO does link to the accepted name. Therefore, we check for these issues and update the accepted plant when required. The accepted name is found using the function get_accepted_plant().
Prior to matching we remove items from a collection that are cultivars (contains certain patterns in the taxonomic name) as these are not found in WCVP. However, we check for exceptions of these patterns within WCVP that we need to match to. For example WCVP contains Conioselinum anthriscoides ‘chuanxiong’ which contains ' ' which is associated with cultivars.
Sometimes in the taxonomic name there are issues with infraspecific markers where we find [**] instead of forma and [*] instead of variety (Version 10 of WCVP).

These issues may be fixed in newer versions of WCVP but we leave the methods in tact to perform checks.

The result of import_wcvp_names is a list containing:

$wcvp_names a data frame containing the prepared wcvp_names with the fixes outlined above.
$exceptions a data frame containing a subsection of the prepared wcvp_names containing the exceptions to cultivars and indeterminates.
$changes a data frame detailing the changes made to wcvp_names.

To add the distribution information we use add_wcvp_distributions().

wcvp = add_wcvp_distributions(use_rWCVPdata = TRUE, wcvp = wcvp)

This adjoins a new item to the wcvp list called $geography, which contains the plant name identifiers that match to wcvp$wcvp_names together with the level 3 TDWG geographical codes for each plant. The codes are contained in 7 columns with name of the form Dist_XYZ_area_code_l3 where:

XYZ = 000: Native, not extinct and not doubtful locations.
XYZ = 010: Native, extinct and not doubtful locations.
XYZ = 001: Native, not extinct and doubtful locations.
XYZ = 011: Native, extinct and doubtful locations.
XYZ = 100: Introduced, not extinct and not doubtful locations.
XYZ = 110: Introduced, extinct and not doubtful locations.
XYZ = 101: Introduced, not extinct and doubtful locations.

We also add the column Dist_000_labels which uses the native, not extinct and not doubtful locations to create simple geographic labels (using generate_labels()).

To enrich a collection with information from WCVP we use enrich_collection() which allows enriching from WCVP, ICUN redlist and BGCI’s plant, as well as extracting information from the collection.

To enrich only wcvp information we can run

enriched_collection = enrich_collection(collection,
                          wcvp = wcvp,
                          iucnRedlist = NA,
                          BGCI = NA,
                          taxon_name_column = 'TaxonName',
                          taxon_name_full_column = 'TaxonNameFull',
                          taxon_author_column = NA)

This has similar inputs to enrich_collection_from_enrich_database() however we do not need to specify enrichment specific columns. For further information see the function’s documentation. The columns you want to enrich from WCVP can be altered using the input wcvp_wanted_info.

It may be advantageous to enrich the WCVP database with matches to IUCN’s redlist directly to get the threatened status of records in WCVP. This may be helpful to determine if updating to the accepted name will affect the threatened status of taxa.

We show how to do this below where we only try to match accepted (or those without an accepted name) records within WCVP:

### Assume we have loaded into the environment:
### wcvp database "wcvp"
### IUCN red list database "IUCN_redlist"

# 1) Reduce to wcvp_names (we don't need geography information)
wcvp_names = wcvp$wcvp_names

# 2) Find the records that are accepted names or do not have an accepted name.
wanted_index = c(which(wcvp_names$plant_name_id %in% unique(wcvp_names$accepted_plant_name_id)),
                 which(is.na(wcvp_names$accepted_plant_name_id)))

# 3) Create a 'collection' from wcvp_names of the taxonomic names and authors we want to match to IUCN redlist.
wcvp_care = data.frame(plant_name_id = wcvp_names$plant_name_id[wanted_index],
                       taxon_name = wcvp_names$taxon_name[wanted_index],
                       taxon_author = wcvp_names$taxon_authors[wanted_index])

# 4) Match to red list.
match_details = BGSmartR::match_collection_to_iucnRedlist(collection = wcvp_care,
                                                          iucnRedlist = IUCN_redlist,
                                          taxon_name_column = 'taxon_name',
                                          taxon_author_column = 'taxon_author')

# 5) Create Red_info containing the IUCN red list information for the matches to wcvp_care.
Red_info = data.frame(matrix(NA, nrow = nrow(wcvp_care), ncol = ncol(IUCN_redlist)))
have_match = which(match_details$match>0)
Red_info[have_match,] = IUCN_redlist[match_details$match[have_match],]
names(Red_info) = names(IUCN_redlist)

# 6) Create a data frame of the matched red list information together with matching information.
wcvp_red = data.frame(wcvp_care,  # The plant name id, taxonomic name and author from wcvp.
                      details = match_details$details,
                      details_short = match_details$details_short,
                      match_taxon_name = match_details$match_taxon_name,
                      original_authors = match_details$original_authors,
                      match_authors = match_details$match_authors,
                      author_check = match_details$author_check,
                      Red_info  
                      )

# 7) Remove those when a match isn't found (i.e red list category = NA)
wcvp_red = wcvp_red[!is.na(wcvp_red$category),]

# 8) Add to wcvp as the item `redList`.
wcvp$redList = wcvp_red

# Finished ready for matching.

enrich_collection() is in-built to check whether wcvp contains $redList and to enrich information from it, if provided.

Creating IUCN redlist ready for matching

IUCN redlist is the resource for retrieving the most up-to-date threatened status of items in your collection.

To obtain the full IUCN red list database we use the rredlist package. To use this package you will need to obtain a key/token to be able to use the API (see here).

To load and prepare the data we use

# 1) Enter your redlist key.
key = "//KEY\\"

# 2) Use `rl_sp` function to get all records from red list
# (code taken from the documentation)
out <- rredlist::rl_sp(all = TRUE, key=key)
length(out)
vapply(out, "[[", 1, "count")
all_df <- do.call(rbind, lapply(out, "[[", "result"))

# 3) Restrict to only plants.
IUCN_redlist = all_df[all_df$kingdom_name == "PLANTAE",]

# 4) Prepare the data.
IUCN_redlist = BGSmartR::prepare_enrich_database(enrich_database = IUCN_redlist,
                                  enrich_taxon_name_column = 'scientific_name',
                                  enrich_taxon_authors_column = 'taxonomic_authority',
                                  do_add_id = FALSE,
                                  console_message = TRUE)

The method outlined above will provide most fields found from IUCN red list’s website, such as the category, main_common_name and population. However, it does not include information such as the published_year, assessment_date, criteria, etc.

If this additional information is wanted loop over all the taxonomic names found above with the function rredlist::rl_search() which will retrieve extra information. Note that this approach is very slow (If there is a better way to do this please let us know).

To enrich to only IUCN red list information we can run

enriched_collection = enrich_collection(collection,
                          wcvp = NA,
                          iucnRedlist = IUCN_redlist,
                          BGCI = NA,
                          taxon_name_column = 'TaxonName',
                          taxon_name_full_column = 'TaxonNameFull',
                          taxon_author_column = NA)

This is similar to enriching with WCVP only.

Note that you can enrich to both WCVP and IUCN red list at once rather then separately by having inputs for both wcvp and iucnRedlist.