Matching functions — match_single • BGSmartR

Functions used to match taxonomic names from a collection to exterior databases (POWO's WCVP, IUCN Redlist)

Usage

match_single(
  taxon_names,
  enrich_database,
  enrich_database_search_index,
  enrich_taxon_name_column = "taxon_name",
  enrich_display_in_message_column = "ID",
  match_column = NA,
  ...
)

match_multiple(
  taxon_names,
  taxon_authors,
  enrich_database,
  enrich_database_search_index,
  enrich_taxon_name_column = "taxon_name",
  enrich_display_in_message_column = "ID",
  enrich_plant_identifier_column = "ID",
  match_column = NA,
  ...,
  show_progress = TRUE
)

match_all_issue(
  taxon_names,
  taxon_authors = rep(NA, length(taxon_names)),
  enrich_database,
  matching_authors = BGSmartR::match_authors,
  matching_criterion = BGSmartR::additional_wcvp_matching,
  do_add_split = TRUE,
  do_fix_hybrid = TRUE,
  do_rm_autonym = TRUE,
  enrich_taxon_name_column = "taxon_name",
  enrich_taxon_authors_column = "taxon_authors_simp",
  enrich_plant_identifier_column = "ID",
  enrich_display_in_message_column = "ID",
  ...
)

match_typos(
  taxon_names,
  taxon_authors,
  enrich_database,
  enrich_taxon_name_column = "taxon_name",
  single_indices = NA,
  mult_indices = NA,
  typo_method = "Data frame only",
  do_match_multiple = TRUE,
  ...
)

no_match_cultivar_indet(taxon_names)

get_match_from_multiple(
  taxon_name_and_author,
  enrich_database_mult,
  matching_authors = BGSmartR::match_authors,
  matching_criterion = BGSmartR::no_additional_matching,
  enrich_plant_identifier_column = "plant_name_id",
  enrich_taxon_name_column = "taxon_name",
  enrich_taxon_authors_column = "taxon_authors_simp",
  enrich_taxon_author_words_column = "author_parts",
  ...
)

check_taxon_typo(
  taxon_name,
  enrich_database = NA,
  enrich_taxon_name_column = "taxon_name",
  typo_df = BGSmartR::typo_list,
  typo_method = "Data frame only",
  ...
)

shorten_message(messages)

try_rm_autonym(
  taxon_names,
  enrich_database_taxon_names,
  console_message = TRUE,
  ...
)

try_fix_infraspecific_level(
  taxon_names,
  enrich_database_taxon_names,
  try_hybrid = TRUE,
  console_message = TRUE,
  ...
)

try_fix_hybrid(
  taxon_names,
  enrich_database_taxon_names,
  try_hybrid = TRUE,
  console_message = TRUE,
  ...
)

Arguments

taxon_names: Vector of taxonomic names.
enrich_database: A data frame of enriching information we want to match taxon_names to.
enrich_database_search_index: A vector of indices of enrich_database that are desired to be matched to.
enrich_taxon_name_column: The name of the column in enrich_database that corresponds to taxonomic names. Default value is taxon_names.
enrich_display_in_message_column: The name of the column in enrich_database that contains values to show in the matching messages. Default value is powo_id (wcvp identifier).
match_column: either NA or the name of the column in enrich_database. The default value if NA which means the values of the match are the indices of the matched records in the enrich database. If instead a single column of enrich_database is desired to be the result of the match the name of the column needs to be provided.
...: Arguments (i.e., attributes) used in the matching algorithm (passed along to nested fuctions). Examples include enrich_taxon_authors_column, enrich_display_in_message_column and enrich_plant_identifier_column.
taxon_authors: A vector of full taxon names (corresponding to taxon_names)
enrich_plant_identifier_column: The name of the column in enrich_database that corresponds to record identifier. Default value is plant_name_id.
show_progress: Flag (TRUE/FALSE) for whether we show progress bar.
matching_authors: The function used to find the best match using the author of taxonomic names. By default the function BGSmartR::match_authors() is used.
matching_criterion: The function used to find the best match when we have 'non-unique' taxonomic names. By default the function BGSmartR::get_match_from_multiple() is used.
do_add_split: Flag (TRUE/FALSE) for whether we search for missing f./var./subsp.
do_fix_hybrid: Flag (TRUE/FALSE) for whether we search for hybrid issues.
do_rm_autonym: Flag (TRUE/FALSE) for whether we try removing autonyms.
enrich_taxon_authors_column: The name of the column in enrich_database that corresponds to the authors of taxonomic names. Default value is taxon_authors_simp.
single_indices: A vector of indices of enrich_database that correspond to the records that have 'unique' taxonomic names.
mult_indices: A vector of indices of enrich_database that correspond to the records that have 'non-unique' taxonomic names.
typo_method: Either 'All', 'Data frame only','Data frame + common', detailing the level of typo finding required.
do_match_multiple: Flag (TRUE/FALSE) for whether we attempt matching those found to have multiple taxonomic names in the enrich database..
taxon_name_and_author: the pair of taxonomic name and combined taxonomic name and author
enrich_database_mult: enrich_database restricted to the rows that correspond to 'non-unique' taxonomic names.
enrich_taxon_author_words_column: The name of the column in enrich_database that corresponds to the words contained in the authors of taxonomic names. Default value is author_parts.
taxon_name: A single taxonomic name.
typo_df: A data frame where the first column is a taxonomic name with a typo and the second column is the corrected taxonomic name. By default BGSmartR::typo_list is used.
messages: messages detailing how a match is obtained.
enrich_database_taxon_names: The taxon names taken from enrich_database.
console_message: Flag (TRUE/FALSE) detailing whether to show messages in the console.
try_hybrid: Flag (TRUE/FALSE) for whether hybrid fixes are attempted.

Details

Below we outline the uses of each function. For further details and examples on matching functions please see the Matching.Rmd vignette.

Each of the matching functions generally return the index of the matching record in enrich_database and a message detailing how the match was obtained. These function can be used as building blocks to build a custom taxonomic name matching algorithm.

match_single() matches taxon_names to enrich_database taking only the first match. enrich_database_search_index should be used to restrict the enrich database to only 'unique' taxonomic names (i.e taxonomic names that correspond to a single record in the enrich database). For 'non-unique' taxonomic names match_multiple() should be used.
match_multiple() matches taxon_names to enrich_database for entries in enrich database that have non-unique taxonomic names. For 'unique' taxonomic names match_single() should be used. For non-unique taxonomic names we first use taxonomic author matching to decide which record to use. This matching is performed to each taxonomic name and author using the function get_match_from_multiple(). get_match_from_multiple() further depends on a matching criteria function which can be added using the input matching_criterion (passed via ...). By default this is set to additional_wcvp_matching(), which uses accepted_plant_name_id and taxon_status to chose the best match (in WCVP).
match_all_issue() attempts to fix hybridisation, change infraspecific levels or remove autonyms to find matches to an enriched database. This function depends on the functions:
- try_rm_autonym() attempts to find taxonomic names in enrich_database by removing autonyms.
- try_fix_infraspecific_level() attempts to find taxonomic names in enrich_database by adding/changing/removing infraspecific levels (var., f., etc).
- try_fix_hybrid() attempts to find taxonomic names in enrich_database by adding/changing/removing hybrid markers (+ or x).
match_typos() attempts to find matches by searching for typos in the taxonomic name. This depends on the function:
- check_taxon_typo() to check a single taxonomic name for typos found either in a typo list or the enriched database.
no_match_cultivar_indet() searches for cultivars and indeterminates and sets their match to -1 indicating no match.
shorten_message() compresses matching message (details of how a match is found) into an easy to read format.