Writing your own queries for the local ECOTOX database is often much
more effective than applying the one-size-fits-all function
search_ecotox()
. But writing custom queries requires your
to understand the structure of the ECOTOX database.
The database’s relational structure
The diagram below shows which tables are available in the ECOTOX database and how they are related. The arrows are labelled with the key field that connects the two tables. The structure may seem a bit intimidating but once you realise which tables are most relevant, you can screen the diagram with more focus. First of all, tables containing actual data are coloured light yellow. All other tables (coloured light blue) contain lookup information. In general most users will be interested in the ‘tests’ table (describing test setup) and the ‘results’ table (describing the test results).
You can also list the tables in the database using
DBI::dbListTables()
. To list table fields, you can use
list_ecotox_fields("all", FALSE)
.
When constructing a query, you should ask yourself what information do I wish to retrieve, and how do I restrict/filter the data to the scope of your research? For that purpose, first focus on the yellow coloured tables, then join information from the lookup tables if required for your purpose.
Setting up queries: some examples
For setting up custom queries, I recommend to use tidyverse packages like dplyr. This is why in the
following examples the dplyr
package is used to construct
queries. The examples intend to show how the diagram above can be used
to collect information from the database.
All test results for tests carried out with Insecta
In this case you want to restrict your data to species that are in the class of Insecta.
library(dplyr)
if (check_ecotox_availability()) {
con <- dbConnectEcotox()
insects <-
tbl(con, "species") |>
filter(class == "Insecta") |>
## Specify which fields you want from the table.
## Make sure to include the key that links with
## other tables ('species_number')
select(species_number, common_name, latin_name)
}
As you can see in the diagram of the database schema, the table ‘species’ is not directly linked to results. Instead, it is linked to the table ‘tests’ which in turn is linked to ‘results’. This means you have to join the data to these tables in this order, using the appropriate keys.
if (check_ecotox_availability()) {
results <-
insects |>
## First join the 'tests' table
left_join(
tbl(con, "tests") |> select(test_id, species_number),
by = "species_number"
) |>
## Then join the 'results' table
left_join(
tbl(con, "results"),
by = "test_id"
) |>
## dplyr performs queries lazily, call
## 'collect' to actually get the results
collect()
close(con)
}
A random substance and random species
Although not a very realistic case, this example shows how you can pick a chemical and a species and see if there are any test results available for their combination. Note that the chance that any random chemical is tested with any random species is very small. Therefore, in most cases the example will yield zero test results.
The point of this example is that you can elegantly use the different
join types (like left, right, inner and anti; see
?dplyr::join
) to efficiently join tables from the database.
Note that search_ecotox()
creates a query based on the
provided search terms and output fields, and ‘blindly’ uses many left
joins to get the information from the database. For complex queries,
this is not a very efficient method to obtain data from the
database.
if (check_ecotox_availability()) {
con <- dbConnectEcotox()
## pick a random species
species <-
tbl(con, "species") |>
mutate(random = runif(n())) |>
slice_min(n = 1, order_by = random) |>
select(!random)
## pick a random chemical
chem <-
tbl(con, "chemicals") |>
rename(test_cas = "cas_number") |>
mutate(random = runif(n())) |>
slice_min(n = 1, order_by = random) |>
select(!random)
## join the randomly picked species and test chemical
## with the 'tests' and 'results' table
results <-
tbl(con, "tests") |>
select(test_id, test_cas, species_number) |>
right_join(species,
by = "species_number") |>
right_join(chem,
by = "test_cas") |>
inner_join(tbl(con, "results") |>
select(1:10),
by = "test_id") |>
collect()
close(con)
}
Dose information
Not all information has one-to-one or many-to-one relationship. In the case of dose, there is a many-to-many relationship with tests and therewith test results. This means if the dose information from one test is joined with test results, like in the example below.
if (check_ecotox_availability()) {
con <- dbConnectEcotox()
dose_info <-
tbl(con, "doses") |>
## Let's select the dose information for the first test in the database
filter(test_id == 1) |>
left_join(tbl(con, "dose_response_details"), by = "dose_id") |>
left_join(tbl(con, "dose_responses"), by = c("dose_resp_id", "test_id")) |>
left_join(tbl(con, "dose_response_links"), by = "dose_resp_id") |>
left_join(tbl(con, "results"), by = c("result_id", "test_id")) |>
collect()
close(con)
}
Indeed, the query below yields multiple doses for a single test
result. Depending on how you wish to further process the data, you may
want to: aggregate; nest; or pivot the dose information to single test
results. When using search_ecotox()
, the dose information
is automatically nested:
## In this example dose information is automatically nested
if (check_ecotox_availability()) {
doses <-
search_ecotox(
list(
result_id = list(
terms = 1182449,
method = "exact")),
output_fields = c("results.result_id", "doses.dose_number"))
}
Automatically generating queries
With a call to search_query_ecotox()
you can retrieve
the SQL query underpinning a search using search_ecotox()
.
As pointed out earlier, this method is very generic and may not provide
the most elegant query. But it can be used for inspiration and
understanding the database.
Overview of tables and fields
The table below lists all tables in the database and its fields. Primary and foreign keys are in bold face. Use the diagram to see how the tables are linked with these keys. The table below only lists the available fields in each table. Use the EPA ECOTOX documentation provided with the downloaded data for detailed information on the fields.
table | fields |
---|---|
chemical_carriers
|
carrier_id |
test_id | |
cas_number | |
chem_name | |
purpose | |
grade | |
grade_comments | |
formulation | |
formulation_comments | |
radiolabel | |
radiolabel_comments | |
purity_mean_op | |
purity_mean | |
purity_min_op | |
purity_min | |
purity_max_op | |
purity_max | |
purity_comments | |
characteristics | |
tests
|
test_id |
reference_number | |
test_cas | |
test_grade | |
test_grade_comments | |
test_formulation | |
test_formulation_comments | |
test_radiolabel | |
test_radiolabel_comments | |
test_purity_mean_op | |
test_purity_mean | |
test_purity_min_op | |
test_purity_min | |
test_purity_max_op | |
test_purity_max | |
test_purity_comments | |
test_characteristics | |
species_number | |
organism_habitat | |
organism_source | |
organism_source_comments | |
organism_lifestage | |
organism_lifestage_comments | |
organism_age_mean_op | |
organism_age_mean | |
organism_age_min_op | |
organism_age_min | |
organism_age_max_op | |
organism_age_max | |
organism_age_unit | |
organism_init_wt_mean_op | |
organism_init_wt_mean | |
organism_init_wt_min_op | |
organism_init_wt_min | |
organism_init_wt_max_op | |
organism_init_wt_max | |
organism_init_wt_unit | |
organism_characteristics | |
organism_gender | |
organism_length_mean_op | |
organism_length_mean | |
organism_length_min_op | |
organism_length_min | |
organism_length_max_op | |
organism_length_max | |
organism_length_unit | |
organism_length_type | |
organism_strain | |
experimental_design | |
study_duration_mean_op | |
study_duration_mean | |
study_duration_min_op | |
study_duration_min | |
study_duration_max_op | |
study_duration_max | |
study_duration_unit | |
study_duration_comments | |
exposure_duration_mean_op | |
exposure_duration_mean | |
exposure_duration_min_op | |
exposure_duration_min | |
exposure_duration_max_op | |
exposure_duration_max | |
exposure_duration_unit | |
exposure_duration_comments | |
study_type | |
study_type_comments | |
test_type | |
test_type_comments | |
test_location | |
test_location_comments | |
test_method | |
test_method_comments | |
exposure_type | |
exposure_type_comments | |
control_type | |
control_type_comments | |
media_type | |
media_type_comments | |
num_doses_mean_op | |
num_doses_mean | |
num_doses_min_op | |
num_doses_min | |
num_doses_max_op | |
num_doses_max | |
num_doses_comments | |
other_effect_comments | |
application_freq_mean_op | |
application_freq_mean | |
application_freq_min_op | |
application_freq_min | |
application_freq_max_op | |
application_freq_max | |
application_freq_unit | |
application_freq_comments | |
application_type | |
application_type_comments | |
application_rate | |
application_rate_unit | |
application_date | |
application_date_comments | |
application_season | |
application_season_comments | |
subhabitat | |
subhabitat_description | |
substrate | |
substrate_description | |
water_depth_mean_op | |
water_depth_mean | |
water_depth_min_op | |
water_depth_min | |
water_depth_max_op | |
water_depth_max | |
water_depth_unit | |
water_depth_comments | |
geographic_code | |
geographic_location | |
latitude | |
longitude | |
halflife_mean_op | |
halflife_mean | |
halflife_min_op | |
halflife_min | |
halflife_max_op | |
halflife_max | |
halflife_unit | |
halflife_comments | |
additional_comments | |
created_date | |
modified_date | |
published_date | |
dose_response_details
|
dose_resp_detail_id |
dose_resp_id | |
dose_id | |
response_mean_op | |
response_mean | |
response_min_op | |
response_min | |
response_max_op | |
response_max | |
statistical_method | |
statistical_method_comments | |
response_stat_value | |
response_stat_value_comments | |
sample_size_mean_op | |
sample_size_mean | |
sample_size_min_op | |
sample_size_min | |
sample_size_max_op | |
sample_size_max | |
significance_code | |
significance_type | |
significance_level_mean_op | |
significance_level_mean | |
significance_level_min_op | |
significance_level_min | |
significance_level_max_op | |
significance_level_max | |
significance_comments | |
chem_analysis_method | |
chem_analysis_method_comments | |
dry_wet | |
dry_wet_pct_mean_op | |
dry_wet_pct_mean | |
dry_wet_pct_min_op | |
dry_wet_pct_min | |
dry_wet_pct_max_op | |
dry_wet_pct_max | |
dry_wet_pct_comments | |
lipid_pct_mean_op | |
lipid_pct_mean | |
lipid_pct_min_op | |
lipid_pct_min | |
lipid_pct_max_op | |
lipid_pct_max | |
lipid_pct_comments | |
dose_resp_detail_comment | |
dose_response_links
|
dose_resp_link_id |
result_id | |
dose_resp_id | |
dose_responses
|
|
test_id | |
sample_size_unit | |
sample_size_comments | |
effect_code | |
measurement_code | |
measurement_comments | |
response_site | |
response_site_comments | |
obs_duration_mean_op | |
obs_duration_mean | |
obs_duration_min_op | |
obs_duration_min | |
obs_duration_max_op | |
obs_duration_max | |
obs_duration_unit | |
obs_duration_comments | |
response_unit | |
response_comments | |
additional_comments | |
doses
|
dose_id |
test_id | |
dose_number | |
dose_conc_unit | |
dose_comments | |
control_type | |
number_organisms_mean_op | |
number_organisms_mean | |
number_organisms_min_op | |
number_organisms_min | |
number_organisms_max_op | |
number_organisms_max | |
number_replicates | |
dose1_conc_type | |
dose1_mean_op | |
dose1_mean | |
dose1_min_op | |
dose1_min | |
dose1_max_op | |
dose1_max | |
dose1_stat_method | |
dose1_stat_value | |
dose2_conc_type | |
dose2_mean_op | |
dose2_mean | |
dose2_min_op | |
dose2_min | |
dose2_max_op | |
dose2_max | |
dose2_stat_method | |
dose2_stat_value | |
dose3_conc_type | |
dose3_mean_op | |
dose3_mean | |
dose3_min_op | |
dose3_min | |
dose3_max_op | |
dose3_max | |
dose3_stat_method | |
dose3_stat_value | |
media_characteristics
|
result_id |
soil_type | |
media_texture_sand_mean_op | |
media_texture_sand_mean | |
media_texture_sand_min_op | |
media_texture_sand_min | |
media_texture_sand_max_op | |
media_texture_sand_max | |
media_texture_silt_mean_op | |
media_texture_silt_mean | |
media_texture_silt_min_op | |
media_texture_silt_min | |
media_texture_silt_max_op | |
media_texture_silt_max | |
media_texture_clay_mean_op | |
media_texture_clay_mean | |
media_texture_clay_min_op | |
media_texture_clay_min | |
media_texture_clay_max_op | |
media_texture_clay_max | |
media_texture_comments | |
media_conc_measured | |
media_conc_dry_wet | |
media_conc_comments | |
media_moisture_pct_mean_op | |
media_moisture_pct_mean | |
media_moisture_pct_min_op | |
media_moisture_pct_min | |
media_moisture_pct_max_op | |
media_moisture_pct_max | |
media_moisture_comments | |
media_cec_mean_op | |
media_cec_mean | |
media_cec_min_op | |
media_cec_min | |
media_cec_max_op | |
media_cec_max | |
media_cec_unit | |
media_cec_comments | |
media_ph_mean_op | |
media_ph_mean | |
media_ph_min_op | |
media_ph_min | |
media_ph_max_op | |
media_ph_max | |
media_ph_comments | |
media_org_matter_mean_op | |
media_org_matter_mean | |
media_org_matter_min_op | |
media_org_matter_min | |
media_org_matter_max_op | |
media_org_matter_max | |
media_org_matter_unit | |
media_org_matter_type | |
media_org_matter_comments | |
media_temperature_mean_op | |
media_temperature_mean | |
media_temperature_min_op | |
media_temperature_min | |
media_temperature_max_op | |
media_temperature_max | |
media_temperature_unit | |
media_temperature_comments | |
media_alkalinity_mean_op | |
media_alkalinity_mean | |
media_alkalinity_min_op | |
media_alkalinity_min | |
media_alkalinity_max_op | |
media_alkalinity_max | |
media_alkalinity_unit | |
media_alkalinity_comments | |
media_hardness_mean_op | |
media_hardness_mean | |
media_hardness_min_op | |
media_hardness_min | |
media_hardness_max_op | |
media_hardness_max | |
media_hardness_unit | |
media_hardness_comments | |
dissolved_oxygen_mean_op | |
dissolved_oxygen_mean | |
dissolved_oxygen_min_op | |
dissolved_oxygen_min | |
dissolved_oxygen_max_op | |
dissolved_oxygen_max | |
dissolved_oxygen_unit | |
dissolved_oxygen_comments | |
media_salinity_mean_op | |
media_salinity_mean | |
media_salinity_min_op | |
media_salinity_min | |
media_salinity_max_op | |
media_salinity_max | |
media_salinity_unit | |
media_salinity_comments | |
media_conductivity_mean_op | |
media_conductivity_mean | |
media_conductivity_min_op | |
media_conductivity_min | |
media_conductivity_max_op | |
media_conductivity_max | |
media_conductivity_unit | |
media_conductivity_comments | |
media_org_carbon_mean_op | |
media_org_carbon_mean | |
media_org_carbon_min_op | |
media_org_carbon_min | |
media_org_carbon_max_op | |
media_org_carbon_max | |
media_org_carbon_unit | |
media_org_carbon_comments | |
media_humic_acid_mean_op | |
media_humic_acid_mean | |
media_humic_acid_min_op | |
media_humic_acid_min | |
media_humic_acid_max_op | |
media_humic_acid_max | |
media_humic_acid_unit | |
media_humic_acid_comments | |
media_calcium_mean_op | |
media_calcium_mean | |
media_calcium_min_op | |
media_calcium_min | |
media_calcium_max_op | |
media_calcium_max | |
media_calcium_unit | |
media_calcium_comments | |
media_magnesium_mean_op | |
media_magnesium_mean | |
media_magnesium_min_op | |
media_magnesium_min | |
media_magnesium_max_op | |
media_magnesium_max | |
media_magnesium_unit | |
media_magnesium_comments | |
media_sodium_mean_op | |
media_sodium_mean | |
media_sodium_min_op | |
media_sodium_min | |
media_sodium_max_op | |
media_sodium_max | |
media_sodium_unit | |
media_sodium_comments | |
media_potassium_mean_op | |
media_potassium_mean | |
media_potassium_min_op | |
media_potassium_min | |
media_potassium_max_op | |
media_potassium_max | |
media_potassium_unit | |
media_potassium_comments | |
media_sulfate_mean_op | |
media_sulfate_mean | |
media_sulfate_min_op | |
media_sulfate_min | |
media_sulfate_max_op | |
media_sulfate_max | |
media_sulfate_unit | |
media_sulfate_comments | |
media_chlorine_mean_op | |
media_chlorine_mean | |
media_chlorine_min_op | |
media_chlorine_min | |
media_chlorine_max_op | |
media_chlorine_max | |
media_chlorine_unit | |
media_chlorine_comments | |
media_diss_carbon_mean_op | |
media_diss_carbon_mean | |
media_diss_carbon_min_op | |
media_diss_carbon_min | |
media_diss_carbon_max_op | |
media_diss_carbon_max | |
media_diss_carbon_unit | |
media_diss_carbon_comments | |
media_sulfur_mean_op | |
media_sulfur_mean | |
media_sulfur_min_op | |
media_sulfur_min | |
media_sulfur_max_op | |
media_sulfur_max | |
media_sulfur_unit | |
media_sulfur_comments | |
references
|
reference_number |
reference_db | |
reference_type | |
author | |
title | |
source | |
publication_year | |
doi | |
results
|
result_id |
test_id | |
sample_size_mean_op | |
sample_size_mean | |
sample_size_min_op | |
sample_size_min | |
sample_size_max_op | |
sample_size_max | |
sample_size_unit | |
sample_size_comments | |
obs_duration_mean_op | |
obs_duration_mean | |
obs_duration_min_op | |
obs_duration_min | |
obs_duration_max_op | |
obs_duration_max | |
obs_duration_unit | |
obs_duration_comments | |
endpoint | |
endpoint_comments | |
trend | |
effect | |
effect_comments | |
measurement | |
measurement_comments | |
response_site | |
response_site_comments | |
effect_pct_mean_op | |
effect_pct_mean | |
effect_pct_min_op | |
effect_pct_min | |
effect_pct_max_op | |
effect_pct_max | |
effect_pct_comments | |
conc1_type | |
ion1 | |
conc1_mean_op | |
conc1_mean | |
conc1_min_op | |
conc1_min | |
conc1_max_op | |
conc1_max | |
conc1_unit | |
conc1_comments | |
conc2_type | |
ion2 | |
conc2_mean_op | |
conc2_mean | |
conc2_min_op | |
conc2_min | |
conc2_max_op | |
conc2_max | |
conc2_unit | |
conc2_comments | |
conc3_type | |
ion3 | |
conc3_mean_op | |
conc3_mean | |
conc3_min_op | |
conc3_min | |
conc3_max_op | |
conc3_max | |
conc3_unit | |
conc3_comments | |
bcf1_mean_op | |
bcf1_mean | |
bcf1_min_op | |
bcf1_min | |
bcf1_max_op | |
bcf1_max | |
bcf1_unit | |
bcf1_comments | |
bcf2_mean_op | |
bcf2_mean | |
bcf2_min_op | |
bcf2_min | |
bcf2_max_op | |
bcf2_max | |
bcf2_unit | |
bcf2_comments | |
bcf3_mean_op | |
bcf3_mean | |
bcf3_min_op | |
bcf3_min | |
bcf3_max_op | |
bcf3_max | |
bcf3_unit | |
bcf3_comments | |
significance_code | |
significance_type | |
significance_level_mean_op | |
significance_level_mean | |
significance_level_min_op | |
significance_level_min | |
significance_level_max_op | |
significance_level_max | |
significance_comments | |
chem_analysis_method | |
chem_analysis_method_comments | |
endpoint_assigned | |
organism_final_wt_mean_op | |
organism_final_wt_mean | |
organism_final_wt_min_op | |
organism_final_wt_min | |
organism_final_wt_max_op | |
organism_final_wt_max | |
organism_final_wt_unit | |
organism_final_wt_comments | |
intake_rate_mean_op | |
intake_rate_mean | |
intake_rate_min_op | |
intake_rate_min | |
intake_rate_max_op | |
intake_rate_max | |
intake_rate_unit | |
intake_rate_comments | |
lipid_pct_mean_op | |
lipid_pct_mean | |
lipid_pct_min_op | |
lipid_pct_min | |
lipid_pct_max_op | |
lipid_pct_max | |
lipid_pct_comments | |
dry_wet | |
dry_wet_pct_mean_op | |
dry_wet_pct_mean | |
dry_wet_pct_min_op | |
dry_wet_pct_min | |
dry_wet_pct_max_op | |
dry_wet_pct_max | |
dry_wet_pct_comments | |
steady_state | |
additional_comments | |
companion_tag | |
old_terretox_result_number | |
created_date | |
modified_date | |
species
|
species_number |
common_name | |
latin_name | |
kingdom | |
phylum_division | |
subphylum_div | |
superclass | |
class | |
tax_order | |
family | |
genus | |
species | |
subspecies | |
variety | |
ecotox_group | |
ncbi_taxid | |
species_synonyms
|
species_number |
latin_name | |
chemicals
|
cas_number |
chemical_name | |
ecotox_group | |
dtxsid | |
application_frequency_codes
|
code |
description | |
application_type_codes
|
code |
description | |
chemical_analysis_codes
|
code |
description | |
chemical_formulation_codes
|
code |
description | |
chemical_grade_codes
|
code |
description | |
concentration_type_codes
|
code |
description | |
concentration_unit_codes
|
code |
description | |
control_type_codes
|
code |
description | |
dose_stat_method_codes
|
code |
description | |
duration_unit_codes
|
code |
description | |
effect_codes
|
code |
description | |
endpoint_assigned_codes
|
code |
description | |
endpoint_codes
|
code |
description | |
exposure_type_codes
|
code |
description | |
field_study_type_codes
|
code |
description | |
gender_codes
|
code |
description | |
geographic_codes
|
code |
description | |
habitat_codes
|
code |
description | |
ion_codes
|
code |
description | |
lifestage_codes
|
code |
description | |
measurement_codes
|
code |
description | |
media_char_unit_codes
|
code |
description | |
media_type_codes
|
code |
description | |
organic_matter_type_codes
|
code |
description | |
organism_source_codes
|
code |
description | |
radio_label_codes
|
code |
description | |
response_site_codes
|
code |
description | |
sample_size_unit_codes
|
code |
description | |
season_codes
|
code |
description | |
statistical_significance_codes
|
code |
description | |
substrate_codes
|
code |
description | |
test_location_codes
|
code |
description | |
test_method_codes
|
code |
description | |
test_type_codes
|
code |
description | |
trend_codes
|
code |
description | |
weight_unit_codes
|
code |
description | |
length_unit_codes
|
code |
description | |
length_type_codes
|
code |
description |