Data associated with MGRs
How do scientists get data associated with samples?
There are different types of data. One type of data is associated with collection of the sample. Scientists will record various data, for example, the GPS coordinates of where it was collected, temperature, salinity, etc. A second type of data are called metagenomics, which is a very powerful way to characterize marine genetic resources (MGRs), particularly marine microbes, through sequencing their DNA en masse. Metagenomics can determine which organisms are there, what they are doing, with whom, and why. The data generated from DNA that is extracted from samples is sometimes called eDNA or environmental DNA.
As discussed in the Physical Materials section, analysis of microorganisms (with a sample that could contain millions of organisms) takes a significant amount of work.
Typical steps are:
Extraction of DNA from samples
Feeding the DNA into a sequencer
Quality control of the sequences generated
Computational analysis, aiming first to understand which sequences came from which organisms, and what enzymes/functions do they encode.
Derive new knowledge
Publish and send to DNA sequence repositories (databases of International Nucleotide Sequence Database Collaboration) with links to metadata in other public databases.
The computational analysis in step 4 is not trivial because each sample typically has thousands of organisms (many of which are unknown) and each organism contains thousands of genes (the majority of which are unknown). It’s a bit like trying to build a jigsaw puzzle containing trillions of pieces without knowing the picture!
What can the computer analysis tell scientists?
Computer analysis could give different levels of data; genes, genomes of individual organisms, the communities of individual organisms that live together in a sample of seawater, and in regions of ocean that collectively compose the global ocean.
Let’s use the UN Convention on law of the Sea (UNCLOS) (it has 320 articles!) as as an example to draw an analogy:
Let’s say that DNA is described as x number of A's, y number of B's, z number of C's etc.
Scientists can ask the computer to:
define a complete set of words from the data, then
define the sentences, then
define paragraphs, then
define articles of UNCLOS.
In this case,
1. words would be genes
2. sentences would be genomes of individual organisms
3. paragraphs would be communities of individual organisms that live together in a sample of seawater, and
4. articles of UNCLOS would be regions of ocean that collectively compose the global ocean.
Is it important to note that while a computer may be able to analyze data to various levels (from genes to regions of the ocean), scientists would still need to derive new knowledge to make a meaningful contribution to fostering science. For further discussion on the utilization of data, see traceability/transparency of data in the Phases and Stakeholders section of this website.