Research Ideas and Outcomes : Forum Paper
PDF
Forum Paper
The scope and scale of the life sciences (‘Nature’s envelope’)
expand article info David J Patterson
‡ University of Sydney, Sydney, Australia
Open Access

Abstract

The extension of biology with a more data-centric component offers new opportunities for discovery. To enable investigations that rely on third-party data, the infrastructure that retains data and allows their re-use should, arguably, enable transactions that relate to any and all biological processes. The assembly of such a service-oriented and enabling infrastructure is challenging. Part of the challenge is to factor in the scope and scale of biological processes. From this foundation can emerge an estimate of the number of discipline-specific centres which will gather data in their given area of interest and prepare them for a path that will lead to trusted, persistent data repositories which will make fit-for-purpose data available for re-use. A simple model is presented for the scope and scale of life sciences. It can accommodate all known processes conducted by or caused by any and all organisms. It is depicted on a grid, the axes of which are (x) the durations of the processes and (y) the sizes of participants involved. Both axes are presented in log10 scales, and the grid is divided into decadal blocks with ten fold increments of time and size. Processes range in duration from 10-17 seconds to 3.5 billion years or more, and the sizes of participants range from 10-15 to 1.3 107 metres. Examples are given to illustrate the diversity of biological processes and their often inexact character. About half of the blocks within the grid do not contain known processes. The blocks that include biological processes amount to ‘Nature’s envelope’, a valuable rhetorical device onto which subdisciplines and existing initiatives may be mapped, and from which can be derived some key requirements for a comprehensive data infrastructure.

Keywords

Nature's envelope, scope of life sciences, scope of biological sciences, cyberinfrastructure, macroscope

Background

The growth of a data-rich and data-centric aspect of biology brings the prospect of new opportunities for discovery – both generally (National Research Council of the National Academies. 2009, National Science Foundation Cyberinfrastructure Council. 2007, National Science Foundation Office of Advanced Cyberinfrastructure 2020, OECD Megascience Forum Working Group on Biological Informatics 1999, Tansley and Tolle 2009) or in respect of particular disciplines (e.g. Hobern et al. (2019), Jones et al. (2006), Parr et al. (2012)). Data-mining adds to the processes of deduction, induction, guesswork, reductionism, and experimentation; it may reveal new patterns, better describe known patterns, or direct attention to informative outliers. With associated improvements in computing power, it enables analyses that require so much data that they were previously impractical. Access to large quantities of data may reveal patterns that were not discernible before, and stabilizes or invalidates less certain insights. With appropriate interoperability, previously isolated disciplines can be interconnected to explore processes that extend across multiple scales. In many regards, the potential of a framework, toolkit, and personnel trained to take advantage of this new growth in Biology closely corresponds with Joel de Rosnay’s vision for a ‘macroscope’ – a device intended to analyse phenomena previously deemed to be too complex to allow any real progress (de Rosnay 1975).

The potential of data-centric developments will be realized best if scientists can call on an appropriate (cyber) infrastructure that makes data freely available in a ready-to-use form. Examples of existing environments include Genbank and the other members of the International Nucleotide Sequence Database Collaboration (Brunak et al. 2002, Federhen 2012, Karsch-Mizrachi et al. 2012) for molecular biology, and the Global Biodiversity Information Facility (GBIF) and the Ocean Biodiversity Information System (OBIS) for occurrence data (Heberling et al. 2021, Vanden Berghe et al. 2013). Such data aggregators can capture and standardize data, promote training and standards for new skill-sets (Palmer et al. 2007), foster a shift in conventions towards data-sharing and re-use of data; and set priorities (Hardisty 2013, Thessen and Patterson 2011). Once in place, discipline-based aggregation centres lead to new tools and environments that have agendas beyond the initial intent.

The assembly of such environments for all of biology is a colossal challenge. It will be very costly and will depend on a new political commitment to fund the construction and persistence of the service infrastructure. Some argue that urgent science problems should be the driver for this new infrastructure (Sterner et al. 2020). This decentralized approach seems inevitable. It would favour particular agendas, be agile and responsive to needs; but, as part of the competitive research enterprise, it will add to the fragmented character of biology. de Rosnay’s macroscope perspective reminds us that a well-designed global cyberinfrastructure should enable progress not only with pressing agendas but also with less proximate concerns.

The position taken here is that the research environment is ill-suited to the assembly and maintenance of a persistent service cyberinfrastructure. Most research is based on short term projects such that continued funding, and hence the continuing availability of the infrastructure, is not certain. Most current discipline-based repositories serve a particular research agenda, but lack the resources to ensure access to all data in perpetuity, provide quality control processes, or to prepare content for transfer to trusted data repositories. Without a commitment to capture all content, some legacy and at-risk information will not be made digital or not in forms that allow for easy analysis. Those data will simply be lost from inclusion in current and future scientific efforts. That is, we need to consider the needs of a comprehensive infrastructure without being constrained by what best serves trend-setters in current research.

The requirements for an ideal general-purpose (enabling) infrastructure are reasonably predictable. Using the term agent to refer to individuals, institutions, or programmes; it is expected that one or more agents will take responsibility for the discovery and aggregation of all data within each of all domains of research. The most inclusive stance should be taken as to what constitutes a domain of research. Sources should include the output of any project, individual, team, or programme; data collected by funding sources, institutions, publications, publishers, databases, computed data, and so on. Output from sources will be discovered and copied (gathered/aggregated) by agents into one or more data centres representing their defined domains of research. It is expected there will be more than 10,000 discipline-focussed data aggregators. As information may or may not have been ‘born digital’, devices will be needed to ensure that legacy data are made digital. Once acquired, data will need to be normalized, have key provenance and discipline-specific metadata added; and then be made available through reliable and trustworthy pathways for harvesting by trusted data repositories which meet CoreTrustSeal standards and which guarantee access to the data in perpetuity (Corrado 2019, Dillo and De Leeuw 2018, Downs 2021). Compliance with FAIR principles (Wilkinson 2016) or more demanding standards is expected.

Some of the challenges that an infrastructure will face are already evident from research in biology that relies heavily on the re-use of data. A good example is the re-use of molecular data in investigations of phylogenetic relationships (e.g. Hinchliff et al. 2015). Such studies reveal uncorrected misidentifications of material (e.g. Leray et al. 2019, Pentinsaari et al. 2020) or other errors in the data (Bidartondo 2008). A second challenge is the integration of information from different sources. This problem arises in broad cross-discipline areas (Jones et al. 2006, Miled et al. 2004, Nishant et al. 2011), within subdomains (Hall et al. 2013), for taxonomies (Franz and Sterner 2018,Garnett et al. 2020), or even in the very narrow domain of occurrence data (Belbin et al. 2013, Mesibov 2013). Immediate problems may misdirect attention from the absence of a clear plan, protocols and funding that are needed to guide all data along the pathway from source to trusted repositories of fit-for-purpose data.

Along the pathway from source to repository, at least one agent will need to take responsibility for polishing services that will correct errors, keep metadata up to date, update software-dependent data, correct flaws in aggregation processes, and so on (Belbin et al. 2013, Chapman 2005, Franz and Sterner 2018, Mesibov 2013). Without this, there can be no guarantee that data will be fit for purpose. Such polishing services include those needed for scientific names because errors and idiosyncrasies with names are common in data sources (Patterson et al. 2016), despite the very high significance of names as metadata. Names and associated taxonomic concepts (identities) change with new research in nomenclature, systematics, and phylogeny; such that the prior use of names may need to be updated. That is, it must be assumed that any name strings associated with or acting as a data object may need to be corrected or replaced on one or more occasions. With appropriate investment, name polishing can be provided along with on-line reconciliation and resolution services (Mozzherin et al. 2017, Patterson et al. 2010). Older occurrence data may need polishing to maintain currency with geopolitical developments or to, like Biogeomancer, convert place names to georeferences (Guralnick et al. 2006). The pathway should include annotation services such as Filtered Push (Wang et al. 2009) which allow users and curators to add comments or corrections and hence improve the quality of data.

A service-oriented infrastructure must include, and be built atop, a layer of discipline-based aggregators. Absent from discussions about a general cyberinfrastructure is an assertion of the full extent of the life sciences. Such an assertion is needed to guide planning efforts with estimates of the number of data sources, the amount and character of primary data, requirements for discipline-specific data aggregation and management centers which will deliver fit-for-purpose data to persistent repositories with curatorial practices that meet the highest standards (Dillo and De Leeuw 2018, Downs 2021). Without recognition of the scope and scale of the discipline, the costs of building an infrastructure will not be known, the political will for new funding models will be absent, and the comprehensive enabling cyberinfrastructure that some seek will not emerge.

Nature's Envelope

The intent here is to promote the dialogue as to the scope and scale of biology that is needed to plan a data infrastructure that can serve all aspects of the biological sciences. All known life is a single array of processes which are interconnected from the sub-molecular level to the global. Each process can be represented by the size(s) of the participant(s), and its duration. Arguably, process-based metrics can be applied to any facet of biology, unlike metrics based on ‘objects’ – such as the number of species or other measures of biodiversity, the number of data objects, or the number of agents (Thessen and Patterson 2011). The emphasis on process is useful as processes are the targets for most discovery efforts.

The graphic framework that was used for this exercise was a grid with log10 axes for the duration of processes in seconds and the size(s) of participants in metres. The choice of using a log scale is one of convenience only but is consistent with other efforts to represent information that extends over broad scales (Morrison and Morrison 1994, 't Hooft and Vandoren 2014). The approach has been applied in more limited extents to biology (e.g. Buonomano 2007, McGeogh 1998).

The result (Fig. 1) was a grid that extends across about 35 orders of magnitude of time, and about 21 orders of magnitude for size. Instances of processes were taken from all levels of organization and were plotted onto this framework, selecting those decadal blocks (defined by their lower left corners) in which processes occurred. Examples of biological processes follow. Biological processes occur in about half of the available blocks. A line was drawn around the examples to give the green area in Fig. 1. The periphery was blurred to reflect the inexact metrics of processes. The green area is ‘Nature’s Envelope’.

Figure 1.  

The envelope that contains all biological processes. The axes are (horizontal) the duration of processes, and (vertical) the size of the participants. Metrics are represented in log10 scales. The green area is where biological processes occur, and its periphery is ‘Nature’s Envelope’.

As biology merges with chemistry, physics, geology and other sciences, it is helpful to indicate what was included in this exercise to establish the outer bounds of the life sciences. Inclusion is limited to processes conducted by, in, or among living organisms, and the consequences of those processes. The result is not theoretical, but is a summary of processes that embrace subatomic events, molecular and biochemical events, cellular, tissue, organismic, ecological, evolutionary, and global events. Most are obviously active processes: examples being the acquisition and translocation of ions, transformational changes in motility proteins such as myosin or kinesin; the flight patterns of peregrine falcons, or the expansion of ground cover by colonial plants. Some verge on being considered passive: such as the passage of photons through chlorophyll molecules, but this is included as there is an active component that intercepts and retains energy. Also included is the expansion of the oxygen-containing atmosphere, as it is driven by photosynthetic processes. The expansion of the distribution of invasive species is included, but the movement of the virulent B.1.1.7 COVID strain aka VUI – 202012/01 recorded in Britain in October 2020, and located in the US and Australia in December of the same year is not (because of the involvement of air travel). The course of Voyager spacecraft, the ages of inert fossils, and the fossilization process are not included. Clearly, inclusion of processes as ‘life sciences’ is open to debate and may need to be reset with future versions of Nature’s Envelope.

The emphasis on processes involves an unfamiliar inexactness in information. Processes are transient by nature, are influenced by other processes, internal and external environments, recent histories, age, the number and diversity of directly or indirectly connected participants, whether information is obtained in vivo, in vitro, by inference or calculation, and so on. As an example of the imprecision involved, the time it takes for mRNA to move from a nucleus to the outer margins of a cell depends on the number of nuclear pores, the size of the cell, whether cyclosis is expressed and how, on temperature, whether the mRNA is remodelled into a ribonucleo-protein or not, involvement of molecular motors, the alleles available within the observed population, the species, and the type of cell. Consequently, the speed of movement varies by at least two orders of magnitude (Rodriguez et al. 2007). Rather than represent processes by exact numbers such as a mean value, the construction of the first draft of the ‘Nature’s Envelope’ graphic favoured minimal and maximal estimates of range.

The extremes of the envelope that includes all life processes was set by identifying the processes with shortest and longest durations, and those with the smallest and largest participants. The briefest process is held to be the interception of a photon by a photopigment molecule during which energy is transferred from the photon to the photopigment. A photon of light travels at 300,000 km (3.108 metres) per second. A chlorophyll molecule measures about 2-3 nm or 3.10-9 metres. A simple calculation establishes that the amount of time that a chlorophyll molecule is exposed to and must take advantage of the energy of a photon is 10-17 seconds. As for the size of the participating photon, the treatment of photons as objects with size is questionable, but there is a consensus that a size of 1.10-15 m is appropriate (Pohl et al. 2016). The process used for the other extreme is that of evolution, for which we use as a start point the oldest recorded fossils of bacterial stromatolites or other microbial activities which date back to about 3.4 – 4.2 bya (Dodd et al. 2017). The evolutionary process has therefore endured for about 1017 seconds. The participant in the evolutionary process is Earth. The size is taken as the solid mass plus 100 km depth of oxygen-influenced atmosphere, that is, about 13.106 m.

To populate the envelope and establish its shape, sample biological processes were mapped into decadal blocks within the grid. As an example, the process of a (dead) whale exploding from pressure of gasses accumulating in its intestines endures for about 1 to 10 seconds and involves an object about 10 metres long. As with all other processes, the explosive event is not isolated. It is interconnected with the metabolism and growth of individual bacteria, populations of genetically similar organisms and of taxonomically diverse communities all of which contribute to the production of the gases. The eruption is also connected to responses by members of the microbial food web and other scavengers that benefit from the resulting supply of dissolved and particulate food materials.

Three classes of further examples illustrate the process by which the envelope was populated, and reveal more of the problems that were encountered.

Life history data are included for all classes of organisms, from sub-micron viruses to honey fungi and tree clones extending over multiple kilometres. Examples with short and long life-spans were favoured. Data on the generation times of identified bacteria measured in minutes, to various species of trees known to be many thousands of years old were included. Examples were mapped onto the decadal blocks defined by the sizes of individuals of the relevant species. Times of early demise and fossilization processes were not included. Data on life-spans were extended to classes of cells. The doubling times of many protists (single cells) are known and some were included. The life-spans of human red blood cells populate two decadal blocks. Both are defined by the size 1-10 µm (red blood cells are 7-8 µm in diameter), but given that red blood cells can survive for 70-140 days, two blocks (defined by 106 and 107seconds) were selected (Franco 2012). More blocks may be populated when a greater diversity of cells and organisms are included. The life-span concept was extended further to molecules. The life of mRNA molecules of some organisms has been measured, but, despite being expressed as half-time decay rates (Baudrimont et al. 2017), can be included.

A second class of examples relate to movements. Included are the increases in dimensions of organisms from nascent form to adult. An entry for growth of stromatolites is based on estimates of a few millimetres expansion per year. For some, data are entered for a species (Arctic terns migrate more than 10,000 kms in 3 months); while others and preferred are particularized. Joe, a tumbler pigeon, departed Oregon (USA) on October 29th, 2020, and arrived 17,000 kms distant in Melbourne (Australia) on December 26th, 2020. Some activities are represented by more than one entry. Murmurations of starlings are included both in decadal blocks defined by the size of individual organisms, and in blocks for the whole flock. Cyclical movements include the molecular motor kinesin that steps 10 nm or so in 100 microseconds as well as movements of organisms from bacteria to large trees in response to tidal, diurnal, lunar, seasonal, or annual cycles. Emergence events of Magicicada that are separated by many years are included. Movements in response to environmental factors, optimising location or orientation relative to directional factors (sunlight) or to gradients (such as the responses of microbial and meiobenthic communities to REDOX gradients) are included. Range extensions are included. Cane toads were introduced to North Queensland (Australia) in 1935 with the intent of controlling pests in sugar cane crops have since expanded their range by over 1000 kms. Entries for plants include the estimated 14,000 – 80,000 year period that the Pando clone of aspen trees has extended about 5 km (DeWoody et al. 2008).

The last suite is much of a miscellany. The envelope includes transactions, such as steps in metabolic pathways, the exchange of neurotransmitters between cells, communications internally in multicellular organisms involving hormones, or externally involving pheromones. Microbial biogeochemical activities which are associated with transformations and precipitations of organic and inorganic deposits, including fool’s gold (Thiel et al. 2019) or real gold (Reith et al. 2007) proved difficult to categorize. Some aspects of adaptation and evolution are included. The length of time involved in the acquisition of new behavioural traits, such as the ability of crows to use vehicles to break nuts, has been asserted (Nihei and Higuchi 2001). It is included using decadal blocks defined by the size of individual crows. Our recent experience with COVID also provides data on the emergence and spread of new genotypes, something that can be added to more conventional evolutionary trees with their asserted timelines.

Concluding comments

‘Nature’s envelope’ (v. 1) is not intended as an analytical tool, but as a rhetorical device. Such devices have had a significant impact on the development of our discipline. Examples include the depiction of evolutionary relationships, the concept of evolution, so-called ‘laws’ like Gause’s Law of competitive exclusion or Bergmann’s Rule that within a clade those species that live in colder climes are larger, and various models from molecular to ecological that seek to represent reality. Although such devices may lack numeracy and exactness, they can be treated as testable hypotheses, and can grow into or spawn more exact assertions.

As a rhetorical device, Nature’s Envelope aims to provide context for a variety of conversations. Initially, it was motivated by the challenges of building a unifying informatics framework that might aid the study of any aspect of biology. It is not intended to be part of the data infrastructure. Indeed, its reliance on information about processes processes may make it incompatible with the object-based catalogues which lack information on time-lines but which are the most usual form of data repositories. None-the-less, ‘Nature’s Envelope’ can help to determine the number of discipline-based data aggregation centres that will be needed to discover, standardize and move data from primary producers into an environment where they may be freely used in computational analyses. At this time, there is not the political will nor resources to craft, build, staff and maintain a service-oriented array of data services. For the time being at least, most developments that will form part of the infrastructure will be driven by particular research technologies and agendas (Sterner et al. 2020).

The Envelope can be made more informative by the addition of layers. Fig. 2 includes a window to show the processes can be directly observed by a generalized individual unaided by special equipment. It allows for the formation of visual images after less than 0.1 second of exposure to a subject, and the capacity to discriminate items less than 0.1 mm in size. The upper right corner of the window is based on examples of 19th century naturalists such as Joseph Banks or Alfred Russel Wallace, whose decades of observations around the world led to insights on global distributions of plants and animals (the Wallace Line being a case in point).

Figure 2.  

‘Nature’s Envelope’ with an added layer that shows the extent of living processes which may be perceived by a human individual.

Other layers may be developed to show which areas of biology benefit from particular technologies – such as how the individual experience window can be expanded by access to microscopes. Layers may inform us about the relevance of technologies or reveal which processes are measurable and which processes must be inferred or computed. Layering can show one or more domains where communities with particular taxonomic or other skills can add value. Layering exercises that identify subdisciplines and the targets of special interest groups, will help to clarify opportunities and requirements for data interoperability. In turn, this helps to set requirements for data and metadata standards.

While the current iteration of Nature’s Envelope’ is data-based, it is inexact and incomplete. It is a preliminary assertion that, if helpful to discussions, would be improved by being fleshed out by community involvement. It would be helpful to expand and enrich this framework. More examples will help affirm the shape of the envelope. In some cases, it will be possible to import data from environments that deal with processes, such as migrations (Megamove, Movebank or the Bird Migration Explorer), cyclic processes such as seasonal emergences; life cycles, or growth. In some cases this information can be computed from object-related environments that include time-stamps as metadata. There are other definitions of 'life' which might admit more or fewer processes. Should, for example, Natur's Envelope include technology-assisted activities or exobiological assertions. Finally, there are benefits if we identify sources of arbitrariness and reduce that feature. Progress would be best done using an open collaborative community (a template is available as Suppl. material 1 to aid initial efforts in this process).

Acknowledgements

I thank Carl Seaquist, James Patterson, Julian Partridge, and Rebecca Lynn for their comments.

References

Supplementary material

Suppl. material 1: Patterson Nature's envelope (template) 
Authors:  David J Patterson
Data type:  Powerpoint
Brief description: 

A powerpoint file with an image of Nature's Envelope as submitted to RIO, with an additional editable layer as a window

login to comment