Background & Summary

Benthic foraminifera are unicellular organisms characterised by a calcium carbonate shell. They are responsible for about 20% of global carbonate production1. In carbonate-dominated environments, such as coral reefs, foraminifera are important contributors to the production of sediment2,3. Combined effects of global climate change impair carbonate production in these environments4,5,6, which is critical for low-lying tropical islands to withstand sea-level rise7,8. Foraminifera are also sensitive sentinels of environmental change9. As such, they act as proxies to infer data on both long and short-temporal scales. For example, the species composition of an assemblage of foraminifera is related to environmental change10,11, thus analysing the gradient of such a composition by sampling over time12 or by observing a sediment core13, shows temporal trends in habitat quality. Monitoring foraminifera is thus imperative in improving our understanding of the response of these sensitive systems to climate change.

Currently, collected sediment cores are washed and sieved to separate the foraminifera from the surrounding material. This is followed by manual partitioning, identification and quantification of the foraminifera (see Fig. 1). This is a time-consuming process that requires expert knowledge. As such, there has been active research in automating (parts of) this procedure. Such methods involve taking photographs of sieved and washed foraminifera and using machine learning to subsequently classify the imaged foraminifera. This is done by defining hand-crafted features14 or by learning the relevant features for classification based on the whole image of individual foraminifera15,16,17. Several efforts have incorporated 3D features by photographing specimens under different lighting conditions14, at different focal planes18 or both17. Despite these advances in automation, preparation of the sediment core sample (e.g. washing, sieving and sorting) remains a bottleneck. Moreover, foraminifera stuck in hardened consolidated sediment are difficult to obtain and classify in this manner16,18.

Fig. 1
figure 1

The conventional procedure of quantifying foraminifera. A core sample is subdivided, soaked in a (chemical) solution and rinsed, followed by (wet) sieving, separation, and classification and quantification of the species. The latter two steps are performed under a microscope.

We envision an alternative workflow, in which 3D X-ray computational tomography (CT) is used to further automate the procedure. While CT scanning has been widely employed to analyse individual foraminifera19,20,21,22,23,24,25, our workflow is based on scanning the whole core before washing and sieving. The resulting 3D image can then be processed digitally to separate the individual foraminifera contained within the core sample and classify them using advanced 2D and 3D machine learning techniques. A schematic depiction of the envisioned procedure is shown in Fig. 2.

Fig. 2
figure 2

The proposed procedure of quantifying foraminifera. A core sample is subdivided and CT scanned. An algorithm then quantifies the 3D CT reconstruction.

The methods needed to segment and classify individual foraminifera from such 3D scans do not exist yet. To enable the development of such methods, high-quality labelled training data are needed. While there exist some datasets of CT scans of (individual) foraminifera species19,20,26, they are not suitable for developing and training methods for separating and classifying CT scans of whole core samples, as the data is lacking in magnitude, diversity and consistency. One recent, notable CT scan dataset25 improves in terms of (planktic) species diversity and dataset magnitude but still lacks in size for training machine learning algorithms. We aim to address these shortcomings with ForametCeTera; a dataset consisting of 436 3D CT scans of 288 individual benthic foraminifera and 148 bits of non-foraminiferan material. ForametCeTera’s data can be used as building blocks to generate synthetic digital core samples, which in turn can be used to develop methods for segmenting and classifying CT scans of core samples. We also demonstrate a high-throughput, specimen-agnostic scanning procedure suitable for the aforementioned task which can be used to rapidly build upon the dataset’s breadth and depth with additional microfossils. It is the first time that a dataset of foraminifera of this scope has been collected. We believe it will be helpful for method development and validation.

The remainder of the paper is organised as follows. A detailed description of how the dataset was acquired is given in the Methods section. This is followed by a detailed description of the dataset itself, how it was validated, and how it can be used.

Methods

To create the dataset, we carefully selected samples of individual foraminifera and bits of non-foraminiferan material. We will refer to either individual foraminifera or non-foraminiferan material as specimens. The specimens were sourced from core samples acquired by the Naturalis Biodiversity Center and split up into foraminifera species and a residual group of non-foraminiferan material. Each collection of specimens was put in a tube with a filling medium. Each tube was then CT scanned (creating a group-scan) with consistent scanning parameters deemed most suitable after performing trial scans of all specimens. A summary of this process is shown in Fig. 3. ForametCeTera contains 11 group-scans, that were segmented into a total of 436 individual specimens; 288 foraminifera and 148 bits of non-foraminiferan material. This process took place over a period of 2 months. The scope of the dataset encapsulates the raw CT projections of the group-scans, reconstructed 2D cross-sectional data of the group-scans, reconstructed 3D group-scans, segmented 3D specimens and all scripts that were used throughout this process.

Fig. 3
figure 3

The specimen scanning procedure. From left to right, a sample of different specimens is split into groups by type and put in tubes with a filling medium, each tube is scanned following the same protocol, each group-scan is segmented producing density volumes of individual specimens.

Sample selection

The samples from which the specimens were retrieved, were collected in Makassar, Indonesia and Espiritu Santo, Vanuatu. A detailed overview of their metadata is shown in Table 1. These specimens have previously been used in academic and educational activities and have therefore been cleaned. However several specimens had clumps of sand sticking to them. Taking care to keep sandy samples separate, all specimens were split and grouped by their respective species and region. Non-foraminiferan material was separated too. The resulting collection of specimens is described in Table 2. Several microscope images of these specimens are shown in Fig. 4.

Table 1 Metadata of the samples from which the specimens were retrieved.
Table 2 An overview of the scanned specimens.
Fig. 4
figure 4

Microscope images of the selected samples. (a) Alveolinella quoyi. (b) Operculina ammonoides. (c) Baculogypsinoides sphaerulata. (d) Marginopora santoensis. (e) γ, an example of non-foraminiferan material.

Sample preparation

In our testing, it was found that an adequate CT scan takes at least half an hour. It would be very time-consuming to perform such a scan for each individual specimen. As such, it was favoured to scan several specimens at once, creating the previously mentioned group-scans. This group-scan can then be segmented to obtain individual specimens. The successful application of segmentation algorithms requires the specimens to be adequately spaced. To achieve this, a filling medium was needed that fulfilled several requirements; 1) having minimal overlap in density to the specimens, 2) being relatively homogeneous and 3) given the value and utility of foraminifera, being separable from the specimens when dismantling the samples. The candidate filling media were sugar and coffee creamer as they both dissolve in water (requirement 3) and are readily available. After creating scans with both media, sugar was found to be both less homogeneous and have density overlap with the specimens. Subsequently, coffee creamer was the selected filling medium.

For each group-scan, the selected specimens and filling medium were put in a plastic tube (11.6 mm inner diameter, 50.7 mm inner height, 1.5 mm thick) with a screw cap. Specimens at the bottom of the tube are at risk of being poorly captured as they are near the thicker bottom of the tube and mounting equipment. As such, a 2 cm cylindrical piece of Styrofoam was first placed in the tube, raising the tube contents from the bottom. Subsequently, a teaspoon of filling medium was dispensed on a piece of paper with a crease through the middle. The specimen of choice was then disposed onto the filling medium and mixed with it. The mixed medium and specimen materials were poured into the tube following the aforementioned crease. Upon X-ray inspection, it may turn out that the specimens are poorly distributed throughout the filling medium. In such cases, simply shaking the tube can effectively redistribute the specimens throughout the medium.

Scanner setup and parameters

To perform the CT scans, a Neoscan N80 FP micro-CT scanner was used, located at the Naturalis Biodiversity Center. The scanner contains a microfocus X-ray source (limited to 110 kV, 16 W), an active pixel CMOS flat-panel X-ray detector (7 Mp) and, in between these, a stage where the sample is mounted, capable of axial rotation and 3D translation27.

The CT scanner is operated using Neoscan’s accompanying software: Neoscan80, version 3.0.2. All scans were performed with a resolution of 15 μm, using a 0.5 mm aluminium (Al) filter. Testing revealed that a different filter may enhance the image contrast of some specimens as, from certain angles, X-rays were too attenuated by the filter and the specimens. However, as one of the downstream aims of ForametCeTera is to generate synthetic samples, scanning with different parameters would introduce per-specimen biases. This particular filter turned out to be suitable for most specimens.

Prior to each scan, the CT scanner flat-field was automatically calibrated using the Neoscan80 software. For each scan, the object made a full 360° rotation with images captured at 0.2° increments resulting in 1801 projections, including a final overlapping projection. The exposure time was 94 ms and 4 averaging frames were used. The X-ray source was set to 67 kV and 200 μA. Projections were captured at the highest possible resolution of 2400 × 2752 pixels. In case the region of interest exceeded the vertical field of view of the scanner, multiple scans were ‘stitched’ together using the oversize scanning feature of Neoscan80.

Data acquisition

The projection data which resulted from the group-scans was reconstructed to create 2D cross-sectional images using the Neoscan80 software. Only intensity values between 0.09 and 0.9 were retained. These values are of a unitless quantity proportional to material density. The 2D cross sections were exported as a stack of lossless-compressed 8-bit .png files. These stacks were converted into lossless group-scan 3D data by means of a Python script (stack.py, see Code availability), producing .nrrd files, a file format for n-dimensional raster data28. In the 3D data, the foraminifera are separated from the filling medium based on a voxel intensity threshold. This threshold was found empirically and tuned to strike a balance between preserving foraminiferan material whilst completely removing the filling medium. After checking the resulting disconnected components for segmentation faults, the individual specimens are exported as .nrrd files. This segmentation procedure is implemented in another Python script (segment.ipynb, see Code availability).

Further expansion

ForametCeTera is a diverse 3D dataset yet further expansion would enhance the robustness and generalisability of (machine learning-based) classifiers. Diversity can be expanded by scanning specimens from different species, regions and depths. Additionally, scans of unprocessed samples would enable the testing of trained classifiers on real-world data that was minimally pre-processed as envisioned in the proposed procedure (see Fig. 2).

Data Records

ForametCeTera is publicly available as a .zip file at Zenodo29 of about 4.4 GB. This file contains the 2D reconstructed group-scan cross-sections, the 3D group-scan data and the individual, segmented 3D specimens. The raw projection images of the group-scans are available upon request as the data is relatively large (332 GB uncompressed). An overview of the data and its metadata is given in Tables 1 and 2. The way the dataset is structured is shown in Fig. 5. The Group_scans folder contains the 3D reconstructions of the group-scans, the Specimens folder contains the segmented 3D specimens i.e. the segmented group-scans, the Reconstructions folder contains the reconstructed 2D cross-sectional data and a file output by Neoscan80 on the scanning parameters per scan. Specimen indices in the Specimens folder may jump due to touching specimens that were erroneously segmented as one and subsequently not exported. For the sake of reproducibility, the indices were not corrected for this. <ID a> placeholders refer to the IDs as shown in the Dataset ID column in Table 2.

Fig. 5
figure 5

The dataset structure of ForametCeTera.

Technical Validation

The Neoscan CT scanner undergoes regular maintenance and calibration. Prior to each scan, the flat-field reference is updated by the Neoscan software. The reconstructed 3D data has been examined to identify any oddities in the captured intensity values. The segmented specimens have been checked on intensity oddities and on segmentation faults.

Usage Notes

The 3D data can be viewed using, for example, the open-source program 3D slicer (https://www.slicer.org/). To perform analysis, Python was used in conjunction with several packages like pynrrd to load the data and numpy, scipy.ndimage and scikit-image to perform further analysis. For machine learning endeavours, the TorchIO or rising (exclusive to PyTorch) libraries may be used.