-
Investigating Collaborative Data Practices: a Case Study on Artificial Intelligence for Healthcare Research
Authors:
Rafael Henkin,
Elizabeth Remfry,
Duncan J. Reynolds,
Megan Clinch,
Michael R. Barnes
Abstract:
Developing artificial intelligence (AI) tools for healthcare is a collaborative effort, bringing data scientists, clinicians, patients and other disciplines together. In this paper, we explore the collaborative data practices of research consortia tasked with applying AI tools to understand and manage multiple long-term conditions in the UK. Through an inductive thematic analysis of 13 semi-struct…
▽ More
Developing artificial intelligence (AI) tools for healthcare is a collaborative effort, bringing data scientists, clinicians, patients and other disciplines together. In this paper, we explore the collaborative data practices of research consortia tasked with applying AI tools to understand and manage multiple long-term conditions in the UK. Through an inductive thematic analysis of 13 semi-structured interviews with participants of these consortia, we aimed to understand how collaboration happens based on the tools used, communication processes and settings, as well as the conditions and obstacles for collaborative work. Our findings reveal the adaptation of tools that are used for sharing knowledge and the tailoring of information based on the audience, particularly those from a clinical or patient perspective. Limitations on the ability to do this were also found to be imposed by the use of electronic healthcare records and access to datasets. We identified meetings as the key setting for facilitating exchanges between disciplines and allowing for the blending and creation of knowledge. Finally, we bring to light the conditions needed to facilitate collaboration and discuss how some of the challenges may be navigated in future work.
△ Less
Submitted 16 January, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Temporal Network Analysis of Email Communication Patterns in a Long Standing Hierarchy
Authors:
Matthew Russell Barnes,
Mladen Karan,
Stephen McQuistin,
Colin Perkins,
Gareth Tyson,
Matthew Purver,
Ignacio Castro,
Richard G. Clegg
Abstract:
An important concept in organisational behaviour is how hierarchy affects the voice of individuals, whereby members of a given organisation exhibit differing power relations based on their hierarchical position. Although there have been prior studies of the relationship between hierarchy and voice, they tend to focus on more qualitative small-scale methods and do not account for structural aspects…
▽ More
An important concept in organisational behaviour is how hierarchy affects the voice of individuals, whereby members of a given organisation exhibit differing power relations based on their hierarchical position. Although there have been prior studies of the relationship between hierarchy and voice, they tend to focus on more qualitative small-scale methods and do not account for structural aspects of the organisation. This paper develops large-scale computational techniques utilising temporal network analysis to measure the effect that organisational hierarchy has on communication patterns within an organisation, focusing on the structure of pairwise interactions between individuals. We focus on one major organisation as a case study - the Internet Engineering Task Force (IETF) - a major technical standards development organisation for the Internet. A particularly useful feature of the IETF is a transparent hierarchy, where participants take on explicit roles (e.g. Area Directors, Working Group Chairs). Its processes are also open, so we have visibility into the communication of people at different hierarchy levels over a long time period. We utilise a temporal network dataset of 989,911 email interactions among 23,741 participants to study how hierarchy impacts communication patterns. We show that the middle levels of the IETF are growing in terms of their dominance in communications. Higher levels consistently experience a higher proportion of incoming communication than lower levels, with higher levels initiating more communications too. We find that communication tends to flow "up" the hierarchy more than "down". Finally, we find that communication with higher-levels is associated with future communication more than for lower-levels, which we interpret as "facilitation". We conclude by discussing the implications this has on patterns within the wider IETF and for other organisations.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Automated segmentation of rheumatoid arthritis immunohistochemistry stained synovial tissue
Authors:
Amaya Gallagher-Syed,
Abbas Khan,
Felice Rivellese,
Costantino Pitzalis,
Myles J. Lewis,
Gregory Slabaugh,
Michael R. Barnes
Abstract:
Rheumatoid Arthritis (RA) is a chronic, autoimmune disease which primarily affects the joint's synovial tissue. It is a highly heterogeneous disease, with wide cellular and molecular variability observed in synovial tissues. Over the last two decades, the methods available for their study have advanced considerably. In particular, Immunohistochemistry stains are well suited to highlighting the fun…
▽ More
Rheumatoid Arthritis (RA) is a chronic, autoimmune disease which primarily affects the joint's synovial tissue. It is a highly heterogeneous disease, with wide cellular and molecular variability observed in synovial tissues. Over the last two decades, the methods available for their study have advanced considerably. In particular, Immunohistochemistry stains are well suited to highlighting the functional organisation of samples. Yet, analysis of IHC-stained synovial tissue samples is still overwhelmingly done manually and semi-quantitatively by expert pathologists. This is because in addition to the fragmented nature of IHC stained synovial tissue, there exist wide variations in intensity and colour, strong clinical centre batch effect, as well as the presence of many undesirable artefacts present in gigapixel Whole Slide Images (WSIs), such as water droplets, pen annotation, folded tissue, blurriness, etc. There is therefore a strong need for a robust, repeatable automated tissue segmentation algorithm which can cope with this variability and provide support to imaging pipelines. We train a UNET on a hand-curated, heterogeneous real-world multi-centre clinical dataset R4RA, which contains multiple types of IHC staining. The model obtains a DICE score of 0.865 and successfully segments different types of IHC staining, as well as dealing with variance in colours, intensity and common WSIs artefacts from the different clinical centres. It can be used as the first step in an automated image analysis pipeline for synovial tissue samples stained with IHC, increasing speed, reproducibility and robustness.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Raphtory: The temporal graph engine for Rust and Python
Authors:
Ben Steer,
Naomi Arnold,
Cheick Tidiane Ba,
Renaud Lambiotte,
Haaroon Yousaf,
Lucas Jeub,
Fabian Murariu,
Shivam Kapoor,
Pedro Rico,
Rachel Chan,
Louis Chan,
James Alford,
Richard G. Clegg,
Felix Cuadrado,
Matthew Russell Barnes,
Peijie Zhong,
John N. Pougué Biyong,
Alhamza Alnaimi
Abstract:
Raphtory is a platform for building and analysing temporal networks. The library includes methods for creating networks from a variety of data sources; algorithms to explore their structure and evolution; and an extensible GraphQL server for deployment of applications built on top. Raphtory's core engine is built in Rust, for efficiency, with Python interfaces, for ease of use. Raphtory is develop…
▽ More
Raphtory is a platform for building and analysing temporal networks. The library includes methods for creating networks from a variety of data sources; algorithms to explore their structure and evolution; and an extensible GraphQL server for deployment of applications built on top. Raphtory's core engine is built in Rust, for efficiency, with Python interfaces, for ease of use. Raphtory is developed by network scientists, with a background in Physics, Applied Mathematics, Engineering and Computer Science, for use across academia and industry.
△ Less
Submitted 3 January, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
A Machine Learning approach for correcting radial velocities using physical observables
Authors:
M. Perger,
G. Anglada-Escudé,
D. Baroch,
M. Lafarga,
I. Ribas,
J. C. Morales,
E. Herrero,
P. J. Amado,
J. R. Barnes,
J. A. Caballero,
S. V. Jeffers,
A. Quirrenbach,
A. Reiners
Abstract:
Precision radial velocity (RV) measurements continue to be a key tool to detect and characterise extrasolar planets. While instrumental precision keeps improving, stellar activity remains a barrier to obtain reliable measurements below 1-2 m/s accuracy. Using simulations and real data, we investigate the capabilities of a Deep Neural Network approach to produce activity free Doppler measurements o…
▽ More
Precision radial velocity (RV) measurements continue to be a key tool to detect and characterise extrasolar planets. While instrumental precision keeps improving, stellar activity remains a barrier to obtain reliable measurements below 1-2 m/s accuracy. Using simulations and real data, we investigate the capabilities of a Deep Neural Network approach to produce activity free Doppler measurements of stars. As case studies we use observations of two known stars (Eps Eridani and AUMicroscopii), both with clear signals of activity induced RV variability. Synthetic data using the starsim code are generated for the observables (inputs) and the resulting RV signal (labels), and used to train a Deep Neural Network algorithm. We identify an architecture consisting of convolutional and fully connected layers that is adequate to the task. The indices investigated are mean line-profile parameters (width, bisector, contrast) and multi-band photometry. We demonstrate that the RV-independent approach can drastically reduce spurious Doppler variability from known physical effects such as spots, rotation and convective blueshift. We identify the combinations of activity indices with most predictive power. When applied to real observations, we observe a good match of the correction with the observed variability, but we also find that the noise reduction is not as good as in the simulations, probably due to the lack of detail in the simulated physics. We demonstrate that a model-driven machine learning approach is sufficient to clean Doppler signals from activity induced variability for well known physical effects. There are dozens of known activity related observables whose inversion power remains unexplored indicating that the use of additional indicators, more complete models, and more observations with optimised sampling strategies can lead to significant improvements in our detrending capabilities.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Measuring Equality and Hierarchical Mobility on Abstract Complex Networks
Authors:
Matthew Russell Barnes,
Vincenzo Nicosia,
Richard G. Clegg
Abstract:
The centrality of a node within a network, however it is measured, is a vital proxy for the importance or influence of that node, and the differences in node centrality generate hierarchies and inequalities. If the network is evolving in time, the influence of each node changes in time as well, and the corresponding hierarchies are modified accordingly. However, there is still a lack of systematic…
▽ More
The centrality of a node within a network, however it is measured, is a vital proxy for the importance or influence of that node, and the differences in node centrality generate hierarchies and inequalities. If the network is evolving in time, the influence of each node changes in time as well, and the corresponding hierarchies are modified accordingly. However, there is still a lack of systematic study into the ways in which the centrality of a node evolves when a graph changes. In this paper we introduce a taxonomy of metrics of equality and hierarchical mobility in networks that evolve in time. We propose an indicator of equality based on the classical Gini Coefficient from economics, and we quantify the hierarchical mobility of nodes, that is, how and to what extent the centrality of a node and its neighbourhood change over time. These measures are applied to a corpus of thirty time evolving network data sets from different domains. We show that the proposed taxonomy measures can discriminate between networks from different fields. We also investigate correlations between different taxonomy measures, and demonstrate that some of them have consistently strong correlations (or anti-correlations) across the entire corpus. The mobility and equality measures developed here constitute a useful toolbox for investigating the nature of network evolution, and also for discriminating between different artificial models hypothesised to explain that evolution.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Routing by matching on convex pieces of grid graphs
Authors:
H. Alpert,
R. Barnes,
S. Bell,
A. Mauro,
N. Nevo,
N. Tucker,
H. Yang
Abstract:
The routing number is a graph invariant introduced by Alon, Chung, and Graham in 1994, and it has been studied for trees and other classes of graphs such as hypercubes. It gives the minimum number of routing steps needed to sort a set of distinct tokens, placed one on each vertex, where each routing step swaps a set of disjoint pairs of adjacent tokens. Our main theorem generalizes the known estim…
▽ More
The routing number is a graph invariant introduced by Alon, Chung, and Graham in 1994, and it has been studied for trees and other classes of graphs such as hypercubes. It gives the minimum number of routing steps needed to sort a set of distinct tokens, placed one on each vertex, where each routing step swaps a set of disjoint pairs of adjacent tokens. Our main theorem generalizes the known estimate that a rectangular grid graph R with width w(R) and height h(R) has routing number rt(R) in O(w(R)+h(R)). We show that for the subgraph P of the infinite square lattice enclosed by any convex polygon, its routing number rt(P) is in O(w(P)+h(P)).
△ Less
Submitted 20 June, 2021;
originally announced June 2021.
-
Controlled abstention neural networks for identifying skillful predictions for classification problems
Authors:
Elizabeth A. Barnes,
Randal J. Barnes
Abstract:
The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed "forecasts of opportunity." When these opportunities are not present, scientists need prediction systems that a…
▽ More
The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed "forecasts of opportunity." When these opportunities are not present, scientists need prediction systems that are capable of saying "I don't know." We introduce a novel loss function, termed the "NotWrong loss", that allows neural networks to identify forecasts of opportunity for classification problems. The NotWrong loss introduces an abstention class that allows the network to identify the more confident samples and abstain (say "I don't know") on the less confident samples. The abstention loss is designed to abstain on a user-defined fraction of the samples via a PID controller. Unlike many machine learning methods used to reject samples post-training, the NotWrong loss is applied during training to preferentially learn from the more confident samples. We show that the NotWrong loss outperforms other existing loss functions for multiple climate use cases. The implementation of the proposed loss function is straightforward in most network architectures designed for classification as it only requires the addition of an abstention class to the output layer and modification of the loss function.
△ Less
Submitted 16 April, 2021;
originally announced April 2021.
-
Controlled abstention neural networks for identifying skillful predictions for regression problems
Authors:
Elizabeth A. Barnes,
Randal J. Barnes
Abstract:
The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed "forecasts of opportunity". When these opportunities are not present, scientists need prediction systems that a…
▽ More
The earth system is exceedingly complex and often chaotic in nature, making prediction incredibly challenging: we cannot expect to make perfect predictions all of the time. Instead, we look for specific states of the system that lead to more predictable behavior than others, often termed "forecasts of opportunity". When these opportunities are not present, scientists need prediction systems that are capable of saying "I don't know." We introduce a novel loss function, termed "abstention loss", that allows neural networks to identify forecasts of opportunity for regression problems. The abstention loss works by incorporating uncertainty in the network's prediction to identify the more confident samples and abstain (say "I don't know") on the less confident samples. The abstention loss is designed to determine the optimal abstention fraction, or abstain on a user-defined fraction via a PID controller. Unlike many methods for attaching uncertainty to neural network predictions post-training, the abstention loss is applied during training to preferentially learn from the more confident samples. The abstention loss is built upon a standard computer science method. While the standard approach is itself a simple yet powerful tool for incorporating uncertainty in regression problems, we demonstrate that the abstention loss outperforms this more standard method for the synthetic climate use cases explored here. The implementation of proposed loss function is straightforward in most network architectures designed for regression, as it only requires modification of the output layer and loss function.
△ Less
Submitted 16 April, 2021;
originally announced April 2021.
-
DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate Speeches
Authors:
Taylan K. Sen,
Gazi Naven,
Luke Gerstner,
Daryl Bagley,
Raiyan Abdul Baten,
Wasifur Rahman,
Kamrul Hasan,
Kurtis G. Haut,
Abdullah Mamun,
Samiha Samrose,
Anne Solbu,
R. Eric Barnes,
Mark G. Frank,
Ehsan Hoque
Abstract:
In this work, we present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC). Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities of raw video recordings of competitive collegiate debaters (N…
▽ More
In this work, we present a database of multimodal communication features extracted from debate speeches in the 2019 North American Universities Debate Championships (NAUDC). Feature sets were extracted from the visual (facial expression, gaze, and head pose), audio (PRAAT), and textual (word sentiment and linguistic category) modalities of raw video recordings of competitive collegiate debaters (N=717 6-minute recordings from 140 unique debaters). Each speech has an associated competition debate score (range: 67-96) from expert judges as well as competitor demographic and per-round reflection surveys. We observe the fully multimodal model performs best in comparison to models trained on various compositions of modalities. We also find that the weights of some features (such as the expression of joy and the use of the word we) change in direction between the aforementioned models. We use these results to highlight the value of a multimodal dataset for studying competitive, collegiate debate.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
InCorr: Interactive Data-Driven Correlation Panels for Digital Outcrop Analysis
Authors:
Thomas Ortner,
Andreas Walch,
Rebecca Nowak,
Robert Barnes,
Thomas Höllt,
Eduard Gröller
Abstract:
Geological analysis of 3D Digital Outcrop Models (DOMs) for reconstruction of ancient habitable environments is a key aspect of the upcoming ESA ExoMars 2022 Rosalind Franklin Rover and the NASA 2020 Rover Perseverance missions in seeking signs of past life on Mars. Geologists measure and interpret 3D DOMs, create sedimentary logs and combine them in `correlation panels' to map the extents of key…
▽ More
Geological analysis of 3D Digital Outcrop Models (DOMs) for reconstruction of ancient habitable environments is a key aspect of the upcoming ESA ExoMars 2022 Rosalind Franklin Rover and the NASA 2020 Rover Perseverance missions in seeking signs of past life on Mars. Geologists measure and interpret 3D DOMs, create sedimentary logs and combine them in `correlation panels' to map the extents of key geological horizons, and build a stratigraphic model to understand their position in the ancient landscape. Currently, the creation of correlation panels is completely manual and therefore time-consuming, and inflexible. With InCorr we present a visualization solution that encompasses a 3D logging tool and an interactive data-driven correlation panel that evolves with the stratigraphic analysis. For the creation of InCorr we closely cooperated with leading planetary geologists in the form of a design study. We verify our results by recreating an existing correlation analysis with InCorr and validate our correlation panel against a manually created illustration. Further, we conducted a user-study with a wider circle of geologists. Our evaluation shows that InCorr efficiently supports the domain experts in tackling their research questions and that it has the potential to significantly impact how geologists work with digital outcrop representations in general.
△ Less
Submitted 8 November, 2020; v1 submitted 22 July, 2020;
originally announced July 2020.
-
BusTr: Predicting Bus Travel Times from Real-Time Traffic
Authors:
Richard Barnes,
Senaka Buthpitiya,
James Cook,
Alex Fabrikant,
Andrew Tomkins,
Fangzhou Xu
Abstract:
We present BusTr, a machine-learned model for translating road traffic forecasts into predictions of bus delays, used by Google Maps to serve the majority of the world's public transit systems where no official real-time bus tracking is provided. We demonstrate that our neural sequence model improves over DeepTTE, the state-of-the-art baseline, both in performance (-30% MAPE) and training stabilit…
▽ More
We present BusTr, a machine-learned model for translating road traffic forecasts into predictions of bus delays, used by Google Maps to serve the majority of the world's public transit systems where no official real-time bus tracking is provided. We demonstrate that our neural sequence model improves over DeepTTE, the state-of-the-art baseline, both in performance (-30% MAPE) and training stability. We also demonstrate significant generalization gains over simpler models, evaluated on longitudinal data to cope with a constantly evolving world.
△ Less
Submitted 2 July, 2020;
originally announced July 2020.
-
Factors in the Portability of Tokenized Assets on Distributed Ledgers
Authors:
Richard Barnes
Abstract:
The tokenization of assets deployed to distributed ledger technology is increasingly cited to revolutionize financial services by allowing traditionally illiquid assets to be bought and sold on primary and secondary markets increasing asset liquidity, transparency and reducing transaction completion time. To realize these benefits it is important the token is transferrable, that is, portable from…
▽ More
The tokenization of assets deployed to distributed ledger technology is increasingly cited to revolutionize financial services by allowing traditionally illiquid assets to be bought and sold on primary and secondary markets increasing asset liquidity, transparency and reducing transaction completion time. To realize these benefits it is important the token is transferrable, that is, portable from one distributed ledger to another. In this paper we survey current interoperability architectures and smart contract languages, identifying factors affecting the portability of tokenized assets. We propose a portability maturity model that can be used to help assess the current state of technology and supporting market infrastructure.
△ Less
Submitted 15 May, 2020;
originally announced May 2020.
-
Accelerating a fluvial incision and landscape evolution model with parallelism
Authors:
Richard Barnes
Abstract:
Solving inverse problems and achieving statistical rigour in landscape evolution models requires running many model realizations. Parallel computation is necessary to achieve this in a reasonable time. However, no previous algorithm is well-suited to leveraging modern parallelism. Here, I describe an algorithm that can utilize the parallel potential of GPUs, many-core processors, and SIMD instruct…
▽ More
Solving inverse problems and achieving statistical rigour in landscape evolution models requires running many model realizations. Parallel computation is necessary to achieve this in a reasonable time. However, no previous algorithm is well-suited to leveraging modern parallelism. Here, I describe an algorithm that can utilize the parallel potential of GPUs, many-core processors, and SIMD instructions, in addition to working well in serial. The new algorithm runs 43x faster (70s vs. 3,000s on a 10,000x10,000 input) than the previous state of the art and exhibits sublinear scaling with input size. I also identify methods for using multidirectional flow routing and quickly eliminating landscape depressions and local minima. Tips for parallelization and a step-by-step guide to achieving it are given to help others achieve good performance with their own code. Complete, well-commented, easily adaptable source code for all versions of the algorithm is available as a supplement and on Github.
△ Less
Submitted 8 March, 2018;
originally announced March 2018.
-
Gerrymandering and Compactness: Implementation Flexibility and Abuse
Authors:
Richard Barnes,
Justin Solomon
Abstract:
Political districts may be drawn to favor one group or political party over another, or gerrymandered. A number of measurements have been suggested as ways to detect and prevent such behavior. These measures give concrete axes along which districts and districting plans can be compared. However, measurement values are affected by both noise and the compounding effects of seemingly innocuous implem…
▽ More
Political districts may be drawn to favor one group or political party over another, or gerrymandered. A number of measurements have been suggested as ways to detect and prevent such behavior. These measures give concrete axes along which districts and districting plans can be compared. However, measurement values are affected by both noise and the compounding effects of seemingly innocuous implementation decisions. Such issues will arise for any measure. As a case study demonstrating the effect, we show that commonly-used measures of geometric compactness for district boundaries are affected by several factors irrelevant to fairness or compliance with civil rights law. We further show that an adversary could manipulate measurements to affect the assessment of a given plan. This instability complicates using these measurements as legislative or judicial standards to counteract unfair redistricting practices. This paper accompanies the release of packages in C++, Python, and R that correctly, efficiently, and reproducibly calculate a variety of compactness scores.
△ Less
Submitted 16 December, 2020; v1 submitted 7 March, 2018;
originally announced March 2018.
-
Parallel Non-divergent Flow Accumulation For Trillion Cell Digital Elevation Models On Desktops Or Clusters
Authors:
Richard Barnes
Abstract:
Continent-scale datasets challenge hydrological algorithms for processing digital elevation models. Flow accumulation is an important input for many such algorithms; here, I parallelize its calculation. The new algorithm works on one or many cores, or multiple machines, and can take advantage of large memories or cope with small ones. Unlike previous parallel algorithms, the new algorithm guarante…
▽ More
Continent-scale datasets challenge hydrological algorithms for processing digital elevation models. Flow accumulation is an important input for many such algorithms; here, I parallelize its calculation. The new algorithm works on one or many cores, or multiple machines, and can take advantage of large memories or cope with small ones. Unlike previous parallel algorithms, the new algorithm guarantees a fixed number of memory access and communication events per raster cell. In testing, the new algorithm ran faster and used fewer resources than previous algorithms, exhibiting ~30% strong and weak scaling efficiencies up to 48 cores and linear scaling across datasets ranging over three orders of magnitude. The largest dataset tested had two trillion (2*10^12) cells. With 48 cores, processing required 24 minutes wall-time (14.5 compute-hours). This test is three orders of magnitude larger than any previously performed in the literature. Complete, well-commented source code and correctness tests are available on Github.
△ Less
Submitted 3 February, 2017; v1 submitted 15 August, 2016;
originally announced August 2016.
-
Parallel Priority-Flood Depression Filling For Trillion Cell Digital Elevation Models On Desktops Or Clusters
Authors:
Richard Barnes
Abstract:
Algorithms for extracting hydrologic features and properties from digital elevation models (DEMs) are challenged by large datasets, which often cannot fit within a computer's RAM. Depression filling is an important preconditioning step to many of these algorithms. Here, I present a new, linearly-scaling algorithm which parallelizes the Priority-Flood depression-filling algorithm by subdividing a D…
▽ More
Algorithms for extracting hydrologic features and properties from digital elevation models (DEMs) are challenged by large datasets, which often cannot fit within a computer's RAM. Depression filling is an important preconditioning step to many of these algorithms. Here, I present a new, linearly-scaling algorithm which parallelizes the Priority-Flood depression-filling algorithm by subdividing a DEM into tiles. Using a single-producer, multi-consumer design, the new algorithm works equally well on one core, multiple cores, or multiple machines and can take advantage of large memories or cope with small ones. Unlike previous algorithms, the new algorithm guarantees a fixed number of memory access and communication events per subdivision of the DEM. In comparison testing, this results in the new algorithm running generally faster while using fewer resources than previous algorithms. For moderately sized tiles, the algorithm exhibits ~60% strong and weak scaling efficiencies up to 48 cores, and linear time scaling across datasets ranging over three orders of magnitude. The largest dataset on which I run the algorithm has 2 trillion (2*10^12) cells. With 48 cores, processing required 4.8 hours wall-time (9.3 compute-days). This test is three orders of magnitude larger than any previously performed in the literature. Complete, well-commented source code and correctness tests are available for download from a repository.
△ Less
Submitted 15 August, 2016; v1 submitted 20 June, 2016;
originally announced June 2016.
-
Distributed Parallel D8 Up-Slope Area Calculation in Digital Elevation Models
Authors:
Richard Barnes,
Clarence Lehman,
David Mulla
Abstract:
This paper presents a parallel algorithm for calculating the eight-directional (D8) up-slope contributing area in digital elevation models (DEMs). In contrast with previous algorithms, which have potentially unbounded inter-node communications, the algorithm presented here realizes strict bounds on the number of inter-node communications. Those bounds in turn allow D8 attributes to be processed fo…
▽ More
This paper presents a parallel algorithm for calculating the eight-directional (D8) up-slope contributing area in digital elevation models (DEMs). In contrast with previous algorithms, which have potentially unbounded inter-node communications, the algorithm presented here realizes strict bounds on the number of inter-node communications. Those bounds in turn allow D8 attributes to be processed for arbitrarily large DEMs on hardware ranging from average desktops to supercomputers. The algorithm can use the OpenMP and MPI parallel computing models, either in combination or separately. It partitions the DEM between slave nodes, calculates an internal up-slope area by replacing information from other slaves with variables representing unknown quantities, passes the results on to a master node which combines all the slaves' data, and passes information back to each slave, which then computes its final result. In this way each slave's DEM partition is treated as a simple unit in the DEM as a whole and only two communications take place per node.
△ Less
Submitted 18 May, 2016;
originally announced May 2016.
-
Priority-Flood: An Optimal Depression-Filling and Watershed-Labeling Algorithm for Digital Elevation Models
Authors:
Richard Barnes,
Clarence Lehman,
David Mulla
Abstract:
Depressions (or pits) are low areas within a digital elevation model that are surrounded by higher terrain, with no outlet to lower areas. Filling them so they are level, as fluid would fill them if the terrain were impermeable, is often necessary in preprocessing DEMs. The depression-filling algorithm presented here---called Priority-Flood---unifies and improves on the work of a number of previou…
▽ More
Depressions (or pits) are low areas within a digital elevation model that are surrounded by higher terrain, with no outlet to lower areas. Filling them so they are level, as fluid would fill them if the terrain were impermeable, is often necessary in preprocessing DEMs. The depression-filling algorithm presented here---called Priority-Flood---unifies and improves on the work of a number of previous authors who have published similar algorithms. The algorithm operates by flooding DEMs inwards from their edges using a priority queue to determine the next cell to be flooded. The resultant DEM has no depressions or digital dams: every cell is guaranteed to drain. The algorithm is optimal for both integer and floating-point data, working in O(n) and O(n lg n) time, respectively. It is shown that by using a plain queue to fill depressions once they have been found, an O(m lg m) time-complexity can be achieved, where m does not exceed the number of cells n. This is the lowest time complexity of any known floating-point depression-filling algorithm. In testing, this improved variation of the algorithm performed up to 37% faster than the original. Additionally, a parallel version of an older, but widely-used depression-filling algorithm required six parallel processors to achieve a run-time on par with what the newer algorithm's improved variation took on a single processor. The Priority-Flood Algorithm is simple to understand and implement: the included pseudocode is only 20 lines and the included C++ reference implementation is under a hundred lines. The algorithm can work on irregular meshes as well as 4-, 6-, 8-, and n-connected grids. It can also be adapted to label watersheds and determine flow directions through either incremental elevation changes or depression carving. In the case of incremental elevation changes, the algorithm includes safety checks not present in prior works.
△ Less
Submitted 13 November, 2015;
originally announced November 2015.
-
An Efficient Assignment of Drainage Direction Over Flat Surfaces in Raster Digital Elevation Models
Authors:
Richard Barnes,
Clarence Lehman,
David Mulla
Abstract:
In processing raster digital elevation models (DEMs) it is often necessary to assign drainage directions over flats---that is, over regions with no local elevation gradient. This paper presents an approach to drainage direction assignment which is not restricted by a flat's shape, number of outlets, or surrounding topography. Flow is modeled by superimposing a gradient away from higher terrain wit…
▽ More
In processing raster digital elevation models (DEMs) it is often necessary to assign drainage directions over flats---that is, over regions with no local elevation gradient. This paper presents an approach to drainage direction assignment which is not restricted by a flat's shape, number of outlets, or surrounding topography. Flow is modeled by superimposing a gradient away from higher terrain with a gradient towards lower terrain resulting in a drainage field exhibiting flow convergence, an improvement over methods which produce regions of parallel flow. This approach builds on previous work by Garbrecht and Martz (1997), but presents several important improvements. The improved algorithm guarantees that flats are only resolved if they have outlets. The algorithm does not require iterative application; a single pass is sufficient to resolve all flats. The algorithm presents a clear strategy for identifying flats and their boundaries. The algorithm is not susceptible to loss of floating-point precision. Furthermore, the algorithm is efficient, operating in O( N ) time whereas the older algorithm operates in O( N^(3/2) ) time. In testing, the improved algorithm ran 6.5 times faster than the old for a 100 x 100 cell flat and 69 times faster for a 700 x 700 cell flat. In tests on actual DEMs, the improved algorithm finished its processing 38--110 times sooner while running on a single processor than a parallel implementation of the old algorithm did while running on 16 processors. The improved algorithm is an optimal, accurate, easy-to-implement drop-in replacement for the original. Pseudocode is provided in the paper and working source code is provided in the Supplemental Materials.
△ Less
Submitted 13 November, 2015;
originally announced November 2015.