nimCSO

Search:
Group by:
Source   Edit  

Navigation: nimCSO (core) | Changelog | nimcso/bitArrayAutoconfigured |GitHubMIT LicenseJOSS Article DraftmacOS TestingLinux TestingWindows Testing

nim Composition Space Optimization is a high-performance tool implementing several methods for selecting components (data dimensions) in compositional datasets, which optimize the data availability and density for applications such as machine learning (ML) given a constraint on the number of components to be selected, so that they can be designed in a way balancing their accuracy and domain of applicability. Making said choice is a combinatorically hard problem when data is composed of a large number of independent components due to the interdependency of components being present. Thus, efficiency of the search becomes critical for any application where interaction between components is of interest in a modeling effort, ranging:

  • from market economics,
  • through medicine where drug interactions can have a significant impact on the treatment,
  • to materials science, where the composition and processing history are critical to resulting properties.

We are particularily interested in the latter case of materials science, where we utilize nimCSO to optimize ML deployment over our datasets on Compositionally Complex Materials (CCMs) which are largest ever collected (from almost 550 publications) spanning up to 60 dimensions and developed within the ULTERA Project (ultera.org) carried under the US DOE ARPA-E ULTIMATE program which aims to develop a new generation of ultra-high temperature materials for aerospace applications, through generative machine learning models 10.20517/jmi.2021.05 driving thermodynamic modeling and experimentation 10.2139/ssrn.4689687.

At its core, nimCSO leverages the metaprogramming ability of the Nim language to optimize itself at the compile time, both in terms of speed and memory handling, to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated later in benchamrks, nimCSO reaches the physical limits of the hardware (L1 cache latency) and can outperform an efficient native Python implementation over 400 times in terms of speed and 50 times in terms of memory usage (not counting interpreter), while also outperforming NumPy implementation 35 and 17 times, respectively, when checking a candidate solution.

Main nimCSO figure

nimCSO is designed to be both (1) a user-ready tool (see figure above), implementing:

  • Efficient brute force approaches (for handling up to 25 dimensions)
  • Custom search algorithm (for up to 40 dimensions)
  • Genetic algorithm (for any dimensionality)

and also is (2) a scaffold for building even more elaborate methods in the future, including heuristics going beyond data availability. All configuration is done with a simple human-readable YAML config file and plain text data files, making it easy to modify the search method and its parameters with no knowledge of programming and only basic command line skills. A single command is used to recompile (nim c -f) and run (-r) problem (-d:configPath=config.yaml) with nimCSO (src/nimcso) using one of several methods. Advanced users can also quickly customize the provided methods with brief scripts using the nimCSO as a data-centric library.

Usage

Quick Start

To use ``nimCSO`` you don't even need to install anything, as long as you have a (free) GitHub account, since we prepared a pre-configured Codespace for you! Simply click on the link below and it will create a cloud development environment for you, with all dependencies installed for you through Conda and Nimble package managers. You can then run nimCSO through terminal or play with a Jupyter notebook we prepared.

Main nimCSO figure

General

config.yaml

The config.yaml file is the critical component which defines several required parameters listed below. You can either just change the values in the provided config.yaml or create a custom one, like the config_rhea.yaml, and point to it at the compilation with -d:configPath=config_rhea.yaml flag. Inside, you will need to define the following parameters:

  • taskName - A string with the name of the task. It does not affect the results in any way, except for being printed during runtime for easier identification.
  • taskDescription - A string with the description of the task. It does not affect the results in any way, except for being printed during runtime for easier identification.
  • datasetPath - A string with the path (relative to CWD) with the dataset file. Please see Dataset files below for details on its content.
  • elementOrder - A list of strings with the names of the elements in the dataset. The order does not affect the results in any way, except for the order in which the elements will be printed in the resulting solutions. It does determine the order in which they are stored internally though, so if you are an advanced user and, e.g., write a custom heuristic, you may want to take advantage of this to, e.g., assign a list of weights to the elements.

Dataset files

We wanted to make creating the input dataset as simple and as cross-platform as possible, thus the dataset file should be plain text file containing one set of elements (in any order) per line separated by commas. You can use .txt or .csv file extensions interchangeably, with no effect on the nimCSO behavior, but note that editing CSV with Excel in some countries (e.g., Italy) may casuse issues. The dataset should not contain any header. The dataset can contain any elements, as the one not present in the ``elementOrder`` will be ignored at the parsing stage. It will generally look like:

Al,Cr,Hf,Mo,Ni,Re,Ru,Si,Ta,Ti,W
Al,Co,Cr,Cu,Fe,Ni,Ti
Al,B,C,Co,Cr,Hf,Mo,Ni,Ta,Ti,W,Zr
Mo,Nb,Si,Ta,W
Co,Fe,Mo,Ni,V
Hf,Nb,Ta,Ti,Zr
Mo,Nb,Ta,V,W
Al,Co,Cr,Fe,Ni,Si,Ti
Al,Co,Cr,Cu,Fe,Ni

you are also welcome to align the elements in columns, like below,

Al, B, Co, Cr
    B,     Cr, Fe, Ni
Al,    Co,     Fe, Ni

but having empty fields is not allowed, so Al, ,Co,Cr, , ,W would not be parsed correctly.

The dataset provided by default with nimCSO comes from a snapshot of the ULTERA Database and lists elements in "aggregated" alloys, which means every entry corresponds to a unique HEA composition-processing-structure triplet (which may have several attached properties). The dataset access is currently limited, but once it is public, you will be able to obtain it (and newer versions) with Python code like this using the pymongo library:

collection = client['ULTERA']['AGGREGATED_Jul2023']
elementList = [e['material']['elements'] for e in collection.find({
    'material.nComponents': {'$gte': 3},
    'metaSet.source': {'$in': ['LIT', 'VAL']},
    'properties.source': {'$in': ['EXP', 'DFT']}
    })]

Notes:

Elemental Solutions

Throughout this codebase and documentation, you will see ElSolution, which is a short for "Elemental Solution" and represents a solution to the problem of selecting elements to remove from the dataset. Using word elements is not technically precise, as the solution space is built around components, which do not have to be elemental entities in your problem, and you can model compositions of any kind. However, in the code we consistently refer to "elements" because (1) it is the most common use case and (2) elSol obviously sounds much better than comSol.

Installation

If you want to use nimCSO on your machine (local or remote), the best course of action is likely to install dependencies and clone the software so that you can get a ready-to-use setup you can also customize. You can do it fairly easily in just a couple of minutes.

Nim (compiler)

First, you need to install Nim language compiler which on most Unix (Linux/MacOS) systems is very straightforward.

  • On MacOS, assuming you have Homebrew installed, simply:

    brew install nim

  • Using `conda`, `miniconda`, mamba, or `micromamba` cross-platform package manager:

    conda install -c conda-forge nim

  • On most Linux distributions, you should also be able to use your built-in package manager like pacman, apt, yum, or rpm; however, the default channel/repository, especially on enterprise systems, may have an unsupported version (nim<2.0). While we do test `nimCSO` with `1.6` versions too, your experience may be degraded, so you may want to update it or go with another option.
  • You can, of course, also build it yourself from `nim` source code! It is relatively straightforward and fast compared to many other languages.

On Windows, you may consider using `WSL`, i.e., Windows Subsystem for Linux, which is strongly recommended, interplays well with VS Code, and will let you act as if you were on Linux. If you need to use Windows directly, you can follow these installation instructions.

nimCSO

Then, you can use the bundled Nimble tool (package manager for Nim, similar to Rust's crate or Python's pip) to install two top-level nim dependencies:

  • arraymancer, which is a powerful N-dimensional array library, and
  • yaml which parses the configuration files.

It's a single command:

nimble install --depsOnly

or, explicitly:

nimble install -y arraymancer yaml

Finally, you can clone the repository and compile the library with:

git clone https://github.com/amkrajewski/nimcso
cd nimcso
nim c -r -f -d:release src/nimcso
which will compile the library and print out concise help message with available CLI options.

And now, you are ready to use nimCSO :)

Install Notes

  • In general, nimble could give you a very similar experience to pip and allow you to install nimCSO from `nimble` index (PyPI equivalent), without manual cloning and compilation. The reason above README undergoes such additional gymnastics and places the binary in the local src rather than some more general location, is that nimCSO is meant to be compiled with local YAML config files on a per-task basis, taking advantage of optimizations enabled by knowing task specifics. Thus, having it installed would, I think, confuse users and perhaps leave unnecessary files behind after task completion.
  • If you must use nim<2.0 for any reason, you may want to manually install package versions known to work with nim=1.6.x using nimble install -y yaml@1.1.0 arraymancer@0.7.32.
  • You can use nimble list -i to verify that Nim packages were installed correctly.
  • Please note that nim used by nimCSO is the nim programming language, not a python package. While conda will work perfectly on Unix systems, you cannot install it with pip install nim, as the `nim` Python package on PyPI is an entirely different thing (an obscure interface to the Network Interface Monitor).

Contributing

What to Contribute

  • We explicitly welcome unsolicited user feedback and feature requests submitted through GitHub Issues.
  • If you wish to contribute to the development of nimCSO you are very welcome to do so by forking the repository and creating a pull request. As of Summer 2024, we are actively developing the code and using it in two separate research projects, so we should get back to you within a week or two.
  • We are particularly interested in:
    • Performance improvements, even if marginal.
    • Additional I/O file format handling, like HDF5, especially on the input data side.
    • Additional genetic algorithms, ideally, outperforming the current ones.
    • Additional test databases and configurations. We would love to see some high entropy ceramics, glasses, and metallic glasses on the materials science side, complex microbial communities on the biology side, and polypharmacy-related data on the pharmaceutical side.
  • We are also open to helping you run our code in non-profit or academic research cases! Please do not hesitate to contact us through the GitHub issues or [by email](mailto:ak@psu.edu).

Rules for Contributing

  • We do not enforce any strict style convention for `nimCSO` contributions, as long as code maintains high readability and overall quality.
  • If you are unsure on what style to use, consult `nim` compiler style convention and try to stick to it. In general, style conventions in nim language are a very tricky subject compared to most languages. It is explicitly designed to not have code style conventions, even on the basic level like naming, and makes programmers read code closer to how it will be parsed into AST. The result is that collaborative projects use camelCase in one file to define a function and then kebab-case in another one to call it. Surprisingly to some, nim programmers tend to cherish that and even the largest projects like Arraymancer allow it.##

Benchmarking

The key performance advantage of nimCSO comes from how it handles checking how many datapoints would have to be removed if a given set of elements were removed. You can quickly compare performance (speed and memory usage) of nimCSO to other approaches based on (a) native Python sets and (b) well-optimized NumPy implementation.

In the benchmarks directory, you will find 3 scripts, which will automatically ingest the example dataset we ship (dataList.txt) with 2,150 data points and try to remove all entries containing elements from a fixed set of 5 ("Fe", "Cr", "Ni", "Co", "Al", "Ti"). This is repeated thousands of times to get a good average, but should not take more than several seconds on a modern machine.

  • nimcso.nim - The nimCSO implementation based around BitArrays. From the root of the project, you can run it with a simple:

    nim c -r -f -d:release --threads:off benchmarks/nimcso

  • nativePython.py - A Python implementation using its native sets. If you have python (3.11 is recommended) installed, you can run it with:

    python benchmarks/nativePython.py

  • pythonNumPy.py - A Python implementation using NumPy. If you have python (3.11 is recommended) and numpy (v1.25+ is recommended) installed, you can run it with:

    python benchmarks/pythonNumPy.py

You should see results roughly aligning with the following (run on MacBook M2 Pro):

MethodTime Per DatasetTime per EntryRelative SpeedRepresentation SizeRelative Size
Native Python (3.11.8)107.5 µs50.0 nsx1871.5 kBx1
NumPy (1.26.4)36.4 µs16.9 nsx3.079.7 kBx10.9
nimCSO (0.6.0) BitArray6.9 µs3.2 nsx15.650.4 kBx17.3
nimCSO (0.6.0) uint640.98 µs0.456 nsx11016.8 kBx52

Tests

nimCSO uses unittest to test (1) all results-producing functions for accuracy against known references, and (2) all of the available functions in the package for runtime errors. All tests are within the tests directory and can be run in one-go with tests/runAll script using the following command:

nim c -f -r -d:release -d:configPath=tests/config.yaml tests/runAll

which, as one can see, uses the test/config.yaml file to configure the tests for a smaller set of elements (to reduce runtime) and a custom data file tests/testDataList.txt, which includes some elements like unobtanium (Ub) to verify filtering works as expected.

Types

ElSolution = ref object
  elBA*: BitArray
  prevented*: int
The ElSolution object, or Elemental Solution, represents a singular solution to the problem of selecting elements to remove from the dataset. It is a reference object with two core fileds, but is meant to be extended beyond that for advanced use cases (e.g., utilizing multi-property heuristics for the search algorithms). These two fields are:
  • elBA: A BitArray, configured at the compile time, holds the elements removed from the dataset in this solution. It can hold any number of elements. Its size is 64-bit aligned, so any number of elements below 65 will not increase the memory usage.
  • prevented: An int field, which holds the number of datapoints prevented from being considered due to the removal of the elements encoded in the elBA. It is calculated when the solution is created and can be recalculated with the setPrevented procedure.
Source   Edit  

Consts

configPath {.strdefine.}: string = "config.yaml"
Compile-time-assigned constant pointing to the specific config.yaml file used to compile the current nimCSO binary. It is exported to allow users to easily assert in scripts that they are using the correct config file. Source   Edit  
dataN = 2150
Compile-time-calculated constant based on your speficic config/data files. Value is config&dataset-dependent and corresponds to the number of datapoints ingested from the dataset after filtering datapoints not contributing to the solution space (becasue of elements present in them.) Source   Edit  
elementN = 37
Compile-time-calculated constant based on your speficic config/data files. Allows us to optimize the data structures and relevant methods for the specific problem at the compile time. Source   Edit  
elementOrder = ["Fe", "Cr", "Ni", "Co", "Al", "Ti", "Nb", "Cu", "Mo", "Ta",
                "Zr", "V", "Hf", "W", "Mn", "Si", "Re", "B", "Ru", "C", "Sn",
                "Mg", "Zn", "Li", "O", "Y", "Pd", "N", "Ca", "Ir", "Sc", "Ge",
                "Be", "Ag", "Nd", "S", "Ga"]
Compile-time-calculated constant based on your speficic config/data files. Does not affect which elements are present in the results, but determines the order in which they are handled internally and printed in the results. Source   Edit  

Procs

proc `$`(elSol: ElSolution): string {....raises: [], tags: [], forbids: [].}
Casts the solution into a string with human-readable list of elements present in it (in the order based on the config) pointing with -> to the number of prevented datapoints.

Example:

let elSol = newElSolution(@["Cr", "Fe", "Ni"], getPresenceBitArrays())
assert $elSol=="FeCrNi->1484"
Source   Edit  
func `<`(a, b: ElSolution): bool {....raises: [], tags: [], forbids: [].}
Compares two ElSolutions based on the number of prevented datapoints. Used for sorting and comparison in the search algorithms. Source   Edit  
proc `==`(a, b: ElSolution): bool {....raises: [], tags: [], forbids: [].}
Checks equality of two ElSolutions based on the equality of their BitArrays. Please note this is not the same as > and < operators, which are based on the number of prevented datapoints. Source   Edit  
func `>`(a, b: ElSolution): bool {....raises: [], tags: [], forbids: [].}
Compares two ElSolutions based on the number of prevented datapoints. Used for sorting and comparison in the search algorithms. Source   Edit  
proc algorithmSearch(verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [WriteIOEffect, TimeEffect],
    forbids: [].}
(Key Routine) This custom algorithm iteratively expands and evaluates candidates from a priority queue (binary heap), while leveraging the fact that the number of data points lost when removing elements A and B from the dataset has to be at least as large as when removing either A or B alone to delay exploration of candidates until they can contribute to the solution. Furthermore, to (1) avoid revisiting the same candidate without keeping track of visited states and (2) further inhibit the exploration of unlikely candidates, the algorithm assumes that while searching for a given order of solution, elements present in already expanded solutions will not improve those not yet expanded. This effectively prunes candidate branches requiring two or more levels of backtracking. This method has generated the same results as combinatoric brute forcing in our tests, as demonstrated in the tests/algorithmSearch script, except for occasional differences in the last explored solution. By default, the BitArray representation is used, but the bool array representation can be used by setting the presenceArrays parameter to getPresenceBoolArrays(). Source   Edit  
proc bruteForce(verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [WriteIOEffect, TimeEffect],
    forbids: [].}

(Key Routine) A high performance (35 times faster than native Python and 4 times faster than NumPy) and easily extensible (leveraging the ElSolution type) brute force algorithm for finding the optimal solution for the problem of which N elements to remove from dataset to loose the least daya. It enumerates all entries in the power set of considered elements by representing them as integers from 0 to 2^elementN - 1 and using them to initialize BitArrays. It then iteratively evaluates them keeping track of the best solution for each order (number of elements present in the solution), what allows for a minimal memory footprint as only several solutions are kept in memory at a time. The results are printed to the console. It is implemented for up to 64 elements, as it is not feasible for more than around 30 elements, but it could be extended by simply enumerating solutions as two or more integers and using them to initialize BitArrays.

Every time you start this algorithm, it will benchmark your system to estimate how much will it take, so that you can avoid waiting forever if you accidentaly run it on too large problem or too slow system, as shown in the figure below where both problems apply.

bruteForceETAExample Source   Edit  
proc bruteForceInt(verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [WriteIOEffect, TimeEffect],
    forbids: [].}
(Key Routine) A really high performance (400 times faster than native Python and 50 times faster than NumPy) brute force algorithm for finding the optimal solution for the problem of which N elements to remove from dataset to lose the least data. Unlike the standard bruteForce algorithm does not use the ElSolution type and cannot be easily extended to other use cases and cannot be used for more than 64 elements without sacrificing the performance, at which point bruteForce should be much better choice. Source   Edit  
proc crossover(elSol1: var ElSolution; elSol2: var ElSolution;
               presenceArrays: seq[BitArray] | seq[seq[bool]]): void
Implementation of the crossover here is more elaborate than typical swapping you may see elswere due to constraint of conservation of the number of set bits in the output solutions, so that they retain the same order. The proceducre takes two var ElSolutions and:
  1. Finds positions of non-overlapping bits, while not modifying the overlapping ones.
  2. Randomizes the order of non-overlapping positions set.
  3. Sets the bits in the output solutions by picking from the randomized set of non-overlapping positions.
nimCSO Crossover

It is primarily used in the geneticSearch algorithm.

Source   Edit  
proc geneticSearch(verbose: bool = true; initialSolutionsN: Natural = 100;
                   searchWidth: Natural = 100; maxIterations: Natural = 1000;
                   minIterations: Natural = 10; mutationsN: Natural = 1): seq[
    ElSolution] {....raises: [IOError, ValueError],
                  tags: [WriteIOEffect, TimeEffect], forbids: [].}
(Key Routine) This custom genetic algotithm utilizes custom mutate_ and crossover_ procedures preserving the number of elements present (bits set) in their output solutions to iteratively improve a set of solutions. It is primarily aimed at (1) problems with more than 40 elements, where neither bruteForce nor algorithmSearch are feasible and (2) at cases where the decent solution is needed quickly. Its implementation allows for arbitrary dimensionality of the problem and its time complexity will scale linearly with it. You may control a set of parameters to adjust the algorithm to your needs, including the number of initial randomly generated solutions initialSolutionsN, the number of solutions to keep carry over to the next iteration searchWidth, the maximum number of iterations maxIterations, the minimum number of iterations the solution has to fail to improve to be considered. Source   Edit  
func getNextNodes(elSol: ElSolution; exclusions: BitArray;
                  presenceBitArrays: seq[BitArray] | seq[seq[bool]]): seq[
    ElSolution]
Takes the current ElSolution and compares it with exclusions BitArray to determine all possible next steps (removing additional element from dataset) that do not overlap with the exclusions. Used primarily in the algorithmSearch routine to explore the solution space without visiting the same solution twice. Performance can be tested with the expBenchmark routine. Source   Edit  
func getPresenceBitArrays(): seq[BitArray] {....raises: [], tags: [], forbids: [].}
Returns a sequence of BitArrays encoding the presence of elements in each row in the dataset within _bits of integers stored in each BitArray. It operates based on compile-time constants. Source   Edit  
func getPresenceBoolArrays(): seq[seq[bool]] {....raises: [], tags: [], forbids: [].}
Returns a sequence of sequences of bools encoding the presence of elements in each row in the dataset. It is faster than getPresenceBitArrays but uses more memory, but only one instance is stored in the memory at a time, so it is a better choice for databases with less than many millions of rows. It operates based on compile-time constants. Source   Edit  
proc getPresenceIntArray(): array[dataN, uint64] {....raises: [], tags: [],
    forbids: [].}
Returns a compile-time-determined-length array of unsigned integers encoding the presence of elements in each row in the dataset, which is as fast and compact as you can get on a 64-bit architecture. It is by far the most limiting representation implemented in nimCSO, which will not work for datasets with more than 64 elements, but it is blazingly fast to access and process, since we can leverage the hardware's native bit operations, and uses a couple times less memory than the BitArray representation thanks to not having intermediate pointers, which for under 64 elements are the same size as the payload itself. Source   Edit  
proc getPresenceTensor(): Tensor[int8] {....raises: [], tags: [], forbids: [].}
(Legacy function retained for easy Arraymancer integration for library users) Returns an Arraymancer Tensor[int8] denoting presence of elements in the dataset (1 if present, 0 if not), which can be then used to calculate the quantity of data prevented by removal of a given set of elements. Operated based on compile-time constants. Source   Edit  
func hash(elSol: ElSolution): Hash {....raises: [], tags: [], forbids: [].}
Hashes the solution based on the hash of its BitArray only and not the number of prevented datapoints. Hashing is used for storage in HashSets and OrderedSets. The omission of the number of prevented datapoints is intentional, as it allows checking for the presence of hypothetical solutions among the initialized (calculated) solutions without the need to calculate presence. Source   Edit  
proc leastPreventing(verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [TimeEffect, WriteIOEffect],
    forbids: [].}
Runs a search for single-element solutions preventing the least data, i.e. the least common elements based on the filtered dataset. Returns a sequence of ElSolutions which can be used on its own (by setting verbose to see it or by using saveResults) or as a starting point for an exploration technique. Source   Edit  
proc mostCommon(verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [TimeEffect, WriteIOEffect],
    forbids: [].}
Convenience wrapper for the leastPreventing routine, which returns its results in reversed order. It was added for the sake of clarity. Source   Edit  
proc mutate(elSol: var ElSolution;
            presenceArrays: seq[BitArray] | seq[seq[bool]]): void
Mutates the var ElSolution by taking its BitArray and swapping at random two of its bits from the range encoding presence of elements in the dataset. It then recalculates the number of prevented datapoints based on the presence of elements in the dataset encoded in either a sequence of BitArrays or a sequence of sequences of bools.nimCSO Mutation

As depicted in the diagram, the mutation procedure is fully random so (1) bit can swap itself, (2) bits can swap causing a flip, or (3) bits can swap with no effect.

Source   Edit  
proc newElSolution(elBA: BitArray; pBA: seq[BitArray] | seq[seq[bool]]): ElSolution
Creates a new ElSolution object based on a BitArray encoding the presence of elements. It uses sequence of BitArrays or a sequence of sequences of bools to calculate the number of prevented datapoints to set the prevented field. Source   Edit  
proc newElSolution(elementSet: seq[string]; pBA: seq[BitArray] | seq[seq[bool]]): ElSolution
Creates a new ElSolution object from a sequence of element name strings, which it encodes into a BitArray based on the elementOrder defined in the config and passing it to the other newElSolution procedure it overloads. In the process, it checks if the element set is a subset of the element order defined in the config. Source   Edit  
proc newElSolutionRandomN(order: int; pBA: seq[BitArray] | seq[seq[bool]]): ElSolution
Creates a new ElSolution object with a random set of order number of elements present in it by randomly picking setting bits in initially unset BitArray until it reaches the desired number of bits set. It uses sequence of BitArrays or a sequence of sequences of bools to calculate the number of prevented datapoints to set the prevented field. Primarily used in the geneticSearch algorithm, but could be used in other contexts as well. Source   Edit  
func presentInData(elList: BitArray; pBAs: seq[BitArray] | seq[seq[bool]]): int
A philosophical opposite of preventedData procedures. It returns the number of datapoints which have all of the elements encoded by the elList BitArray present in them, based on either a sequence of BitArrays or a sequence of sequences of bools encoding presence in the dataset. For single element, it could be obtained by subtracting the prevented data from the total data count, but it gets more complicated for multiple elements. Source   Edit  
func preventedData(elList: BitArray; presenceBitArrays: seq[BitArray]): int {.
    ...raises: [], tags: [], forbids: [].}
Returns the number of datapoints prevented by removal of the elements encoded in the elList BitArray by comparing it to the sequence of BitArrays encoding presence in the dataset. Source   Edit  
func preventedData(elList: BitArray; presenceBoolArrays: seq[seq[bool]]): int {.
    ...raises: [], tags: [], forbids: [].}
Returns the number of datapoints prevented by removal of the elements encoded in the elList BitArray by comparing it to the sequence of sequences of bools encoding presence in the dataset. Source   Edit  
proc preventedData(elList: Tensor[int8]; presenceTensor: Tensor[int8]): int {.
    ...raises: [ValueError], tags: [], forbids: [].}
Returns the number of datapoints prevented by removal of the elements encoded in the elList 1D Tensor[int8] by comparing it to the 2D Tensor[int8] encoding presence in the dataset. Source   Edit  
func preventedData(elList: uint64; presenceIntArray: array[dataN, uint64]): int {.
    ...raises: [], tags: [], forbids: [].}
Returns the number of datapoints prevented by removal of the elements encoded in the elList unsigned integer (uint64) by comparing it to the compile-time-determined-length array of unsigned integers encoding the presence of elements in each row in the dataset. It leverages the hardware's native bit operations whenever possible, and is blazingly fast in cases where it can be used. Source   Edit  
proc randomize(elSol: var ElSolution): void {....raises: [], tags: [], forbids: [].}
Randomizes the BitArray of the var ElSolution by setting each bit to a random value, used primarily for benchmarking purposes. Source   Edit  
proc saveFilteredDataset(path: string = "filteredDataset.csv"): void {.
    ...raises: [IOError], tags: [WriteIOEffect], forbids: [].}
Saves the filtered dataset (containing only the datapoints contributing to the solution space as defined by set of elementOrder) into a file at the path. Source   Edit  
proc saveResults(results: seq[ElSolution]; path: string = "results.csv";
                 separator: string = "-"): void {....raises: [IOError],
    tags: [WriteIOEffect], forbids: [].}
Saves results from any routine (stored in a sequence of ElSolutions) into to a CSV file with columns "Removed Elements", "Allowed Elements", "Prevented", "Allowed", into a file at the path. The separator is used to separate the element names, and it is set to a dash - by default resulting in easily readable strig (e.g., Cr-Fe-Ni-Mo). Source   Edit  
func setPrevented(elSol: var ElSolution;
                  presenceArrays: seq[BitArray] | seq[seq[bool]]): void
Calculates and sets the prevented field of the var ElSolution based on the presence of elements in the dataset encoded in either a sequence of BitArrays or a sequence of sequences of bools. Source   Edit  
proc singleSolution(args: seq[string]; verbose: bool = true): seq[ElSolution] {.
    ...raises: [IOError, ValueError], tags: [WriteIOEffect], forbids: [].}
Parses the arguments to find all -ss or --singleSolution words, splits the arguments to single tasks, and runs them through the newElSolution procedure. It then tests and returns all the solutions. It is primarily used for testing purposes if you want to manually test a set of solutions. Source   Edit