Tool Setup and Run Command Reference

Overview

The public rerun begins from stored prediction contracts. Those files are produced upstream from tool specific outputs and then evaluated in a common benchmark schema. This page records the tool roster, the configured commands, and the location of the upstream setup notes.

Manifest and registry

Code
from pathlib import Path

import pandas as pd
import yaml

manifest_path = Path('../config/tool_output_manifest.example.yml')
registry_path = Path('../config/prediction_tools.yaml')

manifest = yaml.safe_load(manifest_path.read_text(encoding='utf-8'))
registry = yaml.safe_load(registry_path.read_text(encoding='utf-8'))

manifest_df = pd.DataFrame(manifest['tool_outputs'])
registry_df = pd.DataFrame.from_dict(registry['tools'], orient='index').reset_index(names='tool_slug')


def summarize_command(value):
    if value in (None, '', []):
        return 'not used in the public rerun'
    if isinstance(value, list):
        return ' '.join(str(part) for part in value)
    return str(value)


def classify_role(value):
    mapping = {
        'search_engine': 'native genome search',
        'pair_scorer': 'pair scorer',
        'web_service_search': 'web service search',
    }
    return mapping.get(str(value), str(value))


tool_table = manifest_df.merge(registry_df, on='tool_slug', how='left')
tool_table['benchmark_role'] = tool_table['tool_role'].map(classify_role)
tool_table['local_command_summary'] = tool_table['local_command'].map(summarize_command)
tool_table['docker_command_summary'] = tool_table['docker_command'].map(summarize_command)

tool_table[
    [
        'tool',
        'tool_slug',
        'mode',
        'benchmark_role',
        'relative_path',
        'local_command_summary',
        'docker_command_summary',
    ]
]
tool tool_slug mode benchmark_role relative_path local_command_summary docker_command_summary
0 Cas-OFFinder cas_offinder native_search native genome search data/zenodo/standard_tool_predictions/predicti... cas-offinder snugel/cas-offinder:latest cas-offinder
1 CRISPRitz_mismatch crispritz_mismatch native_search native genome search data/zenodo/standard_tool_predictions/predicti... crispritz.py pinellolab/crispritz:latest crispritz.py
2 CRISPRitz_cfd crispritz_cfd native_search native genome search data/zenodo/standard_tool_predictions/predicti... crispritz.py pinellolab/crispritz:latest crispritz.py
3 FlashFry flashfry native_search native genome search data/zenodo/standard_tool_predictions/predicti... flashfry eclipse-temurin:8-jre java -Xmx8g -jar FlashFr...
4 GuideScan2 guidescan2 native_search native genome search data/zenodo/standard_tool_predictions/predicti... guidescan nan
5 CRISPROFF crisproff pair_scorer pair scorer data/zenodo/standard_tool_predictions/predicti... run_crisproff.py nan
6 CCTop cctop native_search web service search data/zenodo/standard_tool_predictions/predicti... cctop_submit.py nan
7 CRISPOR crispor native_search native genome search data/zenodo/standard_tool_predictions/predicti... nan maximilianh/crispor:latest
8 MOFF moff pair_scorer pair scorer data/zenodo/standard_tool_predictions/predicti... MOFF score nan
9 CRISOT crisot pair_scorer pair scorer data/zenodo/standard_tool_predictions/predicti... CRISOT.py scores nan

The manifest defines the public contract files required by the rerun. The tool registry defines the upstream commands and runtime assumptions used to create those contract files.

Contract files

Each row in the manifest corresponds to one public contract file.

  • tool is the manuscript facing tool name.
  • tool_slug is the configuration key.
  • mode distinguishes native search tools from pair scorers.
  • relative_path is the expected location of the standardized contract file.

The contract layer is the public handoff between upstream tool execution and the benchmark rerun.

Per tool summary

Code
provenance_view = tool_table[
    [
        'tool',
        'mode',
        'benchmark_role',
        'pam',
        'max_mismatches',
        'local_command_summary',
        'docker_command_summary',
    ]
].rename(
    columns={
        'tool': 'Tool',
        'mode': 'Public rerun mode',
        'benchmark_role': 'Benchmark role',
        'pam': 'Configured PAM',
        'max_mismatches': 'Configured mismatch cap',
        'local_command_summary': 'Configured local command',
        'docker_command_summary': 'Configured Docker command',
    }
)
provenance_view
Tool Public rerun mode Benchmark role Configured PAM Configured mismatch cap Configured local command Configured Docker command
0 Cas-OFFinder native_search native genome search NGG 6.0 cas-offinder snugel/cas-offinder:latest cas-offinder
1 CRISPRitz_mismatch native_search native genome search NGG 6.0 crispritz.py pinellolab/crispritz:latest crispritz.py
2 CRISPRitz_cfd native_search native genome search NGG 6.0 crispritz.py pinellolab/crispritz:latest crispritz.py
3 FlashFry native_search native genome search NGG 6.0 flashfry eclipse-temurin:8-jre java -Xmx8g -jar FlashFr...
4 GuideScan2 native_search native genome search NGG 6.0 guidescan nan
5 CRISPROFF pair_scorer pair scorer NaN NaN run_crisproff.py nan
6 CCTop native_search web service search NGG 5.0 cctop_submit.py nan
7 CRISPOR native_search native genome search NGG 6.0 nan maximilianh/crispor:latest
8 MOFF pair_scorer pair scorer NaN NaN MOFF score nan
9 CRISOT pair_scorer pair scorer NaN NaN CRISOT.py scores nan

The two command columns summarize how each tool was invoked upstream. They are not full shell transcripts, but they identify the program, wrapper, or container used to generate the normalized contract file.

Upstream provenance

The upstream execution and normalization logic is documented in:

  • config/prediction_tools.yaml
  • config/tool_output_manifest.example.yml
  • data/zenodo/README.md

prediction_tools.yaml defines the configured commands, mismatch limits, PAM settings, and runtime options summarized for the public data release. The manifest records the standardized contract files consumed by the public benchmark runner. The Zenodo data notes list the larger deposited files that are not tracked in GitHub.