Tool Setup and Run Command Reference

Overview

The public rerun begins from stored prediction contracts. Those files are produced upstream from tool specific outputs and then evaluated in a common benchmark schema. This page records the tool roster, the configured commands, and the location of the upstream setup notes.

Manifest and registry

Code

from pathlib import Path

import pandas as pd
import yaml

manifest_path = Path('../config/tool_output_manifest.example.yml')
registry_path = Path('../config/prediction_tools.yaml')

manifest = yaml.safe_load(manifest_path.read_text(encoding='utf-8'))
registry = yaml.safe_load(registry_path.read_text(encoding='utf-8'))

manifest_df = pd.DataFrame(manifest['tool_outputs'])
registry_df = pd.DataFrame.from_dict(registry['tools'], orient='index').reset_index(names='tool_slug')


def summarize_command(value):
    if value in (None, '', []):
        return 'not used in the public rerun'
    if isinstance(value, list):
        return ' '.join(str(part) for part in value)
    return str(value)


def classify_role(value):
    mapping = {
        'search_engine': 'native genome search',
        'pair_scorer': 'pair scorer',
        'web_service_search': 'web service search',
    }
    return mapping.get(str(value), str(value))


tool_table = manifest_df.merge(registry_df, on='tool_slug', how='left')
tool_table['benchmark_role'] = tool_table['tool_role'].map(classify_role)
tool_table['local_command_summary'] = tool_table['local_command'].map(summarize_command)
tool_table['docker_command_summary'] = tool_table['docker_command'].map(summarize_command)

tool_table[
    [
        'tool',
        'tool_slug',
        'mode',
        'benchmark_role',
        'relative_path',
        'local_command_summary',
        'docker_command_summary',
    ]
]

	tool	tool_slug	mode	benchmark_role	relative_path	local_command_summary	docker_command_summary
0	Cas-OFFinder	cas_offinder	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	cas-offinder	snugel/cas-offinder:latest cas-offinder
1	CRISPRitz_mismatch	crispritz_mismatch	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	crispritz.py	pinellolab/crispritz:latest crispritz.py
2	CRISPRitz_cfd	crispritz_cfd	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	crispritz.py	pinellolab/crispritz:latest crispritz.py
3	FlashFry	flashfry	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	flashfry	eclipse-temurin:8-jre java -Xmx8g -jar FlashFr...
4	GuideScan2	guidescan2	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	guidescan	nan
5	CRISPROFF	crisproff	pair_scorer	pair scorer	data/zenodo/standard_tool_predictions/predicti...	run_crisproff.py	nan
6	CCTop	cctop	native_search	web service search	data/zenodo/standard_tool_predictions/predicti...	cctop_submit.py	nan
7	CRISPOR	crispor	native_search	native genome search	data/zenodo/standard_tool_predictions/predicti...	nan	maximilianh/crispor:latest
8	MOFF	moff	pair_scorer	pair scorer	data/zenodo/standard_tool_predictions/predicti...	MOFF score	nan
9	CRISOT	crisot	pair_scorer	pair scorer	data/zenodo/standard_tool_predictions/predicti...	CRISOT.py scores	nan

The manifest defines the public contract files required by the rerun. The tool registry defines the upstream commands and runtime assumptions used to create those contract files.

Contract files

Each row in the manifest corresponds to one public contract file.

tool is the manuscript facing tool name.
tool_slug is the configuration key.
mode distinguishes native search tools from pair scorers.
relative_path is the expected location of the standardized contract file.

The contract layer is the public handoff between upstream tool execution and the benchmark rerun.

Per tool summary

Code

provenance_view = tool_table[
    [
        'tool',
        'mode',
        'benchmark_role',
        'pam',
        'max_mismatches',
        'local_command_summary',
        'docker_command_summary',
    ]
].rename(
    columns={
        'tool': 'Tool',
        'mode': 'Public rerun mode',
        'benchmark_role': 'Benchmark role',
        'pam': 'Configured PAM',
        'max_mismatches': 'Configured mismatch cap',
        'local_command_summary': 'Configured local command',
        'docker_command_summary': 'Configured Docker command',
    }
)
provenance_view

	Tool	Public rerun mode	Benchmark role	Configured PAM	Configured mismatch cap	Configured local command	Configured Docker command
0	Cas-OFFinder	native_search	native genome search	NGG	6.0	cas-offinder	snugel/cas-offinder:latest cas-offinder
1	CRISPRitz_mismatch	native_search	native genome search	NGG	6.0	crispritz.py	pinellolab/crispritz:latest crispritz.py
2	CRISPRitz_cfd	native_search	native genome search	NGG	6.0	crispritz.py	pinellolab/crispritz:latest crispritz.py
3	FlashFry	native_search	native genome search	NGG	6.0	flashfry	eclipse-temurin:8-jre java -Xmx8g -jar FlashFr...
4	GuideScan2	native_search	native genome search	NGG	6.0	guidescan	nan
5	CRISPROFF	pair_scorer	pair scorer	NaN	NaN	run_crisproff.py	nan
6	CCTop	native_search	web service search	NGG	5.0	cctop_submit.py	nan
7	CRISPOR	native_search	native genome search	NGG	6.0	nan	maximilianh/crispor:latest
8	MOFF	pair_scorer	pair scorer	NaN	NaN	MOFF score	nan
9	CRISOT	pair_scorer	pair scorer	NaN	NaN	CRISOT.py scores	nan

The two command columns summarize how each tool was invoked upstream. They are not full shell transcripts, but they identify the program, wrapper, or container used to generate the normalized contract file.

Upstream provenance

The upstream execution and normalization logic is documented in:

config/prediction_tools.yaml
config/tool_output_manifest.example.yml
data/zenodo/README.md

prediction_tools.yaml defines the configured commands, mismatch limits, PAM settings, and runtime options summarized for the public data release. The manifest records the standardized contract files consumed by the public benchmark runner. The Zenodo data notes list the larger deposited files that are not tracked in GitHub.