Skip to content

GAAD-Foundation/AIMAC

Repository files navigation

AIMAC

AIMAC is a tool for evaluating the accessibility of web pages generated by various LLMs via the OpenRouter API.

  • Generates HTML using various LLMs via the OpenRouter API.
  • Captures screenshots of the rendered pages using Playwright (parallel by default).
  • Runs accessibility checks using Axe-core.
  • Scores and compares model performance based on Serious and Critical axe-core violations, with dampening to prevent single rules from dominating. See ARCHITECTURE.md for scoring methodology.
  • Provides detailed reports on accessibility violations and compliance.

Live Results

View the latest benchmark results at aimac.ai.

Quick Start

Requires Python 3.12+ and uv.

# Clone the repository
git clone https://github.com/GAAD-Foundation/AIMAC.git
cd AIMAC

# Install dependencies (including dev tools for testing)
uv sync --all-extras

# Install Playwright browser (required for screenshots)
uv run playwright install chromium

# Verify installation
uv run aimac --help

Environment Setup

Create a .env file with your OpenRouter API key:

cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Usage

Run commands via uv run:

Initialize the Database

uv run aimac init
uv run aimac i

Creates the database by applying SQL from data/schema/*.sql. The database location is configured via AIMAC_DATABASE_PATH (defaults to ./data/aimac.db).

Collect Models

uv run aimac collect
uv run aimac c

# Force a fresh run (bypass all caches)
uv run aimac collect --refresh

# Test specific models only
uv run aimac collect --models anthropic/claude-sonnet-4,openai/gpt-4o

Results are cached automatically. Use --refresh to bypass caches. See ARCHITECTURE.md for caching details.

Fetches the top programming models from the OpenRouter API and:

  1. Saves a snapshot to data/snapshots/models/YYYY-MM-DD_HH-MM.json for inspection
  2. Upserts models to the database (preserves any manual overrides)
  3. Executes pending requests asynchronously with retry logic
  4. Writes artifacts (HTML, JSON) to output/ directory
  5. Creates a leaderboard and per-model summaries with rankings based on median accessibility score (lower is better). Ties are broken by: (1) mean score, (2) total violations, (3) cost. Models with identical values across all metrics receive the same rank.

Note: Requires OPENROUTER_API_KEY to be set in your .env file.

View Reports on the Command Line

After running aimac collect, view the results interactively:

# View leaderboard (all models ranked)
uv run aimac report
uv run aimac r                    # Short alias

# View specific model details
uv run aimac r model 1            # Top-ranked model
uv run aimac r model claude       # Fuzzy search by name
uv run aimac r model anthropic/claude-3.5-sonnet  # Exact model ID

# Compare models within a category
uv run aimac r category shopping  # Fuzzy search
uv run aimac r category 5         # By category number

Report Features

Accessibility-First Design

Reports are designed to work well with assistive technology:

  • TSV default - Tab-separated output works with screen readers, grep, cut, and Unix pipes
  • No ANSI colors - Plain text ensures compatibility with all terminals and assistive technology
  • No Unicode decorations - Avoids characters that may not render or announce correctly

Output Formats

Default output is TSV (tab-separated values) - optimized for screen readers and Unix tools:

# Default TSV output
uv run aimac r

# Pretty (padded columns for terminals)
uv run aimac r --format pretty

# JSON (for scripts, web)
uv run aimac r --format json
# Note: Leaderboard JSON includes reliability fields `stddev` (Consistency) and `p90` per model.

Set default format via environment variable:

# In .env file
AIMAC_REPORT_FORMAT=tsv|pretty|json

# Or export in shell
export AIMAC_REPORT_FORMAT=json

Verbose Mode

Add -v flag to see additional columns (company names, reliability metrics):

# Leaderboard with total violations, Consistency (StdDev), and P90
uv run aimac r -v

# Model view with Critical/Serious breakdown
uv run aimac r model 1 -v

# Category view with Critical/Serious breakdown
uv run aimac r category shopping -v

Unix Composability

TSV output works seamlessly with Unix tools:

# Sort by cost (4th column)
uv run aimac r | sort -t$'\t' -k4,4n

# Top 5 models
uv run aimac r | head -n 6  # +1 for header

# Filter by company (requires -v for Company column)
uv run aimac r -v | grep "Anthropic"

# Save to file
uv run aimac r --format json > leaderboard.json

Navigation Hints

Each report includes contextual "Run next:" suggestions to help you explore:

# Leaderboard shows:
Run next: "aimac report model 1"

# Model view shows:
Run next: "aimac report model 2"

# Category view shows:
Run next: "aimac report model <top-performer-id>"

Examples

Basic workflow:

# 1. View leaderboard
uv run aimac r

# 2. Examine top model
uv run aimac r model 1

# 3. Compare models in a specific category
uv run aimac r category shopping

# 4. Get detailed severity breakdown
uv run aimac r category shopping -v

Export for sharing:

# Generate Pretty report for sharing in terminals
uv run aimac r --format pretty > reports/leaderboard.txt

# Export all data as JSON
uv run aimac r --format json > reports/leaderboard.json
uv run aimac r model 1 --format json > reports/top_model.json

Scripting with JSON:

# Extract top model ID
TOP_MODEL=$(uv run aimac r --format json | jq -r '.rows[0].model_id')

# Count models with score < 20
uv run aimac r --format json | jq '[.rows[] | select(.median < 20)] | length'

Help Commands

# Main help (shows all commands)
uv run aimac -h

# CLI reporting flags and options
uv run aimac report -h

# CLI reporting examples and workflows
uv run aimac r help

Screenshots

Screenshots run in parallel by default using Playwright, auto-tuned to your CPU. Progress prints as Screenshots progress: N/Total during execution.

Artifacts written to output/ alongside HTML:

  • {request_id}.html
  • {request_id}.png

See ARCHITECTURE.md for worker formulas, memory model, and timeout configuration.

Testing

uv run -m pytest -q

Tests are hermetic (no real API calls). See ARCHITECTURE.md for test organization and isolation details.

Troubleshooting

playwright install chromium command not found

Run via uv: uv run playwright install chromium

Citation

If you use AIMAC in your research, see CITATION.md for citation formats.

License

MIT License - see LICENSE for details.