CentralGauge AL Code Benchmark for LLMs

An open source benchmark for evaluating LLM performance on AL code generation for Microsoft Dynamics 365 Business Central, with 56 tasks across three difficulty tiers, real compilation, and test execution.

Open live site View source

Stack: TypeScript · Deno
Status: Active
Released: 2025
License: MIT

Overview

CentralGauge measures how well different LLMs can generate AL code for Business Central. Built with TypeScript on Deno, it runs 56 tasks across three difficulty tiers (Easy, Medium, Hard) against Docker-containerized BC environments.

Generated code is compiled in a real BC container and tested with actual test codeunits. No syntactic approximations. Scoring: 50 points for compilation, 30 for passing tests, 20 for code patterns (10 required, 10 forbidden). Pass threshold is 70 points. Models get a second attempt to fix compilation errors, with a 10-point penalty.

Supports OpenAI, Anthropic, Google Gemini, Azure OpenAI, OpenRouter (200+ models), and local Ollama instances. Each run tracks token usage and cost. Results are stored in SQLite for historical comparison. Reports available in HTML and JSON.

Highlights

56 tasks across Easy / Medium / Hard tiers

Real BC container compile + test execution

OpenAI, Anthropic, Gemini, Azure, OpenRouter, Ollama

SQLite history + HTML / JSON reports

70-point pass threshold, retry with 10-pt penalty

Per-run token usage + cost tracking

Overview

Highlights

Related writing