Compiler-driven code quality

The problem

Two of the most expensive engineering problems on a long-lived codebase don’t have the same shape on the surface, but they have the same shape underneath.

The first is a quality problem. Heuristic linters miss context-specific bugs. A C codebase that’s been edited by hundreds of engineers for fifteen years carries patterns that are correct in some places and wrong in others — the same function name re-defined with a slightly different signature in two files; a function called with arguments that would work if the headers had been updated when the prototype was; a struct field whose type was migrated from int to int64_t everywhere except in one helper. A linter looking at one file at a time will miss most of this. The information is genuinely out-of-band.

The second is a migration problem. Hand-converting a legacy DSL doesn’t scale. At Amdocs, an in-house Python-on-Pandas reporting library called XFocus replaces TIBCO WebFOCUS for telecom business reporting. The catch: the carrier already has hundreds of WebFOCUS scripts written over the years. A careful hand-conversion takes around 48 hours per script. Multiply by 500+ scripts and you have a project with clear strategic value that still risks never finishing.

Different surface, same underneath. Both problems are about discovering machine-checkable structure in code that wasn’t designed to be machine-checked. The right tool for that is a parser, not a regex.

What I built

Two Lex/Yacc compilers, two years apart, applying the same idea to two different languages.

flowchart LR
    src["Source<br/>(.c / .y / .l / .focexec)"]
    lex["Lex<br/>tokenizer"]
    yacc["Yacc<br/>grammar + actions"]
    ast["Parse tree /<br/>AST"]
    emit["Emitter<br/>(per use case)"]
    out1["Prototype block<br/>(__protos.h)"]
    out2["XFocus Python<br/>(Pandas / NumPy)"]

    src --> lex --> yacc --> ast --> emit
    emit --> out1
    emit --> out2

    classDef src fill:#fef3c7,stroke:#f59e0b,color:#78350f;
    classDef pipe fill:#dcfce7,stroke:#10b981,color:#064e3b;
    classDef out fill:#e0e7ff,stroke:#6366f1,color:#312e81;

    class src src;
    class lex,yacc,ast,emit pipe;
    class out1,out2 out;

Both compilers share the same shape. A Lex tokenizer feeds a Yacc grammar; the grammar's actions build a parse tree; the emitter walks the tree and produces the target artifact. Same pipeline, different emitters: one writes a prototype block (`__protos.h`); the other writes Pandas/NumPy code in the in-house XFocus library.

C/C++ Prototype Generator (Jun-Sep 2023). A build-time compiler that scans every C source in the repo, parses out function definitions, generates prototypes for all of them, and integrates them via preprocessor #if guards across 500+ C files. The compiler is wrapped in a build-time pass; the prototypes become authoritative at compile time. Result: bad merges, incorrect function usage, and type mismatches stop being deploy-time discoveries and start being build-time errors.
WebFOCUS auto-converter (Jul 2024 — Present, part of XFocus). A Lex/Yacc compiler that parses legacy WebFOCUS scripts and emits the corresponding Python / Pandas / NumPy code in the new in-house library.

Two applications of one shape: when the language surface is yours to control, parser-driven tools beat heuristic ones every time.

Why Lex/Yacc

The choice of Lex/Yacc surprises some people in 2026. It shouldn’t.

Lex/Yacc has been the durable choice for ad-hoc compilers since the 1970s for very practical reasons:

Generated, not handwritten. A grammar in a .y file plus token rules in a .l file is dramatically smaller than a recursive-descent parser would be. The grammar is the spec.
Zero runtime dependencies. The generated C is one file you ship. Trivial to embed in a build pipeline.
Fast enough that no one ever notices. A WebFOCUS script of a few thousand lines parses in milliseconds. A whole repo of C parses faster than the linker links.
Boring. Lex and Yacc don’t get rewritten every two years. They are a stable surface to depend on.

Newer parser frameworks (ANTLR, tree-sitter) are excellent for different jobs — specifically, tools that need rich error recovery or live editing support. Neither the prototype generator nor the converter needed that. They needed a grammar, a deterministic action set, and a CI-runnable binary. Lex/Yacc fits the slot.

Story 1: the C/C++ prototype generator

The setup: a long-lived C/C++ codebase with hundreds of files. Function declarations were sometimes correct, sometimes drift. Nobody had time to audit them. The compiler saw whatever the headers said; whatever the headers said wasn’t always true.

The compiler I wrote does three things:

Lexes every C source file — strips comments and string literals, tokenizes into the canonical C token set.
Parses function definitions — the only construct it cares about; it walks past everything else. The grammar is a few dozen rules, mostly about handling the legitimately hairy C declarator syntax (int (*foo(int x))[3] and friends).
Emits a prototype block for each file, gated behind a preprocessor guard so it doesn’t conflict with hand-written prototypes in headers.

A simplified slice of the Yacc grammar:

function_definition
    : declaration_specifiers declarator declaration_list_opt compound_statement
        { emit_prototype($1, $2); }
    ;

declarator
    : direct_declarator
    | pointer direct_declarator
    ;

direct_declarator
    : IDENTIFIER
    | direct_declarator '(' parameter_type_list ')'
    | direct_declarator '(' identifier_list_opt ')'
    ;

And the emitted prototype block (auto-generated, written into __protos.h):

/* AUTO-GENERATED - do not edit */
#if defined(BUILD_WITH_PROTOTYPE_CHECK)
extern int  customer_balance(const char *customer_id, int64_t *out);
extern void provision_line(const provision_req *req, provision_resp *resp);
extern int  rate_call_record(const cdr_t *cdr, money_t *out);
/* ...several hundred more... */
#endif

The build flips on BUILD_WITH_PROTOTYPE_CHECK and runs every translation unit against the generated prototypes. Anywhere the actual call doesn’t match the actual definition, the compiler stops the build and points at the line.

Plus around 10+ shell scripts that orchestrate the pass, integrate with the build system, and rewrite ~500 C files to include the right guard. The shell-glue piece is unglamorous, but it’s where most of the integration cost lived.

Story 2: the WebFOCUS auto-converter

A year later, a different problem with the same shape.

WebFOCUS — TIBCO’s reporting language (acquired from Information Builders in January 2021, now under Cloud Software Group) — has been generating telecom business reports for some carriers for over a decade. The team had built XFocus, an in-house Pandas / NumPy library that does the same job in Python with no vendor-specific runtime dependency. The blocker was migrating the existing 500+ scripts without spending a year on each one.

The numbers everyone agreed on: hand-converting a script took roughly 48 hours of careful work — read the WebFOCUS code, understand what it produces, write the Pandas equivalent, validate the output matches byte-for-byte, edge-case it. Skilled engineering time, slow.

The compiler I wrote does what compilers do: read the source language, emit the target language. The grammar covers the WebFOCUS verbs the team’s scripts actually use (a meaningful subset of the full language, but enough). Each verb has a deterministic Pandas mapping; the converter walks the parse tree and emits.

A representative WebFOCUS fragment:

TABLE FILE BILLING
PRINT
    CUSTOMER_ID
    AS 'Customer'
    SUM.AMOUNT_CHARGED
    AS 'Total Charged'
BY MONTH
WHERE BILL_STATUS EQ 'PAID'
ON TABLE PCHOLD FORMAT XLSX
END

The corresponding XFocus output:

df = (
    billing
    .where(billing["bill_status"] == "PAID")
    .group_by("month")
    .agg(
        customer=("customer_id", "first"),
        total_charged=("amount_charged", "sum"),
    )
    .rename(columns={"customer": "Customer", "total_charged": "Total Charged"})
)
df.to_excel("output.xlsx")

The grammar handles the cases real scripts actually contain. Edge cases that don’t parse get reported with a line number and a one-line summary; an engineer can hand-fix those (a few percent of scripts) without re-doing the bulk.

What both have in common

Two stories, one shape:

The language surface was stable enough to write a grammar against.
The grammar didn’t have to be the full language. Both compilers handle the subset that the codebase actually uses. Anything outside the subset gets reported as an error, which is a valid response — those cases need a human eye anyway.
The output was deterministic. Same input, same output, every time. That’s what makes the tool trustworthy.
The integration glue was where most of the time went. Writing the parser is the fun part; threading it into a build system, getting it to handle every file in a repo, dealing with edge cases — that’s the actual project.

When you control the language surface, build a parser before you build a linter. Heuristic tools are a lower bound on what’s catchable; parsers are the upper bound.

— The lesson that travels.

Lessons

When you control the language surface, build a parser before you build a linter. Heuristic tools are a lower-bound on what’s catchable; parsers are the upper bound. The cost difference is smaller than it looks once the grammar is written.

The grammar doesn’t have to be complete to be useful. Cover the subset that actually appears in the corpus. The cases that don’t parse are exactly the cases that need human review anyway. Aiming for a complete grammar is how parser projects die.

Lex/Yacc is unfashionable and excellent. Pick boring tools for tasks that benefit from boring tools. Both compilers are still in production and have needed essentially zero maintenance since they shipped.

Code generation is refactoring at scale. Once you have a parser plus an emitter, every future migration of the same shape becomes a small grammar change instead of a fresh project. Both tools are still being extended; neither has needed a rewrite. The same shape powers the Tuxedo → gRPC transpiler — different language, same architecture.