Case studies Jun 2023 — Present Two compilers, one shape
Compiler-driven code quality
Two parser-driven Lex/Yacc tools that turned legacy-code problems into checked migration and build-time feedback: a C/C++ prototype generator and a WebFOCUS auto-converter for moving reporting logic into Python.
- 80%
- Defects caught at build
- 48 hrs → 40 min
- Per script convert time
- 500+
- Scripts migrated
- 2 langs
- C/C++ + WebFOCUS
The problem
Two of the most expensive engineering problems on a long-lived codebase don’t have the same shape on the surface, but they have the same shape underneath.
The first is a quality problem. Heuristic linters miss context-specific
bugs. A C codebase that’s been edited by hundreds of engineers for fifteen
years carries patterns that are correct in some places and wrong in others —
the same function name re-defined with a slightly different signature in two
files; a function called with arguments that would work if the headers had
been updated when the prototype was; a struct field whose type was migrated
from int to int64_t everywhere except in one helper. A linter looking at
one file at a time will miss most of this. The information is genuinely
out-of-band.
The second is a migration problem. Hand-converting a legacy DSL doesn’t scale. At Amdocs, an in-house Python-on-Pandas reporting library called XFocus replaces TIBCO WebFOCUS for telecom business reporting. The catch: the carrier already has hundreds of WebFOCUS scripts written over the years. A careful hand-conversion takes around 48 hours per script. Multiply by 500+ scripts and you have a project with clear strategic value that still risks never finishing.
Different surface, same underneath. Both problems are about discovering machine-checkable structure in code that wasn’t designed to be machine-checked. The right tool for that is a parser, not a regex.
What I built
Two Lex/Yacc compilers, two years apart, applying the same idea to two different languages.
- C/C++ Prototype Generator (Jun-Sep 2023). A build-time compiler that
scans every C source in the repo, parses out function definitions, generates
prototypes for all of them, and integrates them via preprocessor
#ifguards across 500+ C files. The compiler is wrapped in a build-time pass; the prototypes become authoritative at compile time. Result: bad merges, incorrect function usage, and type mismatches stop being deploy-time discoveries and start being build-time errors. - WebFOCUS auto-converter (Jul 2024 — Present, part of XFocus). A Lex/Yacc compiler that parses legacy WebFOCUS scripts and emits the corresponding Python / Pandas / NumPy code in the new in-house library.
Two applications of one shape: when the language surface is yours to control, parser-driven tools beat heuristic ones every time.
Why Lex/Yacc
The choice of Lex/Yacc surprises some people in 2026. It shouldn’t.
Lex/Yacc has been the durable choice for ad-hoc compilers since the 1970s for very practical reasons:
- Generated, not handwritten. A grammar in a
.yfile plus token rules in a.lfile is dramatically smaller than a recursive-descent parser would be. The grammar is the spec. - Zero runtime dependencies. The generated C is one file you ship. Trivial to embed in a build pipeline.
- Fast enough that no one ever notices. A WebFOCUS script of a few thousand lines parses in milliseconds. A whole repo of C parses faster than the linker links.
- Boring. Lex and Yacc don’t get rewritten every two years. They are a stable surface to depend on.
Newer parser frameworks (ANTLR, tree-sitter) are excellent for different jobs — specifically, tools that need rich error recovery or live editing support. Neither the prototype generator nor the converter needed that. They needed a grammar, a deterministic action set, and a CI-runnable binary. Lex/Yacc fits the slot.
Story 1: the C/C++ prototype generator
The setup: a long-lived C/C++ codebase with hundreds of files. Function declarations were sometimes correct, sometimes drift. Nobody had time to audit them. The compiler saw whatever the headers said; whatever the headers said wasn’t always true.
The compiler I wrote does three things:
- Lexes every C source file — strips comments and string literals, tokenizes into the canonical C token set.
- Parses function definitions — the only construct it cares about; it
walks past everything else. The grammar is a few dozen rules, mostly about
handling the legitimately hairy C declarator syntax (
int (*foo(int x))[3]and friends). - Emits a prototype block for each file, gated behind a preprocessor guard so it doesn’t conflict with hand-written prototypes in headers.
A simplified slice of the Yacc grammar:
function_definition : declaration_specifiers declarator declaration_list_opt compound_statement { emit_prototype($1, $2); } ;
declarator : direct_declarator | pointer direct_declarator ;
direct_declarator : IDENTIFIER | direct_declarator '(' parameter_type_list ')' | direct_declarator '(' identifier_list_opt ')' ;And the emitted prototype block (auto-generated, written into __protos.h):
/* AUTO-GENERATED - do not edit */#if defined(BUILD_WITH_PROTOTYPE_CHECK)extern int customer_balance(const char *customer_id, int64_t *out);extern void provision_line(const provision_req *req, provision_resp *resp);extern int rate_call_record(const cdr_t *cdr, money_t *out);/* ...several hundred more... */#endifThe build flips on BUILD_WITH_PROTOTYPE_CHECK and runs every translation
unit against the generated prototypes. Anywhere the actual call doesn’t match
the actual definition, the compiler stops the build and points at the line.
Plus around 10+ shell scripts that orchestrate the pass, integrate with the build system, and rewrite ~500 C files to include the right guard. The shell-glue piece is unglamorous, but it’s where most of the integration cost lived.
Story 2: the WebFOCUS auto-converter
A year later, a different problem with the same shape.
WebFOCUS — TIBCO’s reporting language (acquired from Information Builders in January 2021, now under Cloud Software Group) — has been generating telecom business reports for some carriers for over a decade. The team had built XFocus, an in-house Pandas / NumPy library that does the same job in Python with no vendor-specific runtime dependency. The blocker was migrating the existing 500+ scripts without spending a year on each one.
The numbers everyone agreed on: hand-converting a script took roughly 48 hours of careful work — read the WebFOCUS code, understand what it produces, write the Pandas equivalent, validate the output matches byte-for-byte, edge-case it. Skilled engineering time, slow.
The compiler I wrote does what compilers do: read the source language, emit the target language. The grammar covers the WebFOCUS verbs the team’s scripts actually use (a meaningful subset of the full language, but enough). Each verb has a deterministic Pandas mapping; the converter walks the parse tree and emits.
A representative WebFOCUS fragment:
TABLE FILE BILLINGPRINT CUSTOMER_ID AS 'Customer' SUM.AMOUNT_CHARGED AS 'Total Charged'BY MONTHWHERE BILL_STATUS EQ 'PAID'ON TABLE PCHOLD FORMAT XLSXENDThe corresponding XFocus output:
df = ( billing .where(billing["bill_status"] == "PAID") .group_by("month") .agg( customer=("customer_id", "first"), total_charged=("amount_charged", "sum"), ) .rename(columns={"customer": "Customer", "total_charged": "Total Charged"}))df.to_excel("output.xlsx")The grammar handles the cases real scripts actually contain. Edge cases that don’t parse get reported with a line number and a one-line summary; an engineer can hand-fix those (a few percent of scripts) without re-doing the bulk.
What both have in common
Two stories, one shape:
- The language surface was stable enough to write a grammar against.
- The grammar didn’t have to be the full language. Both compilers handle the subset that the codebase actually uses. Anything outside the subset gets reported as an error, which is a valid response — those cases need a human eye anyway.
- The output was deterministic. Same input, same output, every time. That’s what makes the tool trustworthy.
- The integration glue was where most of the time went. Writing the parser is the fun part; threading it into a build system, getting it to handle every file in a repo, dealing with edge cases — that’s the actual project.
When you control the language surface, build a parser before you build a linter. Heuristic tools are a lower bound on what’s catchable; parsers are the upper bound.
Lessons
When you control the language surface, build a parser before you build a linter. Heuristic tools are a lower-bound on what’s catchable; parsers are the upper bound. The cost difference is smaller than it looks once the grammar is written.
The grammar doesn’t have to be complete to be useful. Cover the subset that actually appears in the corpus. The cases that don’t parse are exactly the cases that need human review anyway. Aiming for a complete grammar is how parser projects die.
Lex/Yacc is unfashionable and excellent. Pick boring tools for tasks that benefit from boring tools. Both compilers are still in production and have needed essentially zero maintenance since they shipped.
Code generation is refactoring at scale. Once you have a parser plus an emitter, every future migration of the same shape becomes a small grammar change instead of a fresh project. Both tools are still being extended; neither has needed a rewrite. The same shape powers the Tuxedo → gRPC transpiler — different language, same architecture.