What changed This paper formalizes the task of generating code for domain-specific languages (DSLs) from natural language descriptions, termed Text2DSL. This is presented as a distinct problem class, separate from more general tasks like Text-to-SQL or general-purpose code generation. The authors introduce the PolkitBench dataset, which contains 4,204 verified pairs of natural language descriptions and corresponding Polkit rules. Each pair was validated using a three-level Abstract Syntax Tree (AST)-based pipeline.
Experiments were conducted using two Mixture-of-Experts (MoE) models: GigaChat-10B-A1.8B (with 1.8 billion active parameters) and Nemotron-3-Nano-30B-A3B (with 3 billion active parameters). These experiments focused on the impact of structured context provided within prompts. The structured context included elements like Backus-Naur Form (BNF) grammar, API specifications, and a vocabulary of permitted identifiers.
The results indicate a significant improvement in code generation quality when structured context is supplied. For both models tested, the syntactic validity of the generated code increased to 98.6-99.4%. Structural validity saw improvements ranging from 9.7 to 35.5 percentage points. Furthermore, the CodeBLEU score, a metric for evaluating code generation quality, improved by 60% to 95%. The consistency of these improvements across models of different scales and origins suggests that injecting formal target-language specifications into the prompt context is a robust method for achieving high-quality DSL code generation without the need for model fine-tuning.
Why it matters for builders Developers who utilize domain-specific languages (DSLs) for tasks such as defining operating system security policies often face challenges with manual rule authoring, which can be complex and prone to errors. This research offers a promising avenue for automating this process. By establishing Text2DSL as a formal problem and demonstrating the effectiveness of structured prompts, builders can leverage LLMs more effectively to generate correct and valid DSL code.
The findings suggest that even without specialized fine-tuning, providing LLMs with relevant contextual information like grammar rules and API definitions can dramatically boost the accuracy and structural integrity of the generated code. This lowers the barrier to entry for working with DSLs and can streamline development workflows, reducing debugging time and improving the reliability of system configurations.
Practical impact The practical impact of this research lies in its potential to democratize the use of DSLs. For instance, in the realm of operating system security, where Polkit rules are essential, the ability to generate these rules from simple natural language descriptions, guided by structured context, can significantly speed up policy implementation and maintenance. Developers can potentially draft security policies in plain English, and the LLM, armed with the correct DSL specifications, can translate these into functional Polkit rules.
This approach could also extend to other DSL-heavy domains, such as configuration management, network programming, or specialized data processing frameworks. The key takeaway for builders is the importance of prompt engineering, specifically the inclusion of formal language specifications, as a powerful tool for enhancing LLM performance in code generation tasks. This method offers a cost-effective way to improve code quality, as it bypasses the need for extensive and resource-intensive model fine-tuning.
Caveats and source limits The research presented in this paper is based on experiments conducted with two specific MoE models (GigaChat-10B-A1.8B and Nemotron-3-Nano-30B-A3B) and a single DSL (Polkit). While the results show consistency across these models, their generalizability to other LLM architectures, different DSLs, or more complex generation tasks is not explicitly detailed. The paper focuses on prompt engineering with structured context as a method for improving generation quality, and does not explore other potential techniques like fine-tuning or retrieval-augmented generation in depth for this specific Text2DSL problem.
The PolkitBench dataset comprises 4,204 verified pairs, which provides a solid foundation for evaluation, but the diversity and complexity of real-world DSL usage scenarios may extend beyond this dataset's scope. The reported metrics (syntactic validity, structural validity, CodeBLEU) are standard for code generation evaluation, but the ultimate measure of success in a production environment would involve successful execution and adherence to intended security policies, which is implicitly covered by the validation pipeline but not exhaustively demonstrated in the excerpt.
Featured on AI Radar: Text2DSL: LLM-Based Code Generation for Domain-Specific Languages