Build Your Own Programming Language - Second Edition

By : Clinton L. Jeffery

Build Your Own Programming Language - Second Edition

By: Clinton L. Jeffery

Overview of this book

There are many reasons to build a programming language: out of necessity, as a learning exercise, or just for fun. Whatever your reasons, this book gives you the tools to succeed. You’ll build the frontend of a compiler for your language and generate a lexical analyzer and parser using Lex and YACC tools. Then you’ll explore a series of syntax tree traversals before looking at code generation for a bytecode virtual machine or native code. In this edition, a new chapter has been added to assist you in comprehending the nuances and distinctions between preprocessors and transpilers. Code examples have been modernized, expanded, and rigorously tested, and all content has undergone thorough refreshing. You’ll learn to implement code generation techniques using practical examples, including the Unicon Preprocessor and transpiling Jzero code to Unicon. You'll move to domain-specific language features and learn to create them as built-in operators and functions. You’ll also cover garbage collection. Dr. Jeffery’s experiences building the Unicon language are used to add context to the concepts, and relevant examples are provided in both Unicon and Java so that you can follow along in your language of choice. By the end of this book, you'll be able to build and deploy your own domain-specific language.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Section I: Programming Language Frontends

Free Chapter

Why Build Another Programming Language?

Motivations for writing your own programming language

Types of programming language implementations

Organizing a bytecode language implementation

Languages used in the examples

The difference between programming languages and libraries

Applicability to other software engineering tasks

Establishing the requirements for your language

Case study – requirements that inspired the Unicon language

Summary

Questions

Programming Language Design

Determining the kinds of words and punctuation to provide in your language

Specifying the control flow

Deciding on what kinds of data to support

Overall program structure

Completing the Jzero language definition

Case study – designing graphics facilities in Unicon

Summary

Questions

Scanning Source Code

Technical requirements

Lexemes, lexical categories, and tokens

Regular expressions

Using UFlex and JFlex

Writing a scanner for Jzero

Regular expressions are not always enough

Summary

Questions

Parsing

Technical requirements

Syntax analysis

Context-free grammars

Using iyacc and BYACC/J

Writing a parser for Jzero

Summary

Questions

Syntax Trees

Technical requirements

Using GNU Make

Learning about trees

Creating leaves from terminal symbols

Building internal nodes from production rules

Forming syntax trees for the Jzero language

Debugging and testing your syntax tree

Summary

Questions

Section II: Syntax Tree Traversals

Symbol Tables

Technical requirements

Establishing the groundwork for symbol tables

Creating and populating symbol tables for each scope

Checking for undeclared variables

Finding redeclared variables

Handling package and class scopes in Unicon

Testing and debugging symbol tables

Summary

Questions

Checking Base Types

Technical requirements

Type representation in the compiler

Assigning type information to declared variables

Determining the type at each syntax tree node

Runtime type checks and type inference in Unicon

Summary

Questions

Checking Types on Arrays, Method Calls, and Structure Accesses

Technical requirements

Checking operations on array types

Checking method calls

Checking structured type accesses

Summary

Questions

Intermediate Code Generation

Technical requirements

What is intermediate code?

An intermediate code instruction set

Annotating syntax trees with labels for control flow

Generating code for expressions

Generating code for control flow

Summary

Questions

Syntax Coloring in an IDE

Writing your own IDE versus supporting an existing one

Downloading the software used in this chapter

Adding support for your language to Visual Studio Code

Integrating a compiler into a programmer’s editor

Avoiding reparsing the entire file on every change

Using lexical information to colorize tokens

Highlighting errors using parse results

Summary

Questions

Section III: Code Generation and Runtime Systems

Preprocessors and Transpilers

Understanding preprocessors

Code generation in the Unicon preprocessor

The difference between preprocessors and transpilers

Transpiling Jzero code to Unicon

Summary

Questions

Bytecode Interpreters

Technical requirements

Understanding what bytecode is

Comparing bytecode with intermediate code

Building a bytecode instruction set for Jzero

Implementing a bytecode interpreter

Writing a runtime system for Jzero

Running a Jzero program

Examining iconx, the Unicon bytecode interpreter

Summary

Questions

Generating Bytecode

Technical requirements

Converting intermediate code to Jzero bytecode

Comparing bytecode assembler with binary formats

Linking, loading, and including the runtime system

Unicon example – bytecode generation in icont

Summary

Questions

Native Code Generation

Technical requirements

Deciding whether to generate native code

Introducing the x64 instruction set

Using registers

Converting intermediate code to x64 code

Generating x64 output

Summary

Questions

Leave a review!

Implementing Operators and Built-In Functions

Implementing operators

Writing built-in functions

Integrating built-ins with control structures

Developing operators and functions for Unicon

Summary

Questions

Domain Control Structures

Knowing when a new control structure is needed

Scanning strings in Icon and Unicon

Rendering regions in Unicon

Summary

Questions

Garbage Collection

Grasping the importance of garbage collection

Counting references to objects

Marking live data and sweeping the rest

Summary

Questions

Final Thoughts

Reflecting on what was learned from writing this book

Deciding where to go from here

Exploring references for further reading

Summary

Section IV: Appendix

Answers

Other Books You May Enjoy

Index

Appendix: Unicon Essentials

Syntactic shorthand

Running Unicon

Using Unicon’s declarations and data types

Evaluating expressions

Debugging and environmental issues

Function mini-reference

Selected keywords

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Establishing the requirements for your language

After you are sure you need a new programming language for what you are doing, take a few minutes to establish the requirements. This is open-ended. It is you defining what success for your project will look like. Wise language inventors do not create a whole new syntax from scratch. Instead, they define it in terms of a set of modifications to make to a popular existing language.

Many great programming languages (Lisp, Forth, Smalltalk, and many others) had their success significantly limited by the degree to which their syntax was unnecessarily different from mainstream languages. Still, your language requirements include what it will look like, and that includes syntax.

More importantly, you must define a set of control structures or semantics where your programming language needs to go beyond existing language(s). This will sometimes include special support for an application domain that is not well served by existing languages and their libraries. Such domain-specific languages (DSLs) are common enough that whole books are focused on that topic. Our goal for this book will be to focus on the nuts and bolts of building the compiler and runtime system for such a language, independent of whatever domain you may be working in.

In a normal software engineering process, requirements analysis would start with brainstorming lists of functional and non-functional requirements. Functional requirements for a programming language involve the specifics of how the end user developer will interact with it. You might not anticipate all the command-line options for your language up front, but you probably know whether interactivity is required, or whether a separate compile step is OK. The discussion of interpreters and compilers in the previous section, and this book’s presentation of a compiler, might seem to make that choice for you, but Python is an example of a language that provides a fully interactive interface, even though the source code you type into Python gets compiled into bytecode and executed by a bytecode machine, rather than being interpreted directly.

Non-functional requirements are properties that your programming language must achieve that are not directly tied to the end user developer’s interactions. They include things such as what operating system(s) your language must run on, how fast execution must be, or how little space the programs written in your language must run within.

The non-functional requirement regarding how fast execution must be usually determines the answer as to whether you can target a software (bytecode) machine or need to target native code. Native code is not just faster; it is also considerably more difficult to generate, and it might make your language considerably less flexible in terms of runtime system features. You might choose to target bytecode first, and then work on a native code generator afterward.

The first language I learned to program on was a BASIC interpreter in which the programs had to run within 4 KB of RAM. BASIC at the time had a low memory footprint requirement. But even in modern times, it is not uncommon to find yourself on a platform where Java won’t run by default! For example, on virtual machines with configured memory limits for user processes, you may have to learn some awkward command-line options to compile or run even simple Java programs.

In addition to identifying functional and non-functional requirements, many requirements analysis approaches also define a set of use cases and ask the developer to write descriptions for them. Inventing a programming language is different from your average software engineering project, but before you are finished, you may want to go there and perform such a use case analysis. A use case is a task that someone performs using a software application. When the software application is a programming language, if you are not careful, the use cases may be too general to be useful, such as write my application and run my program. While those two might not be very useful, you might want to think about whether your programming language implementation must support program development, debugging, separate compilation and linking, integration with external languages and libraries, and so forth. Most of those topics are beyond the scope of this book, but we will consider some of them.

Since this book presents the implementation of a language called Jzero, here are some requirements for Jzero. Some of these requirements may appear arbitrary. You could certainly add your own requirements and produce your own Java dialect, but this list describes what we are aiming for in this book. If it is not clear to you where one of the following requirements came from, it either came from our source inspiration language (plzero) or previous experience teaching compiler construction:

Jzero should be a strict subset of Java. All legal Jzero programs should be legal Java programs. This requirement allows us to check the behavior of our test programs when we are debugging our language implementation.
Jzero should provide enough features to allow interesting computations. This includes if statements, while loops, and multiple functions, along with parameters.
Jzero should support a few data types, including Booleans, integers, arrays, and the String type. However, it only needs to support a subset of their functionality, (as you’ll see later). These types are enough to allow input and output of interesting values into a computation.
Jzero should emit decent error messages, showing the filename and line number, including messages for attempts to use Java features not in Jzero. We will need reasonable error messages to debug the implementation.
Jzero should run fast enough to be practical. This requirement is vague, but it implies that we won’t be doing a pure interpreter. Pure interpreters that execute source code directly without any internal code generation step are a very retro thing, evocative of the 1960s and 1970s. They tend to execute unacceptably slowly by modern standards. On the other hand, you might very well decide that your language should provide the highly interactive look and feel of a pure interpreter, like Python does. Anyhow, that is not in Jzero’s requirements.
Jzero should be as simple as possible so that I can explain it. Sadly, this rules out writing a full description of a native code generator or even an implementation that targets JVM bytecode; we will provide our own simple bytecode machine.

Perhaps more requirements will emerge as we go along, but this is a start. Since we are constrained for time and space, perhaps this requirements list is more important for what it does not say, rather than for what it does say. By way of comparison, here are some of the requirements that led to the creation of the Unicon programming language.

Build Your Own Programming Language - Second Edition

By : Clinton L. Jeffery

Build Your Own Programming Language - Second Edition

By: Clinton L. Jeffery

Overview of this book

Related Content you might be interested in

Current Title:

Build Your Own Programming Language - Second Edition

Establishing the requirements for your language