Lexical Analysis Utilities

GitLab Docs Pipeline Coverage License

A lightweight Java lexical analysis library that lets you build hand-written tokenizers and parsers without code generation. It sits between a basic StringTokenizer and a full parser generator like JFlex or ANTLR: fast and flexible enough for real grammars, simple enough to set up in a few classes.

Requires Java 17+.

Features

  • No code generation -- define your grammar in plain Java by subclassing a handful of base classes.
  • Customizable character classification -- override TokenClassifier to control what constitutes identifiers, numbers, strings, symbols, and comments for your language.
  • Lookahead and validation -- peek(), expect(), and predicate-based matching give you a clean recursive-descent parsing experience.
  • Accurate error reporting -- LexerException carries line number, column, and source context so error messages point directly at the problem: Error on line 3 row 5-7: Unexpected token foo bar baz ----^^^
  • Comment collection -- skipped tokens (whitespace, comments) are collected separately so you can attach documentation comments to AST nodes.
  • Streaming and pre-loaded modes -- works with both in-memory strings and large input streams via LineBufferedReader.

Getting Started

Add to maven:

<dependency>
    <groupId>net.morimekta.utils</groupId>
    <artifactId>lexer</artifactId>
    <version>5.0.0</version>
</dependency>

Or gradle:

implementation 'net.morimekta.utils:lexer:5.0.0'

Core Concepts

Building a lexer involves four pieces that plug together:

Class Purpose
TokenType Marker interface for your token type enum (e.g. IDENTIFIER, STRING, NUMBER, SYMBOL).
TokenClassifier Defines the character-level rules: what starts an identifier, which characters are string delimiters, how comments begin, etc.
TokenFactory Creates Token instances and assigns them the correct type.
TokenizerBase Drives the actual character-by-character tokenization using your classifier and factory.
Lexer High-level API on top of a tokenizer, providing peek(), expect(), expectSymbol(), and iteration.

Usage

1. Define your token types

public enum MyTokenType implements TokenType {
    IDENTIFIER,
    STRING,
    NUMBER,
    SYMBOL,
    COMMENT
}

2. Customize character classification

Override only the rules that differ from the defaults. For example, if your language uses -- for line comments and - as part of identifiers:

public class MyClassifier extends TokenClassifier {
    @Override
    public boolean identifierSeparator(int ch) {
        return ch == '-';
    }

    @Override
    public boolean startString(int ch) {
        return ch == '\'' || ch == '"';
    }

    @Override
    public boolean startNumber(int ch) {
        return ch >= '0' && ch <= '9';
    }
}

3. Implement a token factory

public class MyTokenFactory implements TokenFactory<MyTokenType, Token<MyTokenType>> {
    @Override
    public Token<MyTokenType> identifierToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.IDENTIFIER, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> stringToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.STRING, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> numberToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.NUMBER, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> symbolToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.SYMBOL, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> genericToken(char[] buf, int off, int len, MyTokenType type, int lineNo, int linePos) {
        return new Token<>(buf, off, len, type, lineNo, linePos);
    }

    @Override
    public MyTokenType lineCommentTokenType() {
        return MyTokenType.COMMENT;
    }
}

4. Wire it together and parse

var tokenizer = new TokenizerBase<>(new MyClassifier(), new MyTokenFactory(), reader, 4096, true);
var lexer     = new Lexer<>(tokenizer);

while (lexer.hasNext()) {
    var token = lexer.next();
    System.out.printf("[%s] %s (line %d, col %d)%n",
                      token.type(), token, token.lineNo(), token.linePos());
}

Or with the expect / peek pattern for recursive-descent parsing:

var name = lexer.expect("identifier", MyTokenType.IDENTIFIER);
lexer.expectSymbol("assignment", '=');
var value = lexer.expect("value");
lexer.expectSymbol("terminator", ';');

Real-World Example: ASN.1 IDL Parser

The utils-asn1 project uses utils-lexer to parse ASN.1 Interface Definition Language files. It demonstrates all the extension points working together on a non-trivial grammar.

Token types for ASN.1

public enum Asn1TokenType implements TokenType {
    IDENTIFIER,
    STRING,
    NUMBER,
    SYMBOL,
    COMMENT_MULTI_LINE,  // /* ... */
    COMMENT_INLINE,      // -- comment --
    COMMENT,             // -- comment\n
    GROUP,
}

Character classifier

ASN.1 identifiers can contain hyphens (Type-Name, value-name), use '...' strings with suffixes ('0110'B for bit strings), and have -- line comments:

public class Asn1TokenClassifier extends TokenClassifier {
    @Override
    public boolean startString(int ch) {
        return ch == '\'' || ch == '"';
    }

    @Override
    public boolean allowAfterString(int ch) {
        return ch == 'B' || ch == 'H';  // bit string / hex octet string
    }

    @Override
    public boolean identifierSeparator(int ch) {
        return ch == '-';
    }

    @Override
    public boolean endIdentifier(int sep, int after) {
        return sep == '-' && after == '-';  // '--' starts a comment, not an identifier
    }

    @Override
    public boolean allowBeforeIdentifier(int ch) {
        return ch == '&' || ch == '@';  // field references and component relations
    }
}

Tokenizer with multi-character symbols

ASN.1 has multi-character symbols like ::=, .., and [[. The tokenizer overrides nextSymbol() to handle these:

public class Asn1Tokenizer extends TokenizerBase<Asn1TokenType, Asn1Token> {
    public Asn1Tokenizer(Path file, Reader reader) {
        super(new Asn1TokenClassifier(), new Asn1TokenFactory(), reader, 4096, true);
        this.file = file;
    }

    @Override
    public boolean skipTokenOnParseNext(Asn1Token token) {
        return token.type() == Asn1TokenType.COMMENT ||
               token.type() == Asn1TokenType.COMMENT_INLINE ||
               token.type() == Asn1TokenType.COMMENT_MULTI_LINE;
    }

    @Override
    protected Asn1Token nextSymbol() throws IOException {
        if (lastChar == '-') {
            // might be a '--' comment, or a negative number
            // ...
        } else if (lastChar == ':') {
            // might be '::=' assignment
            // ...
        }
        return super.nextSymbol();
    }
}

Lexer subclass

The ASN.1 lexer extends Lexer to expose skipped comments for documentation attachment and to produce domain-specific parse exceptions:

public class Asn1Lexer extends Lexer<Asn1TokenType, Asn1Token> {
    public Asn1Lexer(Path file, Tokenizer<Asn1TokenType, Asn1Token> tokenizer) {
        super(tokenizer);
        this.file = file;
        this.tokenizer = tokenizer;
    }

    public List<Asn1Token> skippedComments() {
        return tokenizer.getSkippedTokens();
    }

    public void clearSkippedComments() {
        tokenizer.clearSkippedTokens();
    }
}

Parsing with the lexer

With all the pieces in place, parsing an ASN.1 module reads naturally as a recursive-descent parser:

try (var reader = new InputStreamReader(inputStream)) {
    var lexer = new Asn1Lexer(file, new Asn1Tokenizer(file, reader));

    // Parse module header
    var moduleName = lexer.expect("Module name", Asn1Token::isTypeName);
    lexer.expect("DEFINITIONS", t -> t.is("DEFINITIONS"));
    lexer.expect("::=", t -> t.is(SYMBOL, "::="));
    lexer.expect("BEGIN", t -> t.is(IDENTIFIER, "BEGIN"));

    // Collect documentation comments
    lexer.peek();
    List<Asn1Token> docs = new ArrayList<>(lexer.skippedComments());
    lexer.clearSkippedComments();

    // Parse exports
    if (lexer.peek().is("EXPORTS")) {
        lexer.next();
        do {
            exports.add(lexer.expect("Exported ID", IDENTIFIER));
        } while (lexer.expectSymbol("separator", ';', ',').isSymbol(','));
    }

    // Parse assignments until END
    while (!lexer.peek("").is(IDENTIFIER, "END")) {
        var declName = lexer.expect("Name", IDENTIFIER);
        // ... parse type or value assignment
    }
}

API Reference

Token<Type>

Each token carries its content (as a CharSlice), its type, and its position in the source:

Method Description
type() The token's type.
lineNo() Line number (1-based).
linePos() Column position (1-based).
line() The full source line containing this token.
is(String name) Check token text.
is(Type type) Check token type.
is(Type type, String name) Check both.
isSymbol(char c) Check if single-character symbol.
decodeString(boolean strict) Decode string content, handling escape sequences.

Lexer<TT, T>

Method Description
next() Consume and return the next token, or null at EOF.
hasNext() Check if more tokens remain.
peek() Look ahead without consuming.
peek(String what) Look ahead, throw on EOF.
expect(String what) Consume next token, throw on EOF.
expect(String what, TT type) Consume and validate type.
expect(String what, Predicate<T>) Consume and validate with predicate.
expectSymbol(String what, char...) Consume and validate as one of the given symbols.
readUntil(String term, TT type, boolean allowEof) Read raw content until a terminator string.
failure(T token, String msg, Object...) Create a LexerException pointing at a token.
iterator() Iterate tokens (wraps checked exceptions in unchecked variants).

License

Licensed under the Apache License, Version 2.0.

See morimekta.net/utils for release procedures.