Lexical Analysis Utilities

A lightweight Java lexical analysis library that lets you build hand-written tokenizers and parsers without code generation. It sits between a basic StringTokenizer and a full parser generator like JFlex or ANTLR: fast and flexible enough for real grammars, simple enough to set up in a few classes.

Requires Java 17+.

Features

No code generation -- define your grammar in plain Java by subclassing a handful of base classes.
Customizable character classification -- override TokenClassifier to control what constitutes identifiers, numbers, strings, symbols, and comments for your language.
Lookahead and validation -- peek(), expect(), and predicate-based matching give you a clean recursive-descent parsing experience.
Accurate error reporting -- LexerException carries line number, column, and source context so error messages point directly at the problem:
```
Error on line 3 row 5-7: Unexpected token
foo bar baz
----^^^
```
Comment collection -- skipped tokens (whitespace, comments) are collected separately so you can attach documentation comments to AST nodes.
Streaming and pre-loaded modes -- works with both in-memory strings and large input streams via LineBufferedReader.

Getting Started

Add to maven:

<dependency>
    <groupId>net.morimekta.utils</groupId>
    <artifactId>lexer</artifactId>
    <version>5.0.0</version>
</dependency>

Or gradle:

implementation 'net.morimekta.utils:lexer:5.0.0'

Core Concepts

Building a lexer involves four pieces that plug together:

Class	Purpose
`TokenType`	Marker interface for your token type enum (e.g. `IDENTIFIER`, `STRING`, `NUMBER`, `SYMBOL`).
`TokenClassifier`	Defines the character-level rules: what starts an identifier, which characters are string delimiters, how comments begin, etc.
`TokenFactory`	Creates `Token` instances and assigns them the correct type.
`TokenizerBase`	Drives the actual character-by-character tokenization using your classifier and factory.
`Lexer`	High-level API on top of a tokenizer, providing `peek()`, `expect()`, `expectSymbol()`, and iteration.

Usage

1. Define your token types

public enum MyTokenType implements TokenType {
    IDENTIFIER,
    STRING,
    NUMBER,
    SYMBOL,
    COMMENT
}

2. Customize character classification

Override only the rules that differ from the defaults. For example, if your language uses -- for line comments and - as part of identifiers:

public class MyClassifier extends TokenClassifier {
    @Override
    public boolean identifierSeparator(int ch) {
        return ch == '-';
    }

    @Override
    public boolean startString(int ch) {
        return ch == '\'' || ch == '"';
    }

    @Override
    public boolean startNumber(int ch) {
        return ch >= '0' && ch <= '9';
    }
}

3. Implement a token factory

public class MyTokenFactory implements TokenFactory<MyTokenType, Token<MyTokenType>> {
    @Override
    public Token<MyTokenType> identifierToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.IDENTIFIER, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> stringToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.STRING, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> numberToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.NUMBER, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> symbolToken(char[] buf, int off, int len, int lineNo, int linePos) {
        return new Token<>(buf, off, len, MyTokenType.SYMBOL, lineNo, linePos);
    }

    @Override
    public Token<MyTokenType> genericToken(char[] buf, int off, int len, MyTokenType type, int lineNo, int linePos) {
        return new Token<>(buf, off, len, type, lineNo, linePos);
    }

    @Override
    public MyTokenType lineCommentTokenType() {
        return MyTokenType.COMMENT;
    }
}

4. Wire it together and parse

var tokenizer = new TokenizerBase<>(new MyClassifier(), new MyTokenFactory(), reader, 4096, true);
var lexer     = new Lexer<>(tokenizer);

while (lexer.hasNext()) {
    var token = lexer.next();
    System.out.printf("[%s] %s (line %d, col %d)%n",
                      token.type(), token, token.lineNo(), token.linePos());
}

Or with the expect / peek pattern for recursive-descent parsing:

var name = lexer.expect("identifier", MyTokenType.IDENTIFIER);
lexer.expectSymbol("assignment", '=');
var value = lexer.expect("value");
lexer.expectSymbol("terminator", ';');

Real-World Example: ASN.1 IDL Parser

The utils-asn1 project uses utils-lexer to parse ASN.1 Interface Definition Language files. It demonstrates all the extension points working together on a non-trivial grammar.

Token types for ASN.1

public enum Asn1TokenType implements TokenType {
    IDENTIFIER,
    STRING,
    NUMBER,
    SYMBOL,
    COMMENT_MULTI_LINE,  // /* ... */
    COMMENT_INLINE,      // -- comment --
    COMMENT,             // -- comment\n
    GROUP,
}

Character classifier

ASN.1 identifiers can contain hyphens (Type-Name, value-name), use '...' strings with suffixes ('0110'B for bit strings), and have -- line comments:

public class Asn1TokenClassifier extends TokenClassifier {
    @Override
    public boolean startString(int ch) {
        return ch == '\'' || ch == '"';
    }

    @Override
    public boolean allowAfterString(int ch) {
        return ch == 'B' || ch == 'H';  // bit string / hex octet string
    }

    @Override
    public boolean identifierSeparator(int ch) {
        return ch == '-';
    }

    @Override
    public boolean endIdentifier(int sep, int after) {
        return sep == '-' && after == '-';  // '--' starts a comment, not an identifier
    }

    @Override
    public boolean allowBeforeIdentifier(int ch) {
        return ch == '&' || ch == '@';  // field references and component relations
    }
}

Tokenizer with multi-character symbols

ASN.1 has multi-character symbols like ::=, .., and [[. The tokenizer overrides nextSymbol() to handle these:

public class Asn1Tokenizer extends TokenizerBase<Asn1TokenType, Asn1Token> {
    public Asn1Tokenizer(Path file, Reader reader) {
        super(new Asn1TokenClassifier(), new Asn1TokenFactory(), reader, 4096, true);
        this.file = file;
    }

    @Override
    public boolean skipTokenOnParseNext(Asn1Token token) {
        return token.type() == Asn1TokenType.COMMENT ||
               token.type() == Asn1TokenType.COMMENT_INLINE ||
               token.type() == Asn1TokenType.COMMENT_MULTI_LINE;
    }

    @Override
    protected Asn1Token nextSymbol() throws IOException {
        if (lastChar == '-') {
            // might be a '--' comment, or a negative number
            // ...
        } else if (lastChar == ':') {
            // might be '::=' assignment
            // ...
        }
        return super.nextSymbol();
    }
}

Lexer subclass

The ASN.1 lexer extends Lexer to expose skipped comments for documentation attachment and to produce domain-specific parse exceptions:

public class Asn1Lexer extends Lexer<Asn1TokenType, Asn1Token> {
    public Asn1Lexer(Path file, Tokenizer<Asn1TokenType, Asn1Token> tokenizer) {
        super(tokenizer);
        this.file = file;
        this.tokenizer = tokenizer;
    }

    public List<Asn1Token> skippedComments() {
        return tokenizer.getSkippedTokens();
    }

    public void clearSkippedComments() {
        tokenizer.clearSkippedTokens();
    }
}

Parsing with the lexer

With all the pieces in place, parsing an ASN.1 module reads naturally as a recursive-descent parser:

try (var reader = new InputStreamReader(inputStream)) {
    var lexer = new Asn1Lexer(file, new Asn1Tokenizer(file, reader));

    // Parse module header
    var moduleName = lexer.expect("Module name", Asn1Token::isTypeName);
    lexer.expect("DEFINITIONS", t -> t.is("DEFINITIONS"));
    lexer.expect("::=", t -> t.is(SYMBOL, "::="));
    lexer.expect("BEGIN", t -> t.is(IDENTIFIER, "BEGIN"));

    // Collect documentation comments
    lexer.peek();
    List<Asn1Token> docs = new ArrayList<>(lexer.skippedComments());
    lexer.clearSkippedComments();

    // Parse exports
    if (lexer.peek().is("EXPORTS")) {
        lexer.next();
        do {
            exports.add(lexer.expect("Exported ID", IDENTIFIER));
        } while (lexer.expectSymbol("separator", ';', ',').isSymbol(','));
    }

    // Parse assignments until END
    while (!lexer.peek("").is(IDENTIFIER, "END")) {
        var declName = lexer.expect("Name", IDENTIFIER);
        // ... parse type or value assignment
    }
}

API Reference

`Token<Type>`

Each token carries its content (as a CharSlice), its type, and its position in the source:

Method	Description
`type()`	The token's type.
`lineNo()`	Line number (1-based).
`linePos()`	Column position (1-based).
`line()`	The full source line containing this token.
`is(String name)`	Check token text.
`is(Type type)`	Check token type.
`is(Type type, String name)`	Check both.
`isSymbol(char c)`	Check if single-character symbol.
`decodeString(boolean strict)`	Decode string content, handling escape sequences.

`Lexer<TT, T>`

Method	Description
`next()`	Consume and return the next token, or `null` at EOF.
`hasNext()`	Check if more tokens remain.
`peek()`	Look ahead without consuming.
`peek(String what)`	Look ahead, throw on EOF.
`expect(String what)`	Consume next token, throw on EOF.
`expect(String what, TT type)`	Consume and validate type.
`expect(String what, Predicate<T>)`	Consume and validate with predicate.
`expectSymbol(String what, char...)`	Consume and validate as one of the given symbols.
`readUntil(String term, TT type, boolean allowEof)`	Read raw content until a terminator string.
`failure(T token, String msg, Object...)`	Create a `LexerException` pointing at a token.
`iterator()`	Iterate tokens (wraps checked exceptions in unchecked variants).

License

Licensed under the Apache License, Version 2.0.

See morimekta.net/utils for release procedures.