Lexical Analysis Utilities
A lightweight Java lexical analysis library that lets you build hand-written
tokenizers and parsers without code generation. It sits between a basic
StringTokenizer and a full parser generator like JFlex or ANTLR: fast and
flexible enough for real grammars, simple enough to set up in a few classes.
Requires Java 17+.
Features
- No code generation -- define your grammar in plain Java by subclassing a handful of base classes.
- Customizable character classification -- override
TokenClassifierto control what constitutes identifiers, numbers, strings, symbols, and comments for your language. - Lookahead and validation --
peek(),expect(), and predicate-based matching give you a clean recursive-descent parsing experience. - Accurate error reporting --
LexerExceptioncarries line number, column, and source context so error messages point directly at the problem:Error on line 3 row 5-7: Unexpected token foo bar baz ----^^^ - Comment collection -- skipped tokens (whitespace, comments) are collected separately so you can attach documentation comments to AST nodes.
- Streaming and pre-loaded modes -- works with both in-memory strings and
large input streams via
LineBufferedReader.
Getting Started
Add to maven:
<dependency>
<groupId>net.morimekta.utils</groupId>
<artifactId>lexer</artifactId>
<version>5.0.0</version>
</dependency>
Or gradle:
implementation 'net.morimekta.utils:lexer:5.0.0'
Core Concepts
Building a lexer involves four pieces that plug together:
| Class | Purpose |
|---|---|
TokenType |
Marker interface for your token type enum (e.g. IDENTIFIER, STRING, NUMBER, SYMBOL). |
TokenClassifier |
Defines the character-level rules: what starts an identifier, which characters are string delimiters, how comments begin, etc. |
TokenFactory |
Creates Token instances and assigns them the correct type. |
TokenizerBase |
Drives the actual character-by-character tokenization using your classifier and factory. |
Lexer |
High-level API on top of a tokenizer, providing peek(), expect(), expectSymbol(), and iteration. |
Usage
1. Define your token types
public enum MyTokenType implements TokenType {
IDENTIFIER,
STRING,
NUMBER,
SYMBOL,
COMMENT
}
2. Customize character classification
Override only the rules that differ from the defaults. For example, if your
language uses -- for line comments and - as part of identifiers:
public class MyClassifier extends TokenClassifier {
@Override
public boolean identifierSeparator(int ch) {
return ch == '-';
}
@Override
public boolean startString(int ch) {
return ch == '\'' || ch == '"';
}
@Override
public boolean startNumber(int ch) {
return ch >= '0' && ch <= '9';
}
}
3. Implement a token factory
public class MyTokenFactory implements TokenFactory<MyTokenType, Token<MyTokenType>> {
@Override
public Token<MyTokenType> identifierToken(char[] buf, int off, int len, int lineNo, int linePos) {
return new Token<>(buf, off, len, MyTokenType.IDENTIFIER, lineNo, linePos);
}
@Override
public Token<MyTokenType> stringToken(char[] buf, int off, int len, int lineNo, int linePos) {
return new Token<>(buf, off, len, MyTokenType.STRING, lineNo, linePos);
}
@Override
public Token<MyTokenType> numberToken(char[] buf, int off, int len, int lineNo, int linePos) {
return new Token<>(buf, off, len, MyTokenType.NUMBER, lineNo, linePos);
}
@Override
public Token<MyTokenType> symbolToken(char[] buf, int off, int len, int lineNo, int linePos) {
return new Token<>(buf, off, len, MyTokenType.SYMBOL, lineNo, linePos);
}
@Override
public Token<MyTokenType> genericToken(char[] buf, int off, int len, MyTokenType type, int lineNo, int linePos) {
return new Token<>(buf, off, len, type, lineNo, linePos);
}
@Override
public MyTokenType lineCommentTokenType() {
return MyTokenType.COMMENT;
}
}
4. Wire it together and parse
var tokenizer = new TokenizerBase<>(new MyClassifier(), new MyTokenFactory(), reader, 4096, true);
var lexer = new Lexer<>(tokenizer);
while (lexer.hasNext()) {
var token = lexer.next();
System.out.printf("[%s] %s (line %d, col %d)%n",
token.type(), token, token.lineNo(), token.linePos());
}
Or with the expect / peek pattern for recursive-descent parsing:
var name = lexer.expect("identifier", MyTokenType.IDENTIFIER);
lexer.expectSymbol("assignment", '=');
var value = lexer.expect("value");
lexer.expectSymbol("terminator", ';');
Real-World Example: ASN.1 IDL Parser
The utils-asn1 project uses
utils-lexer to parse ASN.1 Interface Definition Language files. It
demonstrates all the extension points working together on a non-trivial
grammar.
Token types for ASN.1
public enum Asn1TokenType implements TokenType {
IDENTIFIER,
STRING,
NUMBER,
SYMBOL,
COMMENT_MULTI_LINE, // /* ... */
COMMENT_INLINE, // -- comment --
COMMENT, // -- comment\n
GROUP,
}
Character classifier
ASN.1 identifiers can contain hyphens (Type-Name, value-name), use '...'
strings with suffixes ('0110'B for bit strings), and have -- line comments:
public class Asn1TokenClassifier extends TokenClassifier {
@Override
public boolean startString(int ch) {
return ch == '\'' || ch == '"';
}
@Override
public boolean allowAfterString(int ch) {
return ch == 'B' || ch == 'H'; // bit string / hex octet string
}
@Override
public boolean identifierSeparator(int ch) {
return ch == '-';
}
@Override
public boolean endIdentifier(int sep, int after) {
return sep == '-' && after == '-'; // '--' starts a comment, not an identifier
}
@Override
public boolean allowBeforeIdentifier(int ch) {
return ch == '&' || ch == '@'; // field references and component relations
}
}
Tokenizer with multi-character symbols
ASN.1 has multi-character symbols like ::=, .., and [[. The tokenizer
overrides nextSymbol() to handle these:
public class Asn1Tokenizer extends TokenizerBase<Asn1TokenType, Asn1Token> {
public Asn1Tokenizer(Path file, Reader reader) {
super(new Asn1TokenClassifier(), new Asn1TokenFactory(), reader, 4096, true);
this.file = file;
}
@Override
public boolean skipTokenOnParseNext(Asn1Token token) {
return token.type() == Asn1TokenType.COMMENT ||
token.type() == Asn1TokenType.COMMENT_INLINE ||
token.type() == Asn1TokenType.COMMENT_MULTI_LINE;
}
@Override
protected Asn1Token nextSymbol() throws IOException {
if (lastChar == '-') {
// might be a '--' comment, or a negative number
// ...
} else if (lastChar == ':') {
// might be '::=' assignment
// ...
}
return super.nextSymbol();
}
}
Lexer subclass
The ASN.1 lexer extends Lexer to expose skipped comments for documentation
attachment and to produce domain-specific parse exceptions:
public class Asn1Lexer extends Lexer<Asn1TokenType, Asn1Token> {
public Asn1Lexer(Path file, Tokenizer<Asn1TokenType, Asn1Token> tokenizer) {
super(tokenizer);
this.file = file;
this.tokenizer = tokenizer;
}
public List<Asn1Token> skippedComments() {
return tokenizer.getSkippedTokens();
}
public void clearSkippedComments() {
tokenizer.clearSkippedTokens();
}
}
Parsing with the lexer
With all the pieces in place, parsing an ASN.1 module reads naturally as a recursive-descent parser:
try (var reader = new InputStreamReader(inputStream)) {
var lexer = new Asn1Lexer(file, new Asn1Tokenizer(file, reader));
// Parse module header
var moduleName = lexer.expect("Module name", Asn1Token::isTypeName);
lexer.expect("DEFINITIONS", t -> t.is("DEFINITIONS"));
lexer.expect("::=", t -> t.is(SYMBOL, "::="));
lexer.expect("BEGIN", t -> t.is(IDENTIFIER, "BEGIN"));
// Collect documentation comments
lexer.peek();
List<Asn1Token> docs = new ArrayList<>(lexer.skippedComments());
lexer.clearSkippedComments();
// Parse exports
if (lexer.peek().is("EXPORTS")) {
lexer.next();
do {
exports.add(lexer.expect("Exported ID", IDENTIFIER));
} while (lexer.expectSymbol("separator", ';', ',').isSymbol(','));
}
// Parse assignments until END
while (!lexer.peek("").is(IDENTIFIER, "END")) {
var declName = lexer.expect("Name", IDENTIFIER);
// ... parse type or value assignment
}
}
API Reference
Token<Type>
Each token carries its content (as a CharSlice), its type, and its position
in the source:
| Method | Description |
|---|---|
type() |
The token's type. |
lineNo() |
Line number (1-based). |
linePos() |
Column position (1-based). |
line() |
The full source line containing this token. |
is(String name) |
Check token text. |
is(Type type) |
Check token type. |
is(Type type, String name) |
Check both. |
isSymbol(char c) |
Check if single-character symbol. |
decodeString(boolean strict) |
Decode string content, handling escape sequences. |
Lexer<TT, T>
| Method | Description |
|---|---|
next() |
Consume and return the next token, or null at EOF. |
hasNext() |
Check if more tokens remain. |
peek() |
Look ahead without consuming. |
peek(String what) |
Look ahead, throw on EOF. |
expect(String what) |
Consume next token, throw on EOF. |
expect(String what, TT type) |
Consume and validate type. |
expect(String what, Predicate<T>) |
Consume and validate with predicate. |
expectSymbol(String what, char...) |
Consume and validate as one of the given symbols. |
readUntil(String term, TT type, boolean allowEof) |
Read raw content until a terminator string. |
failure(T token, String msg, Object...) |
Create a LexerException pointing at a token. |
iterator() |
Iterate tokens (wraps checked exceptions in unchecked variants). |
License
Licensed under the Apache License, Version 2.0.
See morimekta.net/utils for release procedures.