Class TrainingDataDumper
java.lang.Object
org.ek9lang.assist.TrainingDataDumper
Outputs Q&A pairs in JSONL format for LLM fine-tuning.
Supports instruction/response format, basic chat-format, and mlx_lm chat format
with system prompt and cross-reference resolution.
Optionally includes Qwen3 thinking traces (<think>...</think>) when
the Q&A has a thinking field and thinking mode is enabled.
Used by the -Q, -Qc, -Qct, -Qt, -Qtt CLI flags.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) String(package private) StringbuildAssistantResponse(QuestionAndAnswer qa, boolean includeThinking) Build the assistant response, optionally prefixed with Qwen3 thinking traces.voiddumpChatJsonl(QuestionRegistry registry, PrintStream out) Dump all Q&A pairs as chat-format JSONL (for models expecting conversation).voiddumpJsonl(QuestionRegistry registry, PrintStream out) Dump all Q&A pairs as instruction/response JSONL to the given output stream.voiddumpMlxChatJsonl(QuestionRegistry registry, PrintStream out) Dump all Q&A pairs as mlx_lm chat-format JSONL with system prompt and cross-reference resolution.voiddumpMlxChatJsonl(QuestionRegistry registry, PrintStream out, boolean includeThinking) Dump mlx_lm chat-format JSONL with optional Qwen3 thinking traces.(package private) static StringescapeJson(String value) (package private) StringresolveReferences(String text, QuestionRegistry registry) Resolve cross-references in text by replacing "See Q{id}" with the referenced Q&A's summary.intwriteTrainingData(QuestionRegistry registry, Path outputDir) Write mlx_lm training data directory structure with stratified train/valid split.intwriteTrainingData(QuestionRegistry registry, Path outputDir, boolean includeThinking) Write mlx_lm training data with optional Qwen3 thinking traces.
-
Field Details
-
EK9_SYSTEM_PROMPT
- See Also:
-
-
Constructor Details
-
TrainingDataDumper
public TrainingDataDumper()
-
-
Method Details
-
dumpJsonl
Dump all Q&A pairs as instruction/response JSONL to the given output stream. Each canonical question produces one entry, plus one entry per alternate phrasing. -
dumpChatJsonl
Dump all Q&A pairs as chat-format JSONL (for models expecting conversation). -
dumpMlxChatJsonl
Dump all Q&A pairs as mlx_lm chat-format JSONL with system prompt and cross-reference resolution. Designed for Qwen3 LoRA fine-tuning with --mask-prompt. Only emits canonical questions — alternate phrasings are excluded to avoid duplicate responses that degrade training quality. Alternates remain in the source files for BM25F search via the -q CLI flag. Cross-references (See Q###) are resolved to inline summaries at generation time. -
dumpMlxChatJsonl
Dump mlx_lm chat-format JSONL with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer. -
writeTrainingData
Write mlx_lm training data directory structure with stratified train/valid split. Creates outputDir/train.jsonl and outputDir/valid.jsonl with a 90/10 split stratified by category. Only canonical questions are emitted — alternate phrasings are excluded to ensure each training entry has a unique, focused response.- Parameters:
registry- the Q&A registryoutputDir- the directory to create (e.g., ./ek9_training_data)- Returns:
- the number of total JSONL entries written (train + valid)
- Throws:
IOException
-
writeTrainingData
public int writeTrainingData(QuestionRegistry registry, Path outputDir, boolean includeThinking) throws IOException Write mlx_lm training data with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer.- Throws:
IOException
-
resolveReferences
Resolve cross-references in text by replacing "See Q{id}" with the referenced Q&A's summary. If a referenced Q&A cannot be found, logs a warning and leaves the original text unchanged. -
buildAssistantResponse
-
buildAssistantResponse
Build the assistant response, optionally prefixed with Qwen3 thinking traces. When includeThinking is true and the Q&A has a non-empty thinking field, the response is prefixed with <think>\n{thinking}\n</think>\n\n followed by the normal answer content. -
escapeJson
-