org.ek9lang.assist.TrainingDataDumper

public class TrainingDataDumper extends Object

Outputs Q&A pairs in JSONL format for LLM fine-tuning. Supports instruction/response format, basic chat-format, and mlx_lm chat format with system prompt and cross-reference resolution. Optionally includes Qwen3 thinking traces (<think>...</think>) when the Q&A has a thinking field and thinking mode is enabled. Used by the -Q, -Qc, -Qct, -Qt, -Qtt CLI flags.

Field Summary

Fields

Modifier and Type

Field

Description

(package private) static final String

EK9_SYSTEM_PROMPT
Constructor Summary

Constructors

Constructor

Description

TrainingDataDumper()
Method Summary

Modifier and Type

Method

Description

(package private) String

buildAssistantResponse(QuestionAndAnswer qa)

(package private) String

buildAssistantResponse(QuestionAndAnswer qa, boolean includeThinking)

Build the assistant response, optionally prefixed with Qwen3 thinking traces.

void

dumpChatJsonl(QuestionRegistry registry, PrintStream out)

Dump all Q&A pairs as chat-format JSONL (for models expecting conversation).

void

dumpJsonl(QuestionRegistry registry, PrintStream out)

Dump all Q&A pairs as instruction/response JSONL to the given output stream.

void

dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out)

Dump all Q&A pairs as mlx_lm chat-format JSONL with system prompt and cross-reference resolution.

void

dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out, boolean includeThinking)

Dump mlx_lm chat-format JSONL with optional Qwen3 thinking traces.

(package private) static String

escapeJson(String value)

(package private) String

resolveReferences(String text, QuestionRegistry registry)

Resolve cross-references in text by replacing "See Q{id}" with the referenced Q&A's summary.

int

writeTrainingData(QuestionRegistry registry, Path outputDir)

Write mlx_lm training data directory structure with stratified train/valid split.

int

writeTrainingData(QuestionRegistry registry, Path outputDir, boolean includeThinking)

Write mlx_lm training data with optional Qwen3 thinking traces.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- EK9_SYSTEM_PROMPT
  static final String EK9_SYSTEM_PROMPT
  
  See Also:
  
  Constant Field Values
Constructor Details
- TrainingDataDumper
  
  public TrainingDataDumper()
Method Details
- dumpJsonl
  
  public void dumpJsonl(QuestionRegistry registry, PrintStream out)
  
  Dump all Q&A pairs as instruction/response JSONL to the given output stream. Each canonical question produces one entry, plus one entry per alternate phrasing.
- dumpChatJsonl
  
  public void dumpChatJsonl(QuestionRegistry registry, PrintStream out)
  
  Dump all Q&A pairs as chat-format JSONL (for models expecting conversation).
- dumpMlxChatJsonl
  
  public void dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out)
  
  Dump all Q&A pairs as mlx_lm chat-format JSONL with system prompt and cross-reference resolution. Designed for Qwen3 LoRA fine-tuning with --mask-prompt. Only emits canonical questions — alternate phrasings are excluded to avoid duplicate responses that degrade training quality. Alternates remain in the source files for BM25F search via the -q CLI flag. Cross-references (See Q###) are resolved to inline summaries at generation time.
- dumpMlxChatJsonl
  
  public void dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out, boolean includeThinking)
  
  Dump mlx_lm chat-format JSONL with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer.
- writeTrainingData
  
  public int writeTrainingData(QuestionRegistry registry, Path outputDir) throws IOException
  
  Write mlx_lm training data directory structure with stratified train/valid split. Creates outputDir/train.jsonl and outputDir/valid.jsonl with a 90/10 split stratified by category. Only canonical questions are emitted — alternate phrasings are excluded to ensure each training entry has a unique, focused response.
  
  Parameters:
  
  registry - the Q&A registry
  
  outputDir - the directory to create (e.g., ./ek9_training_data)
  
  Returns:
  
  the number of total JSONL entries written (train + valid)
  
  Throws:
  
  IOException
- writeTrainingData
  
  public int writeTrainingData(QuestionRegistry registry, Path outputDir, boolean includeThinking) throws IOException
  
  Write mlx_lm training data with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer.
  
  Throws:
  
  IOException
- resolveReferences
  
  String resolveReferences(String text, QuestionRegistry registry)
  
  Resolve cross-references in text by replacing "See Q{id}" with the referenced Q&A's summary. If a referenced Q&A cannot be found, logs a warning and leaves the original text unchanged.
- buildAssistantResponse
  
  String buildAssistantResponse(QuestionAndAnswer qa)
- buildAssistantResponse
  
  String buildAssistantResponse(QuestionAndAnswer qa, boolean includeThinking)
  
  Build the assistant response, optionally prefixed with Qwen3 thinking traces. When includeThinking is true and the Q&A has a non-empty thinking field, the response is prefixed with <think>\n{thinking}\n</think>\n\n followed by the normal answer content.
- escapeJson
  
  static String escapeJson(String value)

Class TrainingDataDumper

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Details

EK9_SYSTEM_PROMPT

Constructor Details

TrainingDataDumper

Method Details

dumpJsonl

dumpChatJsonl

dumpMlxChatJsonl

dumpMlxChatJsonl

writeTrainingData

writeTrainingData

resolveReferences

buildAssistantResponse

buildAssistantResponse

escapeJson