Class TrainingDataDumper

java.lang.Object
org.ek9lang.assist.TrainingDataDumper

public class TrainingDataDumper extends Object
Outputs Q&A pairs in JSONL format for LLM fine-tuning. Supports instruction/response format, basic chat-format, and mlx_lm chat format with system prompt and cross-reference resolution. Optionally includes Qwen3 thinking traces (<think>...</think>) when the Q&A has a thinking field and thinking mode is enabled. Used by the -Q, -Qc, -Qct, -Qt, -Qtt CLI flags.
  • Field Details

  • Constructor Details

    • TrainingDataDumper

      public TrainingDataDumper()
  • Method Details

    • dumpJsonl

      public void dumpJsonl(QuestionRegistry registry, PrintStream out)
      Dump all Q&A pairs as instruction/response JSONL to the given output stream. Each canonical question produces one entry, plus one entry per alternate phrasing.
    • dumpChatJsonl

      public void dumpChatJsonl(QuestionRegistry registry, PrintStream out)
      Dump all Q&A pairs as chat-format JSONL (for models expecting conversation).
    • dumpMlxChatJsonl

      public void dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out)
      Dump all Q&A pairs as mlx_lm chat-format JSONL with system prompt and cross-reference resolution. Designed for Qwen3 LoRA fine-tuning with --mask-prompt. Only emits canonical questions — alternate phrasings are excluded to avoid duplicate responses that degrade training quality. Alternates remain in the source files for BM25F search via the -q CLI flag. Cross-references (See Q###) are resolved to inline summaries at generation time.
    • dumpMlxChatJsonl

      public void dumpMlxChatJsonl(QuestionRegistry registry, PrintStream out, boolean includeThinking)
      Dump mlx_lm chat-format JSONL with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer.
    • writeTrainingData

      public int writeTrainingData(QuestionRegistry registry, Path outputDir) throws IOException
      Write mlx_lm training data directory structure with stratified train/valid split. Creates outputDir/train.jsonl and outputDir/valid.jsonl with a 90/10 split stratified by category. Only canonical questions are emitted — alternate phrasings are excluded to ensure each training entry has a unique, focused response.
      Parameters:
      registry - the Q&A registry
      outputDir - the directory to create (e.g., ./ek9_training_data)
      Returns:
      the number of total JSONL entries written (train + valid)
      Throws:
      IOException
    • writeTrainingData

      public int writeTrainingData(QuestionRegistry registry, Path outputDir, boolean includeThinking) throws IOException
      Write mlx_lm training data with optional Qwen3 thinking traces. When includeThinking is true and a Q&A has a thinking field, the assistant response is prefixed with <think>...</think> reasoning before the answer.
      Throws:
      IOException
    • resolveReferences

      String resolveReferences(String text, QuestionRegistry registry)
      Resolve cross-references in text by replacing "See Q{id}" with the referenced Q&A's summary. If a referenced Q&A cannot be found, logs a warning and leaves the original text unchanged.
    • buildAssistantResponse

      String buildAssistantResponse(QuestionAndAnswer qa)
    • buildAssistantResponse

      String buildAssistantResponse(QuestionAndAnswer qa, boolean includeThinking)
      Build the assistant response, optionally prefixed with Qwen3 thinking traces. When includeThinking is true and the Q&A has a non-empty thinking field, the response is prefixed with <think>\n{thinking}\n</think>\n\n followed by the normal answer content.
    • escapeJson

      static String escapeJson(String value)