- Initialize KV caches before moving model to device
- Disable flex_attention decoding to avoid torch.compile hang
- Remove unused compile step (controlled by cuda_compile setting)
The flex_attention's create_block_mask triggers torch compilation
which can hang the system when called during model preload.
Make external VQA handlers (moondream3, joytag, joycaption, deepseek)
compatible with VQA load/unload mechanism for consistent model lifecycle.
- Added vqa_detection.py, add shared detection helpers
- Add load and unload functions to all external handlers
- Replace device_map="auto" with sd_models.move_model in joycaption
- Update dispatcher and moondream handlers to use shared helpers
Refactor VQA module from module-level globals to a VQA class singleton
pattern with self-contained per-model loading methods.
Changes:
- Add VQA class with model/processor state and detection data storage
- Extract load methods for clean model pre-loading via UI
- Interrogate to return string only; store detection data on instance
- Add vqa_draw.py for bounding box/point annotation utilities
Stub, further transfer of drawing functions to follow
- Update moondream3.py to store detection data on VQA singleton
- Update endpoints.py and ui_caption.py for new return type
Replace silent fallback to "Describe the image" with explicit error
when user selects "Use Prompt" but leaves the prompt field empty.
Follows the same pattern as missing image validation.
- Add interrogate_vlm_thinking_mode setting to save checkbox state
- Update ui_caption to restore Thinking Mode preference on load
- Add blank line before 'Answer:' label for visual separation
- Remove '\n\n' replacement in clean() that stripped blank lines
- Fix Qwen reasoning detection when <think> tag is in prompt, not response
- Add reasoning icon to Moondream 2 and 3 model names
- Add thinking_mode/reasoning parameter to enable reasoning mode
- Add Detect Gaze task with placeholder hint
- Parse point/detect results to return annotation data for visualization
- Handle keep_thinking setting: format as "Reasoning:\n...\nAnswer:\n..." or discard
- Add comprehensive debug logging throughout handler
- Add load_model() function to pre-load VLM into memory
- Add unload_model() function to free VLM from memory
- Add Load/Unload buttons to Caption tab UI
Comprehensive overhaul of the VQA interrogation system including:
- Prefill text support for guiding VLM responses
- Thinking mode support with tag cleanup/retention
- Dynamic prompt/task selection based on model type
- Bounding box visualization for detection results
- Debug infrastructure (SD_VQA_DEBUG env var)
- New model support: MiMo-VL, Nidum Gemma, Allura Gemma
- Model-specific prompt lists (Florence, Moondream)
Add support for Moondream 3 Preview VLM with:
- Text query, caption, point, and detect capabilities
- Bounding box visualization for object detection
- Max pixels setting for resolution control
- Device offloading support
Improvements to the OpenCLIP interrogation:
- Sort all ranking dicts by similarity score (descending)
- Add format_category() helper for text formatting
- Add formatted text output for CLIP labels textbox
- Return additional text update in analyze_image()
Two fixes for the JoyCaption handler:
- Only offload model if shared.opts.interrogate_offload is True
- Add max_pixels=1024*1024 to AutoProcessor for consistent image handling