- Initialize KV caches before moving model to device
- Disable flex_attention decoding to avoid torch.compile hang
- Remove unused compile step (controlled by cuda_compile setting)
The flex_attention's create_block_mask triggers torch compilation
which can hang the system when called during model preload.
Make external VQA handlers (moondream3, joytag, joycaption, deepseek)
compatible with VQA load/unload mechanism for consistent model lifecycle.
- Added vqa_detection.py, add shared detection helpers
- Add load and unload functions to all external handlers
- Replace device_map="auto" with sd_models.move_model in joycaption
- Update dispatcher and moondream handlers to use shared helpers
Refactor VQA module from module-level globals to a VQA class singleton
pattern with self-contained per-model loading methods.
Changes:
- Add VQA class with model/processor state and detection data storage
- Extract load methods for clean model pre-loading via UI
- Interrogate to return string only; store detection data on instance
- Add vqa_draw.py for bounding box/point annotation utilities
Stub, further transfer of drawing functions to follow
- Update moondream3.py to store detection data on VQA singleton
- Update endpoints.py and ui_caption.py for new return type
Add support for Moondream 3 Preview VLM with:
- Text query, caption, point, and detect capabilities
- Bounding box visualization for object detection
- Max pixels setting for resolution control
- Device offloading support