

安装: bash conda create -n prove python=3.10 conda activate prove; pip3 install -r requirements.txt;
使用: bash python evaluate.py --vlm <vlm_name> --response_json <response_json_path> --scores_path <output_json_path>
模型 | hscore | tscore | average |
---|---|---|---|
Qwen2 (2b) | 69.36 | 80.64 | 75.0 |
Intern-VL2 (2b) | 73.96 | 79.51 | 76.74 |
Phi-3.5-vision (4B) | 73.35 | 82.27 | 77.81 |
LLaVA-1.5 (7B) | 72.67 | 82.58 | 77.62 |
llava-next (7b) | 74.28 | 80.03 | 77.15 |
Intern-VL2 (8b) | 74.55 | 80.56 | 77.56 |
pixtral (12b) | 73.34 | 82.43 | 77.88 |
llava-1.5 (13b) | 72.46 | 82.4 | 77.43 |
Intern-VL2 (26b) | 74.63 | 79.23 | 76.93 |
claude3.5-sonnet | 71.06 | 77.31 | 74.19 |
gpt-4o-mini | 73.18 | 79.24 | 76.21 |
gemini-1.5-flash | 72.73 | 81.74 | 77.23 |
gpt-4o | 76.53 | 80.92 | 78.72 |
@misc{prabhu2024prove, title={Trust but Verify: Programmatic VLM Evaluation in the Wild}, author={Viraj Prabhu and Senthil Purushwalkam and An Yan and Caiming Xiong and Ran Xu}, year={2024}, eprint={2410.13121}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2410.13121}, }