Goto

Collaborating Authors

 evaluation result


Visual Programming for Text to Image Generation and Evaluation

Neural Information Processing Systems

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGEN, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on textlayout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.


Discussion of Evaluation Methodologies

Neural Information Processing Systems

In previous research, there are plenty of arguments about textual backdoor evaluation, including diverse metrics and experiment settings. These valuable discussions motivate us to construct a rigorous benchmark and we highly appreciate their efforts. In this section, we briefly summarize existing opinions and provide a more detailed discussion on this topic. Table 9 summarizes the attackers OpenBackdoorimplements. Effectiveness Besides the mainstream ASR (also called LFR [20]) and CACC metrics, there are also other effectiveness metrics. Shen et al. [46] proposed to count the number of inserted triggers that can successfully flip the label. However, although inserting more triggers could benefit attack strength, the triggers also corrupt the sentences gradually, so it is also possible that the poisoned samples become "adversarial", and we can hardly distinguish. Shen et al. [45] also mentioned this issue, and they advised calculating the ASR difference between a poisoned model and a clean model as an effectiveness metric.








A Training and

Neural Information Processing Systems

All models were trained on single GPUs, except for SchNet when trained on OC20-2M, which required 3 GPUs. Tables 9-12 present the extended results on OC20 across the 4 separate S2EF validation sets. Table 9: Evaluation results on the OC20 S2EF in-distribution validation set. In Table 13, we present the performance and inference throughput of the baseline models on COLL. Table 13: Evaluation of the performance of the four baseline models on the COLL dataset.Inference COLL test set Throughput Samples / Energy MAE Force MAE Force cos EFwT Model GPU sec.