The 2nd ChatGPT4PCG Competition

Character-like Level Generation for Science Birds

Evaluation

The submitted prompts will undergo an evaluation process that involves subjecting them to 10 trials for each of the 26 uppercase letters of the English alphabet (A-Z). The levels generated for each character will be evaluated based on their similarity, stability, and diversity, and scored using the criteria outlined in the scoring policy given below. The entire evaluation process will be conducted using automated scripts and programs.

Please note that the evaluation process will be conducted twice, at midterm and final stages. The number of trials and characters in the evaluation set may be adjusted based on the number of teams.

Evaluation Set

All 26 alphabetical uppercase characters.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Scoring Policy

In trial ii of target character jj for program kk:

  1. Stability: The stability of a level is evaluated using our Science Birds Evaluator. Here, total_blocksijktotal\_blocks_{ijk} is defined as the number of blocks at the initialization step of loading the level into the evaluator. Then the program will calculate the value of moving_blocksijkmoving\_blocks_{ijk}which is defined as the number of moving blocks during the first 10 seconds after level initialization. Each level will receive an staijksta_{ijk} score according to the following equation. The score has a continuous value in [0,1][0, 1].
  2. staijk=total_blocksijkmoving_blocksijktotal_blocksijksta_{ijk} = \frac{total\_blocks_{ijk} - moving\_blocks_{ijk}}{total\_blocks_{ijk}}
  3. Similarity: The similarity score reflects the softmax probability, σ(zijk)j\sigma (\textbf{z}_{ijk})_j of the model called vit-base-uppercase-english-characters, which is used to infer target character jj in trial ii from the image of the generated level after the first 10 seconds of initialization. Each level will receive a continuous value between [0,1][0, 1]. This score, simijksim_{ijk}, is given as
  4. simijk=σ(zijk)jsim_{ijk} = \sigma (\textbf{z}_{ijk})_j
  5. Diversity: The diversity score represents how diverse the generated levels are under the same target character jj for program kk. The score, divjkdiv_{jk}, is calculated by computing cosine distance, DcD_c, of unordered pairs in the set AA containing pairs of output vectors from the softmax probability, vijk=σ(zijk)\textbf{v}_{ijk} = \sigma(\textbf{z}_{ijk}). Ξjk\Xi_{jk} denotes a set containing all such vectors across trials of the same target character jj from the same program kk.
  6. divjk=aAjkDc(va1,va2)0.5T(T+1)T,div_{jk} = \frac{\sum_a^{|A_{jk}|} D_c(\textbf{v}_a^1, \textbf{v}_a^2)}{0.5T(T+1)-T}\text{,}

    where

    Ajk={(va1,va2)(va1,va2)ΞjkΞjkva1va2}A_{jk} = \{(\textbf{v}_a^1, \textbf{v}_a^2) | (\textbf{v}_a^1, \textbf{v}_a^2) \in \Xi_{jk} \bowtie \Xi_{jk} \land v_a^1 \neq v_a^2 \}

    and TT represents the number of trials per target character.

Given that PP and CC represent the number of programs in the competition and the number of characters, respectively, to give a higher weight to a more difficult target character, weightjweight_{j} for target character jj is defined as follows:

weightj=w_staj×w_simj×w_divj,weight_{j} = w\_sta_{j} \times w\_sim_{j} \times w\_div_{j}\text{,}

where

w_staj=max(1k=1Pi=1TstaijkPT,1C),w\_sta_{j} = max(1 - \frac{\sum_{k=1}^{P} \sum_{i=1}^{T} sta_{ijk}}{PT}, \frac{1}{C})\text{,}
w_simj=max(1k=1Pi=1TsimijkPT,1C),w\_sim_{j} = max(1 - \frac{\sum_{k=1}^{P} \sum_{i=1}^{T} sim_{ijk}}{PT}, \frac{1}{C})\text{,}

and

w_divj=max(1k=1PdivjkP,1C)w\_div_{j} = max(1 - \frac{\sum_{k=1}^{P} div_{jk}}{P}, \frac{1}{C})

Next, the weighted trialijktrial_{ijk} score is defined as follows:

trialijk=weightj×staijk×simijktrial_{ijk} = weight_{j} \times sta_{ijk} \times sim_{ijk}

The average score for target character jj of program kk, charjkchar_{jk}, is defined as follows:

charjk=divjki=1TtrialijkTchar_{jk} = \frac{div_{jk}\sum_{i=1}^{T} trial_{ijk}}{T}

The promptkprompt_{k} score is defined as follows:

promptk=j=1CcharjkCprompt_{k} = \frac{\sum_{j=1}^{C} char_{jk}}{C}

Next, the normalized promptkprompt_{k} score, norm_promptknorm\_prompt_{k}, is defined as follows:

norm_promptk=100 promptkcompetition,norm\_prompt_{k} = 100\ \frac{prompt_{k}}{competition}\text{,}

where

competition=k=1Ppromptkcompetition = \sum_{k=1}^{P} prompt_{k}

Finally, norm_promptknorm\_prompt_{k} will be used for ranking.

Ranking Policy

The team that has the highest norm_promptknorm\_prompt_{k} will be declared the winner. If there are multiple teams with the same highest score, the one with the shortest prompt will be chosen as the winner. However, if multiple teams still have the same score and the shortest prompt, they will be considered co-winners.

Evaluation Tools

  1. ChatGPT API via openai Python package using gpt-3.5-turbo-0125 model.
  2. Science Birds Evaluator with features to
    1. Automatically assess the stability
    2. Automatically export images using black-textured blocks on a white background for similarity testing
  3. vit-base-uppercase-english-characters
  4. Automation scripts available on our Resources page.

Evaluation Environment for Response Generation Stage

Software

Hardware

  • CPU: Intel(R) Xeon(R) Gold CPU 6430x128
  • RAM: 250 GB
  • GPU: NVIDIA L40S 48GB

Evaluation Process

  1. We manually inspect each submitted program for a potential violation of the rules.
  2. Qualified programs will be run to gather responses from ChatGPT.
  3. The code extraction script will load each response and produce a new file containing only a series of drop_block() commands.
  4. The text-to-xml conversion script will load each code file and convert it into a Science Birds level description XML file.
  5. Next, Science Birds Evaluator will individually load all levels to assess their stability and capture their images. The results of stability will be recorded, and for each level an image of the structure with black-textured blocks on a white background will be produced by the program.
  6. The similarity checking script will load each image and pass it through an open source-model called vit-base-uppercase-english-characters. It will then record the similarity result.
  7. The diversity checking script assesses the diversity of the levels by averaging the cosine distance of non-duplicated all-pairs of outputs from softmax for each level across trial of the same target character as described in the Evaluation.
  8. Finally, the scoring and ranking script will load all stability, similarity, and diversity results and produce the final rank and score result for all teams according to the scoring policy.