The 2nd ChatGPT4PCG Competition
Character-like Level Generation for Science Birds
Evaluation
The submitted prompts will undergo an evaluation process that involves subjecting them to 10 trials for each of the 26 uppercase letters of the English alphabet (A-Z). The levels generated for each character will be evaluated based on their similarity, stability, and diversity, and scored using the criteria outlined in the scoring policy given below. The entire evaluation process will be conducted using automated scripts and programs.
Please note that the evaluation process will be conducted twice, at midterm and final stages. The number of trials and characters in the evaluation set may be adjusted based on the number of teams.
Evaluation Set
All 26 alphabetical uppercase characters.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Scoring Policy
In trial of target character for program :
- Stability: The stability of a level is evaluated using our Science Birds Evaluator. Here, is defined as the number of blocks at the initialization step of loading the level into the evaluator. Then the program will calculate the value of which is defined as the number of moving blocks during the first 10 seconds after level initialization. Each level will receive an score according to the following equation. The score has a continuous value in .
- Similarity: The similarity score reflects the softmax probability, of the model called vit-base-uppercase-english-characters, which is used to infer target character in trial from the image of the generated level after the first 10 seconds of initialization. Each level will receive a continuous value between . This score, , is given as
- Diversity: The diversity score represents how diverse the generated levels are under the same target character for program . The score, , is calculated by computing cosine distance, , of unordered pairs in the set containing pairs of output vectors from the softmax probability, . denotes a set containing all such vectors across trials of the same target character from the same program .
where
and represents the number of trials per target character.
Given that and represent the number of programs in the competition and the number of characters, respectively, to give a higher weight to a more difficult target character, for target character is defined as follows:
where
and
Next, the weighted score is defined as follows:
The average score for target character of program , , is defined as follows:
The score is defined as follows:
Next, the normalized score, , is defined as follows:
where
Finally, will be used for ranking.
Ranking Policy
The team that has the highest will be declared the winner. If there are multiple teams with the same highest score, the one with the shortest prompt will be chosen as the winner. However, if multiple teams still have the same score and the shortest prompt, they will be considered co-winners.
Evaluation Tools
- ChatGPT API via
openai
Python package usinggpt-3.5-turbo-0125
model. - Science Birds Evaluator with features to
- Automatically assess the stability
- Automatically export images using black-textured blocks on a white background for similarity testing
- vit-base-uppercase-english-characters
- Automation scripts available on our Resources page.
Evaluation Environment for Response Generation Stage
Software
- OS: Ubuntu 22.04.3 LTS
- Python: 3.11.xx
- Node.js: 20.xx.xx LTS
- Unity: 2019.4.40f1
- Science Birds Evaluator
- Our automation scripts
Hardware
- CPU: Intel(R) Xeon(R) Gold CPU 6430x128
- RAM: 250 GB
- GPU: NVIDIA L40S 48GB
Evaluation Process
- We manually inspect each submitted program for a potential violation of the rules.
- Qualified programs will be run to gather responses from ChatGPT.
- The code extraction script will load each response and produce a new file containing only a series of
drop_block()
commands. - The text-to-xml conversion script will load each code file and convert it into a Science Birds level description XML file.
- Next, Science Birds Evaluator will individually load all levels to assess their stability and capture their images. The results of stability will be recorded, and for each level an image of the structure with black-textured blocks on a white background will be produced by the program.
- The similarity checking script will load each image and pass it through an open source-model called vit-base-uppercase-english-characters. It will then record the similarity result.
- The diversity checking script assesses the diversity of the levels by averaging the cosine distance of non-duplicated all-pairs of outputs from softmax for each level across trial of the same target character as described in the Evaluation.
- Finally, the scoring and ranking script will load all stability, similarity, and diversity results and produce the final rank and score result for all teams according to the scoring policy.