# Character-like Level Generation for Science Birds

## Evaluation

The submitted prompts will undergo an evaluation process that involves subjecting them to 10 trials for each of the 26 uppercase letters of the English alphabet (A-Z). The levels generated for each character will be evaluated based on their similarity and stability, and scored using the criteria outlined in the scoring policy given below. The entire evaluation process will be conducted using automated scripts and programs.

Please note that the evaluation process will be conducted twice, at midterm and final stages. The number of trials and characters in the evaluation set may be adjusted based on the number of teams.

### Evaluation Set

All 26 alphabetical uppercase characters.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

### Scoring Policy

In trial $i$ of target character $j$ for prompt $k$:

1. Stability: The stability of a level is evaluated using our Science Birds Evaluator. Here, $total\_blocks_{ijk}$ is defined as the number of blocks at the initialization step of loading the level into the evaluator. Then the program will calculate the value of $moving\_blocks_{ijk}$which is defined as the number of moving blocks during the first 10 seconds after level initialization. Each level will receive an $st_{ijk}$ score according to the following equation. The score has a continuous value in $[0, 1]$.
2. $st_{ijk} = \frac{total\_blocks_{ijk} - moving\_blocks_{ijk}}{total\_blocks_{ijk}}$
3. Similarity: The similarity score reflects the softmax probability, $\sigma (z_{ijk})$of the model called vit-base-letter, which is used to infer target character $j$ in trial $i$ from the image of the generated level after the first 10 seconds of initialization. Each level will receive a continuous value between $[0, 1]$. This score, $si_{ijk}$, is given as
4. $si_{ijk} = \sigma (z_{ijk})$

Given that $P$, $T$, and $C$ represent the number of prompts in the competition, the number of trials per target character, and the number of characters, respectively, to give a higher weight to a more difficult target character, $weight_{j}$ for target character $j$ is defined as follows:

$weight_{j} = w\_st_{j} \times w\_si_{j}\text{,}$

where

$w\_st_{j} = max(1 - \frac{\sum_{k=1}^{P} \sum_{i=1}^{T} st_{ijk}}{PT}, \frac{1}{C})$

and

$w\_si_{j} = max(1 - \frac{\sum_{k=1}^{P} \sum_{i=1}^{T} si_{ijk}}{PT}, \frac{1}{C})$

Next, the weighted $trial_{ijk}$ score is defined as follows:

$trial_{ijk} = weight_{j}st_{ijk}si_{ijk}$

The average score for target character $j$ of prompt $k$, $char_{jk}$, is defined as follows:

$char_{jk} = \frac{\sum_{i=1}^{T} trial_{ijk}}{T}$

The $prompt_{k}$ score is defined as follows:

$prompt_{k} = \frac{\sum_{j=1}^{C} char_{jk}}{C}$

Next, the normalized $prompt_{k}$ score, $norm\_prompt_{k}$, is defined as follows:

$norm\_prompt_{k} = 100\ \frac{prompt_{k}}{competition}\text{,}$

where

$competition = \sum_{k=1}^{P} prompt_{k}$

Finally, $norm\_prompt_{k}$ will be used for ranking.

### Ranking Policy

The team that has the highest $norm\_prompt_{k}$ will be declared the winner. If there are multiple teams with the same highest score, the one with the shortest prompt will be chosen as the winner. However, if multiple teams still have the same score and the shortest prompt, they will be considered co-winners.

## Evaluation Tools

1. ChatGPT API via openai Node.js package using the latest gpt-3.5-turbo model.
2. Science Birds Evaluator with features to
1. Automatically assess the stability
2. Automatically export images using black-textured blocks on a white background for similarity testing
3. vit-base-letter
4. Automation scripts are available on our Resources page.

## Evaluation Environment

1. Software:
2. Hardware:
• CPU: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
• RAM: 16 GB

## Evaluation Process

1. The qualification checking script will be first run to check for rule violations, including:
1. Whether the prompt contains disallowed characters.
2. Whether the prompt length exceeds the maximum number of words.
3. Whether the prompt does not contain the <OBJECT>.
2. The response gathering script will load each qualified team's prompt and repeat the following two steps for each character -- (1) replacing <OBJECT> with the target character and (2) contacting the ChatGPT API for a specific number of trials.
1. Each trial will always start from scratch and will contain no previous conversation.
3. The code extraction script will load each response and produce a new file containing only a series of ab_drop() commands.
4. The text-to-xml conversion script will load each code file and convert it into a Science Birds level description XML file.
5. Next, Science Birds Evaluator, will individually load all levels to assess their stability and capture thier images. The results of stability will be recorded, and for each level an image of the structure with black-textured blocks on a white background will be produced by the program.
6. The similarity checking script will load each image and pass it through an open source-model called vit-base-letter. It will then record the similarity result.
7. Finally, the scoring and ranking script will load all stability and similarity results and produce the final rank and score result for all teams according to the scoring policy.