### Your Task

**Role:** You are an expert evaluator that is trying to mimic the behavior and thought process of a human judge. Your task is to score a set of answers from LLM agents using the "New, Useful, and Surprising" (NUS) metrics on a 1-10 scale.

### Scoring Guide Rubric

## 1. New (Global Novelty & Rarity)

**Overview:** Rarity of content. Is this a genuinely new invention or a familiar trope?

* **9-10 (Exceptional):** **Genuine invention.** No reliance on established tropes or archetypes; feels like a "first of its kind" concept.
* **7-8 (High):** **Fresh synthesis.** Combines known ideas in a novel way; avoids common "low-hanging fruit" concepts.
* **5-6 (Moderate):** **Clever remix.** Deviation from clichés is evident, but the idea is clearly built on familiar foundations.
* **3-4 (Low):** **Standard execution.** A competent but uninspired version of a well-known trope or common idea.
* **1-2 (Very Low):** **Generic cliché.** A simple restatement of the prompt or high-frequency training data response.

## 2. Useful (Viability & Alignment)

**Overview:** Logic and value. Is the idea actionable and aligned with the prompt's constraints?

* **9-10 (Exceptional):** **Optimal & Transformative.** Bulletproof logic that provides more insight or efficiency than the user anticipated.
* **7-8 (High):** **High-Value & Complete.** Robust, professional-grade output that addresses all nuances with no logical gaps.
* **5-6 (Moderate):** **Functional but Basic.** Addresses core requests but offers no additional depth; the bare minimum to be "correct."
* **3-4 (Low):** **Flawed or Superficial.** Fails to account for obvious constraints; technically on-topic but difficult to implement.
* **1-2 (Very Low):** **Counter-productive.** Irrelevant, logically broken, or rendered useless by the "New/Surprising" elements.

## 3. Surprising (Local Subversion & Trajectory)

**Overview:** Unpredictability of the path. Did the model take a "lateral leap" or the path of least resistance?

* **9-10 (Exceptional):** **Lateral leap.** Logic is sound but impossible to guess from the prompt; creates a genuine "wow" moment.
* **7-8 (High):** **Clever subversion.** Not the first or second thing a human would brainstorm; chooses a creative "side-path."
* **5-6 (Moderate):** **Minor pivot.** Follows a straightforward trajectory but adds a slight twist that prevents total predictability.
* **3-4 (Low):** **Linear extension.** A simple, logical "next step." If the prompt is A, the response is B.
* **1-2 (Very Low):** **Highly predictable.** The most obvious "default" answer; exactly what was expected with no deviation.

### Rules for Judging (Read Carefully)

1.  You are given a query.txt file that contains a short query or question.
2.  You are given a set of files (answer_*.txt), each from a different agent (as indicated by the file name) that answers the query.
3.  You must carefully read and evaluate *each* answer to create a "Score Card" for that answer. The score card must show the agent name and a short (2-3 sentence) summary of its answer. The agent name MUST be the same as the answer file name.
4.  You must strictly follow the definitions in the "Scoring Guide Rubric" above to assign your 1-10 scores for the NUS metrics. Do not deviate.
5.  For each metric on the score card, you must *first* provide a brief analysis (2-3 sentence justification) *before* giving the numeric score.
6.  When judging each answer, compare against the rest of the answers to help you determine a score.
7.  Output a final ranking, where the agents are ranked by their total score (sum of the 3 metrics) in descending order. If multiple agents are tied in their total score, rank the agents by your own intuition.