Task Description
The objective of GenSIE is to extract structured knowledge from a text fragment according to a specific, arbitrary schema provided in the input.
Read the full task description, including score metrics and detailed constraints.
The Challenge
For each instance, the system receives:
- Context: A text fragment in Spanish (Wikipedia, news, scientific abstracts).
- Instruction: A natural language description of the goal.
- Target Schema: A JSON Schema definition (OpenAPI 3.0 subset).
The system must generate a valid JSON object that:
- Strictly adheres to the
Target Schema. - Contains information faithfully extracted from the
Context. - Returns
nullfor information not present (Hallucination Check).
Example Instance
Input Context
"El ensayo clínico aleatorizado de Fase 3 de la vacuna mRNA-1273 (Moderna) evaluó a 30,420 participantes. Los resultados primarios mostraron una eficacia del 94.1%..."
Input Instruction
"Extrae el nombre de la medicación y el resultado del ensayo, siendo positivo si es mayor de 90% de efectividad, negativo si es menor del 70%, inconcluso en cualquier otro caso."
Input Target Schema
Note that clinical_outcome requires semantic reasoning mapping "94.1%" to the enum POSITIVE, as explained in the instructions. These semantic reasoning subproblems can range in complexity.
The Zero-Shot Constraint
This is a Zero-Shot Schema task.
- Development Phase: You will see schemas like
Biography,Recipe, and others. - Test Phase: You will encounter entirely new schemas (e.g.,
LegalContract,ProductSpec, and others).
Your system must generalize to the structure, not memorize the entity types.
Evaluation Metrics
Evaluating generative structured output requires a rigorous approach that assesses both structural validity and semantic accuracy. We employ a custom metric called Flattened Schema Scoring.
Flattened Schema Scoring
Since JSON objects can be deeply nested, we first "flatten" both the Gold Standard and the System Output into a set of key-value pairs using dot notation.
Let \(J\) be a JSON object. The flattening function \(\Phi(J)\) transforms it:
Flattening Example
Nested JSON:
Flattened Representation:
Value Scoring Logic
Once aligned by keys, we compare the values. The scoring method depends strictly on the data type defined in the schema.
Case A: Rigid Types
For fields where precision is binary (Numbers, Dates, Booleans, and Enum Strings), we require an Exact Match.
- Example: If the Gold schema expects
"clinical_outcome": "POSITIVE"and the system outputs"positive"(lowercase) or"SUCCESS", the score is \(0\).
Case B: Free Text
For fields describing content (Descriptions, Summaries), we use a hybrid metric combining semantic and lexical similarity.
- Semantic: Cosine similarity of embeddings using a multilingual sentence transformer (e.g.,
paraphrase-multilingual-mpnet-base-v2). - Lexical: A normalized token-overlap score.
- \(\alpha\): Weighting parameter (default \(\approx 0.7\)).
Aggregated Metrics
We calculate the total True Positive Score (TPS) by summing the similarity scores of all shared keys.
Final rankings are based on Micro-F1:
Secondary Leaderboard: Efficiency
In addition to performance, we maintain an Efficiency Leaderboard to reward sustainable AI.
We calculate the Performance-to-Cost Ratio:
- Total Token Consumption: The sum of all input and output tokens generated across all API calls required to solve a single instance (including reasoning steps, retries, and self-corrections).
Rules & Submission
Submission Format
Participants must submit a private GitHub repository containing:
- Source Code: MIT or Apache 2.0 licensed.
- Dockerfile: A self-sufficient container for inference.
- README: Instructions for running the pipeline.
Network Restrictions
The evaluation environment is isolated (no internet). Your container must include all necessary code and dependencies. The only allowed network call is to the Organizer's Hosted Inference API (OpenAI-compatible).
The "No-Training" Policy
You cannot fine-tune the Language Model weights.
- Allowed: Prompt Engineering, RAG, Grammar-Constrained Decoding, Synthetic Data generation for few-shotting.
- Prohibited: LoRA, QLoRA, Full Fine-tuning.
Inference Environment
- Hardware: NVIDIA A100 GPUs.
- Model Access: Your code connects to a local endpoint (e.g.,
http://inference-server:8000/v1) serving the official models (Llama 3, Salamandra, Qwen).