GenSIE Starter Kit
The GenSIE Starter Kit is designed to get you up and running with the task in minutes. It includes a baseline agent, a local evaluation server, and an initial set of examples to guide your development.
1. The Starter Data
Inside the repository at data/starter/, you will find 40 JSON instances covering multiple domains (Legal, Medical, Cultural, and Technical).
Data Curation
These examples have been generated using Gemini 3 Pro and manually curated for grounding and correctness. However, they have not yet passed the official "Gold Curation" protocol. Expect this set to be updated when the full development set is released. The format, however, is final and will not change.
2. Reference Implementation & Baselines
The kit provides a high-quality Python baseline in src/gensie/baseline.py that is fully compliant with the competition's hardware and connectivity constraints.
BasicAgent: A reference agent that uses the OpenAI client. It is model-agnostic, meaning it respects themodelparameter passed by the evaluator instead of hardcoding a specific model.OfficialParticipant: The entry point for your submission. It acts as a factory that can host up to three distinct pipelines.
Public Baselines
During the evaluation period, the organizers will provide three open-source baselines executed via standard zero-shot prompting with grammar-constrained decoding. These target different compute tiers:
- Tiny: Llama 3.2 3B Instruct
- Small: Salamandra 7b Instruct (native Spanish language model)
- Medium: Qwen 3 14b
Initial Benchmarks
To provide a reference point for your experiments, we have performed internal testing on the 40 starter instances using the provided zero-shot baseline:
| Model | Micro-F1 |
|---|---|
llama-3.2-3b-instruct |
0.2950 |
llama-3.1-8b-instruct |
0.3109 |
gemini-3-flash-preview |
0.4334 |
llama-4-maverick |
0.4549 |
While your mileage may vary depending on prompting and system configuration, these are reasonable "lower baselines" to aim for. We will report official Inter-Annotator Agreement (IAA) and human performance metrics upon the release of the final test set.
3. Environment Setup
We recommend using uv for dependency management. The environment is optimized for CPU-only execution, avoiding heavy ML frameworks.
-
Install dependencies:
2. Configure Environment:The evaluator will provide
OPENAI_API_KEYandOPENAI_BASE_URLat runtime. For local development, create a.envfile in the root of thegensiefolder:
4. Local Development Workflow
Serving your Agent
To make your agent available for evaluation, start the FastAPI server:
This launches your agent athttp://localhost:8000.
Running Local Benchmarks
You can test your performance against the starter data. The eval command will automatically pass the --model flag to your agent, simulating the evaluator's behavior. You can also generate a JSON report for later analysis:
gensie eval
--data data/starter/
--url http://localhost:8000
--pipeline baseline
--model gpt-4o-mini
--output report.json
Error Penalization
If your agent fails to process a task (e.g., returns a 500 error or crashes), the evaluation engine will penalize the result with a 0 score for that instance. The task remains in the denominator for Recall calculation, effectively lowering your overall F1 score.
First Run & Embeddings
On its first run, gensie eval will automatically download the lightweight BAAI/bge-small-en-v1.5 embedding model via fastembed. This is used for fast semantic similarity calculations on your CPU. The official leaderboard evaluation will use a more robust, heavier model (e.g., paraphrase-multilingual-mpnet-base-v2).
5. How to Hack
To start building your own solution:
- Configure Inference: Before launching your container, ensure your
.envfile points to a valid backend. You can host a local model using LMStudio or connect to an external provider by settingOPENAI_BASE_URLandOPENAI_API_KEY. - Launch: Start the environment:
- Implement: Create a new agent class in
src/gensie/that inherits fromGenSIEAgent. - Register: Add your agent to the
OfficialParticipantclass insidebaseline.py. - Benchmark: Run
gensie evalto see your progress.
Fast Iteration with Docker
The provided Dockerfile installs the package in editable mode. You can use Docker Compose to build and run your agent with local volume mounting and environment variables pre-configured:
This will:
- Build the agent image.
- Mount your local code so changes are reflected instantly (via FastAPI reload).
- Load your API keys from the
.envfile.