MCU mc-multimodal-agent — Minecraft Multimodal Agent

Most Minecraft bots cheat: they read raw world state, query block coordinates directly, and move with privileged APIs no human player has. MCU mc-multimodal-agent takes the harder path — it plays Minecraft the way a person does, by looking at the screen and acting through screen coordinates, driven by a multimodal LLM loop on top of Mineflayer.

This repository is the formal AgentBeats / Amber submission wrapper for the core mc-multimodal-agent project. It packages the Node agent into a reproducible Docker image, exposes an A2A (Agent-to-Agent) JSON-RPC endpoint on port 9009, and ships a CI workflow that builds, tests, and publishes the image to GitHub Container Registry on every push to main.

The agent is built around an OpenClaw-style turn loop: each step composes an active prompt from persistent memory, sends a first-person frame and tool catalog to the model, executes the chosen tool, records the outcome, and compacts long context into LevelDB-backed layered memory notes — so it can pursue goals that span hours of in-game time without losing the thread.

How does it work?

Every turn, the agent receives a rendered first-person Minecraft frame and the current memory snapshot. The model returns a single structured action — either a tool_call, an ordered tool_calls batch, or a final answer. The host process executes the tool against Mineflayer, captures the outcome, and feeds the next frame back into the loop.

Capabilities

✓ Multimodal LLM loop with image input + function tools
✓ Visual-first control via screen coordinates
✓ Mineflayer movement, digging, placing, crafting, chat
✓ Pathfinding and inventory management
✓ Blueprint loading and bottom-up block placement
✓ Nearby-player imitation traces

Memory & Skills

✓ Persistent goal trees for long autonomous tasks
✓ LevelDB-backed layered memory notes
✓ Indexed recall + recent transcript context
✓ Pre-compaction durable flushes
✓ Skill snapshots learned via record_skill
✓ soul.md persona loaded every turn

The agent supports both the OpenAI Responses API and Chat Completions, including any OpenAI-compatible base URL. Structured outputs constrain each turn to a strict JSON schema, and transient transport failures (429, 5xx, socket resets, timeouts) are retried with exponential backoff before the task gracefully checkpoints.

What's in this submission wrapper

Amber manifest — amber-manifest.json5 registers the agent with AgentBeats and exposes one A2A endpoint on port 9009
Dockerfile — Builds the Node AgentBeats A2A service from the mc-multimodal-agent submodule into a deployable image
A2A conformance tests — pytest suite that hits a running container and validates agent-card and message endpoints
CI workflow — .github/workflows/test-and-publish.yml builds, runs the conformance suite, and publishes to GHCR on every push to main
Core agent submodule — mc-multimodal-agent/ is the upstream agent: Mineflayer bot, multimodal turn loop, memory, tool catalog, blueprints, and skills
Published image — ghcr.io/madgaa-lab/mcu-mc-multimodal-agent:latest is pullable by the AgentBeats evaluator without local build steps

Quick Start

# Clone the wrapper and the core agent submodule
git clone https://github.com/MadGAA-Lab/MCU-mc-multimodal-agent.git
cd MCU-mc-multimodal-agent
git submodule update --init --recursive mc-multimodal-agent

# Build the submission image
docker build -t mcu-mc-multimodal-agent .

# Run with official OpenAI
docker run --rm -p 9009:9009 \
  -e API_KEY="$OPENAI_API_KEY" \
  -e OPENAI_API_KEY="$OPENAI_API_KEY" \
  -e OPENAI_BASE_URL="https://api.openai.com/v1" \
  -e OPENAI_MODEL="gpt-5.4" \
  mcu-mc-multimodal-agent

# Health check (works without an API key)
curl http://127.0.0.1:9009/.well-known/agent-card.json

# Or pull the prebuilt image directly from GHCR
docker pull ghcr.io/madgaa-lab/mcu-mc-multimodal-agent:latest

# Run A2A conformance tests against a running container
uv sync --extra test
uv run pytest -v tests --agent-url http://127.0.0.1:9009

Environment Variables

The container reads its model configuration from environment variables — no provider is hard-coded. Any OpenAI-compatible endpoint works as long as it implements Chat Completions with structured outputs.

API_KEY — Model API key (required); also mirrored as OPENAI_API_KEY
OPENAI_BASE_URL — Provider base URL (default: https://api.openai.com/v1)
OPENAI_MODEL — Model name (default: gpt-5.4; use gpt-5.5 for hard visual planning)
AGENTBEATS_MODEL_EVERY_N_STEPS — How often the model is consulted between cheap deterministic steps (default: 4)
AGENTBEATS_DEFAULT_HOLD_STEPS — Default hold/wait steps between actions (default: 3)
AGENTBEATS_MAX_HOLD_STEPS — Upper bound on hold steps before forcing a model turn (default: 12)

Citation

If you use the MCU mc-multimodal-agent submission in your research or evaluation, please cite:

@software{mcu_mc_multimodal_agent,
  title  = {MCU mc-multimodal-agent: AgentBeats Submission Wrapper for a Multimodal Minecraft Agent},
  author = {MadGAA-Lab},
  year   = {2026},
  url    = {https://github.com/MadGAA-Lab/MCU-mc-multimodal-agent},
  note   = {Formal AgentBeats / Amber submission wrapping win10ogod/mc-multimodal-agent}
}