Querying the Hugging Face Hub with a Tiny LLM and FastMCP
Running a 500-million-parameter model as an intelligent agent sounds ambitious. Pair it with structured context from an MCP server and it becomes surprisingly capable: the model handles reasoning, the server handles data access, and neither needs to know how the other works internally. This tutorial builds on the FastMCP Hugging Face Hub server — four MCP resources exposing Hub model and dataset metadata — and shows how to wire Qwen2.5-0.5B-Instruct to it so the model can answer live questions about the catalogue without any parametric knowledge of the Hub.
Prerequisites
pip install fastmcp huggingface_hub transformers torch nest_asyncio
The server from the previous article must be available as hf_server.py. If not, run the companion notebook to regenerate it — the server is a single file written by one cell.
Step 1: Load the Model
Qwen2.5-0.5B-Instruct is instruction-tuned, ships with a chat template, and fits comfortably in CPU memory. It downloads roughly 1 GB on first use and is cached by huggingface_hub.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float32, # use torch.bfloat16 if you have a GPU
)
model.eval()
model.eval() disables dropout so inference is deterministic. The choice of 0.5B is intentional: when context is provided explicitly, parametric knowledge matters less, and a smaller model answers faster with far less memory.
Step 2: Read a Resource from the MCP Server
PythonStdioTransport spawns hf_server.py as a subprocess with an explicit log_file, which avoids the fileno UnsupportedOperation error that occurs when Jupyter’s ipykernel OutStream is passed as stderr to subprocess.Popen. nest_asyncio lets asyncio.run() work inside Jupyter’s existing event loop. read_resource returns a list of content objects; .text gives the raw string payload.
import asyncio
import sys
import nest_asyncio
from fastmcp import Client
from fastmcp.client.transports import PythonStdioTransport
nest_asyncio.apply()
async def fetch_resource(uri: str) -> str:
transport = PythonStdioTransport(
script_path="hf_server.py",
python_cmd=sys.executable,
log_file=open("mcp_server.log", "w"), # real file → has fileno()
)
async with Client(transport) as client:
result = await client.read_resource(uri)
return result[0].text
catalogue = asyncio.run(fetch_resource("hf://models"))
print(catalogue[:300])
The catalogue is a compact JSON array — id, likes, downloads per model. That structure is deliberate: small models parse clean JSON reliably; verbose free-text descriptions add tokens without adding signal.
Step 3: Inject Context and Generate an Answer
The pattern is context injection: the MCP resource goes into the system message, the user’s question goes in the user turn. The model never calls the MCP server itself — the orchestration layer fetches the data and hands it to the model as plain text.
def ask(question: str, context: str) -> str:
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant with access to a Hugging Face Hub catalogue.\n\n"
f"Catalogue (JSON):\n{context}"
),
},
{"role": "user", "content": question},
]
# apply_chat_template formats messages with Qwen2.5's special tokens
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=150,
do_sample=False, # greedy for reproducibility
pad_token_id=tokenizer.eos_token_id,
)
# slice off the echoed prompt tokens before decoding
new_tokens = output_ids[0, inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
apply_chat_template inserts the <|im_start|> / <|im_end|> tokens that Qwen2.5 expects. Without them the model produces incoherent output. Slicing output_ids from inputs["input_ids"].shape[1] strips the prompt echo so only the generated answer is decoded.
Step 4: Full Pipeline
The demo fetches two different resources and asks a question about each. Because fetch_resource is async, the natural approach is a single async def main that awaits both reads before calling the synchronous ask function.
async def main():
# 1. Top models catalogue → find the most downloaded
catalogue = await fetch_resource("hf://models")
answer = ask(
"Which model in the catalogue has the most downloads? Give the model ID and the download count.",
catalogue,
)
print("Most downloaded model:")
print(answer)
# 2. Specific model detail → extract task and tags
detail = await fetch_resource("hf://models/google/gemma-2-2b")
answer2 = ask(
"What task is this model designed for? List three of its tags.",
detail,
)
print("\nGemma-2-2b detail:")
print(answer2)
asyncio.run(main())
Figure: The orchestrator fetches data from the MCP server and injects it into the LLM’s context; the model never calls the server directly.
With the catalogue in the system prompt, Qwen2.5-0.5B reliably identifies the most-downloaded model and extracts structured fields from the detail JSON — both tasks require only careful reading, not world knowledge.
Conclusion
Context injection cleanly separates data access from reasoning. The FastMCP server owns the interface to the Hugging Face Hub; the LLM owns interpretation. Replacing the server with a different data source — an internal model registry, a W&B experiment tracker, a Weights & Biases artifact store — requires no changes to the inference code. Swapping the model requires no changes to the server.
Three natural next steps:
- Dynamic tool use: expose
api.list_models(search=query)as an@mcp.tool()so the model can issue targeted searches rather than receiving a static catalogue. - Larger models: the same orchestration code works with any Hugging Face model; a 7B instruction-tuned model produces richer, more reliable answers with identical context.
- Streaming: pass a
TextStreamertomodel.generateto stream tokens as they are produced, which improves perceived latency for interactive applications.
For the server implementation, see FastMCP Server with Hugging Face Hub Resources.
The companion notebook with all the code from this article is available on GitHub.