Text generation
Crustoff implements the OpenAI chat, completions and embeddings endpoints. Anything written for
the OpenAI API works by changing only the base_url and api_key.
Chat completions
POST /v1/chat/completions
resp = client.chat.completions.create(
model="qwen2.5-7b-instruct",
messages=[
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Explain backpropagation in two sentences."},
],
temperature=0.7,
max_tokens=256,
)
print(resp.choices[0].message.content)Common parameters
| Field | Type | Notes |
|---|---|---|
model | string | A text model id (see Models) |
messages | array | Standard OpenAI chat messages |
max_tokens | int | Output cap. Clamped to the model’s maximum |
temperature, top_p | float | Sampling controls |
stream | bool | Stream tokens as server-sent events |
Unrecognised parameters (tools, response_format, logprobs, …) are forwarded to the model
untouched, so you can use them as you would with OpenAI.
Streaming
Set stream: true to receive tokens as they’re generated:
stream = client.chat.completions.create(
model="qwen2.5-7b-instruct",
messages=[{"role": "user", "content": "Count to five."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)Streaming responses are billed on the exact tokens delivered. If a client disconnects mid-stream, you’re only charged for what was sent.
Embeddings
POST /v1/embeddings — billed on input tokens only (there is no completion).
emb = client.embeddings.create(
model="qwen2.5-7b-instruct",
input="The quick brown fox.",
)
print(len(emb.data[0].embedding))Billing
Text is billed per token: a per-million input rate plus a per-million output rate, charged in micro-dollar credits. See Pricing & billing.