Text generation

Text generation

Crustoff implements the OpenAI chat, completions and embeddings endpoints. Anything written for the OpenAI API works by changing only the base_url and api_key.

Chat completions

POST /v1/chat/completions

resp = client.chat.completions.create(
    model="qwen2.5-7b-instruct",
    messages=[
        {"role": "system", "content": "You are concise."},
        {"role": "user", "content": "Explain backpropagation in two sentences."},
    ],
    temperature=0.7,
    max_tokens=256,
)
print(resp.choices[0].message.content)

Common parameters

FieldTypeNotes
modelstringA text model id (see Models)
messagesarrayStandard OpenAI chat messages
max_tokensintOutput cap. Clamped to the model’s maximum
temperature, top_pfloatSampling controls
streamboolStream tokens as server-sent events

Unrecognised parameters (tools, response_format, logprobs, …) are forwarded to the model untouched, so you can use them as you would with OpenAI.

Streaming

Set stream: true to receive tokens as they’re generated:

stream = client.chat.completions.create(
    model="qwen2.5-7b-instruct",
    messages=[{"role": "user", "content": "Count to five."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Streaming responses are billed on the exact tokens delivered. If a client disconnects mid-stream, you’re only charged for what was sent.

Embeddings

POST /v1/embeddings — billed on input tokens only (there is no completion).

emb = client.embeddings.create(
    model="qwen2.5-7b-instruct",
    input="The quick brown fox.",
)
print(len(emb.data[0].embedding))

Billing

Text is billed per token: a per-million input rate plus a per-million output rate, charged in micro-dollar credits. See Pricing & billing.