We’ll start with implementing the non-streaming bit. Let’s start with modeling our request:

from typing import List, Optional

from pydantic import BaseModel

class ChatMessage(BaseModel):
role: str
content: str

class ChatCompletionRequest(BaseModel):
model: str = "mock-gpt-model"
messages: List[ChatMessage]
max_tokens: Optional[int] = 512
temperature: Optional[float] = 0.1
stream: Optional[bool] = False

The PyDantic model represents the request from the client, aiming to replicate the API reference. For the sake of brevity, this model does not implement the entire specs, but rather the bare bones needed for it to work. If you’re missing a parameter that is a part of the API specs (like top_p), you can simply add it to the model.

The ChatCompletionRequest models the parameters OpenAI uses in their requests. The chat API specs require specifying a list of ChatMessage (like a chat history, the client is usually in charge of keeping it and feeding back in at every request). Each chat message has a role attribute (usually system, assistant , or user ) and a content attribute containing the actual message text.

Next, we’ll write our FastAPI chat completions endpoint:

import time

from fastapi import FastAPI

app = FastAPI(title="OpenAI-compatible API")

@app.post("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):

if request.messages and request.messages[0].role == 'user':
resp_content = "As a mock AI Assitant, I can only echo your last message:" + request.messages[-1].content
else:
resp_content = "As a mock AI Assitant, I can only echo your last message, but there were no messages!"

return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"model": request.model,
"choices": [{
"message": ChatMessage(role="assistant", content=resp_content)
}]
}

That simple.

Testing our implementation

Assuming both code blocks are in a file called main.py, we’ll install two Python libraries in our environment of choice (always best to create a new one): pip install fastapi openai and launch the server from a terminal:

uvicorn main:app

Using another terminal (or by launching the server in the background), we will open a Python console and copy-paste the following code, taken straight from OpenAI’s Python Client Reference:

from openai import OpenAI

# init client and connect to localhost server
client = OpenAI(
api_key="fake-api-key",
base_url=" # change the default port if needed
)

# call API
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-1337-turbo-pro-max",
)

# print the top "choice"
print(chat_completion.choices[0].message.content)

If you’ve done everything correctly, the response from the server should be correctly printed. It’s also worth inspecting the chat_completion object to see that all relevant attributes are as sent from our server. You should see something like this:

Code by the author, formatted using Carbon

As LLM generation tends to be slow (computationally expensive), it’s worth streaming your generated content back to the client, so that the user can see the response as it’s being generated, without having to wait for it to finish. If you recall, we gave ChatCompletionRequest a boolean stream property — this lets the client request that the data be streamed back to it, rather than sent at once.

This makes things just a bit more complex. We will create a generator function to wrap our mock response (in a real-world scenario, we will want a generator that is hooked up to our LLM generation)

import asyncio
import json

async def _resp_async_generator(text_resp: str):
# let's pretend every word is a token and return it over time
tokens = text_resp.split(" ")

for i, token in enumerate(tokens):
chunk = {
"id": i,
"object": "chat.completion.chunk",
"created": time.time(),
"model": "blah",
"choices": [{"delta": {"content": token + " "}}],
}
yield f"data: {json.dumps(chunk)}\n\n"
await asyncio.sleep(1)
yield "data: [DONE]\n\n"

And now, we would modify our original endpoint to return a StreamingResponse when stream==True

import time

from starlette.responses import StreamingResponse

app = FastAPI(title="OpenAI-compatible API")

@app.post("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):

if request.messages:
resp_content = "As a mock AI Assitant, I can only echo your last message:" + request.messages[-1].content
else:
resp_content = "As a mock AI Assitant, I can only echo your last message, but there wasn't one!"
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type="application/x-ndjson")

return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"model": request.model,
"choices": [{
"message": ChatMessage(role="assistant", content=resp_content) }]
}

Testing the streaming implementation

After restarting the uvicorn server, we’ll open up a Python console and put in this code (again, taken from OpenAI’s library docs)

from openai import OpenAI

# init client and connect to localhost server
client = OpenAI(
api_key="fake-api-key",
base_url=" # change the default port if needed
)

stream = client.chat.completions.create(
model="mock-gpt-model",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "")

You should see each word in the server’s response being slowly printed, mimicking token generation. We can inspect the last chunk object to see something like this:

Code by the author, formatted using Carbon

Putting it all together

Finally, in the gist below, you can see the entire code for the server.

Leave a Reply