Duncan Gichimu

In this guide, we'll create a gRPC service in Python that utilizes LLama Index to make requests to OpenAI. We'll dockerize this service, and deploy it to an AWS EC2 instance.

Step 1: Setup Python Environment

First, set up a Python environment. Using Jetbrain’s PyCharm IDE simplifies this process. Create a new project and install the required packages

shell

python -m pip install grpcio grpcio-tools llama-index python-dotenv

Step 2: Define gRPC Service

Create a protos directory in the root of your project and add a file named chat.proto:

plaintext

syntax = "proto3";

service Chat {
    rpc answer_question(ChatRequest) returns (stream ChatResponse);
}

message ChatMessage {
    string role = 1;  // Role will be either USER or SYSTEM
    string content = 2;
}

message ChatRequest {
    repeated ChatMessage prompts = 1;
}

message ChatResponse {
    string response = 1;
}

Generate the gRPC code from the proto file:

shell

python -m grpc_tools.protoc -I./protos --python_out=. --pyi_out=. --grpc_python_out=. ./protos/chat.proto

This will generate:

chat_pb2.py

chat_pb2.pyi

chat_pb2_grpc.py

Step 3: Create gRPC Service

Create a .env file in the root directory. If you haven't already instantiate a git repository and add the .env to your .gitignore

plaintext

OPENAI_API_KEY=<Your OPENAI_API_KEY>
MODEL=gpt-3.5-turbo-0125
EMBEDDING_MODEL=text-embedding-3-small
TOP_K=3
SYSTEM_PROMPT=<Your SYSTEM_PROMPT>

Configure Chat Engine settings:

python

# settings.py
import os
from typing import Dict
from llama_index.core.settings import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

def llm_config_from_env() -> Dict:
    from llama_index.core.constants import DEFAULT_TEMPERATURE

    model = os.getenv("MODEL")
    temperature = os.getenv("LLM_TEMPERATURE", DEFAULT_TEMPERATURE)
    max_tokens = os.getenv("LLM_MAX_TOKENS")

    config = {
        "model": model,
        "temperature": float(temperature),
        "max_tokens": int(max_tokens) if max_tokens is not None else None,
    }
    return config

def embedding_config_from_env() -> Dict:
    model = os.getenv("EMBEDDING_MODEL")
    dimension = os.getenv("EMBEDDING_DIM")

    config = {
        "model": model,
        "dimension": int(dimension) if dimension is not None else None,
    }
    return config

def init_settings():
    llm_configs = llm_config_from_env()
    embedding_configs = embedding_config_from_env()

    Settings.llm = OpenAI(**llm_configs)
    Settings.embed_model = OpenAIEmbedding(**embedding_configs)
    Settings.chunk_size = int(os.getenv("CHUNK_SIZE", "1024"))
    Settings.chunk_overlap = int(os.getenv("CHUNK_OVERLAP", "20"))

Create the server script:

python

# server.py
import asyncio
import logging
import grpc
from dotenv import load_dotenv
import chat_pb2_grpc
from chat_servicer import ChatServicer
from settings import init_settings

load_dotenv()
init_settings()

async def serve() -> None:
    server = grpc.aio.server()
    chat_pb2_grpc.add_ChatServicer_to_server(ChatServicer(), server)
    server.add_insecure_port("[::]:15006")
    await server.start()
    await server.wait_for_termination()

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    asyncio.run(serve())

Step 4: Dockerize the Application

Generate a requirements.txt file:

shell

pip freeze > requirements.txt

Create a Dockerfile:

plaintext

FROM python:3.10

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 15006

CMD ["python", "server.py"]

Building a Chatbot - Python gRPC Service

Step 1: Setup Python Environment

Step 2: Define gRPC Service

Step 3: Create gRPC Service

Step 4: Dockerize the Application