OpenAI Cookbook: The Definitive Resource for Production AI Engineering

Executive Summary

The gap between understanding OpenAI's API documentation and shipping production-grade AI features is often measured in weeks of experimentation, debugging, and architectural refactoring. OpenAI Cookbook bridges this gap with a curated collection of practical examples, proven patterns, and engineering wisdom accumulated from thousands of real-world deployments.

Unlike API documentation that explains what endpoints do, the Cookbook focuses on how to use them effectively in production scenarios. Each guide addresses specific technical challenges—from implementing semantic search with embeddings to building multi-agent systems with function calling—with complete, runnable code that demonstrates best practices for error handling, token optimization, and cost management.

The Cookbook has evolved into the de facto standard for OpenAI integration patterns, maintained by OpenAI's developer relations team and enriched by contributions from the engineering community. It covers the entire AI development lifecycle: from initial prototyping to production deployment, from basic chat completions to sophisticated RAG (Retrieval-Augmented Generation) architectures, from single-model applications to complex multi-agent orchestration.

Why OpenAI Cookbook Matters for AI Engineers

Traditional API documentation answers "what" and "where"—what parameters exist and where to send requests. But production AI engineering requires answering "how" and "why": how to structure prompts for reliability, why certain model parameters affect output quality, how to implement semantic caching, why embeddings models require specific preprocessing.

OpenAI Cookbook provides four critical advantages:

1. Battle-Tested Code Examples

Every code sample in the Cookbook represents hours of engineering refinement. Examples include production-grade error handling, efficient token usage, proper API key management, and retry logic with exponential backoff. Rather than starting from minimal "hello world" examples, engineers can adapt proven implementations that have been validated in real-world applications.

2. Architectural Patterns for Common Use Cases

The Cookbook documents proven architectures for recurring AI engineering challenges: building chatbots with conversation memory, implementing document Q&A systems, creating specialized AI agents with tool use, and optimizing embedding-based search. These patterns encode architectural decisions that would otherwise require extensive experimentation.

3. Performance and Cost Optimization Techniques

Production AI applications face unique optimization challenges around token usage, API rate limits, and model selection. The Cookbook provides quantitative guidance on techniques like prompt compression, semantic caching, batch processing, and model cascading—with benchmark data showing real-world performance improvements.

4. Integration with Modern AI Stacks

As the AI ecosystem has matured, the Cookbook has evolved to cover integrations with vector databases (Pinecone, Weaviate, Qdrant), application frameworks (LangChain, LlamaIndex), and observability tools (Weights & Biases, LangSmith). This ecosystem perspective helps engineers build complete systems rather than isolated components.

For engineering teams, the Cookbook reduces time-to-production from weeks to days by providing working implementations of complex patterns. The difference between reading API docs and studying Cookbook examples is the difference between knowing that embeddings exist and understanding how to build a production-grade semantic search system with proper chunking, metadata filtering, and relevance scoring.

Technical Deep Dive

Core Content Areas

The OpenAI Cookbook is organized into six major content areas, each addressing distinct technical challenges in AI application development. Understanding this structure helps engineers quickly find relevant patterns for their specific use cases.

#### 1. Prompt Engineering and Optimization

The foundation of reliable AI applications is well-structured prompts. The Cookbook provides systematic approaches to prompt design that go far beyond simple trial-and-error:

Structured Output Generation

One of the most common production requirements is generating structured data (JSON, CSV, etc.) from LLM responses. The Cookbook demonstrates multiple techniques with increasing reliability guarantees:

\\\python import openai import json from typing import List, Dict, Optional from pydantic import BaseModel, Field


Define structured output schema with Pydantic
class ProductAnalysis(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    key_features: List[str] = Field(description="List of mentioned product features")
    price_sensitivity: str = Field(description="low, medium, or high")
    purchase_intent: int = Field(description="Score from 0-100")
    concerns: Optional[List[str]] = Field(description="List of customer concerns")
Use JSON mode for guaranteed valid JSON output
def analyze_product_review(review_text: str) -> ProductAnalysis:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a product review analyzer. Extract structured
                insights from customer reviews. Respond with valid JSON matching this schema:
                {
                    "sentiment": "positive|negative|neutral",
                    "key_features": ["feature1", "feature2"],
                    "price_sensitivity": "low|medium|high",
                    "purchase_intent": 0-100,
                    "concerns": ["concern1", "concern2"] or null
                }"""
            },
            {
                "role": "user",
                "content": review_text
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.3  # Lower temperature for more consistent structured output
    )
    # Parse and validate with Pydantic
    json_output = json.loads(response.choices[0].message.content)
    return ProductAnalysis(**json_output)
Production usage with error handling
review = """I've been using this laptop for 3 months. The performance is excellent
and the battery lasts all day. However, the price point of $2000 is quite steep
compared to competitors. The screen quality could be better for this price range."""

try: analysis = analyze_product_review(review) print(f"Sentiment: {analysis.sentiment}") print(f"Purchase Intent: {analysis.purchase_intent}/100") print(f"Key Features: {', '.join(analysis.key_features)}") print(f"Concerns: {', '.join(analysis.concerns) if analysis.concerns else 'None'}") except json.JSONDecodeError as e: print(f"Failed to parse JSON response: {e}") except Exception as e: print(f"API error: {e}") \\\

This pattern uses JSON mode (available in GPT-4 and GPT-3.5-turbo) to guarantee syntactically valid JSON output, combined with Pydantic validation to ensure semantic correctness. The Cookbook demonstrates how this approach reduces parsing failures from 5-10% (with naive prompting) to <0.1% in production.

Few-Shot Learning for Domain Adaptation

When working with specialized domains or specific output formats, few-shot examples dramatically improve consistency:

\\\python def create_few_shot_classifier(domain_examples: List[Dict[str, str]]) -> callable: """Creates a classifier using few-shot learning from domain examples."""


    # Build few-shot prompt from examples
    example_text = "\n\n".join([
        f"Input: {ex['input']}\nCategory: {ex['category']}\nReason: {ex['reason']}"
        for ex in domain_examples
    ])
    system_prompt = f"""You are a specialized content classifier. Study these examples:
{example_text}
For new inputs, classify them using the same categories and reasoning style."""
    def classify(text: str) -> Dict[str, str]:
        response = openai.chat.completions.create(
            model="gpt-4o-mini",  # Mini works well with good few-shot examples
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Input: {text}"}
            ],
            temperature=0.2,
            max_tokens=150
        )
        content = response.choices[0].message.content
        # Parse category and reason from response
        lines = content.split('\n')
        category = lines[0].replace('Category: ', '').strip()
        reason = lines[1].replace('Reason: ', '').strip() if len(lines) > 1 else ""
        return {"category": category, "reason": reason}
    return classify
Domain-specific medical triage example
medical_examples = [
    {
        "input": "Patient has severe chest pain radiating to left arm, shortness of breath",
        "category": "CRITICAL",
        "reason": "Symptoms indicate possible cardiac event requiring immediate attention"
    },
    {
        "input": "Patient reports mild headache for 2 days, improving with rest",
        "category": "ROUTINE",
        "reason": "Non-urgent symptoms manageable with basic care"
    },
    {
        "input": "Patient fell and cannot put weight on ankle, visible swelling",
        "category": "URGENT",
        "reason": "Potential fracture requiring prompt evaluation"
    }
]

triage_classifier = create_few_shot_classifier(medical_examples) result = triage_classifier("Patient has persistent fever 103°F for 3 days, severe fatigue") print(f"Triage Category: {result['category']}") print(f"Clinical Reasoning: {result['reason']}") \\\

The Cookbook demonstrates that 3-5 high-quality examples often match or exceed the performance of hundreds of fine-tuning examples for classification tasks, with the advantage of zero training time and immediate iteration.

#### 2. Embeddings and Semantic Search

Text embeddings power modern AI applications from recommendation systems to document search. The Cookbook provides production-ready implementations of embedding-based architectures:

Building Production-Grade Semantic Search

\\\python import openai import numpy as np from typing import List, Dict, Tuple import tiktoken


class SemanticSearchEngine:
    """Production semantic search with chunking, caching, and relevance scoring."""
    def __init__(self, model: str = "text-embedding-3-large", chunk_size: int = 512):
        self.model = model
        self.chunk_size = chunk_size
        self.encoding = tiktoken.encoding_for_model("gpt-4")
        self.document_chunks: List[Dict] = []
        self.embeddings: np.ndarray = None
    def chunk_text(self, text: str, overlap: int = 50) -> List[str]:
        """Chunk text with overlap to preserve context at boundaries."""
        tokens = self.encoding.encode(text)
        chunks = []
        for i in range(0, len(tokens), self.chunk_size - overlap):
            chunk_tokens = tokens[i:i + self.chunk_size]
            chunk_text = self.encoding.decode(chunk_tokens)
            chunks.append(chunk_text)
        return chunks
    def index_documents(self, documents: List[Dict[str, str]]) -> None:
        """Index documents with metadata for filtering."""
        all_chunks = []
        for doc in documents:
            chunks = self.chunk_text(doc['content'])
            for i, chunk in enumerate(chunks):
                all_chunks.append({
                    'text': chunk,
                    'doc_id': doc['id'],
                    'chunk_index': i,
                    'metadata': doc.get('metadata', {})
                })
        self.document_chunks = all_chunks
        # Batch embedding generation for efficiency
        texts = [chunk['text'] for chunk in all_chunks]
        self.embeddings = self._get_embeddings_batch(texts)
        print(f"Indexed {len(documents)} documents into {len(all_chunks)} chunks")
    def _get_embeddings_batch(self, texts: List[str], batch_size: int = 100) -> np.ndarray:
        """Generate embeddings in batches to respect rate limits."""
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            response = openai.embeddings.create(
                model=self.model,
                input=batch
            )
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)
        return np.array(all_embeddings)
    def search(
        self,
        query: str,
        top_k: int = 5,
        metadata_filters: Dict = None,
        similarity_threshold: float = 0.7
    ) -> List[Dict]:
        """Semantic search with optional metadata filtering."""
        # Generate query embedding
        query_embedding = self._get_embeddings_batch([query])[0]
        # Calculate cosine similarity
        similarities = np.dot(self.embeddings, query_embedding)
        # Apply metadata filters if specified
        valid_indices = range(len(self.document_chunks))
        if metadata_filters:
            valid_indices = [
                i for i in valid_indices
                if all(
                    self.document_chunks[i]['metadata'].get(key) == value
                    for key, value in metadata_filters.items()
                )
            ]
        # Get top-k results above threshold
        results = []
        for idx in valid_indices:
            if similarities[idx] >= similarity_threshold:
                results.append({
                    'text': self.document_chunks[idx]['text'],
                    'doc_id': self.document_chunks[idx]['doc_id'],
                    'similarity': float(similarities[idx]),
                    'metadata': self.document_chunks[idx]['metadata']
                })
        # Sort by similarity and return top-k
        results.sort(key=lambda x: x['similarity'], reverse=True)
        return results[:top_k]
Production usage example
search_engine = SemanticSearchEngine()
Index technical documentation
documents = [
    {
        'id': 'doc_001',
        'content': """Authentication in our API uses Bearer tokens. Include your
        API key in the Authorization header as 'Bearer sk-your-key'. Keys can be
        generated in the dashboard under Settings > API Keys. Never expose keys
        in client-side code.""",
        'metadata': {'category': 'security', 'version': 'v2'}
    },
    {
        'id': 'doc_002',
        'content': """Rate limits are enforced per organization. Free tier allows
        20 requests per minute. Pro tier allows 3500 RPM. Enterprise has custom
        limits. Implement exponential backoff when you receive 429 status codes.""",
        'metadata': {'category': 'limits', 'version': 'v2'}
    },
    {
        'id': 'doc_003',
        'content': """Error handling best practices: Always check response status
        codes. 400 indicates invalid request format. 401 means authentication failed.
        429 indicates rate limit exceeded. 500 indicates server error requiring retry.""",
        'metadata': {'category': 'errors', 'version': 'v2'}
    }
]
search_engine.index_documents(documents)
Semantic search with natural language query
results = search_engine.search(
    query="How do I authenticate API requests?",
    top_k=3,
    metadata_filters={'version': 'v2'},
    similarity_threshold=0.7
)

for result in results: print(f"\nRelevance: {result['similarity']:.2f}") print(f"Category: {result['metadata']['category']}") print(f"Content: {result['text'][:200]}...") \\\

This implementation demonstrates several production-critical patterns from the Cookbook:

•Chunking with overlap to preserve context at chunk boundaries
•Batch processing to maximize throughput and respect rate limits
•Metadata filtering to enable faceted search
•Similarity thresholds to filter low-quality results
•Token counting to ensure chunks fit model context windows

The Cookbook includes benchmarks showing this approach handles 100K+ documents with sub-second query latency when using appropriate vector databases (Pinecone, Weaviate, etc.) as storage backends.

#### 3. Function Calling and Tool Use

Function calling (formerly known as function/tool use) enables LLMs to interact with external systems. The Cookbook provides comprehensive patterns for building reliable tool-using agents:

Production Function Calling Architecture

\\\python import openai import json from typing import List, Dict, Callable, Any from datetime import datetime


class ToolRegistry:
    """Manages tool definitions and execution for AI agents."""
    def __init__(self):
        self.tools: Dict[str, Callable] = {}
        self.tool_schemas: List[Dict] = []
    def register(self, schema: Dict):
        """Register a tool with its schema and implementation."""
        def decorator(func: Callable):
            tool_name = schema['function']['name']
            self.tools[tool_name] = func
            self.tool_schemas.append(schema)
            return func
        return decorator
    def execute(self, tool_name: str, arguments: Dict[str, Any]) -> Any:
        """Execute a tool with given arguments."""
        if tool_name not in self.tools:
            raise ValueError(f"Unknown tool: {tool_name}")
        try:
            return self.tools[tool_name](**arguments)
        except Exception as e:
            return {"error": str(e), "tool": tool_name}
Initialize tool registry
tools = ToolRegistry()
Register customer support tools
@tools.register({
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Retrieve current status of a customer order",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID (format: ORD-XXXXX)"
                }
            },
            "required": ["order_id"]
        }
    }
})
def get_order_status(order_id: str) -> Dict:
    """Simulated order status lookup."""
    # In production, this would query your database/API
    orders = {
        "ORD-12345": {
            "status": "shipped",
            "tracking": "1Z999AA10123456784",
            "estimated_delivery": "2025-09-25",
            "items": ["Laptop Stand", "USB-C Cable"]
        },
        "ORD-67890": {
            "status": "processing",
            "estimated_ship_date": "2025-09-24",
            "items": ["Mechanical Keyboard"]
        }
    }
    return orders.get(order_id, {"error": "Order not found"})
@tools.register({
    "type": "function",
    "function": {
        "name": "initiate_return",
        "description": "Start the return process for an order",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID to return"
                },
                "reason": {
                    "type": "string",
                    "enum": ["defective", "wrong_item", "not_needed", "other"],
                    "description": "Reason for return"
                },
                "items": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of items to return (empty for all items)"
                }
            },
            "required": ["order_id", "reason"]
        }
    }
})
def initiate_return(order_id: str, reason: str, items: List[str] = None) -> Dict:
    """Simulated return initiation."""
    return {
        "return_id": f"RET-{datetime.now().strftime('%Y%m%d')}-001",
        "order_id": order_id,
        "status": "approved",
        "return_label_url": "https://returns.example.com/label/123",
        "refund_method": "original_payment",
        "estimated_refund_days": 5
    }
@tools.register({
    "type": "function",
    "function": {
        "name": "update_shipping_address",
        "description": "Update the shipping address for an order (only if not yet shipped)",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "street": {"type": "string"},
                "city": {"type": "string"},
                "state": {"type": "string"},
                "zip_code": {"type": "string"}
            },
            "required": ["order_id", "street", "city", "state", "zip_code"]
        }
    }
})
def update_shipping_address(order_id: str, street: str, city: str, state: str, zip_code: str) -> Dict:
    """Simulated address update."""
    # Check if order can still be modified
    order = get_order_status(order_id)
    if order.get('status') == 'shipped':
        return {"error": "Cannot update address for shipped orders"}
    return {
        "success": True,
        "order_id": order_id,
        "new_address": {
            "street": street,
            "city": city,
            "state": state,
            "zip_code": zip_code
        }
    }
class CustomerSupportAgent:
    """Multi-turn conversational agent with tool use."""
    def __init__(self, tool_registry: ToolRegistry, max_iterations: int = 10):
        self.tools = tool_registry
        self.max_iterations = max_iterations
    def run(self, user_message: str, conversation_history: List[Dict] = None) -> str:
        """Run agent with tool use until completion."""
        messages = conversation_history or []
        messages.append({"role": "user", "content": user_message})
        for iteration in range(self.max_iterations):
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=self.tools.tool_schemas,
                tool_choice="auto"
            )
            assistant_message = response.choices[0].message
            messages.append(assistant_message)
            # Check if agent wants to use tools
            if assistant_message.tool_calls:
                for tool_call in assistant_message.tool_calls:
                    function_name = tool_call.function.name
                    function_args = json.loads(tool_call.function.arguments)
                    print(f"[Agent calling tool: {function_name} with args: {function_args}]")
                    # Execute tool
                    result = self.tools.execute(function_name, function_args)
                    # Add tool result to conversation
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": json.dumps(result)
                    })
                # Continue loop to get agent's next response
                continue
            # No more tool calls - agent has final response
            return assistant_message.content
        return "I apologize, but I'm having trouble completing this request. Please contact support."
Production usage
agent = CustomerSupportAgent(tools)
Example conversation
conversation = []
user_query = "Hi, I need to check on my order ORD-12345 and possibly return it"
response = agent.run(user_query, conversation)
print(f"Agent: {response}")
Multi-turn conversation continues
follow_up = "Yes, I'd like to return the USB-C Cable because I received the wrong type"
response = agent.run(follow_up, conversation)
print(f"Agent: {response}")
\

\\

This pattern demonstrates several critical production requirements:

•Tool registry for manageable tool definitions as systems grow
•Error handling in tool execution with graceful failure modes
•Iteration limits to prevent infinite loops in agent reasoning
•Multi-turn conversations with proper message history management
•Type-safe tool schemas that guide LLM tool selection

The Cookbook shows this architecture scales to 50+ tools without degradation in tool selection accuracy when tools have clear, distinct descriptions.

#### 4. RAG (Retrieval-Augmented Generation) Systems

RAG combines semantic search with LLM generation for question-answering over proprietary data. The Cookbook provides end-to-end implementations:

Production RAG with Citation Tracking

\\\python import openai from typing import List, Dict, Tuple from dataclasses import dataclass


@dataclass
class Citation:
    """Represents a source citation for RAG responses."""
    source_id: str
    chunk_text: str
    relevance_score: float
    page_number: int = None
class RAGPipeline:
    """Production RAG with citation tracking and answer verification."""
    def __init__(self, search_engine: SemanticSearchEngine):
        self.search = search_engine
        self.model = "gpt-4o"
    def answer_question(
        self,
        question: str,
        num_sources: int = 5,
        require_citations: bool = True
    ) -> Tuple[str, List[Citation]]:
        """Generate answer with source citations."""
        # Step 1: Retrieve relevant context
        search_results = self.search.search(
            query=question,
            top_k=num_sources,
            similarity_threshold=0.7
        )
        if not search_results:
            return "I don't have enough information to answer that question.", []
        # Step 2: Build context with source markers
        context_parts = []
        citations = []
        for idx, result in enumerate(search_results, 1):
            source_marker = f"[Source {idx}]"
            context_parts.append(f"{source_marker}\n{result['text']}")
            citations.append(Citation(
                source_id=result['doc_id'],
                chunk_text=result['text'],
                relevance_score=result['similarity']
            ))
        context = "\n\n".join(context_parts)
        # Step 3: Generate answer with citation requirement
        system_prompt = """You are a helpful assistant that answers questions based
        ONLY on the provided context.
        Important rules:
        1. Only use information from the provided sources
        2. Cite sources using their [Source N] markers in your answer
        3. If the context doesn't contain relevant information, say so
        4. Be concise but complete
        5. Use direct quotes when appropriate"""
        response = openai.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"""Context:
{context}
Question: {question}
Provide a detailed answer citing specific sources."""}
            ],
            temperature=0.3  # Lower temperature for factual accuracy
        )
        answer = response.choices[0].message.content
        # Step 4: Verify citations are present if required
        if require_citations:
            has_citations = any(f"[Source {i}]" in answer for i in range(1, len(citations) + 1))
            if not has_citations:
                # Regenerate with stronger citation requirement
                response = openai.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": f"""Context:
{context}
Question: {question}
CRITICAL: You MUST cite specific sources using [Source N] notation in your answer."""}
                    ],
                    temperature=0.1
                )
                answer = response.choices[0].message.content
        return answer, citations
    def answer_with_verification(self, question: str) -> Dict:
        """Generate answer with self-verification step."""
        # Get initial answer
        answer, citations = self.answer_question(question)
        # Self-verification prompt
        verification_prompt = f"""Review this question-answer pair for accuracy:
Question: {question}
Answer: {answer}
Context used:
{chr(10).join([f"[Source {i+1}] {c.chunk_text[:200]}..." for i, c in enumerate(citations)])}
Does the answer accurately reflect the information in the sources?
Respond with JSON:
{{
    "is_accurate": true/false,
    "confidence": 0-100,
    "issues": ["issue1", "issue2"] or null,
    "suggested_improvements": "..." or null
}}"""
        verification = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": verification_prompt}],
            response_format={"type": "json_object"}
        )
        verification_result = json.loads(verification.choices[0].message.content)
        return {
            "answer": answer,
            "citations": [
                {
                    "source_id": c.source_id,
                    "relevance": c.relevance_score,
                    "excerpt": c.chunk_text[:150] + "..."
                }
                for c in citations
            ],
            "verification": verification_result
        }
Production usage
rag_pipeline = RAGPipeline(search_engine)
question = "What authentication methods are supported and how do rate limits work?"
result = rag_pipeline.answer_with_verification(question)

print(f"Answer:\n{result['answer']}\n") print(f"Confidence: {result['verification']['confidence']}%") print(f"Accurate: {result['verification']['is_accurate']}\n") print("Sources:") for citation in result['citations']: print(f" - {citation['source_id']} (relevance: {citation['relevance']:.2f})") print(f" {citation['excerpt']}") \\\

This RAG implementation includes production-critical features:

•Citation tracking to enable fact-checking and transparency
•Answer verification using self-critique to catch hallucinations
•Confidence scoring to flag low-quality responses
•Source attribution for compliance and auditing
•Graceful degradation when no relevant context exists

The Cookbook demonstrates this architecture reduces hallucination rates from 15-20% (naive RAG) to 2-3% through verification steps.

#### 5. Fine-Tuning and Model Customization

While prompt engineering handles most use cases, fine-tuning becomes valuable for specialized domains or extreme cost optimization. The Cookbook provides complete fine-tuning workflows:

Fine-Tuning Pipeline for Custom Domains

\\\python import openai from typing import List, Dict import json import time


class FineTuningPipeline:
    """Complete fine-tuning workflow with validation and deployment."""
    def prepare_training_data(
        self,
        examples: List[Dict],
        validation_split: float = 0.1
    ) -> Tuple[str, str]:
        """Format and split data for fine-tuning."""
        # Shuffle and split
        import random
        random.shuffle(examples)
        split_idx = int(len(examples) * (1 - validation_split))
        train_examples = examples[:split_idx]
        val_examples = examples[split_idx:]
        # Format as JSONL
        def format_example(ex: Dict) -> str:
            return json.dumps({
                "messages": [
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]},
                    {"role": "assistant", "content": ex["output"]}
                ]
            })
        train_file = "training_data.jsonl"
        val_file = "validation_data.jsonl"
        with open(train_file, 'w') as f:
            f.write('\n'.join(format_example(ex) for ex in train_examples))
        with open(val_file, 'w') as f:
            f.write('\n'.join(format_example(ex) for ex in val_examples))
        print(f"Prepared {len(train_examples)} training examples")
        print(f"Prepared {len(val_examples)} validation examples")
        return train_file, val_file
    def create_fine_tune_job(
        self,
        training_file: str,
        validation_file: str = None,
        base_model: str = "gpt-4o-mini-2024-07-18",
        suffix: str = None,
        hyperparameters: Dict = None
    ) -> str:
        """Create and monitor fine-tuning job."""
        # Upload training data
        with open(training_file, 'rb') as f:
            train_upload = openai.files.create(file=f, purpose='fine-tune')
        print(f"Uploaded training file: {train_upload.id}")
        # Upload validation data if provided
        val_file_id = None
        if validation_file:
            with open(validation_file, 'rb') as f:
                val_upload = openai.files.create(file=f, purpose='fine-tune')
            val_file_id = val_upload.id
            print(f"Uploaded validation file: {val_upload.id}")
        # Create fine-tuning job
        job = openai.fine_tuning.jobs.create(
            training_file=train_upload.id,
            validation_file=val_file_id,
            model=base_model,
            suffix=suffix,
            hyperparameters=hyperparameters or {
                "n_epochs": 3,
                "batch_size": "auto",
                "learning_rate_multiplier": "auto"
            }
        )
        print(f"Created fine-tuning job: {job.id}")
        print(f"Status: {job.status}")
        return job.id
    def monitor_job(self, job_id: str) -> str:
        """Monitor fine-tuning job until completion."""
        print("\nMonitoring fine-tuning progress...")
        while True:
            job = openai.fine_tuning.jobs.retrieve(job_id)
            status = job.status
            print(f"Status: {status}")
            if status == "succeeded":
                print(f"\nFine-tuning completed!")
                print(f"Fine-tuned model: {job.fine_tuned_model}")
                return job.fine_tuned_model
            elif status == "failed":
                print(f"\nFine-tuning failed: {job.error}")
                raise Exception(f"Fine-tuning failed: {job.error}")
            elif status in ["validating_files", "queued", "running"]:
                # Show metrics if available
                if hasattr(job, 'trained_tokens') and job.trained_tokens:
                    print(f"  Tokens trained: {job.trained_tokens}")
                time.sleep(60)  # Check every minute
            else:
                print(f"Unexpected status: {status}")
                time.sleep(60)
    def evaluate_model(
        self,
        model: str,
        test_examples: List[Dict],
        base_model: str = "gpt-4o-mini-2024-07-18"
    ) -> Dict:
        """Compare fine-tuned model to base model."""
        print(f"\nEvaluating {model} against {base_model}...")
        results = {
            "fine_tuned": {"correct": 0, "total": len(test_examples)},
            "base": {"correct": 0, "total": len(test_examples)}
        }
        for ex in test_examples:
            # Test fine-tuned model
            ft_response = openai.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0
            )
            ft_output = ft_response.choices[0].message.content
            # Test base model
            base_response = openai.chat.completions.create(
                model=base_model,
                messages=[
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0
            )
            base_output = base_response.choices[0].message.content
            # Check correctness (exact match for classification tasks)
            expected = ex["output"].strip().lower()
            if ft_output.strip().lower() == expected:
                results["fine_tuned"]["correct"] += 1
            if base_output.strip().lower() == expected:
                results["base"]["correct"] += 1
        # Calculate metrics
        results["fine_tuned"]["accuracy"] = results["fine_tuned"]["correct"] / results["fine_tuned"]["total"]
        results["base"]["accuracy"] = results["base"]["correct"] / results["base"]["total"]
        results["improvement"] = results["fine_tuned"]["accuracy"] - results["base"]["accuracy"]
        print(f"\nResults:")
        print(f"  Fine-tuned accuracy: {results['fine_tuned']['accuracy']:.2%}")
        print(f"  Base accuracy: {results['base']['accuracy']:.2%}")
        print(f"  Improvement: {results['improvement']:.2%}")
        return results
Example: Fine-tune for medical billing code classification
examples = [
    {
        "system": "You are a medical billing assistant. Classify procedures into billing codes.",
        "input": "Patient received annual physical examination with EKG",
        "output": "CPT: 99395, 93000"
    },
    {
        "system": "You are a medical billing assistant. Classify procedures into billing codes.",
        "input": "Patient underwent diagnostic colonoscopy with biopsy",
        "output": "CPT: 45380, 45380-59"
    },
    # ... 500+ more examples
]

pipeline = FineTuningPipeline() train_file, val_file = pipeline.prepare_training_data(examples) job_id = pipeline.create_fine_tune_job(train_file, val_file, suffix="medical-billing-v1") model = pipeline.monitor_job(job_id) evaluation = pipeline.evaluate_model(model, examples[-50:]) # Test on held-out examples \\\

The Cookbook demonstrates fine-tuning provides:

•3-5x cost reduction for high-volume specialized tasks
•20-40% accuracy improvement over few-shot prompting in narrow domains
•Consistent output formatting without extensive prompt engineering
•Lower latency by reducing prompt size

However, it also warns about when NOT to fine-tune: for tasks requiring up-to-date knowledge (use RAG), for tasks with high variability (use few-shot), or for rapid iteration (fine-tuning takes hours to days).

#### 6. Production Best Practices

The Cookbook dedicates extensive content to production reliability, covering error handling, rate limiting, caching, and monitoring:

Comprehensive Error Handling and Retry Logic

\\\python import openai from openai import OpenAIError, APIError, RateLimitError, APIConnectionError import time from typing import Optional, Callable import logging


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductionOpenAIClient:
    """Production-grade OpenAI client with comprehensive error handling."""
    def __init__(
        self,
        api_key: str,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        timeout: float = 30.0
    ):
        self.client = openai.OpenAI(api_key=api_key)
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.timeout = timeout
    def _exponential_backoff(self, attempt: int) -> float:
        """Calculate exponential backoff delay with jitter."""
        import random
        delay = min(self.base_delay * (2 ** attempt), self.max_delay)
        jitter = random.uniform(0, delay * 0.1)
        return delay + jitter
    def completion_with_retry(
        self,
        messages: list,
        model: str = "gpt-4o",
        fallback_model: Optional[str] = "gpt-4o-mini",
        **kwargs
    ) -> str:
        """Create completion with automatic retry and fallback logic."""
        last_error = None
        current_model = model
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=current_model,
                    messages=messages,
                    timeout=self.timeout,
                    **kwargs
                )
                content = response.choices[0].message.content
                logger.info(f"Completion successful with {current_model} (attempt {attempt + 1})")
                return content
            except RateLimitError as e:
                logger.warning(f"Rate limit hit for {current_model}: {e}")
                last_error = e
                # Check if we should switch to fallback model
                if attempt == 1 and fallback_model and current_model != fallback_model:
                    logger.info(f"Switching to fallback model: {fallback_model}")
                    current_model = fallback_model
                    continue
                # Wait with exponential backoff
                delay = self._exponential_backoff(attempt)
                logger.info(f"Retrying in {delay:.2f} seconds...")
                time.sleep(delay)
            except APIConnectionError as e:
                logger.error(f"Connection error: {e}")
                last_error = e
                if attempt < self.max_retries - 1:
                    delay = self._exponential_backoff(attempt)
                    time.sleep(delay)
                else:
                    raise
            except APIError as e:
                # Check if error is retryable
                if e.status_code >= 500:
                    logger.error(f"Server error (status {e.status_code}): {e}")
                    last_error = e
                    if attempt < self.max_retries - 1:
                        delay = self._exponential_backoff(attempt)
                        time.sleep(delay)
                    else:
                        raise
                else:
                    # Client errors (4xx) shouldn't be retried
                    logger.error(f"Client error (status {e.status_code}): {e}")
                    raise
            except OpenAIError as e:
                logger.error(f"OpenAI error: {e}")
                raise
        # All retries exhausted
        raise last_error
Token counting and cost estimation
class TokenManager:
    """Manages token counting and cost estimation."""
    # Pricing per 1M tokens (as of September 2025)
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.150, "output": 0.600},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
        "text-embedding-3-large": {"input": 0.13, "output": 0},
        "text-embedding-3-small": {"input": 0.02, "output": 0},
    }
    def __init__(self):
        import tiktoken
        self.encodings = {}
    def count_tokens(self, text: str, model: str = "gpt-4o") -> int:
        """Count tokens for given text and model."""
        import tiktoken
        if model not in self.encodings:
            try:
                self.encodings[model] = tiktoken.encoding_for_model(model)
            except KeyError:
                # Fallback to cl100k_base for unknown models
                self.encodings[model] = tiktoken.get_encoding("cl100k_base")
        return len(self.encodings[model].encode(text))
    def estimate_cost(
        self,
        input_tokens: int,
        output_tokens: int,
        model: str
    ) -> float:
        """Estimate cost for given token usage."""
        if model not in self.PRICING:
            logger.warning(f"Unknown model pricing: {model}")
            return 0.0
        pricing = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost
    def check_context_limit(self, messages: list, model: str) -> bool:
        """Check if messages fit within model context window."""
        context_limits = {
            "gpt-4o": 128000,
            "gpt-4o-mini": 128000,
            "gpt-4-turbo": 128000,
            "gpt-3.5-turbo": 16385
        }
        limit = context_limits.get(model, 8192)
        total_tokens = sum(
            self.count_tokens(msg["content"], model)
            for msg in messages
            if "content" in msg
        )
        if total_tokens > limit * 0.9:  # Use 90% as safety margin
            logger.warning(f"Message tokens ({total_tokens}) approaching context limit ({limit})")
            return False
        return True
Production usage with monitoring
client = ProductionOpenAIClient(api_key="your-api-key")
token_manager = TokenManager()
def safe_completion(user_message: str) -> Dict:
    """Production completion with full error handling and monitoring."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_message}
    ]
    # Check token limits
    if not token_manager.check_context_limit(messages, "gpt-4o"):
        return {"error": "Message too long", "code": "TOKEN_LIMIT_EXCEEDED"}
    # Estimate input cost
    input_tokens = sum(token_manager.count_tokens(m["content"], "gpt-4o") for m in messages)
    try:
        start_time = time.time()
        response = client.completion_with_retry(
            messages=messages,
            model="gpt-4o",
            fallback_model="gpt-4o-mini",
            max_tokens=1000,
            temperature=0.7
        )
        latency = time.time() - start_time
        # Calculate actual cost
        output_tokens = token_manager.count_tokens(response, "gpt-4o")
        cost = token_manager.estimate_cost(input_tokens, output_tokens, "gpt-4o")
        # Log metrics
        logger.info(f"Completion metrics: latency={latency:.2f}s, cost=${cost:.4f}, tokens={input_tokens + output_tokens}")
        return {
            "response": response,
            "metrics": {
                "latency_seconds": latency,
                "cost_usd": cost,
                "input_tokens": input_tokens,
                "output_tokens": output_tokens
            }
        }
    except Exception as e:
        logger.error(f"Completion failed: {e}")
        return {"error": str(e), "code": "API_ERROR"}
Usage
result = safe_completion("Explain quantum computing in simple terms")
if "error" not in result:
    print(f"Response: {result['response']}")
    print(f"Cost: ${result['metrics']['cost_usd']:.4f}")
    print(f"Latency: {result['metrics']['latency_seconds']:.2f}s")
\

\\

This production client demonstrates patterns the Cookbook emphasizes:

•Exponential backoff with jitter to avoid thundering herd problems
•Model fallback for high availability
•Comprehensive error classification (retryable vs non-retryable)
•Token management to prevent context overflow
•Cost tracking for budget management
•Latency monitoring for SLA compliance

Documentation Structure and Navigation

The Cookbook organizes content into three navigation layers:

1. Quick Start Guides: 5-10 minute tutorials for common tasks (API setup, first completion, embeddings basics)Quick Start Guides: 5-10 minute tutorials for common tasks (API setup, first completion, embeddings basics)
2. How-To Guides: 20-30 minute implementations of specific patterns (RAG, function calling, fine-tuning)How-To Guides: 20-30 minute implementations of specific patterns (RAG, function calling, fine-tuning)
3. Deep Dives: Comprehensive guides exploring trade-offs and advanced optimizationsDeep Dives: Comprehensive guides exploring trade-offs and advanced optimizations

This structure enables both rapid onboarding and deep technical learning.

Real-World Examples

Example 1: Building a Production Documentation Q&A System

A SaaS company needed to reduce support ticket volume by enabling customers to self-serve answers from 10,000+ pages of technical documentation. The Cookbook's RAG patterns provided the foundation:

Implementation: Combined embeddings-based search (text-embedding-3-large) with GPT-4o for answer generation, implementing the citation tracking pattern to ensure answer accuracy.

Results:

•35% reduction in support tickets for documentation-related questions
•92% user satisfaction rating for AI-generated answers
•Average response time of 2.3 seconds (vs 45 minutes human response time)
•$18K/month support cost savings

Key Cookbook Patterns Used:

•Semantic search with metadata filtering (filtering by documentation version)
•RAG with citation tracking for transparency
•Answer verification to reduce hallucinations
•Cost optimization through model cascading (GPT-4o-mini for simple queries, GPT-4o for complex ones)

Example 2: Automating Legal Document Analysis

A legal tech startup needed to extract structured data from 50,000+ contracts for due diligence processes. Manual extraction took 2-3 hours per contract.

Implementation: Following the Cookbook's structured output guide, built a multi-step extraction pipeline using JSON mode and Pydantic validation.

\\\python

`Based on Cookbook structured output pattern`


class ContractData(BaseModel):
    parties: List[str]
    effective_date: str
    termination_date: Optional[str]
    contract_value: Optional[str]
    key_obligations: List[str]
    termination_clauses: List[str]
    renewal_terms: Optional[str]
    liability_caps: Optional[str]
def extract_contract_data(contract_text: str) -> ContractData:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Extract structured data from contracts. Output valid JSON."
            },
            {"role": "user", "content": contract_text}
        ],
        response_format={"type": "json_object"}
    )

return ContractData(**json.loads(response.choices[0].message.content)) \\\

Results:

•Extraction time reduced from 2-3 hours to 4-5 minutes per contract
•96% accuracy on key fields (validated against human review)
•Processing cost: $0.80 per contract (vs $150-200 for paralegal time)
•Enabled due diligence on 10x more contracts in same timeframe

Example 3: Personalizing E-Commerce Recommendations

An e-commerce platform wanted to move beyond collaborative filtering to generate natural language product recommendations based on user browsing history and preferences.

Implementation: Used the Cookbook's embeddings patterns to build semantic product search combined with GPT-4o for personalized explanations.

Architecture:

1. Generate embeddings for all product descriptions (text-embedding-3-large)Generate embeddings for all product descriptions (text-embedding-3-large)
2. Create user profile embeddings from browsing historyCreate user profile embeddings from browsing history
3. Find semantically similar productsFind semantically similar products
4. Use GPT-4o to generate personalized recommendation explanationsUse GPT-4o to generate personalized recommendation explanations

Results:

•28% increase in click-through rate on recommendations
•15% increase in conversion rate
•Average explanation generation: 180ms
•Cost: $0.003 per recommendation

Key Insight from Cookbook: Using smaller embedding model (text-embedding-3-small) with higher dimensions provided 95% of accuracy at 1/6th the cost, enabling real-time personalization at scale.

Common Pitfalls

Pitfall 1: Not Using JSON Mode for Structured Output

Problem: Many developers use regex or custom parsing to extract structured data from LLM responses, leading to frequent parsing failures.

Solution: Always use JSON mode when expecting structured output:

\\\python

`Bad: Unreliable parsing`


response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and email as JSON: ..."}]
)
Parse response.choices[0].message.content with regex - fragile!

Good: Guaranteed valid JSON
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and email: ..."}],
    response_format={"type": "json_object"}  # Guarantees valid JSON
)
\

\\

The Cookbook shows JSON mode reduces parsing errors from 8-12% to <0.1%.

Pitfall 2: Ignoring Token Limits

Problem: Applications crash when messages exceed context windows, especially in multi-turn conversations.

Solution: Always count tokens before API calls:

\\\python import tiktoken


def safe_conversation(messages: List[Dict], max_context: int = 120000):
    encoding = tiktoken.encoding_for_model("gpt-4o")
    total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
    # Truncate old messages if exceeding limit
    while total_tokens > max_context * 0.9:  # Use 90% as safety margin
        if len(messages) <= 2:  # Keep system + latest user message
            break
        messages.pop(1)  # Remove oldest non-system message
        total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)

return messages \\\

Pitfall 3: Over-Engineering with Function Calling

Problem: Developers create 20+ tools for agents when many could be handled with direct prompting, leading to poor tool selection.

Solution: The Cookbook recommends the "5 Tool Rule": If your agent needs more than 5-7 tools, either:

•Combine related tools into one with parameters
•Use hierarchical agents with specialized tool sets
•Solve some tasks with direct prompting instead

Pitfall 4: Not Caching Embeddings

Problem: Regenerating embeddings for static content on every search request wastes money and adds latency.

Solution: The Cookbook shows comprehensive caching patterns:

\\\python import hashlib import json from functools import lru_cache


@lru_cache(maxsize=10000)
def get_embedding_cached(text: str, model: str = "text-embedding-3-large") -> List[float]:
    """Cache embeddings with content hash as key."""
    response = openai.embeddings.create(model=model, input=text)
    return response.data[0].embedding
For persistent caching
class EmbeddingCache:
    def __init__(self, cache_file: str = "embeddings_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()
    def _load_cache(self) -> Dict:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    def get_embedding(self, text: str, model: str) -> List[float]:
        # Create cache key from content hash
        cache_key = hashlib.sha256(f"{text}:{model}".encode()).hexdigest()
        if cache_key in self.cache:
            return self.cache[cache_key]
        # Generate and cache
        response = openai.embeddings.create(model=model, input=text)
        embedding = response.data[0].embedding
        self.cache[cache_key] = embedding
        self._save_cache()

return embedding \\\

For a 10K document corpus that's searched 1M times/month, caching reduces embedding costs from $130/month to one-time $13.

Pitfall 5: Not Implementing Rate Limit Headers

Problem: Applications hit rate limits repeatedly without backing off appropriately.

Solution: The Cookbook demonstrates reading rate limit headers:

\\\python def completion_with_rate_limit_awareness(messages: List[Dict]) -> str: response = openai.chat.completions.create( model="gpt-4o", messages=messages )


    # Check rate limit headers
    headers = response.response.headers
    remaining_requests = int(headers.get('x-ratelimit-remaining-requests', 0))
    remaining_tokens = int(headers.get('x-ratelimit-remaining-tokens', 0))
    # Proactive throttling if approaching limits
    if remaining_requests < 10 or remaining_tokens < 10000:
        logger.warning(f"Approaching rate limits: {remaining_requests} requests, {remaining_tokens} tokens remaining")
        time.sleep(1)  # Proactive backoff

return response.choices[0].message.content \\\

This pattern reduces 429 errors by 80% through proactive throttling.

Best Practices

1. Start with Prompt Engineering, Not Fine-Tuning

The Cookbook emphasizes that 95% of use cases should start with prompt engineering:

Prompt Engineering Path:

1. Basic prompting with clear instructionsBasic prompting with clear instructions
2. Few-shot examples (3-5 examples)Few-shot examples (3-5 examples)
3. Chain-of-thought prompting for complex reasoningChain-of-thought prompting for complex reasoning
4. Structured output with JSON modeStructured output with JSON mode

Only move to fine-tuning if:

•Task is extremely high-volume (millions of calls/month) where cost savings justify effort
•Domain is highly specialized with proprietary terminology
•Consistent output format is critical and few-shot doesn't achieve it
•You have 500+ high-quality training examples

2. Implement Semantic Caching for Repeated Queries

For applications with repeated similar queries, semantic caching provides massive cost savings:

\\\python import numpy as np from typing import Optional


class SemanticCache:
    """Cache responses based on semantic similarity of prompts."""
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache: List[Dict] = []
    def get(self, query: str, query_embedding: np.ndarray) -> Optional[str]:
        """Retrieve cached response for semantically similar query."""
        for entry in self.cache:
            similarity = np.dot(entry['embedding'], query_embedding)
            if similarity >= self.threshold:
                logger.info(f"Cache hit (similarity: {similarity:.3f})")
                return entry['response']
        return None
    def set(self, query: str, query_embedding: np.ndarray, response: str):
        """Cache query-response pair."""
        self.cache.append({
            'query': query,
            'embedding': query_embedding,
            'response': response
        })
Usage
semantic_cache = SemanticCache(similarity_threshold=0.95)
def cached_completion(query: str) -> str:
    # Generate query embedding
    embedding_response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = np.array(embedding_response.data[0].embedding)
    # Check cache
    cached_response = semantic_cache.get(query, query_embedding)
    if cached_response:
        return cached_response
    # Generate new response
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )
    response_text = response.choices[0].message.content
    # Cache for future
    semantic_cache.set(query, query_embedding, response_text)

return response_text \\\

For customer support chatbots, semantic caching achieves 40-60% cache hit rates, reducing costs proportionally.

3. Use Model Cascading for Cost Optimization

Different queries require different model capabilities. The Cookbook recommends cascading from cheap to expensive models:

\\\python def cascaded_completion(query: str, complexity_threshold: float = 0.7) -> Dict: """Use cheaper model for simple queries, expensive for complex ones."""


    # Step 1: Classify query complexity with mini model
    complexity_check = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Rate this query's complexity from 0.0 (simple) to 1.0 (complex).
            Respond with just a number.
            Query: {query}"""
        }],
        temperature=0
    )
    complexity = float(complexity_check.choices[0].message.content.strip())
    # Step 2: Route to appropriate model
    model = "gpt-4o" if complexity >= complexity_threshold else "gpt-4o-mini"
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}]
    )

return { "response": response.choices[0].message.content, "model_used": model, "complexity": complexity } \\\

This pattern reduces average cost per query by 60-70% while maintaining quality.

4. Implement Structured Logging for Debugging

The Cookbook emphasizes comprehensive logging for AI applications:

\\\python import logging import json from datetime import datetime


class AILogger:
    """Structured logging for AI operations."""
    def __init__(self, log_file: str = "ai_operations.jsonl"):
        self.log_file = log_file
    def log_completion(
        self,
        prompt: str,
        response: str,
        model: str,
        tokens: Dict,
        latency: float,
        cost: float,
        user_id: Optional[str] = None,
        metadata: Dict = None
    ):
        """Log completion with full context."""
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "type": "completion",
            "model": model,
            "prompt": prompt,
            "response": response,
            "tokens": tokens,
            "latency_seconds": latency,
            "cost_usd": cost,
            "user_id": user_id,
            "metadata": metadata or {}
        }
        with open(self.log_file, 'a') as f:
            f.write(json.dumps(log_entry) + '\n')
    def analyze_costs(self, time_period: str = "day") -> Dict:
        """Analyze costs from logs."""
        # Implementation for analyzing logs
        pass
Usage
logger = AILogger()
def logged_completion(prompt: str, user_id: str) -> str:
    start = time.time()
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    latency = time.time() - start
    usage = response.usage
    cost = calculate_cost(usage.prompt_tokens, usage.completion_tokens, "gpt-4o")
    logger.log_completion(
        prompt=prompt,
        response=response.choices[0].message.content,
        model="gpt-4o",
        tokens={
            "prompt": usage.prompt_tokens,
            "completion": usage.completion_tokens,
            "total": usage.total_tokens
        },
        latency=latency,
        cost=cost,
        user_id=user_id
    )

return response.choices[0].message.content \\\

Structured logs enable debugging production issues and cost analysis.

5. Implement User Feedback Loops

The Cookbook recommends collecting user feedback to continuously improve:

\\\python class FeedbackSystem: """Track AI response quality through user feedback."""


    def __init__(self, feedback_file: str = "user_feedback.jsonl"):
        self.feedback_file = feedback_file
    def record_feedback(
        self,
        response_id: str,
        prompt: str,
        response: str,
        rating: int,  # 1-5
        user_comment: Optional[str] = None
    ):
        """Record user feedback for AI response."""
        feedback = {
            "timestamp": datetime.utcnow().isoformat(),
            "response_id": response_id,
            "prompt": prompt,
            "response": response,
            "rating": rating,
            "comment": user_comment
        }
        with open(self.feedback_file, 'a') as f:
            f.write(json.dumps(feedback) + '\n')
    def get_low_quality_examples(self, min_rating: int = 2) -> List[Dict]:
        """Extract low-rated responses for analysis."""
        # Implementation for analyzing feedback
        pass
This feedback becomes training data for fine-tuning or prompt improvement
\

\\

Getting Started

Prerequisites

•Python 3.8+ or Node.js 16+
•OpenAI API key (get from https://platform.openai.com/api-keys)
•Basic understanding of async/await patterns
•Familiarity with JSON and REST APIs

Step 1: Install OpenAI SDK

\\\bash

`Python`


pip install openai
Node.js
npm install openai
Or using poetry/pnpm
poetry add openai
pnpm add openai
\

\\

Step 2: Set Up API Key

\\\bash

`Environment variable (recommended for production)`


export OPENAI_API_KEY='sk-your-api-key-here'
Or in .env file
echo "OPENAI_API_KEY=sk-your-api-key-here" > .env
\

\\

Step 3: First API Call

\\\python import openai import os


Initialize client
openai.api_key = os.getenv("OPENAI_API_KEY")
Simple completion
response = openai.chat.completions.create(
    model="gpt-4o-mini",  # Start with mini for testing
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what OpenAI Cookbook is in one sentence."}
    ]
)

print(response.choices[0].message.content) \\\

Step 4: Explore Cookbook Examples

Navigate to https://cookbook.openai.com and explore:

1. "Getting Started" section: Basic API usage, authentication, error handling"Getting Started" section: Basic API usage, authentication, error handling
2. "Embeddings" guides: Semantic search implementation"Embeddings" guides: Semantic search implementation
3. "Function Calling" tutorials: Building AI agents with tools"Function Calling" tutorials: Building AI agents with tools
4. "RAG" patterns: Document Q&A systems"RAG" patterns: Document Q&A systems

Step 5: Build Your First Real Application

Follow the Cookbook's "Building a Q&A System" guide to create a working RAG application:

1. Index your documents with embeddingsIndex your documents with embeddings
2. Implement semantic searchImplement semantic search
3. Build answer generation with citationsBuild answer generation with citations
4. Add error handling and loggingAdd error handling and logging
5. Deploy to productionDeploy to production

Next Steps

•Join OpenAI Developer Forum for community support
•Explore cookbook GitHub repository for latest examples
•Implement monitoring and logging from best practices section
•Experiment with different models for cost optimization
•Collect user feedback to improve responses

Conclusion

OpenAI Cookbook has become indispensable for AI engineering teams building production applications. Its value extends far beyond simple code examples—it represents the accumulated wisdom of thousands of real-world AI deployments, distilled into actionable patterns and best practices.

The Cookbook's true power lies in three key attributes:

1. Production-Ready Code: Every example includes error handling, retry logic, token management, and cost optimization. Engineers can adapt cookbook patterns directly into production rather than treating them as toy examples requiring extensive hardening.

2. Pattern Recognition: By studying cookbook examples across embeddings, function calling, RAG, and fine-tuning, engineers develop intuition for which approaches suit which problems. This pattern recognition dramatically reduces time spent on architectural decisions.

3. Community Validation: Patterns in the Cookbook have been validated by thousands of implementations. When following cookbook architectures, engineers gain confidence that they're building on proven foundations rather than experimental approaches.

For teams new to OpenAI's APIs, the Cookbook reduces time-to-first-production-feature from weeks to days. For experienced teams, it provides optimization techniques and advanced patterns that improve quality, reduce costs, and enable more sophisticated capabilities.

As AI applications evolve from simple chatbots to complex multi-agent systems with retrieval, tool use, and real-time learning, the Cookbook has evolved alongside. Its coverage of modern patterns—RAG architectures, semantic caching, model cascading, and production monitoring—ensures it remains relevant as the AI landscape matures.

The question for engineering teams is no longer whether to use OpenAI Cookbook, but how quickly they can internalize its patterns into their development practices. Those who master the Cookbook's techniques gain a significant competitive advantage: the ability to ship sophisticated AI features with the reliability and cost-efficiency that production demands.

Key Features

▸Production-Ready Examples
Battle-tested code from OpenAI engineers
▸RAG Architecture Patterns
Complete retrieval-augmented generation systems
▸Fine-Tuning Guides
Model customization best practices
▸Cost Optimization
Strategies to reduce API costs
▸Security Best Practices
Prompt injection prevention and safety
▸Performance Optimization
Latency reduction techniques

OpenAI Cookbook: The Definitive Resource for Production AI Engineering

Executive Summary

Why OpenAI Cookbook Matters for AI Engineers

OpenAI Cookbook provides four critical advantages:

1. Battle-Tested Code Examples

2. Architectural Patterns for Common Use Cases

3. Performance and Cost Optimization Techniques

4. Integration with Modern AI Stacks

Technical Deep Dive

Core Content Areas

#### 1. Prompt Engineering and Optimization

The foundation of reliable AI applications is well-structured prompts. The Cookbook provides systematic approaches to prompt design that go far beyond simple trial-and-error:

Structured Output Generation

\\\python import openai import json from typing import List, Dict, Optional from pydantic import BaseModel, Field


Define structured output schema with Pydantic
class ProductAnalysis(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    key_features: List[str] = Field(description="List of mentioned product features")
    price_sensitivity: str = Field(description="low, medium, or high")
    purchase_intent: int = Field(description="Score from 0-100")
    concerns: Optional[List[str]] = Field(description="List of customer concerns")
Use JSON mode for guaranteed valid JSON output
def analyze_product_review(review_text: str) -> ProductAnalysis:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a product review analyzer. Extract structured
                insights from customer reviews. Respond with valid JSON matching this schema:
                {
                    "sentiment": "positive|negative|neutral",
                    "key_features": ["feature1", "feature2"],
                    "price_sensitivity": "low|medium|high",
                    "purchase_intent": 0-100,
                    "concerns": ["concern1", "concern2"] or null
                }"""
            },
            {
                "role": "user",
                "content": review_text
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.3  # Lower temperature for more consistent structured output
    )
    # Parse and validate with Pydantic
    json_output = json.loads(response.choices[0].message.content)
    return ProductAnalysis(**json_output)
Production usage with error handling
review = """I've been using this laptop for 3 months. The performance is excellent
and the battery lasts all day. However, the price point of $2000 is quite steep
compared to competitors. The screen quality could be better for this price range."""

Few-Shot Learning for Domain Adaptation

When working with specialized domains or specific output formats, few-shot examples dramatically improve consistency:

\\\python def create_few_shot_classifier(domain_examples: List[Dict[str, str]]) -> callable: """Creates a classifier using few-shot learning from domain examples."""


    # Build few-shot prompt from examples
    example_text = "\n\n".join([
        f"Input: {ex['input']}\nCategory: {ex['category']}\nReason: {ex['reason']}"
        for ex in domain_examples
    ])
    system_prompt = f"""You are a specialized content classifier. Study these examples:
{example_text}
For new inputs, classify them using the same categories and reasoning style."""
    def classify(text: str) -> Dict[str, str]:
        response = openai.chat.completions.create(
            model="gpt-4o-mini",  # Mini works well with good few-shot examples
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"Input: {text}"}
            ],
            temperature=0.2,
            max_tokens=150
        )
        content = response.choices[0].message.content
        # Parse category and reason from response
        lines = content.split('\n')
        category = lines[0].replace('Category: ', '').strip()
        reason = lines[1].replace('Reason: ', '').strip() if len(lines) > 1 else ""
        return {"category": category, "reason": reason}
    return classify
Domain-specific medical triage example
medical_examples = [
    {
        "input": "Patient has severe chest pain radiating to left arm, shortness of breath",
        "category": "CRITICAL",
        "reason": "Symptoms indicate possible cardiac event requiring immediate attention"
    },
    {
        "input": "Patient reports mild headache for 2 days, improving with rest",
        "category": "ROUTINE",
        "reason": "Non-urgent symptoms manageable with basic care"
    },
    {
        "input": "Patient fell and cannot put weight on ankle, visible swelling",
        "category": "URGENT",
        "reason": "Potential fracture requiring prompt evaluation"
    }
]

#### 2. Embeddings and Semantic Search

Text embeddings power modern AI applications from recommendation systems to document search. The Cookbook provides production-ready implementations of embedding-based architectures:

Building Production-Grade Semantic Search

\\\python import openai import numpy as np from typing import List, Dict, Tuple import tiktoken


class SemanticSearchEngine:
    """Production semantic search with chunking, caching, and relevance scoring."""
    def __init__(self, model: str = "text-embedding-3-large", chunk_size: int = 512):
        self.model = model
        self.chunk_size = chunk_size
        self.encoding = tiktoken.encoding_for_model("gpt-4")
        self.document_chunks: List[Dict] = []
        self.embeddings: np.ndarray = None
    def chunk_text(self, text: str, overlap: int = 50) -> List[str]:
        """Chunk text with overlap to preserve context at boundaries."""
        tokens = self.encoding.encode(text)
        chunks = []
        for i in range(0, len(tokens), self.chunk_size - overlap):
            chunk_tokens = tokens[i:i + self.chunk_size]
            chunk_text = self.encoding.decode(chunk_tokens)
            chunks.append(chunk_text)
        return chunks
    def index_documents(self, documents: List[Dict[str, str]]) -> None:
        """Index documents with metadata for filtering."""
        all_chunks = []
        for doc in documents:
            chunks = self.chunk_text(doc['content'])
            for i, chunk in enumerate(chunks):
                all_chunks.append({
                    'text': chunk,
                    'doc_id': doc['id'],
                    'chunk_index': i,
                    'metadata': doc.get('metadata', {})
                })
        self.document_chunks = all_chunks
        # Batch embedding generation for efficiency
        texts = [chunk['text'] for chunk in all_chunks]
        self.embeddings = self._get_embeddings_batch(texts)
        print(f"Indexed {len(documents)} documents into {len(all_chunks)} chunks")
    def _get_embeddings_batch(self, texts: List[str], batch_size: int = 100) -> np.ndarray:
        """Generate embeddings in batches to respect rate limits."""
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            response = openai.embeddings.create(
                model=self.model,
                input=batch
            )
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)
        return np.array(all_embeddings)
    def search(
        self,
        query: str,
        top_k: int = 5,
        metadata_filters: Dict = None,
        similarity_threshold: float = 0.7
    ) -> List[Dict]:
        """Semantic search with optional metadata filtering."""
        # Generate query embedding
        query_embedding = self._get_embeddings_batch([query])[0]
        # Calculate cosine similarity
        similarities = np.dot(self.embeddings, query_embedding)
        # Apply metadata filters if specified
        valid_indices = range(len(self.document_chunks))
        if metadata_filters:
            valid_indices = [
                i for i in valid_indices
                if all(
                    self.document_chunks[i]['metadata'].get(key) == value
                    for key, value in metadata_filters.items()
                )
            ]
        # Get top-k results above threshold
        results = []
        for idx in valid_indices:
            if similarities[idx] >= similarity_threshold:
                results.append({
                    'text': self.document_chunks[idx]['text'],
                    'doc_id': self.document_chunks[idx]['doc_id'],
                    'similarity': float(similarities[idx]),
                    'metadata': self.document_chunks[idx]['metadata']
                })
        # Sort by similarity and return top-k
        results.sort(key=lambda x: x['similarity'], reverse=True)
        return results[:top_k]
Production usage example
search_engine = SemanticSearchEngine()
Index technical documentation
documents = [
    {
        'id': 'doc_001',
        'content': """Authentication in our API uses Bearer tokens. Include your
        API key in the Authorization header as 'Bearer sk-your-key'. Keys can be
        generated in the dashboard under Settings > API Keys. Never expose keys
        in client-side code.""",
        'metadata': {'category': 'security', 'version': 'v2'}
    },
    {
        'id': 'doc_002',
        'content': """Rate limits are enforced per organization. Free tier allows
        20 requests per minute. Pro tier allows 3500 RPM. Enterprise has custom
        limits. Implement exponential backoff when you receive 429 status codes.""",
        'metadata': {'category': 'limits', 'version': 'v2'}
    },
    {
        'id': 'doc_003',
        'content': """Error handling best practices: Always check response status
        codes. 400 indicates invalid request format. 401 means authentication failed.
        429 indicates rate limit exceeded. 500 indicates server error requiring retry.""",
        'metadata': {'category': 'errors', 'version': 'v2'}
    }
]
search_engine.index_documents(documents)
Semantic search with natural language query
results = search_engine.search(
    query="How do I authenticate API requests?",
    top_k=3,
    metadata_filters={'version': 'v2'},
    similarity_threshold=0.7
)

for result in results: print(f"\nRelevance: {result['similarity']:.2f}") print(f"Category: {result['metadata']['category']}") print(f"Content: {result['text'][:200]}...") \\\

This implementation demonstrates several production-critical patterns from the Cookbook:

•Chunking with overlap to preserve context at chunk boundaries
•Batch processing to maximize throughput and respect rate limits
•Metadata filtering to enable faceted search
•Similarity thresholds to filter low-quality results
•Token counting to ensure chunks fit model context windows

The Cookbook includes benchmarks showing this approach handles 100K+ documents with sub-second query latency when using appropriate vector databases (Pinecone, Weaviate, etc.) as storage backends.

#### 3. Function Calling and Tool Use

Function calling (formerly known as function/tool use) enables LLMs to interact with external systems. The Cookbook provides comprehensive patterns for building reliable tool-using agents:

Production Function Calling Architecture

\\\python import openai import json from typing import List, Dict, Callable, Any from datetime import datetime


class ToolRegistry:
    """Manages tool definitions and execution for AI agents."""
    def __init__(self):
        self.tools: Dict[str, Callable] = {}
        self.tool_schemas: List[Dict] = []
    def register(self, schema: Dict):
        """Register a tool with its schema and implementation."""
        def decorator(func: Callable):
            tool_name = schema['function']['name']
            self.tools[tool_name] = func
            self.tool_schemas.append(schema)
            return func
        return decorator
    def execute(self, tool_name: str, arguments: Dict[str, Any]) -> Any:
        """Execute a tool with given arguments."""
        if tool_name not in self.tools:
            raise ValueError(f"Unknown tool: {tool_name}")
        try:
            return self.tools[tool_name](**arguments)
        except Exception as e:
            return {"error": str(e), "tool": tool_name}
Initialize tool registry
tools = ToolRegistry()
Register customer support tools
@tools.register({
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Retrieve current status of a customer order",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID (format: ORD-XXXXX)"
                }
            },
            "required": ["order_id"]
        }
    }
})
def get_order_status(order_id: str) -> Dict:
    """Simulated order status lookup."""
    # In production, this would query your database/API
    orders = {
        "ORD-12345": {
            "status": "shipped",
            "tracking": "1Z999AA10123456784",
            "estimated_delivery": "2025-09-25",
            "items": ["Laptop Stand", "USB-C Cable"]
        },
        "ORD-67890": {
            "status": "processing",
            "estimated_ship_date": "2025-09-24",
            "items": ["Mechanical Keyboard"]
        }
    }
    return orders.get(order_id, {"error": "Order not found"})
@tools.register({
    "type": "function",
    "function": {
        "name": "initiate_return",
        "description": "Start the return process for an order",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID to return"
                },
                "reason": {
                    "type": "string",
                    "enum": ["defective", "wrong_item", "not_needed", "other"],
                    "description": "Reason for return"
                },
                "items": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of items to return (empty for all items)"
                }
            },
            "required": ["order_id", "reason"]
        }
    }
})
def initiate_return(order_id: str, reason: str, items: List[str] = None) -> Dict:
    """Simulated return initiation."""
    return {
        "return_id": f"RET-{datetime.now().strftime('%Y%m%d')}-001",
        "order_id": order_id,
        "status": "approved",
        "return_label_url": "https://returns.example.com/label/123",
        "refund_method": "original_payment",
        "estimated_refund_days": 5
    }
@tools.register({
    "type": "function",
    "function": {
        "name": "update_shipping_address",
        "description": "Update the shipping address for an order (only if not yet shipped)",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
                "street": {"type": "string"},
                "city": {"type": "string"},
                "state": {"type": "string"},
                "zip_code": {"type": "string"}
            },
            "required": ["order_id", "street", "city", "state", "zip_code"]
        }
    }
})
def update_shipping_address(order_id: str, street: str, city: str, state: str, zip_code: str) -> Dict:
    """Simulated address update."""
    # Check if order can still be modified
    order = get_order_status(order_id)
    if order.get('status') == 'shipped':
        return {"error": "Cannot update address for shipped orders"}
    return {
        "success": True,
        "order_id": order_id,
        "new_address": {
            "street": street,
            "city": city,
            "state": state,
            "zip_code": zip_code
        }
    }
class CustomerSupportAgent:
    """Multi-turn conversational agent with tool use."""
    def __init__(self, tool_registry: ToolRegistry, max_iterations: int = 10):
        self.tools = tool_registry
        self.max_iterations = max_iterations
    def run(self, user_message: str, conversation_history: List[Dict] = None) -> str:
        """Run agent with tool use until completion."""
        messages = conversation_history or []
        messages.append({"role": "user", "content": user_message})
        for iteration in range(self.max_iterations):
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=self.tools.tool_schemas,
                tool_choice="auto"
            )
            assistant_message = response.choices[0].message
            messages.append(assistant_message)
            # Check if agent wants to use tools
            if assistant_message.tool_calls:
                for tool_call in assistant_message.tool_calls:
                    function_name = tool_call.function.name
                    function_args = json.loads(tool_call.function.arguments)
                    print(f"[Agent calling tool: {function_name} with args: {function_args}]")
                    # Execute tool
                    result = self.tools.execute(function_name, function_args)
                    # Add tool result to conversation
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": json.dumps(result)
                    })
                # Continue loop to get agent's next response
                continue
            # No more tool calls - agent has final response
            return assistant_message.content
        return "I apologize, but I'm having trouble completing this request. Please contact support."
Production usage
agent = CustomerSupportAgent(tools)
Example conversation
conversation = []
user_query = "Hi, I need to check on my order ORD-12345 and possibly return it"
response = agent.run(user_query, conversation)
print(f"Agent: {response}")
Multi-turn conversation continues
follow_up = "Yes, I'd like to return the USB-C Cable because I received the wrong type"
response = agent.run(follow_up, conversation)
print(f"Agent: {response}")
\

\\

This pattern demonstrates several critical production requirements:

•Tool registry for manageable tool definitions as systems grow
•Error handling in tool execution with graceful failure modes
•Iteration limits to prevent infinite loops in agent reasoning
•Multi-turn conversations with proper message history management
•Type-safe tool schemas that guide LLM tool selection

The Cookbook shows this architecture scales to 50+ tools without degradation in tool selection accuracy when tools have clear, distinct descriptions.

#### 4. RAG (Retrieval-Augmented Generation) Systems

RAG combines semantic search with LLM generation for question-answering over proprietary data. The Cookbook provides end-to-end implementations:

Production RAG with Citation Tracking

\\\python import openai from typing import List, Dict, Tuple from dataclasses import dataclass


@dataclass
class Citation:
    """Represents a source citation for RAG responses."""
    source_id: str
    chunk_text: str
    relevance_score: float
    page_number: int = None
class RAGPipeline:
    """Production RAG with citation tracking and answer verification."""
    def __init__(self, search_engine: SemanticSearchEngine):
        self.search = search_engine
        self.model = "gpt-4o"
    def answer_question(
        self,
        question: str,
        num_sources: int = 5,
        require_citations: bool = True
    ) -> Tuple[str, List[Citation]]:
        """Generate answer with source citations."""
        # Step 1: Retrieve relevant context
        search_results = self.search.search(
            query=question,
            top_k=num_sources,
            similarity_threshold=0.7
        )
        if not search_results:
            return "I don't have enough information to answer that question.", []
        # Step 2: Build context with source markers
        context_parts = []
        citations = []
        for idx, result in enumerate(search_results, 1):
            source_marker = f"[Source {idx}]"
            context_parts.append(f"{source_marker}\n{result['text']}")
            citations.append(Citation(
                source_id=result['doc_id'],
                chunk_text=result['text'],
                relevance_score=result['similarity']
            ))
        context = "\n\n".join(context_parts)
        # Step 3: Generate answer with citation requirement
        system_prompt = """You are a helpful assistant that answers questions based
        ONLY on the provided context.
        Important rules:
        1. Only use information from the provided sources
        2. Cite sources using their [Source N] markers in your answer
        3. If the context doesn't contain relevant information, say so
        4. Be concise but complete
        5. Use direct quotes when appropriate"""
        response = openai.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"""Context:
{context}
Question: {question}
Provide a detailed answer citing specific sources."""}
            ],
            temperature=0.3  # Lower temperature for factual accuracy
        )
        answer = response.choices[0].message.content
        # Step 4: Verify citations are present if required
        if require_citations:
            has_citations = any(f"[Source {i}]" in answer for i in range(1, len(citations) + 1))
            if not has_citations:
                # Regenerate with stronger citation requirement
                response = openai.chat.completions.create(
                    model=self.model,
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": f"""Context:
{context}
Question: {question}
CRITICAL: You MUST cite specific sources using [Source N] notation in your answer."""}
                    ],
                    temperature=0.1
                )
                answer = response.choices[0].message.content
        return answer, citations
    def answer_with_verification(self, question: str) -> Dict:
        """Generate answer with self-verification step."""
        # Get initial answer
        answer, citations = self.answer_question(question)
        # Self-verification prompt
        verification_prompt = f"""Review this question-answer pair for accuracy:
Question: {question}
Answer: {answer}
Context used:
{chr(10).join([f"[Source {i+1}] {c.chunk_text[:200]}..." for i, c in enumerate(citations)])}
Does the answer accurately reflect the information in the sources?
Respond with JSON:
{{
    "is_accurate": true/false,
    "confidence": 0-100,
    "issues": ["issue1", "issue2"] or null,
    "suggested_improvements": "..." or null
}}"""
        verification = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": verification_prompt}],
            response_format={"type": "json_object"}
        )
        verification_result = json.loads(verification.choices[0].message.content)
        return {
            "answer": answer,
            "citations": [
                {
                    "source_id": c.source_id,
                    "relevance": c.relevance_score,
                    "excerpt": c.chunk_text[:150] + "..."
                }
                for c in citations
            ],
            "verification": verification_result
        }
Production usage
rag_pipeline = RAGPipeline(search_engine)
question = "What authentication methods are supported and how do rate limits work?"
result = rag_pipeline.answer_with_verification(question)

This RAG implementation includes production-critical features:

•Citation tracking to enable fact-checking and transparency
•Answer verification using self-critique to catch hallucinations
•Confidence scoring to flag low-quality responses
•Source attribution for compliance and auditing
•Graceful degradation when no relevant context exists

The Cookbook demonstrates this architecture reduces hallucination rates from 15-20% (naive RAG) to 2-3% through verification steps.

#### 5. Fine-Tuning and Model Customization

While prompt engineering handles most use cases, fine-tuning becomes valuable for specialized domains or extreme cost optimization. The Cookbook provides complete fine-tuning workflows:

Fine-Tuning Pipeline for Custom Domains

\\\python import openai from typing import List, Dict import json import time


class FineTuningPipeline:
    """Complete fine-tuning workflow with validation and deployment."""
    def prepare_training_data(
        self,
        examples: List[Dict],
        validation_split: float = 0.1
    ) -> Tuple[str, str]:
        """Format and split data for fine-tuning."""
        # Shuffle and split
        import random
        random.shuffle(examples)
        split_idx = int(len(examples) * (1 - validation_split))
        train_examples = examples[:split_idx]
        val_examples = examples[split_idx:]
        # Format as JSONL
        def format_example(ex: Dict) -> str:
            return json.dumps({
                "messages": [
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]},
                    {"role": "assistant", "content": ex["output"]}
                ]
            })
        train_file = "training_data.jsonl"
        val_file = "validation_data.jsonl"
        with open(train_file, 'w') as f:
            f.write('\n'.join(format_example(ex) for ex in train_examples))
        with open(val_file, 'w') as f:
            f.write('\n'.join(format_example(ex) for ex in val_examples))
        print(f"Prepared {len(train_examples)} training examples")
        print(f"Prepared {len(val_examples)} validation examples")
        return train_file, val_file
    def create_fine_tune_job(
        self,
        training_file: str,
        validation_file: str = None,
        base_model: str = "gpt-4o-mini-2024-07-18",
        suffix: str = None,
        hyperparameters: Dict = None
    ) -> str:
        """Create and monitor fine-tuning job."""
        # Upload training data
        with open(training_file, 'rb') as f:
            train_upload = openai.files.create(file=f, purpose='fine-tune')
        print(f"Uploaded training file: {train_upload.id}")
        # Upload validation data if provided
        val_file_id = None
        if validation_file:
            with open(validation_file, 'rb') as f:
                val_upload = openai.files.create(file=f, purpose='fine-tune')
            val_file_id = val_upload.id
            print(f"Uploaded validation file: {val_upload.id}")
        # Create fine-tuning job
        job = openai.fine_tuning.jobs.create(
            training_file=train_upload.id,
            validation_file=val_file_id,
            model=base_model,
            suffix=suffix,
            hyperparameters=hyperparameters or {
                "n_epochs": 3,
                "batch_size": "auto",
                "learning_rate_multiplier": "auto"
            }
        )
        print(f"Created fine-tuning job: {job.id}")
        print(f"Status: {job.status}")
        return job.id
    def monitor_job(self, job_id: str) -> str:
        """Monitor fine-tuning job until completion."""
        print("\nMonitoring fine-tuning progress...")
        while True:
            job = openai.fine_tuning.jobs.retrieve(job_id)
            status = job.status
            print(f"Status: {status}")
            if status == "succeeded":
                print(f"\nFine-tuning completed!")
                print(f"Fine-tuned model: {job.fine_tuned_model}")
                return job.fine_tuned_model
            elif status == "failed":
                print(f"\nFine-tuning failed: {job.error}")
                raise Exception(f"Fine-tuning failed: {job.error}")
            elif status in ["validating_files", "queued", "running"]:
                # Show metrics if available
                if hasattr(job, 'trained_tokens') and job.trained_tokens:
                    print(f"  Tokens trained: {job.trained_tokens}")
                time.sleep(60)  # Check every minute
            else:
                print(f"Unexpected status: {status}")
                time.sleep(60)
    def evaluate_model(
        self,
        model: str,
        test_examples: List[Dict],
        base_model: str = "gpt-4o-mini-2024-07-18"
    ) -> Dict:
        """Compare fine-tuned model to base model."""
        print(f"\nEvaluating {model} against {base_model}...")
        results = {
            "fine_tuned": {"correct": 0, "total": len(test_examples)},
            "base": {"correct": 0, "total": len(test_examples)}
        }
        for ex in test_examples:
            # Test fine-tuned model
            ft_response = openai.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0
            )
            ft_output = ft_response.choices[0].message.content
            # Test base model
            base_response = openai.chat.completions.create(
                model=base_model,
                messages=[
                    {"role": "system", "content": ex.get("system", "")},
                    {"role": "user", "content": ex["input"]}
                ],
                temperature=0
            )
            base_output = base_response.choices[0].message.content
            # Check correctness (exact match for classification tasks)
            expected = ex["output"].strip().lower()
            if ft_output.strip().lower() == expected:
                results["fine_tuned"]["correct"] += 1
            if base_output.strip().lower() == expected:
                results["base"]["correct"] += 1
        # Calculate metrics
        results["fine_tuned"]["accuracy"] = results["fine_tuned"]["correct"] / results["fine_tuned"]["total"]
        results["base"]["accuracy"] = results["base"]["correct"] / results["base"]["total"]
        results["improvement"] = results["fine_tuned"]["accuracy"] - results["base"]["accuracy"]
        print(f"\nResults:")
        print(f"  Fine-tuned accuracy: {results['fine_tuned']['accuracy']:.2%}")
        print(f"  Base accuracy: {results['base']['accuracy']:.2%}")
        print(f"  Improvement: {results['improvement']:.2%}")
        return results
Example: Fine-tune for medical billing code classification
examples = [
    {
        "system": "You are a medical billing assistant. Classify procedures into billing codes.",
        "input": "Patient received annual physical examination with EKG",
        "output": "CPT: 99395, 93000"
    },
    {
        "system": "You are a medical billing assistant. Classify procedures into billing codes.",
        "input": "Patient underwent diagnostic colonoscopy with biopsy",
        "output": "CPT: 45380, 45380-59"
    },
    # ... 500+ more examples
]

The Cookbook demonstrates fine-tuning provides:

•3-5x cost reduction for high-volume specialized tasks
•20-40% accuracy improvement over few-shot prompting in narrow domains
•Consistent output formatting without extensive prompt engineering
•Lower latency by reducing prompt size

#### 6. Production Best Practices

The Cookbook dedicates extensive content to production reliability, covering error handling, rate limiting, caching, and monitoring:

Comprehensive Error Handling and Retry Logic

\\\python import openai from openai import OpenAIError, APIError, RateLimitError, APIConnectionError import time from typing import Optional, Callable import logging


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductionOpenAIClient:
    """Production-grade OpenAI client with comprehensive error handling."""
    def __init__(
        self,
        api_key: str,
        max_retries: int = 3,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        timeout: float = 30.0
    ):
        self.client = openai.OpenAI(api_key=api_key)
        self.max_retries = max_retries
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.timeout = timeout
    def _exponential_backoff(self, attempt: int) -> float:
        """Calculate exponential backoff delay with jitter."""
        import random
        delay = min(self.base_delay * (2 ** attempt), self.max_delay)
        jitter = random.uniform(0, delay * 0.1)
        return delay + jitter
    def completion_with_retry(
        self,
        messages: list,
        model: str = "gpt-4o",
        fallback_model: Optional[str] = "gpt-4o-mini",
        **kwargs
    ) -> str:
        """Create completion with automatic retry and fallback logic."""
        last_error = None
        current_model = model
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=current_model,
                    messages=messages,
                    timeout=self.timeout,
                    **kwargs
                )
                content = response.choices[0].message.content
                logger.info(f"Completion successful with {current_model} (attempt {attempt + 1})")
                return content
            except RateLimitError as e:
                logger.warning(f"Rate limit hit for {current_model}: {e}")
                last_error = e
                # Check if we should switch to fallback model
                if attempt == 1 and fallback_model and current_model != fallback_model:
                    logger.info(f"Switching to fallback model: {fallback_model}")
                    current_model = fallback_model
                    continue
                # Wait with exponential backoff
                delay = self._exponential_backoff(attempt)
                logger.info(f"Retrying in {delay:.2f} seconds...")
                time.sleep(delay)
            except APIConnectionError as e:
                logger.error(f"Connection error: {e}")
                last_error = e
                if attempt < self.max_retries - 1:
                    delay = self._exponential_backoff(attempt)
                    time.sleep(delay)
                else:
                    raise
            except APIError as e:
                # Check if error is retryable
                if e.status_code >= 500:
                    logger.error(f"Server error (status {e.status_code}): {e}")
                    last_error = e
                    if attempt < self.max_retries - 1:
                        delay = self._exponential_backoff(attempt)
                        time.sleep(delay)
                    else:
                        raise
                else:
                    # Client errors (4xx) shouldn't be retried
                    logger.error(f"Client error (status {e.status_code}): {e}")
                    raise
            except OpenAIError as e:
                logger.error(f"OpenAI error: {e}")
                raise
        # All retries exhausted
        raise last_error
Token counting and cost estimation
class TokenManager:
    """Manages token counting and cost estimation."""
    # Pricing per 1M tokens (as of September 2025)
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.150, "output": 0.600},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
        "text-embedding-3-large": {"input": 0.13, "output": 0},
        "text-embedding-3-small": {"input": 0.02, "output": 0},
    }
    def __init__(self):
        import tiktoken
        self.encodings = {}
    def count_tokens(self, text: str, model: str = "gpt-4o") -> int:
        """Count tokens for given text and model."""
        import tiktoken
        if model not in self.encodings:
            try:
                self.encodings[model] = tiktoken.encoding_for_model(model)
            except KeyError:
                # Fallback to cl100k_base for unknown models
                self.encodings[model] = tiktoken.get_encoding("cl100k_base")
        return len(self.encodings[model].encode(text))
    def estimate_cost(
        self,
        input_tokens: int,
        output_tokens: int,
        model: str
    ) -> float:
        """Estimate cost for given token usage."""
        if model not in self.PRICING:
            logger.warning(f"Unknown model pricing: {model}")
            return 0.0
        pricing = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost
    def check_context_limit(self, messages: list, model: str) -> bool:
        """Check if messages fit within model context window."""
        context_limits = {
            "gpt-4o": 128000,
            "gpt-4o-mini": 128000,
            "gpt-4-turbo": 128000,
            "gpt-3.5-turbo": 16385
        }
        limit = context_limits.get(model, 8192)
        total_tokens = sum(
            self.count_tokens(msg["content"], model)
            for msg in messages
            if "content" in msg
        )
        if total_tokens > limit * 0.9:  # Use 90% as safety margin
            logger.warning(f"Message tokens ({total_tokens}) approaching context limit ({limit})")
            return False
        return True
Production usage with monitoring
client = ProductionOpenAIClient(api_key="your-api-key")
token_manager = TokenManager()
def safe_completion(user_message: str) -> Dict:
    """Production completion with full error handling and monitoring."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": user_message}
    ]
    # Check token limits
    if not token_manager.check_context_limit(messages, "gpt-4o"):
        return {"error": "Message too long", "code": "TOKEN_LIMIT_EXCEEDED"}
    # Estimate input cost
    input_tokens = sum(token_manager.count_tokens(m["content"], "gpt-4o") for m in messages)
    try:
        start_time = time.time()
        response = client.completion_with_retry(
            messages=messages,
            model="gpt-4o",
            fallback_model="gpt-4o-mini",
            max_tokens=1000,
            temperature=0.7
        )
        latency = time.time() - start_time
        # Calculate actual cost
        output_tokens = token_manager.count_tokens(response, "gpt-4o")
        cost = token_manager.estimate_cost(input_tokens, output_tokens, "gpt-4o")
        # Log metrics
        logger.info(f"Completion metrics: latency={latency:.2f}s, cost=${cost:.4f}, tokens={input_tokens + output_tokens}")
        return {
            "response": response,
            "metrics": {
                "latency_seconds": latency,
                "cost_usd": cost,
                "input_tokens": input_tokens,
                "output_tokens": output_tokens
            }
        }
    except Exception as e:
        logger.error(f"Completion failed: {e}")
        return {"error": str(e), "code": "API_ERROR"}
Usage
result = safe_completion("Explain quantum computing in simple terms")
if "error" not in result:
    print(f"Response: {result['response']}")
    print(f"Cost: ${result['metrics']['cost_usd']:.4f}")
    print(f"Latency: {result['metrics']['latency_seconds']:.2f}s")
\

\\

This production client demonstrates patterns the Cookbook emphasizes:

•Exponential backoff with jitter to avoid thundering herd problems
•Model fallback for high availability
•Comprehensive error classification (retryable vs non-retryable)
•Token management to prevent context overflow
•Cost tracking for budget management
•Latency monitoring for SLA compliance

Documentation Structure and Navigation

The Cookbook organizes content into three navigation layers:

1. Quick Start Guides: 5-10 minute tutorials for common tasks (API setup, first completion, embeddings basics)Quick Start Guides: 5-10 minute tutorials for common tasks (API setup, first completion, embeddings basics)
2. How-To Guides: 20-30 minute implementations of specific patterns (RAG, function calling, fine-tuning)How-To Guides: 20-30 minute implementations of specific patterns (RAG, function calling, fine-tuning)
3. Deep Dives: Comprehensive guides exploring trade-offs and advanced optimizationsDeep Dives: Comprehensive guides exploring trade-offs and advanced optimizations

This structure enables both rapid onboarding and deep technical learning.

Real-World Examples

Example 1: Building a Production Documentation Q&A System

A SaaS company needed to reduce support ticket volume by enabling customers to self-serve answers from 10,000+ pages of technical documentation. The Cookbook's RAG patterns provided the foundation:

Implementation: Combined embeddings-based search (text-embedding-3-large) with GPT-4o for answer generation, implementing the citation tracking pattern to ensure answer accuracy.

Results:

•35% reduction in support tickets for documentation-related questions
•92% user satisfaction rating for AI-generated answers
•Average response time of 2.3 seconds (vs 45 minutes human response time)
•$18K/month support cost savings

Key Cookbook Patterns Used:

•Semantic search with metadata filtering (filtering by documentation version)
•RAG with citation tracking for transparency
•Answer verification to reduce hallucinations
•Cost optimization through model cascading (GPT-4o-mini for simple queries, GPT-4o for complex ones)

Example 2: Automating Legal Document Analysis

A legal tech startup needed to extract structured data from 50,000+ contracts for due diligence processes. Manual extraction took 2-3 hours per contract.

Implementation: Following the Cookbook's structured output guide, built a multi-step extraction pipeline using JSON mode and Pydantic validation.

\\\python

`Based on Cookbook structured output pattern`


class ContractData(BaseModel):
    parties: List[str]
    effective_date: str
    termination_date: Optional[str]
    contract_value: Optional[str]
    key_obligations: List[str]
    termination_clauses: List[str]
    renewal_terms: Optional[str]
    liability_caps: Optional[str]
def extract_contract_data(contract_text: str) -> ContractData:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Extract structured data from contracts. Output valid JSON."
            },
            {"role": "user", "content": contract_text}
        ],
        response_format={"type": "json_object"}
    )

return ContractData(**json.loads(response.choices[0].message.content)) \\\

Results:

•Extraction time reduced from 2-3 hours to 4-5 minutes per contract
•96% accuracy on key fields (validated against human review)
•Processing cost: $0.80 per contract (vs $150-200 for paralegal time)
•Enabled due diligence on 10x more contracts in same timeframe

Example 3: Personalizing E-Commerce Recommendations

An e-commerce platform wanted to move beyond collaborative filtering to generate natural language product recommendations based on user browsing history and preferences.

Implementation: Used the Cookbook's embeddings patterns to build semantic product search combined with GPT-4o for personalized explanations.

Architecture:

1. Generate embeddings for all product descriptions (text-embedding-3-large)Generate embeddings for all product descriptions (text-embedding-3-large)
2. Create user profile embeddings from browsing historyCreate user profile embeddings from browsing history
3. Find semantically similar productsFind semantically similar products
4. Use GPT-4o to generate personalized recommendation explanationsUse GPT-4o to generate personalized recommendation explanations

Results:

•28% increase in click-through rate on recommendations
•15% increase in conversion rate
•Average explanation generation: 180ms
•Cost: $0.003 per recommendation

Key Insight from Cookbook: Using smaller embedding model (text-embedding-3-small) with higher dimensions provided 95% of accuracy at 1/6th the cost, enabling real-time personalization at scale.

Common Pitfalls

Pitfall 1: Not Using JSON Mode for Structured Output

Problem: Many developers use regex or custom parsing to extract structured data from LLM responses, leading to frequent parsing failures.

Solution: Always use JSON mode when expecting structured output:

\\\python

`Bad: Unreliable parsing`


response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and email as JSON: ..."}]
)
Parse response.choices[0].message.content with regex - fragile!

Good: Guaranteed valid JSON
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract name and email: ..."}],
    response_format={"type": "json_object"}  # Guarantees valid JSON
)
\

\\

The Cookbook shows JSON mode reduces parsing errors from 8-12% to <0.1%.

Pitfall 2: Ignoring Token Limits

Problem: Applications crash when messages exceed context windows, especially in multi-turn conversations.

Solution: Always count tokens before API calls:

\\\python import tiktoken


def safe_conversation(messages: List[Dict], max_context: int = 120000):
    encoding = tiktoken.encoding_for_model("gpt-4o")
    total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
    # Truncate old messages if exceeding limit
    while total_tokens > max_context * 0.9:  # Use 90% as safety margin
        if len(messages) <= 2:  # Keep system + latest user message
            break
        messages.pop(1)  # Remove oldest non-system message
        total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)

return messages \\\

Pitfall 3: Over-Engineering with Function Calling

Problem: Developers create 20+ tools for agents when many could be handled with direct prompting, leading to poor tool selection.

Solution: The Cookbook recommends the "5 Tool Rule": If your agent needs more than 5-7 tools, either:

•Combine related tools into one with parameters
•Use hierarchical agents with specialized tool sets
•Solve some tasks with direct prompting instead

Pitfall 4: Not Caching Embeddings

Problem: Regenerating embeddings for static content on every search request wastes money and adds latency.

Solution: The Cookbook shows comprehensive caching patterns:

\\\python import hashlib import json from functools import lru_cache


@lru_cache(maxsize=10000)
def get_embedding_cached(text: str, model: str = "text-embedding-3-large") -> List[float]:
    """Cache embeddings with content hash as key."""
    response = openai.embeddings.create(model=model, input=text)
    return response.data[0].embedding
For persistent caching
class EmbeddingCache:
    def __init__(self, cache_file: str = "embeddings_cache.json"):
        self.cache_file = cache_file
        self.cache = self._load_cache()
    def _load_cache(self) -> Dict:
        try:
            with open(self.cache_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    def _save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)
    def get_embedding(self, text: str, model: str) -> List[float]:
        # Create cache key from content hash
        cache_key = hashlib.sha256(f"{text}:{model}".encode()).hexdigest()
        if cache_key in self.cache:
            return self.cache[cache_key]
        # Generate and cache
        response = openai.embeddings.create(model=model, input=text)
        embedding = response.data[0].embedding
        self.cache[cache_key] = embedding
        self._save_cache()

return embedding \\\

For a 10K document corpus that's searched 1M times/month, caching reduces embedding costs from $130/month to one-time $13.

Pitfall 5: Not Implementing Rate Limit Headers

Problem: Applications hit rate limits repeatedly without backing off appropriately.

Solution: The Cookbook demonstrates reading rate limit headers:

\\\python def completion_with_rate_limit_awareness(messages: List[Dict]) -> str: response = openai.chat.completions.create( model="gpt-4o", messages=messages )


    # Check rate limit headers
    headers = response.response.headers
    remaining_requests = int(headers.get('x-ratelimit-remaining-requests', 0))
    remaining_tokens = int(headers.get('x-ratelimit-remaining-tokens', 0))
    # Proactive throttling if approaching limits
    if remaining_requests < 10 or remaining_tokens < 10000:
        logger.warning(f"Approaching rate limits: {remaining_requests} requests, {remaining_tokens} tokens remaining")
        time.sleep(1)  # Proactive backoff

return response.choices[0].message.content \\\

This pattern reduces 429 errors by 80% through proactive throttling.

Best Practices

1. Start with Prompt Engineering, Not Fine-Tuning

The Cookbook emphasizes that 95% of use cases should start with prompt engineering:

Prompt Engineering Path:

1. Basic prompting with clear instructionsBasic prompting with clear instructions
2. Few-shot examples (3-5 examples)Few-shot examples (3-5 examples)
3. Chain-of-thought prompting for complex reasoningChain-of-thought prompting for complex reasoning
4. Structured output with JSON modeStructured output with JSON mode

Only move to fine-tuning if:

•Task is extremely high-volume (millions of calls/month) where cost savings justify effort
•Domain is highly specialized with proprietary terminology
•Consistent output format is critical and few-shot doesn't achieve it
•You have 500+ high-quality training examples

2. Implement Semantic Caching for Repeated Queries

For applications with repeated similar queries, semantic caching provides massive cost savings:

\\\python import numpy as np from typing import Optional


class SemanticCache:
    """Cache responses based on semantic similarity of prompts."""
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache: List[Dict] = []
    def get(self, query: str, query_embedding: np.ndarray) -> Optional[str]:
        """Retrieve cached response for semantically similar query."""
        for entry in self.cache:
            similarity = np.dot(entry['embedding'], query_embedding)
            if similarity >= self.threshold:
                logger.info(f"Cache hit (similarity: {similarity:.3f})")
                return entry['response']
        return None
    def set(self, query: str, query_embedding: np.ndarray, response: str):
        """Cache query-response pair."""
        self.cache.append({
            'query': query,
            'embedding': query_embedding,
            'response': response
        })
Usage
semantic_cache = SemanticCache(similarity_threshold=0.95)
def cached_completion(query: str) -> str:
    # Generate query embedding
    embedding_response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = np.array(embedding_response.data[0].embedding)
    # Check cache
    cached_response = semantic_cache.get(query, query_embedding)
    if cached_response:
        return cached_response
    # Generate new response
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )
    response_text = response.choices[0].message.content
    # Cache for future
    semantic_cache.set(query, query_embedding, response_text)

return response_text \\\

For customer support chatbots, semantic caching achieves 40-60% cache hit rates, reducing costs proportionally.

3. Use Model Cascading for Cost Optimization

Different queries require different model capabilities. The Cookbook recommends cascading from cheap to expensive models:

\\\python def cascaded_completion(query: str, complexity_threshold: float = 0.7) -> Dict: """Use cheaper model for simple queries, expensive for complex ones."""


    # Step 1: Classify query complexity with mini model
    complexity_check = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Rate this query's complexity from 0.0 (simple) to 1.0 (complex).
            Respond with just a number.
            Query: {query}"""
        }],
        temperature=0
    )
    complexity = float(complexity_check.choices[0].message.content.strip())
    # Step 2: Route to appropriate model
    model = "gpt-4o" if complexity >= complexity_threshold else "gpt-4o-mini"
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}]
    )

return { "response": response.choices[0].message.content, "model_used": model, "complexity": complexity } \\\

This pattern reduces average cost per query by 60-70% while maintaining quality.

4. Implement Structured Logging for Debugging

The Cookbook emphasizes comprehensive logging for AI applications:

\\\python import logging import json from datetime import datetime


class AILogger:
    """Structured logging for AI operations."""
    def __init__(self, log_file: str = "ai_operations.jsonl"):
        self.log_file = log_file
    def log_completion(
        self,
        prompt: str,
        response: str,
        model: str,
        tokens: Dict,
        latency: float,
        cost: float,
        user_id: Optional[str] = None,
        metadata: Dict = None
    ):
        """Log completion with full context."""
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "type": "completion",
            "model": model,
            "prompt": prompt,
            "response": response,
            "tokens": tokens,
            "latency_seconds": latency,
            "cost_usd": cost,
            "user_id": user_id,
            "metadata": metadata or {}
        }
        with open(self.log_file, 'a') as f:
            f.write(json.dumps(log_entry) + '\n')
    def analyze_costs(self, time_period: str = "day") -> Dict:
        """Analyze costs from logs."""
        # Implementation for analyzing logs
        pass
Usage
logger = AILogger()
def logged_completion(prompt: str, user_id: str) -> str:
    start = time.time()
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    latency = time.time() - start
    usage = response.usage
    cost = calculate_cost(usage.prompt_tokens, usage.completion_tokens, "gpt-4o")
    logger.log_completion(
        prompt=prompt,
        response=response.choices[0].message.content,
        model="gpt-4o",
        tokens={
            "prompt": usage.prompt_tokens,
            "completion": usage.completion_tokens,
            "total": usage.total_tokens
        },
        latency=latency,
        cost=cost,
        user_id=user_id
    )

return response.choices[0].message.content \\\

Structured logs enable debugging production issues and cost analysis.

5. Implement User Feedback Loops

The Cookbook recommends collecting user feedback to continuously improve:

\\\python class FeedbackSystem: """Track AI response quality through user feedback."""


    def __init__(self, feedback_file: str = "user_feedback.jsonl"):
        self.feedback_file = feedback_file
    def record_feedback(
        self,
        response_id: str,
        prompt: str,
        response: str,
        rating: int,  # 1-5
        user_comment: Optional[str] = None
    ):
        """Record user feedback for AI response."""
        feedback = {
            "timestamp": datetime.utcnow().isoformat(),
            "response_id": response_id,
            "prompt": prompt,
            "response": response,
            "rating": rating,
            "comment": user_comment
        }
        with open(self.feedback_file, 'a') as f:
            f.write(json.dumps(feedback) + '\n')
    def get_low_quality_examples(self, min_rating: int = 2) -> List[Dict]:
        """Extract low-rated responses for analysis."""
        # Implementation for analyzing feedback
        pass
This feedback becomes training data for fine-tuning or prompt improvement
\

\\

Getting Started

Prerequisites

•Python 3.8+ or Node.js 16+
•OpenAI API key (get from https://platform.openai.com/api-keys)
•Basic understanding of async/await patterns
•Familiarity with JSON and REST APIs

Step 1: Install OpenAI SDK

\\\bash

`Python`


pip install openai
Node.js
npm install openai
Or using poetry/pnpm
poetry add openai
pnpm add openai
\

\\

Step 2: Set Up API Key

\\\bash

`Environment variable (recommended for production)`


export OPENAI_API_KEY='sk-your-api-key-here'
Or in .env file
echo "OPENAI_API_KEY=sk-your-api-key-here" > .env
\

\\

Step 3: First API Call

\\\python import openai import os


Initialize client
openai.api_key = os.getenv("OPENAI_API_KEY")
Simple completion
response = openai.chat.completions.create(
    model="gpt-4o-mini",  # Start with mini for testing
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what OpenAI Cookbook is in one sentence."}
    ]
)

print(response.choices[0].message.content) \\\

Step 4: Explore Cookbook Examples

Navigate to https://cookbook.openai.com and explore:

1. "Getting Started" section: Basic API usage, authentication, error handling"Getting Started" section: Basic API usage, authentication, error handling
2. "Embeddings" guides: Semantic search implementation"Embeddings" guides: Semantic search implementation
3. "Function Calling" tutorials: Building AI agents with tools"Function Calling" tutorials: Building AI agents with tools
4. "RAG" patterns: Document Q&A systems"RAG" patterns: Document Q&A systems

Step 5: Build Your First Real Application

Follow the Cookbook's "Building a Q&A System" guide to create a working RAG application:

1. Index your documents with embeddingsIndex your documents with embeddings
2. Implement semantic searchImplement semantic search
3. Build answer generation with citationsBuild answer generation with citations
4. Add error handling and loggingAdd error handling and logging
5. Deploy to productionDeploy to production

Next Steps

•Join OpenAI Developer Forum for community support
•Explore cookbook GitHub repository for latest examples
•Implement monitoring and logging from best practices section
•Experiment with different models for cost optimization
•Collect user feedback to improve responses

Conclusion

The Cookbook's true power lies in three key attributes:

Key Features

▸Production-Ready Examples
Battle-tested code from OpenAI engineers
▸RAG Architecture Patterns
Complete retrieval-augmented generation systems
▸Fine-Tuning Guides
Model customization best practices
▸Cost Optimization
Strategies to reduce API costs
▸Security Best Practices
Prompt injection prevention and safety
▸Performance Optimization
Latency reduction techniques

OpenAI Cookbook: The Definitive Resource for Production AI Engineering

Executive Summary

Why OpenAI Cookbook Matters for AI Engineers

Technical Deep Dive

Core Content Areas

Define structured output schema with Pydantic

Use JSON mode for guaranteed valid JSON output

Production usage with error handling

Domain-specific medical triage example

Production usage example

Index technical documentation

Semantic search with natural language query

Initialize tool registry

Register customer support tools

Production usage

Example conversation

Multi-turn conversation continues

Production usage

Example: Fine-tune for medical billing code classification

Token counting and cost estimation

Production usage with monitoring

Usage

Documentation Structure and Navigation

Real-World Examples

Example 1: Building a Production Documentation Q&A System

Example 2: Automating Legal Document Analysis

Based on Cookbook structured output pattern

Example 3: Personalizing E-Commerce Recommendations

Common Pitfalls

Pitfall 1: Not Using JSON Mode for Structured Output

Bad: Unreliable parsing

Parse response.choices[0].message.content with regex - fragile!

Good: Guaranteed valid JSON

Pitfall 2: Ignoring Token Limits

Pitfall 3: Over-Engineering with Function Calling

Pitfall 4: Not Caching Embeddings

For persistent caching

Pitfall 5: Not Implementing Rate Limit Headers

Best Practices

1. Start with Prompt Engineering, Not Fine-Tuning

2. Implement Semantic Caching for Repeated Queries

Usage

3. Use Model Cascading for Cost Optimization

4. Implement Structured Logging for Debugging

Usage

5. Implement User Feedback Loops

This feedback becomes training data for fine-tuning or prompt improvement

Getting Started

Prerequisites

Step 1: Install OpenAI SDK

Python

Node.js

Or using poetry/pnpm

Step 2: Set Up API Key

Environment variable (recommended for production)

Or in .env file

Step 3: First API Call

Initialize client

Simple completion

Step 4: Explore Cookbook Examples

Step 5: Build Your First Real Application

Next Steps

Conclusion

Key Features

Related Links

OpenAI Cookbook: The Definitive Resource for Production AI Engineering

Executive Summary

Why OpenAI Cookbook Matters for AI Engineers

Technical Deep Dive

Core Content Areas

Define structured output schema with Pydantic

Use JSON mode for guaranteed valid JSON output

Production usage with error handling

Domain-specific medical triage example

Production usage example

Index technical documentation

Semantic search with natural language query

Initialize tool registry

Register customer support tools

Production usage

`Based on Cookbook structured output pattern`

`Bad: Unreliable parsing`

`Python`

`Environment variable (recommended for production)`

`Based on Cookbook structured output pattern`

`Bad: Unreliable parsing`

`Python`

`Environment variable (recommended for production)`