Data Science Integration Guide
==============================

**What is ostruct?** A schema-first CLI that renders Jinja2 templates locally in a sandbox, then sends the resulting prompt + JSON schema to OpenAI's Structured Outputs endpoint for guaranteed valid JSON responses. Perfect for data science workflows requiring reliable, structured analysis outputs.

Learn how to leverage ostruct for data science workflows, including Jupyter/Colab integration, multi-source analysis, and visualization generation. This guide covers everything from basic data extraction to complex research synthesis workflows.

.. note::
   This guide focuses on data science use cases. For general template usage, see the :doc:`template_guide`. For tool integration basics, see :doc:`tool_integration`.

.. tip::
   **Quick Start**: Jump to :ref:`jupyter-integration` if you want to start using ostruct in Jupyter notebooks immediately.

Overview
========

ostruct excels at transforming unstructured data into structured insights, making it perfect for data science workflows where you need to:

- Extract structured data from diverse sources (CSV, PDFs, web pages, APIs)
- Combine quantitative analysis with qualitative research
- Generate consistent, validated output schemas for downstream processing
- Integrate AI-powered analysis into existing data pipelines

Key Benefits for Data Science
-----------------------------

**Schema-First Reliability**
  Every output matches your defined JSON schema, eliminating parsing errors and ensuring consistent data structures for analysis.

**Multi-Tool Orchestration**
  Combine Code Interpreter (Python execution), File Search (document analysis), and Web Search (current data) in a single workflow. Note: ``--tool-choice auto`` (default) lets the model decide when to use tools; use ``--tool-choice required`` to force tool usage.

**Notebook Integration**
  Works seamlessly in Jupyter, Colab, and other notebook environments with proper token management and output formatting.

**Crucial Limitations**
  - **Binary files cannot be accessed in templates** - they must be routed to Code Interpreter (``ci:``) or user-data (``ud:``)
  - **File size limits** apply based on ``OSTRUCT_TEMPLATE_FILE_LIMIT`` environment variable
  - **Internet access** in Code Interpreter may be limited depending on OpenAI's current restrictions

**Reproducible Workflows**
  Template-based approach ensures consistent analysis across different datasets and team members.

.. _jupyter-integration:

Jupyter/Colab Integration
=========================

Setting Up ostruct in Notebooks
--------------------------------

**Installation in Jupyter/Colab:**

.. code-block:: bash

   # Install ostruct in notebook environment
   pip install ostruct-cli

   # For enhanced file type detection (recommended for data science)
   pip install ostruct-cli[enhanced-detection]

   # Verify installation
   ostruct --version

.. code-block:: python

   # Set up OpenAI API key in Python
   import os
   os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

**30-Second Working Example:**

.. code-block:: bash

   # Create simple template
   echo "Analyze this data: {{ data.content }}" > analyze.j2

   # Create schema
   echo '{"type":"object","properties":{"insights":{"type":"array","items":{"type":"string"}}}}' > schema.json

   # Create sample data
   echo "Sales: Jan=100, Feb=150, Mar=120" > data.txt

   # Run analysis
   ostruct run analyze.j2 schema.json --file prompt:data data.txt --model gpt-4o-mini

**Expected Output:**

.. code-block:: json

   {
     "insights": [
       "Sales peaked in February with 150 units",
       "March saw a 20% decline from February",
       "Overall trend shows growth from Jan to Feb, then decline"
     ]
   }

**Basic Notebook Workflow:**

.. code-block:: python

   # Create a simple data extraction template
   template_content = '''
   ---
   system_prompt: You are an expert data analyst. Extract key metrics and insights.
   ---
   Analyze this dataset and extract the key findings:

   {{ data.content }}

   Focus on:
   1. Summary statistics
   2. Notable patterns or trends
   3. Data quality issues
   4. Recommendations for further analysis
   '''

   # Write template to file
   with open('data_analysis.j2', 'w') as f:
       f.write(template_content)

   # Define output schema
   schema = {
       "type": "object",
       "properties": {
           "summary_stats": {
               "type": "object",
               "description": "Key summary statistics"
           },
           "patterns": {
               "type": "array",
               "items": {"type": "string"},
               "description": "Notable patterns or trends found"
           },
           "data_quality": {
               "type": "array",
               "items": {"type": "string"},
               "description": "Data quality issues identified"
           },
           "recommendations": {
               "type": "array",
               "items": {"type": "string"},
               "description": "Recommendations for further analysis"
           }
       },
       "required": ["summary_stats", "patterns", "data_quality", "recommendations"]
   }

   import json
   with open('analysis_schema.json', 'w') as f:
       json.dump(schema, f, indent=2)

**Running Analysis in Notebooks:**

.. code-block:: python

   # Run ostruct analysis
   import subprocess
   import json

   # Execute ostruct command
   result = subprocess.run([
       'ostruct', 'run', 'data_analysis.j2', 'analysis_schema.json',
       '--file', 'ci:data', 'your_dataset.csv',
       '--model', 'gpt-4o',
       '--output-file', 'analysis_results.json'
   ], capture_output=True, text=True)

   # Load and display results
   with open('analysis_results.json', 'r') as f:
       analysis = json.load(f)

   print("Analysis Results:")
   print(f"Patterns found: {len(analysis['patterns'])}")
   for pattern in analysis['patterns']:
       print(f"  • {pattern}")

Interactive Jupyter Notebook Example
=====================================

Experience ostruct data science workflows interactively with our comprehensive Jupyter notebook:

.. raw:: html

   <div style="text-align: center; margin: 20px 0;">
   <a href="https://colab.research.google.com/github/yaniv-golan/ostruct/blob/main/examples/data-science/notebooks/ostruct_data_analysis.ipynb" target="_blank">
       <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="margin: 10px;">
   </a>
   </div>

**What's included in the notebook:**

- **6 Complete Examples**: From basic analysis to advanced multi-tool workflows
- **Working Code**: All examples include working templates, schemas, and data
- **Financial Analysis**: Quarterly financial analysis with market context
- **Business Intelligence**: Competitive analysis and strategic recommendations
- **Interactive Workflows**: Dynamic analysis based on custom questions
- **Batch Processing**: Production-ready patterns for multiple datasets
- **Best Practices**: Performance optimization, cost management, security

**Local Usage:**

.. code-block:: bash

   # Clone and run locally
   git clone https://github.com/yaniv-golan/ostruct.git
   cd ostruct/examples/data-science/notebooks
   jupyter notebook ostruct_data_analysis.ipynb

The notebook demonstrates all the workflows described in this guide with working code you can run immediately.

Try in Colab
------------

.. raw:: html

   <a href="https://colab.research.google.com/github/yaniv-golan/ostruct/blob/main/examples/data-science/notebooks/ostruct_data_analysis.ipynb" target="_blank">
     <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
   </a>

Advanced Notebook Integration
-----------------------------

**Jupyter Magic Commands for ostruct:**

.. code-block:: python

   # Create reusable magic command for ostruct
   from IPython.core.magic import line_magic, Magics, magics_class
   from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring
   import subprocess
   import json

   @magics_class
   class OstructMagics(Magics):

       @line_magic
       @magic_arguments()
       @argument('template', help='Template file path')
       @argument('schema', help='Schema file path')
       @argument('--file', dest='data_file', help='Data file to analyze')
       @argument('--model', default='gpt-4o-mini', help='Model to use')
       def ostruct(self, line):
           """Run ostruct analysis from Jupyter cell"""
           args = parse_argstring(self.ostruct, line)

           result = subprocess.run([
               'ostruct', 'run', args.template, args.schema,
               '--file', f'ci:data', args.data_file,
               '--model', args.model,
               '--output-file', 'results.json'
           ], capture_output=True, text=True)

           if result.returncode == 0:
               with open('results.json', 'r') as f:
                   return json.load(f)
           else:
               print(f"Error: {result.stderr}")
               return None

   # Register the magic
   get_ipython().register_magic_functions(OstructMagics)

   # Usage: %ostruct analysis.j2 schema.json --file data.csv --model gpt-4o

**DataFrame Integration Patterns:**

.. code-block:: python

   import pandas as pd
   import tempfile
   import os

   class DataFrameAnalyzer:
       """Enhanced DataFrame analysis with ostruct integration"""

       def __init__(self, df):
           self.df = df
           self.temp_files = []

       def create_context_template(self, analysis_focus="general"):
           """Generate template with DataFrame context"""
           template = f'''
   ---
   system_prompt: |
     You are analyzing a dataset with {len(self.df)} rows and {len(self.df.columns)} columns.
     Focus on {analysis_focus} analysis patterns.
   ---

   Dataset Overview:
   - Shape: {self.df.shape[0]} rows × {self.df.shape[1]} columns
   - Columns: {", ".join(self.df.columns.tolist())}
   - Data types: {dict(self.df.dtypes.astype(str))}

   Sample data:
   {{{{ data.content }}}}

   Analysis Requirements:
   1. Identify key patterns and trends
   2. Assess data quality and completeness
   3. Suggest follow-up analysis steps
   4. Highlight any anomalies or outliers
   '''
           return template

       def analyze(self, focus="general", sample_size=1000):
           """Run ostruct analysis on DataFrame"""

           # Sample large datasets
           sample_df = self.df.sample(min(sample_size, len(self.df)))

           # Create temporary files
           with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
               csv_file = f.name
               sample_df.to_csv(f, index=False)
               self.temp_files.append(csv_file)

           with tempfile.NamedTemporaryFile(mode='w', suffix='.j2', delete=False) as f:
               template_file = f.name
               f.write(self.create_context_template(focus))
               self.temp_files.append(template_file)

           # Define schema
           schema = {
               "type": "object",
               "properties": {
                   "summary": {"type": "string", "description": "Overall dataset summary"},
                   "patterns": {
                       "type": "array",
                       "items": {"type": "string"},
                       "description": "Key patterns identified"
                   },
                   "quality_issues": {
                       "type": "array",
                       "items": {"type": "string"},
                       "description": "Data quality concerns"
                   },
                   "recommendations": {
                       "type": "array",
                       "items": {"type": "string"},
                       "description": "Analysis recommendations"
                   }
               }
           }

           with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
               schema_file = f.name
               json.dump(schema, f, indent=2)
               self.temp_files.append(schema_file)

           # Run analysis
           result = subprocess.run([
               'ostruct', 'run', template_file, schema_file,
               '--file', f'ci:data', csv_file,
               '--model', 'gpt-4o-mini'
           ], capture_output=True, text=True)

           if result.returncode == 0:
               return json.loads(result.stdout)
           else:
               print(f"Analysis failed: {result.stderr}")
               return None

       def cleanup(self):
           """Clean up temporary files"""
           for file_path in self.temp_files:
               try:
                   os.unlink(file_path)
               except FileNotFoundError:
                   pass
           self.temp_files = []

       def __del__(self):
           self.cleanup()

   # Usage example
   df = pd.read_csv('sales_data.csv')
   analyzer = DataFrameAnalyzer(df)
   insights = analyzer.analyze(focus="sales trends")
   print(f"Found {len(insights['patterns'])} patterns")

**Token Management for Large Datasets:**

.. code-block:: python

   def smart_sample_for_analysis(df, max_tokens=8000, chars_per_token=4):
       """
       Intelligently sample DataFrame to fit within token limits
       """
       # Estimate current size
       csv_str = df.to_csv(index=False)
       estimated_tokens = len(csv_str) // chars_per_token

       if estimated_tokens <= max_tokens:
           return df

       # Calculate sample size needed
       sample_ratio = max_tokens / estimated_tokens
       sample_size = int(len(df) * sample_ratio * 0.8)  # 80% buffer

       print(f"Dataset too large ({estimated_tokens} tokens). Sampling {sample_size} rows.")

       # Stratified sampling if categorical columns exist
       categorical_cols = df.select_dtypes(include=['object']).columns
       if len(categorical_cols) > 0:
           return df.groupby(categorical_cols[0]).apply(
               lambda x: x.sample(min(len(x), sample_size // df[categorical_cols[0]].nunique()))
           ).reset_index(drop=True)
       else:
           return df.sample(sample_size)

   # Usage
   large_df = pd.read_csv('large_dataset.csv')
   manageable_df = smart_sample_for_analysis(large_df)
   analyzer = DataFrameAnalyzer(manageable_df)

**Environment Variable Management:**

.. code-block:: python

   # Secure API key management for notebooks
   import os
   from getpass import getpass

   def setup_ostruct_environment():
       """Setup ostruct environment variables securely"""

       if 'OPENAI_API_KEY' not in os.environ:
           print("OpenAI API key not found in environment.")
           api_key = getpass("Enter your OpenAI API key: ")
           os.environ['OPENAI_API_KEY'] = api_key
           print("✓ API key set for this session")

       # Set notebook-friendly defaults
       os.environ['OSTRUCT_CACHE_UPLOADS'] = 'true'
       os.environ['OSTRUCT_TEMPLATE_FILE_LIMIT'] = '10MB'

       print("✓ ostruct environment configured")

   # Run at start of notebook
   setup_ostruct_environment()

**Visualization Integration:**

.. code-block:: python

   def generate_analysis_visualizations(df, analysis_results):
       """
       Generate visualizations based on ostruct analysis recommendations
       """
       import matplotlib.pyplot as plt
       import seaborn as sns

       # Extract visualization suggestions from analysis
       if 'recommendations' in analysis_results:
           viz_suggestions = [
               rec for rec in analysis_results['recommendations']
               if any(word in rec.lower() for word in ['plot', 'chart', 'graph', 'visualiz'])
           ]

           for suggestion in viz_suggestions:
               print(f"Visualization suggestion: {suggestion}")

       # Auto-generate basic plots for numeric columns
       numeric_cols = df.select_dtypes(include=['number']).columns

       if len(numeric_cols) > 0:
           fig, axes = plt.subplots(2, 2, figsize=(12, 10))

           # Distribution plot
           df[numeric_cols[0]].hist(ax=axes[0,0])
           axes[0,0].set_title(f'Distribution of {numeric_cols[0]}')

           # Correlation heatmap if multiple numeric columns
           if len(numeric_cols) > 1:
               corr_matrix = df[numeric_cols].corr()
               sns.heatmap(corr_matrix, annot=True, ax=axes[0,1])
               axes[0,1].set_title('Correlation Matrix')

           # Box plot for outlier detection
           df.boxplot(column=numeric_cols[0], ax=axes[1,0])
           axes[1,0].set_title(f'Outliers in {numeric_cols[0]}')

           # Trend over index (if meaningful)
           df[numeric_cols[0]].plot(ax=axes[1,1])
           axes[1,1].set_title(f'Trend of {numeric_cols[0]}')

           plt.tight_layout()
           plt.show()

   # Usage after analysis
   viz_results = generate_analysis_visualizations(df, insights)

Multi-Tool Data Science Workflows
==================================

Combining Code Interpreter, File Search, and Web Search
-------------------------------------------------------

**Market Research + Data Analysis Example:**

.. code-block:: bash

   # Comprehensive business intelligence workflow
   ostruct run market_analysis.j2 business_intel_schema.json \
     --file ci:sales_data quarterly_sales.csv \
     --file fs:market_reports industry_report.pdf \
     --enable-tool web-search \
     --model gpt-4o

**Template Example (market_analysis.j2):**

.. code-block:: jinja

   ---
   system_prompt: |
     You are a senior business analyst. Combine quantitative sales data with
     market research and current industry trends to provide comprehensive insights.
   ---

   # Business Intelligence Analysis

   ## Sales Data Analysis
   {% if code_interpreter_enabled %}
   Analyze the sales data for trends, seasonality, and performance metrics:
   {{ sales_data.content }}

   Generate visualizations showing:
   - Monthly sales trends
   - Product category performance
   - Regional sales distribution
   {% endif %}

   ## Market Context
   {% if file_search_enabled %}
   Research market conditions and competitive landscape from:
   {{ market_reports.content }}

   Extract insights about:
   - Market size and growth
   - Competitive positioning
   - Industry trends
   {% endif %}

   ## Current Market Intelligence
   {% if web_search_enabled %}
   Research current market conditions, recent news, and industry developments
   relevant to our business sector.
   {% endif %}

   ## Synthesis
   Combine all data sources to provide:
   1. Performance assessment against market conditions
   2. Opportunities and threats analysis
   3. Strategic recommendations
   4. Key metrics to monitor

**Output Schema for Business Intelligence:**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "sales_analysis": {
         "type": "object",
         "properties": {
           "trends": {"type": "array", "items": {"type": "string"}},
           "key_metrics": {"type": "object"},
           "performance_summary": {"type": "string"}
         }
       },
       "market_context": {
         "type": "object",
         "properties": {
           "market_size": {"type": "string"},
           "growth_rate": {"type": "string"},
           "competitive_position": {"type": "string"}
         }
       },
       "current_intelligence": {
         "type": "array",
         "items": {"type": "string"},
         "description": "Recent market developments"
       },
       "strategic_recommendations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "recommendation": {"type": "string"},
             "priority": {"type": "string", "enum": ["high", "medium", "low"]},
             "rationale": {"type": "string"}
           }
         }
       }
     }
   }

Research Synthesis Workflows
-----------------------------

**Academic Research Analysis:**

.. code-block:: bash

   # Combine literature review with data analysis
   ostruct run research_synthesis.j2 research_schema.json \
     --file fs:papers "*.pdf" --recursive \
     --file ci:dataset research_data.csv \
     --enable-tool web-search \
     --model gpt-4o

This workflow:

1. **Searches papers** using File Search for literature context
2. **Analyzes data** using Code Interpreter for statistical insights
3. **Updates with current research** using Web Search
4. **Synthesizes findings** into structured research output

Comprehensive Multi-Tool Workflow Patterns
===========================================

CSV Analysis with Code Interpreter
-----------------------------------

**Pattern 1: Enhanced Data Analysis with Visualization**

.. code-block:: bash

   # Deep CSV analysis with automated visualization generation
   ostruct run csv_deep_analysis.j2 analysis_schema.json \
     --file ci:dataset sales_data.csv \
     --file ci:reference benchmark_data.csv \
     --model gpt-4o

**Template (csv_deep_analysis.j2):**

.. code-block:: jinja

   ---
   system_prompt: |
     You are a senior data analyst. Perform comprehensive analysis including
     statistical testing, visualization generation, and business insights.
   ---

   # Comprehensive CSV Data Analysis

   ## Dataset Overview
   Primary dataset: {{ dataset.name }} ({{ dataset.size }} bytes)
   Reference dataset: {{ reference.name }} ({{ reference.size }} bytes)

   ## Analysis Tasks

   ### 1. Statistical Analysis
   Load and analyze the primary dataset:
   ```python
   import pandas as pd
   import numpy as np
   import matplotlib.pyplot as plt
   import seaborn as sns
   from scipy import stats

   # Load data
   df = pd.read_csv('{{ dataset.name }}')
   ref_df = pd.read_csv('{{ reference.name }}')

   # Generate comprehensive statistics
   print("=== DATASET SUMMARY ===")
   print(df.describe())
   print(f"Dataset shape: {df.shape}")
   print(f"Missing values: {df.isnull().sum().sum()}")
   ```

   ### 2. Visualization Generation
   Create insightful visualizations:
   ```python
   # Set up the plotting environment
   plt.style.use('seaborn-v0_8')
   fig, axes = plt.subplots(2, 2, figsize=(15, 12))

   # 1. Distribution analysis
   numeric_cols = df.select_dtypes(include=[np.number]).columns
   if len(numeric_cols) > 0:
       df[numeric_cols[0]].hist(bins=30, ax=axes[0,0])
       axes[0,0].set_title(f'Distribution of {numeric_cols[0]}')

   # 2. Correlation heatmap
   if len(numeric_cols) > 1:
       corr_matrix = df[numeric_cols].corr()
       sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=axes[0,1])
       axes[0,1].set_title('Correlation Matrix')

   # 3. Time series or trend analysis
   if 'date' in df.columns or 'timestamp' in df.columns:
       # Time series visualization logic
       pass
   else:
       # Box plot for outlier detection
       if len(numeric_cols) > 0:
           df.boxplot(column=numeric_cols[0], ax=axes[1,0])
           axes[1,0].set_title(f'Outlier Analysis: {numeric_cols[0]}')

   # 4. Comparative analysis with reference data
   # Compare key metrics between datasets
   axes[1,1].bar(['Primary', 'Reference'],
                 [df[numeric_cols[0]].mean(), ref_df[numeric_cols[0]].mean()])
   axes[1,1].set_title('Comparative Analysis')

   plt.tight_layout()
   plt.savefig('comprehensive_analysis.png', dpi=300, bbox_inches='tight')
   plt.show()
   ```

   ### 3. Statistical Testing
   Perform significance tests:
   ```python
   # Compare primary vs reference dataset
   if len(numeric_cols) > 0:
       primary_values = df[numeric_cols[0]].dropna()
       reference_values = ref_df[numeric_cols[0]].dropna()

       # T-test for mean differences
       t_stat, p_value = stats.ttest_ind(primary_values, reference_values)
       print(f"T-test results: t={t_stat:.4f}, p={p_value:.4f}")

       # Effect size (Cohen's d)
       pooled_std = np.sqrt(((len(primary_values)-1)*primary_values.var() +
                            (len(reference_values)-1)*reference_values.var()) /
                           (len(primary_values)+len(reference_values)-2))
       cohens_d = (primary_values.mean() - reference_values.mean()) / pooled_std
       print(f"Effect size (Cohen's d): {cohens_d:.4f}")
   ```

   Provide business insights and recommendations based on the analysis.

**Output Schema:**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "dataset_summary": {
         "type": "object",
         "properties": {
           "rows": {"type": "number"},
           "columns": {"type": "number"},
           "missing_values": {"type": "number"},
           "data_types": {"type": "object"}
         }
       },
       "statistical_analysis": {
         "type": "object",
         "properties": {
           "descriptive_stats": {"type": "object"},
           "correlations": {"type": "array", "items": {"type": "object"}},
           "significance_tests": {"type": "array", "items": {"type": "object"}}
         }
       },
       "visualizations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "filename": {"type": "string"},
             "type": {"type": "string"},
             "description": {"type": "string"},
             "insights": {"type": "array", "items": {"type": "string"}}
           }
         }
       },
       "business_insights": {
         "type": "array",
         "items": {"type": "string"}
       },
       "recommendations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "recommendation": {"type": "string"},
             "priority": {"type": "string"},
             "rationale": {"type": "string"}
           }
         }
       }
     }
   }

Web Research + Data Analysis Combinations
------------------------------------------

**Pattern 2: Market Intelligence with Data Validation**

.. code-block:: bash

   # Combine internal sales data with current market intelligence
   ostruct run market_intelligence.j2 market_schema.json \
     --file ci:sales internal_sales.csv \
     --file ci:competitor competitor_analysis.csv \
     --enable-tool web-search \
     --ws-context-size comprehensive \
     --model gpt-4o

**Template (market_intelligence.j2):**

.. code-block:: jinja

   ---
   system_prompt: |
     You are a market intelligence analyst. Use web search to gather current
     market data and validate it against internal analysis.
   ---

   # Market Intelligence Analysis

   ## Internal Data Analysis

   ### Sales Performance Analysis
   ```python
   import pandas as pd
   import matplotlib.pyplot as plt

   # Load internal data
   sales_df = pd.read_csv('{{ sales.name }}')
   competitor_df = pd.read_csv('{{ competitor.name }}')

   # Analyze sales trends
   print("=== INTERNAL SALES ANALYSIS ===")
   monthly_sales = sales_df.groupby('month')['revenue'].sum()
   print("Monthly revenue trends:")
   print(monthly_sales.describe())

   # Competitive position analysis
   print("\n=== COMPETITIVE ANALYSIS ===")
   print("Market share analysis:")
   print(competitor_df['market_share'].describe())

   # Generate comparison chart
   fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

   monthly_sales.plot(kind='line', ax=ax1, title='Internal Sales Trend')
   competitor_df['market_share'].plot(kind='bar', ax=ax2, title='Market Share Distribution')

   plt.tight_layout()
   plt.savefig('internal_analysis.png')
   plt.show()
   ```

   ## Current Market Research

   Research current market conditions and trends:
   - Industry growth rates and forecasts
   - Recent competitor announcements and strategy changes
   - Regulatory changes affecting the market
   - Consumer behavior shifts and emerging trends
   - Technology disruptions in the sector

   Focus your search on:
   1. Market size and growth projections for {{ sales.content | extract_industry }}
   2. Recent competitor activities and market positioning
   3. Consumer preference shifts in the last 6 months
   4. Regulatory or economic factors affecting demand

   ## Data Validation and Synthesis

   ```python
   # Cross-validate web research findings with internal data
   print("=== VALIDATION ANALYSIS ===")

   # Check if internal trends align with market research
   internal_growth = (monthly_sales.iloc[-1] - monthly_sales.iloc[0]) / monthly_sales.iloc[0] * 100
   print(f"Internal growth rate: {internal_growth:.2f}%")

   # This will be compared with web research findings
   print("Compare this with market research growth rates above")

   # Identify discrepancies and opportunities
   avg_competitor_share = competitor_df['market_share'].mean()
   our_estimated_share = 100 / len(competitor_df)  # Assuming equal distribution

   print(f"Average competitor market share: {avg_competitor_share:.2f}%")
   print(f"Our estimated position: {our_estimated_share:.2f}%")
   ```

   Synthesize findings from internal data and web research to provide:
   1. Market opportunity assessment
   2. Competitive positioning recommendations
   3. Strategic actions based on combined insights
   4. Risk factors identified from external research

File Search + Code Interpreter for Research Synthesis
-----------------------------------------------------

**Pattern 3: Academic Literature + Data Analysis**

.. code-block:: bash

   # Combine literature review with experimental data analysis
   ostruct run research_synthesis.j2 research_schema.json \
     --file fs:literature "research_papers/*.pdf" --recursive \
     --file ci:data experimental_results.csv \
     --file ci:reference baseline_data.csv \
     --enable-tool web-search \
     --model gpt-4o

**Template (research_synthesis.j2):**

.. code-block:: jinja

   ---
   system_prompt: |
     You are a research scientist. Synthesize literature findings with
     experimental data analysis to draw comprehensive conclusions.
   ---

   # Research Synthesis Analysis

   ## Literature Review Summary

   {% if file_search_enabled %}
   Based on the research papers in your knowledge base:

   {{ literature }}

   Extract and summarize:
   1. **Methodology consensus**: What experimental approaches are most validated?
   2. **Key findings**: What are the established relationships and effects?
   3. **Gaps identified**: What questions remain unanswered?
   4. **Methodological considerations**: What are the best practices?
   {% else %}
   Note: File Search unavailable. Proceeding with data analysis and web research.
   {% endif %}

   ## Experimental Data Analysis

   ### Statistical Analysis of Results
   ```python
   import pandas as pd
   import numpy as np
   import matplotlib.pyplot as plt
   import seaborn as sns
   from scipy import stats
   import statsmodels.api as sm

   # Load experimental data
   results_df = pd.read_csv('{{ data.name }}')
   baseline_df = pd.read_csv('{{ reference.name }}')

   print("=== EXPERIMENTAL DATA ANALYSIS ===")
   print(f"Experimental data shape: {results_df.shape}")
   print(f"Baseline data shape: {baseline_df.shape}")

   # Descriptive statistics
   print("\nExperimental Results Summary:")
   print(results_df.describe())

   print("\nBaseline Summary:")
   print(baseline_df.describe())

   # Statistical comparisons
   numeric_cols = results_df.select_dtypes(include=[np.number]).columns

   for col in numeric_cols:
       if col in baseline_df.columns:
           exp_values = results_df[col].dropna()
           baseline_values = baseline_df[col].dropna()

           # T-test
           t_stat, p_val = stats.ttest_ind(exp_values, baseline_values)

           # Effect size
           pooled_std = np.sqrt(((len(exp_values)-1)*exp_values.var() +
                                (len(baseline_values)-1)*baseline_values.var()) /
                               (len(exp_values)+len(baseline_values)-2))
           effect_size = (exp_values.mean() - baseline_values.mean()) / pooled_std

           print(f"\n{col} Analysis:")
           print(f"  Experimental mean: {exp_values.mean():.4f}")
           print(f"  Baseline mean: {baseline_values.mean():.4f}")
           print(f"  T-test p-value: {p_val:.4f}")
           print(f"  Effect size: {effect_size:.4f}")
   ```

   ### Visualization of Key Findings
   ```python
   # Create comprehensive visualization
   fig, axes = plt.subplots(2, 2, figsize=(15, 12))

   # 1. Comparison of means
   if len(numeric_cols) > 0:
       col = numeric_cols[0]
       means = [results_df[col].mean(), baseline_df[col].mean()]
       stds = [results_df[col].std(), baseline_df[col].std()]

       x_pos = np.arange(len(['Experimental', 'Baseline']))
       axes[0,0].bar(x_pos, means, yerr=stds, capsize=5)
       axes[0,0].set_xticks(x_pos)
       axes[0,0].set_xticklabels(['Experimental', 'Baseline'])
       axes[0,0].set_title(f'Comparison of {col}')

   # 2. Distribution comparison
   if len(numeric_cols) > 0:
       axes[0,1].hist(results_df[numeric_cols[0]].dropna(), alpha=0.7, label='Experimental')
       axes[0,1].hist(baseline_df[numeric_cols[0]].dropna(), alpha=0.7, label='Baseline')
       axes[0,1].legend()
       axes[0,1].set_title('Distribution Comparison')

   # 3. Correlation analysis
   if len(numeric_cols) > 1:
       corr_matrix = results_df[numeric_cols].corr()
       sns.heatmap(corr_matrix, annot=True, ax=axes[1,0])
       axes[1,0].set_title('Experimental Data Correlations')

   # 4. Trend analysis or scatter plot
   if len(numeric_cols) > 1:
       axes[1,1].scatter(results_df[numeric_cols[0]], results_df[numeric_cols[1]])
       axes[1,1].set_xlabel(numeric_cols[0])
       axes[1,1].set_ylabel(numeric_cols[1])
       axes[1,1].set_title('Relationship Analysis')

   plt.tight_layout()
   plt.savefig('research_analysis.png', dpi=300)
   plt.show()
   ```

   ## Current Research Context

   Search for recent publications and developments related to:
   - Latest methodological advances in this research area
   - Recent findings that support or contradict our results
   - Emerging theoretical frameworks
   - Clinical or practical applications of similar research

   ## Synthesis and Conclusions

   ```python
   print("=== RESEARCH SYNTHESIS ===")

   # Summary statistics for reporting
   if len(numeric_cols) > 0:
       primary_metric = numeric_cols[0]
       exp_mean = results_df[primary_metric].mean()
       baseline_mean = baseline_df[primary_metric].mean()
       improvement = ((exp_mean - baseline_mean) / baseline_mean) * 100

       print(f"Primary outcome ({primary_metric}):")
       print(f"  Experimental: {exp_mean:.4f}")
       print(f"  Baseline: {baseline_mean:.4f}")
       print(f"  Improvement: {improvement:.2f}%")

   print("\nReady for synthesis with literature and current research...")
   ```

   Provide comprehensive synthesis addressing:
   1. How experimental results align with literature findings
   2. Novel contributions of this research
   3. Limitations and considerations based on methodological review
   4. Future research directions
   5. Practical implications and applications

Visualization Generation Patterns
---------------------------------

**Pattern 4: Automated Chart Generation with Business Context**

.. code-block:: bash

   # Generate contextual visualizations with business insights
   ostruct run viz_generation.j2 visualization_schema.json \
     --file ci:data business_metrics.csv \
     --file ci:benchmark industry_benchmarks.csv \
     --enable-tool web-search \
     --model gpt-4o

**Template (viz_generation.j2):**

.. code-block:: jinja

   ---
   system_prompt: |
     You are a data visualization expert and business analyst. Create insightful
     visualizations that tell a compelling business story.
   ---

   # Business Data Visualization Generation

   ## Data Exploration and Preparation

   ```python
   import pandas as pd
   import matplotlib.pyplot as plt
   import seaborn as sns
   import plotly.graph_objects as go
   import plotly.express as px
   from plotly.subplots import make_subplots
   import numpy as np
   from datetime import datetime

   # Load data
   business_df = pd.read_csv('{{ data.name }}')
   benchmark_df = pd.read_csv('{{ benchmark.name }}')

   print("=== DATA OVERVIEW ===")
   print(f"Business data shape: {business_df.shape}")
   print(f"Benchmark data shape: {benchmark_df.shape}")
   print("\nBusiness data columns:", business_df.columns.tolist())
   print("Benchmark data columns:", benchmark_df.columns.tolist())

   # Data quality check
   print("\nMissing values:")
   print("Business data:", business_df.isnull().sum().sum())
   print("Benchmark data:", benchmark_df.isnull().sum().sum())
   ```

   ## Visualization 1: Performance Dashboard

   ```python
   # Create a comprehensive dashboard
   fig = make_subplots(
       rows=2, cols=2,
       subplot_titles=('Revenue Trend', 'Performance vs Benchmark',
                      'Category Breakdown', 'Growth Analysis'),
       specs=[[{"secondary_y": True}, {"type": "bar"}],
              [{"type": "pie"}, {"type": "scatter"}]]
   )

   # 1. Revenue trend with growth rate
   if 'date' in business_df.columns and 'revenue' in business_df.columns:
       business_df['date'] = pd.to_datetime(business_df['date'])
       monthly_revenue = business_df.groupby('date')['revenue'].sum().reset_index()

       fig.add_trace(
           go.Scatter(x=monthly_revenue['date'], y=monthly_revenue['revenue'],
                     mode='lines+markers', name='Revenue'),
           row=1, col=1
       )

       # Add growth rate on secondary y-axis
       monthly_revenue['growth_rate'] = monthly_revenue['revenue'].pct_change() * 100
       fig.add_trace(
           go.Scatter(x=monthly_revenue['date'], y=monthly_revenue['growth_rate'],
                     mode='lines', name='Growth Rate %', yaxis='y2'),
           row=1, col=1, secondary_y=True
       )

   # 2. Performance vs Benchmark
   if 'metric' in business_df.columns and 'value' in business_df.columns:
       metrics = business_df['metric'].unique()[:5]  # Top 5 metrics
       business_values = [business_df[business_df['metric']==m]['value'].mean() for m in metrics]
       benchmark_values = [benchmark_df[benchmark_df['metric']==m]['value'].mean() for m in metrics]

       fig.add_trace(
           go.Bar(x=metrics, y=business_values, name='Our Performance'),
           row=1, col=2
       )
       fig.add_trace(
           go.Bar(x=metrics, y=benchmark_values, name='Industry Benchmark'),
           row=1, col=2
       )

   # 3. Category breakdown
   if 'category' in business_df.columns and 'value' in business_df.columns:
       category_totals = business_df.groupby('category')['value'].sum()
       fig.add_trace(
           go.Pie(labels=category_totals.index, values=category_totals.values,
                 name="Category Distribution"),
           row=2, col=1
       )

   # 4. Growth analysis scatter
   if 'investment' in business_df.columns and 'return' in business_df.columns:
       fig.add_trace(
           go.Scatter(x=business_df['investment'], y=business_df['return'],
                     mode='markers', name='ROI Analysis',
                     text=business_df.get('category', ''),
                     textposition="top center"),
           row=2, col=2
       )

   fig.update_layout(height=800, showlegend=True,
                     title_text="Business Performance Dashboard")
   fig.write_html("business_dashboard.html")
   fig.show()
   ```

   ## Visualization 2: Competitive Analysis Charts

   ```python
   # Advanced competitive positioning
   plt.style.use('seaborn-v0_8')
   fig, axes = plt.subplots(2, 2, figsize=(16, 12))

   # Market positioning bubble chart
   if all(col in business_df.columns for col in ['market_share', 'growth_rate', 'revenue']):
       scatter = axes[0,0].scatter(business_df['market_share'],
                                  business_df['growth_rate'],
                                  s=business_df['revenue']/1000,  # Bubble size
                                  alpha=0.6, c=range(len(business_df)),
                                  cmap='viridis')
       axes[0,0].set_xlabel('Market Share (%)')
       axes[0,0].set_ylabel('Growth Rate (%)')
       axes[0,0].set_title('Market Positioning (Bubble size = Revenue)')

       # Add competitor benchmarks if available
       if all(col in benchmark_df.columns for col in ['market_share', 'growth_rate']):
           axes[0,0].scatter(benchmark_df['market_share'],
                            benchmark_df['growth_rate'],
                            marker='x', s=100, c='red', label='Competitors')
           axes[0,0].legend()

   # Performance radar chart simulation
   categories = ['Revenue', 'Market Share', 'Customer Satisfaction', 'Innovation', 'Efficiency']
   if len([col for col in categories if col.lower().replace(' ', '_') in business_df.columns]) >= 3:
       # Create radar chart data
       our_scores = []
       benchmark_scores = []

       for category in categories:
           col_name = category.lower().replace(' ', '_')
           if col_name in business_df.columns:
               our_scores.append(business_df[col_name].mean())
               benchmark_scores.append(benchmark_df[col_name].mean() if col_name in benchmark_df.columns else our_scores[-1] * 0.9)
           else:
               our_scores.append(0)
               benchmark_scores.append(0)

       # Polar plot simulation using regular plot
       angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
       our_scores += our_scores[:1]  # Complete the circle
       benchmark_scores += benchmark_scores[:1]
       angles += angles[:1]

       axes[0,1].plot(angles, our_scores, 'o-', linewidth=2, label='Our Performance')
       axes[0,1].fill(angles, our_scores, alpha=0.25)
       axes[0,1].plot(angles, benchmark_scores, 'o-', linewidth=2, label='Industry Average')
       axes[0,1].fill(angles, benchmark_scores, alpha=0.25)
       axes[0,1].set_title('Performance Radar')
       axes[0,1].legend()

   # Trend comparison
   if 'date' in business_df.columns:
       business_df['date'] = pd.to_datetime(business_df['date'])
       monthly_data = business_df.groupby('date').agg({
           'revenue': 'sum',
           'customers': 'sum' if 'customers' in business_df.columns else 'count'
       }).reset_index()

       ax2_twin = axes[1,0].twinx()
       line1 = axes[1,0].plot(monthly_data['date'], monthly_data['revenue'],
                             'b-', label='Revenue')
       line2 = ax2_twin.plot(monthly_data['date'], monthly_data['customers'],
                            'r--', label='Customers')

       axes[1,0].set_xlabel('Date')
       axes[1,0].set_ylabel('Revenue', color='b')
       ax2_twin.set_ylabel('Customers', color='r')
       axes[1,0].set_title('Revenue and Customer Trends')

       # Combine legends
       lines = line1 + line2
       labels = [l.get_label() for l in lines]
       axes[1,0].legend(lines, labels, loc='upper left')

   # ROI and efficiency analysis
   if 'investment' in business_df.columns and 'return' in business_df.columns:
       business_df['roi'] = (business_df['return'] - business_df['investment']) / business_df['investment'] * 100

       # Box plot of ROI by category
       if 'category' in business_df.columns:
           categories = business_df['category'].unique()
           roi_by_category = [business_df[business_df['category']==cat]['roi'].values for cat in categories]
           axes[1,1].boxplot(roi_by_category, labels=categories)
           axes[1,1].set_title('ROI Distribution by Category')
           axes[1,1].set_ylabel('ROI (%)')
           plt.setp(axes[1,1].get_xticklabels(), rotation=45)

   plt.tight_layout()
   plt.savefig('competitive_analysis.png', dpi=300, bbox_inches='tight')
   plt.show()
   ```

   ## Current Market Context Research

   Research current market trends and industry benchmarks:
   - Industry performance metrics and KPIs
   - Recent market shifts and opportunities
   - Competitive landscape changes
   - Economic factors affecting performance

   ## Visualization Insights Summary

   ```python
   print("=== VISUALIZATION INSIGHTS ===")

   # Generate summary statistics for each visualization
   print("Dashboard Summary:")
   if 'revenue' in business_df.columns:
       total_revenue = business_df['revenue'].sum()
       avg_monthly_revenue = business_df.groupby('date')['revenue'].sum().mean() if 'date' in business_df.columns else business_df['revenue'].mean()
       print(f"  Total Revenue: ${total_revenue:,.2f}")
       print(f"  Average Monthly Revenue: ${avg_monthly_revenue:,.2f}")

   print("\nPerformance vs Benchmark:")
   if 'metric' in business_df.columns and 'value' in business_df.columns:
       our_avg = business_df['value'].mean()
       benchmark_avg = benchmark_df['value'].mean() if 'value' in benchmark_df.columns else 0
       performance_ratio = our_avg / benchmark_avg if benchmark_avg > 0 else 1
       print(f"  Our Average Performance: {our_avg:.2f}")
       print(f"  Industry Average: {benchmark_avg:.2f}")
       print(f"  Performance Ratio: {performance_ratio:.2f}x")

   print("\nVisualization files generated:")
   print("  - business_dashboard.html (Interactive dashboard)")
   print("  - competitive_analysis.png (Static analysis charts)")
   ```

Data Science Schema Templates
=============================

Ready-to-use JSON schema templates for common data science workflows. These schemas ensure consistent, validated outputs across different analysis types and can be easily customized for specific use cases.

## Schema Template Library

### 1. Comprehensive Data Analysis Schema

**Use Case**: Complete dataset analysis with statistics, patterns, and recommendations

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "analysis_metadata": {
         "type": "object",
         "properties": {
           "dataset_name": {"type": "string", "description": "Name of the analyzed dataset"},
           "analysis_date": {"type": "string", "format": "date-time"},
           "analyst": {"type": "string", "description": "Name or ID of the analyst"},
           "analysis_type": {"type": "string", "enum": ["exploratory", "confirmatory", "descriptive", "predictive"]},
           "model_used": {"type": "string", "description": "OpenAI model used for analysis"}
         },
         "required": ["dataset_name", "analysis_date", "analysis_type"]
       },
       "dataset_summary": {
         "type": "object",
         "properties": {
           "rows": {"type": "integer", "minimum": 0},
           "columns": {"type": "integer", "minimum": 0},
           "missing_values": {"type": "integer", "minimum": 0},
           "data_types": {
             "type": "object",
             "additionalProperties": {"type": "string"}
           },
           "memory_usage": {"type": "string", "description": "Memory usage in MB/GB"},
           "date_range": {
             "type": "object",
             "properties": {
               "start_date": {"type": "string", "format": "date"},
               "end_date": {"type": "string", "format": "date"}
             }
           }
         },
         "required": ["rows", "columns"]
       },
       "statistical_analysis": {
     "type": "object",
     "properties": {
       "descriptive_stats": {
             "type": "object",
             "patternProperties": {
               "^[a-zA-Z_][a-zA-Z0-9_]*$": {
         "type": "object",
         "properties": {
                   "count": {"type": "number"},
           "mean": {"type": "number"},
           "median": {"type": "number"},
                   "std": {"type": "number"},
           "min": {"type": "number"},
                   "max": {"type": "number"},
                   "q25": {"type": "number"},
                   "q75": {"type": "number"},
                   "skewness": {"type": "number"},
                   "kurtosis": {"type": "number"}
                 },
                 "required": ["count", "mean", "std"]
               }
         }
       },
       "correlations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
                 "variable_1": {"type": "string"},
                 "variable_2": {"type": "string"},
                 "correlation_coefficient": {"type": "number", "minimum": -1, "maximum": 1},
                 "p_value": {"type": "number", "minimum": 0, "maximum": 1},
                 "significance_level": {"type": "string", "enum": ["***", "**", "*", "ns"]},
                 "interpretation": {"type": "string"}
               },
               "required": ["variable_1", "variable_2", "correlation_coefficient"]
             }
           },
           "hypothesis_tests": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "test_name": {"type": "string"},
             "variables": {"type": "array", "items": {"type": "string"}},
                 "statistic": {"type": "number"},
                 "p_value": {"type": "number", "minimum": 0, "maximum": 1},
                 "degrees_of_freedom": {"type": "integer", "minimum": 0},
                 "effect_size": {"type": "number"},
                 "confidence_interval": {
                   "type": "object",
                   "properties": {
                     "lower": {"type": "number"},
                     "upper": {"type": "number"},
                     "confidence_level": {"type": "number", "default": 0.95}
                   }
                 },
                 "conclusion": {"type": "string"},
                 "interpretation": {"type": "string"}
               },
               "required": ["test_name", "p_value", "conclusion"]
             }
           }
         }
       },
       "patterns_and_insights": {
         "type": "object",
         "properties": {
           "key_patterns": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
                 "pattern": {"type": "string"},
                 "variables_involved": {"type": "array", "items": {"type": "string"}},
                 "strength": {"type": "string", "enum": ["weak", "moderate", "strong"]},
                 "confidence": {"type": "string", "enum": ["low", "medium", "high"]},
                 "business_impact": {"type": "string"}
               }
             }
           },
           "anomalies": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "type": {"type": "string", "enum": ["outlier", "missing_data", "inconsistency", "trend_break"]},
                 "description": {"type": "string"},
                 "affected_variables": {"type": "array", "items": {"type": "string"}},
                 "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                 "recommended_action": {"type": "string"}
               }
             }
           },
           "trends": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "variable": {"type": "string"},
                 "trend_type": {"type": "string", "enum": ["increasing", "decreasing", "seasonal", "cyclical", "stable"]},
                 "strength": {"type": "number", "minimum": 0, "maximum": 1},
                 "time_period": {"type": "string"},
                 "forecast": {"type": "string"}
               }
             }
           }
         }
       },
       "recommendations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "category": {"type": "string", "enum": ["data_quality", "analysis", "business_action", "further_investigation"]},
             "recommendation": {"type": "string"},
             "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
             "rationale": {"type": "string"},
             "expected_impact": {"type": "string"},
             "effort_required": {"type": "string", "enum": ["minimal", "moderate", "significant"]},
             "timeline": {"type": "string"}
           },
           "required": ["recommendation", "priority", "rationale"]
         }
       }
     },
     "required": ["analysis_metadata", "dataset_summary", "statistical_analysis", "patterns_and_insights", "recommendations"]
   }

### 2. Advanced Visualization Schema

**Use Case**: Comprehensive visualization specifications with business context

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "visualization_suite": {
         "type": "object",
         "properties": {
           "title": {"type": "string"},
           "description": {"type": "string"},
           "created_date": {"type": "string", "format": "date-time"},
           "data_source": {"type": "string"},
           "target_audience": {"type": "string", "enum": ["technical", "business", "executive", "general"]}
         }
       },
       "visualizations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "id": {"type": "string"},
             "type": {
               "type": "string",
               "enum": ["line", "bar", "scatter", "histogram", "box", "heatmap", "pie", "area", "radar", "bubble", "treemap", "waterfall", "funnel", "gauge", "sankey"]
             },
             "title": {"type": "string"},
             "subtitle": {"type": "string"},
             "data_specification": {
               "type": "object",
               "properties": {
                 "x_axis": {
                   "type": "object",
                   "properties": {
                     "variable": {"type": "string"},
                     "label": {"type": "string"},
                     "data_type": {"type": "string", "enum": ["categorical", "numerical", "datetime"]},
                     "format": {"type": "string"}
                   }
                 },
                 "y_axis": {
                   "type": "object",
                   "properties": {
                     "variable": {"type": "string"},
                     "label": {"type": "string"},
                     "data_type": {"type": "string", "enum": ["categorical", "numerical", "datetime"]},
                     "format": {"type": "string"}
                   }
                 },
                 "color_by": {"type": "string"},
                 "size_by": {"type": "string"},
                 "filters": {
                   "type": "array",
                   "items": {
                     "type": "object",
                     "properties": {
                       "variable": {"type": "string"},
                       "condition": {"type": "string"},
                       "value": {"type": ["string", "number", "array"]}
                     }
                   }
                 },
                 "aggregation": {
                   "type": "object",
                   "properties": {
                     "method": {"type": "string", "enum": ["sum", "mean", "median", "count", "min", "max", "std"]},
                     "group_by": {"type": "array", "items": {"type": "string"}}
                   }
                 }
               }
             },
             "styling": {
               "type": "object",
               "properties": {
                 "color_palette": {"type": "string"},
                 "theme": {"type": "string", "enum": ["light", "dark", "corporate", "minimal"]},
                 "width": {"type": "integer"},
                 "height": {"type": "integer"},
                 "interactive": {"type": "boolean"},
                 "annotations": {
                   "type": "array",
                   "items": {
                     "type": "object",
                     "properties": {
                       "type": {"type": "string", "enum": ["text", "arrow", "line", "rectangle"]},
                       "text": {"type": "string"},
                       "position": {"type": "object"},
                       "style": {"type": "object"}
                     }
                   }
                 }
               }
             },
             "insights": {
               "type": "array",
               "items": {
                 "type": "object",
                 "properties": {
                   "insight": {"type": "string"},
                   "type": {"type": "string", "enum": ["trend", "anomaly", "comparison", "distribution", "correlation"]},
                   "confidence": {"type": "string", "enum": ["low", "medium", "high"]},
                   "business_relevance": {"type": "string"}
                 }
               }
             },
             "implementation": {
               "type": "object",
               "properties": {
                 "python_code": {"type": "string"},
                 "libraries_required": {"type": "array", "items": {"type": "string"}},
                 "file_outputs": {"type": "array", "items": {"type": "string"}},
                 "estimated_runtime": {"type": "string"}
               }
             }
           },
           "required": ["type", "title", "data_specification"]
         }
       },
       "dashboard_layout": {
         "type": "object",
         "properties": {
           "grid_layout": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "visualization_id": {"type": "string"},
                 "row": {"type": "integer"},
                 "column": {"type": "integer"},
                 "width": {"type": "integer"},
                 "height": {"type": "integer"}
               }
             }
           },
           "narrative_flow": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "section": {"type": "string"},
                 "description": {"type": "string"},
                 "visualizations": {"type": "array", "items": {"type": "string"}},
                 "key_message": {"type": "string"}
               }
             }
           }
         }
       }
     },
     "required": ["visualization_suite", "visualizations"]
   }

### 3. Research Synthesis Schema

**Use Case**: Academic and scientific research synthesis with literature integration

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "research_synthesis": {
         "type": "object",
         "properties": {
           "study_title": {"type": "string"},
           "research_question": {"type": "string"},
           "methodology": {"type": "string"},
           "synthesis_date": {"type": "string", "format": "date-time"},
           "researcher": {"type": "string"}
         },
         "required": ["study_title", "research_question"]
       },
       "literature_review": {
         "type": "object",
         "properties": {
           "sources_analyzed": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "title": {"type": "string"},
                 "authors": {"type": "array", "items": {"type": "string"}},
                 "year": {"type": "integer"},
                 "journal": {"type": "string"},
                 "doi": {"type": "string"},
                 "relevance_score": {"type": "number", "minimum": 1, "maximum": 5},
                 "quality_score": {"type": "number", "minimum": 1, "maximum": 5},
                 "key_findings": {"type": "array", "items": {"type": "string"}}
               }
             }
           },
           "methodological_consensus": {
             "type": "object",
             "properties": {
               "common_approaches": {"type": "array", "items": {"type": "string"}},
               "validated_methods": {"type": "array", "items": {"type": "string"}},
               "methodological_gaps": {"type": "array", "items": {"type": "string"}},
               "best_practices": {"type": "array", "items": {"type": "string"}}
             }
           },
           "theoretical_framework": {
             "type": "object",
             "properties": {
               "established_theories": {"type": "array", "items": {"type": "string"}},
               "emerging_concepts": {"type": "array", "items": {"type": "string"}},
               "contradicting_findings": {"type": "array", "items": {"type": "string"}},
               "research_gaps": {"type": "array", "items": {"type": "string"}}
             }
           }
         }
       },
       "experimental_analysis": {
         "type": "object",
         "properties": {
           "study_design": {
             "type": "object",
             "properties": {
               "type": {"type": "string", "enum": ["experimental", "observational", "cross-sectional", "longitudinal", "case-control", "cohort"]},
               "sample_size": {"type": "integer"},
               "groups": {"type": "array", "items": {"type": "string"}},
               "variables": {
                 "type": "object",
                 "properties": {
                   "independent": {"type": "array", "items": {"type": "string"}},
                   "dependent": {"type": "array", "items": {"type": "string"}},
                   "confounding": {"type": "array", "items": {"type": "string"}}
                 }
               }
             }
           },
           "statistical_results": {
             "type": "object",
             "properties": {
               "primary_outcomes": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "outcome": {"type": "string"},
                     "measurement": {"type": "string"},
                     "result": {"type": "number"},
                     "confidence_interval": {
                       "type": "object",
                       "properties": {
                         "lower": {"type": "number"},
                         "upper": {"type": "number"},
                         "level": {"type": "number", "default": 0.95}
                       }
                     },
                     "p_value": {"type": "number"},
                     "effect_size": {"type": "number"},
                     "clinical_significance": {"type": "string"}
                   }
                 }
               },
               "secondary_outcomes": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "outcome": {"type": "string"},
                     "result": {"type": "number"},
                     "significance": {"type": "string"}
                   }
                 }
               },
               "subgroup_analyses": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "subgroup": {"type": "string"},
                     "results": {"type": "object"},
                     "interaction_p_value": {"type": "number"}
                   }
                 }
               }
             }
           }
         }
       },
       "synthesis_conclusions": {
         "type": "object",
         "properties": {
           "key_findings": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "finding": {"type": "string"},
                 "evidence_strength": {"type": "string", "enum": ["weak", "moderate", "strong", "very_strong"]},
                 "consistency_across_studies": {"type": "string", "enum": ["inconsistent", "somewhat_consistent", "consistent"]},
                 "literature_support": {"type": "string"},
                 "novel_contribution": {"type": "boolean"}
               }
             }
           },
           "limitations": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "limitation": {"type": "string"},
                 "impact": {"type": "string", "enum": ["minor", "moderate", "major"]},
                 "mitigation": {"type": "string"}
               }
             }
           },
           "implications": {
             "type": "object",
             "properties": {
               "theoretical": {"type": "array", "items": {"type": "string"}},
               "practical": {"type": "array", "items": {"type": "string"}},
               "clinical": {"type": "array", "items": {"type": "string"}},
               "policy": {"type": "array", "items": {"type": "string"}}
             }
           },
           "future_research": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "direction": {"type": "string"},
                 "priority": {"type": "string", "enum": ["low", "medium", "high"]},
                 "methodology": {"type": "string"},
                 "expected_impact": {"type": "string"}
               }
             }
           }
         }
       }
     },
     "required": ["research_synthesis", "synthesis_conclusions"]
   }

### 4. Business Intelligence Schema

**Use Case**: Market analysis and business intelligence reporting

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "business_intelligence_report": {
         "type": "object",
         "properties": {
           "report_title": {"type": "string"},
           "analysis_period": {
             "type": "object",
             "properties": {
               "start_date": {"type": "string", "format": "date"},
               "end_date": {"type": "string", "format": "date"}
             }
           },
           "analyst": {"type": "string"},
           "report_date": {"type": "string", "format": "date-time"},
           "executive_summary": {"type": "string"},
           "key_metrics": {
             "type": "object",
             "patternProperties": {
               "^[a-zA-Z_][a-zA-Z0-9_]*$": {
                 "type": "object",
                 "properties": {
                   "value": {"type": "number"},
                   "unit": {"type": "string"},
                   "change_from_previous": {"type": "number"},
                   "trend": {"type": "string", "enum": ["increasing", "decreasing", "stable"]},
                   "target": {"type": "number"},
                   "performance_vs_target": {"type": "string"}
                 }
               }
             }
           }
         },
         "required": ["report_title", "analysis_period", "executive_summary"]
       },
       "market_analysis": {
         "type": "object",
         "properties": {
           "market_size": {
             "type": "object",
             "properties": {
               "total_addressable_market": {"type": "number"},
               "serviceable_addressable_market": {"type": "number"},
               "serviceable_obtainable_market": {"type": "number"},
               "currency": {"type": "string", "default": "USD"},
               "growth_rate": {"type": "number"},
               "forecast_period": {"type": "string"}
             }
           },
           "competitive_landscape": {
             "type": "object",
             "properties": {
               "market_leaders": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "company": {"type": "string"},
                     "market_share": {"type": "number", "minimum": 0, "maximum": 100},
                     "strengths": {"type": "array", "items": {"type": "string"}},
                     "weaknesses": {"type": "array", "items": {"type": "string"}},
                     "recent_developments": {"type": "array", "items": {"type": "string"}}
                   }
                 }
               },
               "our_position": {
                 "type": "object",
                 "properties": {
                   "market_share": {"type": "number"},
                   "rank": {"type": "integer"},
                   "competitive_advantages": {"type": "array", "items": {"type": "string"}},
                   "areas_for_improvement": {"type": "array", "items": {"type": "string"}}
                 }
               }
             }
           },
           "market_trends": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "trend": {"type": "string"},
                 "impact": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                 "magnitude": {"type": "string", "enum": ["low", "medium", "high"]},
                 "timeline": {"type": "string"},
                 "implications": {"type": "string"}
               }
             }
           },
           "customer_insights": {
             "type": "object",
             "properties": {
               "segments": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "segment_name": {"type": "string"},
                     "size": {"type": "number"},
                     "growth_rate": {"type": "number"},
                     "key_characteristics": {"type": "array", "items": {"type": "string"}},
                     "pain_points": {"type": "array", "items": {"type": "string"}},
                     "preferences": {"type": "array", "items": {"type": "string"}}
                   }
                 }
               },
               "behavior_changes": {
                 "type": "array",
                 "items": {
                   "type": "object",
                   "properties": {
                     "change": {"type": "string"},
                     "drivers": {"type": "array", "items": {"type": "string"}},
                     "business_impact": {"type": "string"}
                   }
                 }
               }
             }
           }
         }
       },
       "performance_analysis": {
         "type": "object",
         "properties": {
           "financial_performance": {
             "type": "object",
             "properties": {
               "revenue": {
                 "type": "object",
                 "properties": {
                   "current_period": {"type": "number"},
                   "previous_period": {"type": "number"},
                   "year_over_year_growth": {"type": "number"},
                   "by_segment": {"type": "object"},
                   "by_geography": {"type": "object"}
                 }
               },
               "profitability": {
                 "type": "object",
                 "properties": {
                   "gross_margin": {"type": "number"},
                   "operating_margin": {"type": "number"},
                   "net_margin": {"type": "number"},
                   "margin_trends": {"type": "array", "items": {"type": "string"}}
                 }
               },
               "key_ratios": {
                 "type": "object",
                 "properties": {
                   "current_ratio": {"type": "number"},
                   "debt_to_equity": {"type": "number"},
                   "return_on_assets": {"type": "number"},
                   "return_on_equity": {"type": "number"}
                 }
               }
             }
           },
           "operational_performance": {
             "type": "object",
             "properties": {
               "efficiency_metrics": {"type": "object"},
               "quality_metrics": {"type": "object"},
               "customer_satisfaction": {"type": "object"},
               "employee_metrics": {"type": "object"}
             }
           }
         }
       },
       "strategic_recommendations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "category": {"type": "string", "enum": ["growth", "efficiency", "competitive", "innovation", "risk_mitigation"]},
             "recommendation": {"type": "string"},
             "rationale": {"type": "string"},
             "expected_impact": {"type": "string"},
             "implementation_timeline": {"type": "string"},
             "resources_required": {"type": "string"},
             "success_metrics": {"type": "array", "items": {"type": "string"}},
             "risks": {"type": "array", "items": {"type": "string"}},
             "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
           },
           "required": ["recommendation", "rationale", "priority"]
         }
       },
       "risk_assessment": {
         "type": "object",
         "properties": {
           "identified_risks": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "risk": {"type": "string"},
                 "category": {"type": "string", "enum": ["market", "operational", "financial", "regulatory", "technological", "competitive"]},
                 "probability": {"type": "string", "enum": ["low", "medium", "high"]},
                 "impact": {"type": "string", "enum": ["low", "medium", "high"]},
                 "mitigation_strategies": {"type": "array", "items": {"type": "string"}},
                 "monitoring_indicators": {"type": "array", "items": {"type": "string"}}
               }
             }
           },
           "risk_matrix": {
             "type": "object",
             "properties": {
               "high_probability_high_impact": {"type": "array", "items": {"type": "string"}},
               "high_probability_low_impact": {"type": "array", "items": {"type": "string"}},
               "low_probability_high_impact": {"type": "array", "items": {"type": "string"}},
               "low_probability_low_impact": {"type": "array", "items": {"type": "string"}}
             }
           }
         }
       }
     },
     "required": ["business_intelligence_report", "strategic_recommendations"]
   }

### 5. Quick Analysis Schema

**Use Case**: Rapid analysis with essential insights (lightweight version)

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "quick_analysis": {
         "type": "object",
         "properties": {
           "dataset": {"type": "string"},
           "analysis_date": {"type": "string", "format": "date-time"},
           "key_metrics": {"type": "object"},
           "top_insights": {
             "type": "array",
             "maxItems": 5,
             "items": {"type": "string"}
           },
           "red_flags": {
             "type": "array",
             "maxItems": 3,
             "items": {"type": "string"}
           },
           "immediate_actions": {
             "type": "array",
             "maxItems": 3,
             "items": {"type": "string"}
           }
         },
         "required": ["dataset", "top_insights"]
       }
     }
   }

## Schema Usage Guidelines

### Customization Tips

1. **Remove unnecessary fields** for simpler analyses
2. **Add domain-specific properties** (e.g., medical, financial, engineering fields)
3. **Adjust enum values** to match your business terminology
4. **Modify validation rules** (min/max values, required fields) based on your data

### Schema Selection Guide

- **Comprehensive Data Analysis**: Full statistical analysis with business context
- **Advanced Visualization**: Complex dashboards and chart specifications
- **Research Synthesis**: Academic or scientific research projects
- **Business Intelligence**: Market analysis and strategic planning
- **Quick Analysis**: Rapid insights for daily operations

### Best Practices

1. **Always include metadata** (analyst, date, data source) for traceability
2. **Use consistent field naming** across your organization's schemas
3. **Include confidence levels** for insights and recommendations
4. **Provide clear descriptions** in schema properties for AI model understanding
5. **Validate outputs** against schemas to ensure consistency

Practical Examples and Use Cases
=================================

Financial Data Analysis
-----------------------

**Scenario**: Analyze quarterly financial data with market context

.. code-block:: bash

   ostruct run financial_analysis.j2 financial_schema.json \
     --file ci:financials quarterly_report.csv \
     --file fs:industry industry_benchmarks.pdf \
     --enable-tool web-search \
     --model gpt-4o

**Key Features**:
- Automated ratio calculations via Code Interpreter
- Benchmark comparisons via File Search
- Current market conditions via Web Search
- Structured output for further processing

Scientific Research Synthesis
-----------------------------

**Scenario**: Combine experimental data with literature review

.. code-block:: bash

   ostruct run research_synthesis.j2 research_schema.json \
     --file ci:results experimental_data.csv \
     --dir fs:literature "papers/" \
     --enable-tool web-search \
     --model gpt-4o

**Workflow**:
1. Statistical analysis of experimental results
2. Literature context from paper database
3. Current research trends from web search
4. Synthesized conclusions with citations

Market Research Automation
---------------------------

**Scenario**: Automated market intelligence reports

.. code-block:: bash

   ostruct run market_intel.j2 market_schema.json \
     --file ci:sales_data current_sales.csv \
     --file fs:reports competitor_analysis.pdf \
     --enable-tool web-search \
     --ws-context-size comprehensive \
     --model gpt-4o

**Output**: Structured market intelligence report with quantitative metrics, competitive analysis, and current market trends.

Token Management for Large Datasets
====================================

Best Practices
--------------

**Chunking Large Files:**

.. code-block:: python

   # Split large datasets for processing
   import pandas as pd

   # Read large dataset
   df = pd.read_csv('large_dataset.csv')

   # Process in chunks
   chunk_size = 1000
   for i in range(0, len(df), chunk_size):
       chunk = df[i:i+chunk_size]
       chunk.to_csv(f'chunk_{i//chunk_size}.csv', index=False)

       # Process each chunk
       subprocess.run([
           'ostruct', 'run', 'analysis.j2', 'schema.json',
           '--file', 'ci:data', f'chunk_{i//chunk_size}.csv',
           '--output-file', f'results_{i//chunk_size}.json'
       ])

**Dry Run for Token Estimation:**

.. code-block:: bash

   # Preview prompt and token counts without API cost
   ostruct run analysis.j2 schema.json \
     --file ci:data large_dataset.csv \
     --dry-run

   # This shows the full expanded prompt and token count
   # Use this to optimize before making expensive API calls

   # Use token-efficient models for large datasets
   ostruct run analysis.j2 schema.json \
     --file ci:data large_dataset.csv \
     --model gpt-4o-mini  # More cost-effective for large inputs

Error Handling and Troubleshooting
===================================

Known Issues
------------

**File Search Empty Results (Current Bug):**

File Search may return empty results despite successful vector store creation. This is a known upstream OpenAI API issue affecting all models.

**Workarounds:**
- **Fallback to Code Interpreter:** Route documents to ``ci:`` for programmatic parsing
- **Direct prompt inclusion:** Use ``prompt:`` routing for smaller documents that fit in context
- **Hybrid approach:** Combine manual document parsing with web search for current information

.. code-block:: bash

   # If File Search fails, try Code Interpreter parsing
   ostruct run analysis.j2 schema.json \
     --file ci:docs research_paper.pdf \
     --enable-tool web-search \
     --model gpt-4o

Common Issues
-------------

**Binary File Access Errors:**

.. code-block:: jinja

   {# Handle mixed file types gracefully #}
   {% for file in dataset %}
     {% if file.extension in ['csv', 'txt', 'json'] %}
       {{ file.content }}
     {% else %}
       File: {{ file.name }} ({{ file.size }} bytes, binary - use Code Interpreter for analysis)
     {% endif %}
   {% endfor %}

**Token Limit Errors:**

.. code-block:: bash

   # Use summary approach for large files
   ostruct run summarize_first.j2 summary_schema.json \
     --file ci:data large_file.csv \
     --max-output-tokens 4000

**Schema Validation Failures:**

.. code-block:: python

   # Validate schema before processing
   import jsonschema
   import json

   with open('schema.json', 'r') as f:
       schema = json.load(f)

   # Test with sample data
   sample_output = {"test": "data"}
   try:
       jsonschema.validate(sample_output, schema)
       print("Schema is valid")
   except jsonschema.ValidationError as e:
       print(f"Schema error: {e}")

Performance Optimization
========================

Efficient Workflows
-------------------

**Parallel Processing:**

.. code-block:: python

   import concurrent.futures
   import subprocess

   def process_file(filename):
       return subprocess.run([
           'ostruct', 'run', 'analysis.j2', 'schema.json',
           '--file', 'ci:data', filename,
           '--output-file', f'results_{filename}.json'
       ], capture_output=True)

   # Process multiple files in parallel
   files = ['data1.csv', 'data2.csv', 'data3.csv']
   with concurrent.futures.ThreadPoolExecutor() as executor:
       results = list(executor.map(process_file, files))

**Model Selection for Different Tasks:**

.. code-block:: bash

   # Use appropriate models for different complexity levels

   # Simple extraction - use efficient model
   ostruct run extract.j2 schema.json --model gpt-4o-mini

   # Complex analysis - use powerful model
   ostruct run complex_analysis.j2 schema.json --model gpt-4o

   # Reasoning tasks - use reasoning model
   ostruct run reasoning.j2 schema.json --model o1-preview

Practical Examples and Use Cases
===================================

This section provides complete, real-world workflows demonstrating ostruct's power for data science applications.

Financial Data Analysis Workflow
---------------------------------

**Scenario:** Automated quarterly financial analysis combining market data, company reports, and regulatory filings.

**Complete Workflow:**

**Step 1: Market Data Collection and Analysis**

.. code-block:: bash

   # Template: financial_analysis.j2
   ostruct run financial_analysis.j2 financial_schema.json \
     --file ci:data quarterly_data.xlsx \
     --enable-tool web-search \
     --web-query "{{company_name}} Q3 2024 earnings market reaction analysis" \
     --model gpt-4o

**Template (financial_analysis.j2):**

.. code-block:: jinja

   ## Financial Analysis for {{company_name}} - {{quarter}}

   ### Market Data Analysis
   Analyze the following financial data and provide comprehensive insights:

   **Raw Data:**
   {{ quarterly_data.content }}

   **Market Context (from web search):**
   {% if web_search_results %}
   {{ web_search_results }}
   {% endif %}

   ### Analysis Requirements:
   1. **Performance Metrics**: Calculate key ratios (ROE, EBITDA margin, debt-to-equity)
   2. **Trend Analysis**: Compare with previous 4 quarters
   3. **Market Position**: Benchmark against industry peers
   4. **Risk Assessment**: Identify potential financial risks
   5. **Growth Projection**: Forecast next quarter based on current trends

   ### Regulatory Compliance Check:
   Review all metrics against SEC disclosure requirements and flag any concerning trends.

**Schema (financial_schema.json):**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "executive_summary": {
         "type": "string",
         "description": "2-3 sentence summary of financial health"
       },
       "key_metrics": {
         "type": "object",
         "properties": {
           "revenue": {"type": "number"},
           "net_income": {"type": "number"},
           "roe": {"type": "number"},
           "ebitda_margin": {"type": "number"},
           "debt_to_equity": {"type": "number"}
         },
         "required": ["revenue", "net_income", "roe"]
       },
       "trend_analysis": {
         "type": "object",
         "properties": {
           "revenue_growth": {"type": "number"},
           "profit_margin_trend": {"type": "string"},
           "quarter_over_quarter_change": {"type": "number"}
         }
       },
       "risk_factors": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "risk_type": {"type": "string"},
             "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
             "description": {"type": "string"},
             "mitigation_suggestions": {"type": "string"}
           },
           "required": ["risk_type", "severity", "description"]
         }
       },
       "growth_forecast": {
         "type": "object",
         "properties": {
           "next_quarter_revenue_estimate": {"type": "number"},
           "confidence_level": {"type": "string", "enum": ["low", "medium", "high"]},
           "key_assumptions": {"type": "array", "items": {"type": "string"}}
         }
       },
       "compliance_flags": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "regulation": {"type": "string"},
             "status": {"type": "string", "enum": ["compliant", "attention_required", "violation"]},
             "details": {"type": "string"}
           }
         }
       }
     },
     "required": ["executive_summary", "key_metrics", "risk_factors"]
   }

**Step 2: Report Generation and Visualization**

.. code-block:: bash

   # Generate presentation-ready report
   ostruct run financial_report.j2 report_schema.json \
     --file prompt:previous_analysis results.json \
     --enable-tool code-interpreter \
     --model gpt-4o

Scientific Research Synthesis
-----------------------------

**Scenario:** Automated literature review combining research papers, recent publications, and domain expert opinions.

**Complete Workflow:**

**Step 1: Multi-Source Research Collection**

.. code-block:: bash

   # Combine local papers with latest research
   ostruct run research_synthesis.j2 research_schema.json \
     --file fs:papers research_papers/ \
     --enable-tool web-search \
     --web-query "{{research_topic}} 2024 latest findings systematic review" \
     --model o1-preview

**Template (research_synthesis.j2):**

.. code-block:: jinja

   ## Comprehensive Research Synthesis: {{research_topic}}

   ### Local Research Papers Analysis
   {% for paper in research_papers %}
   **Paper:** {{ paper.name }}
   **Content:** {{ paper.content if paper.size < 50000 else "Large paper - analyze key sections" }}
   {% endfor %}

   ### Latest Web Research
   {% if web_search_results %}
   **Recent Findings:**
   {{ web_search_results }}
   {% endif %}

   ### Synthesis Requirements:
   1. **Literature Gap Analysis**: Identify research gaps and contradictions
   2. **Methodology Comparison**: Compare approaches across studies
   3. **Evidence Quality Assessment**: Rate evidence strength using GRADE criteria
   4. **Emerging Trends**: Identify novel approaches and future directions
   5. **Practical Applications**: Translate findings to actionable insights

   ### Meta-Analysis Elements:
   - Sample sizes and statistical power across studies
   - Effect sizes and confidence intervals
   - Heterogeneity assessment
   - Publication bias evaluation

**Schema (research_schema.json):**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "research_summary": {
         "type": "string",
         "description": "Comprehensive 3-4 sentence summary of current state"
       },
       "key_findings": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "finding": {"type": "string"},
             "evidence_level": {"type": "string", "enum": ["strong", "moderate", "weak", "insufficient"]},
             "supporting_studies": {"type": "array", "items": {"type": "string"}},
             "contradictory_evidence": {"type": "array", "items": {"type": "string"}}
           },
           "required": ["finding", "evidence_level"]
         }
       },
       "methodology_analysis": {
         "type": "object",
         "properties": {
           "dominant_approaches": {"type": "array", "items": {"type": "string"}},
           "emerging_methods": {"type": "array", "items": {"type": "string"}},
           "methodological_limitations": {"type": "array", "items": {"type": "string"}}
         }
       },
       "research_gaps": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "gap_description": {"type": "string"},
             "importance": {"type": "string", "enum": ["critical", "important", "moderate", "minor"]},
             "suggested_approaches": {"type": "array", "items": {"type": "string"}}
           },
           "required": ["gap_description", "importance"]
         }
       },
       "practical_implications": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "domain": {"type": "string"},
             "implication": {"type": "string"},
             "confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
             "implementation_complexity": {"type": "string", "enum": ["simple", "moderate", "complex"]}
           }
         }
       },
       "future_directions": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "direction": {"type": "string"},
             "priority": {"type": "string", "enum": ["high", "medium", "low"]},
             "feasibility": {"type": "string", "enum": ["high", "medium", "low"]},
             "potential_impact": {"type": "string", "enum": ["transformative", "significant", "incremental"]}
           }
         }
       }
     },
     "required": ["research_summary", "key_findings", "research_gaps"]
   }

Business Intelligence Report Generation
---------------------------------------

**Scenario:** Automated competitive analysis combining internal sales data, market reports, and real-time competitor monitoring.

**Complete Workflow:**

**Step 1: Multi-Source Business Intelligence**

.. code-block:: bash

   # Comprehensive competitive analysis
   ostruct run bi_analysis.j2 bi_schema.json \
     --file ci:data sales_data.xlsx \
     --file fs:reports market_reports/ \
     --enable-tool web-search \
     --web-query "{{competitor_name}} Q4 2024 market share pricing strategy" \
     --model gpt-4o

**Template (bi_analysis.j2):**

.. code-block:: jinja

   ## Business Intelligence Report - {{analysis_period}}

   ### Internal Performance Analysis
   **Sales Data:**
   {{ sales_data.content }}

   ### Market Context Analysis
   **Market Reports:**
   {% for report in market_reports %}
   **Report:** {{ report.name }}
   {{ report.content if report.size < 30000 else "Large report - focus on executive summary and key metrics" }}
   {% endfor %}

   ### Competitive Intelligence
   {% if web_search_results %}
   **Latest Competitor Activity:**
   {{ web_search_results }}
   {% endif %}

   ### Analysis Requirements:
   1. **Market Position**: Our position vs competitors across key metrics
   2. **Growth Opportunities**: Untapped segments and expansion possibilities
   3. **Competitive Threats**: Emerging competitors and market disruptions
   4. **Pricing Analysis**: Price positioning and optimization opportunities
   5. **Strategic Recommendations**: Actionable next steps with ROI projections

   ### Executive Briefing Elements:
   - Top 3 strategic priorities
   - Revenue impact projections
   - Resource requirements
   - Timeline for implementation

**Schema (bi_schema.json):**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "executive_summary": {
         "type": "string",
         "description": "CEO-ready 2-3 sentence summary of strategic position"
       },
       "market_position": {
         "type": "object",
         "properties": {
           "market_share": {"type": "number"},
           "competitive_ranking": {"type": "integer"},
           "differentiation_strengths": {"type": "array", "items": {"type": "string"}},
           "competitive_gaps": {"type": "array", "items": {"type": "string"}}
         }
       },
       "growth_opportunities": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "opportunity": {"type": "string"},
             "market_size": {"type": "number"},
             "revenue_potential": {"type": "number"},
             "time_to_market": {"type": "string"},
             "investment_required": {"type": "number"},
             "risk_level": {"type": "string", "enum": ["low", "medium", "high"]}
           },
           "required": ["opportunity", "revenue_potential", "risk_level"]
         }
       },
       "competitive_threats": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "threat_source": {"type": "string"},
             "threat_type": {"type": "string"},
             "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
             "timeline": {"type": "string"},
             "mitigation_strategies": {"type": "array", "items": {"type": "string"}}
           }
         }
       },
       "strategic_recommendations": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "recommendation": {"type": "string"},
             "priority": {"type": "string", "enum": ["critical", "high", "medium", "low"]},
             "expected_roi": {"type": "number"},
             "implementation_timeline": {"type": "string"},
             "resource_requirements": {"type": "array", "items": {"type": "string"}},
             "success_metrics": {"type": "array", "items": {"type": "string"}}
           },
           "required": ["recommendation", "priority", "expected_roi"]
         }
       },
       "pricing_analysis": {
         "type": "object",
         "properties": {
           "current_positioning": {"type": "string"},
           "competitor_comparison": {"type": "array", "items": {"type": "object"}},
           "optimization_opportunities": {"type": "array", "items": {"type": "string"}},
           "revenue_impact_estimate": {"type": "number"}
         }
       }
     },
     "required": ["executive_summary", "market_position", "growth_opportunities", "strategic_recommendations"]
   }

Market Research Automation
---------------------------

**Scenario:** Automated market entry analysis combining demographic data, competitor landscape, and regulatory environment.

**Complete Workflow:**

**Step 1: Comprehensive Market Entry Analysis**

.. code-block:: bash

   # Complete market entry assessment
   ostruct run market_entry.j2 market_schema.json \
     --file ci:data demographic_data.csv \
     --file fs:docs regulatory_requirements/ \
     --enable-tool web-search \
     --web-query "{{target_market}} {{industry}} market entry barriers regulatory requirements 2024" \
     --model gpt-4o

**Template (market_entry.j2):**

.. code-block:: jinja

   ## Market Entry Analysis: {{target_market}} - {{industry}}

   ### Demographic and Market Data
   **Market Demographics:**
   {{ demographic_data.content }}

   ### Regulatory Environment
   **Regulatory Requirements:**
   {% for doc in regulatory_requirements %}
   **Document:** {{ doc.name }}
   {{ doc.content if doc.size < 40000 else "Large regulatory document - focus on key compliance requirements" }}
   {% endfor %}

   ### Competitive Landscape Research
   {% if web_search_results %}
   **Current Market Intelligence:**
   {{ web_search_results }}
   {% endif %}

   ### Analysis Framework:
   1. **Market Attractiveness**: Size, growth, profitability assessment
   2. **Competitive Intensity**: Porter's Five Forces analysis
   3. **Entry Barriers**: Regulatory, financial, operational obstacles
   4. **Go-to-Market Strategy**: Channel analysis and market penetration approach
   5. **Financial Projections**: Revenue forecasts and investment requirements
   6. **Risk Assessment**: Market, operational, and regulatory risks

   ### Decision Framework:
   Provide clear GO/NO-GO recommendation with supporting rationale and alternative strategies.

**Schema (market_schema.json):**

.. code-block:: json

   {
     "type": "object",
     "properties": {
       "market_attractiveness": {
         "type": "object",
         "properties": {
           "market_size_usd": {"type": "number"},
           "growth_rate": {"type": "number"},
           "profit_margin_potential": {"type": "number"},
           "market_maturity": {"type": "string", "enum": ["emerging", "growth", "mature", "declining"]},
           "attractiveness_score": {"type": "integer", "minimum": 1, "maximum": 10}
         },
         "required": ["market_size_usd", "growth_rate", "attractiveness_score"]
       },
       "competitive_analysis": {
         "type": "object",
         "properties": {
           "market_concentration": {"type": "string", "enum": ["fragmented", "moderate", "concentrated", "monopolistic"]},
           "key_competitors": {
             "type": "array",
             "items": {
               "type": "object",
               "properties": {
                 "company": {"type": "string"},
                 "market_share": {"type": "number"},
                 "competitive_advantages": {"type": "array", "items": {"type": "string"}},
                 "vulnerabilities": {"type": "array", "items": {"type": "string"}}
               }
             }
           },
           "competitive_intensity": {"type": "integer", "minimum": 1, "maximum": 5}
         }
       },
       "entry_barriers": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "barrier_type": {"type": "string"},
             "severity": {"type": "string", "enum": ["low", "medium", "high", "prohibitive"]},
             "description": {"type": "string"},
             "mitigation_strategies": {"type": "array", "items": {"type": "string"}},
             "estimated_cost": {"type": "number"}
           },
           "required": ["barrier_type", "severity", "description"]
         }
       },
       "go_to_market_strategy": {
         "type": "object",
         "properties": {
           "recommended_channels": {"type": "array", "items": {"type": "string"}},
           "market_penetration_approach": {"type": "string"},
           "customer_acquisition_strategy": {"type": "string"},
           "pricing_strategy": {"type": "string"},
           "marketing_budget_estimate": {"type": "number"}
         }
       },
       "financial_projections": {
         "type": "object",
         "properties": {
           "year_1_revenue": {"type": "number"},
           "year_3_revenue": {"type": "number"},
           "break_even_timeline": {"type": "string"},
           "initial_investment_required": {"type": "number"},
           "roi_projection": {"type": "number"}
         },
         "required": ["year_1_revenue", "initial_investment_required"]
       },
       "risk_assessment": {
         "type": "array",
         "items": {
           "type": "object",
           "properties": {
             "risk_category": {"type": "string"},
             "risk_description": {"type": "string"},
             "likelihood": {"type": "string", "enum": ["low", "medium", "high"]},
             "impact": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
             "mitigation_plan": {"type": "string"}
           }
         }
       },
       "recommendation": {
         "type": "object",
         "properties": {
           "decision": {"type": "string", "enum": ["go", "no_go", "conditional_go", "delayed_entry"]},
           "confidence_level": {"type": "string", "enum": ["low", "medium", "high"]},
           "key_rationale": {"type": "array", "items": {"type": "string"}},
           "next_steps": {"type": "array", "items": {"type": "string"}},
           "alternative_strategies": {"type": "array", "items": {"type": "string"}}
         },
         "required": ["decision", "confidence_level", "key_rationale"]
       }
     },
     "required": ["market_attractiveness", "competitive_analysis", "entry_barriers", "recommendation"]
   }

**Step 2: Action Plan Generation**

.. code-block:: bash

   # Generate implementation roadmap
   ostruct run implementation_plan.j2 plan_schema.json \
     --file prompt:analysis market_entry_results.json \
     --model gpt-4o

Best Practices for Complex Workflows
-------------------------------------

**Template Design Principles:**

1. **Structured Instructions**: Provide clear, numbered requirements
2. **Context Awareness**: Handle missing data gracefully
3. **Progressive Disclosure**: Start broad, then drill into specifics
4. **Error Resilience**: Include fallback strategies for data issues

**Schema Design Principles:**

1. **Business-Ready Output**: Structure matches decision-making needs
2. **Validation Built-In**: Use enums and constraints for data quality
3. **Extensible Design**: Allow for future requirement additions
4. **Confidence Indicators**: Include certainty levels for AI outputs

**Workflow Orchestration:**

1. **Multi-Stage Processing**: Break complex analysis into digestible stages
2. **Tool Selection**: Match tool capabilities to data types and complexity
3. **Quality Gates**: Validate intermediate outputs before final processing
4. **Documentation**: Maintain audit trail of analysis steps

Integration with Data Science Tools
====================================

Pandas Integration
------------------

.. code-block:: python

   import pandas as pd
   import json
   import subprocess

   # Process DataFrame with ostruct
   def analyze_dataframe(df, analysis_template, schema_file):
       # Save DataFrame temporarily
       temp_file = 'temp_data.csv'
       df.to_csv(temp_file, index=False)

       # Run ostruct analysis
       result = subprocess.run([
           'ostruct', 'run', analysis_template, schema_file,
           '--file', 'ci:data', temp_file,
           '--output-file', 'temp_results.json'
       ], capture_output=True, text=True)

       # Load results
       with open('temp_results.json', 'r') as f:
           return json.load(f)

   # Example usage
   df = pd.read_csv('sales_data.csv')
   insights = analyze_dataframe(df, 'sales_analysis.j2', 'sales_schema.json')

Matplotlib/Seaborn Integration
------------------------------

.. code-block:: python

   # Generate visualization specifications with ostruct
   viz_template = '''
   ---
   system_prompt: You are a data visualization expert. Generate matplotlib/seaborn code specifications.
   ---
   Create visualization specifications for this dataset:
   {{ data.content }}

   Generate specifications for the most insightful charts to show patterns, distributions, and relationships.
   '''

   # Use ostruct to generate viz specs, then create plots
   viz_specs = analyze_dataframe(df, 'viz_template.j2', 'viz_schema.json')

   # Execute generated visualization code
   for viz in viz_specs['visualizations']:
       exec(viz['matplotlib_code'])

Next Steps
==========

**Getting Started:**

1. Set up ostruct in your notebook environment
2. Try the basic data extraction example
3. Experiment with multi-tool workflows
4. Adapt schemas for your specific use cases

**Advanced Usage:**

- Explore the :doc:`template_guide` for complex template patterns
- See :doc:`tool_integration` for multi-tool coordination
- Check :doc:`cli_reference` for all available options

See Also
========

- :doc:`template_guide` - Comprehensive template creation guide
- :doc:`tool_integration` - Multi-tool integration patterns
- :doc:`cli_reference` - Complete command-line reference
- :doc:`quickstart` - General getting started guide