Data Science Integration Guide

What is ostruct? A schema-first CLI that renders Jinja2 templates locally in a sandbox, then sends the resulting prompt + JSON schema to OpenAI’s Structured Outputs endpoint for guaranteed valid JSON responses. Perfect for data science workflows requiring reliable, structured analysis outputs.

Learn how to leverage ostruct for data science workflows, including Jupyter/Colab integration, multi-source analysis, and visualization generation. This guide covers everything from basic data extraction to complex research synthesis workflows.

Note

This guide focuses on data science use cases. For general template usage, see the Template Guide. For tool integration basics, see Multi-Tool Integration.

Tip

Quick Start: Jump to Jupyter/Colab Integration if you want to start using ostruct in Jupyter notebooks immediately.

Overview

ostruct excels at transforming unstructured data into structured insights, making it perfect for data science workflows where you need to:

  • Extract structured data from diverse sources (CSV, PDFs, web pages, APIs)

  • Combine quantitative analysis with qualitative research

  • Generate consistent, validated output schemas for downstream processing

  • Integrate AI-powered analysis into existing data pipelines

Key Benefits for Data Science

Schema-First Reliability

Every output matches your defined JSON schema, eliminating parsing errors and ensuring consistent data structures for analysis.

Multi-Tool Orchestration

Combine Code Interpreter (Python execution), File Search (document analysis), and Web Search (current data) in a single workflow. Note: --tool-choice auto (default) lets the model decide when to use tools; use --tool-choice required to force tool usage.

Notebook Integration

Works seamlessly in Jupyter, Colab, and other notebook environments with proper token management and output formatting.

Crucial Limitations
  • Binary files cannot be accessed in templates - they must be routed to Code Interpreter (ci:) or user-data (ud:)

  • File size limits apply based on OSTRUCT_TEMPLATE_FILE_LIMIT environment variable

  • Internet access in Code Interpreter may be limited depending on OpenAI’s current restrictions

Reproducible Workflows

Template-based approach ensures consistent analysis across different datasets and team members.

Jupyter/Colab Integration

Setting Up ostruct in Notebooks

Installation in Jupyter/Colab:

# Install ostruct in notebook environment
pip install ostruct-cli

# For enhanced file type detection (recommended for data science)
pip install ostruct-cli[enhanced-detection]

# Verify installation
ostruct --version
# Set up OpenAI API key in Python
import os
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

30-Second Working Example:

# Create simple template
echo "Analyze this data: {{ data.content }}" > analyze.j2

# Create schema
echo '{"type":"object","properties":{"insights":{"type":"array","items":{"type":"string"}}}}' > schema.json

# Create sample data
echo "Sales: Jan=100, Feb=150, Mar=120" > data.txt

# Run analysis
ostruct run analyze.j2 schema.json --file prompt:data data.txt --model gpt-4o-mini

Expected Output:

{
  "insights": [
    "Sales peaked in February with 150 units",
    "March saw a 20% decline from February",
    "Overall trend shows growth from Jan to Feb, then decline"
  ]
}

Basic Notebook Workflow:

# Create a simple data extraction template
template_content = '''
---
system_prompt: You are an expert data analyst. Extract key metrics and insights.
---
Analyze this dataset and extract the key findings:

{{ data.content }}

Focus on:
1. Summary statistics
2. Notable patterns or trends
3. Data quality issues
4. Recommendations for further analysis
'''

# Write template to file
with open('data_analysis.j2', 'w') as f:
    f.write(template_content)

# Define output schema
schema = {
    "type": "object",
    "properties": {
        "summary_stats": {
            "type": "object",
            "description": "Key summary statistics"
        },
        "patterns": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Notable patterns or trends found"
        },
        "data_quality": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Data quality issues identified"
        },
        "recommendations": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Recommendations for further analysis"
        }
    },
    "required": ["summary_stats", "patterns", "data_quality", "recommendations"]
}

import json
with open('analysis_schema.json', 'w') as f:
    json.dump(schema, f, indent=2)

Running Analysis in Notebooks:

# Run ostruct analysis
import subprocess
import json

# Execute ostruct command
result = subprocess.run([
    'ostruct', 'run', 'data_analysis.j2', 'analysis_schema.json',
    '--file', 'ci:data', 'your_dataset.csv',
    '--model', 'gpt-4o',
    '--output-file', 'analysis_results.json'
], capture_output=True, text=True)

# Load and display results
with open('analysis_results.json', 'r') as f:
    analysis = json.load(f)

print("Analysis Results:")
print(f"Patterns found: {len(analysis['patterns'])}")
for pattern in analysis['patterns']:
    print(f"  β€’ {pattern}")

Interactive Jupyter Notebook Example

Experience ostruct data science workflows interactively with our comprehensive Jupyter notebook:

What’s included in the notebook:

  • 6 Complete Examples: From basic analysis to advanced multi-tool workflows

  • Working Code: All examples include working templates, schemas, and data

  • Financial Analysis: Quarterly financial analysis with market context

  • Business Intelligence: Competitive analysis and strategic recommendations

  • Interactive Workflows: Dynamic analysis based on custom questions

  • Batch Processing: Production-ready patterns for multiple datasets

  • Best Practices: Performance optimization, cost management, security

Local Usage:

# Clone and run locally
git clone https://github.com/yaniv-golan/ostruct.git
cd ostruct/examples/data-science/notebooks
jupyter notebook ostruct_data_analysis.ipynb

The notebook demonstrates all the workflows described in this guide with working code you can run immediately.

Try in Colab

Open In Colab

Advanced Notebook Integration

Jupyter Magic Commands for ostruct:

# Create reusable magic command for ostruct
from IPython.core.magic import line_magic, Magics, magics_class
from IPython.core.magic_arguments import argument, magic_arguments, parse_argstring
import subprocess
import json

@magics_class
class OstructMagics(Magics):

    @line_magic
    @magic_arguments()
    @argument('template', help='Template file path')
    @argument('schema', help='Schema file path')
    @argument('--file', dest='data_file', help='Data file to analyze')
    @argument('--model', default='gpt-4o-mini', help='Model to use')
    def ostruct(self, line):
        """Run ostruct analysis from Jupyter cell"""
        args = parse_argstring(self.ostruct, line)

        result = subprocess.run([
            'ostruct', 'run', args.template, args.schema,
            '--file', f'ci:data', args.data_file,
            '--model', args.model,
            '--output-file', 'results.json'
        ], capture_output=True, text=True)

        if result.returncode == 0:
            with open('results.json', 'r') as f:
                return json.load(f)
        else:
            print(f"Error: {result.stderr}")
            return None

# Register the magic
get_ipython().register_magic_functions(OstructMagics)

# Usage: %ostruct analysis.j2 schema.json --file data.csv --model gpt-4o

DataFrame Integration Patterns:

import pandas as pd
import tempfile
import os

class DataFrameAnalyzer:
    """Enhanced DataFrame analysis with ostruct integration"""

    def __init__(self, df):
        self.df = df
        self.temp_files = []

    def create_context_template(self, analysis_focus="general"):
        """Generate template with DataFrame context"""
        template = f'''
---
system_prompt: |
  You are analyzing a dataset with {len(self.df)} rows and {len(self.df.columns)} columns.
  Focus on {analysis_focus} analysis patterns.
---

Dataset Overview:
- Shape: {self.df.shape[0]} rows Γ— {self.df.shape[1]} columns
- Columns: {", ".join(self.df.columns.tolist())}
- Data types: {dict(self.df.dtypes.astype(str))}

Sample data:
{{{{ data.content }}}}

Analysis Requirements:
1. Identify key patterns and trends
2. Assess data quality and completeness
3. Suggest follow-up analysis steps
4. Highlight any anomalies or outliers
'''
        return template

    def analyze(self, focus="general", sample_size=1000):
        """Run ostruct analysis on DataFrame"""

        # Sample large datasets
        sample_df = self.df.sample(min(sample_size, len(self.df)))

        # Create temporary files
        with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f:
            csv_file = f.name
            sample_df.to_csv(f, index=False)
            self.temp_files.append(csv_file)

        with tempfile.NamedTemporaryFile(mode='w', suffix='.j2', delete=False) as f:
            template_file = f.name
            f.write(self.create_context_template(focus))
            self.temp_files.append(template_file)

        # Define schema
        schema = {
            "type": "object",
            "properties": {
                "summary": {"type": "string", "description": "Overall dataset summary"},
                "patterns": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Key patterns identified"
                },
                "quality_issues": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Data quality concerns"
                },
                "recommendations": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Analysis recommendations"
                }
            }
        }

        with tempfile.NamedTemporaryFile(mode='w', suffix='.json', delete=False) as f:
            schema_file = f.name
            json.dump(schema, f, indent=2)
            self.temp_files.append(schema_file)

        # Run analysis
        result = subprocess.run([
            'ostruct', 'run', template_file, schema_file,
            '--file', f'ci:data', csv_file,
            '--model', 'gpt-4o-mini'
        ], capture_output=True, text=True)

        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            print(f"Analysis failed: {result.stderr}")
            return None

    def cleanup(self):
        """Clean up temporary files"""
        for file_path in self.temp_files:
            try:
                os.unlink(file_path)
            except FileNotFoundError:
                pass
        self.temp_files = []

    def __del__(self):
        self.cleanup()

# Usage example
df = pd.read_csv('sales_data.csv')
analyzer = DataFrameAnalyzer(df)
insights = analyzer.analyze(focus="sales trends")
print(f"Found {len(insights['patterns'])} patterns")

Token Management for Large Datasets:

def smart_sample_for_analysis(df, max_tokens=8000, chars_per_token=4):
    """
    Intelligently sample DataFrame to fit within token limits
    """
    # Estimate current size
    csv_str = df.to_csv(index=False)
    estimated_tokens = len(csv_str) // chars_per_token

    if estimated_tokens <= max_tokens:
        return df

    # Calculate sample size needed
    sample_ratio = max_tokens / estimated_tokens
    sample_size = int(len(df) * sample_ratio * 0.8)  # 80% buffer

    print(f"Dataset too large ({estimated_tokens} tokens). Sampling {sample_size} rows.")

    # Stratified sampling if categorical columns exist
    categorical_cols = df.select_dtypes(include=['object']).columns
    if len(categorical_cols) > 0:
        return df.groupby(categorical_cols[0]).apply(
            lambda x: x.sample(min(len(x), sample_size // df[categorical_cols[0]].nunique()))
        ).reset_index(drop=True)
    else:
        return df.sample(sample_size)

# Usage
large_df = pd.read_csv('large_dataset.csv')
manageable_df = smart_sample_for_analysis(large_df)
analyzer = DataFrameAnalyzer(manageable_df)

Environment Variable Management:

# Secure API key management for notebooks
import os
from getpass import getpass

def setup_ostruct_environment():
    """Setup ostruct environment variables securely"""

    if 'OPENAI_API_KEY' not in os.environ:
        print("OpenAI API key not found in environment.")
        api_key = getpass("Enter your OpenAI API key: ")
        os.environ['OPENAI_API_KEY'] = api_key
        print("βœ“ API key set for this session")

    # Set notebook-friendly defaults
    os.environ['OSTRUCT_CACHE_UPLOADS'] = 'true'
    os.environ['OSTRUCT_TEMPLATE_FILE_LIMIT'] = '10MB'

    print("βœ“ ostruct environment configured")

# Run at start of notebook
setup_ostruct_environment()

Visualization Integration:

def generate_analysis_visualizations(df, analysis_results):
    """
    Generate visualizations based on ostruct analysis recommendations
    """
    import matplotlib.pyplot as plt
    import seaborn as sns

    # Extract visualization suggestions from analysis
    if 'recommendations' in analysis_results:
        viz_suggestions = [
            rec for rec in analysis_results['recommendations']
            if any(word in rec.lower() for word in ['plot', 'chart', 'graph', 'visualiz'])
        ]

        for suggestion in viz_suggestions:
            print(f"Visualization suggestion: {suggestion}")

    # Auto-generate basic plots for numeric columns
    numeric_cols = df.select_dtypes(include=['number']).columns

    if len(numeric_cols) > 0:
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))

        # Distribution plot
        df[numeric_cols[0]].hist(ax=axes[0,0])
        axes[0,0].set_title(f'Distribution of {numeric_cols[0]}')

        # Correlation heatmap if multiple numeric columns
        if len(numeric_cols) > 1:
            corr_matrix = df[numeric_cols].corr()
            sns.heatmap(corr_matrix, annot=True, ax=axes[0,1])
            axes[0,1].set_title('Correlation Matrix')

        # Box plot for outlier detection
        df.boxplot(column=numeric_cols[0], ax=axes[1,0])
        axes[1,0].set_title(f'Outliers in {numeric_cols[0]}')

        # Trend over index (if meaningful)
        df[numeric_cols[0]].plot(ax=axes[1,1])
        axes[1,1].set_title(f'Trend of {numeric_cols[0]}')

        plt.tight_layout()
        plt.show()

# Usage after analysis
viz_results = generate_analysis_visualizations(df, insights)

Multi-Tool Data Science Workflows

Research Synthesis Workflows

Academic Research Analysis:

# Combine literature review with data analysis
ostruct run research_synthesis.j2 research_schema.json \
  --file fs:papers "*.pdf" --recursive \
  --file ci:dataset research_data.csv \
  --enable-tool web-search \
  --model gpt-4o

This workflow:

  1. Searches papers using File Search for literature context

  2. Analyzes data using Code Interpreter for statistical insights

  3. Updates with current research using Web Search

  4. Synthesizes findings into structured research output

Comprehensive Multi-Tool Workflow Patterns

CSV Analysis with Code Interpreter

Pattern 1: Enhanced Data Analysis with Visualization

# Deep CSV analysis with automated visualization generation
ostruct run csv_deep_analysis.j2 analysis_schema.json \
  --file ci:dataset sales_data.csv \
  --file ci:reference benchmark_data.csv \
  --model gpt-4o

Template (csv_deep_analysis.j2):

---
system_prompt: |
  You are a senior data analyst. Perform comprehensive analysis including
  statistical testing, visualization generation, and business insights.
---

# Comprehensive CSV Data Analysis

## Dataset Overview
Primary dataset: {{ dataset.name }} ({{ dataset.size }} bytes)
Reference dataset: {{ reference.name }} ({{ reference.size }} bytes)

## Analysis Tasks

### 1. Statistical Analysis
Load and analyze the primary dataset:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load data
df = pd.read_csv('{{ dataset.name }}')
ref_df = pd.read_csv('{{ reference.name }}')

# Generate comprehensive statistics
print("=== DATASET SUMMARY ===")
print(df.describe())
print(f"Dataset shape: {df.shape}")
print(f"Missing values: {df.isnull().sum().sum()}")
```

### 2. Visualization Generation
Create insightful visualizations:
```python
# Set up the plotting environment
plt.style.use('seaborn-v0_8')
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Distribution analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    df[numeric_cols[0]].hist(bins=30, ax=axes[0,0])
    axes[0,0].set_title(f'Distribution of {numeric_cols[0]}')

# 2. Correlation heatmap
if len(numeric_cols) > 1:
    corr_matrix = df[numeric_cols].corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=axes[0,1])
    axes[0,1].set_title('Correlation Matrix')

# 3. Time series or trend analysis
if 'date' in df.columns or 'timestamp' in df.columns:
    # Time series visualization logic
    pass
else:
    # Box plot for outlier detection
    if len(numeric_cols) > 0:
        df.boxplot(column=numeric_cols[0], ax=axes[1,0])
        axes[1,0].set_title(f'Outlier Analysis: {numeric_cols[0]}')

# 4. Comparative analysis with reference data
# Compare key metrics between datasets
axes[1,1].bar(['Primary', 'Reference'],
              [df[numeric_cols[0]].mean(), ref_df[numeric_cols[0]].mean()])
axes[1,1].set_title('Comparative Analysis')

plt.tight_layout()
plt.savefig('comprehensive_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
```

### 3. Statistical Testing
Perform significance tests:
```python
# Compare primary vs reference dataset
if len(numeric_cols) > 0:
    primary_values = df[numeric_cols[0]].dropna()
    reference_values = ref_df[numeric_cols[0]].dropna()

    # T-test for mean differences
    t_stat, p_value = stats.ttest_ind(primary_values, reference_values)
    print(f"T-test results: t={t_stat:.4f}, p={p_value:.4f}")

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((len(primary_values)-1)*primary_values.var() +
                         (len(reference_values)-1)*reference_values.var()) /
                        (len(primary_values)+len(reference_values)-2))
    cohens_d = (primary_values.mean() - reference_values.mean()) / pooled_std
    print(f"Effect size (Cohen's d): {cohens_d:.4f}")
```

Provide business insights and recommendations based on the analysis.

Output Schema:

{
  "type": "object",
  "properties": {
    "dataset_summary": {
      "type": "object",
      "properties": {
        "rows": {"type": "number"},
        "columns": {"type": "number"},
        "missing_values": {"type": "number"},
        "data_types": {"type": "object"}
      }
    },
    "statistical_analysis": {
      "type": "object",
      "properties": {
        "descriptive_stats": {"type": "object"},
        "correlations": {"type": "array", "items": {"type": "object"}},
        "significance_tests": {"type": "array", "items": {"type": "object"}}
      }
    },
    "visualizations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "filename": {"type": "string"},
          "type": {"type": "string"},
          "description": {"type": "string"},
          "insights": {"type": "array", "items": {"type": "string"}}
        }
      }
    },
    "business_insights": {
      "type": "array",
      "items": {"type": "string"}
    },
    "recommendations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "recommendation": {"type": "string"},
          "priority": {"type": "string"},
          "rationale": {"type": "string"}
        }
      }
    }
  }
}

Web Research + Data Analysis Combinations

Pattern 2: Market Intelligence with Data Validation

# Combine internal sales data with current market intelligence
ostruct run market_intelligence.j2 market_schema.json \
  --file ci:sales internal_sales.csv \
  --file ci:competitor competitor_analysis.csv \
  --enable-tool web-search \
  --ws-context-size comprehensive \
  --model gpt-4o

Template (market_intelligence.j2):

---
system_prompt: |
  You are a market intelligence analyst. Use web search to gather current
  market data and validate it against internal analysis.
---

# Market Intelligence Analysis

## Internal Data Analysis

### Sales Performance Analysis
```python
import pandas as pd
import matplotlib.pyplot as plt

# Load internal data
sales_df = pd.read_csv('{{ sales.name }}')
competitor_df = pd.read_csv('{{ competitor.name }}')

# Analyze sales trends
print("=== INTERNAL SALES ANALYSIS ===")
monthly_sales = sales_df.groupby('month')['revenue'].sum()
print("Monthly revenue trends:")
print(monthly_sales.describe())

# Competitive position analysis
print("\n=== COMPETITIVE ANALYSIS ===")
print("Market share analysis:")
print(competitor_df['market_share'].describe())

# Generate comparison chart
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

monthly_sales.plot(kind='line', ax=ax1, title='Internal Sales Trend')
competitor_df['market_share'].plot(kind='bar', ax=ax2, title='Market Share Distribution')

plt.tight_layout()
plt.savefig('internal_analysis.png')
plt.show()
```

## Current Market Research

Research current market conditions and trends:
- Industry growth rates and forecasts
- Recent competitor announcements and strategy changes
- Regulatory changes affecting the market
- Consumer behavior shifts and emerging trends
- Technology disruptions in the sector

Focus your search on:
1. Market size and growth projections for {{ sales.content | extract_industry }}
2. Recent competitor activities and market positioning
3. Consumer preference shifts in the last 6 months
4. Regulatory or economic factors affecting demand

## Data Validation and Synthesis

```python
# Cross-validate web research findings with internal data
print("=== VALIDATION ANALYSIS ===")

# Check if internal trends align with market research
internal_growth = (monthly_sales.iloc[-1] - monthly_sales.iloc[0]) / monthly_sales.iloc[0] * 100
print(f"Internal growth rate: {internal_growth:.2f}%")

# This will be compared with web research findings
print("Compare this with market research growth rates above")

# Identify discrepancies and opportunities
avg_competitor_share = competitor_df['market_share'].mean()
our_estimated_share = 100 / len(competitor_df)  # Assuming equal distribution

print(f"Average competitor market share: {avg_competitor_share:.2f}%")
print(f"Our estimated position: {our_estimated_share:.2f}%")
```

Synthesize findings from internal data and web research to provide:
1. Market opportunity assessment
2. Competitive positioning recommendations
3. Strategic actions based on combined insights
4. Risk factors identified from external research

File Search + Code Interpreter for Research Synthesis

Pattern 3: Academic Literature + Data Analysis

# Combine literature review with experimental data analysis
ostruct run research_synthesis.j2 research_schema.json \
  --file fs:literature "research_papers/*.pdf" --recursive \
  --file ci:data experimental_results.csv \
  --file ci:reference baseline_data.csv \
  --enable-tool web-search \
  --model gpt-4o

Template (research_synthesis.j2):

---
system_prompt: |
  You are a research scientist. Synthesize literature findings with
  experimental data analysis to draw comprehensive conclusions.
---

# Research Synthesis Analysis

## Literature Review Summary

{% if file_search_enabled %}
Based on the research papers in your knowledge base:

{{ literature }}

Extract and summarize:
1. **Methodology consensus**: What experimental approaches are most validated?
2. **Key findings**: What are the established relationships and effects?
3. **Gaps identified**: What questions remain unanswered?
4. **Methodological considerations**: What are the best practices?
{% else %}
Note: File Search unavailable. Proceeding with data analysis and web research.
{% endif %}

## Experimental Data Analysis

### Statistical Analysis of Results
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# Load experimental data
results_df = pd.read_csv('{{ data.name }}')
baseline_df = pd.read_csv('{{ reference.name }}')

print("=== EXPERIMENTAL DATA ANALYSIS ===")
print(f"Experimental data shape: {results_df.shape}")
print(f"Baseline data shape: {baseline_df.shape}")

# Descriptive statistics
print("\nExperimental Results Summary:")
print(results_df.describe())

print("\nBaseline Summary:")
print(baseline_df.describe())

# Statistical comparisons
numeric_cols = results_df.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    if col in baseline_df.columns:
        exp_values = results_df[col].dropna()
        baseline_values = baseline_df[col].dropna()

        # T-test
        t_stat, p_val = stats.ttest_ind(exp_values, baseline_values)

        # Effect size
        pooled_std = np.sqrt(((len(exp_values)-1)*exp_values.var() +
                             (len(baseline_values)-1)*baseline_values.var()) /
                            (len(exp_values)+len(baseline_values)-2))
        effect_size = (exp_values.mean() - baseline_values.mean()) / pooled_std

        print(f"\n{col} Analysis:")
        print(f"  Experimental mean: {exp_values.mean():.4f}")
        print(f"  Baseline mean: {baseline_values.mean():.4f}")
        print(f"  T-test p-value: {p_val:.4f}")
        print(f"  Effect size: {effect_size:.4f}")
```

### Visualization of Key Findings
```python
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Comparison of means
if len(numeric_cols) > 0:
    col = numeric_cols[0]
    means = [results_df[col].mean(), baseline_df[col].mean()]
    stds = [results_df[col].std(), baseline_df[col].std()]

    x_pos = np.arange(len(['Experimental', 'Baseline']))
    axes[0,0].bar(x_pos, means, yerr=stds, capsize=5)
    axes[0,0].set_xticks(x_pos)
    axes[0,0].set_xticklabels(['Experimental', 'Baseline'])
    axes[0,0].set_title(f'Comparison of {col}')

# 2. Distribution comparison
if len(numeric_cols) > 0:
    axes[0,1].hist(results_df[numeric_cols[0]].dropna(), alpha=0.7, label='Experimental')
    axes[0,1].hist(baseline_df[numeric_cols[0]].dropna(), alpha=0.7, label='Baseline')
    axes[0,1].legend()
    axes[0,1].set_title('Distribution Comparison')

# 3. Correlation analysis
if len(numeric_cols) > 1:
    corr_matrix = results_df[numeric_cols].corr()
    sns.heatmap(corr_matrix, annot=True, ax=axes[1,0])
    axes[1,0].set_title('Experimental Data Correlations')

# 4. Trend analysis or scatter plot
if len(numeric_cols) > 1:
    axes[1,1].scatter(results_df[numeric_cols[0]], results_df[numeric_cols[1]])
    axes[1,1].set_xlabel(numeric_cols[0])
    axes[1,1].set_ylabel(numeric_cols[1])
    axes[1,1].set_title('Relationship Analysis')

plt.tight_layout()
plt.savefig('research_analysis.png', dpi=300)
plt.show()
```

## Current Research Context

Search for recent publications and developments related to:
- Latest methodological advances in this research area
- Recent findings that support or contradict our results
- Emerging theoretical frameworks
- Clinical or practical applications of similar research

## Synthesis and Conclusions

```python
print("=== RESEARCH SYNTHESIS ===")

# Summary statistics for reporting
if len(numeric_cols) > 0:
    primary_metric = numeric_cols[0]
    exp_mean = results_df[primary_metric].mean()
    baseline_mean = baseline_df[primary_metric].mean()
    improvement = ((exp_mean - baseline_mean) / baseline_mean) * 100

    print(f"Primary outcome ({primary_metric}):")
    print(f"  Experimental: {exp_mean:.4f}")
    print(f"  Baseline: {baseline_mean:.4f}")
    print(f"  Improvement: {improvement:.2f}%")

print("\nReady for synthesis with literature and current research...")
```

Provide comprehensive synthesis addressing:
1. How experimental results align with literature findings
2. Novel contributions of this research
3. Limitations and considerations based on methodological review
4. Future research directions
5. Practical implications and applications

Visualization Generation Patterns

Pattern 4: Automated Chart Generation with Business Context

# Generate contextual visualizations with business insights
ostruct run viz_generation.j2 visualization_schema.json \
  --file ci:data business_metrics.csv \
  --file ci:benchmark industry_benchmarks.csv \
  --enable-tool web-search \
  --model gpt-4o

Template (viz_generation.j2):

---
system_prompt: |
  You are a data visualization expert and business analyst. Create insightful
  visualizations that tell a compelling business story.
---

# Business Data Visualization Generation

## Data Exploration and Preparation

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import numpy as np
from datetime import datetime

# Load data
business_df = pd.read_csv('{{ data.name }}')
benchmark_df = pd.read_csv('{{ benchmark.name }}')

print("=== DATA OVERVIEW ===")
print(f"Business data shape: {business_df.shape}")
print(f"Benchmark data shape: {benchmark_df.shape}")
print("\nBusiness data columns:", business_df.columns.tolist())
print("Benchmark data columns:", benchmark_df.columns.tolist())

# Data quality check
print("\nMissing values:")
print("Business data:", business_df.isnull().sum().sum())
print("Benchmark data:", benchmark_df.isnull().sum().sum())
```

## Visualization 1: Performance Dashboard

```python
# Create a comprehensive dashboard
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Revenue Trend', 'Performance vs Benchmark',
                   'Category Breakdown', 'Growth Analysis'),
    specs=[[{"secondary_y": True}, {"type": "bar"}],
           [{"type": "pie"}, {"type": "scatter"}]]
)

# 1. Revenue trend with growth rate
if 'date' in business_df.columns and 'revenue' in business_df.columns:
    business_df['date'] = pd.to_datetime(business_df['date'])
    monthly_revenue = business_df.groupby('date')['revenue'].sum().reset_index()

    fig.add_trace(
        go.Scatter(x=monthly_revenue['date'], y=monthly_revenue['revenue'],
                  mode='lines+markers', name='Revenue'),
        row=1, col=1
    )

    # Add growth rate on secondary y-axis
    monthly_revenue['growth_rate'] = monthly_revenue['revenue'].pct_change() * 100
    fig.add_trace(
        go.Scatter(x=monthly_revenue['date'], y=monthly_revenue['growth_rate'],
                  mode='lines', name='Growth Rate %', yaxis='y2'),
        row=1, col=1, secondary_y=True
    )

# 2. Performance vs Benchmark
if 'metric' in business_df.columns and 'value' in business_df.columns:
    metrics = business_df['metric'].unique()[:5]  # Top 5 metrics
    business_values = [business_df[business_df['metric']==m]['value'].mean() for m in metrics]
    benchmark_values = [benchmark_df[benchmark_df['metric']==m]['value'].mean() for m in metrics]

    fig.add_trace(
        go.Bar(x=metrics, y=business_values, name='Our Performance'),
        row=1, col=2
    )
    fig.add_trace(
        go.Bar(x=metrics, y=benchmark_values, name='Industry Benchmark'),
        row=1, col=2
    )

# 3. Category breakdown
if 'category' in business_df.columns and 'value' in business_df.columns:
    category_totals = business_df.groupby('category')['value'].sum()
    fig.add_trace(
        go.Pie(labels=category_totals.index, values=category_totals.values,
              name="Category Distribution"),
        row=2, col=1
    )

# 4. Growth analysis scatter
if 'investment' in business_df.columns and 'return' in business_df.columns:
    fig.add_trace(
        go.Scatter(x=business_df['investment'], y=business_df['return'],
                  mode='markers', name='ROI Analysis',
                  text=business_df.get('category', ''),
                  textposition="top center"),
        row=2, col=2
    )

fig.update_layout(height=800, showlegend=True,
                  title_text="Business Performance Dashboard")
fig.write_html("business_dashboard.html")
fig.show()
```

## Visualization 2: Competitive Analysis Charts

```python
# Advanced competitive positioning
plt.style.use('seaborn-v0_8')
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Market positioning bubble chart
if all(col in business_df.columns for col in ['market_share', 'growth_rate', 'revenue']):
    scatter = axes[0,0].scatter(business_df['market_share'],
                               business_df['growth_rate'],
                               s=business_df['revenue']/1000,  # Bubble size
                               alpha=0.6, c=range(len(business_df)),
                               cmap='viridis')
    axes[0,0].set_xlabel('Market Share (%)')
    axes[0,0].set_ylabel('Growth Rate (%)')
    axes[0,0].set_title('Market Positioning (Bubble size = Revenue)')

    # Add competitor benchmarks if available
    if all(col in benchmark_df.columns for col in ['market_share', 'growth_rate']):
        axes[0,0].scatter(benchmark_df['market_share'],
                         benchmark_df['growth_rate'],
                         marker='x', s=100, c='red', label='Competitors')
        axes[0,0].legend()

# Performance radar chart simulation
categories = ['Revenue', 'Market Share', 'Customer Satisfaction', 'Innovation', 'Efficiency']
if len([col for col in categories if col.lower().replace(' ', '_') in business_df.columns]) >= 3:
    # Create radar chart data
    our_scores = []
    benchmark_scores = []

    for category in categories:
        col_name = category.lower().replace(' ', '_')
        if col_name in business_df.columns:
            our_scores.append(business_df[col_name].mean())
            benchmark_scores.append(benchmark_df[col_name].mean() if col_name in benchmark_df.columns else our_scores[-1] * 0.9)
        else:
            our_scores.append(0)
            benchmark_scores.append(0)

    # Polar plot simulation using regular plot
    angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False).tolist()
    our_scores += our_scores[:1]  # Complete the circle
    benchmark_scores += benchmark_scores[:1]
    angles += angles[:1]

    axes[0,1].plot(angles, our_scores, 'o-', linewidth=2, label='Our Performance')
    axes[0,1].fill(angles, our_scores, alpha=0.25)
    axes[0,1].plot(angles, benchmark_scores, 'o-', linewidth=2, label='Industry Average')
    axes[0,1].fill(angles, benchmark_scores, alpha=0.25)
    axes[0,1].set_title('Performance Radar')
    axes[0,1].legend()

# Trend comparison
if 'date' in business_df.columns:
    business_df['date'] = pd.to_datetime(business_df['date'])
    monthly_data = business_df.groupby('date').agg({
        'revenue': 'sum',
        'customers': 'sum' if 'customers' in business_df.columns else 'count'
    }).reset_index()

    ax2_twin = axes[1,0].twinx()
    line1 = axes[1,0].plot(monthly_data['date'], monthly_data['revenue'],
                          'b-', label='Revenue')
    line2 = ax2_twin.plot(monthly_data['date'], monthly_data['customers'],
                         'r--', label='Customers')

    axes[1,0].set_xlabel('Date')
    axes[1,0].set_ylabel('Revenue', color='b')
    ax2_twin.set_ylabel('Customers', color='r')
    axes[1,0].set_title('Revenue and Customer Trends')

    # Combine legends
    lines = line1 + line2
    labels = [l.get_label() for l in lines]
    axes[1,0].legend(lines, labels, loc='upper left')

# ROI and efficiency analysis
if 'investment' in business_df.columns and 'return' in business_df.columns:
    business_df['roi'] = (business_df['return'] - business_df['investment']) / business_df['investment'] * 100

    # Box plot of ROI by category
    if 'category' in business_df.columns:
        categories = business_df['category'].unique()
        roi_by_category = [business_df[business_df['category']==cat]['roi'].values for cat in categories]
        axes[1,1].boxplot(roi_by_category, labels=categories)
        axes[1,1].set_title('ROI Distribution by Category')
        axes[1,1].set_ylabel('ROI (%)')
        plt.setp(axes[1,1].get_xticklabels(), rotation=45)

plt.tight_layout()
plt.savefig('competitive_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
```

## Current Market Context Research

Research current market trends and industry benchmarks:
- Industry performance metrics and KPIs
- Recent market shifts and opportunities
- Competitive landscape changes
- Economic factors affecting performance

## Visualization Insights Summary

```python
print("=== VISUALIZATION INSIGHTS ===")

# Generate summary statistics for each visualization
print("Dashboard Summary:")
if 'revenue' in business_df.columns:
    total_revenue = business_df['revenue'].sum()
    avg_monthly_revenue = business_df.groupby('date')['revenue'].sum().mean() if 'date' in business_df.columns else business_df['revenue'].mean()
    print(f"  Total Revenue: ${total_revenue:,.2f}")
    print(f"  Average Monthly Revenue: ${avg_monthly_revenue:,.2f}")

print("\nPerformance vs Benchmark:")
if 'metric' in business_df.columns and 'value' in business_df.columns:
    our_avg = business_df['value'].mean()
    benchmark_avg = benchmark_df['value'].mean() if 'value' in benchmark_df.columns else 0
    performance_ratio = our_avg / benchmark_avg if benchmark_avg > 0 else 1
    print(f"  Our Average Performance: {our_avg:.2f}")
    print(f"  Industry Average: {benchmark_avg:.2f}")
    print(f"  Performance Ratio: {performance_ratio:.2f}x")

print("\nVisualization files generated:")
print("  - business_dashboard.html (Interactive dashboard)")
print("  - competitive_analysis.png (Static analysis charts)")
```

Data Science Schema Templates

Ready-to-use JSON schema templates for common data science workflows. These schemas ensure consistent, validated outputs across different analysis types and can be easily customized for specific use cases.

## Schema Template Library

### 1. Comprehensive Data Analysis Schema

Use Case: Complete dataset analysis with statistics, patterns, and recommendations

{
  "type": "object",
  "properties": {
    "analysis_metadata": {
      "type": "object",
      "properties": {
        "dataset_name": {"type": "string", "description": "Name of the analyzed dataset"},
        "analysis_date": {"type": "string", "format": "date-time"},
        "analyst": {"type": "string", "description": "Name or ID of the analyst"},
        "analysis_type": {"type": "string", "enum": ["exploratory", "confirmatory", "descriptive", "predictive"]},
        "model_used": {"type": "string", "description": "OpenAI model used for analysis"}
      },
      "required": ["dataset_name", "analysis_date", "analysis_type"]
    },
    "dataset_summary": {
      "type": "object",
      "properties": {
        "rows": {"type": "integer", "minimum": 0},
        "columns": {"type": "integer", "minimum": 0},
        "missing_values": {"type": "integer", "minimum": 0},
        "data_types": {
          "type": "object",
          "additionalProperties": {"type": "string"}
        },
        "memory_usage": {"type": "string", "description": "Memory usage in MB/GB"},
        "date_range": {
          "type": "object",
          "properties": {
            "start_date": {"type": "string", "format": "date"},
            "end_date": {"type": "string", "format": "date"}
          }
        }
      },
      "required": ["rows", "columns"]
    },
    "statistical_analysis": {
  "type": "object",
  "properties": {
    "descriptive_stats": {
          "type": "object",
          "patternProperties": {
            "^[a-zA-Z_][a-zA-Z0-9_]*$": {
      "type": "object",
      "properties": {
                "count": {"type": "number"},
        "mean": {"type": "number"},
        "median": {"type": "number"},
                "std": {"type": "number"},
        "min": {"type": "number"},
                "max": {"type": "number"},
                "q25": {"type": "number"},
                "q75": {"type": "number"},
                "skewness": {"type": "number"},
                "kurtosis": {"type": "number"}
              },
              "required": ["count", "mean", "std"]
            }
      }
    },
    "correlations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
              "variable_1": {"type": "string"},
              "variable_2": {"type": "string"},
              "correlation_coefficient": {"type": "number", "minimum": -1, "maximum": 1},
              "p_value": {"type": "number", "minimum": 0, "maximum": 1},
              "significance_level": {"type": "string", "enum": ["***", "**", "*", "ns"]},
              "interpretation": {"type": "string"}
            },
            "required": ["variable_1", "variable_2", "correlation_coefficient"]
          }
        },
        "hypothesis_tests": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "test_name": {"type": "string"},
          "variables": {"type": "array", "items": {"type": "string"}},
              "statistic": {"type": "number"},
              "p_value": {"type": "number", "minimum": 0, "maximum": 1},
              "degrees_of_freedom": {"type": "integer", "minimum": 0},
              "effect_size": {"type": "number"},
              "confidence_interval": {
                "type": "object",
                "properties": {
                  "lower": {"type": "number"},
                  "upper": {"type": "number"},
                  "confidence_level": {"type": "number", "default": 0.95}
                }
              },
              "conclusion": {"type": "string"},
              "interpretation": {"type": "string"}
            },
            "required": ["test_name", "p_value", "conclusion"]
          }
        }
      }
    },
    "patterns_and_insights": {
      "type": "object",
      "properties": {
        "key_patterns": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
              "pattern": {"type": "string"},
              "variables_involved": {"type": "array", "items": {"type": "string"}},
              "strength": {"type": "string", "enum": ["weak", "moderate", "strong"]},
              "confidence": {"type": "string", "enum": ["low", "medium", "high"]},
              "business_impact": {"type": "string"}
            }
          }
        },
        "anomalies": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "type": {"type": "string", "enum": ["outlier", "missing_data", "inconsistency", "trend_break"]},
              "description": {"type": "string"},
              "affected_variables": {"type": "array", "items": {"type": "string"}},
              "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
              "recommended_action": {"type": "string"}
            }
          }
        },
        "trends": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "variable": {"type": "string"},
              "trend_type": {"type": "string", "enum": ["increasing", "decreasing", "seasonal", "cyclical", "stable"]},
              "strength": {"type": "number", "minimum": 0, "maximum": 1},
              "time_period": {"type": "string"},
              "forecast": {"type": "string"}
            }
          }
        }
      }
    },
    "recommendations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": {"type": "string", "enum": ["data_quality", "analysis", "business_action", "further_investigation"]},
          "recommendation": {"type": "string"},
          "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
          "rationale": {"type": "string"},
          "expected_impact": {"type": "string"},
          "effort_required": {"type": "string", "enum": ["minimal", "moderate", "significant"]},
          "timeline": {"type": "string"}
        },
        "required": ["recommendation", "priority", "rationale"]
      }
    }
  },
  "required": ["analysis_metadata", "dataset_summary", "statistical_analysis", "patterns_and_insights", "recommendations"]
}

### 2. Advanced Visualization Schema

Use Case: Comprehensive visualization specifications with business context

{
  "type": "object",
  "properties": {
    "visualization_suite": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "description": {"type": "string"},
        "created_date": {"type": "string", "format": "date-time"},
        "data_source": {"type": "string"},
        "target_audience": {"type": "string", "enum": ["technical", "business", "executive", "general"]}
      }
    },
    "visualizations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "id": {"type": "string"},
          "type": {
            "type": "string",
            "enum": ["line", "bar", "scatter", "histogram", "box", "heatmap", "pie", "area", "radar", "bubble", "treemap", "waterfall", "funnel", "gauge", "sankey"]
          },
          "title": {"type": "string"},
          "subtitle": {"type": "string"},
          "data_specification": {
            "type": "object",
            "properties": {
              "x_axis": {
                "type": "object",
                "properties": {
                  "variable": {"type": "string"},
                  "label": {"type": "string"},
                  "data_type": {"type": "string", "enum": ["categorical", "numerical", "datetime"]},
                  "format": {"type": "string"}
                }
              },
              "y_axis": {
                "type": "object",
                "properties": {
                  "variable": {"type": "string"},
                  "label": {"type": "string"},
                  "data_type": {"type": "string", "enum": ["categorical", "numerical", "datetime"]},
                  "format": {"type": "string"}
                }
              },
              "color_by": {"type": "string"},
              "size_by": {"type": "string"},
              "filters": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "variable": {"type": "string"},
                    "condition": {"type": "string"},
                    "value": {"type": ["string", "number", "array"]}
                  }
                }
              },
              "aggregation": {
                "type": "object",
                "properties": {
                  "method": {"type": "string", "enum": ["sum", "mean", "median", "count", "min", "max", "std"]},
                  "group_by": {"type": "array", "items": {"type": "string"}}
                }
              }
            }
          },
          "styling": {
            "type": "object",
            "properties": {
              "color_palette": {"type": "string"},
              "theme": {"type": "string", "enum": ["light", "dark", "corporate", "minimal"]},
              "width": {"type": "integer"},
              "height": {"type": "integer"},
              "interactive": {"type": "boolean"},
              "annotations": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "type": {"type": "string", "enum": ["text", "arrow", "line", "rectangle"]},
                    "text": {"type": "string"},
                    "position": {"type": "object"},
                    "style": {"type": "object"}
                  }
                }
              }
            }
          },
          "insights": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "insight": {"type": "string"},
                "type": {"type": "string", "enum": ["trend", "anomaly", "comparison", "distribution", "correlation"]},
                "confidence": {"type": "string", "enum": ["low", "medium", "high"]},
                "business_relevance": {"type": "string"}
              }
            }
          },
          "implementation": {
            "type": "object",
            "properties": {
              "python_code": {"type": "string"},
              "libraries_required": {"type": "array", "items": {"type": "string"}},
              "file_outputs": {"type": "array", "items": {"type": "string"}},
              "estimated_runtime": {"type": "string"}
            }
          }
        },
        "required": ["type", "title", "data_specification"]
      }
    },
    "dashboard_layout": {
      "type": "object",
      "properties": {
        "grid_layout": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "visualization_id": {"type": "string"},
              "row": {"type": "integer"},
              "column": {"type": "integer"},
              "width": {"type": "integer"},
              "height": {"type": "integer"}
            }
          }
        },
        "narrative_flow": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "section": {"type": "string"},
              "description": {"type": "string"},
              "visualizations": {"type": "array", "items": {"type": "string"}},
              "key_message": {"type": "string"}
            }
          }
        }
      }
    }
  },
  "required": ["visualization_suite", "visualizations"]
}

### 3. Research Synthesis Schema

Use Case: Academic and scientific research synthesis with literature integration

{
  "type": "object",
  "properties": {
    "research_synthesis": {
      "type": "object",
      "properties": {
        "study_title": {"type": "string"},
        "research_question": {"type": "string"},
        "methodology": {"type": "string"},
        "synthesis_date": {"type": "string", "format": "date-time"},
        "researcher": {"type": "string"}
      },
      "required": ["study_title", "research_question"]
    },
    "literature_review": {
      "type": "object",
      "properties": {
        "sources_analyzed": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "authors": {"type": "array", "items": {"type": "string"}},
              "year": {"type": "integer"},
              "journal": {"type": "string"},
              "doi": {"type": "string"},
              "relevance_score": {"type": "number", "minimum": 1, "maximum": 5},
              "quality_score": {"type": "number", "minimum": 1, "maximum": 5},
              "key_findings": {"type": "array", "items": {"type": "string"}}
            }
          }
        },
        "methodological_consensus": {
          "type": "object",
          "properties": {
            "common_approaches": {"type": "array", "items": {"type": "string"}},
            "validated_methods": {"type": "array", "items": {"type": "string"}},
            "methodological_gaps": {"type": "array", "items": {"type": "string"}},
            "best_practices": {"type": "array", "items": {"type": "string"}}
          }
        },
        "theoretical_framework": {
          "type": "object",
          "properties": {
            "established_theories": {"type": "array", "items": {"type": "string"}},
            "emerging_concepts": {"type": "array", "items": {"type": "string"}},
            "contradicting_findings": {"type": "array", "items": {"type": "string"}},
            "research_gaps": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    },
    "experimental_analysis": {
      "type": "object",
      "properties": {
        "study_design": {
          "type": "object",
          "properties": {
            "type": {"type": "string", "enum": ["experimental", "observational", "cross-sectional", "longitudinal", "case-control", "cohort"]},
            "sample_size": {"type": "integer"},
            "groups": {"type": "array", "items": {"type": "string"}},
            "variables": {
              "type": "object",
              "properties": {
                "independent": {"type": "array", "items": {"type": "string"}},
                "dependent": {"type": "array", "items": {"type": "string"}},
                "confounding": {"type": "array", "items": {"type": "string"}}
              }
            }
          }
        },
        "statistical_results": {
          "type": "object",
          "properties": {
            "primary_outcomes": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "outcome": {"type": "string"},
                  "measurement": {"type": "string"},
                  "result": {"type": "number"},
                  "confidence_interval": {
                    "type": "object",
                    "properties": {
                      "lower": {"type": "number"},
                      "upper": {"type": "number"},
                      "level": {"type": "number", "default": 0.95}
                    }
                  },
                  "p_value": {"type": "number"},
                  "effect_size": {"type": "number"},
                  "clinical_significance": {"type": "string"}
                }
              }
            },
            "secondary_outcomes": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "outcome": {"type": "string"},
                  "result": {"type": "number"},
                  "significance": {"type": "string"}
                }
              }
            },
            "subgroup_analyses": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "subgroup": {"type": "string"},
                  "results": {"type": "object"},
                  "interaction_p_value": {"type": "number"}
                }
              }
            }
          }
        }
      }
    },
    "synthesis_conclusions": {
      "type": "object",
      "properties": {
        "key_findings": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "finding": {"type": "string"},
              "evidence_strength": {"type": "string", "enum": ["weak", "moderate", "strong", "very_strong"]},
              "consistency_across_studies": {"type": "string", "enum": ["inconsistent", "somewhat_consistent", "consistent"]},
              "literature_support": {"type": "string"},
              "novel_contribution": {"type": "boolean"}
            }
          }
        },
        "limitations": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "limitation": {"type": "string"},
              "impact": {"type": "string", "enum": ["minor", "moderate", "major"]},
              "mitigation": {"type": "string"}
            }
          }
        },
        "implications": {
          "type": "object",
          "properties": {
            "theoretical": {"type": "array", "items": {"type": "string"}},
            "practical": {"type": "array", "items": {"type": "string"}},
            "clinical": {"type": "array", "items": {"type": "string"}},
            "policy": {"type": "array", "items": {"type": "string"}}
          }
        },
        "future_research": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "direction": {"type": "string"},
              "priority": {"type": "string", "enum": ["low", "medium", "high"]},
              "methodology": {"type": "string"},
              "expected_impact": {"type": "string"}
            }
          }
        }
      }
    }
  },
  "required": ["research_synthesis", "synthesis_conclusions"]
}

### 4. Business Intelligence Schema

Use Case: Market analysis and business intelligence reporting

{
  "type": "object",
  "properties": {
    "business_intelligence_report": {
      "type": "object",
      "properties": {
        "report_title": {"type": "string"},
        "analysis_period": {
          "type": "object",
          "properties": {
            "start_date": {"type": "string", "format": "date"},
            "end_date": {"type": "string", "format": "date"}
          }
        },
        "analyst": {"type": "string"},
        "report_date": {"type": "string", "format": "date-time"},
        "executive_summary": {"type": "string"},
        "key_metrics": {
          "type": "object",
          "patternProperties": {
            "^[a-zA-Z_][a-zA-Z0-9_]*$": {
              "type": "object",
              "properties": {
                "value": {"type": "number"},
                "unit": {"type": "string"},
                "change_from_previous": {"type": "number"},
                "trend": {"type": "string", "enum": ["increasing", "decreasing", "stable"]},
                "target": {"type": "number"},
                "performance_vs_target": {"type": "string"}
              }
            }
          }
        }
      },
      "required": ["report_title", "analysis_period", "executive_summary"]
    },
    "market_analysis": {
      "type": "object",
      "properties": {
        "market_size": {
          "type": "object",
          "properties": {
            "total_addressable_market": {"type": "number"},
            "serviceable_addressable_market": {"type": "number"},
            "serviceable_obtainable_market": {"type": "number"},
            "currency": {"type": "string", "default": "USD"},
            "growth_rate": {"type": "number"},
            "forecast_period": {"type": "string"}
          }
        },
        "competitive_landscape": {
          "type": "object",
          "properties": {
            "market_leaders": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "company": {"type": "string"},
                  "market_share": {"type": "number", "minimum": 0, "maximum": 100},
                  "strengths": {"type": "array", "items": {"type": "string"}},
                  "weaknesses": {"type": "array", "items": {"type": "string"}},
                  "recent_developments": {"type": "array", "items": {"type": "string"}}
                }
              }
            },
            "our_position": {
              "type": "object",
              "properties": {
                "market_share": {"type": "number"},
                "rank": {"type": "integer"},
                "competitive_advantages": {"type": "array", "items": {"type": "string"}},
                "areas_for_improvement": {"type": "array", "items": {"type": "string"}}
              }
            }
          }
        },
        "market_trends": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "trend": {"type": "string"},
              "impact": {"type": "string", "enum": ["positive", "negative", "neutral"]},
              "magnitude": {"type": "string", "enum": ["low", "medium", "high"]},
              "timeline": {"type": "string"},
              "implications": {"type": "string"}
            }
          }
        },
        "customer_insights": {
          "type": "object",
          "properties": {
            "segments": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "segment_name": {"type": "string"},
                  "size": {"type": "number"},
                  "growth_rate": {"type": "number"},
                  "key_characteristics": {"type": "array", "items": {"type": "string"}},
                  "pain_points": {"type": "array", "items": {"type": "string"}},
                  "preferences": {"type": "array", "items": {"type": "string"}}
                }
              }
            },
            "behavior_changes": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "change": {"type": "string"},
                  "drivers": {"type": "array", "items": {"type": "string"}},
                  "business_impact": {"type": "string"}
                }
              }
            }
          }
        }
      }
    },
    "performance_analysis": {
      "type": "object",
      "properties": {
        "financial_performance": {
          "type": "object",
          "properties": {
            "revenue": {
              "type": "object",
              "properties": {
                "current_period": {"type": "number"},
                "previous_period": {"type": "number"},
                "year_over_year_growth": {"type": "number"},
                "by_segment": {"type": "object"},
                "by_geography": {"type": "object"}
              }
            },
            "profitability": {
              "type": "object",
              "properties": {
                "gross_margin": {"type": "number"},
                "operating_margin": {"type": "number"},
                "net_margin": {"type": "number"},
                "margin_trends": {"type": "array", "items": {"type": "string"}}
              }
            },
            "key_ratios": {
              "type": "object",
              "properties": {
                "current_ratio": {"type": "number"},
                "debt_to_equity": {"type": "number"},
                "return_on_assets": {"type": "number"},
                "return_on_equity": {"type": "number"}
              }
            }
          }
        },
        "operational_performance": {
          "type": "object",
          "properties": {
            "efficiency_metrics": {"type": "object"},
            "quality_metrics": {"type": "object"},
            "customer_satisfaction": {"type": "object"},
            "employee_metrics": {"type": "object"}
          }
        }
      }
    },
    "strategic_recommendations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "category": {"type": "string", "enum": ["growth", "efficiency", "competitive", "innovation", "risk_mitigation"]},
          "recommendation": {"type": "string"},
          "rationale": {"type": "string"},
          "expected_impact": {"type": "string"},
          "implementation_timeline": {"type": "string"},
          "resources_required": {"type": "string"},
          "success_metrics": {"type": "array", "items": {"type": "string"}},
          "risks": {"type": "array", "items": {"type": "string"}},
          "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]}
        },
        "required": ["recommendation", "rationale", "priority"]
      }
    },
    "risk_assessment": {
      "type": "object",
      "properties": {
        "identified_risks": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "risk": {"type": "string"},
              "category": {"type": "string", "enum": ["market", "operational", "financial", "regulatory", "technological", "competitive"]},
              "probability": {"type": "string", "enum": ["low", "medium", "high"]},
              "impact": {"type": "string", "enum": ["low", "medium", "high"]},
              "mitigation_strategies": {"type": "array", "items": {"type": "string"}},
              "monitoring_indicators": {"type": "array", "items": {"type": "string"}}
            }
          }
        },
        "risk_matrix": {
          "type": "object",
          "properties": {
            "high_probability_high_impact": {"type": "array", "items": {"type": "string"}},
            "high_probability_low_impact": {"type": "array", "items": {"type": "string"}},
            "low_probability_high_impact": {"type": "array", "items": {"type": "string"}},
            "low_probability_low_impact": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  },
  "required": ["business_intelligence_report", "strategic_recommendations"]
}

### 5. Quick Analysis Schema

Use Case: Rapid analysis with essential insights (lightweight version)

{
  "type": "object",
  "properties": {
    "quick_analysis": {
      "type": "object",
      "properties": {
        "dataset": {"type": "string"},
        "analysis_date": {"type": "string", "format": "date-time"},
        "key_metrics": {"type": "object"},
        "top_insights": {
          "type": "array",
          "maxItems": 5,
          "items": {"type": "string"}
        },
        "red_flags": {
          "type": "array",
          "maxItems": 3,
          "items": {"type": "string"}
        },
        "immediate_actions": {
          "type": "array",
          "maxItems": 3,
          "items": {"type": "string"}
        }
      },
      "required": ["dataset", "top_insights"]
    }
  }
}

## Schema Usage Guidelines

### Customization Tips

  1. Remove unnecessary fields for simpler analyses

  2. Add domain-specific properties (e.g., medical, financial, engineering fields)

  3. Adjust enum values to match your business terminology

  4. Modify validation rules (min/max values, required fields) based on your data

### Schema Selection Guide

  • Comprehensive Data Analysis: Full statistical analysis with business context

  • Advanced Visualization: Complex dashboards and chart specifications

  • Research Synthesis: Academic or scientific research projects

  • Business Intelligence: Market analysis and strategic planning

  • Quick Analysis: Rapid insights for daily operations

### Best Practices

  1. Always include metadata (analyst, date, data source) for traceability

  2. Use consistent field naming across your organization’s schemas

  3. Include confidence levels for insights and recommendations

  4. Provide clear descriptions in schema properties for AI model understanding

  5. Validate outputs against schemas to ensure consistency

Practical Examples and Use Cases

Financial Data Analysis

Scenario: Analyze quarterly financial data with market context

ostruct run financial_analysis.j2 financial_schema.json \
  --file ci:financials quarterly_report.csv \
  --file fs:industry industry_benchmarks.pdf \
  --enable-tool web-search \
  --model gpt-4o

Key Features: - Automated ratio calculations via Code Interpreter - Benchmark comparisons via File Search - Current market conditions via Web Search - Structured output for further processing

Scientific Research Synthesis

Scenario: Combine experimental data with literature review

ostruct run research_synthesis.j2 research_schema.json \
  --file ci:results experimental_data.csv \
  --dir fs:literature "papers/" \
  --enable-tool web-search \
  --model gpt-4o

Workflow: 1. Statistical analysis of experimental results 2. Literature context from paper database 3. Current research trends from web search 4. Synthesized conclusions with citations

Market Research Automation

Scenario: Automated market intelligence reports

ostruct run market_intel.j2 market_schema.json \
  --file ci:sales_data current_sales.csv \
  --file fs:reports competitor_analysis.pdf \
  --enable-tool web-search \
  --ws-context-size comprehensive \
  --model gpt-4o

Output: Structured market intelligence report with quantitative metrics, competitive analysis, and current market trends.

Token Management for Large Datasets

Best Practices

Chunking Large Files:

# Split large datasets for processing
import pandas as pd

# Read large dataset
df = pd.read_csv('large_dataset.csv')

# Process in chunks
chunk_size = 1000
for i in range(0, len(df), chunk_size):
    chunk = df[i:i+chunk_size]
    chunk.to_csv(f'chunk_{i//chunk_size}.csv', index=False)

    # Process each chunk
    subprocess.run([
        'ostruct', 'run', 'analysis.j2', 'schema.json',
        '--file', 'ci:data', f'chunk_{i//chunk_size}.csv',
        '--output-file', f'results_{i//chunk_size}.json'
    ])

Dry Run for Token Estimation:

# Preview prompt and token counts without API cost
ostruct run analysis.j2 schema.json \
  --file ci:data large_dataset.csv \
  --dry-run

# This shows the full expanded prompt and token count
# Use this to optimize before making expensive API calls

# Use token-efficient models for large datasets
ostruct run analysis.j2 schema.json \
  --file ci:data large_dataset.csv \
  --model gpt-4o-mini  # More cost-effective for large inputs

Error Handling and Troubleshooting

Known Issues

File Search Empty Results (Current Bug):

File Search may return empty results despite successful vector store creation. This is a known upstream OpenAI API issue affecting all models.

Workarounds: - Fallback to Code Interpreter: Route documents to ci: for programmatic parsing - Direct prompt inclusion: Use prompt: routing for smaller documents that fit in context - Hybrid approach: Combine manual document parsing with web search for current information

# If File Search fails, try Code Interpreter parsing
ostruct run analysis.j2 schema.json \
  --file ci:docs research_paper.pdf \
  --enable-tool web-search \
  --model gpt-4o

Common Issues

Binary File Access Errors:

{# Handle mixed file types gracefully #}
{% for file in dataset %}
  {% if file.extension in ['csv', 'txt', 'json'] %}
    {{ file.content }}
  {% else %}
    File: {{ file.name }} ({{ file.size }} bytes, binary - use Code Interpreter for analysis)
  {% endif %}
{% endfor %}

Token Limit Errors:

# Use summary approach for large files
ostruct run summarize_first.j2 summary_schema.json \
  --file ci:data large_file.csv \
  --max-output-tokens 4000

Schema Validation Failures:

# Validate schema before processing
import jsonschema
import json

with open('schema.json', 'r') as f:
    schema = json.load(f)

# Test with sample data
sample_output = {"test": "data"}
try:
    jsonschema.validate(sample_output, schema)
    print("Schema is valid")
except jsonschema.ValidationError as e:
    print(f"Schema error: {e}")

Performance Optimization

Efficient Workflows

Parallel Processing:

import concurrent.futures
import subprocess

def process_file(filename):
    return subprocess.run([
        'ostruct', 'run', 'analysis.j2', 'schema.json',
        '--file', 'ci:data', filename,
        '--output-file', f'results_{filename}.json'
    ], capture_output=True)

# Process multiple files in parallel
files = ['data1.csv', 'data2.csv', 'data3.csv']
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(executor.map(process_file, files))

Model Selection for Different Tasks:

# Use appropriate models for different complexity levels

# Simple extraction - use efficient model
ostruct run extract.j2 schema.json --model gpt-4o-mini

# Complex analysis - use powerful model
ostruct run complex_analysis.j2 schema.json --model gpt-4o

# Reasoning tasks - use reasoning model
ostruct run reasoning.j2 schema.json --model o1-preview

Practical Examples and Use Cases

This section provides complete, real-world workflows demonstrating ostruct’s power for data science applications.

Financial Data Analysis Workflow

Scenario: Automated quarterly financial analysis combining market data, company reports, and regulatory filings.

Complete Workflow:

Step 1: Market Data Collection and Analysis

# Template: financial_analysis.j2
ostruct run financial_analysis.j2 financial_schema.json \
  --file ci:data quarterly_data.xlsx \
  --enable-tool web-search \
  --web-query "{{company_name}} Q3 2024 earnings market reaction analysis" \
  --model gpt-4o

Template (financial_analysis.j2):

## Financial Analysis for {{company_name}} - {{quarter}}

### Market Data Analysis
Analyze the following financial data and provide comprehensive insights:

**Raw Data:**
{{ quarterly_data.content }}

**Market Context (from web search):**
{% if web_search_results %}
{{ web_search_results }}
{% endif %}

### Analysis Requirements:
1. **Performance Metrics**: Calculate key ratios (ROE, EBITDA margin, debt-to-equity)
2. **Trend Analysis**: Compare with previous 4 quarters
3. **Market Position**: Benchmark against industry peers
4. **Risk Assessment**: Identify potential financial risks
5. **Growth Projection**: Forecast next quarter based on current trends

### Regulatory Compliance Check:
Review all metrics against SEC disclosure requirements and flag any concerning trends.

Schema (financial_schema.json):

{
  "type": "object",
  "properties": {
    "executive_summary": {
      "type": "string",
      "description": "2-3 sentence summary of financial health"
    },
    "key_metrics": {
      "type": "object",
      "properties": {
        "revenue": {"type": "number"},
        "net_income": {"type": "number"},
        "roe": {"type": "number"},
        "ebitda_margin": {"type": "number"},
        "debt_to_equity": {"type": "number"}
      },
      "required": ["revenue", "net_income", "roe"]
    },
    "trend_analysis": {
      "type": "object",
      "properties": {
        "revenue_growth": {"type": "number"},
        "profit_margin_trend": {"type": "string"},
        "quarter_over_quarter_change": {"type": "number"}
      }
    },
    "risk_factors": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "risk_type": {"type": "string"},
          "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
          "description": {"type": "string"},
          "mitigation_suggestions": {"type": "string"}
        },
        "required": ["risk_type", "severity", "description"]
      }
    },
    "growth_forecast": {
      "type": "object",
      "properties": {
        "next_quarter_revenue_estimate": {"type": "number"},
        "confidence_level": {"type": "string", "enum": ["low", "medium", "high"]},
        "key_assumptions": {"type": "array", "items": {"type": "string"}}
      }
    },
    "compliance_flags": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "regulation": {"type": "string"},
          "status": {"type": "string", "enum": ["compliant", "attention_required", "violation"]},
          "details": {"type": "string"}
        }
      }
    }
  },
  "required": ["executive_summary", "key_metrics", "risk_factors"]
}

Step 2: Report Generation and Visualization

# Generate presentation-ready report
ostruct run financial_report.j2 report_schema.json \
  --file prompt:previous_analysis results.json \
  --enable-tool code-interpreter \
  --model gpt-4o

Scientific Research Synthesis

Scenario: Automated literature review combining research papers, recent publications, and domain expert opinions.

Complete Workflow:

Step 1: Multi-Source Research Collection

# Combine local papers with latest research
ostruct run research_synthesis.j2 research_schema.json \
  --file fs:papers research_papers/ \
  --enable-tool web-search \
  --web-query "{{research_topic}} 2024 latest findings systematic review" \
  --model o1-preview

Template (research_synthesis.j2):

## Comprehensive Research Synthesis: {{research_topic}}

### Local Research Papers Analysis
{% for paper in research_papers %}
**Paper:** {{ paper.name }}
**Content:** {{ paper.content if paper.size < 50000 else "Large paper - analyze key sections" }}
{% endfor %}

### Latest Web Research
{% if web_search_results %}
**Recent Findings:**
{{ web_search_results }}
{% endif %}

### Synthesis Requirements:
1. **Literature Gap Analysis**: Identify research gaps and contradictions
2. **Methodology Comparison**: Compare approaches across studies
3. **Evidence Quality Assessment**: Rate evidence strength using GRADE criteria
4. **Emerging Trends**: Identify novel approaches and future directions
5. **Practical Applications**: Translate findings to actionable insights

### Meta-Analysis Elements:
- Sample sizes and statistical power across studies
- Effect sizes and confidence intervals
- Heterogeneity assessment
- Publication bias evaluation

Schema (research_schema.json):

{
  "type": "object",
  "properties": {
    "research_summary": {
      "type": "string",
      "description": "Comprehensive 3-4 sentence summary of current state"
    },
    "key_findings": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "finding": {"type": "string"},
          "evidence_level": {"type": "string", "enum": ["strong", "moderate", "weak", "insufficient"]},
          "supporting_studies": {"type": "array", "items": {"type": "string"}},
          "contradictory_evidence": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["finding", "evidence_level"]
      }
    },
    "methodology_analysis": {
      "type": "object",
      "properties": {
        "dominant_approaches": {"type": "array", "items": {"type": "string"}},
        "emerging_methods": {"type": "array", "items": {"type": "string"}},
        "methodological_limitations": {"type": "array", "items": {"type": "string"}}
      }
    },
    "research_gaps": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "gap_description": {"type": "string"},
          "importance": {"type": "string", "enum": ["critical", "important", "moderate", "minor"]},
          "suggested_approaches": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["gap_description", "importance"]
      }
    },
    "practical_implications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "domain": {"type": "string"},
          "implication": {"type": "string"},
          "confidence_level": {"type": "string", "enum": ["high", "medium", "low"]},
          "implementation_complexity": {"type": "string", "enum": ["simple", "moderate", "complex"]}
        }
      }
    },
    "future_directions": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "direction": {"type": "string"},
          "priority": {"type": "string", "enum": ["high", "medium", "low"]},
          "feasibility": {"type": "string", "enum": ["high", "medium", "low"]},
          "potential_impact": {"type": "string", "enum": ["transformative", "significant", "incremental"]}
        }
      }
    }
  },
  "required": ["research_summary", "key_findings", "research_gaps"]
}

Business Intelligence Report Generation

Scenario: Automated competitive analysis combining internal sales data, market reports, and real-time competitor monitoring.

Complete Workflow:

Step 1: Multi-Source Business Intelligence

# Comprehensive competitive analysis
ostruct run bi_analysis.j2 bi_schema.json \
  --file ci:data sales_data.xlsx \
  --file fs:reports market_reports/ \
  --enable-tool web-search \
  --web-query "{{competitor_name}} Q4 2024 market share pricing strategy" \
  --model gpt-4o

Template (bi_analysis.j2):

## Business Intelligence Report - {{analysis_period}}

### Internal Performance Analysis
**Sales Data:**
{{ sales_data.content }}

### Market Context Analysis
**Market Reports:**
{% for report in market_reports %}
**Report:** {{ report.name }}
{{ report.content if report.size < 30000 else "Large report - focus on executive summary and key metrics" }}
{% endfor %}

### Competitive Intelligence
{% if web_search_results %}
**Latest Competitor Activity:**
{{ web_search_results }}
{% endif %}

### Analysis Requirements:
1. **Market Position**: Our position vs competitors across key metrics
2. **Growth Opportunities**: Untapped segments and expansion possibilities
3. **Competitive Threats**: Emerging competitors and market disruptions
4. **Pricing Analysis**: Price positioning and optimization opportunities
5. **Strategic Recommendations**: Actionable next steps with ROI projections

### Executive Briefing Elements:
- Top 3 strategic priorities
- Revenue impact projections
- Resource requirements
- Timeline for implementation

Schema (bi_schema.json):

{
  "type": "object",
  "properties": {
    "executive_summary": {
      "type": "string",
      "description": "CEO-ready 2-3 sentence summary of strategic position"
    },
    "market_position": {
      "type": "object",
      "properties": {
        "market_share": {"type": "number"},
        "competitive_ranking": {"type": "integer"},
        "differentiation_strengths": {"type": "array", "items": {"type": "string"}},
        "competitive_gaps": {"type": "array", "items": {"type": "string"}}
      }
    },
    "growth_opportunities": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "opportunity": {"type": "string"},
          "market_size": {"type": "number"},
          "revenue_potential": {"type": "number"},
          "time_to_market": {"type": "string"},
          "investment_required": {"type": "number"},
          "risk_level": {"type": "string", "enum": ["low", "medium", "high"]}
        },
        "required": ["opportunity", "revenue_potential", "risk_level"]
      }
    },
    "competitive_threats": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "threat_source": {"type": "string"},
          "threat_type": {"type": "string"},
          "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
          "timeline": {"type": "string"},
          "mitigation_strategies": {"type": "array", "items": {"type": "string"}}
        }
      }
    },
    "strategic_recommendations": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "recommendation": {"type": "string"},
          "priority": {"type": "string", "enum": ["critical", "high", "medium", "low"]},
          "expected_roi": {"type": "number"},
          "implementation_timeline": {"type": "string"},
          "resource_requirements": {"type": "array", "items": {"type": "string"}},
          "success_metrics": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["recommendation", "priority", "expected_roi"]
      }
    },
    "pricing_analysis": {
      "type": "object",
      "properties": {
        "current_positioning": {"type": "string"},
        "competitor_comparison": {"type": "array", "items": {"type": "object"}},
        "optimization_opportunities": {"type": "array", "items": {"type": "string"}},
        "revenue_impact_estimate": {"type": "number"}
      }
    }
  },
  "required": ["executive_summary", "market_position", "growth_opportunities", "strategic_recommendations"]
}

Market Research Automation

Scenario: Automated market entry analysis combining demographic data, competitor landscape, and regulatory environment.

Complete Workflow:

Step 1: Comprehensive Market Entry Analysis

# Complete market entry assessment
ostruct run market_entry.j2 market_schema.json \
  --file ci:data demographic_data.csv \
  --file fs:docs regulatory_requirements/ \
  --enable-tool web-search \
  --web-query "{{target_market}} {{industry}} market entry barriers regulatory requirements 2024" \
  --model gpt-4o

Template (market_entry.j2):

## Market Entry Analysis: {{target_market}} - {{industry}}

### Demographic and Market Data
**Market Demographics:**
{{ demographic_data.content }}

### Regulatory Environment
**Regulatory Requirements:**
{% for doc in regulatory_requirements %}
**Document:** {{ doc.name }}
{{ doc.content if doc.size < 40000 else "Large regulatory document - focus on key compliance requirements" }}
{% endfor %}

### Competitive Landscape Research
{% if web_search_results %}
**Current Market Intelligence:**
{{ web_search_results }}
{% endif %}

### Analysis Framework:
1. **Market Attractiveness**: Size, growth, profitability assessment
2. **Competitive Intensity**: Porter's Five Forces analysis
3. **Entry Barriers**: Regulatory, financial, operational obstacles
4. **Go-to-Market Strategy**: Channel analysis and market penetration approach
5. **Financial Projections**: Revenue forecasts and investment requirements
6. **Risk Assessment**: Market, operational, and regulatory risks

### Decision Framework:
Provide clear GO/NO-GO recommendation with supporting rationale and alternative strategies.

Schema (market_schema.json):

{
  "type": "object",
  "properties": {
    "market_attractiveness": {
      "type": "object",
      "properties": {
        "market_size_usd": {"type": "number"},
        "growth_rate": {"type": "number"},
        "profit_margin_potential": {"type": "number"},
        "market_maturity": {"type": "string", "enum": ["emerging", "growth", "mature", "declining"]},
        "attractiveness_score": {"type": "integer", "minimum": 1, "maximum": 10}
      },
      "required": ["market_size_usd", "growth_rate", "attractiveness_score"]
    },
    "competitive_analysis": {
      "type": "object",
      "properties": {
        "market_concentration": {"type": "string", "enum": ["fragmented", "moderate", "concentrated", "monopolistic"]},
        "key_competitors": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "company": {"type": "string"},
              "market_share": {"type": "number"},
              "competitive_advantages": {"type": "array", "items": {"type": "string"}},
              "vulnerabilities": {"type": "array", "items": {"type": "string"}}
            }
          }
        },
        "competitive_intensity": {"type": "integer", "minimum": 1, "maximum": 5}
      }
    },
    "entry_barriers": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "barrier_type": {"type": "string"},
          "severity": {"type": "string", "enum": ["low", "medium", "high", "prohibitive"]},
          "description": {"type": "string"},
          "mitigation_strategies": {"type": "array", "items": {"type": "string"}},
          "estimated_cost": {"type": "number"}
        },
        "required": ["barrier_type", "severity", "description"]
      }
    },
    "go_to_market_strategy": {
      "type": "object",
      "properties": {
        "recommended_channels": {"type": "array", "items": {"type": "string"}},
        "market_penetration_approach": {"type": "string"},
        "customer_acquisition_strategy": {"type": "string"},
        "pricing_strategy": {"type": "string"},
        "marketing_budget_estimate": {"type": "number"}
      }
    },
    "financial_projections": {
      "type": "object",
      "properties": {
        "year_1_revenue": {"type": "number"},
        "year_3_revenue": {"type": "number"},
        "break_even_timeline": {"type": "string"},
        "initial_investment_required": {"type": "number"},
        "roi_projection": {"type": "number"}
      },
      "required": ["year_1_revenue", "initial_investment_required"]
    },
    "risk_assessment": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "risk_category": {"type": "string"},
          "risk_description": {"type": "string"},
          "likelihood": {"type": "string", "enum": ["low", "medium", "high"]},
          "impact": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
          "mitigation_plan": {"type": "string"}
        }
      }
    },
    "recommendation": {
      "type": "object",
      "properties": {
        "decision": {"type": "string", "enum": ["go", "no_go", "conditional_go", "delayed_entry"]},
        "confidence_level": {"type": "string", "enum": ["low", "medium", "high"]},
        "key_rationale": {"type": "array", "items": {"type": "string"}},
        "next_steps": {"type": "array", "items": {"type": "string"}},
        "alternative_strategies": {"type": "array", "items": {"type": "string"}}
      },
      "required": ["decision", "confidence_level", "key_rationale"]
    }
  },
  "required": ["market_attractiveness", "competitive_analysis", "entry_barriers", "recommendation"]
}

Step 2: Action Plan Generation

# Generate implementation roadmap
ostruct run implementation_plan.j2 plan_schema.json \
  --file prompt:analysis market_entry_results.json \
  --model gpt-4o

Best Practices for Complex Workflows

Template Design Principles:

  1. Structured Instructions: Provide clear, numbered requirements

  2. Context Awareness: Handle missing data gracefully

  3. Progressive Disclosure: Start broad, then drill into specifics

  4. Error Resilience: Include fallback strategies for data issues

Schema Design Principles:

  1. Business-Ready Output: Structure matches decision-making needs

  2. Validation Built-In: Use enums and constraints for data quality

  3. Extensible Design: Allow for future requirement additions

  4. Confidence Indicators: Include certainty levels for AI outputs

Workflow Orchestration:

  1. Multi-Stage Processing: Break complex analysis into digestible stages

  2. Tool Selection: Match tool capabilities to data types and complexity

  3. Quality Gates: Validate intermediate outputs before final processing

  4. Documentation: Maintain audit trail of analysis steps

Integration with Data Science Tools

Pandas Integration

import pandas as pd
import json
import subprocess

# Process DataFrame with ostruct
def analyze_dataframe(df, analysis_template, schema_file):
    # Save DataFrame temporarily
    temp_file = 'temp_data.csv'
    df.to_csv(temp_file, index=False)

    # Run ostruct analysis
    result = subprocess.run([
        'ostruct', 'run', analysis_template, schema_file,
        '--file', 'ci:data', temp_file,
        '--output-file', 'temp_results.json'
    ], capture_output=True, text=True)

    # Load results
    with open('temp_results.json', 'r') as f:
        return json.load(f)

# Example usage
df = pd.read_csv('sales_data.csv')
insights = analyze_dataframe(df, 'sales_analysis.j2', 'sales_schema.json')

Matplotlib/Seaborn Integration

# Generate visualization specifications with ostruct
viz_template = '''
---
system_prompt: You are a data visualization expert. Generate matplotlib/seaborn code specifications.
---
Create visualization specifications for this dataset:
{{ data.content }}

Generate specifications for the most insightful charts to show patterns, distributions, and relationships.
'''

# Use ostruct to generate viz specs, then create plots
viz_specs = analyze_dataframe(df, 'viz_template.j2', 'viz_schema.json')

# Execute generated visualization code
for viz in viz_specs['visualizations']:
    exec(viz['matplotlib_code'])

Next Steps

Getting Started:

  1. Set up ostruct in your notebook environment

  2. Try the basic data extraction example

  3. Experiment with multi-tool workflows

  4. Adapt schemas for your specific use cases

Advanced Usage:

See Also