DataQL Logo

LLM Integration Guide

DataQL enables Large Language Models (LLMs) to efficiently query and transform data without loading entire files into context, dramatically reducing token consumption.

The Problem

LLMs are powerful, but they have a fundamental limitation: context windows. When you need to analyze a data file, you face tough choices:

Send the entire file: A 10MB CSV consumes ~100,000+ tokens. At $3 per million tokens, that’s $0.30+ per query. And many files simply won’t fit.
Sample the data: You might miss important patterns. “Show me the first 100 rows” rarely answers real questions.
Pre-process externally: Break the conversational flow. Write scripts. Lose the interactive nature of LLM assistance.

The Solution

DataQL acts as a data access layer for LLMs. Instead of sending raw data, the LLM sends SQL queries and receives only the results it needs:

┌─────────────────┐                      ┌─────────────────┐
│      LLM        │  "SELECT region,     │    DataQL       │
│                 │   SUM(sales)         │                 │
│ "Analyze sales  │   FROM sales         │  ┌───────────┐  │
│  by region"     │ ──────────────────►  │  │ 10GB CSV  │  │
│                 │                      │  └───────────┘  │
│                 │  ◄──────────────────  │                 │
│                 │  {North: $1.2M,      │  Query Result:  │
│                 │   South: $980K}      │  2 rows, 50     │
│                 │                      │  tokens         │
└─────────────────┘                      └─────────────────┘

Why This Matters

Scenario	Without DataQL	With DataQL
Analyze 10MB CSV	~100,000 tokens ($3+)	~500 tokens ($0.01)
Find specific records	Load everything, filter in context	SQL WHERE clause
Aggregate statistics	Manual calculation by LLM	SQL SUM, AVG, COUNT
Join multiple files	Impossible in context	Single SQL JOIN
Query cloud storage	Download → Script → Parse	One command
Repeated analysis	Re-send file each time	Query as needed

Real Numbers

10MB CSV file: ~50,000 rows, ~100,000 tokens
Simple query result: 10 rows, ~200 tokens
Token reduction: 99.8%
Cost reduction: From $0.30 to $0.0006 per query

Quick Start

1. Install DataQL Skills (Recommended for Claude Code)

dataql skills install

This installs skills that teach Claude Code how to use DataQL automatically.

2. Use in Claude Code

Just describe what you need in natural language:

“Query users.csv and show users older than 30”
“Join orders.json with customers.csv by customer_id”
“Get total revenue from sales.parquet grouped by region”
“Count how many records are in this CSV file”

3. Or Use MCP Server (For Any LLM)

dataql mcp serve

Configure your LLM client to connect to the MCP server (see MCP Setup Guide).

Integration Methods

Method 1: MCP Server (Best for Automation)

The Model Context Protocol (MCP) provides native integration with LLMs:

Supported Clients:

Claude Code (Anthropic)
OpenAI Codex
Google Gemini CLI
Any MCP-compatible client

Configuration: See MCP Setup Guide

Method 2: Skills (Best for Interactive Use in Claude Code)

Skills are markdown files that teach Claude how to use DataQL:

# Install to global directory (all projects)
dataql skills install --global

# Install to current project only
dataql skills install --project

# List installed skills
dataql skills list

Method 3: Direct CLI (Universal - Any LLM with Bash)

Any LLM with shell access can use DataQL directly:

# Query a file
dataql run -f data.csv -q "SELECT * FROM data WHERE status='active'"

# Get schema
dataql run -f data.csv -q ".schema data"

# Join multiple files
dataql run -f users.csv -f orders.json -q "SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id"

Token Efficiency Patterns

Pattern 1: Schema-First Approach

Always understand the data structure before running complex queries:

# Step 1: Get schema (minimal tokens)
dataql run -f data.csv -q ".schema data"

# Step 2: Query specific columns
dataql run -f data.csv -q "SELECT name, email FROM data WHERE status='active'"

Pattern 2: Aggregation Pipeline

Instead of returning all rows, use aggregations:

# Bad: Returns potentially thousands of rows
dataql run -f sales.csv -q "SELECT * FROM sales"

# Good: Returns summary statistics
dataql run -f sales.csv -q "SELECT category, COUNT(*) as count, SUM(amount) as total FROM sales GROUP BY category"

Start broad and narrow down:

# Step 1: Understand the data
dataql run -f logs.json -q "SELECT DISTINCT level FROM logs"

# Step 2: Count by category
dataql run -f logs.json -q "SELECT level, COUNT(*) FROM logs GROUP BY level"

# Step 3: Get specific records
dataql run -f logs.json -q "SELECT * FROM logs WHERE level='error' LIMIT 20"

Pattern 4: Sampling Large Datasets

Preview data before processing:

# Preview first 5 rows
dataql run -f large_dataset.parquet -q "SELECT * FROM large_dataset LIMIT 5"

# Get total count
dataql run -f large_dataset.parquet -q "SELECT COUNT(*) FROM large_dataset"

MCP Tools Reference

When using the MCP server, these tools are available:

dataql_query

Execute SQL queries on any data source.

Parameters:

source (required): File path, URL, S3 URI, or database connection string
query (required): SQL query to execute
delimiter (optional): CSV delimiter character

Example:

{
  "name": "dataql_query",
  "arguments": {
    "source": "users.csv",
    "query": "SELECT * FROM users WHERE age > 30 LIMIT 10"
  }
}

dataql_schema

Get the structure of a data source.

Parameters:

source (required): File path, URL, or database connection string

Example:

{
  "name": "dataql_schema",
  "arguments": {
    "source": "sales.json"
  }
}

dataql_preview

Preview first N rows of data.

Parameters:

source (required): Data source
limit (optional): Number of rows (default: 5, max: 100)

Example:

{
  "name": "dataql_preview",
  "arguments": {
    "source": "data.parquet",
    "limit": 10
  }
}

dataql_aggregate

Perform aggregation operations.

Parameters:

source (required): Data source
column (required): Column to aggregate
operation (required): count, sum, avg, min, max
group_by (optional): Column to group by

Example:

{
  "name": "dataql_aggregate",
  "arguments": {
    "source": "sales.csv",
    "column": "amount",
    "operation": "sum",
    "group_by": "category"
  }
}

Platform-Specific Guides

Claude Code

Install Skills:
```
dataql skills install --global
```
Use Naturally: Just ask Claude to analyze your data files. The skills teach Claude when and how to use DataQL.

Or Configure MCP: Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "dataql": {
      "type": "stdio",
      "command": "dataql",
      "args": ["mcp", "serve"]
    }
  }
}

OpenAI Codex

Add to ~/.codex/config.toml:

[mcp.servers.dataql]
transport = "stdio"
command = "dataql"
args = ["mcp", "serve"]

Google Gemini CLI

Add to your Gemini CLI configuration:

mcpServers:
  dataql:
    command: dataql
    args: ["mcp", "serve"]

Other LLMs

For LLMs with shell/bash access, use direct CLI commands:

# Set up an alias for easier use
alias dq='dataql run'

# Query files
dq -f data.csv -q "SELECT * FROM data WHERE amount > 100"

Real-World Examples

Example 1: Analyze Sales Data

# 1. Understand the data
dataql run -f sales.csv -q ".schema sales"

# 2. Get overview
dataql run -f sales.csv -q "SELECT COUNT(*) as total_transactions, SUM(amount) as revenue FROM sales"

# 3. Top products
dataql run -f sales.csv -q "SELECT product, SUM(amount) as total FROM sales GROUP BY product ORDER BY total DESC LIMIT 10"

# 4. Sales by region
dataql run -f sales.csv -q "SELECT region, COUNT(*) as orders, AVG(amount) as avg_order FROM sales GROUP BY region"

Example 2: Join Customer and Order Data

dataql run -f customers.csv -f orders.json -q "
SELECT
  c.name,
  c.email,
  COUNT(o.id) as order_count,
  SUM(o.total) as total_spent
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
ORDER BY total_spent DESC
LIMIT 20"

Example 3: Filter and Export

# Find active users and export
dataql run -f users.json -q "SELECT id, name, email FROM users WHERE status = 'active'" -e active_users.csv -t csv

# Export as JSONL for processing
dataql run -f data.parquet -q "SELECT * FROM data WHERE region = 'US'" -e us_data.jsonl -t jsonl

Example 4: Query Remote Data

# Query from URL
dataql run -f "https://api.example.com/data.json" -q "SELECT * FROM data LIMIT 10"

# Query from S3
dataql run -f "s3://my-bucket/analytics/sales.parquet" -q "SELECT * FROM sales WHERE year = 2024"

# Query from database
dataql run -f "postgres://user:pass@localhost/db?table=users" -q "SELECT * FROM users WHERE active = true"

Best Practices

Always use LIMIT for exploratory queries
Select specific columns instead of SELECT *
Use aggregations when you need summaries
Check schema first before complex queries
Export large results to files instead of context
Use JOINs to combine data from multiple sources
Filter early with WHERE clauses to reduce data

Troubleshooting

Skills Not Activating in Claude Code

Check installation: dataql skills list
Restart Claude Code
Run with debug: claude --debug

MCP Server Not Connecting

Verify DataQL is installed: dataql --version
Test MCP server: echo '{"jsonrpc":"2.0","method":"tools/list","id":1}' | dataql mcp serve
Check client configuration

Query Errors

Check schema: dataql run -f file.csv -q ".schema tablename"
Verify column names match exactly
For CSV, check delimiter: dataql run -f file.csv -d ";" -q "..."

LLM Integration Guide

The Problem

The Solution

Why This Matters

Real Numbers

Quick Start

1. Install DataQL Skills (Recommended for Claude Code)

2. Use in Claude Code

3. Or Use MCP Server (For Any LLM)

Integration Methods

Method 1: MCP Server (Best for Automation)

Method 2: Skills (Best for Interactive Use in Claude Code)

Method 3: Direct CLI (Universal - Any LLM with Bash)

Token Efficiency Patterns

Pattern 1: Schema-First Approach

Pattern 2: Aggregation Pipeline

Pattern 3: Incremental Refinement

Pattern 4: Sampling Large Datasets

MCP Tools Reference

dataql_query

dataql_schema

dataql_preview

dataql_aggregate

Platform-Specific Guides

Claude Code

OpenAI Codex

Google Gemini CLI

Other LLMs

Real-World Examples

Example 1: Analyze Sales Data

Example 2: Join Customer and Order Data

Example 3: Filter and Export

Example 4: Query Remote Data

Best Practices

Troubleshooting

Skills Not Activating in Claude Code

MCP Server Not Connecting

Query Errors

See Also