DataQL

Data Sources

DataQL supports multiple data sources including local files, URLs, cloud storage, and stdin.

Local Files

Load data from local files by providing the file path:

# Absolute path
dataql run -f /home/user/data/sales.csv -q "SELECT * FROM sales"

# Relative path
dataql run -f ./data/users.json -q "SELECT * FROM users"

Supported Formats

Format	Extensions	Description
CSV	`.csv`	Comma-separated values
JSON	`.json`	JSON arrays or single objects
JSONL	`.jsonl`, `.ndjson`	Newline-delimited JSON
XML	`.xml`	XML documents
YAML	`.yaml`, `.yml`	YAML documents
Parquet	`.parquet`	Apache Parquet columnar format
Excel	`.xlsx`, `.xls`	Microsoft Excel spreadsheets
Avro	`.avro`	Apache Avro format
ORC	`.orc`	Apache ORC format

HTTP/HTTPS URLs

Query data directly from web URLs:

# CSV from URL
dataql run -f "https://example.com/data.csv" -q "SELECT * FROM data"

# JSON from API
dataql run -f "https://api.example.com/users.json" -q "SELECT name, email FROM users"

# GitHub raw files
dataql run -f "https://raw.githubusercontent.com/user/repo/main/data.csv" -q "SELECT * FROM data"

URL Examples

# Query public dataset
dataql run \
  -f "https://raw.githubusercontent.com/datasets/population/main/data/population.csv" \
  -q "SELECT Country_Name, Value FROM population WHERE Year = 2020 ORDER BY Value DESC LIMIT 10"

# Query JSON API
dataql run \
  -f "https://jsonplaceholder.typicode.com/posts" \
  -q "SELECT id, title FROM posts WHERE userId = 1"

Standard Input (stdin)

Read data from stdin using - as the file path:

# Pipe CSV data
cat data.csv | dataql run -f - -q "SELECT * FROM stdin"

# Pipe JSON data
echo '[{"name":"Alice","age":30},{"name":"Bob","age":25}]' | dataql run -f - -q "SELECT * FROM stdin"

# Combine with curl
curl -s "https://api.example.com/data.json" | dataql run -f - -q "SELECT * FROM stdin"

# Process command output
ps aux | dataql run -f - -d " " -q "SELECT * FROM stdin LIMIT 10"

stdin with Different Formats

The format is auto-detected from the content:

# JSON from stdin
echo '{"users":[{"name":"Alice"},{"name":"Bob"}]}' | dataql run -f - -q "SELECT * FROM stdin"

# CSV from stdin
echo -e "id,name\n1,Alice\n2,Bob" | dataql run -f - -q "SELECT * FROM stdin"

Amazon S3

Query data stored in Amazon S3 buckets.

Configuration

Set AWS credentials via environment variables:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"

Or use AWS CLI configuration:

aws configure

Usage

# S3 URL format
dataql run -f "s3://bucket-name/path/to/file.csv" -q "SELECT * FROM file"

# Query S3 object
dataql run -f "s3://my-data-bucket/sales/2024/january.csv" -q "
SELECT product, SUM(amount) as total
FROM january
GROUP BY product
ORDER BY total DESC
"

# S3 with different formats
dataql run -f "s3://my-bucket/data.json" -q "SELECT * FROM data"
dataql run -f "s3://my-bucket/data.parquet" -q "SELECT * FROM data"

S3 Configuration Options

You can configure S3 access with additional options:

Environment Variable	Description
`AWS_ACCESS_KEY_ID`	AWS access key ID
`AWS_SECRET_ACCESS_KEY`	AWS secret access key
`AWS_SESSION_TOKEN`	Session token (for temporary credentials)
`AWS_REGION`	AWS region (e.g., `us-east-1`)
`AWS_ENDPOINT_URL`	Custom endpoint (for S3-compatible storage)

S3-Compatible Storage

DataQL works with S3-compatible storage services like MinIO, DigitalOcean Spaces, etc.:

# MinIO
export AWS_ENDPOINT_URL="http://localhost:9000"
dataql run -f "s3://my-bucket/data.csv" -q "SELECT * FROM data"

# DigitalOcean Spaces
export AWS_ENDPOINT_URL="https://nyc3.digitaloceanspaces.com"
dataql run -f "s3://my-space/data.csv" -q "SELECT * FROM data"

Google Cloud Storage (GCS)

Query data stored in Google Cloud Storage buckets.

Configuration

Set up authentication using a service account:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Or use Application Default Credentials:

gcloud auth application-default login

Usage

# GCS URL format
dataql run -f "gs://bucket-name/path/to/file.csv" -q "SELECT * FROM file"

# Query GCS object
dataql run -f "gs://my-data-bucket/analytics/events.jsonl" -q "
SELECT event_type, COUNT(*) as count
FROM events
GROUP BY event_type
"

# GCS with different formats
dataql run -f "gs://my-bucket/data.parquet" -q "SELECT * FROM data"

GCS Configuration

Environment Variable	Description
`GOOGLE_APPLICATION_CREDENTIALS`	Path to service account JSON file
`GOOGLE_CLOUD_PROJECT`	GCP project ID (optional)

Service Account Setup

Create a service account in GCP Console
Download the JSON key file
Set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Azure Blob Storage

Query data stored in Azure Blob Storage containers.

Configuration

Set Azure credentials via environment variables:

export AZURE_STORAGE_ACCOUNT="your-storage-account"
export AZURE_STORAGE_KEY="your-storage-key"

Or use Azure CLI:

az login

Usage

# Azure Blob URL format
dataql run -f "az://container-name/path/to/file.csv" -q "SELECT * FROM file"

# Alternative format
dataql run -f "azure://container-name/path/to/file.csv" -q "SELECT * FROM file"

# Query Azure blob
dataql run -f "az://mycontainer/data/users.json" -q "
SELECT name, email
FROM users
WHERE active = true
"

Azure Configuration

Environment Variable	Description
`AZURE_STORAGE_ACCOUNT`	Storage account name
`AZURE_STORAGE_KEY`	Storage account access key
`AZURE_STORAGE_CONNECTION_STRING`	Connection string (alternative)
`AZURE_STORAGE_SAS_TOKEN`	SAS token for limited access

Using Connection String

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;EndpointSuffix=core.windows.net"
dataql run -f "az://mycontainer/data.csv" -q "SELECT * FROM data"

Using SAS Token

export AZURE_STORAGE_ACCOUNT="myaccount"
export AZURE_STORAGE_SAS_TOKEN="?sv=2021-06-08&ss=b&srt=co&sp=rl..."
dataql run -f "az://mycontainer/data.csv" -q "SELECT * FROM data"

Amazon DynamoDB

Query data from DynamoDB tables.

Configuration

Set AWS credentials via environment variables:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"

Or use AWS CLI configuration:

aws configure

URL Format

# Basic format: region/table-name
dynamodb://us-east-1/my-table

# With custom endpoint (LocalStack, local DynamoDB)
dynamodb://us-east-1/my-table?endpoint=http://localhost:8000

Usage Examples

# Query DynamoDB table
dataql run -f "dynamodb://us-east-1/users" \
  -q "SELECT name, email FROM users WHERE age > 30"

# Export to CSV
dataql run -f "dynamodb://us-east-1/orders" \
  -q "SELECT * FROM orders" \
  -e output.csv -t csv

# With LocalStack endpoint
dataql run -f "dynamodb://us-east-1/test-table?endpoint=http://localhost:4566" \
  -q "SELECT * FROM test_table"

# Custom table name
dataql run -f "dynamodb://us-east-1/my-data-table" -c my_table \
  -q "SELECT * FROM my_table"

Environment Variables

Variable	Description
`AWS_ACCESS_KEY_ID`	AWS access key ID
`AWS_SECRET_ACCESS_KEY`	AWS secret access key
`AWS_REGION`	AWS region (e.g., `us-east-1`)
`AWS_ENDPOINT_URL`	Custom endpoint URL (for LocalStack)
`AWS_ENDPOINT_URL_DYNAMODB`	DynamoDB-specific custom endpoint

How It Works

DataQL scans the DynamoDB table using the AWS SDK
The schema is inferred from the first item in the table
Data is loaded into an in-memory SQLite database
You can then query the data using standard SQL syntax

Limitations

DynamoDB-specific query operations (KeyConditions, FilterExpressions) are not supported
All data is loaded into memory (use --lines to limit rows)
Schema is inferred from the first item; tables with inconsistent schemas may have missing columns

Message Queues

Query messages from message queues without consuming/deleting them. Perfect for troubleshooting and debugging.

Supported Systems

System	URL Prefix	Status
AWS SQS	`sqs://`	✅ Supported
Apache Kafka	`kafka://`	✅ Supported
RabbitMQ	`rabbitmq://`	Coming soon
Apache Pulsar	`pulsar://`	Coming soon
Google Pub/Sub	`pubsub://`	Coming soon

AWS SQS

Query SQS messages without deleting them (uses VisibilityTimeout=0).

Configuration

Set AWS credentials via environment variables:

export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"

URL Formats

# Simple format: queue name + region
sqs://my-queue-name?region=us-east-1

# Full AWS URL format (region auto-detected)
sqs://https://sqs.us-east-1.amazonaws.com/123456789/my-queue

# With options
sqs://my-queue?region=us-east-1&max_messages=50&wait_time=5

URL Parameters

Parameter	Description	Default
`region`	AWS region (required for simple format)	-
`max_messages`	Maximum messages to retrieve	10
`wait_time`	Long polling wait time in seconds (0-20)	0

Usage Examples

# Preview messages from a queue
dataql run -f "sqs://my-events-queue?region=us-east-1" \
  -q "SELECT * FROM my_events_queue LIMIT 10"

# Analyze message types
dataql run -f "sqs://events-queue?region=us-east-1&max_messages=100" \
  -q "SELECT body_event_type, COUNT(*) as count FROM events_queue GROUP BY body_event_type"

# Filter error messages
dataql run -f "sqs://error-queue?region=us-east-1" \
  -q "SELECT message_id, body_error_message FROM error_queue WHERE body_status = 'error'"

# Check messages with timestamps
dataql run -f "sqs://orders-queue?region=us-east-1" \
  -q "SELECT message_id, timestamp, body_order_id FROM orders_queue ORDER BY timestamp DESC"

Generated Table Schema

Messages are imported into a table with these columns:

Column	Type	Description
`message_id`	TEXT	SQS Message ID
`source`	TEXT	Queue URL
`timestamp`	TEXT	When message was sent
`timestamp_unix`	TEXT	Unix timestamp
`receive_count`	TEXT	Times message was received
`body`	TEXT	Raw message body
`body_*`	TEXT	Flattened JSON fields (if body is JSON)
`meta_*`	TEXT	Message attributes

For JSON message bodies, fields are automatically flattened with body_ prefix:

// Message body
{"event": "order_created", "user": {"id": 123, "name": "Alice"}}

// Becomes columns
body_event = "order_created"
body_user_id = "123"
body_user_name = "Alice"

Best Practices

Start with low max_messages for exploration
Use wait_time for empty queues to wait for messages
Messages are NOT deleted - safe for troubleshooting
Filter with SQL instead of retrieving all messages

Apache Kafka

Query Kafka topic messages without committing offsets (peek mode).

Configuration

Set Kafka connection via environment variables:

export KAFKA_BROKERS="localhost:9092"

# For authenticated clusters
export KAFKA_SASL_USERNAME="your-username"
export KAFKA_SASL_PASSWORD="your-password"
export KAFKA_SASL_MECHANISM="PLAIN"  # or SCRAM-SHA-256, SCRAM-SHA-512

URL Formats

# Simple format: broker/topic
kafka://localhost:9092/my-topic

# With consumer group
kafka://localhost:9092/my-topic?group=my-consumer-group

# With options
kafka://localhost:9092/my-topic?group=dataql-reader&max_messages=100

URL Parameters

Parameter	Description	Default
`group`	Consumer group ID	`dataql-reader`
`max_messages`	Maximum messages to retrieve	100
`offset`	Start offset (`newest`, `oldest`)	`newest`

Usage Examples

# Preview messages from a topic
dataql run -f "kafka://localhost:9092/events" \
  -q "SELECT * FROM events LIMIT 10"

# Analyze message types
dataql run -f "kafka://localhost:9092/orders?max_messages=1000" \
  -q "SELECT body_order_type, COUNT(*) as count FROM orders GROUP BY body_order_type"

# Filter error events
dataql run -f "kafka://localhost:9092/logs" \
  -q "SELECT body_level, body_message FROM logs WHERE body_level = 'ERROR'"

# Export messages to file
dataql run -f "kafka://localhost:9092/events" \
  -q "SELECT * FROM events" \
  -e events_backup.jsonl -t jsonl

Generated Table Schema

Messages are imported into a table with these columns:

Column	Type	Description
`key`	TEXT	Kafka message key
`partition`	TEXT	Partition number
`offset`	TEXT	Message offset
`timestamp`	TEXT	Message timestamp
`body`	TEXT	Raw message body
`body_*`	TEXT	Flattened JSON fields (if body is JSON)

For JSON message bodies, fields are automatically flattened with body_ prefix:

// Message body
{"event": "user_signup", "user": {"id": 123, "email": "alice@example.com"}}

// Becomes columns
body_event = "user_signup"
body_user_id = "123"
body_user_email = "alice@example.com"

Peek Mode (Non-Destructive)

Kafka reads in DataQL use peek mode:

Messages are read but offsets are NOT committed
Multiple reads return the same messages
Safe for troubleshooting and debugging
Does not affect other consumers

MCP Tool

When using with LLMs via MCP, the dataql_mq_peek tool is available:

{
  "name": "dataql_mq_peek",
  "arguments": {
    "source": "sqs://my-queue?region=us-east-1",
    "max_messages": 20,
    "query": "SELECT * FROM my_queue WHERE body_status = 'error'"
  }
}

Also works with Kafka:

{
  "name": "dataql_mq_peek",
  "arguments": {
    "source": "kafka://localhost:9092/events",
    "max_messages": 50,
    "query": "SELECT body_event_type, COUNT(*) as count FROM events GROUP BY body_event_type"
  }
}

Multiple Sources

You can query data from multiple sources and join them:

# Join local file with S3 data
dataql run \
  -f users.csv \
  -f "s3://my-bucket/orders.csv" \
  -q "SELECT u.name, o.amount FROM users u JOIN orders o ON u.id = o.user_id"

# Join data from different cloud providers
dataql run \
  -f "s3://aws-bucket/products.csv" \
  -f "gs://gcp-bucket/inventory.csv" \
  -q "SELECT p.name, i.quantity FROM products p JOIN inventory i ON p.sku = i.sku"

Best Practices

Performance

Use Parquet for large datasets: Parquet is columnar and compressed, making it faster for analytical queries.

dataql run -f "s3://bucket/large-data.parquet" -q "SELECT column1, column2 FROM large_data"

Limit records for exploration:

dataql run -f "s3://bucket/huge-file.csv" -l 1000 -q "SELECT * FROM huge_file"

Use JSONL for streaming: JSONL allows processing line by line with lower memory usage.

Security

Use IAM roles in AWS: Instead of access keys, use IAM roles when running on EC2/ECS.
Use workload identity in GCP: Configure workload identity for GKE workloads.
Use managed identities in Azure: Use managed identities for Azure VMs and AKS.
Never commit credentials: Use environment variables or secret managers.

Troubleshooting

S3 Access Denied:

# Check credentials
aws sts get-caller-identity

# Check bucket permissions
aws s3 ls s3://bucket-name/

GCS Permission Denied:

# Check authentication
gcloud auth list

# Check bucket access
gsutil ls gs://bucket-name/

Azure Storage Error:

# Check account
az storage account show --name myaccount

# List containers
az storage container list --account-name myaccount

Data Sources

Local Files

Supported Formats

HTTP/HTTPS URLs

URL Examples

Standard Input (stdin)

stdin with Different Formats

Amazon S3

Configuration

Usage

S3 Configuration Options

S3-Compatible Storage

Google Cloud Storage (GCS)

Configuration

Usage

GCS Configuration

Service Account Setup

Azure Blob Storage

Configuration

Usage

Azure Configuration

Using Connection String

Using SAS Token

Amazon DynamoDB

Configuration

URL Format

Usage Examples

Environment Variables

How It Works

Limitations

Message Queues

Supported Systems

AWS SQS

Configuration

URL Formats

URL Parameters

Usage Examples

Generated Table Schema

Best Practices

Apache Kafka

Configuration

URL Formats

URL Parameters

Usage Examples

Generated Table Schema

Peek Mode (Non-Destructive)

MCP Tool

Multiple Sources

Best Practices

Performance

Security

Troubleshooting

See Also