Skip to content
v2.4 — Model streaming now live

Deploy ML models
for |

The inference platform that handles optimization, scaling, and global distribution. Push your model, get an endpoint.

Free tier included. No credit card required.

cortex-cli v2.4.1

How it works

Three steps to production

01

Upload Model

Push any PyTorch, TensorFlow, or ONNX model via CLI, SDK, or drag-and-drop. We handle containerization.

02

Auto-Optimize

Automatic quantization, graph optimization, and hardware-specific compilation. Zero config required.

03

Deploy Global

One command deploys to 40+ edge regions with auto-scaling, health checks, and zero-downtime rollouts.

Platform

Built for scale

Observability

Real-time monitoring

Track latency, throughput, error rates, and cost per inference across all deployed models. Set alerts, compare experiments, and drill into individual requests.

Requests / min12,847
p50: 32ms
p99: 58ms
errors: 0.02%
Edge

Global edge network

Automatic geo-routing to 40+ regions. Lowest latency for every request, everywhere.

us-east-1
12ms
eu-west-1
28ms
ap-south-1
34ms
us-west-2
67ms
SDK

Auto-generated SDKs

Type-safe clients generated from your model schema. Python, TypeScript, Go, Rust.

// Auto-generated SDK
const res = await cortex.infer({
model: "my-model",
input: "..."
})
Security

Enterprise-grade security

SOC 2 Type II certified. End-to-end encryption in transit and at rest, RBAC, audit logs, VPC peering, and air-gapped deployment options.

SOC 2
HIPAA
GDPR
E2E Encryption
RBAC
Audit Logs

Performance

32ms p50 latency.
40+ regions. Zero config.

Our inference engine is optimized at every layer -- from custom CUDA kernels to smart request batching. Serve millions of requests with sub-50ms latency across 40+ global regions.

Cortex
Others

Inference latency (lower is better)

Cortex32ms
Provider A110ms
Provider B88ms
Self-hosted195ms
Throughput12,847 req/s

Developer experience

Ship in minutes

The Python SDK that gets out of your way.

app.py
Python
import cortex

# Initialize the client
client = cortex.Client("ctx_sk_...")

# Deploy a model
deployment = client.deploy(
  model="./models/classifier-v3",
  gpu="a100",
  replicas=3,
  regions=["us-east", "eu-west", "ap-south"]
)

# Run inference
result = deployment.predict(
  input={"text": "Analyze this document..."},
  stream=True
)

for chunk in result:
  print(chunk.output, end="")

Trusted by 50K+ developers at

Lattice AINeuralPathInference LabsModelOpsTensor Systems

Predictable pricing

Start free. Scale when you're ready.

Features

Starter

$0forever

Pro

$49/mo

Most chosen

Enterprise

Custom

API calls / mo
1,000
100K
Unlimited
Models
Community
Custom + community
Everything
GPU access
Shared
Dedicated
Reserved fleet
Regions
1
5
40+
Support
Community
Priority
Dedicated CSM
Fine-tuning
--
SLA
--
99.9%
99.99%
SOC 2 / HIPAA
--
--
VPC peering
--
--

Start deploying models today

Free tier. No credit card. Production-ready in minutes.