v2.4 — Model streaming now live

Deploy ML models
for |

The inference platform that handles optimization, scaling, and global distribution. Push your model, get an endpoint.

Free tier included. No credit card required.

cortex-cli v2.4.1

How it works

Three steps to production

Upload Model

Push any PyTorch, TensorFlow, or ONNX model via CLI, SDK, or drag-and-drop. We handle containerization.

Auto-Optimize

Automatic quantization, graph optimization, and hardware-specific compilation. Zero config required.

Deploy Global

One command deploys to 40+ edge regions with auto-scaling, health checks, and zero-downtime rollouts.

Platform

Built for scale

Observability

Real-time monitoring

Track latency, throughput, error rates, and cost per inference across all deployed models. Set alerts, compare experiments, and drill into individual requests.

Requests / min12,847

p50: 32ms

p99: 58ms

errors: 0.02%

Edge

Global edge network

Automatic geo-routing to 40+ regions. Lowest latency for every request, everywhere.

us-east-1

12ms

eu-west-1

28ms

ap-south-1

34ms

us-west-2

67ms

SDK

Auto-generated SDKs

Type-safe clients generated from your model schema. Python, TypeScript, Go, Rust.

// Auto-generated SDK

const res = await cortex.infer({

model: "my-model",

input: "..."

})

Security

Enterprise-grade security

SOC 2 Type II certified. End-to-end encryption in transit and at rest, RBAC, audit logs, VPC peering, and air-gapped deployment options.

SOC 2

HIPAA

GDPR

E2E Encryption

RBAC

Audit Logs

Performance

32ms p50 latency.
40+ regions. Zero config.

Our inference engine is optimized at every layer -- from custom CUDA kernels to smart request batching. Serve millions of requests with sub-50ms latency across 40+ global regions.

Cortex

Others

Inference latency (lower is better)

Cortex32ms

Provider A110ms

Provider B88ms

Self-hosted195ms

Throughput12,847 req/s

Developer experience

Ship in minutes

The Python SDK that gets out of your way.

app.py

Python

import cortex

# Initialize the client
client = cortex.Client("ctx_sk_...")

# Deploy a model
deployment = client.deploy(
  model="./models/classifier-v3",
  gpu="a100",
  replicas=3,
  regions=["us-east", "eu-west", "ap-south"]
)

# Run inference
result = deployment.predict(
  input={"text": "Analyze this document..."},
  stream=True
)

for chunk in result:
  print(chunk.output, end="")

Trusted by 50K+ developers at

Lattice AINeuralPathInference LabsModelOpsTensor Systems

Predictable pricing

Start free. Scale when you're ready.

Features

Starter

$0forever

Pro

$49/mo

Most chosen

Enterprise

Custom

API calls / mo

1,000

100K

Unlimited

Models

Community

Custom + community

Everything

GPU access

Shared

Dedicated

Reserved fleet

Regions

40+

Support

Community

Priority

Dedicated CSM

Fine-tuning

SLA

99.9%

99.99%

SOC 2 / HIPAA

VPC peering

Start deploying models today

Free tier. No credit card. Production-ready in minutes.

Deploy ML modelsfor |