agirouter.products

Build and run generative AI applications with accelerated performance, maximum accuracy, and lowest cost at production scale.

Start building now Docs

LLaMA-3 Chat

Mistral Instruct

Realistic Vision 3.0

Vicuna v1.5

Code Llama

RedPajama-INCITE

Platypus2 Instruct

Nous Hermes Llama-2

Mixtlal-8x22B

Chronos Hermes

Stable Diffusion XL 1.0

Stable Diffusion 2.1

Snowflake Arctic

Phind Code LLaMA v2

Openjourney v4

SOLAR v0

NSQL LLaMA-2

Qwen

Qwen 1.5

DBRX Instruct

Analog Diffusion

MythoMax-L2

Pythia-Chat-Base

Inference that’s fast, simple, and scales as you grow.

FAST

Run leading open-source models like Llama-3 on the fastest inference stack available, up to 4x faster than vLLM. Outperforms Amazon Bedrock, and Azure AI by over 2x.

COST-EFFICIENT

Agirouter Inference is 11x lower cost than GPT-4o when using Llama-3 70B. Our optimizations bring you the best performance at the lowest cost.

scalable

We obsess over system optimization and scaling so you don’t have to. As your application grows, capacity is automatically added to meet your API request volume.

Serverless Endpoints for leading open-source models

Access 100+ models through serverless endpoints – including Llama 3, RedPajama, Falcon and Stable Diffusion XL. Endpoints are OpenAI compatible.
Test models in Chat, Language, Image, and Code Playgrounds.
Access 8 leading embeddings models – including models that outperform OpenAI’s ada-002 and Cohere’s Embed-v3 in MTEB and LoCo Benchmarks.

Try now

<span style='color: #0f6fff;'>Serverless Endpoints</span> for leading open-source models

Dedicated Endpoints for any model

Choose any kind of model — open-source, fine-tuned, or even models you’ve trained.
Choose your hardware configuration. Select the number of instances to deploy and how many you’ll auto-scale to.
Tune for fast latency versus high throughput — simply by adjusting the max batch size.

<span style='color: #0f6fff;'>Dedicated Endpoints</span> for any model

Integrate Agirouter Inference Engine into your application

Integrate models into your production applications using the same easy-to-use inference API for either Serverless Endpoints or Dedicated Instances.
Leverage the Agirouter embeddings endpoint to build your own RAG applications.
Show streaming responses to your end users — almost instantly.

Read our docs

<span style='color: #0f6fff;'>Integrate</span> Agirouter Inference Engine into your application

Perfect for enterprises — performance, privacy, and scalability to meet your needs.

Performance

You get faster tokens per second, higher throughput and lower time to first token. And, all these efficiencies mean we can provide you compute at a lower cost.

SPEED RELATIVE TO VLLM

4x Faster

LLAMA-3 8B AT FULL PRECISION

400 TOKENS/SEC

COST RELATIVE TO GPT-4o

11x lower cost

The Agirouter Inference Engine sets us apart.

We built the blazing fast inference engine that we wanted to use. Now, we’re sharing it with you.

The Agirouter Inference Engine deploys the latest inference techniques:

FlashAttention 3 and Flash-Decoding

The Agirouter Inference Engine integrates and builds upon kernels from FlashAttention-3 along with proprietary kernels for other operators.

Advanced speculative decoding

The Agirouter Inference Engine integrates speculative decoding algorithms such as Medusa and SpecExec. It also comes with custom-built draft models that are more than 10x Chinchilla optimal, to achieve the fastest performance.

Quality-preserving quantization

Agirouter AI quantization achieves the highest accuracy and performance. Built on proprietary kernels including MHA and GEMMs that are optimized for LLM inference, tuned for both pre-fill and decoding phases.

Privacy

FAST

Run leading open-source models like Llama-3 on the fastest inference stack available, up to 4x faster than vLLM. Outperforms Amazon Bedrock, and Azure AI by over 2x.