OLCF Inference Service Documentation

Welcome to the documentation for the OLCF Inference Service. This service flexes the Secure Scientific Service Mesh (S3M) to provide access to powerful Large Language Models (LLMs) running on a highly optimized vLLM runtime, offering OpenAI-compatible API endpoints.

Requesting OLCF Inference Service Access

Please email OLCF Support help@olcf.ornl.gov if you are interested in using the OLCF Inference Service.

Include the following information in your email:
  • Existing OLCF Project ID

  • Project PI

  • Your project’s use case - Please explain how your project will use the OLCF inference service in your workflow.

Authentication

To use the inference service, you must authenticate your requests using a Bearer token.

  1. Mint your token: Tokens must be minted via S3M on myOLCF. More information can be found in the S3M documentation here.

  2. Set your environment variable: Once you have your token, we recommend exporting it securely in your terminal environment to prevent hardcoding it in your scripts.

export S3M_TOKEN="your_minted_token_here"

Endpoint URL

The primary base endpoint for chat completions is:

https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions

Available Models

Note

This list is not exhaustive. To see a complete list, please see List Models

Currently, the service supports the following models:

Supported Models

Model

Aliases

Features

gpt-oss-120b

gpt-oss-120b, gpt-oss

Text, Reasoning

nemotron-nano-fp8

nemotron-nano-fp8, nemotron-nano

Text, Reasoning

apriel-1.6-15b-thinker

apriel-1.6-15b-thinker, apriel-15b-thinker

Text, Reasoning, Vision

nomic-embed-text-v2-moe

nomic-embed-text-v2-moe, nomic-embed-v2

Text Embedding

Multi-Modal Inputs

For a working multimodal example, you can refer to Computer Vision

For additional multimodal examples please see https://docs.vllm.ai/en/stable/features/multimodal_inputs/?h=multimodal#online-serving

OpenAI’s file uploads are not supported on OLCF Inference Service, and the Service will not accept URL data pointing to web addresses.

Usage Examples

Note

When using cURL, wget, or other command line programs please follow the best practices described in the S3M documentation Avoid Command-Line Arguments

You can follow the examples below by simply running

echo "Authorization: ${S3M_TOKEN}" > .env
echo ".env" >> .gitignore

Because the service uses a vLLM backend, the request body is compatible with the standard OpenAI Chat Completions API format.

Note

In order to use the OpenAI Python library, you must first install it or activate an envrionment with it installed. You can install via pip with pip install openai

Since the API is OpenAI-compatible, you can easily use the standard Python openai library. Simply override the base URL and pass your token.

Querying gpt-oss-120b

curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d '{
             "model": "gpt-oss-120b",
             "messages": [
                 {"role": "user", "content": "Your prompt here."},
                 {"role": "user", "content": "We can be 120b(s)."}
             ],
             "stream": false
         }'
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://s3m.olcf.ornl.gov/olcf/open/v1/inference",
    api_key=os.environ.get("S3M_TOKEN")
)

response = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Your prompt here"},
        {"role": "user", "content": "We can be 120b(s)."}
    ],
    stream=False
)

print(response.choices[0].message.content)

Querying nemotron-nano-fp8

curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d '{
           "model": "nemotron-nano-fp8",
           "messages": [{"role": "user", "content": "Your prompt here."}],
           "stream": false
         }'
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://s3m.olcf.ornl.gov/olcf/open/v1/inference",
    api_key=os.environ.get("S3M_TOKEN")
)

response = client.chat.completions.create(
    model="nemotron-nano-fp8",
    messages=[
        {"role": "user", "content": "Your prompt here"}
    ],
    stream=False
)

print(response.choices[0].message.content)

Computer Vision

Note

Computer vision is not supported by every model. Please see Available Models and List Models for details on which models support vision.

jq -n --arg content "$(base64 < frontier.jpg)" '{
  "model": "apriel-1.6-15b-thinker",
  "messages": [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {
                "type": "image_url",
                "image_url": {"url": ("data:image/jpeg;base64," + $content)}
            }
        ]
    }
  ]
}' | curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d @-
import base64
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://s3m.olcf.ornl.gov/olcf/open/v1/inference",
    api_key=os.environ.get("S3M_TOKEN")
)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "frontier.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.chat.completions.create(
    model="apriel-1.6-15b-thinker",
    messages=[
        {
            "role": "user",
            "content": [
                { "type": "text", "text": "Describe the image." },
                {
                    "type": "image_url",
                    "image_url": { "url": f"data:image/jpeg;base64,{base64_image}" },
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)
import base64
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://s3m.olcf.ornl.gov/olcf/open/v1/inference",
    api_key=os.environ.get("S3M_TOKEN")
)

# Function to encode the image
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "frontier.jpg"

# Getting the Base64 string
base64_image = encode_image(image_path)

response = client.responses.create(
    model="apriel-1.6-15b-thinker",
    input=[
        {
            "role": "user",
            "content": [
                { "type": "input_text", "text": "Describe the image." },
                {
                    "type": "input_image",
                    "image_url": f"data:image/jpeg;base64,{base64_image}",
                },
            ],
        },
    ],
)

print(response.output[0].content[0].text)

Simple Text Files

The file data type is not currently supported by any models on the OLCF Inference Service. You will need to send the full text file as a string of long context to the service.

jq -n --arg content "$(< <your simple text file>)" '{
    "model": "gpt-oss-120b",
    "messages": [
        {
            "role": "user",
            "content": ("Process this text.\n\n" + $content)
        }
    ]
}' | curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d @-
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://s3m.olcf.ornl.gov/olcf/open/v1/inference",
    api_key=os.environ.get("S3M_TOKEN")
)


with open("<your simple text file>", "rb") as f:
    data = f.read()

response = client.chat.completions.create(
    model="nemotron-nano-fp8",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Process this text."
                },
                {
                    "type": "text",
                    "text": f"{data}"
                },
            ],
        },
    ],
    stream=False
)

print(response.choices[0].message.content)

Core API Endpoints

Because the service uses a vLLM backend, the API routing and request bodies are compatible with standard OpenAI API formats.

The API has a base URL located at: ``https://s3m.olcf.ornl.gov/olcf/open/v1/inference``

Chat Completions

Endpoint: /chat/completions

Used for conversational interactions and instruction-following models.

curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/chat/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d '{
           "model": "gpt-oss-120b",
           "messages": [{"role": "user", "content": "Explain quantum computing."}],
           "stream": false
         }'

Standard Completions

Endpoint: /completions

Used for traditional text continuation (base models rather than instruction-tuned chat models).

curl -N -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/completions" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d '{
           "model": "nemotron-nano-fp8",
           "prompt": "The future of high-performance computing is",
           "max_tokens": 50,
           "temperature": 0.7
         }'

List Models

Endpoint: /model/info

If you need deeper configuration specs—such as maximum context length, supported modalities (e.g., vision/audio), or max output tokens—LiteLLM exposes a custom /model/info endpoint.

Important

Because /model/info is a custom LiteLLM proxy route rather than a standard OpenAI route, you will use the standard Python requests library to fetch this data instead of the OpenAI SDK.

curl -s -X GET "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/model/info" \
     -H @.env
import os
import requests

url = "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/model/info"
headers = {
    "Authorization": f"Bearer {os.environ.get('S3M_TOKEN')}"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    specs = response.json()
    for model in specs.get("data", []):
        name = model.get("model_name")
        info = model.get("model_info", {})
        print(f"Model: {name}")
        print(f"  - Max Input Tokens: {info.get('max_input_tokens', 'Unknown')}")
        print(f"  - Max Output Tokens: {info.get('max_tokens', 'Unknown')}\n")
else:
    print(f"Failed to fetch specs: {response.status_code}")

Expected JSON Response:

This will return a richer JSON payload containing the backend parameters and capabilities for each model on the server.

{
  "data": [
    {
      "model_name": "gpt-oss-120b",
      "litellm_params": {
        "model": "vllm/gpt-oss-120b"
      },
      "model_info": {
        "max_tokens": 8192,
        "max_input_tokens": 128000,
        "mode": "chat"
      }
    },
    {
      "model_name": "nemotron-nano-fp8",
      "litellm_params": {
        "model": "vllm/nemotron-nano-fp8"
      },
      "model_info": {
        "max_tokens": 4096,
        "max_input_tokens": 32768,
        "mode": "chat"
      }
    }
  ]
}

Embeddings

Endpoint: /embeddings

Generates vector embeddings for a given text.

Note

Embeddings require the use of an embedding-specific model. Check Available Models and List Models for embedding models.

curl -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/embeddings" \
     -H @.env \
     -H "Content-Type: application/json" \
     -d '{
           "model": "nomic-embed-text-v2-moe",
           "input": "This is good news."
         }'

Responses

Endpoint: /responses

A newer OpenAI API endpoint. Includes performance benefits and some additional features over chat completions.

Read more on OpenAI’s Docs: https://developers.openai.com/api/docs/guides/migrate-to-responses

curl -s -X POST "https://s3m.olcf.ornl.gov/olcf/open/v1/inference/responses" \
    -H @.env \
    -H "Content-Type: application/json" \
    -d '{
            "model": "gpt-oss-120b",
            "input": [
                {
                    "role": "user",
                    "content": [
                        { "type": "input_text", "text": "What is an AI agent?" },
                        { "type": "input_text", "text": "Which agentic framework should I use for gpt-oss-120b?" }
                    ]
                }
            ]
        }'

Additional Resources

You can refer to both vLLM’s API reference and OpenAI’s API reference documentation for additional examples and instructions.

vLLM: https://docs.vllm.ai/en/stable/serving/openai_compatible_server/

OpenAI Chat Completions: https://developers.openai.com/api/reference/chat-completions/overview