Forge CLI is an AI-powered code editor and optimization tool specifically designed for NVIDIA GPU kernel development. It integrates smart profiling, automated benchmarking, and hardware-aware AI suggestions to dramatically speed up the process of writing, testing, and optimizing CUDA, Triton, and PyTorch code for maximum performance.
Freemium
How to use Forge CLI?
Install Forge CLI and integrate it into your development workflow. Write your GPU kernel code (CUDA, Triton, PyTorch) in the editor. The AI provides inline performance metrics and suggestions as you type. Use the profiling terminal to analyze bottlenecks, benchmark across different GPU configurations automatically, and emulate hardware you don't physically own to test compatibility and performance.
Forge CLI 's Core Features
Inline Profiling & Metrics: See kernel performance metrics like execution time, memory usage, and occupancy directly in your code editor as you type, eliminating the need for separate profiling runs.
Hardware-Aware AI: The AI understands specific GPU architectures (A100, H100, etc.) and provides optimization suggestions tailored to your exact hardware, including memory layout, block sizes, and precision.
GPU Emulator: Test your kernels on over 50 different GPU architectures (like H100s and A100s) with under 2% error without needing to own the physical hardware, enabling broad compatibility testing.
Local LLM Support: Run AI models locally via Ollama, vLLM, or LM Studio to keep your proprietary code completely private and never leave your machine, ensuring data security.
Automated Benchmarking & Sweeping: Automatically sweeps through different configurations (block sizes, thread counts, memory layouts) to find the fastest setup and track performance regressions over time.
Natural Language Profiling: Describe what you want to profile in plain English, and Forge CLI generates the correct, complex Nsight Compute commands instantly, removing the need to memorize flags.
Multi-GPU Comparison & Analysis: Compare kernel performance across multiple GPUs simultaneously (up to 6 in Pro, unlimited in Enterprise) to identify the best hardware for your workload and optimize for scale.
Forge CLI 's Use Cases
GPU Kernel Developers: Accelerates the development and optimization of high-performance CUDA and Triton kernels by providing instant feedback and AI-driven suggestions, reducing debugging time from hours to minutes.
ML Engineers & Researchers: Optimizes PyTorch model training and inference loops for specific GPU hardware, automatically suggesting mixed precision, kernel fusion, and memory management improvements to speed up experiments.
Hardware Validation Engineers: Uses the GPU emulator to test software compatibility and performance across a wide range of NVIDIA architectures before hardware is available or purchased, de-risking deployment.
HPC Application Developers: Profiles and benchmarks complex scientific computing applications across multi-GPU setups, identifying bottlenecks and optimizing for datacenter-scale deployment efficiently.
AI Infrastructure Teams: Ensures custom AI models and frameworks are optimally tuned for their specific datacenter GPU fleet (like H100 clusters), maximizing utilization and ROI on expensive hardware.
Students & Educators: Provides an accessible, integrated environment to learn GPU programming with real-time feedback and explanations, lowering the barrier to entry for parallel computing concepts.
Forge CLI 's Pricing
FREE
$0/mo
For solo developers. Includes single GPU development, unlimited profiling & benchmarking, CodeLens performance metrics, GPU virtualization, local LLM support, and 1 Forge credit per month.
PRO
$29/mo
For professional teams. Includes everything in Free plus GPU emulator access (50+ GPUs), multi-GPU comparison (6 max), natural language profiling, 10 Forge credits/month, unlimited autocomplete, GPU-optimized suggestions, and priority email support.
ENTERPRISE
Custom Pricing
For large organizations. Includes everything in Pro plus 100+ GPU clusters, datacenter optimization, on-premise deployment, custom silicon support, unlimited Forge credits, custom model fine-tuning, and dedicated 24/7 support with SLA.
PAY AS YOU GO
from $112.50
Agent credits for AI-powered kernel optimization. Credit refund if performance doesn't beat torch.compile(mode='max-autotune'). Includes access to datacenter GPUs (B200, H100, H200).