
Technical Expertise
RAG & Knowledge Systems
- PostgreSQL + pgvector hybrid search (keyword + semantic)
- Retrieve-then-rerank pipelines with citation grounding
- Multi-tenant vector stores with namespace scoping
- Firecrawl web crawling + auto-chunking
Voice & Real-Time AI
- Gemini Live API — full-duplex voice (<100ms latency)
- Whisper STT on local GPU (real-time transcription)
- Kokoro / edge-tts on-premises TTS
- WebSocket streaming with turn-taking & interruption handling
AI Agent Architecture
- Function-calling with tool orchestration (OpenAI, Gemini, HF)
- MCP (Model Context Protocol) — 39 tools exposed to IDE agents
- Multi-agent delegation and persistent chat memory
- Monte Carlo simulation tools (eval7, equity calculation)
Self-Hosted AI Infrastructure
I deploy and operate LLM inference stacks on bare metal — no cloud vendor lock-in. This includes model serving, speech-to-text, text-to-speech, and multi-service orchestration on RTX-class GPUs.
Model Serving
- vLLM (continuous batching, tool calling)
- llama.cpp / llama-server
- GPTQ / AWQ quantization
- VRAM optimization & GPU resource mgmt
Speech Pipeline
- Whisper STT on local GPU
- Kokoro TTS engine
- Real-time audio streaming
Orchestration
- Supervisor / systemd service management
- Nginx reverse proxy + SSL
- Multi-service coordination on bare metal
Core Stack
PythonTypeScriptFastAPINext.jsReactPostgreSQLpgvectorRedisDockervLLMllama.cppWhisperKokoroOpenAI SDKGemini APIHuggingFaceMCP SDKPlaywrightFirecrawlCCXTeval7Fly.ioSupervisorNginxGit