Technical Expertise

Technical Expertise


RAG & Knowledge Systems

  • PostgreSQL + pgvector hybrid search (keyword + semantic)
  • Retrieve-then-rerank pipelines with citation grounding
  • Multi-tenant vector stores with namespace scoping
  • Firecrawl web crawling + auto-chunking

Voice & Real-Time AI

  • Gemini Live API — full-duplex voice (<100ms latency)
  • Whisper STT on local GPU (real-time transcription)
  • Kokoro / edge-tts on-premises TTS
  • WebSocket streaming with turn-taking & interruption handling

AI Agent Architecture

  • Function-calling with tool orchestration (OpenAI, Gemini, HF)
  • MCP (Model Context Protocol) — 39 tools exposed to IDE agents
  • Multi-agent delegation and persistent chat memory
  • Monte Carlo simulation tools (eval7, equity calculation)
Forensic AI Studio·GTO Poker Coach

Platform Engineering

  • Multi-tenant isolation (RLS, namespace scoping)
  • Multi-provider LLM routing (Gemini, OpenAI, HuggingFace, local)
  • Google Workspace integration (Gmail, Calendar, Drive via OAuth2)
  • Real-time WebSocket APIs, SSE streaming

Self-Hosted AI Infrastructure

I deploy and operate LLM inference stacks on bare metal — no cloud vendor lock-in. This includes model serving, speech-to-text, text-to-speech, and multi-service orchestration on RTX-class GPUs.

Model Serving

  • vLLM (continuous batching, tool calling)
  • llama.cpp / llama-server
  • GPTQ / AWQ quantization
  • VRAM optimization & GPU resource mgmt

Speech Pipeline

  • Whisper STT on local GPU
  • Kokoro TTS engine
  • Real-time audio streaming

Orchestration

  • Supervisor / systemd service management
  • Nginx reverse proxy + SSL
  • Multi-service coordination on bare metal

Core Stack

PythonTypeScriptFastAPINext.jsReactPostgreSQLpgvectorRedisDockervLLMllama.cppWhisperKokoroOpenAI SDKGemini APIHuggingFaceMCP SDKPlaywrightFirecrawlCCXTeval7Fly.ioSupervisorNginxGit