AI Engineer
Sungjun Son
LLM Serving · Search Engine · Full Stack · DevOps
sonsj97@plateer.com · sonsj97@gmail.com
About
I'm an AI engineer who designs and operates AI agent platforms. From vLLM/llama.cpp multi-GPU (CUDA·ROCm) LLM serving, LangChain/LangGraph Iterative RAG, and Neo4j knowledge graphs to MCP-based AI agents— I build the entire AI service stack across inference, search, automation, and infrastructure.
Today I lead an 8-person team at Plateer building the enterprise AI platform (XGEN), and as Tech & Consulting Part Leader I design clients' AI adoption directly. I cut token cost ~60% on a regional bank's GenAI platform, co-ran Intel Gaudi2/3 LLM inference PoCs, and advised 30+ enterprise clients on AI.
It all started with commerce search. I ran OpenSearch keyword search at 5,000+ TPS in production and rewrote it in Rust for 28ms responses, then moved through hybrid search and LLM query expansion (AI Search) to today's LLM and agent work. Lately I focus on the MCP ecosystem, open-sourcing LLM-agent infrastructure like graph-tool-call and gwanjong-mcp.
Expertise
Search Engine
90 posts- OpenSearch k-NN / Hybrid Search
- Qdrant Vector DB
- Rust Axum Search API
- NestJS Hybrid Search
- RAG / Semantic Search
AI / ML
78 posts- vLLM / llama.cpp GPU Serving
- MCP-based AI Agent Design
- Graph-based Tool Retrieval Engine
- LangChain / LangGraph RAG
- XGEN AI Agent Platform
Full Stack
63 posts- Next.js / React UI
- Rust API Gateway
- Tauri Desktop App
- Python Async Services
- WebSocket / SSE Realtime
DevOps
44 posts- K8s / K3s Cluster Operations
- ArgoCD GitOps Deployment
- Jenkins CI/CD Pipeline
- Docker Multi-stage Build
- Istio / Let's Encrypt
Experience
Part Leader · AI Lab → Tech & Consulting
- Designed & built the XGEN AI agent platform — vLLM/llama.cpp multi-GPU (CUDA·ROCm) LLM serving, LangChain/LangGraph Iterative RAG, Neo4j knowledge graph, MCP-based AI agents & workflow engine, and a new HWPX/DOCX/PPTX Document Adapter
- Core engineer on a regional bank's GenAI platform — full-stack AI chatbot (SSE streaming, multi-step agents) from requirements to delivery, ~60% token cost reduction, plus CI/CD infrastructure
- Co-ran Intel Gaudi2/3 LLM inference PoCs (DeepSeek·Llama·QwQ quantization/perf) and led XGEN demos·PoCs·consulting for 30+ enterprise clients
- Operated 7 microservices on k3s/Istio/ArgoCD GitOps with dev/stg/prd multi-env design, and built the XGEN GS-certification (TTA) test environment
- Led an 8-person AI Lab team — task allocation, code review, mentoring, and engineer hiring interviews
Platform Division Manager · Full Stack
- Revamped & operated the IRUP content clipping/analytics dashboard — 12 clients, real-time news/YouTube/report ingestion, auto dedup grouping, email/KakaoTalk alerts
- Solo full-stack build of an internal e-approval system (Next.js/Node.js/Firebase) — approval flow, permissions, real-time sync; planned to deployed in 3 months
Engineering · Data Collection System
- Built & ran a 10,000+ docs/day web crawling system — 20+ news-source crawlers; retry/exception handling raised success rate 85% → 98%
- HTML parsing, text normalization, similarity-based dedup, cron batch pipeline; async processing made collection 3x faster
Projects
XGEN 2.0 — AI Agent Platform
An enterprise AI agent platform built from 7 microservices (Model Serving, API Gateway, Core, Workflow, Retrieval, Documents, Frontend). A 4-tier Backend Adapter pattern auto-detects NVIDIA CUDA / AMD ROCm / Vulkan GPUs and dynamically switches vLLM and llama.cpp backends, serving up to 20 models concurrently on a single server. An Iterative RAG pipeline (query expansion → large top-100 retrieval → iterative LLM filtering → compression) improved search accuracy over a simple top-k baseline, and hybrid search (Dense + BM25 Sparse) was applied using Qdrant Prefetch + RRF (Reciprocal Rank Fusion).
- 15x higher LLM inference throughput vs. Transformers (12.5 → 185.3 tokens/sec, vLLM PagedAttention + Continuous Batching)
- 3x faster container startup (45s → 15s), 20% less memory — after removing Ray Serve and moving to a single FastAPI process
- 3.75x faster embedding (45s → 12s for a 10MB PDF) — Switch-Backend dual mode + batch size 512 → 2048
- ArgoCD GitOps pipeline cut deploy time 15min → 3min, 30s rollback, 90% fewer deploy errors, 99.9% availability
- Enterprise RBAC (5-level role hierarchy) + full API I/O audit logging + MCP tool-level permission control
graph-tool-call — Graph Tool Retrieval Engine
A graph-based retrieval engine that lets an LLM precisely find the tool it needs among 1,000+ API tools. It parses OpenAPI specs to build a 3-tier weighted graph (Tag → Operation → Parameter), and achieves higher accuracy than Vector/BM25 via BFS propagation + IDF weighting. An MCP Proxy mode provides a gateway that collapses many MCP servers into just 2 meta-tools.
- On a 1,068-tool benchmark, 2x recall and 40% higher accuracy vs. Vector
- MCP Proxy gateway mode — N MCP servers collapsed into 2 meta-tools (1-hop direct calling)
- Workflow chain engine — auto-composes multi-step tool calls into a DAG
gwanjong-mcp — AI Social Agent
An AI social agent that automates 9 social platforms (Dev.to, Bluesky, Twitter, Reddit, Mastodon, HN, Stack Overflow, GitHub Discussions, Discourse) through an MCP pipeline. Platforms are abstracted with the devhub-social adapter pattern, and the mcp-pipeline stores/requires chain composes a 3-stage Scout → Draft → Strike pipeline.
- Scaled 4 → 9 platforms — adapter pattern minimizes per-platform code
- stores/requires chain auto-resolves dependencies across multi-step pipelines
- Campaign GTM + anti-spam system — rate limiter, content validation, per-platform policy compliance
Synaptic Memory — Brain-inspired Knowledge Graph
A brain-inspired knowledge graph library + MCP server for LLM agents. With Spreading Activation (associative retrieval), Hebbian Learning (experiential learning), and 4-stage Memory Consolidation (L0~L3 auto promotion/eviction), agents automatically structure and retrieve past experience. It reached MRR 0.793 (finance/medical/legal) with FTS alone, and HotPotQA nDCG 0.636.
- 16 MCP tools — Auto-ontology (rules + LLM + embedding) construction
- 5-axis ranking (relevance × importance × recency × vitality × context)
- Zero-dep core — swappable SQLite/PostgreSQL/Qdrant/Neo4j backends
Rust Commerce Search Engine
A commerce search API server rewritten in Rust/Axum to overcome the performance limits of a NestJS search engine. It implements concurrent multi-index OpenSearch search, Redis caching, and unified search across multiple data sources (products/brands/categories). Achieved 1/5 the memory, 30% faster response, and 2x indexing throughput vs. NestJS.
- 28ms average response, 2,100 req/s — Tokio async runtime + Tower middleware
- 12MB idle memory (vs. 60MB on NestJS, 1/5) — leveraging zero-cost abstractions
- Jenkins → Docker → K8s automated deployment pipeline
AI Agent Browser Automation
An LLM-based browser automation agent with a 4-layer architecture (Orchestrator → Planner → Navigator → Extractor). It dynamically registers tools via MCP, and combines Playwright-based DOM parsing with CSS-selector confidence scoring to build automation that is robust to web structure changes. Built from prototype to production in 49 commits over 4 days.
- Human-in-the-Loop raised task completion from 30% → 95%
- 5.5x fewer MCP tool calls — DOM context pre-injected at the planning stage
- No-code automation: scenario recorder → JSON playbook → repeatable execution
NestJS Hybrid Search Engine
A commerce hybrid search engine grown over 14 months and 318 commits. It combines OpenSearch keyword search with Qdrant 384-dimensional vector semantic search via RRF, and improved search accuracy by 40% through LLM-based query expansion (synonyms/intent analysis) and a reranking pipeline. A Nori morphological analyzer detects Korean verbs to skip unnecessary GPT calls, cutting response time from 2~3s to 300ms.
- 40% higher search accuracy from semantic search (resolving keyword mismatch)
- Nori verb detection optimizes GPT calls — 2~3s → 300ms response
- Multi-tenant index design — multiple mall search services on a single cluster
Tauri 2.0 AI Desktop App
A Tauri 2.0 cross-platform AI desktop app with 1/10 the binary size and 1/3 the memory of Electron. A Remote WebView architecture renders a remote server UI directly in the local app without a frontend build, and it implements mistral.rs-based local LLM inference, NAT traversal via a Bore tunnel, and automatic switching between 3 operating modes (local/remote/hybrid).
- Rust Sidecar pattern — Python services auto start/stop with the app
- Remote WebView removes the frontend build — shorter deploy time
- Custom-built mistral.rs local LLM inference + Bore tunnel NAT traversal
i-Scream Mall AI Search
A case study of building and operating an AI search system in production for an education-focused shopping mall (i-Scream Mall). Semantic search + LLM query expansion were applied on a NestJS search engine to improve product search accuracy. It reliably handles 5,000+ TPS peak traffic, and a later Rust rewrite further cut operating costs.
- Stable handling of 5,000+ TPS peak traffic — zero-downtime production operation
- Semantic search resolves keyword mismatch — improved search conversion
- NestJS → Rust rewrite cut memory to 1/5 and improved response 30%
Open Source
graph-tool-call
An LLM-agent tool engine that searches 1,000+ API tools on a graph. Supports an MCP Proxy gateway.
synaptic-memory
A brain-inspired knowledge graph — Spreading Activation, Hebbian Learning, Memory Consolidation. MCP server with 16 tools.
devhub-social
A unified async client for developer communities — Dev.to, Bluesky, Twitter/X, Reddit, and 9 platforms in total.
ku-portal-mcp
An MCP server for Korea University's KUPID portal + Canvas LMS — notices, timetable, library, assignments, grades.
Tech Stack
Languages & Frameworks
AI / ML
Infrastructure & CI/CD
Timeline
- Developed graph-tool-call open source (1,068-tool graph search, MCP Proxy, PyPI release)
- gwanjong-mcp AI social agent (9 platforms, MCP Pipeline)
- Developed Synaptic Memory open source (brain-inspired knowledge graph, MCP server, PyPI release)
- Integrated synaptic-memory + graph-tool-call into the Hive Corp autonomous AI ops platform
- Cash-flow prediction time-series ML ensemble system
- Built the XGEN 2.0 AI agent platform (K8s/ArgoCD infra, frontend, workflow)
- Developed the Knowledge Graph visualization system
- Developed a Tauri 2.0 cross-platform desktop app
- Built a Rust commerce search engine (28ms, 2,100 req/s)
- Built an AI agent browser automation system
- Operated Moon / i-Scream Mall commerce search services
- Built XGEN 1.0 infrastructure and GPU model serving
- Built a NestJS hybrid search engine (318 commits, 14 months)
- Designed and developed a semantic search API
- Built Qdrant vector-DB-based semantic search
- Developed the Aurora commerce search API (OpenSearch multi-index)
- Demand forecasting API / persona recommendation system