AI Engineer
Sungjun Son
LLM Serving · Search Engine · Full Stack · DevOps
sonsj97@plateer.com · sonsj97@gmail.com
About
I started with commerce search engines. I put OpenSearch keyword search into a live service handling 5,000+ TPS, then rewrote it in Rust to bring response time down to 28ms. To improve search quality, I introduced OpenSearch k-NN and Qdrant hybrid search along with LLM query expansion, expanding into AI Search.
Since then I have been leading the development of an AI agent platform (XGEN 2.0). From vLLM/llama.cpp-based multi-GPU (CUDA, ROCm) LLM serving, LangChain/LangGraph-based Iterative RAG, a GraphDB knowledge graph, and MCP-based AI agents, to a workflow engine— I build and operate the entire stack of an AI service: from search to inference, automation, and infrastructure, running 7 microservices on Kubernetes/ArgoCD GitOps.
Recently I have been focusing on the MCP (Model Context Protocol) ecosystem. I develop graph-tool-call, an open-source engine that searches 1,000+ API tools on a graph, and operate gwanjong-mcp, an agent that automates 9 social platforms through an MCP pipeline. From AI tool retrieval and social automation to a work knowledge base—I design and operate MCP in production.
Expertise
Search Engine
90 posts- OpenSearch k-NN / Hybrid Search
- Qdrant Vector DB
- Rust Axum Search API
- NestJS Hybrid Search
- RAG / Semantic Search
AI / ML
73 posts- vLLM / llama.cpp GPU Serving
- MCP-based AI Agent Design
- Graph-based Tool Retrieval Engine
- LangChain / LangGraph RAG
- XGEN AI Agent Platform
Full Stack
62 posts- Next.js / React UI
- Rust API Gateway
- Tauri Desktop App
- Python Async Services
- WebSocket / SSE Realtime
DevOps
40 posts- K8s / K3s Cluster Operations
- ArgoCD GitOps Deployment
- Jenkins CI/CD Pipeline
- Docker Multi-stage Build
- Istio / Let's Encrypt
Projects
XGEN 2.0 — AI Agent Platform
An enterprise AI agent platform built from 7 microservices (Model Serving, API Gateway, Core, Workflow, Retrieval, Documents, Frontend). A 4-tier Backend Adapter pattern auto-detects NVIDIA CUDA / AMD ROCm / Vulkan GPUs and dynamically switches vLLM and llama.cpp backends, serving up to 20 models concurrently on a single server. An Iterative RAG pipeline (query expansion → large top-100 retrieval → iterative LLM filtering → compression) improved search accuracy over a simple top-k baseline, and hybrid search (Dense + BM25 Sparse) was applied using Qdrant Prefetch + RRF (Reciprocal Rank Fusion).
- 15x higher LLM inference throughput vs. Transformers (12.5 → 185.3 tokens/sec, vLLM PagedAttention + Continuous Batching)
- 3x faster container startup (45s → 15s), 20% less memory — after removing Ray Serve and moving to a single FastAPI process
- 3.75x faster embedding (45s → 12s for a 10MB PDF) — Switch-Backend dual mode + batch size 512 → 2048
- ArgoCD GitOps pipeline cut deploy time 15min → 3min, 30s rollback, 90% fewer deploy errors, 99.9% availability
- Enterprise RBAC (5-level role hierarchy) + full API I/O audit logging + MCP tool-level permission control
graph-tool-call — Graph Tool Retrieval Engine
A graph-based retrieval engine that lets an LLM precisely find the tool it needs among 1,000+ API tools. It parses OpenAPI specs to build a 3-tier weighted graph (Tag → Operation → Parameter), and achieves higher accuracy than Vector/BM25 via BFS propagation + IDF weighting. An MCP Proxy mode provides a gateway that collapses many MCP servers into just 2 meta-tools.
- On a 1,068-tool benchmark, 2x recall and 40% higher accuracy vs. Vector
- MCP Proxy gateway mode — N MCP servers collapsed into 2 meta-tools (1-hop direct calling)
- Workflow chain engine — auto-composes multi-step tool calls into a DAG
gwanjong-mcp — AI Social Agent
An AI social agent that automates 9 social platforms (Dev.to, Bluesky, Twitter, Reddit, Mastodon, HN, Stack Overflow, GitHub Discussions, Discourse) through an MCP pipeline. Platforms are abstracted with the devhub-social adapter pattern, and the mcp-pipeline stores/requires chain composes a 3-stage Scout → Draft → Strike pipeline.
- Scaled 4 → 9 platforms — adapter pattern minimizes per-platform code
- stores/requires chain auto-resolves dependencies across multi-step pipelines
- Campaign GTM + anti-spam system — rate limiter, content validation, per-platform policy compliance
Synaptic Memory — Brain-inspired Knowledge Graph
A brain-inspired knowledge graph library + MCP server for LLM agents. With Spreading Activation (associative retrieval), Hebbian Learning (experiential learning), and 4-stage Memory Consolidation (L0~L3 auto promotion/eviction), agents automatically structure and retrieve past experience. It reached MRR 0.793 (finance/medical/legal) with FTS alone, and HotPotQA nDCG 0.636.
- 16 MCP tools — Auto-ontology (rules + LLM + embedding) construction
- 5-axis ranking (relevance × importance × recency × vitality × context)
- Zero-dep core — swappable SQLite/PostgreSQL/Qdrant/Neo4j backends
Rust Commerce Search Engine
A commerce search API server rewritten in Rust/Axum to overcome the performance limits of a NestJS search engine. It implements concurrent multi-index OpenSearch search, Redis caching, and unified search across multiple data sources (products/brands/categories). Achieved 1/5 the memory, 30% faster response, and 2x indexing throughput vs. NestJS.
- 28ms average response, 2,100 req/s — Tokio async runtime + Tower middleware
- 12MB idle memory (vs. 60MB on NestJS, 1/5) — leveraging zero-cost abstractions
- Jenkins → Docker → K8s automated deployment pipeline
AI Agent Browser Automation
An LLM-based browser automation agent with a 4-layer architecture (Orchestrator → Planner → Navigator → Extractor). It dynamically registers tools via MCP, and combines Playwright-based DOM parsing with CSS-selector confidence scoring to build automation that is robust to web structure changes. Built from prototype to production in 49 commits over 4 days.
- Human-in-the-Loop raised task completion from 30% → 95%
- 5.5x fewer MCP tool calls — DOM context pre-injected at the planning stage
- No-code automation: scenario recorder → JSON playbook → repeatable execution
NestJS Hybrid Search Engine
A commerce hybrid search engine grown over 14 months and 318 commits. It combines OpenSearch keyword search with Qdrant 384-dimensional vector semantic search via RRF, and improved search accuracy by 40% through LLM-based query expansion (synonyms/intent analysis) and a reranking pipeline. A Nori morphological analyzer detects Korean verbs to skip unnecessary GPT calls, cutting response time from 2~3s to 300ms.
- 40% higher search accuracy from semantic search (resolving keyword mismatch)
- Nori verb detection optimizes GPT calls — 2~3s → 300ms response
- Multi-tenant index design — multiple mall search services on a single cluster
Tauri 2.0 AI Desktop App
A Tauri 2.0 cross-platform AI desktop app with 1/10 the binary size and 1/3 the memory of Electron. A Remote WebView architecture renders a remote server UI directly in the local app without a frontend build, and it implements mistral.rs-based local LLM inference, NAT traversal via a Bore tunnel, and automatic switching between 3 operating modes (local/remote/hybrid).
- Rust Sidecar pattern — Python services auto start/stop with the app
- Remote WebView removes the frontend build — shorter deploy time
- Custom-built mistral.rs local LLM inference + Bore tunnel NAT traversal
i-Scream Mall AI Search
A case study of building and operating an AI search system in production for an education-focused shopping mall (i-Scream Mall). Semantic search + LLM query expansion were applied on a NestJS search engine to improve product search accuracy. It reliably handles 5,000+ TPS peak traffic, and a later Rust rewrite further cut operating costs.
- Stable handling of 5,000+ TPS peak traffic — zero-downtime production operation
- Semantic search resolves keyword mismatch — improved search conversion
- NestJS → Rust rewrite cut memory to 1/5 and improved response 30%
Open Source
graph-tool-call
An LLM-agent tool engine that searches 1,000+ API tools on a graph. Supports an MCP Proxy gateway.
synaptic-memory
A brain-inspired knowledge graph — Spreading Activation, Hebbian Learning, Memory Consolidation. MCP server with 16 tools.
devhub-social
A unified async client for developer communities — Dev.to, Bluesky, Twitter/X, Reddit, and 9 platforms in total.
ku-portal-mcp
An MCP server for Korea University's KUPID portal + Canvas LMS — notices, timetable, library, assignments, grades.
Tech Stack
Languages & Frameworks
AI / ML
Infrastructure & CI/CD
Timeline
- Developed graph-tool-call open source (1,068-tool graph search, MCP Proxy, PyPI release)
- gwanjong-mcp AI social agent (9 platforms, MCP Pipeline)
- Developed Synaptic Memory open source (brain-inspired knowledge graph, MCP server, PyPI release)
- Integrated synaptic-memory + graph-tool-call into the Hive Corp autonomous AI ops platform
- Cash-flow prediction time-series ML ensemble system
- Built the XGEN 2.0 AI agent platform (K8s/ArgoCD infra, frontend, workflow)
- Developed the Knowledge Graph visualization system
- Developed a Tauri 2.0 cross-platform desktop app
- Built a Rust commerce search engine (28ms, 2,100 req/s)
- Built an AI agent browser automation system
- Operated Moon / i-Scream Mall commerce search services
- Built XGEN 1.0 infrastructure and GPU model serving
- Built a NestJS hybrid search engine (318 commits, 14 months)
- Designed and developed a semantic search API
- Built Qdrant vector-DB-based semantic search
- Developed the Aurora commerce search API (OpenSearch multi-index)
- Demand forecasting API / persona recommendation system