AI Engineer

Sungjun Son

LLM Serving · Search Engine · Full Stack · DevOps

279+ Tech Articles
4 Domains
4+ Years

sonsj97@plateer.com · sonsj97@gmail.com

01.

About

I started with commerce search engines. I put OpenSearch keyword search into a live service handling 5,000+ TPS, then rewrote it in Rust to bring response time down to 28ms. To improve search quality, I introduced OpenSearch k-NN and Qdrant hybrid search along with LLM query expansion, expanding into AI Search.

Since then I have been leading the development of an AI agent platform (XGEN 2.0). From vLLM/llama.cpp-based multi-GPU (CUDA, ROCm) LLM serving, LangChain/LangGraph-based Iterative RAG, a GraphDB knowledge graph, and MCP-based AI agents, to a workflow engine— I build and operate the entire stack of an AI service: from search to inference, automation, and infrastructure, running 7 microservices on Kubernetes/ArgoCD GitOps.

Recently I have been focusing on the MCP (Model Context Protocol) ecosystem. I develop graph-tool-call, an open-source engine that searches 1,000+ API tools on a graph, and operate gwanjong-mcp, an agent that automates 9 social platforms through an MCP pipeline. From AI tool retrieval and social automation to a work knowledge base—I design and operate MCP in production.

Languages Rust Python TypeScript Go
Frameworks NestJS Next.js FastAPI Axum Tauri React
AI / ML vLLM llama.cpp Qdrant OpenSearch HuggingFace LangChain LangGraph MCP
Infra Kubernetes Docker ArgoCD Jenkins Redis
02.

Expertise

03.

Projects

Featured Project 40+ related posts

XGEN 2.0 — AI Agent Platform

Search AI/ML Full Stack DevOps

An enterprise AI agent platform built from 7 microservices (Model Serving, API Gateway, Core, Workflow, Retrieval, Documents, Frontend). A 4-tier Backend Adapter pattern auto-detects NVIDIA CUDA / AMD ROCm / Vulkan GPUs and dynamically switches vLLM and llama.cpp backends, serving up to 20 models concurrently on a single server. An Iterative RAG pipeline (query expansion → large top-100 retrieval → iterative LLM filtering → compression) improved search accuracy over a simple top-k baseline, and hybrid search (Dense + BM25 Sparse) was applied using Qdrant Prefetch + RRF (Reciprocal Rank Fusion).

  • 15x higher LLM inference throughput vs. Transformers (12.5 → 185.3 tokens/sec, vLLM PagedAttention + Continuous Batching)
  • 3x faster container startup (45s → 15s), 20% less memory — after removing Ray Serve and moving to a single FastAPI process
  • 3.75x faster embedding (45s → 12s for a 10MB PDF) — Switch-Backend dual mode + batch size 512 → 2048
  • ArgoCD GitOps pipeline cut deploy time 15min → 3min, 30s rollback, 90% fewer deploy errors, 99.9% availability
  • Enterprise RBAC (5-level role hierarchy) + full API I/O audit logging + MCP tool-level permission control
Python Rust TypeScript K8s / K3s vLLM llama.cpp Qdrant FastAPI Next.js ArgoCD
AI/ML 4 posts

graph-tool-call — Graph Tool Retrieval Engine

A graph-based retrieval engine that lets an LLM precisely find the tool it needs among 1,000+ API tools. It parses OpenAPI specs to build a 3-tier weighted graph (Tag → Operation → Parameter), and achieves higher accuracy than Vector/BM25 via BFS propagation + IDF weighting. An MCP Proxy mode provides a gateway that collapses many MCP servers into just 2 meta-tools.

  • On a 1,068-tool benchmark, 2x recall and 40% higher accuracy vs. Vector
  • MCP Proxy gateway mode — N MCP servers collapsed into 2 meta-tools (1-hop direct calling)
  • Workflow chain engine — auto-composes multi-step tool calls into a DAG
Python MCP OpenAPI Graph BFS PyPI
AI/ML Full Stack 2 posts

gwanjong-mcp — AI Social Agent

An AI social agent that automates 9 social platforms (Dev.to, Bluesky, Twitter, Reddit, Mastodon, HN, Stack Overflow, GitHub Discussions, Discourse) through an MCP pipeline. Platforms are abstracted with the devhub-social adapter pattern, and the mcp-pipeline stores/requires chain composes a 3-stage Scout → Draft → Strike pipeline.

  • Scaled 4 → 9 platforms — adapter pattern minimizes per-platform code
  • stores/requires chain auto-resolves dependencies across multi-step pipelines
  • Campaign GTM + anti-spam system — rate limiter, content validation, per-platform policy compliance
Python MCP TypeScript 9 Platforms Pipeline
AI/ML PyPI

Synaptic Memory — Brain-inspired Knowledge Graph

A brain-inspired knowledge graph library + MCP server for LLM agents. With Spreading Activation (associative retrieval), Hebbian Learning (experiential learning), and 4-stage Memory Consolidation (L0~L3 auto promotion/eviction), agents automatically structure and retrieve past experience. It reached MRR 0.793 (finance/medical/legal) with FTS alone, and HotPotQA nDCG 0.636.

  • 16 MCP tools — Auto-ontology (rules + LLM + embedding) construction
  • 5-axis ranking (relevance × importance × recency × vitality × context)
  • Zero-dep core — swappable SQLite/PostgreSQL/Qdrant/Neo4j backends
Python MCP Knowledge Graph Hebbian PyPI
Search 12 posts

Rust Commerce Search Engine

A commerce search API server rewritten in Rust/Axum to overcome the performance limits of a NestJS search engine. It implements concurrent multi-index OpenSearch search, Redis caching, and unified search across multiple data sources (products/brands/categories). Achieved 1/5 the memory, 30% faster response, and 2x indexing throughput vs. NestJS.

  • 28ms average response, 2,100 req/s — Tokio async runtime + Tower middleware
  • 12MB idle memory (vs. 60MB on NestJS, 1/5) — leveraging zero-cost abstractions
  • Jenkins → Docker → K8s automated deployment pipeline
Rust Axum Tokio OpenSearch Redis Docker
AI/ML 15 posts

AI Agent Browser Automation

An LLM-based browser automation agent with a 4-layer architecture (Orchestrator → Planner → Navigator → Extractor). It dynamically registers tools via MCP, and combines Playwright-based DOM parsing with CSS-selector confidence scoring to build automation that is robust to web structure changes. Built from prototype to production in 49 commits over 4 days.

  • Human-in-the-Loop raised task completion from 30% → 95%
  • 5.5x fewer MCP tool calls — DOM context pre-injected at the planning stage
  • No-code automation: scenario recorder → JSON playbook → repeatable execution
TypeScript Python Playwright MCP LLM Next.js
Search 10 posts

NestJS Hybrid Search Engine

A commerce hybrid search engine grown over 14 months and 318 commits. It combines OpenSearch keyword search with Qdrant 384-dimensional vector semantic search via RRF, and improved search accuracy by 40% through LLM-based query expansion (synonyms/intent analysis) and a reranking pipeline. A Nori morphological analyzer detects Korean verbs to skip unnecessary GPT calls, cutting response time from 2~3s to 300ms.

  • 40% higher search accuracy from semantic search (resolving keyword mismatch)
  • Nori verb detection optimizes GPT calls — 2~3s → 300ms response
  • Multi-tenant index design — multiple mall search services on a single cluster
NestJS OpenSearch Qdrant Nori Python FastEmbed
Full Stack 10 posts

Tauri 2.0 AI Desktop App

A Tauri 2.0 cross-platform AI desktop app with 1/10 the binary size and 1/3 the memory of Electron. A Remote WebView architecture renders a remote server UI directly in the local app without a frontend build, and it implements mistral.rs-based local LLM inference, NAT traversal via a Bore tunnel, and automatic switching between 3 operating modes (local/remote/hybrid).

  • Rust Sidecar pattern — Python services auto start/stop with the app
  • Remote WebView removes the frontend build — shorter deploy time
  • Custom-built mistral.rs local LLM inference + Bore tunnel NAT traversal
Tauri 2.0 Rust React TypeScript mistral.rs
Search AI/ML Case Study

i-Scream Mall AI Search

A case study of building and operating an AI search system in production for an education-focused shopping mall (i-Scream Mall). Semantic search + LLM query expansion were applied on a NestJS search engine to improve product search accuracy. It reliably handles 5,000+ TPS peak traffic, and a later Rust rewrite further cut operating costs.

  • Stable handling of 5,000+ TPS peak traffic — zero-downtime production operation
  • Semantic search resolves keyword mismatch — improved search conversion
  • NestJS → Rust rewrite cut memory to 1/5 and improved response 30%
NestJS Rust OpenSearch Nori LLM
04.

Open Source

05.

Tech Stack

Languages & Frameworks

Rust Python TypeScript Go Axum NestJS Next.js FastAPI Tauri React

AI / ML

vLLM llama.cpp Qdrant OpenSearch k-NN HuggingFace LangChain LangGraph MCP FAISS FastEmbed

Infrastructure & CI/CD

Kubernetes Docker K3s Redis Istio Jenkins ArgoCD Caddy GitHub Actions GitLab CI
06.

Timeline

07.

By the Numbers

279+ Tech Blog Posts
7 Microservices (XGEN)
5 Open Source Projects
1,068 Tool Benchmark (graph-tool-call)
9 Social Platforms Automated (gwanjong)
28ms Rust Search Engine Response
5,000+ TPS Peak Traffic Handled
15x LLM Inference Throughput Gain