AI Engineer

Sungjun Son

LLM Serving · Search Engine · Full Stack · DevOps

274+ Tech Articles
4 Domains
4+ Years

sonsj97@plateer.com · sonsj97@gmail.com

01.

About

I'm an AI engineer who designs and operates AI agent platforms. From vLLM/llama.cpp multi-GPU (CUDA·ROCm) LLM serving, LangChain/LangGraph Iterative RAG, and Neo4j knowledge graphs to MCP-based AI agents— I build the entire AI service stack across inference, search, automation, and infrastructure.

Today I lead an 8-person team at Plateer building the enterprise AI platform (XGEN), and as Tech & Consulting Part Leader I design clients' AI adoption directly. I cut token cost ~60% on a regional bank's GenAI platform, co-ran Intel Gaudi2/3 LLM inference PoCs, and advised 30+ enterprise clients on AI.

It all started with commerce search. I ran OpenSearch keyword search at 5,000+ TPS in production and rewrote it in Rust for 28ms responses, then moved through hybrid search and LLM query expansion (AI Search) to today's LLM and agent work. Lately I focus on the MCP ecosystem, open-sourcing LLM-agent infrastructure like graph-tool-call and gwanjong-mcp.

Languages Rust Python TypeScript Go
Frameworks NestJS Next.js FastAPI Axum Tauri React
AI / ML vLLM llama.cpp Qdrant OpenSearch HuggingFace LangChain LangGraph MCP Neo4j
Infra Kubernetes Docker ArgoCD Jenkins Redis
02.

Expertise

03.

Experience

Part Leader · AI Lab → Tech & Consulting

Plateer · EC Solution Lab, AI Lab
  • Designed & built the XGEN AI agent platform — vLLM/llama.cpp multi-GPU (CUDA·ROCm) LLM serving, LangChain/LangGraph Iterative RAG, Neo4j knowledge graph, MCP-based AI agents & workflow engine, and a new HWPX/DOCX/PPTX Document Adapter
  • Core engineer on a regional bank's GenAI platform — full-stack AI chatbot (SSE streaming, multi-step agents) from requirements to delivery, ~60% token cost reduction, plus CI/CD infrastructure
  • Co-ran Intel Gaudi2/3 LLM inference PoCs (DeepSeek·Llama·QwQ quantization/perf) and led XGEN demos·PoCs·consulting for 30+ enterprise clients
  • Operated 7 microservices on k3s/Istio/ArgoCD GitOps with dev/stg/prd multi-env design, and built the XGEN GS-certification (TTA) test environment
  • Led an 8-person AI Lab team — task allocation, code review, mentoring, and engineer hiring interviews

Platform Division Manager · Full Stack

Seoul IR Network
  • Revamped & operated the IRUP content clipping/analytics dashboard — 12 clients, real-time news/YouTube/report ingestion, auto dedup grouping, email/KakaoTalk alerts
  • Solo full-stack build of an internal e-approval system (Next.js/Node.js/Firebase) — approval flow, permissions, real-time sync; planned to deployed in 3 months

Engineering · Data Collection System

HJ Brain
  • Built & ran a 10,000+ docs/day web crawling system — 20+ news-source crawlers; retry/exception handling raised success rate 85% → 98%
  • HTML parsing, text normalization, similarity-based dedup, cron batch pipeline; async processing made collection 3x faster
04.

Projects

Featured Project 40+ related posts

XGEN 2.0 — AI Agent Platform

Search AI/ML Full Stack DevOps

An enterprise AI agent platform built from 7 microservices (Model Serving, API Gateway, Core, Workflow, Retrieval, Documents, Frontend). A 4-tier Backend Adapter pattern auto-detects NVIDIA CUDA / AMD ROCm / Vulkan GPUs and dynamically switches vLLM and llama.cpp backends, serving up to 20 models concurrently on a single server. An Iterative RAG pipeline (query expansion → large top-100 retrieval → iterative LLM filtering → compression) improved search accuracy over a simple top-k baseline, and hybrid search (Dense + BM25 Sparse) was applied using Qdrant Prefetch + RRF (Reciprocal Rank Fusion).

  • 15x higher LLM inference throughput vs. Transformers (12.5 → 185.3 tokens/sec, vLLM PagedAttention + Continuous Batching)
  • 3x faster container startup (45s → 15s), 20% less memory — after removing Ray Serve and moving to a single FastAPI process
  • 3.75x faster embedding (45s → 12s for a 10MB PDF) — Switch-Backend dual mode + batch size 512 → 2048
  • ArgoCD GitOps pipeline cut deploy time 15min → 3min, 30s rollback, 90% fewer deploy errors, 99.9% availability
  • Enterprise RBAC (5-level role hierarchy) + full API I/O audit logging + MCP tool-level permission control
Python Rust TypeScript K8s / K3s vLLM llama.cpp Qdrant FastAPI Next.js ArgoCD
AI/ML 4 posts

graph-tool-call — Graph Tool Retrieval Engine

A graph-based retrieval engine that lets an LLM precisely find the tool it needs among 1,000+ API tools. It parses OpenAPI specs to build a 3-tier weighted graph (Tag → Operation → Parameter), and achieves higher accuracy than Vector/BM25 via BFS propagation + IDF weighting. An MCP Proxy mode provides a gateway that collapses many MCP servers into just 2 meta-tools.

  • On a 1,068-tool benchmark, 2x recall and 40% higher accuracy vs. Vector
  • MCP Proxy gateway mode — N MCP servers collapsed into 2 meta-tools (1-hop direct calling)
  • Workflow chain engine — auto-composes multi-step tool calls into a DAG
Python MCP OpenAPI Graph BFS PyPI
AI/ML Full Stack 2 posts

gwanjong-mcp — AI Social Agent

An AI social agent that automates 9 social platforms (Dev.to, Bluesky, Twitter, Reddit, Mastodon, HN, Stack Overflow, GitHub Discussions, Discourse) through an MCP pipeline. Platforms are abstracted with the devhub-social adapter pattern, and the mcp-pipeline stores/requires chain composes a 3-stage Scout → Draft → Strike pipeline.

  • Scaled 4 → 9 platforms — adapter pattern minimizes per-platform code
  • stores/requires chain auto-resolves dependencies across multi-step pipelines
  • Campaign GTM + anti-spam system — rate limiter, content validation, per-platform policy compliance
Python MCP TypeScript 9 Platforms Pipeline
AI/ML PyPI

Synaptic Memory — Brain-inspired Knowledge Graph

A brain-inspired knowledge graph library + MCP server for LLM agents. With Spreading Activation (associative retrieval), Hebbian Learning (experiential learning), and 4-stage Memory Consolidation (L0~L3 auto promotion/eviction), agents automatically structure and retrieve past experience. It reached MRR 0.793 (finance/medical/legal) with FTS alone, and HotPotQA nDCG 0.636.

  • 16 MCP tools — Auto-ontology (rules + LLM + embedding) construction
  • 5-axis ranking (relevance × importance × recency × vitality × context)
  • Zero-dep core — swappable SQLite/PostgreSQL/Qdrant/Neo4j backends
Python MCP Knowledge Graph Hebbian PyPI
Search 12 posts

Rust Commerce Search Engine

A commerce search API server rewritten in Rust/Axum to overcome the performance limits of a NestJS search engine. It implements concurrent multi-index OpenSearch search, Redis caching, and unified search across multiple data sources (products/brands/categories). Achieved 1/5 the memory, 30% faster response, and 2x indexing throughput vs. NestJS.

  • 28ms average response, 2,100 req/s — Tokio async runtime + Tower middleware
  • 12MB idle memory (vs. 60MB on NestJS, 1/5) — leveraging zero-cost abstractions
  • Jenkins → Docker → K8s automated deployment pipeline
Rust Axum Tokio OpenSearch Redis Docker
AI/ML 15 posts

AI Agent Browser Automation

An LLM-based browser automation agent with a 4-layer architecture (Orchestrator → Planner → Navigator → Extractor). It dynamically registers tools via MCP, and combines Playwright-based DOM parsing with CSS-selector confidence scoring to build automation that is robust to web structure changes. Built from prototype to production in 49 commits over 4 days.

  • Human-in-the-Loop raised task completion from 30% → 95%
  • 5.5x fewer MCP tool calls — DOM context pre-injected at the planning stage
  • No-code automation: scenario recorder → JSON playbook → repeatable execution
TypeScript Python Playwright MCP LLM Next.js
Search 10 posts

NestJS Hybrid Search Engine

A commerce hybrid search engine grown over 14 months and 318 commits. It combines OpenSearch keyword search with Qdrant 384-dimensional vector semantic search via RRF, and improved search accuracy by 40% through LLM-based query expansion (synonyms/intent analysis) and a reranking pipeline. A Nori morphological analyzer detects Korean verbs to skip unnecessary GPT calls, cutting response time from 2~3s to 300ms.

  • 40% higher search accuracy from semantic search (resolving keyword mismatch)
  • Nori verb detection optimizes GPT calls — 2~3s → 300ms response
  • Multi-tenant index design — multiple mall search services on a single cluster
NestJS OpenSearch Qdrant Nori Python FastEmbed
Full Stack 10 posts

Tauri 2.0 AI Desktop App

A Tauri 2.0 cross-platform AI desktop app with 1/10 the binary size and 1/3 the memory of Electron. A Remote WebView architecture renders a remote server UI directly in the local app without a frontend build, and it implements mistral.rs-based local LLM inference, NAT traversal via a Bore tunnel, and automatic switching between 3 operating modes (local/remote/hybrid).

  • Rust Sidecar pattern — Python services auto start/stop with the app
  • Remote WebView removes the frontend build — shorter deploy time
  • Custom-built mistral.rs local LLM inference + Bore tunnel NAT traversal
Tauri 2.0 Rust React TypeScript mistral.rs
Search AI/ML Case Study

i-Scream Mall AI Search

A case study of building and operating an AI search system in production for an education-focused shopping mall (i-Scream Mall). Semantic search + LLM query expansion were applied on a NestJS search engine to improve product search accuracy. It reliably handles 5,000+ TPS peak traffic, and a later Rust rewrite further cut operating costs.

  • Stable handling of 5,000+ TPS peak traffic — zero-downtime production operation
  • Semantic search resolves keyword mismatch — improved search conversion
  • NestJS → Rust rewrite cut memory to 1/5 and improved response 30%
NestJS Rust OpenSearch Nori LLM
05.

Open Source

06.

Tech Stack

Languages & Frameworks

Rust Python TypeScript Go Axum NestJS Next.js FastAPI Tauri React

AI / ML

vLLM llama.cpp Qdrant OpenSearch k-NN HuggingFace LangChain LangGraph MCP Neo4j PyTorch FAISS FastEmbed

Infrastructure & CI/CD

Kubernetes Docker K3s Redis Istio Jenkins ArgoCD Caddy GitHub Actions Helm MLflow AWS GitLab CI
07.

Timeline

08.

Education

Korea University, Grad School

M.S. in Artificial Intelligence · SW·AI Convergence

Hanyang University

B.S. in Urban Engineering · ML/DL trade-area analysis thesis (team lead)
Certifications Information Processing Industrial Engineer SQLD Network Administrator Lv.2
Languages Korean (native) English (professional)
09.

By the Numbers

274+ Tech Blog Posts
60% Token Cost Cut (Bank GenAI)
11 Open Source Projects
1,068 Tool Benchmark (graph-tool-call)
8 Team Led (AI Lab)
28ms Rust Search Engine Response
30+ Enterprise Clients Advised
15x LLM Inference Throughput Gain