At IBM Client Engineering, I worked on applied AI systems designed to move from prototype to production-facing evaluation under real enterprise constraints. The work spanned multi-agent orchestration, Retrieval-Augmented Generation, fine-tuned language models, MLOps, and large-scale data infrastructure, with a consistent focus on measurable business impact rather than demo-only outcomes.
Built enterprise AI systems across retrieval, agents, fine-tuning, and MLOps, with measurable outcomes including 75 to 85% retrieval accuracy, 25 to 35% alignment gains, 97% faster large-scale processing, and ~90% drift-detection accuracy.
Scope of Work
The role sat at the intersection of solution architecture, applied research, and delivery engineering. I worked on systems that had to be technically sound, observable, and adaptable to different enterprise contexts rather than optimized for a single demo path.
Selected Outcomes
- Built enterprise RAG systems, including form-filling assistants and document workflows, reaching roughly 75 to 85% average retrieval accuracy across evaluated use cases.
- Reduced end-to-end document processing time by up to 80% through better retrieval pipelines and workflow design.
- Fine-tuned LLMs with real and synthetic enterprise data, improving response alignment by roughly 25 to 35% in expert evaluation.
- Delivered large-scale content standardization pipelines with parallel LLM execution and distributed PySpark processing, cutting runtime by 97% while reducing compliance violations.
- Engineered end-to-end MLOps workflows for monitoring, automated retraining, and lifecycle management, with drift detection reaching about 90% accuracy across multi-petabyte Hadoop-based environments.
Systems Built
- Designed multi-agent systems using LangGraph state machines, CrewAI, and custom tool-calling agents.
- Deployed reusable orchestration patterns as Model Context Protocol (MCP) servers using FastAPI and asynchronous processing.
- Built natural-language-to-SQL interfaces that enabled non-technical users to query complex relational databases more directly and accurately.
- Developed reusable benchmarking, evaluation, and observability workflows for LLM systems, including structured monitoring through LangSmith.
Client Engagements
- Fiserv: data science, data engineering, LLM fine-tuning, and LLMOps.
- Visa: data engineering, MLOps, and data science workflows.
- Sony Group: RAG systems and AI software engineering.
- USAA: Java-to-COBOL conversion workflows.
- BNY Mellon: Java-to-COBOL conversion workflows.
- Truist Financial: data engineering.
- Globant: fine-tuned LLM systems for proprietary code explanation.
Applied LLM Research
Part of the role involved identifying repeatable failure modes in applied LLM systems and helping shape research directions that later informed Watsonx product capabilities. The emphasis was on turning experimentation into reproducible internal frameworks rather than running isolated one-off studies.
- Built reproducible model-task alignment and benchmarking pipelines using tools such as LangChain, Hugging Face Datasets, and evaluation frameworks to measure task fit, generalization, and operational suitability.
- Designed prompting strategies for instruction-tuned and chat-based models to improve response quality across business-critical NLP tasks.
- Applied supervised fine-tuning with LoRA to adapt models to client-specific domains and constraints.
- Evaluated zero-shot, few-shot, and fine-tuned approaches across tasks using standardized metrics and human-in-the-loop review.
- Developed hybrid RAG architectures that combined retrieval systems with generative models for better contextual reasoning and grounded responses.
- Implemented evaluation workflows using metrics such as accuracy, F1, ROUGE, and BLEU alongside qualitative review loops.
IBM Platform and Stack
IBM Products
- watsonx.ai
- watsonx.data
- watsonx.governance
- InstructLab
- Carbon Design System
- watsonx Code Assistant for Z
- watsonx Code Assistant for Ansible
- Ansible
Open Source and Engineering Stack
- PyTorch
- Transformers
- Scikit-learn
- Pandas
- Matplotlib
- SciPy
- Hadoop
- Hive
- PySpark
- Kerberos
- Docker
- Kubernetes
- Java
- COBOL with basic exposure