AI Feature Development & Evaluation Services for Growth-Stage SaaS Companies
Consulting services offered as low-risk, fixed-price packages. If you don’t find a package that meets your needs, drop me a line at hey@jxsh.io to discuss a custom arrangement, or let's find a time to talk.
AI Feature Development (Typescript)
Takes 8 weeks and starts at $22,500
Custom AI features from scratch in TypeScript with evaluation and quality assurance built-in from day one, delivering production-ready code in 8 weeks. The development can be done as an isolated microservice or directly within your existing codebase, depending on the level of access you're comfortable providing.
What makes this unique: I also use a proven MCP (Model Context Protocol) approach for feature validation - first building MCP adapters that let you test new functionality directly in Claude/ChatGPT. This means you get to test out the actual feature before we commit to a frontend and lock things in - you can play around with producing the result of the feature while we're building it. This helps us smooth off any rough edges and deliver a better final product in the same amount of time.
Process
Weeks 1-2: Discovery & Architecture
- Requirements gathering and technical specification development
- System architecture design (microservice vs. integrated approach based on your access preferences)
- AI model selection and evaluation framework planning
- Initial grading criteria development defining quality standards for your specific use case
- MCP adapter development for early feature validation in Claude/ChatGPT
Weeks 3-6: Development & Testing
- TypeScript AI feature development with quality assurance built-in from the start
- Continuous MCP-based testing and validation with real user feedback
- Comprehensive test suite creation (integration, unit, and end-to-end tests as appropriate)
- Iterative refinement based on MCP validation results to smooth off rough edges
- Performance monitoring and debugging systems implementation
- Transition from MCP validation to full production integration
Weeks 7-8: Production Integration & Handover
- Final evaluation report generation with performance metrics across all quality benchmarks
- Complete code documentation and deployment setup
- Team training on feature maintenance, testing, and ongoing evaluation
- 90-day support period for questions and adjustments
What's Delivered
- Source Code & Implementation
Complete, production-ready AI feature code that can be deployed as either a standalone microservice or integrated directly into your existing codebase. You get full ownership of all code, documentation, and deployment configurations, with architecture designed for scalability and maintainability.
- Comprehensive Test Suite
A complete testing framework covering your AI feature with integration tests, unit tests, and end-to-end tests as appropriate for your specific implementation. The test suite ensures reliability across different scenarios and provides confidence for future updates and deployments.
- AI Feature Evaluation Report
A detailed assessment of your new AI feature's performance, including the custom grading criteria we develop together throughout the project and final evaluation scores across all quality metrics. Since this is a new feature, all test data will be synthetically generated for baseline testing, but I will work closely with you to ensure the test scenarios are as representative of real-world edge cases as possible. This report serves as your baseline for ongoing performance monitoring and future improvements.
Evals for AI SaaS Features
Takes 3 weeks and starts at $6000
I systematically diagnose and fix existing AI features that aren't performing as expected, delivering quantified improvements in just 3 weeks. Unlike generic monitoring tools, I create custom grading criteria specific to your domain and provide measurable before/after results that prove ROI. You get a complete quality framework, documented fixes, and the knowledge to maintain high AI performance long after the engagement ends.
What makes this unique: Most AI consulting focuses on building new features, but I specialize in rescuing underperforming AI systems with rapid, measurable improvements and custom quality standards tailored to your specific business domain.
Process
Week 1: Assessment & Instrumentation
- Set up monitoring infrastructure to capture AI responses at scale (using tools like Langfuse, Braintrust, or custom dashboards)
- Manual review of hundreds of AI outputs to identify all error modes (no assumptions - we discover problems through direct observation)
- Create comprehensive error taxonomy based on actual failures, not predicted ones
- Develop initial grading criteria document defining quality standards for your specific domain
Week 2: Analysis & Automated Detection
- Build code-based evaluators for deterministic error detection (regex, length checks, tool usage patterns)
- Create LLM-as-Judge evaluators for subjective quality assessment (tone, helpfulness, accuracy)
- Quantify prevalence of each error type across your full dataset
- Implement fixes for identified issues and optimize AI performance
Week 3: Validation & Knowledge Transfer
- Validate improvements using proper train/dev/test data splits
- Measure before/after performance across all error categories
- Finalize grading criteria documentation with maintenance guidelines
- Train your team on ongoing evaluation and quality assessment processes
What's Delivered
- A Grading Criteria Document
A comprehensive, evolving document that defines what "good" AI output looks like for your specific use case. This includes scoring rubrics, quality thresholds, edge case handling rules, and examples of acceptable vs. unacceptable outputs. Unlike static documentation, this document is designed to be updated as your understanding of quality evolves, serving as the foundation for all future AI evaluation and improvement efforts.
- Baseline Performance Report
A quantified analysis of your AI system's current performance, documenting all identified error modes with specific metrics. This report includes failure rates, error categories, cost analysis, and impact assessment for each problem area. It serves as your "before" snapshot, establishing concrete benchmarks against which all improvements will be measured.
- Final Improvement Report
A comprehensive before/after comparison showing exactly what was fixed and by how much. This report quantifies the measurable improvements achieved across all error modes, including reduced failure rates, cost savings, and enhanced reliability metrics. It provides concrete evidence of ROI and serves as documentation for stakeholders on the tangible value delivered.
Strategic AI Consultation
Takes 2 hours and starts at $500
Unlike generic AI consulting that gives you theoretical advice, I focus on practical, implementable solutions tailored to your specific technical constraints and business goals. You get clarity on whether to build new features, fix existing ones, or optimize your current approach - with concrete next steps and timelines rather than vague recommendations.
What makes this unique: Most AI consultants either sell you their preferred solution or give abstract advice, but I provide unbiased, practical guidance based on your actual situation and constraints, helping you make confident decisions before committing significant development resources.
What's Delivered
- Discussion Recording
A comprehensive 2-hour video call where we dive deep into your AI product challenges, explore potential solutions, and align on the best path forward. This isn't just a presentation of findings - it's an interactive strategy session where you can ask questions, challenge assumptions, and work through implementation concerns in real-time. You'll get the recording to share with your team and reference later as you execute on the recommendations.
- Discussion Transcript
A complete transcript of our 2-hour strategy session, cleaned up and organized for easy reference. This ensures you have a written record of all recommendations, action items, and key insights discussed, making it easy to share with team members who weren't on the call and reference specific points as you implement the suggested changes.
- Follow-up Research & Resources
A curated collection of links, tools, frameworks, and references based on everything we discussed during our session. I spend 2 hours after our call researching and compiling the specific resources, documentation, and implementation guides that relate to your unique situation, giving you a head start on executing any of the strategies or solutions we explored together.
FAQs
Who's behind this service?
I'm Josh Pitzalis, working alongside my development partner Sasha Koss. Together we built Chirr App, one of the first Twitter thread schedulers back in 2017, and have over 15 years of combined development experience. Sasha is the author of date-fns, the second most popular JavaScript utility library in the world according to the 2024 State of JS report. We've been exploring AI infrastructure and evaluation systems through projects like Mind Control (a CMS for production AI prompts) and DaisyChain (a no-code prompt chaining tool). Our backgrounds combine deep software engineering expertise with an understanding of the LLM evaluation process and practical lessons form building two AI-infrastructure projects.
What's required from my team?
For the consultation, 2 hours of your time. - You can share access to your AI systems and current performance data over the call if you want it to be reviewed, but these is no expectation to.
Do you work with early-stage startups?
For the development and evaluation services we offer, we tend to focus on growth-stage companies (Series A-C) with established AI products and meaningful API spending ($5K+ monthly). However, the consultation service is designed for any stage company that needs strategic clarity on their AI approach - whether you're pre-Series A exploring AI possibilities or a larger company deciding between different implementation paths.
What if my AI system is unique or complex?
Perfect - that's exactly what I specialize in. Generic monitoring tools and cookie-cutter solutions fail because every AI application has unique failure modes and requirements. I generally work with you to create custom evaluation frameworks and development approaches that understand your specific domain, whether that's legal document analysis, medical diagnosis, financial risk assessment, or something completely different.
How do you handle different tech stacks?
All development work is done in TypeScript, which integrates well with most modern web applications. For evaluation services, I work with your existing infrastructure regardless of language - the monitoring and evaluation systems are designed to work alongside your current setup.
What about data privacy and security?
I work within your existing security protocols and can sign NDAs as needed. For sensitive data, we can work with synthetic data or implement evaluation systems that don't require exposing actual customer data. All code and documentation developed during our engagement belongs entirely to you.
Can you work with regulated industries?
Yes, I have experience working with companies in legal, healthcare, and financial services. I understand the additional compliance requirements and can help design evaluation systems that meet regulatory standards while still providing actionable insights.
What happens if the project needs to be extended?
All services include built-in buffer time, but if scope changes significantly, we'll discuss timeline adjustments upfront. The goal is always to deliver complete, working solutions within the stated timeframe.
Do you provide ongoing support after the engagement?
All services include a 90-day support period for questions and adjustments. For longer-term support or additional features, we can discuss a follow-up engagement.
What about guarantees or refunds?
For the consultation service, if you don't find the session valuable you'll get a full refund. I'll only charge you if you genuinely found the consultation useful to your situation. Since AI development is still such an early-stage field with everyone doing things differently, there's a strong chance our experience can help based on the problems we've solved before, but if we can't add value to your specific situation, then we don't charge you - it's that simple. For the development and evaluation services, all engagements include a satisfaction guarantee. If you don't see measurable improvement or deliverables that meet the agreed specifications, we'll work together to make it right or provide appropriate compensation.