How We Built an AI Question Generator for UTBK with Near-Zero Cost

This article explains how we designed an AI-powered question generator system for UTBK preparation with a strict focus on keeping operational cost near zero. The system relies on Google Gemini 2.5 Flash as the primary model, but the real cost efficiency comes from workflow design: controlled batching, throttling, deduplication, lifecycle-based content management, and drip publishing. Rather than optimizing for maximum AI usage, we optimize for sustainable daily generation within free-tier constraints while keeping the system aligned with the mission of providing accessible education through terasbelajarasik.web.id/bank-soal.

1. The Real Constraint: AI Is Not the Product, the Workflow Is

When building an AI question generator system, it is easy to assume that the main challenge is choosing the right model.

In reality, the model is not the problem.

The real constraint is:

How to design a system where AI becomes a controlled utility, not an uncontrolled cost driver.

Without proper architecture, AI systems tend to:

generate redundant content
repeat similar questions across runs
waste tokens on formatting and validation
scale cost unintentionally with usage
rely too heavily on AI for deterministic tasks

At scale, this makes the system unpredictable and financially unsustainable.

So the goal is not “maximum generation,” but:

stable, controlled, and predictable daily content generation with minimal AI dependency.

2. Model Choice: Why We Use Gemini 2.5 Flash

We use, specifically Gemini 2.5 Flash, as the core generation model.

The decision is not based on it being the “most powerful” model available, but because it fits three operational requirements:

1. Cost Efficiency

It can be used without attaching a paid billing layer in early-stage usage, which is critical for maintaining a near-zero cost system.

2. Latency and Throughput

It is fast enough for batch-based generation pipelines, which is more important than raw intelligence in this use case.

3. Sufficient Quality for Structured Content

For UTBK-style question generation, we do not require frontier-level reasoning. We require:

structured output
consistent formatting
controllable randomness
predictable difficulty scaling

Gemini 2.5 Flash is sufficient for this purpose.

More details about usage limits and behavior can be explored directly from official documentation:
https://ai.google.dev/gemini-api/docs

3. High-Level System Architecture

The system is built as a multi-stage controlled pipeline, not a direct generation system.

Architecture Diagram

4. Core Design Principles

4.1 Small Batch, Always

We intentionally avoid large-scale generation.

Each execution:

generates a small batch of questions
rotates between subject types
ensures predictable API usage

This prevents:

request spikes
quota instability
uncontrolled token bursts

4.2 Controlled Throttling Instead of Maximum Throughput

Instead of trying to maximize request speed, we enforce:

spacing between requests
sequential execution flow
no parallel AI bursts

The goal is not speed, but stability.

4.3 Strict Output Structure (No Free-Form AI Output)

Every response must follow a strict JSON structure.

If the output is invalid:

it is rejected
optionally retried in the next cycle
never manually corrected using another AI call

This eliminates:

secondary AI validation calls
parsing overhead
regeneration loops caused by formatting issues

4.4 Auto Tagging with Limited Capacity

Tagging is handled separately from generation.

However:

only a limited number of items are processed per run
tagging is distributed across time

This prevents tagging from becoming a hidden cost multiplier.

4.5 Deduplication as a Cost Protection Layer

Before anything is published:

all generated questions are compared against existing dataset
similarity threshold is intentionally strict

If duplicate:

marked as duplicate
excluded from publishing pipeline
no regeneration triggered immediately

This is one of the most important cost-saving mechanisms.

4.6 Difficulty Calibration from Real Usage

Instead of relying on AI to estimate difficulty repeatedly, we use:

real user correctness rate
response time distribution
aggregated performance signals

This reduces unnecessary re-generation cycles.

4.7 Lifecycle-Based Content System

Every question moves through a controlled lifecycle:

draft → generated content
ready → validated and safe for use
published → visible to users
duplicate → excluded from system output

This ensures:

no premature publishing
no repeated AI processing
clear separation between generation and production

4.8 Drip Publishing Strategy

Even after a question is ready:

it is not published immediately
it is released gradually

This stabilizes:

system load
user experience
downstream processing demand

5. Working Within Gemini Usage Boundaries

We deliberately design the system to stay within safe usage behavior of Gemini 2.5 Flash free-tier operation.

Rather than focusing on exact numbers, we follow a simpler engineering approach:

We design for low-frequency, predictable, and evenly distributed requests, instead of trying to maximize throughput.

In practice, this means:

no burst generation
no parallel worker flooding
no continuous AI loops
strict batch-based execution

This aligns with how free-tier AI systems are intended to be used: steady, non-abusive workloads.

6. A Reality We Chose Not to Exploit

Technically, it is possible to scale request volume by:

creating multiple API accounts
distributing load across them
rotating keys to increase throughput

Yes, that approach would allow higher generation frequency.

But we deliberately do not take that route.

Because:

it violates the spirit of free-tier usage
it introduces system fragility
it creates operational complexity
and most importantly, it is unnecessary for our goal

We are not building a high-frequency AI content farm.

We are building a sustainable educational system.

So instead of optimizing for maximum extraction, we optimize for responsible usage.

7. Why We Don’t Use “More Powerful” Models Like GPT

We also evaluated GPT-based systems, but the architecture differences are significant.

usage is inherently token-billed
every request contributes directly to cost
scaling automatically increases expenses
there is no equivalent “free operational buffer”

This creates a fundamentally different system design constraint:

GPT systems require cost-first engineering, while Gemini allows workflow-first engineering.

For our use case—educational content generation for students with limited financial access—this distinction is critical.

8. Alignment With Our Mission

The entire system exists to support terasbelajarasik.web.id.

Our goal is simple:

help students prepare for UTBK
without requiring expensive paid courses
without introducing subscription barriers

That is why:

AI cost must remain near zero
infrastructure must be sustainable
system complexity must serve affordability, not scale-for-profit

This is not a high-end AI product.

It is an accessibility-driven educational tool.

9. Key Takeaway

The most important lesson from building this system is:

AI cost control is not achieved by limiting usage alone, but by designing workflows that naturally prevent unnecessary usage.

By combining:

small batch generation
strict lifecycle control
deduplication-first architecture
throttled execution
quota-aware design using Gemini 2.5 Flash

we ensure the system remains stable, predictable, and financially sustainable.

10. Implementation Checklist (Business-Focused)

Final Thoughts

Could we just scale this system by adding more AI accounts or moving everything to a paid GPT-based API? Technically, yes. It would be faster, simpler in some parts, and much more straightforward from an engineering perspective.

But that’s not the point of what we’re building.

There’s something more intentional about designing within constraints—especially when the goal is to build an education platform like terasbelajarasik.web.id. Every decision becomes more deliberate: how often we generate questions, how we structure batches, when we allow AI to be called, and when we explicitly decide not to use it at all.

Working with Gemini 2.5 Flash in a controlled, free-tier-first architecture forces a different kind of engineering mindset. You stop thinking in terms of “how do we scale usage” and start thinking in terms of “how do we prevent unnecessary usage from happening in the first place.”

Even things like deduplication, lifecycle states, and drip publishing stop being just technical details—they become cost-control mechanisms that define the entire system behavior.

Of course, we could push harder. We could parallelize requests, distribute workloads across multiple accounts, or try to extract more throughput from the system. But that would slowly shift the project away from its original intention: building something sustainable, not something maximally aggressive.

There’s a discipline in deliberately not doing the “obvious scaling tricks,” even when they are technically possible.

In the end, it feels less like building an AI system and more like designing a constrained production line where every AI call has to justify its existence.

And maybe that’s the real lesson here: the best system is not the one that uses AI the most—but the one that uses it only when everything else has already been done correctly.

If you have questions about any part of this setup, feel free to reach out!