Scheduling Nightmares: How We Built Scalable Document Syncs for Our SaaS

June 8, 2025 By Isabella Chen Engineering

Scheduling Nightmares: How We Built Scalable Document Syncs for Our SaaS (Image credit: Pexels)

Hey everyone, Wei here, founder and chief architect behind Zenceipt (our SaaS that helps businesses automatically snag accounting docs like invoices and receipts straight from their email inboxes).

One of the core promises we make to our users is convenience – set your sync preferences, and we'll handle the rest. Simple, right? Well, building the scheduling part of that promise turned out to be a more interesting journey than anticipated.

Our users need to connect their Gmail (and eventually other providers) and tell us how often we should check for new documents – hourly, daily at a specific time, maybe just weekly. This means potentially thousands of users, each with their own custom schedule, all needing reliable execution. Get this wrong, and the core value proposition crumbles.

The Early Ideas (and Why They Didn't Fly)

Like many startups, our first instinct is often the simplest path.

setInterval in the Main App? Seriously considered for about 5 minutes. Running potentially thousands of intervals inside our main SvelteKit backend process? That's a recipe for disaster. It doesn't scale, a single app restart wipes all schedules, and it couples scheduling tightly with our main request/response cycle. A non-starter.
System Cron? The old reliable. We could have a master cron job run frequently, query the database for users due for a sync, and kick off processes. But how do you manage individual user schedules dynamically? Updating a user's frequency from daily to hourly becomes complex. Managing failures and retries per user? Ensuring jobs don't overlap or overload the system? It felt brittle and hard to manage at scale via code.

We needed something more robust, designed for dynamic, application-driven scheduling.

Trying Managed Cloud Schedulers

The next logical step was looking at managed services like AWS EventBridge Scheduler or Google Cloud Scheduler. The appeal is obvious: infrastructure managed by the cloud provider, high availability, pay-per-use. Sounds great!

However, we hit a couple of conceptual roadblocks for our specific use case:

The "Million Schedulers" Problem: Creating one distinct cloud scheduler job per user felt operationally complex. Imagine managing API calls to create, update, and delete potentially tens of thousands of these individual schedules. There are often service limits, and just tracking them becomes a task in itself.
The "Batched Trigger" Complexity: The alternative is fewer, more general cloud jobs (e.g., one runs every hour). This job would then trigger an endpoint in our app. That endpoint would need logic to query our database, find all users scheduled for that hour, and then trigger their syncs (maybe fanning out via another queue). While viable, this pushes scheduling logic back into our application and still requires careful handling of load and failures for that batch trigger.
Security & Endpoint Management: Exposing an endpoint for the cloud scheduler to hit requires careful security considerations (authentication, authorization) to prevent abuse.

While powerful, these felt like they either introduced significant management overhead at scale or shifted complexity around rather than solving it cleanly for dynamic, per-user schedules.

Enter the Task Queue: BullMQ + Redis

This led us down the path of distributed task queues with built-in scheduling capabilities. We looked at a few options, including Agenda (which often uses MongoDB), but ultimately landed on BullMQ, backed by Redis.

Here's why this combination clicked for us:

Designed for Dynamic, Repeatable Jobs: BullMQ has first-class support for creating jobs that repeat on a cron schedule or at fixed intervals. Critically, these schedules are stored persistently in Redis.
Decoupling: It cleanly separates the scheduling (managed by BullMQ/Redis) from the execution (handled by separate worker processes). Our main SvelteKit app only needs to tell BullMQ about schedule changes; it doesn't need to worry about when jobs run.
Scalability: This is huge. Redis is fast and scales well. More importantly, we can scale our worker processes independently based on the actual sync workload, without impacting the main web application or the scheduling mechanism itself. Need more sync capacity? Spin up more workers.
Robustness: Redis persistence means schedules survive app restarts or worker crashes. BullMQ has built-in mechanisms for job retries, tracking failures, and managing job lifecycles. This gives us much more confidence than simple setInterval or basic cron.
Node.js Native: As a SvelteKit app, our backend is Node.js. BullMQ fits perfectly into this ecosystem.

Our Scheduling Architecture

So, how does it look in practice?

SvelteKit Backend API:
- Has endpoints for users to manage their sync settings.
- When a user saves a schedule (e.g., "sync daily at 2 AM UTC"), the API uses the BullMQ library to add or update a repeatable job associated with that userId.
- The key is using a unique identifier per user for the repeatable job configuration, allowing updates/removals.

// Conceptual API endpoint logic (using BullMQ) import { Queue } from 'bullmq';

// Connect to the same Redis instance used by workers const syncScheduleQueue = new Queue('user-sync-scheduler', { connection: redisConnectionOptions });

// Example: User sets a daily sync at 2:00 AM UTC async function setUserSyncSchedule(userId: string, cronPattern: string) { // e.g., '0 2 * * *' const repeatableJobKey = `user-sync:${userId}`;

// Remove existing schedule for this user first to ensure clean update await syncScheduleQueue.removeRepeatableByKey(repeatableJobKey);

// Add the new repeatable job await syncScheduleQueue.add( 'trigger-user-sync', // Name of the job the worker will process { userId: userId }, // Data payload for the worker { repeat: { pattern: cronPattern, // The user's chosen schedule tz: 'UTC', // Important: Specify timezone }, jobId: repeatableJobKey, // Use a predictable ID based on userId removeOnComplete: true, // Clean up job instances after success removeOnFail: 100 // Keep some failed job instances for debugging } ); console.log(`Scheduled sync for user userIdwithpatternuserIdwithpattern{cronPattern}`); }

// Call this when user saves settings: // await setUserSyncSchedule('user-123', '0 2 * * *'); // Daily at 2 AM UTC // await setUserSyncSchedule('user-456', '0 * * * *'); // Hourly

Redis: Acts as the brain, storing the job queue (user-sync-scheduler) and the configurations for all the repeatable jobs defined by the API.
Dedicated Worker Service (Separate Node.js Process):
- This is a completely separate application/process/container from our main SvelteKit web server.
- It connects to the same Redis instance and listens only for jobs on the user-sync-scheduler queue.
- When BullMQ determines a repeatable job is due (based on its schedule), it adds an instance of the trigger-user-sync job to the queue.
- The worker picks up this job, extracts the userId, and then initiates the actual email fetching and document processing logic for that user.

// Conceptual Worker process logic (worker.ts) import { Worker } from 'bullmq'; import { performUserEmailSync } from './sync-logic'; // Import actual sync function

const worker = new Worker( 'user-sync-scheduler', // Queue name MUST match the one used by the API async (job) => { const { userId } = job.data; console.log(`Processing sync trigger for user: ${userId}`);

try { // \*\*IMPORTANT\*\*: This is where you call your actual business logic // This function handles connecting to Gmail, fetching emails, // finding documents, extracting info, saving to DB, etc. await performUserEmailSync(userId); console.log(\`Successfully completed sync for user: ${userId}\`); } catch (error) { console.error(\`Sync failed for user ${userId}:\`, error); // Error handling/reporting logic here // BullMQ will handle retries based on job options if configured throw error; // Re-throw error so BullMQ marks the job as failed }

}, { connection: redisConnectionOptions } // Use the same Redis connection options );

console.log('Sync worker started...');

worker.on('failed', (job, err) => { console.error(`Job job?.idfailedwitherrorjob?.idfailedwitherror{err.message}`); });

Project Structure: We keep the worker code in a separate directory within our monorepo (/workers/sync-worker) and deploy it as a distinct service (e.g., a separate Docker container or Heroku worker dyno) from our main SvelteKit web application.

Trade-offs and Final Thoughts

Is this approach perfect? No architecture is. It introduces Redis as an infrastructure dependency that needs managing (backups, monitoring, scaling). It also means deploying and monitoring a separate worker service alongside our main web app.

However, the benefits far outweigh the costs for this specific problem. We gain:

Clear Separation: Scheduling logic is isolated. Sync execution logic is isolated. The web app handles user interaction.
Scalability: We can tune the number of workers based purely on the sync workload.
Reliability: Redis + BullMQ gives us persistence and retry mechanisms essential for background tasks.

Building a SaaS often involves navigating these kinds of architectural trade-offs. For Zenceipt, ensuring that user-scheduled syncs run reliably and scalably is non-negotiable. While simpler options existed, the BullMQ/Redis approach provided the robustness and future-proofing we needed. Now, I sleep a little better knowing that our users' documents are being fetched reliably, right on schedule (mostly!).

Isabella Chen

Isabella is a copy writer who believes accounting shouldn't be intimidating. She draws on her experience helping small businesses to create content that demystifies bookkeeping and empowers entrepreneurs to manage their finances with confidence.

June 8, 2025 Engineering

Scheduling Nightmares: How We Built Scalable Document Syncs for Our SaaS

By Isabella Chen