I Scrape 99 Sources and Process 10,000+ Jobs Daily — Here's the Entire Data Pipeline
Published on BirJob.com · April 2026 · by Ismat Samedov
The Scraper That Ran for 464 Seconds and Returned Nothing
Last February, one of my scrapers — hrcbaku — ran for 7 minutes and 44 seconds inside a GitHub Actions container before finally dying with Errno 104: Connection reset by peer. It had been working perfectly for months. The site hadn't changed. My code hadn't changed. What changed was that GitHub Actions' IP range got flagged by Cloudflare, and my scraper spent nearly 8 minutes politely retrying a connection that would never succeed.
That's the reality of running a production data pipeline that scrapes the open web. Not the tutorial version where you write 20 lines of BeautifulSoup and call it a day. The version where you maintain 99 scrapers across banking portals, government career pages, recruitment platforms, and corporate websites — and something breaks every single week.
I built BirJob, Azerbaijan's largest job aggregation platform. It collects vacancies from 99 sources, deduplicates them, and serves them through a search interface used by thousands of job seekers. The infrastructure runs on ~$25/month. No Kafka. No Airflow. No data lake. Just Python, PostgreSQL, GitHub Actions, and a lot of hard-won lessons about what actually matters in data engineering.
This article is the technical deep-dive I wish existed when I started. Not theory — the actual architecture, the actual code patterns, the actual failure modes.
The Numbers Behind the Pipeline
Before we get into architecture, here's what this system handles daily:
| Metric | Value |
|---|---|
| Active job scrapers | 99 |
| Candidate scrapers | 12 |
| Jobs processed per run | 2,000–7,000 |
| Active jobs in database | 10,000+ |
| Scraped candidate profiles | 30,000+ |
| Database tables | 45+ |
| Database indexes | 50+ |
| Total scraper runtime | 3–4 minutes |
| Monthly infrastructure cost | ~$25 |
These numbers matter because they establish something important: you can build a real data pipeline at meaningful scale without enterprise tooling. Indeed started in 2004 processing 160,000 documents per day on Lucene running on a single server. They grew through four complete infrastructure rewrites before reaching billions. My pipeline is tiny compared to Indeed. But the problems — deduplication, staleness, source reliability, schema drift — are the same problems at any scale.
The data engineering market reflects this reality. Job postings for data engineering roles grew 22.89% year-over-year, with demand showing 50% YoY growth. Median U.S. salaries sit at $130,000–$135,000, now outpacing data science at equivalent experience levels. The plumbing has become more valuable than the models it feeds.
Architecture: The Boring Stack That Actually Works
Here's the full system, end to end:
GitHub Actions (cron: 08:00 UTC daily)
↓
Docker container with Python 3.11
↓
ScraperManager loads 99 scraper classes dynamically
↓
aiohttp sessions run concurrently (semaphore: 2)
↓
Each scraper returns a pandas DataFrame
↓
Python-level cleaning: normalize titles, remove blanks, deduplicate
↓
Source detection via URL pattern matching (200+ patterns)
↓
UPSERT to PostgreSQL on Neon (ON CONFLICT apply_link)
↓
Stale jobs marked inactive (soft delete)
↓
Stats sent to Telegram + GitHub Actions annotations
Every piece of this stack was chosen because it's boring. Boring is good. Boring means I'm not debugging my infrastructure at 2 AM — I'm debugging the actual data problems.
Why Not Airflow? Why Not Kafka?
I get this question a lot. The answer is simple: I process ~5,000 jobs per run. That's a CSV file's worth of data. Airflow is designed for orchestrating complex DAGs across teams. Kafka is designed for millions of events per second. Using either for my workload would be like renting a Boeing 747 to fly across town.
A cron-triggered Docker container on GitHub Actions gives me: free compute (2,000 minutes/month on the free tier), built-in secrets management, logs with GitHub's UI, manual re-trigger capability, and zero servers to maintain. The total DevOps burden is one YAML file.
Arvid Kahl of Podscan — who processes millions of podcast rows daily as a solo founder — put it best: "As a solo founder, you need solutions that not only work well but also don't require constant attention." That's the design philosophy.
The Scraper Framework: Patterns That Survived 99 Sites
Every scraper extends a BaseScraper class. This isn't a framework I planned — it's one that evolved from fixing the same bugs across dozens of scrapers. Here's what the base class handles:
Async HTTP with Connection Pooling
All requests go through aiohttp.ClientSession, which maintains a connection pool with Keep-Alive. This matters more than people think. Benchmarks show aiohttp handles 35.7 requests/second versus 2.2 for synchronous requests — a 15x difference. For 99 scrapers that each make 1–50 HTTP calls, the difference between sequential and concurrent execution is the difference between 45 minutes and 3 minutes.
But raw speed isn't why I use async. The real reason is graceful failure isolation. When one scraper hangs for 60 seconds waiting for a response, the others keep running. In synchronous code, that one slow scraper blocks everything behind it.
Retry Logic with Exponential Backoff
# Simplified from base_scraper.py
max_retries = 3
for attempt in range(max_retries):
try:
response = await session.get(url, timeout=60)
if response.status in (403, 429, 503):
delay = (3 ** attempt) + random.uniform(2, 5)
await asyncio.sleep(delay)
continue
return await response.text()
except (aiohttp.ClientError, asyncio.TimeoutError):
delay = (2 ** attempt) + random.uniform(0, 2)
await asyncio.sleep(delay)
Three retries. Exponential backoff. Random jitter to avoid thundering herds. The CI environment gets more aggressive delays (3^attempt instead of 2^attempt) because GitHub Actions IPs are more likely to be rate-limited.
I learned the hard way that retry logic needs to be per-request, not per-scraper. Early versions would retry the entire scrape operation on failure, which meant re-fetching pages that had already succeeded. Now each individual HTTP call handles its own retries independently.
User-Agent Rotation
The base class rotates through 10 different User-Agent strings — Chrome, Firefox, Safari, Edge across Windows, Mac, and Linux. This isn't about being sneaky. It's about matching what real traffic looks like. A site that sees 500 requests from identical User-Agents in 3 minutes will reasonably assume it's automated.
Anti-bot systems in 2026 use behavioral analysis and ML models. Basic IP rotation alone doesn't work anymore. But for most sites — especially regional job portals — User-Agent rotation combined with respectful delays is enough.
The Error Handler Decorator
@scraper_error_handler
async def scrape_azercell(self, session):
# scraper logic here
return df
Every scraper method is wrapped with @scraper_error_handler. It catches all exceptions and returns an empty DataFrame with error metadata instead of crashing. This is the single most important pattern in the entire codebase.
Why? Because when you run 99 scrapers in parallel and one throws an unhandled exception, you lose that source for the day. But when 99 scrapers run and 15 fail gracefully, you still collect jobs from the other 84. The pipeline degrades instead of crashes. That's the difference between a demo and a production system.
The ScraperManager: Orchestration Without a Framework
The ScraperManager class does four things:
- Dynamic loading — Uses Python's
importlibto auto-discover scraper classes. Drop a new file insources/, and it gets picked up on the next run. No registration, no config file. - Health checks — Before running a scraper, sends a HEAD request to check if the site is responsive. Skips dead sites to avoid wasting time on guaranteed failures.
- Concurrency control — A semaphore limits concurrent scrapers to 2 in CI. Why only 2? Because GitHub Actions runners have limited memory, and running 99 aiohttp sessions plus BeautifulSoup parsing simultaneously causes OOM kills.
- Failure classification — Every scraper run gets categorized: success, no_jobs, failed, timeout, network_error, rate_limited, blocked, or invalid_return. This classification drives the monitoring dashboard.
The disabled scrapers list is loaded from the database (scraper_config table), with a hardcoded fallback if the DB is unreachable. Currently 15 scrapers are disabled — Cloudflare-blocked sites, dead domains, sites that moved to SPAs without APIs. Each disabled scraper has a reason documented in the config. I check them monthly to see if anything's recoverable.
What Actually Breaks (and How Often)
| Failure Type | Frequency | Example |
|---|---|---|
| CSS selector changes | 2–3 per month | TABIB changed CSS module hashes every deploy |
| API endpoint changes | 1–2 per month | ProJobs moved from /v1/vacancies to unknown endpoint |
| Cloudflare/IP blocking | Ongoing (15 sites) | GitHub Actions IPs blocked by Djinni, Boss.az |
| Site goes offline | 1 per quarter | guavalab.az went completely offline |
| REST → GraphQL migration | Rare but painful | Boss.az moved to Apollo GraphQL overnight |
| Rate limiting (429) | Weekly | Some sites limit to 3 requests before blocking |
The average maintenance load is about 3–5 hours per week fixing broken scrapers. That's the part nobody talks about in scraping tutorials. Building the scraper takes an hour. Maintaining it for a year takes 150 hours.
Deduplication: The Hardest Problem Nobody Warns You About
The same job posting can appear on 4 different job boards with slightly different titles, different URLs, and different formatting. "Senior Data Analyst" on one site is "Data Analyst (Senior)" on another. Both link to the same application form. If you don't deduplicate, your users see the same job four times and lose trust in your platform.
My deduplication runs at three levels:
Level 1: Within-Batch Dedup (Python)
# Normalize then dedup within the current scrape batch
df['_norm_key'] = df.apply(
lambda r: f"{r['company'].lower().strip()}::{normalize_title(r['title'])}",
axis=1
)
df = df.drop_duplicates(subset=['_norm_key'], keep='first')
The normalize_title function strips parenthetical content, removes location suffixes ("- Bakı"), collapses whitespace, and lowercases everything. "Frontend Developer (React) - Baku" becomes "frontend developer". This catches about 60% of within-batch duplicates.
Level 2: Database Upsert on apply_link
INSERT INTO scraper.jobs_jobpost
(title, company, apply_link, source, last_seen_at, is_active, dedup_hash)
VALUES ($1, $2, $3, $4, NOW(), TRUE, $5)
ON CONFLICT (apply_link) DO UPDATE SET
title = EXCLUDED.title,
company = EXCLUDED.company,
source = EXCLUDED.source,
last_seen_at = NOW(),
is_active = TRUE
RETURNING source, (xmax = 0) AS is_new;
The apply_link (job application URL) is the unique stable identifier. If the same URL appears again, we update the record and refresh last_seen_at. The xmax = 0 trick is a PostgreSQL-specific way to detect whether the row was inserted (new) or updated (existing) — no extra query needed.
Level 3: Hash-Based Soft Tracking
dedup_hash = md5(f"{company.lower().strip()}::{title.lower().strip()}")
This hash isn't enforced as a constraint — it's tracked for reporting. When the same job appears with different URLs (cross-posted to multiple boards), the hash helps identify potential duplicates for manual review. I chose not to enforce this as a unique constraint because false positives (different jobs at the same company with similar titles) would be worse than some duplicates slipping through.
Bright Data's guide to data matching recommends embedding-based similarity search for advanced deduplication. I've looked into it. For my scale (10K active jobs), the MD5 + URL approach catches 85%+ of duplicates without the complexity and cost of running embeddings. The remaining duplicates are cross-board postings with genuinely different URLs, and honestly, showing a job twice is better than accidentally hiding it.
The Database: PostgreSQL Does More Than You Think
I use Neon PostgreSQL with a multi-schema design: scraper for data collection, website for user-facing features. This separation matters because scrapers hammer the database with bulk upserts while users need fast reads. Different schemas let me reason about each workload independently.
Index Strategy
The jobs_jobpost table has 10 indexes. That sounds like a lot for a simple table with 7 columns. But each index serves a specific query pattern:
| Index | Query Pattern |
|---|---|
(is_active, created_at DESC) |
Homepage: newest active jobs |
(is_active, source) |
Filter by source + active |
(is_active, company) |
Company pages: active jobs by employer |
(dedup_hash) |
Dedup check during upsert |
(last_seen_at DESC) |
Freshness monitoring |
(apply_link) UNIQUE |
Upsert conflict target |
Composite indexes are the key optimization here. A query like WHERE is_active = TRUE ORDER BY created_at DESC hits a single index scan instead of filtering then sorting. Mattermost documented making a Postgres query 1,000x faster through exactly this kind of index optimization.
Do I need partitioning? No. PostgreSQL handles billions of rows with proper partitioning, but for <10 million rows, indexes are sufficient. Partitioning adds operational complexity (managing partitions, pruning, migration tooling) that isn't justified at my scale.
Soft Deletes and Job Lifecycle
Jobs are never deleted. When a job disappears from its source, it gets is_active = FALSE. When it reappears, it gets reactivated. This pattern gives me:
- Historical analytics — I can query how long jobs stay active, which sources have the highest turnover
- Recovery from scraper bugs — if a broken scraper marks everything inactive, fixing the scraper and re-running restores the data
- Freshness tracking —
last_seen_attells me exactly when each job was last confirmed active
The downside is query overhead. Every user-facing query needs WHERE is_active = TRUE. But the composite indexes make this effectively free.
Beyond Scraping: The AI Layer
Here's where the pipeline gets interesting. The scraped data feeds into more than just a search interface.
Gemini-Powered Admin Analytics
The admin panel includes an AI Advisor built on Google's Gemini 2.5 Flash. It uses function calling (tool use) to query the database in natural language. An admin can ask "Which sources had the most failures this week?" and the AI:
- Discovers available tables via
list_tables() - Inspects schema via
describe_table() - Writes and validates a SQL query
- Executes it (read-only, max 200 rows)
- Interprets the results in business context
All queries are logged to an ai_advisor_log table with the user message, AI response, SQL queries executed, tool calls made, and execution duration. This creates an audit trail and helps me understand which questions admins actually ask — which informs what dashboards to build next.
The system prompt gives the AI deep business context about BirJob's data model. It knows that is_active = FALSE means a job was deactivated, not deleted. It knows that dedup_hash is MD5 of company+title. This domain knowledge makes the AI's SQL dramatically better than generic text-to-SQL.
Source Detection via Pattern Matching
When a job comes in with apply_link = "https://azercell.easyhire.me/jobs/12345", the system needs to label it as "Azercell". I have 200+ URL patterns that map domains and subdomains to human-readable source names. This is rule-based, not ML, and that's intentional.
AI-native extraction is an emerging trend where LLMs interpret page structure instead of CSS selectors. I've experimented with it. For extraction, CSS selectors are 10–100x more cost-efficient than LLM calls at scale. An LLM call costs ~$0.001–0.01 per page. A regex match costs nothing. When you're processing thousands of pages daily, that difference adds up.
My rule: use AI where it provides genuine intelligence (analytics, matching, recommendations), not where a regex does the same job faster and cheaper.
The Full Stack: How Everything Connects
The scraping pipeline is just one piece. Here's the complete system:
| Component | Technology | Cost/Month |
|---|---|---|
| Frontend + API | Next.js 14 on Vercel | $20 (Pro) |
| Database | PostgreSQL on Neon | $5 |
| Scraper orchestration | GitHub Actions | Free |
| CV storage | AWS S3 | ~$0.50 |
| Email (alerts, campaigns) | Resend | Free tier |
| Payments | Epoint (Azerbaijan) | Per transaction |
| Error tracking | Sentry | Free tier |
| CDN + Security | Cloudflare | Free |
Total: approximately $25/month for a platform serving thousands of users with real-time data from 99 sources. That's not a toy project budget — that's a deliberate architectural choice to keep fixed costs near zero until revenue justifies scaling up.
The Notification Pipeline
Users can subscribe to job alerts via email or Telegram. The email system uses Resend with proper List-Unsubscribe headers (RFC 2369 compliance), HMAC-based unsubscribe tokens, and delivery tracking (sent, delivered, bounced, opened, clicked). The Telegram bot (@birjob_bot) uses webhooks — not polling — for real-time keyword-based job alerts.
Analytics events (20+ types: job views, searches, registrations, payments) go through a fire-and-forget in-memory queue. The queue processes events with 50ms delays to avoid flooding the database. Events include device detection, geolocation (extracted from Cloudflare headers), session tracking, and referrer parsing.
Is the in-memory queue a risk? Yes — events are lost if the process restarts. But for analytics, that's acceptable. Payment webhooks use direct database inserts with signature verification and idempotency guards. Different reliability requirements, different architectures.
A Decision Framework for Solo Data Engineers
After building and maintaining this system for over a year, here's the framework I use for every technical decision:
Step 1: What's Your Actual Scale?
| Daily Volume | Recommended Stack |
|---|---|
| < 10,000 records | Cron + Python + PostgreSQL |
| 10K–100K records | Add Redis caching, consider Celery |
| 100K–1M records | Add message queues, consider Elasticsearch for search |
| 1M+ records | Now Airflow/Dagster, Kafka, and a data warehouse make sense |
Most solo developers are in the first row. Most tutorials teach the last row. That mismatch wastes enormous amounts of time.
Step 2: Optimize for Maintenance, Not Performance
My scrapers are not the fastest possible. I could squeeze more performance with connection pooling optimizations, parallel parsing, or compiled regex. But performance isn't my bottleneck. Maintenance is. Every hour I spend making a scraper 20% faster is an hour I don't spend fixing the three other scrapers that broke this week.
Concrete choices driven by this principle:
- Base classes over frameworks —
BaseScraperhandles retry, encoding, and error handling so each individual scraper is 30–80 lines of site-specific logic - Dynamic class loading — Drop a file, it runs. No registration, no config changes.
- Failure classification — When something breaks, I immediately know if it's a network issue, a rate limit, or a site change. Different causes, different fixes.
- Disabled scrapers with reasons — Every disabled scraper has a documented reason. Monthly I review the list and re-test.
Step 3: Choose Your Deduplication Strategy Early
I've seen this mistake in other projects and made it myself: treating deduplication as a "later" problem. It's not. If your first 1,000 users see duplicate listings, they won't become your next 10,000 users.
Start with URL-based dedup (unique constraint on the source URL). Add normalized title+company hashing when you have multiple sources for the same employers. Only invest in embedding-based similarity matching if you have clear evidence that hash-based dedup misses a significant number of duplicates.
Step 4: Monitor Jobs-Per-Source, Not Just Total Jobs
Total job count is a vanity metric. If your total goes from 10,000 to 9,500, that could be normal daily fluctuation. But if one source drops from 200 to 0, that source is broken. Track per-source job counts on every run. A source that returns zero jobs after previously returning hundreds is always worth investigating — it's never because they genuinely posted nothing.
Lessons I Learned the Hard Way
1. Start with 20 Sources, Not 91
I built 91 scrapers before launch. Half broke within the first month. CSS selectors changed, APIs migrated, sites added Cloudflare. If I started over, I'd launch with the 20 most reliable sources and add others gradually. Each new scraper is a maintenance commitment, not a one-time build.
2. Skip Playwright Unless You Absolutely Can't
95% of "dynamic" sites have a JSON API hiding behind their JavaScript frontend. Check the Network tab in DevTools before reaching for a headless browser. Playwright is slow (5–30 seconds per page vs. 200ms for an API call), flaky in CI environments, and eats memory. I use it for exactly 2 sites now, down from 5, because I found the hidden APIs for the other 3.
3. GitHub Actions IPs Will Get Blocked
GitHub publishes their IP ranges. So do anti-bot services. Several sites (Djinni, Boss.az, HRCBaku) block GitHub Actions IPs specifically. There's no fix other than using a residential proxy or accepting the loss. I chose to disable those scrapers rather than add proxy infrastructure for a handful of sources.
4. The Database Is Your Source of Truth, Not Your Scrapers
Scrapers lie. They return partial data, wrong data, stale data. The database schema enforces what matters: unique constraints, NOT NULL checks, foreign keys. When a scraper returns garbage, the database rejects it. This is intentional. I'd rather lose a scraper run's data than corrupt my database.
5. Soft Deletes Save Lives
Early in development, I hard-deleted jobs that disappeared from their source. Then a scraper bug wiped 3,000 jobs. With hard deletes, they were gone. After switching to soft deletes (is_active = FALSE), a similar bug was fixable by re-running the scraper — everything reactivated automatically.
What I Actually Think
The data engineering industry has an overengineering problem. Conference talks and blog posts showcase Kafka + Airflow + Snowflake + dbt + Great Expectations stacks that cost $50,000/month to run. Those tools exist for a reason — at Netflix scale, you need them. But the vast majority of data work happens at a scale where PostgreSQL, Python, and cron are not just sufficient but optimal.
I've built a platform that scrapes 99 sources, deduplicates 10,000+ jobs, serves thousands of users, runs AI-powered analytics, processes payments, sends email and Telegram alerts — all for $25/month. Not because I'm cheap. Because every dollar spent on infrastructure is a dollar not spent on improving the product.
The "modern data stack" is marketed as the only professional choice. But Indeed started with Lucene on a single server. They grew into complexity as their scale demanded it. They didn't start with a distributed system because someone on Hacker News said they should.
If you're a solo developer or a small team, here's my honest advice: start with the simplest architecture that can handle 10x your current load. PostgreSQL, not Snowflake. Cron, not Airflow. aiohttp, not a managed scraping fleet. You can always add complexity. You can rarely remove it.
The best data pipeline is the one you can debug at 3 AM with your eyes half open. Make it boring. Make it work. Make it maintainable. The fancy stuff can wait until you have the revenue to justify it and the team to maintain it.
Sources
- Indeed Engineering — From 1 to 1 Billion: Evolution of Indeed's Document Serving System
- ElectroIQ — Data Engineering Statistics 2026
- USDSI — Is Data Engineering the Fastest-Growing Career in 2026?
- 365 Data Science — Data Engineer Job Outlook 2025
- DEV.to — requests vs httpx vs aiohttp: Benchmark Results
- The Bootstrapped Founder — Indie Hacking Databases at Scale
- Browserless — State of Web Scraping 2026
- TigerData — PostgreSQL Performance Tuning: Optimizing Database Indexes
- TigerData — Handling Billions of Rows in PostgreSQL
- Mattermost — Making a Postgres Query 1,000x Faster
- Bright Data — Guide to Data Matching and Deduplication
- Dagster — Data Pipeline Architecture: 5 Design Patterns
- Oxylabs — Asynchronous Web Scraping with Python aiohttp
- ScrapingAPI — Legal Battles That Changed Web Scraping
Ismat Samedov builds BirJob — Azerbaijan's job aggregator pulling from 99 sources daily. If you're building data pipelines or job platforms, he's probably made the mistake you're about to make.