I Built a Website That Runs Itself

I built a website that updates itself every day, with no human in the loop. It scrapes YouTube videos and web articles, transcribes audio, extracts mentions of AI tools using a language model, scores them, validates them, generates a daily summary, builds an HTML page, runs a QA agent to check its own work, fixes what it finds, and pushes the result to a git repo that auto-deploys to aitoolz.info.

No cron job babysitting. No manual review. Just agents talking to agents.

And honestly, building it taught me more about the current state of AI than anything I’ve read in the last year.

What is aitoolz.info?

It’s an auto-curated directory of AI tools. Every day, the system watches a set of YouTube channels (Matt Wolfe, AI Explained, Wes Roth, TheAIGRID, and others) and scrapes articles from TechCrunch, VentureBeat, Anthropic’s blog, OpenAI’s blog, and Hugging Face. It transcribes the videos with Whisper, feeds everything into Claude to extract tool mentions, deduplicates, verifies URLs, scores for relevancy, and generates a slick dark-mode HTML directory. At the top there’s always a fresh daily summary — a headline, key developments, notable tools, trends, and a “one to watch.”

As of today it’s tracking 192 tools across 11 categories, and the daily summary reads like a morning briefing you’d actually want.

The Pipeline

Here’s what a full run looks like:

Research → Score → Validate → Summarize → Generate → Test → Fix → Publish

Eight phases, each one a distinct module, most of them calling Claude at some point. Let me walk through the interesting parts.

Research

The system pulls recent videos from a configured list of YouTube channels, downloads the audio, and transcribes locally with Whisper. In parallel, it scrapes RSS feeds and web articles. All of this lands in a content cache — a SQLite table of raw text with source attribution.

Then the LLM analyzer chews through the content in batches, extracting structured tool mentions: name, category, description, URL if mentioned, sentiment. This is where the magic and the mess both live. YouTube transcripts are noisy. People say “lambdod.com” when they mean “lambda.com.” They say “this tool is amazing” without ever naming it. The analyzer has to make sense of all of it.

Scoring

Every tool gets a relevancy score based on four weighted signals: recency (how recently was it mentioned?), citation count (how many sources mentioned it?), verification confidence (did we find a real URL?), and sentiment. Tools that haven’t been mentioned in 60 days get deprecated. New tools get a badge that expires after a week.

This isn’t fancy. It’s a weighted sum. But it works surprisingly well as a sorting mechanism. The interesting tools rise to the top.

Validation

This one’s fun. The system exports all unverified and low-confidence tools as a JSON file — a work order for an external “validator agent.” The validator checks URLs, confirms tools are actually AI-related, flags duplicates, and returns a results file. On the next run, those results get imported and applied: URLs get corrected, non-AI tools get deprioritized, duplicates get merged.

This is an async agent loop. The validator runs on its own schedule, and the main pipeline just checks if there are results to import. No tight coupling.

The Daily Summary

At the end of each collection cycle, all of the day’s content gets fed into Claude with a prompt asking for a structured daily digest: headline, key developments, notable tools, emerging trends, and a “one to watch.” The output is JSON that gets persisted to the database and rendered as a prominent card at the top of the HTML page.

This is probably my favourite part. It turns a bunch of raw scraped text into something that actually reads like a curated newsletter. Today’s headline: “Anthropic Closes $30B Series G at $380B Valuation as AI Arms Race Escalates.”

Test & Fix

Here’s where it gets meta. Before publishing, the system runs a QA testing agent — another Claude call — that reviews the generated HTML for content quality issues. Does the description for Lambda contain transcript artifacts like “mentioned as a sponsor that the channel host personally uses”? Are there tools with None as their URL? Does the header claim 174 tools but only render 65 cards?

The test agent flags everything with a severity level and a fixable boolean. If it finds fixable issues, a fixer agent runs — cleaning up bad descriptions by sending them back to Claude for rewriting, clearing null values, fixing stale “new” flags — and then the HTML is regenerated and tested again. Up to two attempts.

If it passes, it publishes. If it doesn’t, it blocks and you get a test report explaining why.

Publish

The publisher copies the generated index.html into a local git repo clone, commits with a message that includes the test score, and pushes to GitHub. DigitalOcean App Platform picks up the commit and deploys automatically. The whole thing from scrape to live site takes a few minutes.

What I Actually Learned

Language Models Are Unreliable in Exactly the Ways You’d Expect

The LLM will confidently extract “HyperWrite” as a tool with the description “discussed in the video.” It’ll assign a URL from the wrong version of a product. It’ll categorize Waymo as an “AI Tool” instead of, you know, a car company that uses AI. Every single one of these failure modes is predictable, and you deal with them the same way you deal with any unreliable data source: validation layers, confidence scores, and the assumption that any individual extraction might be wrong.

The trick is that across hundreds of extractions, the aggregate is pretty good. You don’t need each one to be perfect. You need the system to be self-correcting.

Agents Aren’t Magic, They’re Just Functions With LLM Calls

I keep seeing “agent” used like it means something profound. In practice, every agent in this system is a Python class with a __call__ method that formats a prompt, calls Claude, and parses the JSON response. The “QA testing agent” is ~250 lines of code, most of which is the prompt template and some regex for pre-computing HTML stats.

What makes them feel agent-like is the pipeline orchestration — the fact that one agent’s output feeds into another agent’s input, and they can make decisions that affect downstream behavior (like the tester blocking a publish). But mechanically, it’s just functions calling an API.

Semantics Are the Hard Part

The hardest problems in this entire system aren’t the API calls or the HTML generation or the git publishing. They’re semantic. Is “Genie 3” a new tool or an update to “Genie 2”? Is “Claude Opus 4.6” a separate entry from “Claude”? Is an AI company the same thing as an AI tool? When a YouTuber says “this new thing from Google” and then pivots to talking about something else, where does one tool mention end and another begin?

These are fundamentally understanding problems, and LLMs are surprisingly okay at them — not perfect, but way better than any regex or keyword-matching approach would be. The system leans into this. It uses Claude for extraction, for summarization, for validation, for testing, and for fixing. Each of these is a different kind of understanding, applied at a different stage.

The Meta Gets Deep

There’s something genuinely weird about a system where Claude extracts tool mentions, then Claude summarizes the findings, then Claude tests the generated output, then Claude fixes the problems it found, then the fixed version gets tested by Claude again. It’s LLMs all the way down.

And the failure modes cascade in interesting ways. If the extractor misidentifies a company as a tool, the scorer gives it a middling score, the validator might catch it or might not, the generator renders it faithfully, and then the tester flags “some entries are companies/organizations rather than tools.” The fix for that lives in the validator, not the tester — the tester just reports.

Understanding which agent should own which responsibility is basically a distributed systems design problem. And like all distributed systems problems, getting the boundaries right is more important than getting any individual component perfect.

Why I Built It

Honestly? Because I’m drowning in AI news. Every morning there are 15 new tools, 3 new models, and someone on YouTube telling me the world just changed again. I wanted a single page that told me what actually matters today, auto-updated, with no human editorial bias (well, beyond the YouTube channels I chose to watch).

But also because I wanted to understand what it actually feels like to build a real application with LLM agents. Not a toy. Not a demo. Something that runs in production, handles messy real-world data, and needs to be reliable enough to push to a live website without human review.

The answer: it feels a lot like regular software engineering, except your core data processing layer is probabilistic and occasionally hallucinates. You deal with it the same way you deal with any other unreliable dependency — redundancy, validation, and graceful degradation.

The future happens slowly and then all at once. And apparently, it also updates itself daily at 2 AM.

aitoolz.info is live now. Built with Python, Claude, Whisper, SQLite, and a healthy distrust of any single LLM output.

What is aitoolz.info?#

The Pipeline#

Research#

Scoring#

Validation#

The Daily Summary#

Test & Fix#

Publish#

What I Actually Learned#

Language Models Are Unreliable in Exactly the Ways You’d Expect#

Agents Aren’t Magic, They’re Just Functions With LLM Calls#

Semantics Are the Hard Part#

The Meta Gets Deep#

Why I Built It#