Skip to main content

Command Palette

Search for a command to run...

Scraping 10k YouTube transcripts for LLM training data

Published
โ€ข2 min read
G
Hi, I'm George! ๐Ÿ‘‹๐Ÿป โœ… Who I am: Full-stack engineer and data extraction specialist. I build scrapers for platforms everyone says can't be scraped โ€” and I do it reliably, at scale. ๐Ÿค“ My journey: Started as a full-stack engineer, quickly realized the hardest (and most valuable) technical problems are in data extraction. Anti-bot systems, fingerprinting, session management, rate limiting โ€” that's where I live now. ๐Ÿš€ My work: Creator of the Unblockable ICO Drops Scraper โ€” built to defeat one of the toughest anti-bot systems on the web. Also built scrapers for CoinMarketCap, Shopify stores, TikTok Shop, and LinkedIn. If it has data behind a wall, I've probably cracked it. โš’๏ธ My skills: Node.js, Crawlee, Playwright, Puppeteer, Advanced Anti-Bot Bypassing, Proxy Management, API Reverse-Engineering, Docker, NestJS, Python. ๐Ÿšข Shipped: PenBot (penbotservices.com) โ€” AI platform delivering intelligence through prison email systems. Inmate Creations (inmatecreations.com) โ€” currently in production. ๐Ÿซฑ๐Ÿปโ€๐Ÿซฒ๐Ÿฝ Work with me: Open to data extraction challenges, automation pipelines, and infrastructure projects. Not open to ticket-taking. ๐Ÿ“ฉ Contact: georgethedeveloper3046@gmail.com ๐Ÿ”— Portfolio: george-kioko.pages.dev ๐Ÿ™ GitHub: github.com/the-ai-entrepreneur-ai-hub

Was building a training set from AI podcast transcripts and hit YouTube API's 10,000/day quota immediately. Here is how I worked around it.

The problem

YouTube Data API v3 has a 10,000 unit daily quota per project. A transcript pull costs about 1-3 units depending on method. At 10 transcripts per minute you burn the whole quota in under an hour.

Workarounds like creating multiple Google projects violate the Terms of Service. The honest alternative is using the public caption tracks that anyone can read in a browser.

The pipeline

Built an Apify actor that:

  1. Accepts a list of YouTube URLs
  2. Extracts the video ID
  3. Requests the caption track directly from YouTube's timed-text endpoint (public, no auth)
  4. Returns plain text + timestamped segments

No API key needed. No quota to burn. $0.004 per transcript.

The run

Ran it on 10,000 URLs of AI-focused podcasts and conference talks. At concurrency 10, total time was ~20 minutes. Total cost was $40.

Output was a single JSON file with title, channel, duration, and full transcript text. Ready to feed into a tokenizer for fine-tuning.

What the numbers look like per 1000

MetricValue
Success rate~94%
Avg transcript length6,200 words
Total tokens (GPT tokenizer)~8.5M tokens
Cost per 1M tokens~$0.47

That is 100x cheaper than paying someone to transcribe audio files and probably better quality than Whisper on low-bitrate source audio.

Caveats

  • 6% of videos have no captions at all (creator disabled them, or they are old). That is inherent to YouTube, not a pipeline issue.
  • Auto-generated captions are noisier than human captions, but still usable for training.
  • Not suitable if you need word-level timestamps or speaker diarization; you only get segment-level timing.

Try it

Apify actor: apify.com/george.the.developer/youtube-transcript-scraper

If you are building a dataset at scale, cost is the only thing that matters. Paying $4k for 1M transcripts via a commercial API hurts more than paying $4 for the same data through a simple actor.

More from this blog

The AI Entrepreneur

19 posts