Scraping 10k YouTube transcripts for LLM training data

PublishedApril 23, 2026

•2 min read

Hi, I'm George! 👋🏻 ✅ Who I am: Full-stack engineer and data extraction specialist. I build scrapers for platforms everyone says can't be scraped — and I do it reliably, at scale. 🤓 My journey: Started as a full-stack engineer, quickly realized the hardest (and most valuable) technical problems are in data extraction. Anti-bot systems, fingerprinting, session management, rate limiting — that's where I live now. 🚀 My work: Creator of the Unblockable ICO Drops Scraper — built to defeat one of the toughest anti-bot systems on the web. Also built scrapers for CoinMarketCap, Shopify stores, TikTok Shop, and LinkedIn. If it has data behind a wall, I've probably cracked it. ⚒️ My skills: Node.js, Crawlee, Playwright, Puppeteer, Advanced Anti-Bot Bypassing, Proxy Management, API Reverse-Engineering, Docker, NestJS, Python. 🚢 Shipped: PenBot (penbotservices.com) — AI platform delivering intelligence through prison email systems. Inmate Creations (inmatecreations.com) — currently in production. 🫱🏻‍🫲🏽 Work with me: Open to data extraction challenges, automation pipelines, and infrastructure projects. Not open to ticket-taking. 📩 Contact: georgethedeveloper3046@gmail.com 🔗 Portfolio: george-kioko.pages.dev 🐙 GitHub: github.com/the-ai-entrepreneur-ai-hub

Was building a training set from AI podcast transcripts and hit YouTube API's 10,000/day quota immediately. Here is how I worked around it.

The problem

YouTube Data API v3 has a 10,000 unit daily quota per project. A transcript pull costs about 1-3 units depending on method. At 10 transcripts per minute you burn the whole quota in under an hour.

Workarounds like creating multiple Google projects violate the Terms of Service. The honest alternative is using the public caption tracks that anyone can read in a browser.

The pipeline

Built an Apify actor that:

Accepts a list of YouTube URLs
Extracts the video ID
Requests the caption track directly from YouTube's timed-text endpoint (public, no auth)
Returns plain text + timestamped segments

No API key needed. No quota to burn. $0.004 per transcript.

The run

Ran it on 10,000 URLs of AI-focused podcasts and conference talks. At concurrency 10, total time was ~20 minutes. Total cost was $40.

Output was a single JSON file with title, channel, duration, and full transcript text. Ready to feed into a tokenizer for fine-tuning.

What the numbers look like per 1000

Metric	Value
Success rate	~94%
Avg transcript length	6,200 words
Total tokens (GPT tokenizer)	~8.5M tokens
Cost per 1M tokens	~$0.47

That is 100x cheaper than paying someone to transcribe audio files and probably better quality than Whisper on low-bitrate source audio.

Caveats

6% of videos have no captions at all (creator disabled them, or they are old). That is inherent to YouTube, not a pipeline issue.
Auto-generated captions are noisier than human captions, but still usable for training.
Not suitable if you need word-level timestamps or speaker diarization; you only get segment-level timing.

Try it

Apify actor: apify.com/george.the.developer/youtube-transcript-scraper

If you are building a dataset at scale, cost is the only thing that matters. Paying $4k for 1M transcripts via a commercial API hurts more than paying $4 for the same data through a simple actor.

#apify #dataengineering #machine-learning #python

1 views

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

I shipped three lead-gen actors yesterday. Today I tested them and found four bugs. Here is what users were getting.

Yesterday I pushed three actors public on Apify Store inside one session. Shopify DTC Brand Discovery, ATS Hire-Trigger Intent Scraper, Funded Startup Tracker. I wrote a launch post and went to bed feeling good about it. This morning I sat down and r...

May 9, 20266 min read

Why I deprecated three actors this week and kept three

I run a small portfolio of scrapers and APIs on Apify Store. Last week I spent an afternoon doing the unfun part of running a portfolio: deciding what to keep and what to bury. Three actors got deprecated. Three got kept. The interesting part was not...

May 8, 20264 min read

I deprecated three actors yesterday. Here is what their failure shape taught me about API actor design.

I run a small portfolio of Apify standby actors. Most of them are single-endpoint things you call from an agent or a worker, get a JSON line back, pay per call. Last week I did a pass over the ones that weren't earning their keep and pulled three off...

May 7, 20265 min read

I deprecated three actors yesterday. Here is what their failure shape taught me about API actor design.

I just 3x'd the price on my LinkedIn scraper. Here's the math.

I pulled up my Apify dashboard this morning before coffee. The number I was hoping for was a positive margin. The number I got was negative 33 percent. $20.94 in compute and proxy cost. $15.63 in revenue. I was paying users to run my actor. The actor...

May 5, 20266 min read3

Why my LinkedIn scraper now refuses jobs

Yesterday I posted about losing $540 a month to silent user churn. The day after, I went looking at why my LinkedIn scraper itself was bleeding compute every run. Found three bugs that were quietly eating margin. Shipped fixes today. If you missed it...

Apr 25, 20266 min read2

The AI Entrepreneur

19 posts

Command Palette