Scraping 10k YouTube transcripts for LLM training data
Was building a training set from AI podcast transcripts and hit YouTube API's 10,000/day quota immediately. Here is how I worked around it.
The problem
YouTube Data API v3 has a 10,000 unit daily quota per project. A transcript pull costs about 1-3 units depending on method. At 10 transcripts per minute you burn the whole quota in under an hour.
Workarounds like creating multiple Google projects violate the Terms of Service. The honest alternative is using the public caption tracks that anyone can read in a browser.
The pipeline
Built an Apify actor that:
- Accepts a list of YouTube URLs
- Extracts the video ID
- Requests the caption track directly from YouTube's timed-text endpoint (public, no auth)
- Returns plain text + timestamped segments
No API key needed. No quota to burn. $0.004 per transcript.
The run
Ran it on 10,000 URLs of AI-focused podcasts and conference talks. At concurrency 10, total time was ~20 minutes. Total cost was $40.
Output was a single JSON file with title, channel, duration, and full transcript text. Ready to feed into a tokenizer for fine-tuning.
What the numbers look like per 1000
| Metric | Value |
| Success rate | ~94% |
| Avg transcript length | 6,200 words |
| Total tokens (GPT tokenizer) | ~8.5M tokens |
| Cost per 1M tokens | ~$0.47 |
That is 100x cheaper than paying someone to transcribe audio files and probably better quality than Whisper on low-bitrate source audio.
Caveats
- 6% of videos have no captions at all (creator disabled them, or they are old). That is inherent to YouTube, not a pipeline issue.
- Auto-generated captions are noisier than human captions, but still usable for training.
- Not suitable if you need word-level timestamps or speaker diarization; you only get segment-level timing.
Try it
Apify actor: apify.com/george.the.developer/youtube-transcript-scraper
If you are building a dataset at scale, cost is the only thing that matters. Paying $4k for 1M transcripts via a commercial API hurts more than paying $4 for the same data through a simple actor.


