Skip to main content

Command Palette

Search for a command to run...

How we dropped LinkedIn scraper false positives from 62% to zero

Published
โ€ข3 min read
G
Hi, I'm George! ๐Ÿ‘‹๐Ÿป โœ… Who I am: Full-stack engineer and data extraction specialist. I build scrapers for platforms everyone says can't be scraped โ€” and I do it reliably, at scale. ๐Ÿค“ My journey: Started as a full-stack engineer, quickly realized the hardest (and most valuable) technical problems are in data extraction. Anti-bot systems, fingerprinting, session management, rate limiting โ€” that's where I live now. ๐Ÿš€ My work: Creator of the Unblockable ICO Drops Scraper โ€” built to defeat one of the toughest anti-bot systems on the web. Also built scrapers for CoinMarketCap, Shopify stores, TikTok Shop, and LinkedIn. If it has data behind a wall, I've probably cracked it. โš’๏ธ My skills: Node.js, Crawlee, Playwright, Puppeteer, Advanced Anti-Bot Bypassing, Proxy Management, API Reverse-Engineering, Docker, NestJS, Python. ๐Ÿšข Shipped: PenBot (penbotservices.com) โ€” AI platform delivering intelligence through prison email systems. Inmate Creations (inmatecreations.com) โ€” currently in production. ๐Ÿซฑ๐Ÿปโ€๐Ÿซฒ๐Ÿฝ Work with me: Open to data extraction challenges, automation pipelines, and infrastructure projects. Not open to ticket-taking. ๐Ÿ“ฉ Contact: georgethedeveloper3046@gmail.com ๐Ÿ”— Portfolio: george-kioko.pages.dev ๐Ÿ™ GitHub: github.com/the-ai-entrepreneur-ai-hub

I run a LinkedIn employees scraper on Apify that had a 62% false positive problem. Here is how we fixed it.

The false positive problem

The first version was a Google dork. Query site:linkedin.com/in/ "TargetCompany" and parse the SERP. Fast, cheap, and broken.

Google indexes profiles that mention a company anywhere in the page. That includes:

  • People who used to work there three years ago
  • People quoted in the company's press releases
  • People whose recent post mentioned the company as a competitor

A paying user ran the actor on 414 companies and got 3,128 profiles back. Manual spot check: 62% were not current employees. Unusable for outreach, because "I saw your company is hiring" to someone who left two years ago kills sender credibility.

The fix: a second verification pass

LinkedIn's public profile HTML includes a JSON-LD Person schema with a worksFor array listing the current employers.

{
  "@type": "Person",
  "name": "Patrick Collison",
  "worksFor": [
    {
      "@type": "Organization",
      "name": "Stripe",
      "url": "https://www.linkedin.com/company/stripe/"
    }
  ]
}

The verification step is: for every SERP candidate, fetch their public profile, parse the JSON-LD, and only emit profiles whose worksFor[].url slug matches the target company. If the match fails, the profile is dropped.

Why not use a real browser?

We tried. Puppeteer with stealth takes ~3 seconds per page on a Contabo VPS, leaves fingerprint traces Cloudflare and LinkedIn detect, and the Chromium image adds ~450 MB to the Apify actor Docker image.

Instead we run a small Go service on the VPS that uses github.com/bogdanfinn/tls-client, a Go HTTP client that reproduces Chrome's TLS handshake exactly, including the JA4 fingerprint, the HTTP/2 settings frame, and the extension order. From LinkedIn's perspective, the connection is indistinguishable from real Chrome 124.

Architecture

Apify actor (Node.js)
  |
  +-- Phase A: Google SERP  ->  Apify GOOGLE_SERP proxy  ->  candidate slugs
  |
  +-- Phase B: verify each candidate
        |
        +-- POST /tls/fetch on VPS (Go service)
              |
              +-- Apify RESIDENTIAL proxy  ->  LinkedIn /in/{slug}
                    |
                    +-- parse JSON-LD  ->  worksFor match?
                          |
                          +-- yes -> emit (confidence: high)
                          +-- no  -> drop (counted as rejected)

The Go service exposes /tls/fetch with a session pool (persistent cookie jars per session_id) and maintains a burn ledger that cools a session for 30 minutes after any 999 block.

Detection is holistic, not just TLS

A common mistake (we made it early on) is assuming TLS fingerprint parity is enough. It is not. Modern detection combines:

  • TLS fingerprint (JA3 / JA4): baseline
  • HTTP/2 settings frame + header order
  • User-Agent consistency with the TLS version
  • IP reputation (datacenter vs residential)
  • Request cadence (human-like spacing)
  • Cookie and session continuity

Chrome 124 JA4 parity only solves layer 1. We also send the full Sec-Ch-Ua, Sec-Fetch-*, Upgrade-Insecure-Requests header set, lock the User-Agent to match the profile version, add 800 to 2500 ms human jitter between requests on the same session, and route through Apify residential for the IP reputation layer.

The counterintuitive insight: do NOT randomize TLS profiles to evade detection. Real Chrome does not randomize. A roller makes you more visible, not less. Stick with one current profile, perfectly.

Results

Smoke test on Stripe this morning: 5 requested, 5 verified current Stripe employees (Patrick Collison, John Collison, Juliet Simpson, JR Farr, Karl Durrance), 0 false positives. Duration about 3 minutes end to end.

Over 299 paying users hit this scraper last month. Failure rate is low single digits, all from LinkedIn 999 blocks that the burn ledger handles correctly.

Try it

Apify Store: https://apify.com/george.the.developer/linkedin-company-employees-scraper

Priced per verified profile. Input: a list of company URLs or names. Output: verified current employees only, no false positives.

More from this blog

The AI Entrepreneur

19 posts