LinkedIn Scraper: Dropping False Positives from 62% to Zero

I run a LinkedIn employees scraper on Apify that had a 62% false positive problem. Here is how we fixed it.

The false positive problem

The first version was a Google dork. Query site:linkedin.com/in/ "TargetCompany" and parse the SERP. Fast, cheap, and broken.

Google indexes profiles that mention a company anywhere in the page. That includes:

People who used to work there three years ago
People quoted in the company's press releases
People whose recent post mentioned the company as a competitor

A paying user ran the actor on 414 companies and got 3,128 profiles back. Manual spot check: 62% were not current employees. Unusable for outreach, because "I saw your company is hiring" to someone who left two years ago kills sender credibility.

The fix: a second verification pass

LinkedIn's public profile HTML includes a JSON-LD Person schema with a worksFor array listing the current employers.

{
  "@type": "Person",
  "name": "Patrick Collison",
  "worksFor": [
    {
      "@type": "Organization",
      "name": "Stripe",
      "url": "https://www.linkedin.com/company/stripe/"
    }
  ]
}

The verification step is: for every SERP candidate, fetch their public profile, parse the JSON-LD, and only emit profiles whose worksFor[].url slug matches the target company. If the match fails, the profile is dropped.

Why not use a real browser?

We tried. Puppeteer with stealth takes ~3 seconds per page on a Contabo VPS, leaves fingerprint traces Cloudflare and LinkedIn detect, and the Chromium image adds ~450 MB to the Apify actor Docker image.

Instead we run a small Go service on the VPS that uses github.com/bogdanfinn/tls-client, a Go HTTP client that reproduces Chrome's TLS handshake exactly, including the JA4 fingerprint, the HTTP/2 settings frame, and the extension order. From LinkedIn's perspective, the connection is indistinguishable from real Chrome 124.

Architecture

Apify actor (Node.js)
  |
  +-- Phase A: Google SERP  ->  Apify GOOGLE_SERP proxy  ->  candidate slugs
  |
  +-- Phase B: verify each candidate
        |
        +-- POST /tls/fetch on VPS (Go service)
              |
              +-- Apify RESIDENTIAL proxy  ->  LinkedIn /in/{slug}
                    |
                    +-- parse JSON-LD  ->  worksFor match?
                          |
                          +-- yes -> emit (confidence: high)
                          +-- no  -> drop (counted as rejected)

The Go service exposes /tls/fetch with a session pool (persistent cookie jars per session_id) and maintains a burn ledger that cools a session for 30 minutes after any 999 block.

Detection is holistic, not just TLS

A common mistake (we made it early on) is assuming TLS fingerprint parity is enough. It is not. Modern detection combines:

TLS fingerprint (JA3 / JA4): baseline
HTTP/2 settings frame + header order
User-Agent consistency with the TLS version
IP reputation (datacenter vs residential)
Request cadence (human-like spacing)
Cookie and session continuity

Chrome 124 JA4 parity only solves layer 1. We also send the full Sec-Ch-Ua, Sec-Fetch-*, Upgrade-Insecure-Requests header set, lock the User-Agent to match the profile version, add 800 to 2500 ms human jitter between requests on the same session, and route through Apify residential for the IP reputation layer.

The counterintuitive insight: do NOT randomize TLS profiles to evade detection. Real Chrome does not randomize. A roller makes you more visible, not less. Stick with one current profile, perfectly.

Results

Smoke test on Stripe this morning: 5 requested, 5 verified current Stripe employees (Patrick Collison, John Collison, Juliet Simpson, JR Farr, Karl Durrance), 0 false positives. Duration about 3 minutes end to end.

Over 299 paying users hit this scraper last month. Failure rate is low single digits, all from LinkedIn 999 blocks that the burn ledger handles correctly.

Try it

Apify Store: https://apify.com/george.the.developer/linkedin-company-employees-scraper

Priced per verified profile. Input: a list of company URLs or names. Output: verified current employees only, no false positives.

How we dropped LinkedIn scraper false positives from 62% to zero

The false positive problem

The fix: a second verification pass

Why not use a real browser?

Architecture

Detection is holistic, not just TLS

Results

Try it

Comments

More from this blog

I shipped three lead-gen actors yesterday. Today I tested them and found four bugs. Here is what users were getting.

Why I deprecated three actors this week and kept three

I deprecated three actors yesterday. Here is what their failure shape taught me about API actor design.

I just 3x'd the price on my LinkedIn scraper. Here's the math.

Why my LinkedIn scraper now refuses jobs

Command Palette

The false positive problem

The fix: a second verification pass

Why not use a real browser?

Architecture

Detection is holistic, not just TLS

Results

Try it

Comments

More from this blog