Clickport
Start free trial

How Bot Detection Actually Works in Web Analytics (2026)

Show article contentsHide article contents
  1. Half of your traffic is not human
  2. What GA4 actually does about bots
  3. Why every other analytics tool is a black box too
  4. The 7 layers a bot must survive
  5. 11 detection checks, explained
  6. Behavioral scoring: catching bots that pass every other check
  7. 56 AI bots, three intent categories
  8. The Bot Center: seeing what others hide
  9. What we are building next
  10. The math behind dirty data
  11. Start with clean data
  12. FAQ

Google Analytics filters bots using a single list of user-agent strings. If the bot says "I'm Googlebot," GA4 excludes it. If the bot says "I'm Chrome 124 on Windows 11," GA4 counts it as a real person. That is the entire system. One list. No behavioral analysis. No IP checks. No transparency. And it protects 28 million websites.

Key Takeaways
  • GA4's bot filter uses a single list (IAB/ABC) that only catches self-identifying bots. In controlled tests, it filtered zero bot sessions that used standard browser user-agents.
  • 51% of all web traffic in 2024 was automated, with bad bots alone at 37%. AI scraper traffic grew 300% year-over-year. Only 2.8% of websites are fully protected.
  • Clickport runs 11 detection checks across 7 filtering layers at ingestion time, from tracker-side hard stops to fingerprint velocity analysis. Blocked events never reach the database.
  • No other analytics tool shows what it blocks. Clickport's Bot Center displays blocked counts by detection method, top sources, AI bot breakdown by intent, and blocklist update status.
  • Behavioral scoring using Welford's online variance algorithm detects bots that pass all other checks by analyzing mouse velocity, scroll patterns, and timing distributions in O(1) memory.

Half of your traffic is not human

Automated traffic surpassed human traffic for the first time in 2024. The Imperva 2025 Bad Bot Report, based on data across their global network, found that bots now account for 51% of all web traffic. Bad bots alone represent 37%, up from 32% in 2023. That is the sixth consecutive year of growth.

This is not a niche problem. Imperva blocked 13 trillion bad bot requests in 2024. Akamai reported a 300% year-over-year surge in AI scraper traffic. Cloudflare's 2025 data shows non-AI bots generating 50% of HTML page requests, 7% more than human traffic.

Cloudflare CEO Matthew Prince put it bluntly at SXSW in March 2026: "We suspect that, in 2027, the amount of bot traffic online will exceed the amount of human traffic."

GLOBAL WEB TRAFFIC COMPOSITION (2024)
Bad bots
37%
Good bots
14%
Human traffic
49%
Source: Imperva 2025 Bad Bot Report

AI is the accelerant. DoubleVerify found an 86% year-over-year spike in general invalid traffic in the second half of 2024, with AI scrapers like GPTBot, ClaudeBot, and AppleBot generating 16% of known-bot impressions. Barracuda's research documented one web application receiving 9.7 million AI scraper requests in 30 days. Another received over 500,000 in a single day.

If your analytics tool cannot tell the difference between a bot and a real visitor, every number in your dashboard is suspect. Your traffic is inflated, your engagement rates are diluted, your A/B tests are contaminated, and your marketing spend decisions are based on fiction.

What GA4 actually does about bots

Google Analytics 4 uses a single mechanism for bot detection: the IAB/ABC International Spiders and Bots List. This is a curated list of known bot user-agent strings and IP addresses, maintained by the Interactive Advertising Bureau and updated approximately monthly.

Google's own documentation says that "traffic from known bots and spiders is automatically excluded." The key word is "known." If a bot identifies itself as Googlebot or Bingbot in its User-Agent header, GA4 excludes it. If that same bot sets its user-agent to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36," GA4 counts it as a real person from Windows.

That is the entire detection system. One list. No behavioral analysis. No datacenter IP blocking. No JavaScript environment checks. No fingerprint velocity analysis.

GA4 BOT DETECTION VS A REAL DETECTION SYSTEM
GA4 (1 check)
☐ IAB/ABC user-agent list
☐ Datacenter IP blocking
☐ JavaScript environment signals
☐ Behavioral scoring
☐ Spam referrer filtering
☐ Fingerprint velocity
☐ Rate limiting
☐ Blocked traffic dashboard
Clickport (11 checks, 7 layers)
☑ 100+ UA pattern matching
☑ 3,400+ datacenter IP ranges
☑ 4 JavaScript environment checks
☑ Behavioral scoring (Welford's)
☑ 2,342 spam referrer domains
☑ Fingerprint velocity analysis
☑ Per-IP rate limiting (60/min)
☑ Bot Center with full breakdown
GA4 detection methods from Google's official documentation.

The consequences are real. Plausible ran controlled tests sending bot traffic to both GA4 and their own tool. Bots using non-human user-agents: GA4 recorded 22 pageviews. Bots using normal Chrome user-agents: GA4 recorded 40 pageviews. Bots from datacenter IPs: GA4 recorded 17 pageviews. We ran our own test with 1,000 Puppeteer bot sessions. GA4 caught zero.

And here is what makes it permanent: GA4 cannot retroactively remove bot traffic from processed reports. Google's data deletion feature works on parameter values, not traffic characteristics. Once bot sessions are in your GA4 property, they stay. Your historical data is permanently inflated.

Why every other analytics tool is a black box too

The privacy-focused analytics tools handle bot detection better than GA4. But they all share one problem: they tell you nothing about what they block.

Plausible uses UA pattern matching, roughly 32,000 datacenter IP ranges, and referrer spam filtering. They have blocked approximately 2 billion bots across all subscribers since February 2023. But their dashboard shows no blocked counts, no detection method breakdown, no visibility into what was filtered.

Fathom explicitly states: "We're not going to detail what we've changed in our bot detection algorithm." They treat it as proprietary. Users see clean data and are expected to trust that it is actually clean.

Matomo has the most transparent approach of the established tools, with a TrackingSpamPrevention plugin that covers cloud provider IPs and headless browser detection. But these features are opt-in plugins, not enabled by default. And the third-party BotTracker plugin provides basic bot visit logging, not a comprehensive detection dashboard.

Umami, the popular open-source option, uses just one npm library (isbot) for user-agent matching. No IP-based detection. No behavioral analysis. No referrer spam filtering. When Umami detects a bot, it returns {"beep": "boop"} and moves on.

BOT DETECTION TRANSPARENCY ACROSS ANALYTICS TOOLS
Feature GA4 Plausible Fathom Matomo Clickport
Shows blocked counts No No No Plugin Yes
Breakdown by method No No No No Yes
AI bot categorization No No No No Yes
Manual session flagging No No No No Yes
Blocklist update status No No No No Yes
Based on published documentation and feature pages for each tool as of April 2026.

The analytics industry treats bot detection as a checkbox. "Bot filtering: enabled." That is the extent of it. No methodology disclosed, no blocked traffic visible, no detection rates published. You are supposed to trust a black box with the accuracy of every number in your dashboard.

If you have ever looked at your analytics and thought "these numbers don't feel right," this is probably why. Clickport takes a different approach: we show you exactly what we block and why.

The 7 layers a bot must survive

Every event that reaches Clickport passes through seven distinct filtering layers before it can appear in your dashboard. Each layer can reject the event. Rejected events never reach the database.

This is ingestion-time filtering, not query-time filtering. The difference matters. GA4 collects everything and filters when generating reports. Clickport blocks confirmed bots before writing a single row. Your database is clean from day one. No retroactive cleanup. No permanently inflated historical data.

But we also built a safety net for ambiguous cases. Hard-confirmed bots are blocked at ingestion. Suspicious sessions that make it through can be manually flagged in the Sessions panel, which excludes them from all dashboard queries without deleting the underlying data.

THE 7-LAYER DETECTION PIPELINE
0
Tracker hard-stops
navigator.webdriver, PhantomJS, Cypress, localhost, self-exclusion. Aborts before any event is sent.
1
Route-level gates
Authentication, demo mode discard, per-IP rate limiting (60 events/minute).
2
Privacy checks
Do Not Track (if site respects DNT) and per-site IP exclusion lists.
3
Bot detection (11 checks)
Environment signals, network signals, and behavioral signals. Priority-ordered, short-circuits on first match.
4
Daily event cap
Per-site daily limit prevents abuse floods from overwhelming the database. Alerts at 80% and 100%.
5
Fingerprint velocity
Sliding window botnet detector. Same browser fingerprint from 100+ events in 10 minutes = coordinated attack.
6
Normal processing
UA parsing, session resolution, database write, goal matching. Only real traffic reaches this point.
Every blocked event is counted in the Bot Center with detection method, source detail, and country. Blocked events never touch the analytics database.

The pipeline is designed around a principle from the bot detection research community: cost asymmetry. Each detection layer does not need to be unbeatable. It just needs to increase the cost and complexity of evasion. A bot can spoof a user-agent (trivial). It can also use residential proxies (moderate cost). It can also patch navigator.webdriver (cheap). Doing all of these simultaneously, consistently, at scale, while also matching real behavioral timing distributions, becomes exponentially more expensive.

11 detection checks, explained

Layer 3 in the pipeline runs 11 individual checks in priority order. The function short-circuits on the first match: once a bot is detected, no further checks run. Every blocked event records the detection method and specific detail (which bot name, which IP range, which behavioral signal).

Environment signals (checks 1-4)

These use signals collected by the tracker's JavaScript running in the visitor's browser.

Check 1: Webdriver flag. The W3C WebDriver specification requires that browsers set navigator.webdriver = true when controlled by automation tools like Selenium, Playwright, or Puppeteer. Real browsers return false. The tracker checks this at initialization and sends the signal with every event.

Check 2: Language count. The tracker reads navigator.languages.length. Real browsers always report at least one language. A count of zero means the browser environment was not properly initialized, which only happens in automation.

Check 3: Software GPU. The tracker reads the WebGL renderer string via UNMASKED_RENDERER_WEBGL. Real browsers report an actual GPU name (like "ANGLE (NVIDIA GeForce RTX 3080)"). Headless Chrome and CI environments report software renderers: SwiftShader or llvmpipe. These are well-documented headless browser signatures.

Check 4: Instant execution. The tracker measures the time between script initialization and the first event send using performance.now(). A delta of zero milliseconds means the script was loaded and executed instantaneously, which only happens in automation frameworks that do not simulate realistic page loading.

Network signals (checks 5-8)

These use server-side data available from the HTTP request.

Check 5: Empty user-agent. Missing or empty User-Agent headers. Legitimate browsers always send one.

Check 6: User-agent pattern matching. A regex matching over 100 known bot patterns across 8 categories: AI bots (54 patterns), search engines (15), SEO tools (10), social crawlers (9), monitoring services (10), feed readers (5), HTTP libraries and headless browsers (11), and vulnerability scanners (6). This is equivalent to what GA4 does with the IAB list, except we maintain our own patterns and update them weekly rather than monthly.

Check 7: Datacenter IP blocking. The client IP is checked against 3,434 datacenter IP ranges from the ipcat project plus 5 hardcoded ranges (Apple, Tencent Cloud, Huawei Cloud, Google Crawlers). The lookup uses binary search over sorted ranges for O(log n) performance: 12 comparisons worst-case instead of 4,000.

But here is where it gets interesting. Naively blocking all datacenter IPs would also block legitimate users on corporate VPNs and commercial VPN services, because providers like NordVPN and ExpressVPN run their exit nodes on cloud infrastructure. A developer browsing through their company's AWS-hosted VPN is a real person, not a bot.

So we maintain a VPN whitelist: 10,700+ CIDR ranges from the X4BNet VPN list. If an IP matches both the datacenter blocklist and the VPN whitelist, it is treated as legitimate. This eliminates the false positive problem that makes naive IP blocking unreliable.

Check 8: Spam referrer filtering. The referrer hostname is checked against 2,342 known spam domains from the Matomo referrer-spam-list. Referrer spam is a technique where bots send fake requests with spoofed Referer headers pointing to spam websites, hoping to generate backlinks or lure curious webmasters into visiting.

Behavioral signals (checks 9-11)

These catch bots that pass all environment and network checks.

Check 9: No viewport. A screen width of zero on a non-pageleave event. Real browsers always report a viewport size.

Check 10: ARM64 Linux. User-agents containing Linux aarch64 or Linux arm64. This is server and container infrastructure (AWS Graviton, Docker on ARM). Nobody browses the web from an ARM64 Linux server.

Check 11: Impossible interaction patterns. This check fires on pageleave events where: scroll depth is 90% or higher, engagement time is between 0 and 5 seconds, and there was zero mouse or keyboard interaction. Scrolling 90% of a page in under 5 seconds with no physical interaction is physically impossible for a human.

How the checks work together
The checks run in priority order. Environment signals (checks 1-4) catch automation tools. Network signals (checks 5-8) catch scrapers, crawlers, and datacenter-hosted bots. Behavioral signals (checks 9-11) catch sophisticated bots that mimic real browsers but cannot fake realistic human behavior. A bot that spoofs its user-agent and uses residential proxies still needs to simulate realistic scroll-and-interaction patterns to survive all 11 checks.

All three blocklists (datacenter IPs, VPN ranges, spam referrers) update automatically on a weekly schedule. A status file records the count and last-update timestamp for each list. If a remote fetch fails, the system falls back to the stale cache rather than running without protection.

Behavioral scoring: catching bots that pass every other check

The 11 detection checks catch the majority of bot traffic. But what about bots running real Chromium instances on residential proxies with spoofed user-agents? They pass every environment check because they are running a real browser. They pass the datacenter IP check because they are on a residential IP. They pass the spam referrer check because they have no referrer.

This is where behavioral scoring comes in. The tracker collects mouse movement, scroll behavior, and timing data, then computes a behavior score from 0 to 100 using Welford's online variance algorithm.

The algorithm is elegant in its simplicity. It computes running mean and variance in a single pass with O(1) memory: no arrays to store, no history to maintain. Each new data point updates three values (count, mean, M2), and the coefficient of variation (CV) is derived from those three values at any time.

What the score measures

Mouse velocity CV (35 points possible). The tracker samples mousemove events and computes the Euclidean distance and time between samples. After 5 or more samples, it calculates the coefficient of variation. Real humans have highly variable mouse velocity (CV > 0.3) because they pause, accelerate, overshoot, and correct. Bots tend to move at constant speed or in mechanical patterns (CV < 0.1).

Scroll velocity CV (25 points possible). Same principle applied to scroll events. After 3 or more samples, it measures the variance. Real humans scroll in irregular bursts with pauses for reading. Bots scroll at uniform speed.

Timing bucket distribution (25 points possible). After 10 or more mouse events, the tracker sorts inter-event intervals into three buckets: under 20ms, 20-100ms, and over 100ms. Real humans produce a spread across all three buckets (no single bucket exceeds 50%). Bots tend to cluster in one bucket because their timing is mechanically regular.

Presence bonus (15 points). Any mouse movement at all earns 15 points. Many bots never trigger a single mouse event.

BEHAVIOR SCORE DISTRIBUTION
70-100
Human visitor
High mouse velocity variance
Irregular scroll patterns
Spread timing distribution
Natural pauses and corrections
30-69
Ambiguous
Some interaction detected
Partial mouse/scroll data
Short session duration
May be mobile or keyboard user
0-29
Likely bot
No mouse movement at all
Constant-speed scrolling
Mechanical timing patterns
Zero interaction events
Score computed using Welford's online variance algorithm. O(1) memory per session.

The behavior score is currently collected and displayed in the Bot Center's traffic quality section. It is not yet used as an automatic blocking signal, because behavioral scoring has higher false-positive risk than the hard signals (a keyboard-only user with no mouse movement would score low despite being human). We are building toward a composite scoring system that weighs behavioral data alongside the other 11 checks rather than using it as a binary block.

56 AI bots, three intent categories

AI crawlers are the fastest-growing segment of bot traffic. Cloudflare observed that AI "user action" crawling surged over 15x year-over-year in 2025. But not all AI bots are the same, and treating them identically is a mistake.

Clickport tracks 56 AI bots across three intent categories: 15 live retrieval, 14 search indexing, and 24 model training (plus 3 deprecated legacy entries). The distinction matters because the appropriate response differs for each.

Live Retrieval (15 bots)

These bots fetch pages during real-time AI conversations. When someone asks ChatGPT to "look up the pricing on this website," ChatGPT-User visits your page. These visits represent actual human intent: someone is interested in your content right now.

Examples: ChatGPT-User (OpenAI), Claude-User (Anthropic), Perplexity-User, meta-externalfetcher (Meta), Google-Agent, kagi-fetcher.

Search Indexing (14 bots)

These crawl your site to build AI-powered search indexes. They are the AI equivalent of Googlebot. Being indexed by AI search engines means your content can appear in AI-generated answers.

Examples: OAI-SearchBot, Claude-SearchBot, PerplexityBot, Amazonbot, meta-webindexer, PhindBot, DuckAssistBot.

Model Training (24 bots)

These scrape your content at scale to train large language models. They provide zero value to your site. They do not send referral traffic. They do not represent user intent. They consume server resources and bandwidth.

Examples: GPTBot (OpenAI), ClaudeBot (Anthropic), Bytespider (ByteDance), Google-Extended, CCBot (Common Crawl), DeepSeekBot, GrokBot (xAI).

AI BOT CRAWL-TO-REFER RATIOS
ClaudeBot
500,000:1
OpenAI
3,700:1
Perplexity
400:1
Google Search
3-30:1
Crawl-to-refer ratio: pages crawled per referral visit sent. Source: Cloudflare 2025 Year in Review

The crawl-to-refer ratio tells the story. Google Search crawls 3 to 30 pages for every visit it sends you. That is a fair trade. ClaudeBot crawls 500,000 pages for every visit. That is resource extraction, not a partnership.

Clickport's Bot Center breaks down AI bot traffic by intent category, so you can see exactly which AI companies are accessing your content, how often, and whether they are sending any traffic back. You can track AI search referral traffic separately in your sources panel to measure the actual value.

The Bot Center: seeing what others hide

Most analytics tools treat bot filtering as invisible plumbing. Something that happens behind the scenes. You never see it, you never question it, you never know if it is working.

We built the Bot Center to make bot detection visible.

It shows total blocked events by detection method (how many were caught by UA pattern matching vs. datacenter IP blocking vs. behavioral signals). It shows the top blocked sources (which specific bots and IP ranges are hitting your site most). It shows AI bot traffic broken down by intent category. It shows traffic quality metrics derived from the behavior scores. And it shows blocklist update status, so you know your protection is current.

Bot Protection
Active
12,847
bots blocked in last 30 days (31% of traffic)
+ 23 manually flagged
AI Crawlers (4,291)
Live Live Retrieval 1,204
Index Search Indexing 892
Train Model Training 2,195
Other Blocked Bots 8,556
Traffic Quality
18%
zero-engagement sessions (1,247 of 6,928)
Desktop 14%
Mobile 26%
Tablet 11%
• 3,434 datacenter IPs
• 10,756 VPN ranges (whitelisted)
• 2,342 spam domains
• 128 bot UA patterns
Updated: March 30, 2026
Clickport Bot Center. Every blocked event is logged with detection method, source, and country.
Why transparency matters
When you can see exactly what is being blocked, you can verify the filter is working correctly. If you see a legitimate service in the blocked list, you know to whitelist it. If you see a spike in bot traffic from a specific IP range, you have an early warning. If you see zero blocked events, you know something is wrong. A black box gives you none of this.

The Sessions panel adds a second layer of control. You can drill into individual sessions, see their engagement metrics (duration, scroll depth, pages viewed, interaction count), and manually flag suspicious ones as bots. Flagged sessions are excluded from all dashboard queries but remain in the database for auditing.

This is the hybrid approach: automated ingestion-time blocking for high-confidence detections, manual session-level control for edge cases. Your analytics data is clean by default, and you have the tools to make it cleaner.

Rahul Gupta, Senior Principal Software Engineer at Barracuda, put it directly: "Their presence can distort website analytics leading to misleading insights and impaired decision-making." The question is whether your analytics tool lets you see that distortion or hides it behind a checkbox.

What we are building next

Bot detection is an arms race. Sophisticated bots get better, and detection systems must evolve. Here is what is coming to Clickport's detection pipeline.

Tier 1 (next releases)

Deferred pageview confirmation. A 2.5-second tracker beacon that requires any mouse, scroll, touch, or keyboard event before confirming the visit. Sessions without confirmation get flagged. This catches bots that fire a single pageview and disappear without any interaction.

Session timeout sweep. When a session expires from the cache, if it has exactly one event, no pageleave, and zero engagement, it gets flagged as suspicious. Many bots hit a single page and never return.

Client Hints header validation. Modern Chromium-based browsers send sec-ch-ua and sec-ch-ua-mobile headers automatically. A request claiming to be Chrome 124 via User-Agent but missing these headers is immediately suspicious. This catches bots using HTTP libraries with spoofed user-agents but no real browser engine.

requestAnimationFrame timing probe. A 5-frame rAF probe to detect headless Chrome via abnormal frame timing. Real browsers sync to the display's refresh rate (~16.67ms at 60Hz) with natural jitter. Headless Chrome either fires too fast or produces suspiciously regular intervals.

IP subnet clustering. Track unique IPs per /24 subnet per site. When 30 or more unique IPs from the same /24 subnet hit a site within 60 minutes, that is a residential proxy pool, not 30 individual visitors from the same neighborhood.

Tier 2 (planned)

Composite bot scoring. Replace binary pass/fail for soft signals with a weighted point system. Hard signals (webdriver, datacenter IP) stay as immediate blocks. Soft signals (low behavior score, suspicious timing, unusual geographic patterns) contribute points. A session accumulates points across multiple weak signals until a threshold is reached. This catches bots that are "slightly off" in multiple dimensions rather than obviously wrong in one.

Geographic anomaly scoring. Detect language/timezone mismatches and geographic concentration combined with engagement signals. A user claiming to be in Germany with a Chinese language preference, zero scroll depth, and a 2-second session is more suspicious than any one of those signals alone.

Tier 3 (research)

TLS fingerprinting (JA4). A Python requests library, a Go net/http client, and a real Chrome browser all produce different TLS fingerprints, even when the User-Agent header is spoofed. TLS fingerprinting detects the gap between what a bot claims to be and what its network stack reveals. This requires changes at the reverse proxy level rather than in application code.

DETECTION ROADMAP
Tier 1: Next releases
Deferred pageview confirmation
Session timeout sweep
Client Hints validation
rAF timing probe
IP subnet clustering
Tier 2: Planned
Composite bot scoring
Geographic anomaly scoring
Canvas hash clustering
Tier 3: Research
TLS fingerprinting (JA4)
MaxMind Anonymous IP DB
IPQualityScore API lookups

We are transparent about what we cannot catch today. Bots on residential proxies running fully instrumented headless Chrome with realistic behavioral simulation will evade most client-side analytics tools, including ours. No analytics-grade detection system catches everything. The goal is to catch the 95% that is catchable and give you the tools to identify the rest.

The math behind dirty data

Bot traffic does not just inflate your visitor count. It compounds through every metric in your dashboard.

Take a site with 10,000 real monthly visitors and a 3.5% conversion rate: 350 conversions per month. Now add 30% bot traffic (a conservative estimate based on CHEQ's finding that 17.9% of all observed traffic is fake, and DataDome's report that only 2.8% of websites are fully protected).

Your dashboard now shows 13,000 visitors and 350 conversions. Your conversion rate reads 2.7%, not 3.5%. You just lost 23% of your apparent conversion rate to phantom visitors. If you are running paid ads, you are now optimizing toward a 2.7% target when reality is already at 3.5%.

BOT IMPACT CALCULATOR
1K50,000500K
0.5%3.0%15%
5%30%60%
What your dashboard shows
50,000
visitors
2.1%
conversion rate
Reality (bots removed)
35,000
actual visitors
3.0%
actual conversion rate
Your conversion rate is 30% lower than reality because of bot traffic
Bot traffic estimates based on Imperva 2025 (37% bad bots) and CHEQ 2024 (17.9% fake traffic). Actual rates vary by industry.

The distortion gets worse in specific dimensions. CHEQ found that 22.1% of "direct" traffic is fake, the highest invalid rate of any channel. Bot sessions average 0.5 seconds of engagement versus 43 seconds for humans. They drag down your average engagement time, inflate your bounce rate, and dilute the signal in every metric you use to make decisions.

A/B testing is especially vulnerable. Peakhour documented that 40% of their customers' test traffic came from bots, silently skewing experiment results. If bots visit Variant A and Variant B at different rates, they create a sample ratio mismatch that can flip statistical significance and cause you to deploy a worse-performing variant.

The financial cost is staggering. Juniper Research projects that digital ad fraud will cost $172 billion annually by 2028. Forrester found that 21 cents of every media dollar is wasted due to poor data quality, translating to $16.5 million per year for enterprise companies.

Clean data is not a nice-to-have. It is the difference between optimizing for reality and optimizing for fiction.

Start with clean data

Every number in your analytics dashboard is downstream of one question: was this visitor real?

If your tool cannot answer that question with confidence, and cannot show you the evidence, then every decision you make from that data carries risk you cannot quantify.

Clickport runs 11 detection checks across 7 filtering layers. Blocked events never reach your database. The Bot Center shows you exactly what was caught, how it was caught, and what is trying to reach your site. You can drill into individual sessions and flag anything suspicious. And we are transparent about what we cannot catch yet, with a public roadmap of planned improvements.

We do all of this without cookies, without fingerprinting visitors, and without requiring a consent banner. Because bot detection should protect your data quality without compromising your visitors' privacy.

Start your free 30-day trial. See your real traffic in 60 seconds. No credit card required.

FAQ

How do I know if my analytics data includes bot traffic?

If you are using GA4 without additional bot protection, your data almost certainly includes bot traffic. The Imperva 2025 Bad Bot Report found that 51% of all web traffic is automated. Signs include: traffic spikes from datacenter locations (Ashburn, Virginia is a common one), sessions with zero engagement time, abnormally high bounce rates, and geographic anomalies that do not match your audience.

Does GA4 automatically filter bots?

GA4 automatically excludes traffic from bots that identify themselves via the IAB/ABC International Spiders and Bots List. This catches search engine crawlers like Googlebot and Bingbot. It does not catch bots that use standard browser user-agents, bots from datacenter IPs, or headless browsers like Puppeteer. In controlled tests, GA4 filtered zero bot sessions that used real browser user-agents.

Can bots execute JavaScript?

Yes. Modern headless browsers (Puppeteer, Playwright, Selenium) fully execute JavaScript, including analytics tracking scripts like GA4's gtag.js. The old assumption that "JavaScript-based tracking automatically filters bots" is no longer true. Clickport's tracker collects environment signals (WebGL renderer, language count, execution timing) that differ between real browsers and automated ones, catching headless browsers that execute JavaScript but cannot fully replicate a real browser environment.

What percentage of web traffic is bots in 2026?

Based on the most recent comprehensive studies: 51% of all web traffic is automated (Imperva 2025), with bad bots specifically at 37%. AI scraper traffic grew 300% year-over-year (Akamai 2025). The DataDome 2025 report found that only 2.8% of websites are fully protected from bot traffic, down from 8.4% in 2024. Cloudflare's CEO predicts bot traffic will exceed human traffic by 2027.

How does bot detection work without cookies or fingerprinting?

Clickport's bot detection operates on signals that do not identify individual users: IP range matching against known datacenter providers, user-agent pattern matching, JavaScript environment checks (WebGL renderer, execution timing), and aggregate behavioral analysis. The behavior score uses variance calculations on mouse and scroll velocity, not individual user tracking. IP addresses are used for detection at ingestion time but are never stored in the analytics database. No cookies are set. No cross-site tracking occurs.

Can I retroactively remove bot traffic from my analytics?

In GA4, effectively no. GA4's data deletion feature works on parameter values, not traffic characteristics. Once bot traffic is processed, it permanently affects historical reports. In Clickport, bot traffic is blocked at ingestion and never reaches the database, so there is nothing to remove. For sessions that make it through detection, you can manually flag them as bots from the Sessions panel, which excludes them from all dashboard queries.

What about bots on residential proxies?

Residential proxy networks (like Bright Data or Oxylabs) route bot traffic through real consumer IP addresses, bypassing datacenter IP detection entirely. This is the hardest category to detect. Clickport addresses this through behavioral scoring (residential proxy bots still often exhibit non-human behavior patterns), fingerprint velocity analysis (detecting clusters of identical browser fingerprints across many IPs), and the planned IP subnet clustering feature. No client-side analytics tool catches 100% of residential proxy bots, and we are transparent about this limitation.

How often are the blocklists updated?

Clickport's three external blocklists (datacenter IPs, VPN whitelist, spam referrers) update weekly via an automated script. Cloud providers like AWS, Google Cloud, and Azure allocate new IP ranges frequently, so weekly updates are necessary to maintain coverage. The blocklist update status (count and last-update timestamp for each list) is visible in the Bot Center.

David Karpik

David Karpik

Founder of Clickport Analytics
Building privacy-focused analytics for website owners who respect their visitors.

Comments

Loading comments...

Leave a comment