Bot Detector API — Identify Search Engine and Malicious Bots by IP Address

Bot Detector API - Identify Search Engine and Malicious Bots by IP Address

What Is IP-Based Bot Detection?

Bot Detector API usage: Not all traffic hitting your server comes from human users. Search engine crawlers, monitoring services, uptime checkers, competitor scrapers, credential stuffers, and ad fraud bots all generate HTTP requests that look identical to real user traffic at the TCP level. Distinguishing automated from human traffic — and further distinguishing beneficial bots like Googlebot from malicious ones — is one of the foundational challenges of web infrastructure and analytics.

The Bot Detector API by GLOBUS.studio identifies whether a given IP address belongs to a known bot by combining three complementary signals: reverse DNS lookups (PTR records), Autonomous System Number (ASN) identification, and hosting provider detection. The underlying database is updated hourly, covering search engine crawlers (Google, Bing, Yandex), malicious bots, and parasitic scrapers. Results are returned in plain text or JSON in 5ms.

API Endpoint and Parameters

GET https://api.globus.studio/v2/bot_ip?ip={address}&format={format}
  • ip — the IPv4 address to check
  • format — response format: json or plain

Full parameter reference and live testing are on the Bot Detector API documentation page.

Request and Response Examples

Known Bot Detected — JSON

GET /v2/bot_ip?ip=66.249.66.1&format=json

{
  "bot_detect": true,
  "bot_name": "google"
}

Known Bot Detected — Plain Text

GET /v2/bot_ip?ip=66.249.66.1&format=plain

google

No Bot Match — JSON

GET /v2/bot_ip?ip=8.8.8.8&format=json

{
  "bot_detect": false,
  "bot_name": "undefined"
}

No Bot Match — Plain Text

GET /v2/bot_ip?ip=8.8.8.8&format=plain

undefined

How the Detection Works

The API triangulates bot identity from three independent data sources rather than relying on a single signal — which is why it catches both well-documented crawlers and less obvious automated traffic:

  • Reverse DNS (PTR) lookup — legitimate search engine bots publish verifiable hostnames. Googlebot crawls from addresses that resolve to *.googlebot.com; Bingbot from *.search.msn.com. The API performs the PTR lookup and cross-references it against known crawler hostname patterns, following Google’s own recommended verification method.
  • ASN identification — each IP block on the internet is registered under an Autonomous System Number tied to an organization. ASNs associated with cloud hosting providers, datacenter operators, and known bot networks are flagged independently of hostname patterns.
  • Hosting provider detection — consumer ISPs and mobile carriers rarely host outbound crawlers. IPs resolving to datacenter and hosting provider netblocks — but lacking a legitimate crawler PTR record — are classified as suspicious automated traffic.

Common Use Cases

Analytics Traffic Filtering

Bot traffic contaminates session metrics, conversion rates, bounce rates, and funnel analysis. A crawl spike from a misbehaving scraper can inflate pageview counts by orders of magnitude and make A/B test results statistically meaningless. Filtering bot IPs at the analytics ingestion layer — before events are written to the data warehouse — keeps reporting clean without requiring retroactive cleanup. At 5ms latency, the bot check adds no perceptible overhead to a server-side analytics pipeline.

Search Engine Crawler Identification for SEO Tools

SEO platforms and log analysis tools need to identify and segment Googlebot, Bingbot, and Yandexbot visits from regular traffic to analyze crawl budget consumption, detect crawl anomalies, and verify that newly published pages are being indexed on schedule. The API’s bot_name field returns the specific crawler name, making it straightforward to build per-crawler reports without maintaining a local database of crawler IP ranges that quickly becomes stale.

WordPress Security and Anti-Scraping Plugins

WordPress sites attract a disproportionate share of automated traffic — vulnerability scanners probing for outdated plugins, content scrapers harvesting posts for republication, and credential stuffers targeting wp-login.php. A security plugin can call the Bot Detector API via wp_remote_get() during the init hook, classify the incoming IP, and apply differentiated responses: allow legitimate search engine crawlers through unimpeded, rate-limit unknown bots, and hard-block known malicious ones — all before WordPress loads the full application stack. Combined with a transient cache keyed on IP, the overhead is negligible under sustained bot traffic.

Ad Fraud Prevention

Click fraud in pay-per-click advertising campaigns costs advertisers billions annually. Bots simulating ad clicks originate from datacenters and compromised hosting infrastructure — exactly the IP ranges the Bot Detector API identifies via ASN and hosting detection. Validating click source IPs before recording conversions or billing events filters out a substantial share of fraudulent clicks without requiring behavioral analysis, which takes time to accumulate sufficient signal.

Rate Limiting and Resource Protection

Search engine crawlers that ignore Crawl-delay directives in robots.txt, or aggressive scrapers looping through paginated content, can consume enough server resources to degrade performance for real users. Detecting the bot at the middleware or reverse proxy level — and applying a separate, stricter rate limit bucket to bot IPs — protects origin capacity without blocking legitimate crawlers outright. Known good bots like Googlebot can be given a dedicated rate limit profile rather than being lumped in with malicious traffic.

Paywall and Content Access Control

Media publishers operating metered paywalls — where readers get a fixed number of free articles before hitting a subscription prompt — need to ensure crawlers can index content for SEO without consuming metered article counts. Detecting search engine bot IPs and serving them the full article content (while serving humans the paywall prompt after the free limit) is the technical implementation of Google’s first-click-free successor policies and keeps indexed content consistent with what subscribers read.

Server-Side Personalization and A/B Testing

A/B testing frameworks that assign variants server-side must exclude bot traffic from experiment populations. A bot consistently assigned to variant B inflates that variant’s session count, dilutes statistical significance, and can skew results if the bot exhibits any click or engagement behavior. Filtering bots out of experiment assignment at the variant selection step keeps test populations clean without requiring post-hoc filtering of result data.

Log Enrichment and Infrastructure Monitoring

Access logs from web servers, CDNs, and load balancers contain millions of entries that are meaningless for capacity planning if bot traffic is not separated from human traffic. Enriching log entries with bot detection data — either inline during request processing or as a batch post-processing step — enables infrastructure teams to accurately attribute bandwidth consumption, identify which bots are generating the most load, and make informed decisions about crawler access policies.

Integration Examples

cURL

curl "https://api.globus.studio/v2/bot_ip?ip=66.249.66.1&format=json"

JavaScript (Fetch API)

const res  = await fetch('https://api.globus.studio/v2/bot_ip?ip=66.249.66.1&format=json');
const data = await res.json();

if (data.bot_detect) {
  console.log('Bot detected:', data.bot_name);
  // skip analytics event, apply rate limit, etc.
} else {
  console.log('Human traffic — proceed normally');
}

PHP

$ip       = '66.249.66.1';
$url      = "https://api.globus.studio/v2/bot_ip?ip={$ip}&format=json";
$response = json_decode(file_get_contents($url), true);

if ($response['bot_detect']) {
    echo 'Bot: ' . htmlspecialchars($response['bot_name']);
} else {
    echo 'Not a bot';
}

Python

import requests

data = requests.get(
    'https://api.globus.studio/v2/bot_ip',
    params={'ip': '66.249.66.1', 'format': 'json'}
).json()

if data['bot_detect']:
    print(f"Bot detected: {data['bot_name']}")
else:
    print('No bot match')

WordPress (PHP) — Init Hook with Transient Cache

add_action( 'init', function () {
    $ip        = $_SERVER['REMOTE_ADDR'];
    $cache_key = 'bot_ip_' . md5( $ip );
    $cached    = get_transient( $cache_key );

    if ( false === $cached ) {
        $response = wp_remote_get(
            'https://api.globus.studio/v2/bot_ip?ip=' . rawurlencode( $ip ) . '&format=json'
        );
        $cached = json_decode( wp_remote_retrieve_body( $response ), true );
        set_transient( $cache_key, $cached, HOUR_IN_SECONDS );
    }

    if ( ! empty( $cached['bot_detect'] ) ) {
        $bot_name = $cached['bot_name'];
        // allow search engine bots, block malicious ones, skip analytics, etc.
        if ( ! in_array( $bot_name, [ 'google', 'bing', 'yandex' ], true ) ) {
            wp_die( 'Access denied.', 403 );
        }
    }
} );

Node.js — Express Middleware

const fetch = require('node-fetch');

async function botDetectMiddleware(req, res, next) {
  const ip       = req.ip;
  const response = await fetch(
    `https://api.globus.studio/v2/bot_ip?ip=${ip}&format=json`
  );
  const data     = await response.json();

  req.isBot   = data.bot_detect;
  req.botName = data.bot_name !== 'undefined' ? data.bot_name : null;
  next();
}

app.use(botDetectMiddleware);

// Later in a route:
app.get('/article', (req, res) => {
  if (req.isBot && req.botName) {
    // serve full content to search crawlers
  } else if (req.isBot) {
    res.status(403).end();
  } else {
    // normal user flow
  }
});

Distinguishing Good Bots from Bad Bots

Not all bots should be treated equally. The API’s bot_name field makes it possible to implement tiered bot policies rather than a single allow/block binary:

  • Search engine crawlers (google, bing, yandex) — should generally be allowed full access to indexable content and excluded from analytics, rate limits, and paywall enforcement.
  • Monitoring and uptime bots — typically benign; may be allowed or rate-limited depending on resource impact.
  • Malicious and parasitic bots — should be hard-blocked at the earliest possible point in the request lifecycle, before any application logic executes.
  • undefined — no bot match; treat as a human user and apply normal application logic.

Database Freshness

Bot operator IP ranges shift frequently — cloud providers reallocate address blocks, botnets rotate exit IPs, and search engines expand crawler infrastructure as they scale indexing capacity. The Bot Detector API’s database is refreshed every hour, making it one of the most frequently updated IP intelligence sources available via a public API. This update cadence is particularly important for malicious bot detection, where stale data quickly becomes ineffective as operators deliberately cycle their infrastructure to evade static blocklists.

Performance

At 5ms average latency, bot detection adds no meaningful overhead to server-side request handling. The lookup combines in-memory database queries with lightweight PTR resolution, keeping response times consistent regardless of the IP being queried. For high-traffic sites, caching results by IP with a short TTL — as shown in the WordPress example above — reduces origin API calls to a fraction of total request volume while maintaining effective detection coverage.

Explore all response fields and run live lookups on the Bot Detector API documentation page.