What is Dorking?
Google dorking is something that has been around for a long time, but not many people know about it. It essentially takes advantage of search engine algorithms to locate specific text or URLs. Normally, this has been used to identify vulnerabilities in web apps, but I think there are far more places where it can be applied.
How Does It Work?
If you ever took a media literacy class in grade school, they may have told you that when searching something up, you can use quotations to ensure all results contain the string wrapped in quotes. This is one of the many functions of dorking, yet generally it is the only known one.
Google (and any other search engine) has a set of search operators that allow someone to pinpoint their results. It exploits the fact that these search engines index every webpage they can find.
As a result, dorking can reveal a lot of information. As long as it is publicly accessible, you can find it. Something I feel like I must note is this does not breach any laws or a search engine's Terms of Service.
Some of the Techniques
With dorking you can specify:
- File Type
- In URL
- In Title
- A Link
- A Site
Any of these can be required to appear in your search results. This means something that always felt a little random, like searching the web, can become fairly deterministic.
How I Have Found It Useful
Say you are looking for a certain set of people who have certain qualities. Maybe you want information on professionals who all likely share a common trait. You know you want LinkedIn profiles, so you know you want LinkedIn URLs.
A query like site:linkedin.com/in/ "{focus_area}" {location} will force all results to be LinkedIn users who MUST have that focus area mentioned somewhere on their profile, along with a location similar to what you specified.
Dorking truly takes advantage of the power of search engines. But how do we get it to work at a more production level? Who wants to sit there writing query after query to gather their initial Dset?
The Power of DDGS
This is where DDGS comes in. DDGS stands for DuckDuckGo Search, and it's a Python library that lets you run search queries programmatically. What started as a wrapper around DuckDuckGo's search has evolved into something more powerful, it is a metasearch library that can pull results from multiple search engines including Google, Bing, Brave, and others.
Why Does This Matter?
We're living through an interesting moment in the history of the internet. Large language models need a lot of data and the companies building them have been scraping the web at unprecedented scale. Google and other search providers have noticed, they're not too happy about this use (some may call abuse) of their Search Engine.
As a result, search engines have gotten super aggressive about rate limiting and bot detection. If you've ever tried to automate Google searches, it is only a matter of a few queries until you run into CAPTCHAs, blocks, or just mysteriously empty results. It's insanely frustrating, and it's only getting worse.
The reason DDGS started with DuckDuckGo is because it historically been more permissive about programmatic access. DDGS takes advantage of this, giving you a way to run automated searches without immediately getting shut down. With that being said, it is far from a magic bullet, you can still get rate limited if you're too aggressive. But, it is a much friendlier starting point than trying to scrape Google directly.
What Can DDGS Actually Do?
At its core, DDGS lets you run searches and get back structured results. But it's not just basic web search. The library supports:
Text Search — Your standard web search. You give it a query, it gives you back titles, URLs, and snippets. Simple.
Image Search — Find images matching your query. You can filter by size, type, color, and other attributes.
News Search — Get recent news articles. You can limit results to the past day, week, month, or year.
Video Search — Search for videos with resolution and duration filters.
Instant Answers — DuckDuckGo's instant answer feature, which pulls quick facts and definitions.
The real power comes from combining these capabilities with dorking operators, now we can programmatically dork to get tons of URL's of candidates likely to fit our critereria.
The Proxy Situation
At a certain point, all search engines will eventually get annoyed if you're hammering their servers from a single IP address. Thankfully the internet protocol is incredibly stupid (as it has to be, another post soon on this) and there is an answer to this, proxies.
DDGS has built-in proxy support, it's surprisingly flexible. You can route your traffic through HTTP proxies, HTTPS proxies, or SOCKS5 proxies. If you're feeling privacy-conscious, there's even a shorthand for routing everything through Tor—just use "tb" as your proxy setting and it connects to Tor Browser's default proxy (You do you man, I'm just the messenger).
For serious production use, you'll want rotating proxies. Instead of making all your requests from one IP address, you spread them across the globe. Each request (or each batch of requests) comes from a different IP, this makes it significantly more difficult for a search engine to recognize you as you and block you.
If you're not using a rotating proxy service that handles distribution automatically, you should initialize a new DDGS instance with a fresh proxy for each search. It's a bit more work, but it's the difference between a script that runs for five minutes and one that runs for five hours. Or, you can just shell the coin for a rotating proxy service, it is next to nothing in price and makes your life much easier instead of trying to build out your own proxy infra. For example, I have been using Bright Data and it has worked quite well for me.
Residential proxies are the gold standard here. These are IP addresses associated with real internet service providers and real home connections. They're much harder to detect and block than datacenter IPs. Now, you can run into some ethical waters here, some providers kind of sort of send requests through peoples home routers without their consent. So, be a good person, use a provider that gets their IP's appropriately.
The Rate Limiting Reality
No matter what, you will eventually hit rate limits if you're doing anything at scale. DDGS has specific exceptions for this, there's a RatelimitException (Shocker) that gets thrown when you've pushed too hard.
The smart approach is to build delays into your scripts, handle these exceptions gracefully, and back off when you need to. Exponential backoff + jitter is your friend. It's not glamorous, but it works.
Why Not Just Use Google's API?
You might be wondering: why go through all this trouble? Doesn't Google have an official API for search?
They do. It's called the Custom Search JSON API. And it's... fine. But there are stupid limitations. You get 100 free queries per day. After that, you're paying $5 per 1,000 queries. If you need to run thousands or tens of thousands of searches, that adds up fast. On top of this, learning what queries work for your use case is a iterative process. Trust me, queries add up quick. 5 dollrs for 1000 queries is pretty ridiculous expensive, especially when you can do it for free, not counting the proxies.
More importantly, the Custom Search API doesn't give you access to the full Google index. You have to set up a "Custom Search Engine" that's limited to specific sites, or pay extra for web-wide search. It's clearly designed for embedding search into your own website, not for the broad data me, or you is looking for.
DDGS sidesteps all of this. It's free, it's not restricted to specific sites, and it doesn't require API keys or account setup. The tradeoff is that you're working against the grain of what search engines want you to do.
Multiple Search Engines
One of the more recent developments in DDGS is support for multiple search backends. The library isn't just querying DuckDuckGo anymore, it can route searches through Google, Bing, Brave, and others.
This is useful for a few reasons. Different search engines have different indexes and different ranking algorithms. If you're doing research, getting results from multiple sources gives you a more complete picture. Additionally, It also helps with reliability, if one backend is having issues or rate limiting you aggressively, you can switch to another.
The library handles a lot of this automatically. There's an "auto" backend setting that tries to pick the best available option and falls back gracefully when things go wrong.
Async
If you're building something that needs to run a lot of searches quickly, DDGS has an async interface. Instead of running searches one at a time and waiting for each to complete, you can fire off multiple searches concurrently and collect the results as they come in.
This is a significant performance improvement for the right use cases. Of course, be careful, rate limits still apply.
The Command Line Option
Not everything needs to be a full Python script. DDGS comes with a command line interface that's pretty capable. You can run searches, filter by region or time period, and even download files directly from the results. I typically use this when I am assessing relevant criteria(s) to get my desired URL set.
It's great for quick one-off searches or for piping results into other command line tools. If you just need to grab some PDF files matching a certain query, you can do it in one line without writing any code.
Wrapping Up
Of course there is only so much you can do with a set of URL's and the brief description a search engine provides. The next step would be scraping said URL's to get the entire page of information. This can be a fairly intensive process. In fact, LinkedIn has some of the best bot detection in the world. But, as I have continnued to learn, where there is a will, there is a way.
For more details and updates, check out the DDGS GitHub repository.