The Ultimate Guide to the Invisible Web

Search engines are, in a sense, the heartbeat of the internet; “Googling” has become a part of everyday speech and is even recognized by Merriam-Webster as a grammatically correct verb. It’s a common misconception, however, that Googling a search term will reveal every site out there that addresses your search. Typical search engines like Google, Yahoo, or Bing actually access only a tiny fraction — estimated at 0.03% — of the internet. The sites that traditional searches yield are part of what’s known as the Surface Web, which is comprised of indexed pages that a search engine’s web crawlers are programmed to retrieve.

"As much as 90 percent of the internet is only accessible through deb web websites."

So where’s the rest? The vast majority of the Internet lies in the Deep Web, sometimes referred to as the Invisible Web. The actual size of the Deep Web is impossible to measure, but many experts estimate it is about 500 times the size of the web as we know it.

So what is the Deep Web, exactly? Deep Web pages operate just like any other site online, but they are constructed so that their existence is invisible to crawlers. While recent news, such as the bust of the infamous drug trafficking site Silk Road and Edward Snowden’s NSA shenanigans, have spotlighted the Deep Web’s existence, it’s still largely misunderstood.

Search Engines and the Surface Web

Understanding how surface pages are indexed by search engines can help you understand what the Deep Web is all about. In early days, computing power and storage space was at such a premium that search engines indexed a minimal number of pages, often storing only partial content. The methodology behind searching reflected users’ intentions; early Internet users generally sought research, so the first search engines indexed simple queries that students or other researchers were likely to make. Search results consisted of actual content that a search engine had stored.

Over time, advancing technology made it profitable for search engines to do a more thorough job of indexing site content. Today’s web crawlers, or spiders, use sophisticated algorithms to collect page data from hyperlinked pages. These robots maneuver their way through all linked data on the Internet, earning their spidery nickname. Every surface site is indexed by metadata that crawlers collect. This metadata, consisting of elements such as page title, page location (URL) and repeated keywords used in text, takes up much less space than actual page content. Instead of the cached content dump of old, today’s search engines speedily and efficiently direct users to websites that are relevant to their queries.

To get a sense of how search engines have improved over time, Google’s interactive breakdown “How Search Works” details all the factors at play in every Google search. In a similar vein, Moz.com’s timeline of Google’s search engine algorithm will give you an idea of how nonstop the efforts have been to refine searches. How these efforts impact the Deep Web is not exactly clear. But it’s reasonable to assume that if major search engines keep improving, ordinary web users will be less likely to seek out arcane Deep Web searches.

How is the Deep Web Invisible to Search Engines?

Search engines like Google are extremely powerful and effective at distilling up-to-the-moment web content. What they lack, however, is the ability to index the vast amount of data that isn’t hyperlinked, and therefore immediately accessible to a web crawler. This may or may not be intentional; for example, content behind a paywall or a blog post that’s written but not yet published both technically reside in the Deep Web.

Some examples of other Deep Web content include:

  • Data that needs to be accessed by a search interface
  • Results of database queries
  • Subscription-only information and other password-protected data
  • Pages that are not linked to by any other page
  • Technically limited content, such as that requiring CAPTCHA technology
  • Text content that exists outside of conventional http:// or https:// protocols

While the scale and diversity of the Deep Web are staggering, it’s notoriety – and appeal – comes from the fact that users are anonymous on the Deep Web, and so are their activities. Because of this, it’s been an important tool for governments; the U.S. Naval research laboratory first launched intelligence tools for Deep Web use in 2003.

Unfortunately, this anonymity has created a breeding ground for criminal elements who take advantage of the opportunity to hide illicit activities. Illegal pornography, drugs, weapons, and passports are just a few of the items available for purchase on the Deep Web. However, the existence of sites like these doesn’t mean that the Deep Web is inherently evil; anonymity has its value, and many users simply prefer to operate within an untraceable system on principle.

"Anonymity has its value, and many users simply prefer to operate within an untraceable system on principle."

Just as Deep Web content can’t be traced by web crawlers, it can’t also be accessed via conventional means. The same Naval research group to develop intelligence-gathering tools created The Onion Router Project, now known by its acronym TOR. Onion routing refers to the process of removing encryption layers from Internet communications, similar to peeling back the layers of an onion. TOR users’ identities and network activities are concealed by this software. TOR, and other software like it, offers an anonymous connection to the Deep Web. It is, in effect, your Deep Web search engine.

But in spite of its back-alley reputation there are plenty of legitimate reasons to use TOR. For one, TOR lets users avoid “traffic analysis” and the monitoring tools used by commercial sites to determine web users’ location and the network they are connecting through. These businesses can then use this information to adjust pricing, or even what products and services they make available.

According to the Tor Project site, the program also allows people to, “[…] Set up a website where people publish material without worrying about censorship.” While this is by no means a clear good or bad thing, the tension between censorship and free speech is felt the world over. The Deep Web furthers that debate by demonstrating what people can and will do to overcome political and social censorship.

Reasons a Page is Invisible

When an ordinary search engine query comes back with no results, that doesn’t necessarily mean there is nothing to be found. An “invisible” page isn’t necessarily inaccessible; it’s simply not indexed by a search engine. There are several reasons why a page may be invisible. Keep in mind that some pages are only temporarily invisible, possibly slated to be indexed at a later date.

Too many parameters

Engines have traditionally ignored any Web pages whose URLs have a long string of parameters and equal signs and question marks, on the off chance that they’ll duplicate what’s in their database – or worse – the spider will somehow go around in circles. Known as the “Shallow Web,” a number of workarounds have been developed to help you access this content.

Form-controlled entry that's not password-protected

In this case, page content only gets displayed when a human applies a set of actions, mostly entering data into a form (specific query information, such as job criteria for a job search engine). This typically includes databases that generate pages on demand. Applicable content includes travel industry data (flight info, hotel availability), job listings, product databases, patents, publicly-accessible government information, dictionary definitions, laws, stock market data, phone books, and professional directories.

Passworded access, subscriptions, or non-subscriptions.

This includes VPN (virtual private networks) and any website where pages require a username and password. Access may or may not be by paid subscription. Applicable content includes academic and corporate databases, newspaper or journal content, and academic library subscriptions.

Timed access

On some sites, like major news sources such as The New York Times, free content becomes inaccessible after a certain number of pageviews. Search engines retain the URL, but the page generates a sign up form and the content is moved to a new URL that requires a password.

Robots exclusion

The robots.txt file, which usually lives in the main directory of a site, tells search robots which files and directories should not be indexed. Hence the name “robots exclusion file.” If this file is set up, it will block certain pages from being indexed, which will then be invisible to searchers. Blog platforms commonly offer this feature.

Hidden pages

There is simply no sequence of hyperlink clicks that could take you to such a page. The pages are accessible, but only to people who know of their existence.

Myths about the invisible web

Drugs, pornography, and other illegal activities are the most talked about aspect of the Deep Web for a reason. Stories about people purchasing heroin online using Bitcoins, a form of electronic currency, or selling weapons internationally make big headlines.

What people don’t realize is that there’s a lot the invisible internet has to offer besides illegal activity. Stereotypes and boogeyman stories keep people away from the Deep Web when there are actually many of wonderful reasons to pay it a visit. In countries such as China, where websites are blocked and internet privacy is hard to come by, there’s a growing community of users who use the deep internet to share information and speak freely. Browsers like TOR are still relatively unknown in China, but the number of people using the service is steadily growing. Citizens in Turkey and other politically tumultuous countries are using the deep internet to gather together, plan protests, and discuss local news outside the watchful eye of the government.

Why might the average American want to use the deep internet? Despite its fame from illegal activity, the deep internet is simply anything not accessible by a simple Google search. As much as 90 percent of the internet is only accessible through deb web websites. Using TOR itself isn’t illegal, nor is going on many deep web websites. The only illegal activity is what would be illegal out in the real world. On the deep web you can find rare and banned books, read hard-to-find news, and even fan fiction. The idea of a wild west of the internet is alive again using the deep web.

How to Access and Search for Invisible Content

If a site is inaccessible by conventional means, there are still ways to access the content, if not the actual pages. Aside from software like TOR, there are a number of entities who do make it possible to view Deep Web content, like universities and research facilities.

For invisible content that cannot or should not be visible, there are still a number of ways to get access:

Membership

Join a professional or research association that provides access to records, research, and peer-reviewed journals.

VPN

Access a virtual private network via an employer

Ask for permission

Request access; this could be as simple as a free registration.

Subscription services

Pay for a subscription to a periodical or other resource whose work you wish to support.

Find a suitable resource

Use an invisible Web directory, portal, or specialized search engine such as Google Book Search or Librarian’s Internet Index.

Using the Deep Web in Education

So where do you, as an educator, come in? The deep web can be used to find information that you couldn’t otherwise access through a simple Google search, and that can prove immeasurably useful to your students and colleagues.

"Beating stereotypes and showing the use of deep web searches is an exciting prospect for students -- they can see that the internet is so much larger than social media and the typical Google or Yahoo searches that they're used to using for school projects and essays."

What people don’t understand is what exactly constitutes deep web information. Journals and books that can only be accessed through a university library website isn’t findable through Google, as well as sites that have turned off the ability to be searched through a search engine. For students who need that firewalled, the ability to search on deep web websites becomes a useful tool for school and beyond.

Show students the use in finding hidden search engines, and what kind of information can be found through them. Beating stereotypes and showing the use of deep web searches is an exciting prospect for students — they can see that the internet is so much larger than social media and the typical Google or Yahoo searches that they’re used to using for school projects and essays. Your local library can be a source of tons of un-Googleable information, and through your library, you may be able to utilize sources such as JSTOR and JURN. For more about how to use deep web sources, check out the book Going Beyond Google: The Invisible Web in Learning and Teaching by Jane Devine and Francine Egger-Sider.

Invisible Web Search Tools

Here is a small sampling of invisible web search tools (directories, portals, engines) to help you find invisible content. To see more like these, please look at our Research Beyond Google article.

A List of Deep Web Search Engines

Purdue Owl’s Resources to Search the Invisible Web

Art

Musie du Louvre

Books Online

The Online Books Page

Economic and Job Data

FreeLunch.com

Finance and Investing

Bankrate.com

General Research

GPO’s Catalog of US Government Publications

Government Data

Copyright Records (LOCIS)

International

International Data Base (IDB)

Law and Politics

THOMAS (Library of Congress)

Library of Congress

Library of Congress

Medical and Health

PubMed

Transportation

FAA Flight Delay Information