Hackers Libray: Protecting Yourself from Google Hackers

A: Read the documentation. Some badly written software has hardcoded paths but most allow you to place the file outside the Web server's docroot.

Introduction

The purpose of this book is to help you understand the tactics a Google hacker might employ so that you can properly protect yourself and your customers from this seemingly innocuous threat.The best way to do this, in our opinion, is to show you exactly what an attacker armed with a search engine like Google is capable of. There is a point at which we must discuss in no uncertain terms exactly how to prevent this type of information exposure or how to remedy an existing exposure. This chapter is all about protecting your site (or your customer's site) from this type of attack.

We'll look at this topic from several perspectives. First, it's important that you understand the value of strong policy with regard to posting data on the Internet.This is not a technical topic and could very easily put the techie in you fast asleep, but a sound security policy is absolutely necessary when it comes to properly securing any site. Second, we'll look at slightly more technical topics that describe how to secure your Web site from Google's (and other search engine's) crawlers. We'll then look at some tools that can be used to help check a Web site's Google exposure, and we'll spend some time talking about ways Google can help you shore up your defenses.

Where Are the Details?

There are too many types of servers and configurations to show how to lock them all down. A discussion on Web server security could easily span an entire book series. We'll look at server security at a high level here, focusing on strategies you can employ to specifically protect you from the Google hacker threat. For more details, please check the references in the "Links to Sites" section.

A Good, Solid Security Policy

The best hardware and software configuration money can buy can't protect your resources if you don't have an effective security policy. Before implementing any software assurances, take the time to review your customer's (or your own) security policy. A good security policy, properly enforced, outlines the assets you're trying to protect, how the protection mechanisms are installed, the acceptable level of operational risk, and what to do in the event of a compromise or disaster. Without a solid, enforced security policy, you're fighting a losing battle.

Web Server Safeguards

There are several ways to keep the prying eyes of a Web crawler from digging too deeply into your site. However, bear in mind that a Web server is best suited for storing data that is meant for public consumption. Despite all the best protections, information leaks happen. If you're really concerned about keeping your sensitive information private, keep it away from your public Web server. Move that data to an intranet or onto a specialized server that is dedicated to serving that information in a safe, responsible, policy-enforced manner.

Don't get in the habit of splitting a public Web server into distinct roles based on access levels. It's too easy for a user to copy data from one file to another, which could render some directory-based protection mechanisms useless. Likewise, consider the implications of a public Web server system compromise. In a well thought out, properly constructed environment, the compromise of a public Web server only results in the compromise of public information. Proper access restrictions would prevent the attacker from bouncing from the Web server to any other machine, making further infiltration of more sensitive information all the more difficult for the attacker. If sensitive information were stored alongside public information on a public Web server, the compromise of that server could potentially compromise the more sensitive information as well.

We'll begin by taking a look at some fairly simple measures that can be taken to lock down a Web server from within. These are general principles; they're not meant to provide a complete solution but rather to highlight some of the common key areas of defense. We will not focus on any specific type of server but will look at suggestions that should be universal to any Web server. We will not delve into the specifics of protecting a Web application, but rather we'll explore more common methods that have proven especially and specifically effective against Web crawlers.

Directory Listings and Missing Index Files

We've already seen the risks associated with directory listings. Although minor information leaks, directory listings allow the Web user to see most (if not all) of the files in a directory, as well as any lower-level subdirectories. As opposed to the "guided" experience of surfing through a series of prepared pages, directory listings provide much more unfettered access. Depending on many factors, such as the permissions of the files and directories as well as the server's settings for allowed files, a casual Web browser could get access to files that should not be public.

Figure 11.1 demonstrates an example of a directory listing that reveals the location of an htaccess file. Normally, this file (which should be called .htaccess, not htaccess) serves to protect the directory contents from unauthorized viewing. However, a server misconfiguration allows this file to be seen in a directory listing and even read.

Directory listings should be disabled unless you intend to allow visitors to peruse files in an FTP-style fashion. On some servers, a directory listing will appear if an index file (as defined by your server configuration) is missing.These files, such as index.html, index.htm, or default.asp, should appear in each and every directory that should present a page to the user. On an Apache Web server, you can disable directory listings by placing a dash or minus sign before the word

Indexes in the httpd.conf file. The line might look something like this if directory listings (or "indexes," as Apache calls them) are disabled:

Options -Indexes FollowSymLinks MultiViews

Blocking Crawlers with Robots.txt

The robots.txt file provides a list of instructions for automated Web crawlers, also called robots or bots. Standardized at www.robotstxt.org/wc/norobots.html, this file allows you to define, with a great deal of precision, which files and directories are off-limits to Web robots.The robots.txt file must be placed in the root of the Web server with permissions that allow the Web server to read the file. Lines in the file beginning with a # sign are considered comments and are ignored. Each line not beginning with a # should begin with either a User-agent or a disallow statement, followed by a colon and an optional space. These lines are written to disallow certain crawlers from accessing certain directories or files. Each Web crawler should send a user-agent field, which lists the name or type of the crawler.The value of Google's user-agent field is Googlebot.To address a disallow to Google, the user-agent line should read:

User-agent: Googlebot

According to the original specification, the wildcard character * can be used in the user-agent field to indicate all crawlers. The disallow line describes what, exactly, the crawler should not look at.The original specifications for this file were fairly inflexible, stating that a disallow line could only address a full or partial URL. According to that original specification, the crawler would ignore any URL starting with the specified string. For example, a line like Disallow: /foo would instruct the crawler to ignore not only /foo but /foo/index.html, whereas a line like Disallow: /foo/ would instruct the crawler to ignore /foo/index.html but not /foo, since the slash trailing foo must exist. For example, a valid robots.txt file is shown here:

#abandon hope all ye who enter User-Agent: * Disallow: /

This file indicates that no crawler is allowed on any part of the site—the ultimate exclude for Web crawlers.The robots.txt file is read from top to bottom as ordered rules.There is no allow line in a robots.txt file.To include a particular

crawler, disallow it access to nothing. This might seem like backward logic, but the following robots.txt file indicates that all crawlers are to be sent away except for the crawler named Palookaville:

#Bring on Palookaville User-Agent: * Disallow: /

User-Agent: Palookaville Disallow:

Notice that there is no slash after Palookaville's disallow. (Norman Cook fans will be delighted to notice the absence of both slashes and dots from anywhere near Palookaville.) Saying that there's no disallow is like saying that user agent is allowed—sloppy and confusing, but that's the way it is.

Google allows for extensions to the robots.txt standard. A disallow pattern may include * to match any number of characters. In addition, a $ indicates the end of a name. For example, to prevent the Googlebot from crawling all your PDF documents, you can use the following robots.txt file:

#Away from my PDF files, Google! User-Agent: Googlebot Disallow: /*.PDF$

Once you've gotten a robots.txt file in place, you can check its validity by visiting the Robots.txt

Web Crawlers and Robots.txt

Hackers don't have to obey your robots.txt file. In fact, Web crawlers really don't have to, either, although most of the big-name Web crawlers will, if only for the "CYA" factor. One fairly common hacker trick is to view a site's robots.txt file first to get an idea of how files and directories are mapped on the server. In fact, as shown in Figure 11.2, a quick Google query can reveal lots of sites that have had their robots.txt files crawled. This, of course, is a misconfiguration, because the robots.txt file is meant to stay behind the scenes.

NOARCHIVE: The Cache "Killer"

The robots.txt file keeps Google away from certain areas of your site. However, there could be cases where you want Google to crawl a page, but you don't want Google to cache a copy of the page or present a "cached" link in its search results.This is accomplished with a META tag.To prevent all (cooperating) crawlers from archiving or caching a document, place the following META tag in the HEAD section of the document:

If you prefer to keep only Google from caching the document, use this META tag in the HEAD section of the document:

Any cooperating crawler can be addressed in this way by inserting its name as the META NAME. Understand that this rule only addresses crawlers. Web visitors (and hackers) can still access these pages.

NOSNIPPET: Getting Rid of Snippets

A snippet is the text listed below the title of a document on the Google results page. Providing insight into the returned document, snippets are convenient when you're blowing through piles of results. However, in some cases, snippets

should be removed. Consider the case of a subscription-based news service. Although this type of site would like to have the kind of exposure that Google can offer, it needs to protect its content (including snippets of content) from nonpaying subscribers. Such a site can accomplish this goal by combining the NOSNIPPET META tag with IP-based filters that allow Google's crawlers to browse content unmolested.To keep Google from displaying snippets, insert this code into the document:

An interesting side effect of the NOSNIPPET tag is that Google will not cache the document. NOSNIPPET removes both the snippet and the cached page.

Password-Protection Mechanisms

Google does not fill in user authentication forms. When presented with a typical password form, Google seems to simply back away from that page, keeping nothing but the page's URL in its database. Although it was once rumored that Google bypasses or somehow magically bypasses security checks, those rumors have never been substantiated. These incidents are more likely an issue of timing.

If Google crawls a password-protected page either before the page is protected or while the password protection is down, Google will cache an image of the protected page. Clicking the original page will show the password dialog, but the cached page does not—providing the illusion that Google has bypassed that page's security. In other cases, a Google news search will provide a snippet of a news story from a subscription site (shown in Figure 11.3), but clicking the link to the story presents a registration screen, as shown in Figure 11.4.This also creates the illusion that Google can magically bypass pesky password dialogs and registration screens.

If you're really serious about keeping the general public (and crawlers like Google) away from your data, consider a password authentication mechanism. A basic password authentication mechanism, htaccess, exists for Apache.An htaccess file, combined with an htpasswd file, allows you to define a list of username/ password combinations that can access specific directories.You'll find an Apache

htaccess tutorial at http://httpd.apache.org/docs/howto/htaccess.html, or try a Google search for htaccess howto.

Software Default Settings and Programs

As we've seen throughout this book, even the most basic Google hacker can home in on default pages, phrases, page titles, programs, and documentation with very little effort. Keep this in mind and remove these items from any Web software you install. It's also good security practice to ensure that default accounts and passwords are removed as well as any installation scripts or programs that were supplied with the software. Since the topic of Web server security is so vast, we'll take a look at some of the highlights you should consider for a few common servers.

The Microsoft IIS 5.0 Security Checklist (see the "Links to Sites" section at the end of this chapter) lists quite a few tasks that can help lock down an IIS 5.0 server in this manner:

■ Remove unused script extensions:

■ Web-based password change: .htr

■ Internet database connector: .idc

■ Server-side includes: .stm, .shtm and .shtml

■ Internet printing: .printer

■ Index server: .htw, .ida and .idq

The Apache 1.3 series comes with fewer default pages and directories, but keep an eye out for the following:

■ The /manual directory from the Web root contains the default documentation.

■ Several language files in the Web root beginning with index.html.These default language files can be removed if unused.

Patch That System

It certainly sounds like a cliche in today's security circles, but it can't be stressed enough: If you choose to do only one thing to secure any of your systems, it should be to keep up with and install all the latest software security patches. Misconfigurations make for a close second, but without a firm foundation, your server doesn't stand a chance.

Hacking Your Own Site

Hacking into your own site is a great way to get an idea of its potential security risks. Obviously, no single person can know everything there is to know about hacking, meaning that hacking your own site is no replacement for having a real penetration test performed by a professional. Even if you are a pen tester by trade, it never hurts to have another perspective on your security posture. In the realm of Google hacking, there are several automated tools and techniques you can use to give yourself another perspective on how Google sees your site. We'll start by looking at some manual methods, and we'll finish by discussing some automated alternatives.

Warning

As we'll see in this chapter, there are several ways a Google search can be automated. Google frowns on any method that does not use its supplied Application Programming Interface (API) along with a Google license key. Assume that any program that does not ask you for your license key is running in violation of Google's terms of service and could result in banishment from Google. Check these important links, www.google.com/terms_of_service.html and www.bmedia.org/ archives/00000109.php, for more information. Be nice to Google and Google will be nice to you!

Site Yourself

We've talked about the site operator throughout the book, but remember that site allows you to narrow a search to a particular domain or server. If you're sullo, the author of the (most impressive) NIKTO tool and administrator of cirt.net, a query like site:cirt.net will list all Google's cached pages from the cirt.net server, as shown in Figure 11.5.

Figure 11.5 A Site Search is One Way to Test Your Google Exposure

You could certainly click each and every one of these links or simply browse through the list of results to determine if those pages are indeed supposed to be public, but this exercise could be very time consuming, especially if the number of results is more than a few hundred. Obviously, you need to automate this process. Let's take a look at some automation tools.

Gooscan

Gooscan, written by Johnny Long, is a Linux-based tool that enables bulk Google searches.The tool was not written with the Google API and therefore violates Google's Terms of Service (TOS). It's a judgment call as to whether or not you want to knowingly violate Google's TOS to scan Google for information leaks originating from your site. If you decide to use a non-API-based tool, remember that Google can (though very rarely does) block certain IP ranges

from using its search engine. Also keep in mind that this tool was designed for securing your site, not breaking into other people's sites. Play nice with the other children, and unless you're accustomed to living on the legal edge, use the Gooscan code as a learning tool and don't actually run it!

Gooscan is available from http://johnny.ihackstuff.com. Don't expect much in the way of a fancy interface or point-and-click functionality.This UNIX-based tool is command-line only and requires a smidge of technical knowledge to install and run.The benefit is that Gooscan is lean and mean and the best current alternative to the Windows-only tools.

Installing Gooscan

To install Gooscan, first download the tar file, decompressing it with the tar command. Gooscan comes with one C program, a README file, and a directory filled with data files, as shown in Figure 11.6.

Once the files have been extracted from the tar file, you must compile Gooscan with a compiler such as GCC. Mac users should first install the XCode package from the Apple Developers Connection Web site, http://connect.apple.com/. Windows users should consider a more "graphical" alternative such as Athena or SiteDigger, because Gooscan does not currently compile under environments like CYGWIN.

Gooscan's Options

Gooscan's usage can be listed by running the tool with no options (or a combination of bad options), as shown in Figure 11.7.

Gooscan's most commonly used options are outlined in the included README file. Let's take a look at how the various options work:

■ <-t target> (required argument) This is the Google appliance or server to scan. An IP address or host name can be used here. Caution: Entering www.google.com here violates Google's terms of service and is neither recommended nor condoned by the author.

■ <-q query | -i queryfile> (required argument) The query or query file to send. Gooscan can be used to send an individual query or a series of queries read from a file. The -q option takes one argument, which can be any valid Google query. For example, these are valid options:

-q googledorks

-q "microsoft sucks"

-q "intitle:index.of secret"

■ [ -i inputfile] (optional argument) The -i option takes one argument—the name of a Gooscan data file. Using a data file allows you to

perform multiple queries with Gooscan. See the following list for information about the included Gooscan data files.

■ [-o outputfile] (optional argument) Gooscan can create a nice HTML output file.This file includes links to the actual Google search results pages for each query.

■ [-P proxy:port] (optional argument) This is the address and port of an HTML proxy server. Queries will be sent here and bounced off to the appliance indicated with the -t argument. The format can be similar to 10.1.1.150:80 or proxy.validcompany.com:8080.

■ [-v] (optional argument) Verbose mode. Every program needs a verbose mode, especially when the author sucks with a command-line debugger.

■ [-s site] (optional argument) This filters only results from a certain site, adding the site operator to each query Gooscan submits.This argument has absolutely no meaning when used against Google appliances, since Google appliances are already site filtered. For example, consider the following Google queries:

site:microsoft.com linux site:apple.com microsoft site:linux.org microsoft

With advanced express permission from Google, you could run the following with Gooscan to achieve the same results:

Gooscan's Data Files

Used in multiple query mode, Gooscan reads queries from a data file. The format of the data files is as follows: search_type | search_string | count | description

search_type can be one of the following:

■ intitle Finds search_string in the title of the page. If requested on the command line, Gooscan will append the site query. Example:

intitle|error||

This will find the word error in the title of a page.

■ inurl Finds search_string in the URL of the page. If requested on the command line, Gooscan will append the site query. Example:

inurl|admin||

This will find the word admin in the URL of a page.

■ indexof Finds search_string in a directory listing. If requested on the command line, Gooscan will append the site query. Directory listings often will have the term index of in the title of the page. Gooscan will generate a Google query that looks something like this:

intitle:index.of search_string

Note

When using the site switch, Gooscan automatically performs a generic search for directory listings. That query looks like this: intitle:index.of site:site_name. If this generic query returns no results, Gooscan will skip any subsequent indexof searches. It is a logical conclusion to skip specific indexof searches if the most generic of indexof searches returns nothing. For example: indexof\htaccess\\

This search will find .htaccess files sitting in a directory listing on the server.

■ filetype Finds search_string as a filename, inserting the site query if requested on the command line. For example:

filetype|cgi cgi||

This search will find files that have an extension of .cgi.

■ raw This search_type allows the user to build custom queries. The query is passed to Google unmodified, adding a site query if requested in the command line. For example:

raw|nletype:xls email username password||

This example will find Excel spreadsheets with the words email, user-name, and password inside the document.

■ search string The search_string is fairly straightforward. Any string is allowed here except chars \n and |. This string is HTML-ized before sending to Google.The A character is converted to %65, and so on. There are some exceptions, such as the fact that spaces are converted to the + character.

■ count This field records the approximate number of hits found when a similar query is run against all of Google. Site is not applied.This value is somewhat arbitrary in that it is based on the rounded numbers supplied by Google and that this number can vary widely based on when and how the search is performed. Still, this number can provide a valuable watermark for sorting data files and creating custom data files. For example, zero count records could safely be eliminated before running a large search. (This field is currently not used by Gooscan.)

■ description This field describes the search type. Currently, only the file-type.gs data file populates this field. Keep reading for more information on the filetype.gs data file.

Several data files are included with Gooscan, each with a distinct purpose:

■ gdork.gs This file includes excerpts from the Google Hacking Database (GHDB) hosted at http://johnny.ihackstuff.com.The GHDB is the Internet's largest database of Google hacking queries maintained by thousands of members who make up the Search Engine Hacking Forums, also hosted at http://johnny.ihackstuff.com. Updated many times a week, the GHDB currently sits at around 750 unique queries.

■ filetype.gs This huge file contains every known filetype in existence, according to www.filext.com. By selecting interesting lines from this file, you can quickly determine the types of files that exist on a server that might warrant further investigation. We suggest creating a subset of this file (with a Linux command such as:

head -5 0 fi.letype.gs > short_filetype.gs

for use in the field. Do not run this file as is. It's too big. With over 8,000 queries, this search would certainly take quite a while and burn precious resources on the target server. Instead, rely on the numbers in the count field to tell you how many (approximate) sites contain these files in Google, selecting only those that are the most common or relevant to your site.The filetypes.gs file lists the most commonly found extensions at the top.

■ inurl.gs This very large data file contains strings from the most popular CGI scanners, which excel at locating programs on Web servers. Sorted by the approximate number of Google hits, this file lists the most common strings at the top, with very esoteric CGI vulnerability strings listed near the bottom.This data file locates the strings in the URL of a page. This is another file that shouldn't be run in its entirety.

■ indexof.gs Nearly identical to the inurl.gs file, this data file finds the strings in a directory listing. Run portions of this file, not all of it!

Using Gooscan

Gooscan can be used in two distinct ways: single-query mode or multiple-query mode. Single-query mode is little better than using Google's Web search feature, with the exception that Gooscan will provide you with Google's number of results in a more portable format. As shown in Figure 11.8, a search for the term daemon9 returns 2440 results from all of Google.To narrow this search to a specific site, such as phrack.org, add the [-s] option. For example:

gooscan -q "daemon9" -t www.google.com -s phrack.org.

Figure 11.8 Gooscan's Single-Query Mode

Notice that Gooscan presents a very lengthy disclaimer when you select www.google.com as the target server.This disclaimer is only presented when you submit a search that potentially violates Google TOS.The output from a standard Gooscan run is fairly paltry, listing only the number of hits from the Google search.You can apply the [-o] option to create a nicer HTML output format.To run the daemon9 query with nicer output, run:

gooscan -q "daemon9" -t www.google.com -o daemon9.html

As shown in Figure 11.9, the HTML output lists the options that were applied to the Gooscan run, the date the scan was performed, a list of the queries, a link to the actual Google search, and the number of results.

The link in the HTML output points to Google. Clicking the link will perform the Google search for you. Don't be too surprised if the numbers on Google's page differ from what is shown in the Gooscan output; Google's search results are sometimes only approximations.

Running Google in multiple-query mode is a blatant violation of Google's TOS but shouldn't cause too much of a Google-stink if it's done judiciously. One way to keep Google on your good side is to respect the spirit of its TOS by sending small batches of queries and not pounding the server with huge data files. As shown in Figure 11.10, you can create a small data file using the head command. A command such as:

head -5 data_files/gdork.gs > data_nles/little_gdork.gs

will create a four-query data file, since the gdork.gs file has a commented header line.

Figure 11.10 Running Small Data Files Could Keep Google from Frowning at You

raor@>loealhosi:--/file/final — bash —

The output from the multiple-query run of Gooscan is still paltry, so let's take a look at the HTML output shown in Figure 11.11.

Using Gooscan with the [-s] switch we can narrow our results to one particular site, in this case

as shown in Figure 11.12.

Most site-narrowed Gooscan runs should come back pretty clean, as this run did. If you see hits that look suspicious, click the link to see exactly what Google saw. Figure 11.13 shows the Google search in its entirety.

In this case, we managed to locate the Google Hacking Database itself, which included a reference that matched our Google query.The other searches didn't return any results, because they were a tad more specific than the Calamaris query, which didn't search titles, URLs, filetypes, and the like.

In summary, Gooscan is a great tool for checking your Web site's exposure, but it should be used cautiously since it does not use the Google API. Break your scans into small batches, unless you (unwisely) like thumbing your nose at the Establishment.

Windows Tools and the .NET Framework

The Windows tools we'll look at all require the Microsoft .NET framework, which can be located with a Google query of .NET framework download.The successful installation of the framework depends on a number of factors, but regardless of the version of Windows you're running, assume that you must be current on all the latest service packs and updates. If Windows Update is available on your version of Windows, run it.The Internet Explorer upgrade, available from the Microsoft Web site (Google query: Internet Explorer upgrade) is the most common required update for successful installation of the .NET Framework. Before downloading and installing Athena or SiteDigger, make sure you've got the .NET Framework properly installed.

Athena

Athena by Steve Lord (steve@buyukada.co.uk) is a Windows-based Google scanner that is not based on the Google API. As with Gooscan, the use of this tool is in violation of Google's TOS and that as a result, Google can block your IP range from using its search engine. Athena is potentially less intrusive than Gooscan, since Athena only allows you to perform one search at a time, but Google's TOS is clear: no automated scanning is allowed. Just as we discussed with Gooscan, use any non-API tool judiciously. History suggests that if you're nice to Google, Google will be nice to you.

Athena can be downloaded from http://snakeoillabs.com/.The download consists of a single MSI file. Assuming you've installed the .NET Framework, the Athena installer is a simple wizard, much like most Windows-based software. Once installed and run, Athena presents the main screen, as shown in Figure 11.14.

As shown, this screen resembles a simple Web browser. The Refine Search text box allows you to enter or refine an existing query. The Search button is similar to Google's Search button and executes a search.

To perform basic searches with Athena, you need to load an XML file containing your desired search strings. Simply open the file from within Athena and all the searches will appear in the Select Query drop-down box. Simply select your

query and click the Search button. Selecting buddylist.blt and clicking Search will deliver the Google results from that search, as shown in Figure 11.15.

As you can see, the results of the query contain undesired items. Fortunately, Athena allows you to refine your query using the Refine Search box. Using the previous query, entering inurl:"buddylist.blt" into the Refine Search box and clicking the Search button provides a much cleaner search (see Figure 11.16).

At this point, Athena might seem rather pointless. It functions just like a Web browser, submitting queries into Google and displaying the results. However, Athena's most powerful functionality lies in its XML-based configuration files.

Using Athena's Config Files

Two of these files are included with Athena: Athena.xml and digicams.xml. These files contain custom queries and descriptions of those queries.The digicams file contains sample queries for finding images; the Athena.xml file contains the queries found in the GHDB.

To load these files, click File | Open Config and select the XML file you'd like to use. Figure 11.17 shows Athena's main screen after you load athena.xml.

As mentioned, Athena uses the GHDB as a source for its searches, making it a very thorough scanning tool.The SiteDigger tool uses similar searches but has chosen not to officially support the GHDB.This means that SiteDigger has far fewer researchers submitting new searches, making for a potentially less thorough search database.

Constructing Athena Config Files

Athena's XML-based config files, which are compatible with Foundstone's SiteDigger, can be modified or even completely overhauled based on your needs. There are two main sections to the XML file: a searchEngine section and the signature section. The searchEngine section describes how a particular search engine's queries are constructed. A typical searchEngine section is shown in the following code examples.

Google (UK)

http://www.google.co.uk/search?q=

%2 6ie=UTF-8%2 6hl=en%2 6meta=

This section is responsible for describing how the various search engines handle search requests. The searchEngineName field is simply a text-based field that describes the name of the search engine. This name will appear in Athena's drop-down box, allowing you to select from among different search engines. The searchEnginePrefixUrl field represents the first part of the search URL that is sent to the search engine. It is assumed that the query part of the search will be filled in after this prefix.The searchEnginePostfixURL field describes the part of the URL that will come after the prefix and the query. This usually describes various options such as output format (UTF-8). Note that Athena uses the section, and SiteDigger does not. This section could be reworked to search the U.S.-based Google engine with the following searchEngine section:

Google (US) http://www.google.com/search?q=

%2 6ie=UTF-8%2 6hl=en%2 6meta=

The signature section describes the individual searches that are to be performed. A typical signature section is shown in the following code example:

22 T1 TECHNOLOGY PROFILE DON

intitle:"Index of" secring.bak

PGP Secret KeyRing Backup

This query looked for a backup of the PGP secret key ring. With this keyring an attacker could decrypt messages encrypted by the user.

100 0

http://johnny.ihackstufF.com

The signatureReferenceNumber is a unique number assigned to each signature. The categoryref is a unique number that describes the signature in the context of its category, which is described in full by category.The querystring is the Google query that is to be performed. It is made HTML-friendly and inserted between the searchEnginePrefixUrl and the searchEnginePostfixUrl in the URL sent to Google. shortDescription and textualDescription are short and long descriptions of the search, respectively. The cveNumber and cveLocation refer to the www.cve.mitre.org Common Vulnerabilities and Exposures list.

The header of the XML file should contain these lines:

and the file should be closed out with a line as well.

Using this format, it's fairly simple to create a file of custom queries.The file must conform to the UTF-8 character set and be strictly XML compliant. This means that HTML tags such as and
must not only be matched with closing tags but that each HTML tag be case sensitive. Microsoft's XML scanner will complain about an opening
tag followed by a closing
tag, since the case of the tags is different. The less-than and greater-than symbols (< and >) can also cause problems when used improperly. If your data

contains the Internet shorthand for "grin," which is , the MS XML scanner will complain.

Current Config Files

The maintainers of the GHDB make available current config files for use with Athena. This file can be downloaded from http://johnny. ihackstuff.com.

The Google API and License Keys

The only way Google will explicitly allow you to automate your queries is via the Google Application Programming Interface. We'll talk about programming in more detail later, but to obtain programs written with the Google API running, you'll need to obtain a license key, and to do that you must first create a Google account by visiting www.google.com/accounts/NewAccount. If you already have a Google account (obtained through Google Groups or the Gmail service, for example) you can log into that account through the Google accounts page, located at www.google.com/accounts. Once logged in, you can proceed to http://api.google. com/createkey to obtain your key.The license key is a sequence of characters that when entered into any tool created with the Google API, allows you to perform 1000 automated queries per day.

SiteDigger

SiteDigger is a tool very similar to Athena, but it is automated and uses the Google API.You must acquire a Google license key to use this program. SiteDigger was architected by Mark Curphey, and development credit goes to Kartik Trivedi, Eric Heitzman, Aaron Higbee and Shanit Gupta.You can download SiteDigger from www.foundstone.com/resources/proddesc/sitedigger.htm. In addition to a license key, you will need to download and install the Microsoft .NET Framework, as we discussed earlier in this chapter.There is no installation for SiteDigger—simply unzip the files into a directory and go.

Once launched, SiteDigger presents the main screen, shown in Figure 11.18.

The main screen allows you to enter a domain (such as those used with the site operator) and your Google license key.The Search, Stop, and Clear buttons are self-explanatory. SiteDigger's menu bar is fairly useless. The only item worth using is Options, which allows you to update SiteDigger's signatures from Foundstone's Web site.The Signatures tab, shown in Figure 11.19, lists the queries that SiteDigger is capable of executing.

The signatures in SiteDigger's list should look familiar. They are very similar to the queries executed by Athena, since many of them came from the GHDB, as you can see when you compare the signature highlighted in Figure 11.19 to the much earlier signature from the GHDB, shown in Figure 11.20.

SiteDigger does not officially use the GHDB as its foundation, and it is less than one-third the size of the GHDB, which is free to developers with attribution to the GHDB Web site. Without the addition of the signatures from the GHDB, SiteDigger suffers. Unfortunately, at the time of this writing, the current version of SiteDigger is incompatible with the GHDB. In addition, there are size constraints to the SiteDigger signature database. The developers obviously never imagined a signature database of more than 550 entries, meaning that even in its current state, the GHDB is larger than the maximum SiteDigger can handle. It is unfortunate that such an excellent tool has such obvious shortcomings.

The Export Results button on the main screen allows you to create a very nice HTML report listing the results of a scan, as shown in Figure 11.21.

The report lists the category, one result from the search, the summary of the search, and a longer description of the significance of the search. Notice that only one URL is returned. It is most unfortunate that SiteDigger only returns one URL, since this severely limits the tool's effectiveness during a penetration test. Even though you can narrow the search to a particular site or domain, weeding through false positives is part of the Google hacking experience and really can't be automated. Clicking the provided URL takes you not to the Google search page with the listed results (which would be preferred) but to the first page that matched the query.There's no easy way to get back to the Google search page from SiteDigger to check out other query results.

Despite SiteDigger's shortcomings, it is still worth using because its automation, much like Gooscan's, makes fairly quick work of large query lists.

Wikto

Wikto is another tool similar to both Athena and SiteDigger. Like SiteDigger, Wikto requires a Google license key to be entered before you can use the GoogleHacks portion of this tool. Wikto, developed by Roelof Temmingh of Sensepost (www.sensepost.com), does far more than merely query Google. However, this book focuses only on that aspect of the tool. Figure 11.22 shows the default GoogleHacks screen.

The Wikto download does not include a copy of the GHDB but is fully compatible, as evidenced by the Load GHDB button. Simply download the latest GHDB update from http://johnny.ihackstuff.com and import it using the Load GHDB button. Once it's loaded, you will see the first box populated with the GHDB entries, as shown in Figure 11.23.

Wikto works in two ways. Entering your domain into the Target box is the equivalent of appending Site:yourdomain.com to each of the searches. Click the Start GH button and Wikto will work its way through the GHDB, one entry at a time (see Figure 11.24).

Wikto displays the information about each query as it passes it, as shown in Figure 11.24. Information about the query (search string, reference ID, general description, and category) are displayed in the middle window, and returned results are displayed in the bottom window.

Wikto will also perform single queries without the Site: tag. By highlighting your desired search string from the GHDB in the top window and clicking the Manual button, Wikto queries Google and returns all results found, as shown in Figure 11.25.

As you can see, the output differs only in the lower window, which displays all the results returned from the query.This is identical to going to Google.com and manually entering the search string, only Wikto is much more convenient.

The one downside to Wikto as of the time of this writing is its lack of a logging feature. Results must be manually cut and pasted if you want to save them. Despite this shortcoming, Wikto's compatibility with the GHDB and its extensive features currently make it one of the better tools available.

Getting Help from Google

So far we've looked at various ways of checking your site for potential information leaks, but what can you do if you detect such leaks? First and foremost, you should remove the offending content from your site. This may be a fairly involved process, but to do it right, you should always figure out the source of the leak, to ensure that similar leaks don't happen in the future. Information leaks don't just happen; they are the result of some event that occurred. Figure out the event, resolve it, and you can begin to stem the source of the problem. Google makes a great Web page available that helps answer some of the most commonly asked questions from a Webmaster's perspective.The "Google Information for Webmasters" page, located at www.google.com/webmasters, lists all sorts of answers to commonly asked questions.

Solving the local problem is only half the battle. In some cases, Google has a cached copy of your information leak just waiting to be picked up by a Google hacker. There are two ways you can delete a cached version of a page. The first method involves the automatic URL removal system at

http://services.google.com/urlconsole/controller.This page, shown in Figure 11.26, requires that you first verify your e-mail address. Although this appears to be a login for a Google account, Google accounts don't seem to provide you access. In most cases, you will have to reregister, even if you have a Google account.The exception seems to be Google Groups accounts, which appear to allow access to this page without a problem.

Once logged in, you will receive an e-mail verification link that, when clicked, will allow you access to the Remove URL options screen, shown in Figure 11.27.This screen provides links to various sets of instructions to help you remove pages from Google's index.

The first option allows you to point Google at a robots.txt page that exists on your site. Google will process that robots.txt file, and if it is valid, will begin the processing to remove the pages affected by that file. According to Google, these requests are usually processed within 24 hours.This option is especially handy if you have made changes to your robots.txt file and would like Google to retroactively update its database, removing any newly referenced files.

The second option allows you to remove a page based on a META tag refer-ence.You can use this option when you discover a page that you'd like to make available to Google, but you'd prefer not to have it cached. Simply update your META tag for the document and submit the document to this removal page.

The third option is the real "Oh, crap!" page. If you find a document that absolutely, positively was not supposed to be public, first remove the document, log into the removal system, and click Remove an Outdated Link. The resulting screen, shown in Figure 11.28, allows you several options for removing the offending data. If you're really terrified of the implications of the document, click the first removal option.This option should nail everything (associated with the document. The second option removes the snippet that appears on the search results page as well as the cached version of the page. The third removal option only deletes the cached version of the page, leaving the snippet on the results page. All these options require that the original page be deleted first. According to Google, this option takes approximately three to five days to process.

The final removal option allows you to remove one of your posts from Google Groups. Unlike the old USENET system, you can make your half-dazed 2:00 a.m. inflammatory comments to a newsgroup go away.To delete a USENET post, log in as the e-mail address from which you posted. Enter either the full Groups URL or the Message ID of the message you want to delete.This request usually takes 24 hours to process.

Summary

The subject of Web server security is too big for any one book. There are so many varied requirements combined with so many different types of Web server software, application software, and operating system software that no one book could do the topic justice. However, a few general principles can at least help you prevent the devastating effects a malicious Google hacker could inflict on a site you're charged with protecting.

First, understand how the Web server software operates in the event of an unexpected condition. Directory listings, missing index files, and specific error messages can all open up avenues for offensive information gathering. Robots.txt files, simple password authentication, and effective use of META tags can help steer Web crawlers away from specific areas of your site. Although Web data is generally considered public, remember that Google hackers might take interest in your site if it appears as a result of a generic hacking search. Default pages, directories and programs can serve as an indicator that there is a low level of technical know-how behind a site. Servers with this type of default information serve as targets for hackers. Get a handle on what, exactly, a search engine needs to know about your site to draw visitors without attracting undue attention as a result of too much exposure. Use any of the available tools, such as Gooscan, Athena, Wikto or SiteDigger, to help you search Google for your site's information leaks. If you locate a page that shouldn't be public, use Google's removal tools to flush the page from Google's database.

Solutions Fast Track

A Good, Solid Security Policy

0 An enforceable, solid security policy should serve as the foundation of any security effort.

0 Without a policy, your safeguards could be inefficient or unenforceable.

Web Server Safeguards

0 Directory listings, error messages, and misconfigurations can provide too much information.

0 Robots.txt files and specialized META tags can help direct search engine crawlers away from specific pages or directories.

0 Password mechanisms, even basic ones, keep crawlers away from protected content.

0 Default pages and settings indicate that a server is not well maintained and can make that server a target.

Hacking Your Own Site

0 Use the site operator to browse the servers you're charged with protecting. Keep an eye out for any pages that don't belong.

0 Use a tool like Gooscan or Athena to assess your exposure.These tools do not use the Google API, so be aware that any blatant abuse or excessive activity could get your IP range cut off from Google.

0 Use a tool like SiteDigger or Wikto, which uses the Google API and should free you from fear of getting shut down.

0 Use the Google Hacking Database to monitor the latest Google hacking queries. Use the GHDB exports with tools like Gooscan, Athena, or SiteDigger.

Getting Help from Google

0 Use Google's Webmaster page for information specifically geared toward Webmasters.

0 Use Google's URL removal tools to get sensitive data out of Google's databases.

good tutorial on using the robots.txt file.

Frequently Asked Questions

The following Frequently Asked Questions, answered by the authors of this book, are designed to both measure your understanding of the concepts presented in this chapter and to assist you with real-life implementation of these concepts. To have your questions about this chapter answered by the author, browse to www.syngress.com/solutions and click on the "Ask the Author" form. You will also gain access to thousands of other FAQs at ITFAQnet.com.

Q: What is the no-cache pragma? Will it keep my pages from caching on Google's servers?

A: The no-cache pragma is a META tag that can be entered into a document to instruct the browser not to load the page into the browser's cache.This does not affect Google's caching feature; it is strictly an instruction to a client's browser. See www.htmlgoodies.com/beyond/nocache.html for more information.

Q: Can you provide any more details about securing IIS?

A: Microsoft makes available a very nice IIS Security Planning Tool.Try a

Google search for IIS Security Planning Tool. Microsoft also makes available an IIS 5 security checklist; Google for IIS 5 services checklist. An excellent read pertaining to IIS 6 can be found with a query like "elements of IIS security". Also, frequent the IIS Security Center.Try querying for IIS security center.

Q: Okay, enough about IIS. What about securing Apache servers?

A: Securityfocus.com has a great article,"Securing Apache: Step-by-Step," available from www.securityfocus.com/infocus/1694.

Q: Which is the best tool for checking my Google exposure?

A: That's a tough question, and the answer depends on your needs. The absolute most through way to check your Web site's exposure is to use the site operator. A query such as site:gulftech.orgwill show you all the pages on gulftech.org that Google knows about. By looking at each and every page, you'll absolutely know what Google has on you. Repeat this process once a week.

If this is too tedious, you'll need to consider an automation tool. A step above the site technique is Athena. Athena reads the full contents of the GHDB and allows you to step through each query, applying a site value to each search.This allows you to step through the comprehensive list of "bad searches" to see if your site is affected. Athena does not use the Google API but is not automated in the truest sense of the word. SiteDigger by Foundstone is automated, and a GHDB config file is available, giving you access to the latest hacking queries. SiteDigger has a nice reporting feature and uses the Google API, making it a friendlier alternative to the non-API tools. Gooscan is potentially the biggest Google automation offender when used improperly, since it is built on the GHDB and will crank through the entire GHDB in fairly short order. It does not use the Google API, and Google will most certainly notice you using it in its wide-open configura-tion.This type of usage is not recommended, since Google could make for a nasty enemy, but when Gooscan is used with discretion and respect for the spirit of Google's no-automation rule, it is a most thorough automated tool. As far as overall usefullness, we like Wikto. It allows for Google scanning functionality ('legal', via the API) and also incorporates a slew of host scanning features backed by the Nikto database.

Hackers Libray

Saturday, December 5, 2009

Protecting Yourself from Google Hackers

No comments:

Post a Comment

Subscribe Now: Feed Icon

FeedBurner FeedCount

Subscribe via email

My Headlines

Followers

Blog Archive

About Me