Saturday, December 5, 2009

Automating Google Searches

Introduction

In a relatively short time, Google has become one of the largest collections of information in the world—certainly one of the largest freely available on the Internet. Outside the corporate anomaly and considering its founders and go-to-market strategy, it is nothing short of amazing that this Internet search power­house has become the de facto standard for searching the Internet for desired information.That said, Google's collected information has become more sought after than the proprietary Web-crawling algorithms, massive storage techniques, or information retrieval system that seems to offer up the requested search infor­mation in mere nanoseconds.

Similar to nearly all other high-technology industries, the niche information security industry continues to assimilate advanced algorithms for the quick deter­mination of more accurate information. Expert systems, artificial intelligence, dynamic database-driven applications, and profiling are four of the overarching initiatives that are currently driving the security applications to the next level of automated computation.

Numerous mechanisms exist for collecting information from Google's online index of Web sites. Throughout this chapter, we discuss multiple methods for retrieving information from Google's database, including an overview of Google's API and manual Web page scraping. Manual Web page scraping is the technique of pulling out desired information from a returned Web page after a query is sent. These page-scraping techniques are quickly gaining in popularity and are currently being utilized in a number of security, information-gathering, and other gimmick search engines. Although the underlying algorithm is nearly iden­tical, the particular implementations of the search algorithm are quite different when written in different programming languages. Last but not least, we discuss how ethical automated scanning applications can be written that do not abuse the Google site by bombarding it with queries. This will be our equivalent to show how page-scraping applications can be written from a "white-hat" perspec­tive. A note of caution:This chapter is written for programmers.You'll need a background in various programming languages to get the most from this chapter. Simpler code examples are used throughout this book.

Warning !

f

Google's stance on automation is that Google does not approve of auto­mated scanning outside its provided Google API. Utilizing manual page-scraping techniques violates Google's terms of service; therefore, all the ™ information in this book is provided for educational purposes. The code and libraries included in this chapter were developed as prototypes and are meant to serve as examples only! Please review Google's Standard Terms and Conditions for the company's current searching policy.

Understanding Google Search Criteria

As you have learned, Google provides access to an extremely large database of information ascertained from online applications and Web sites. As an end user, you have the ability to query this information in two general ways. The first is through the common search interface located on the main page at www.google.com. In general, this mechanism utilizes one or multiple words (or strings) and returns a list of the highest-rated sites with these strings.The other, less common mechanism is the advanced search page that resides on the Google Web site in a somewhat hidden form. Here is a direct Web link to the advanced Google search page in English: www.google.com/advanced_search?hl=en.

Advanced Google querying not only aids in our cause of retrieving sensitive information from the Google database, it also helps educate users on the dangers of storing potentially sensitive information on distributed applications or Web applications. This chapter dives into these intricacies.

Note

Google searching parameters are covered in detail in Chapter 1. Please refer to Chapter 1 for more information on specific Google searching parameters.

Results from advanced and complex Google queries can be captured in one of two ways. The first and easiest is to grab results straight from a browser's address bar after the query is submitted to Google. Another method for obtaining the full query is to utilize a network traffic analyzer or sniffer.

Our recommended sniffer is Ethereal (www.ethereal.org). The newer versions of Ethereal can convert HTTP to ASCII, minimizing the manual conversion necessary to enable humans to read the queries. An advanced Google query looking for exploits is shown in Figure 12.1.

Running an advanced query utilizing the previous Google-supplied form is not a difficult task when you are seeking information or contacts on a specific subject. Although the results of an advanced query, shown in Figure 12.2, are easy to read from a human perspective, it's quite different from a programmatic stand-point.The real issue of this seemingly simple task is magnified when you want to query Google 10,000 times and log the results for later correlation, analysis, or

reporting. At that point, automating the transmission and reception of the Google queries is no longer an option—it's mandatory.

As an additional note, the latest version of Ethereal incorporated an extremely useful feature: cut and paste.You are now able to cut and paste raw packet or ASCII-converted information straight from the Ethereal analysis pane into com­puter memory for later use. Gaining access to packet data in older versions of Ethereal was a cumbersome task that included saving captured streams in .PCAP format, then later manually converting data into a straight text form from .PCAP.

Analyzing the Business

Requirements for Black Hat Auto-Googling

Although we won't attempt to justify the absolute need to automate Google querying and page scraping here, we will point out that it's illegal, unethical, and in some cases, as in securing your Web site or customer's Web site, unavoidably necessary.

Google sets limitations that limit your true ability to monitor your Web applications with complete visibility. That said, we will demonstrate techniques that can be implemented to "more ethically" automatically query Google or avoid the dreaded (and alleged) Google IP blacklist. (Supposedly, a "living" Google blacklist exists to log and limit Google service offenders, whether human or Web bot.)

The following is a list of self-governing Google pen-testing ethics:

■ Implement sleep timers in your applications that will not affect Google's response time on a global level. For instance, do not send 10,000 Google queries as fast as you can write them to the wire; sleep for 2 or 3 sec­onds between each transmission.

■ Do not simply mirror aged Google results. Better to link queries to real­time results than to create an aged database of results that needs constant updating.

■ Test or query with permission ascertained from the "target" site.

Query intelligently, thereby minimizing the number of queries sent to Google. If you have a blanket database that you fire against all sites on Google, even though half are irrelevant, you're unnecessarily abusing the system. Why scan for Linux-based CGI vulnerabilities if the target applications or organization only implement Windows systems?

More information on Google lockouts can be found in the article located at www.bmedia.org/archives/00000109.php.

Google Terms and Conditions

The following are important links to Google's official terms and conditions as they pertain to this book and chapter:

■ Standard Searching Service Terms and Agreements

www.google.com/terms_of_service.html

■ Google API Service Terms and Agreements

www.google.com/apis/api_terms.html

Understanding the Google API

The Google API or development kit was created for programmers who want to interface with Google's online "googleplex" of data.The API is a fully supported set of API calls that can be accessed or leveraged in multiple languages. The most common language to hook into the Google development API is Microsoft C# for .NET.

Unfortunately, you cannot simply read a document on the API set and begin to code.You must complete a few steps before you'll be able to utilize the Google API. As a quick note, do not bet on beating the system's 1000 queries per day. When you use the Google API, each query is accompanied by the Google API key. A local Google cache database keeps track of each key usage to ensure that on any sliding 24-hour scale, a key is not sent more than 1000 times.

The following steps outline Googling as Google intended:

1. Download the development kit at www.google.com/apis/

2. Register to create a new Google API developer account:

■ www.google.com/accounts/NewAccount?continue=http:// api.google.com/createkey&followup=http://api.google.com/createkey.

■ Be prepared to provide your e-mail address, which will end up being your username, and a secure password, as shown in Figure 12.3.

Note

You will be required to verify the supplied e-mail address before your account license will be created and sent to you.

After submission, you need to wait about 10 minutes to get your Google API verification e-mail.This e-mail will be sent to your username/e-mail account. Simply click the supplied link and you will see a page similar to the one shown in Figure 12.4. Keep your Google License Key (a lengthy string of upper and lower case characters) handy. All tools written with the Google API will require it.

License Key Generated

We have generated a Google Web APIs license key and sent it to your email address.

Your license key provides you access to the Google Web APIs service and entitles you to 1,000 queries per day.

For more information, please visit our Getting Help page.

<< Return to Google Web APIs Home.

The last step before coding is to unzip the Google API download and start parsing through the example code and reading the documentation. If you are not familiar with Java or Microsoft C#, you might have serious issues with creating a program that has the ability to access the Google API feature set. We recommend that you become familiar with one of those languages before you dive into the task of creating a program that implements the Google API. Also, keep the GoogleSearch.wsdll file from the API download handy. Most API applications require it.

Understanding a Google Search Request

The Google search parameters and formats differ slightly between the Development API and standard Web client search parameters. In this section we attempt to document the most commonly utilized, required, or requested search parameters that are transmitted through the development API. The parent Google API search parameters are located in Table 12.1, with brief corresponding descriptions. Note that this matches some of the URL parameters we covered in Chapter 1.

Table 12.1 Google API Search Parameters

Name

Description

Filter An extremely useful parameter designed to return only the

most relevant link per major domain. For instance, if this parameter was set, you would not see more than one link for Web-based e-mail for www.hotmail.com.

le This parameter is no longer supported.

Key This parameter is required when utilizing the Google

Development API suite. It is utilized to authenticate to Google and track your queries.

Lr This parameter limits the results to a defined language, such

as English, Chinese, or French.

maxResults Sets the maximum results returned from a specific query. By default, the results are returned with 10 entries per page.

Oe This parameter is no longer supported.

Q This parameter is utilized to specify a specific query against

Google.

Restricts This parameter limits the results to a potential subset of the

entire results. For instance, a restriction could be set to return information only on the United Kingdom or pages written in German.

safeSearch A Boolean parameter meant to be utilized to disallow "adult" content to be returned for a search request.

start This is an index of the first desired result.

The Google API filter rule can help remove useless Google results.The description of the filter flag is included in Table 12.2. Expect additional Google flags to be added in 2005.

Table 12.2 Google API Filter Parameter

Flag

Description

Filter

filter is a Boolean parameter that utilizes two forms of response filtering. The first removes any similar results via a comparison algorithm (similar to diff); the second mechanism ensures that only one result comes from one parent domain.


Table 12.3 contains a comprehensive list of the language restrictions available for use within the Google Development API.These are extremely similar to the search request language peremeters we discussed in Chapter 1..

Appendix C lists a directory of countries with their corresponding country restriction values that can be implemented or leveraged in the Google develop­ment API.These values are extremely useful in combination with language filters and can significantly filter out results from pages containing "Greek."

A major difference between the Web user interface and the Google API is the built-in topic restriction rules. For instance, if you wanted to filter results for Microsoft-related information only, you would execute your search from www.google.com/microsoft as opposed to setting the topic restriction flag to equal a value of Microsoft.Table 12.4 contains a list of the Google topic restric­tions and their corresponding values.

Topic

Value

FreeBSD Linux

Macintosh Microsoft

United States government

Bsd

Linux

mac

microsoft Unclesam

The full value of Google's API search capabilities is realized when you start to utilize API restriction parameter combinations. A set of operators exists to give you the ability to limit results utilizing Boolean and mathematical logic. The AND, OR, and NOT Boolean operators, described in Table 12.5, are fantastic at searching for language and country restrictions; the parentheses ( ) are ideal for encapsulating logic containing multiple operators or search terms.

Table 12.5 Google API Restriction Parameter Combinations

Name

Operator Description

Example

AND

NOT

OR

The AND operator is utilized to combine more than one restriction, thereby further limiting the results.

Limits results to responses from Mexican domains written in Spanish.

The NOT operator is utilized to -countryCU negate the value of a specified variable, or in Google's case, a search sequence.

Eliminates all sites generated in a request with a parent domain in Cuba.

The OR operator is utilized in a countryCU| Boolean manner to state TRUE countryIQ if one of two scenarios are TRUE.

Allows only sites generated in a request with a parent domain in Cuba or Iraq.

Automating Google Searches • Chapter 12 375 Table 12.5 Google API Restriction Parameter Combinations

Name Operator Description Example

Parentheses ( ) The parentheses should be -(lang_CU\lang_PL)

used when you send multiple assignments to Google. Statements in parentheses are evaluated before statements outside parentheses.

Eliminates any responses that were returned in Cuban or Polish.

Note

Google search parentheses are implemented only for the Google Development API; hence, they will not work within the regular search fields or with any other automated page-scraping techniques.

Auto-Googling the Google Way

Utilizing the Google API to conduct automated Google searches is much easier from a development perspective than creating your own API set via manual response page scraping, since all the back-end code is already written for you. The included methods and properties open a vast list of variables that can be put at your development fingertips with the mere instantiation and use of a desired API object.

Google API Search Requests

The following is a list of the Google API results that can be ascertained from the supplied methods. Each of these properties can be implemented to assist you in sending a Google API search request:


Reading Google API Results Responses

The following is a list of the Google API results that can be ascertained from the supplied methods. Each of these properties can be directly accessed once a Google search request has been successfully completed:

As we have discussed, the Google Development APIs come with a slew of limitations. From a developer's perspective, some of these limitations are more apparent and devastating than others. For instance, the well-known 1000 queries will limit your ability to fully test your Google footprint; however, the maximum 10 results per query will also limit your ability to potentially test or fingerprint the Internet for certain vulnerabilities.The full listing of Google API limitations as seen by Google Labs is displayed in Table 12.6.

Table 12.6 Google API Limitations

Component

Limitation

Search request length

2048 bytes

Maximum words utilized to form

10

a query

Maximum sites (site) in a query

1

Maximum results per query

10

Maximum results

1000

Sample API Code

Before we dig into the API code, we must meet a few requirements that are common to most Perl-based Google querying scripts.These are the same requirements we covered in Chapter 4, but we'll list them again for convenience.

In order to use this tool, you must first obtain a Google API key from www.google.com/apis. Download the developer's kit, copying the GoogleSearch.wsdl file into the same directory as this script. Next, download and install the expat package from sourceforge.net/projects/expat.This installation will require a ./configure and a make as is typical with most modern UNIX-based installers.This script also uses SOAP::Lite, which is easiest to install via CPAN. Simply run CPAN from your favorite flavor of UNIX, and issue the fol­lowing commands from the CPAN shell to install SOAP::Lite and various dependencies (some of which may not be absolutely necessary on your plat­form):

install LWP::UserAgent install XML::Parser install MIME::Parser force install SOAP::Lite

This script was written by RoelofTemmingh from SensePost (www.sense-post.com). SensePost uses this tool as part of their footprinting process which really accentuates the power of Google for reconnaissance purposes. For more information about their techniques, try Googling for sensepost tea or sense-post obvious.The first hit for these searches brings up two excellent papers that are a great read filled with excellent information.

The script, called dns-mine.pl is listed below:

#!/usr/bin/perl #

# Google DNS name / sub domain miner

# SensePost Research 2003

# roelof@sensepost.com #

# Assumes the GoogleSearch.wsdl file is in same directory #

#Section 1

use SOAP::Lite;

if ($#ARGV<0){die "perl dns-mine.pl domainname\ne.g. perl dns-mine.pl cnn.com\n";}

my $company = $ARGV[0];

####### You want to edit these four lines: ##############
$key = " YOUR GOOGLE API KEY HERE ";

@randomwords=("site","web","document","internet","link","about",$company);
my $service = SOAP::Lite->service('file:./GoogleSearch.wsdl');
my $numloops=3; #number of pages - max 100

#########################################################

#Section 2

## Loop through all the words to overcome Google's 1000 hit limit foreach $randomword (@randomwords){

print "\nAdding word [$randomword]\n";

#method 1

my $query = "$randomword $company -www.$company"; push @allsites,DoGoogle($key,$query,$company);

#method 2

my $query = "-www.$company $randomword site:$company"; push @allsites,DoGoogle($key,$query,$company);

}

#Section 3

## Remove duplicates @allsites=dedupe(@allsites);

print STDOUT "\n \nDNS names:\n \n";

foreach $site (@allsites){

print STDOUT "$site\n";

}

#Section 4

## Check for subdomains foreach $site (@allsites){

my $splitter=".".$company;

my ($frontpart,$backpart)=split(/$splitter/,$site); if ($frontpart =~ /\./){

@subs=split(/\./,$frontpart);

my $temp="";

for (my $i=1; $i<=$#subs;

$temp=$temp.(@subs[$i].".M);

}

push @allsubs,$temp.$company;

}

}

print STDOUT "\n \nSub domains:\n \n";

@allsubs=dedupe(@allsubs); foreach $sub (@allsubs){

print STDOUT "$sub\n";

}

#Section 5

############ subs ##########

sub dedupe{

my (©keywords) = @_;

my %hash = (); foreach (©keywords) {

$_ =~ tr/[A-Z]/[a-z]/;

chomp;

if (length($_)>1){$hash{$_} = $_;}

}

return keys %hash;

}

#Section 6 sub parseURL{

my ($site,$company)=@_; if (length($site)>0){

if ($site =~ /:\/\/([\.\w]+)[\:\/]/){ my $mined=$1; if ($mined =~/$company/){ return $mined;

}

}

}

return "";

}

#Section 7 sub DoGoogle{

my ($GoogleKey,$GoogleQuery,$company)=@_; my @GoogleDomains=MM;

for ($j=0; $j<$numloops; $j++){ print STDERR "$j "; my $results = $service ->

doGoogleSearch($GoogleKey,$GoogleQuery,(10*$j),10,"true","","true","","latin 1","latin1");

my $re=(@{$results->{resultElements}}); foreach my $results(@{$results->{resultElements}}){ my $site=$results->{URL};

my $dnsname=parseURL($site,$company); if (length($dnsname)>0){

push @GoogleDomains,$dnsname;

}

}

if ($re !=10){last;}

}

return @GoogleDomains;

}

Source Documentation

The Google_DNS_Mine Perl script utilizes the Google Development API through the Perl SOAP module.The script was created to identify and retrieve all of the sub domains and DNS names associated with a particular parent web site. The links and strings retrieved would be extremely useful for anyone seeking to identify directories, CGI bins, or sub domains that could be later utilized or leverage when penetration testing.

Section 1 is utilized to declare the variables and arrays for the script in addi­tion to specifying the modules required. The second section of the script loops through the random word engine querying Google for multiple search terms. All sites and sub-domains that are found within the response pages are then pushed to an associative array (@allsites).The random words, company, and key variables were defined in section 1.

The third section of the script was created for ease of use and educational purposes only. It serves two purposes.The first is to call the subfunction dedupe() that removes duplicate sites from the array then prints each unique site to STDOUT. The sites that are printed to STDOUT during this section are full strings that still contain the parent strings.

Section 4 splits the entire retrieved strings from the Google responses to con­tain only sub-domains. Once the subdomains are properly stripped and for­matted, they are pushed to the @allsubs array then in the same manner covered in Section 3 are removed of duplicates and printed to STDOUT.

The fifth section contains the dedupe() function which removes all of the duplicates for subdomains. The passed array is converted from the memory resi­dent buffer to the @keywords array. Each keyword in the array is then converted to lowercase and the carriage return is removed.The hashes are then compared and returned in a hash table. The sixth section parses out all of the URL infor­mation from the returned Google strings. The memory buffer is parsed into a site variable and company variable which is then utilized to determine the length of the site string. The company variable is later utilized to help slice the pertinent URL string before returning the "mined" string.

The last section of this script contains the bulk of the Google API code required to execute the query on the remote system. The subfunction accepts the GoogleKey, GoogleQuery, and company variables.The my $results line executes the Google query utilizing the SOAP service and corresponding method doGoogleSearch. The results are then parsed and pushed to the @GoogleDomains array before being returned back to the calling function.

When run, the tool launches multiple Google queries (built from the @rand-words list) that locate domain names and subdomains nested in Google result fields. These names and subdomains are output to the screen. For example, run­ning the tool against Google.com produces the following output:


This tool provides excellent mapping data for a penetration test, and the results can be extended by increasing the $numloops variable.

Foundstone's SiteDigger

Kudos to the Foundstone consulting team for their slick Windows inter­face for assessing Web sites. Their tool "plays by the rules," since they do require you to obtain a Google developer license key to power the scan­ning portion of the application. The upside to this method and to utilizing this tool is that you are doing no wrong (provided that you have permis­sion to query-bang a site); the downside is that you are limited to 1000 queries per day. As you can imagine, these 1000 queries could go rather quickly if you were to scan more than one site or if you wanted to run multiple scans on an individual site. It is only a matter of time until the GoogleDork DB is larger than 1000 queries. This tool can be downloaded from Foundstone's homepage at www.foundstone.com under the Resources link. Foundstone's SiteDigger Win32 interface is shown in Figure 12.5. Also consider the Wikto tool from SensePost, (www.sense-post.com), which allows for Google searching and more specific Web server testing.

Understanding Google Attack Libraries

Google attack libraries refer to our (Google Pen Testers) code that has been cre­ated to aid in the development of education about applications and tools that query the Google database, retrieve results, and scrap through those results. At the onset of this endeavor, we decided that we should first create a list of goals that we want our codebase to adhere to, as well as a list of challenges that we should acknowledge:

1. Execute queries against the Google database without using it's Google Development API.

2. Retrieve specific results from the executed Google queries.

3. Parse and scrap through results to provide useful information to the calling program.

4. Utilize components in the particular implementations that use the inherent advantages of each language.

5. Code efficiently.

Pitfalls:

1. Inaccurate development could lead to poor results.

2. Avoid unstable response parsing that is too static to interpret atypical Google page responses.

3. Avoid lengthy or buggy socket code that utilizes too many socket con­nections or does not close them at the appropriate times.

4. Avoid poor query cannon development that will not handle complex or lengthy Google queries.

Pseudocoding

The concept of pseudocoding software or a tool before you start developing is something that is regularly taught in college courses as well as embraced in the commercial software development world. One popular form of this practice is creating a Unified Modeling Language (UML) diagram. UML is most com­monly utilized in developing object-oriented software,but it can also be used to create even the smallest of tools. More commonly than UML and a predecessor is the ever-present graphical flowchart depicting the overarching processes and components that, housed together, collectively make up an application.

One of our goals is to discuss different implementations for automating Google queries and the minute or large differences between the languages. Before we dive into the implementations, let's describe the overall process to achieve our Google Query Library goals in a software process flow diagram. See Figure 12.6.

The Google attack libraries are divided into five overarching categories that will commonly be included within all the different language implementations:

■ Socket initialization This is the first category, starting left to right.. Each of the different language implementations will create and establish a socket that will then be utilized to transfer and receive data from Google.

■ Send a Google request or query Following the arrows, this is the second milestone. Notice that submilestones not mentioned include ascertaining the query and formatting potential arguments within that query.

■ Retrieve the Google response generated from your query This response will contain several sets or (carriage-returned lines) of informa­tion; most important, it will include the total number of hits your query generated. Other bits of information that we are currently less interested in include Web sites and the full URLs for the responses.

■ Scrape or separate The fourth process will be to scrape or separate the useful desired information from the less useful and commonly over­whelming amount of information that Google returns on the main pages in response to search requests. In this case, we will search for a "of about" string that precedes the total hits count for the page. It will act as a landmark for us, helping pinpoint the location of the total hits number.

■ Return the total number of hits Last but certainly not least, we will return the total number of hits that the query generated to the calling location within the script or program. This allows us to create flexible code that can be further extended at a later time or included within a larger pen-testing script or program.

Perl Implementation

The following Perl implementation has very little debug code and was created to depict how easy it is to automate custom querying on Google and page scraping within ascertained Web pages.The code is divided into three main components. The first is a dump of the source, second is the script's execution output, and lastly is documentation for the script's logic and code implementation.

#Section 1

#Google Hacking in Perl #Written by Foster #!/usr/bin/perl -w use IO::Socket;

#Section 2

$query = '/search?hl=en&q=dog'; $server = 'www.google.com'; $port = 80;

#Section 3

#############################

sub socketInit() {

$socket = IO::Socket::INET->new(

Proto => 'tcp',

PeerAddr => $server,

PeerPort => $port,

Timeout => 10,

);

unless($socket) {

die("Could not connect to $server:$port"); }

$socket->autoflush(1);

}

#Section 4

############################

sub sendQuery($) {

my ($myquery) = @_;

print $socket ("GET $myquery HTTP/1.0\n\n"); while ($line = <$socket>) {

if ($line =~ /Results.*of\sabout/)

{

return $line;

}

}

}

#Section 5

############################

sub getTotalHits($)

{

my ($ourline) = @_; $hits=MM;

$index = index($ourline, "of about"); $str = substr($ourline, $index, 30); @buf=split(//,$str);

for ($i = 0; $i < 30; {

if ($buf[$i] =~ /[0-9]/) {

$hits=$hits.$buf[$i];

}

}

return $hits; }

############################

#Section 6 socketInit();

$string = sendQuery($query); $totalhits = getTotalHits($string);

#Printing to STDOUT the Total Hits Retrieved from Google print ($totalhits);

Output

When you execute the previous Perl script with the embedded Google Attack Libraries, you will receive the following standard out (STDOUT).The output represents the total number of Google pages that are returned with the submitted query:

%GABE%\ perl google perl.pl $GABE%\ 53400000

Source Documentation

The first section of this program, or Section 1, contains the header information for the script. It contains the local directory in which the Perl executable is stored, along with the socket module initialization.

Section 2 sets the three global variables that are required to test these Google Attack Libraries using a live example against Google.com.The first is the query that will be passed to the functions later down the line. If you need to automate these functions as a part of a larger Google scanning application, they could be replaced with a looping mechanism to pass multiple queries to the Google Attack Library functions.The second variable stores Google's server address or domain name and the corresponding port it resides on. We realize we could have hardcoded the port number to 80, but to make the code more flexible the vari­ables are left as dynamic.

The first function in our Perl example contains our socketlnit function.The initial part creates the socket structure with the corresponding protocol, server address, port, and socket timeout value.The TCP protocol was utilized, not HTTP.The HTTP protocol will be manually created and forced onto the wire. The unless function attempts to establish the socket. If the unless function is unsuccessful, it will exit the program with the die statement and print an error message to the screen.The last line "autoflushes" the data from the socket to pre­pare for data transmission.

The fourth section is the sendQuery function.This function requires one parameter, the query that you want to run on Google. The parameter is stored in memory on the first line and saved to the local $myquery variable. The second line in the parameter writes the HTTP request to the socket, which contains the

desired query. The while loop is utilized to read in each line of the multiple lines, one at a time, for the Google page's response.The encapsulated IF statement is used to find the line that contains the total hit count by referencing an "about" string that is always found on the Google page. Once that line is identified, it is returned to the calling function.

Section 5 is the meat of the script, containing all the page-scraping code. It also takes in one parameter, stores it in memory, then stores it to the local scope variable $ourline. The global $hits variable is initialized and will later be used to store the total number of Google hits before it is returned.The index() line finds the numerical location of the string "of about", which is located right before the totals hits on the response page of a Google query.The next line then utilizes the substr() function to grab 30 characters, starting at the index location. (The total hits number will be included as a part of those 30 characters.) The looping con­struct underneath is then utilized to grab all digits from that string and store them into the $hits variable. Lastly, the $hits variable is returned to the calling function location.

Section 6 comprises four main components.The first component calls the socket initialization function. The second line is subdivided into two parts. The right side of the equal sign is utilized to call the sendQuery function with the desired query. In the case of a Google Pen Tester, this query could be a CGI scan, exploit search, or allinurl: vulnerability scan. Whatever the search, the response of that search is saved in the $string variable. That $string variable is then passed to the getTotalhits function. The total number of hits is stored in the new $totalhits variable, then printed to stardard out (STDOUT) via the last line of the program.

Python Implementation

The Python language proved an extremely efficient language in regard to number of lines of code to reach success. Not only was it easy to write due to the object-oriented nature of Python, but few actual lines of code were needed to obtain the results we were looking for. When you compare the Python code to that of the Perl code, you will undoubtedly notice a few key differences. For instance, in the Python code, we strip out digits using a regular expression instead of parsing through a looping construct. The other major difference is that we have encapsu­lated our socket establishment code within try/except blocks. These blocks aid in exception handling and debugging if there is an error.

This was hands-down our favorite Google Query Library—two thumbs up for object-oriented scripting languages. Included in this example is our source, output, and source documentation.

Source

#Google Hacking in Python #Written by Foster #Section 1 import socket import sys

import re #Regular Expression Module #Section 2

HOST = 'www.google.com' # The remote host

PORT = 80 # The same port as used by the server

s = None

query = "/search?hl=en&q=dogM #Section 3

for res in socket.getaddrinfo(HOST, PORT, socket.AF_UNSPEC, socket.SOCK_STREAM):

af, socktype, proto, canonname, sa = res

try:

s = socket.socket(af, socktype, proto) except socket.error, msg:

s = None

continue try:

s.connect(sa) except socket.error, msg:

s.close()

s = None

continue break if s is None:

print 'could not open socket'

sys.exit(i) #Section 4

s.send("GET " +query+ " HTTP/1.0\n\n")

myindex = 0

while myindex < 1:

data = s.recv(8 09 6)

myindex = data.find("about") s.close()

#Section 5

mysubstr = data[ myindex : myindex + 30 ] regexObj = re.compile('\d') list = regexObj.nndall(mysubstr) totalHits = ''.join(list) print totalHits

Output

The following output represents the corresponding total hits retrieved from Google:

53500000

Source Documentation

The first section of the Python script, Section 1, defines the modules that are required to run the script. It uses Import to allow the script access to particular objects and methods. Section 2 contains our four global variables that we have become accustomed to declaring in the beginning of our examples. They include our socket object, host, port, and query variables.

The third section contains all our socket initialization code. It creates the appropriate socket structure on line one.The two try/except blocks encapsulate the socket creation and connection code. If the except statements are executed, the corresponding error messages will be output to STDOUT. If a socket could not be created at all, the debug message "Could not open socket" will be sent to STDOUT.

Section 4 is utilized to both send the Google query and store the appropriate Google response.The first line of code writes the HTTP request to the socket.

The myindex variable is initially declared to zero because it will be utilized as our counter to determine when we receive the Google response line with our total hits number. Since Google responses are sent in a series of text lines, we must loop through each individually until the desired line is in the memory buffer. The While loop is utilized to loop through the response strings, and once the "about" string is identified, it sets the value of myindex to a number greater than one, thereby causing the loop to break. Lastly, the socket is closed.

The last section of this script is Section 5.The first line of code utilizes the index ascertained in Section 4 to grab a 30-character slice of the complete Google response.The total hits number is encapsulated within this 30-character string.The second line compiles a regular expression to identify all digits within a particular string. The Findall method is then utilized to create a list of the digits within the slice. The list is then converted back to a string using the Join method before being printed to STDOUT on the last line of the script.

Extending this script to scrape sites that are included in Google's responses or the specific URL hits contained in the response is not terribly difficult; however, it does add another layer of complexity. We would only need to create a looping structure, then implement a regular expression engine to search out URL-like strings within the response page. Once they're retrieved, the option exists to print them to standard out or push them to an associative array. Chapter 10 has more information on utilizing regular expressions within Google searches.

C# Implementation (.NET)

C#, pronounced C sharp, is a much different beast when it comes to imple­menting Google attack libraries within applications or automated penetration testing tools. First, the entire language was created in an object-oriented manner for object-oriented programming (OOP) developers. As you will see in our code demonstration, the previous concept of an attack function utilized in the Perl example no longer exists. Instead we have created a .NET C# object that con­tains the functionality for auto-querying Google, scraping the page results, then returning the number of total hits for any specified query. Since this example has the same output as the Perl example, we have alleviated that section and only provided the source along with its documentation.

GOOGLE_CSHARPE.CS SOURCE

//Google Hacking in C# //Written by the master BW

using System; using System.Text;

using System.Text.RegularExpressions; using System.Net; using System.Net.Sockets;

namespace ConsoleApplication2 {

class GoogleQuery {

//Required Socket Variables

private const string query = "/search?hl=en&q=dog"; private const string server = "www.google.com"; private const int port = 80;

private Socket socket;

//Method #1

public void SocketInit() {

socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);

IPHostEntry ipHostInfo = Dns.Resolve(server);

IPAddress ipAddress = ipHostInfo.AddressList[0];

socket.Connect(new IPEndPoint(ipAddress, port));

}

//Method #2

public void SendQuery() {

socket.Send(ASCIIEncoding.ASCII.GetBytes(string.Format("GET {0} HTTP/1.0\n\n", query)));

}

//Method #3

public string GetTotalHits() {

// receive the total page

byte[] buffer = null;

byte[] chunk = new byte[4096];

try

{

while (socket.Receive(chunk) > 0) {

byte[] tmp = new byte[(buffer == null ? 0 : buffer.Length) + chunk.Length];

if (buffer != null)

buffer.CopyTo(tmp, 0);

chunk.CopyTo(tmp, buffer != null ? buffer.Length : 0);

buffer = tmp;

}

}

catch {

if (buffer == null)

throw new Exception("No data read from host");

}

// find the total hits

string text = System.Text.ASCIIEncoding.ASCII.GetString(buffer); Regex regex = new Regex(@"of about>(?[0-9,]+)"); Match m = regex.Match(text); if (m.Success == false)

throw new Exception("Parse error");

return m.Groups["count"].Value;

}

}

///

/// Summary description for ClassL ///
class AppClass

{

///

/// The main entry point for the application.

///


[STAThread]

static void Main(string[] args) {

GoogleQuery gq = new GoogleQuery();

gq.SocketInit();

gq.SendQuery();

Console.WriteLine("Total Hits {0}", gq.GetTotalHits()); }

}

}

Source Documentation

The code for the Google C# application is much different from that of the Perl script because it's object oriented and located in a single object as opposed to functions. Initially, we'll create a new object that will be responsible for the core of our functionality.This new object will allow us to easily reuse our code in other projects or in applications that attempt to wrap or further automate the Google querying process.The name of the object that we have created is GoogleQuery. GoogleQuery has three public methods that we're interested in: SendQuery, GetTotalHits, and its constructor.

The first public method, GoogleQuery, has three private constant variables: string query, string server, and int port.These store the program's required variables for instantiating and establishing the socket connection. GoogleQuery's constructor creates a new TCP socket via the Socket object's constructor. Following the cre­ation of the TCP socket, it looks up the IP address of google.com by means of the static, built-in C# method Dns.Resolve. Dns.Resolve returns an object of type IPHostEntry.The IP address of google.com can be extracted from this object by referencing the first index of the AddressList member of IPHostEntry (ipHostInfo.AddressList[0]). Next, the code creates an object of type IPEndPoint and passes two arguments to its constructor: the IP address gleaned from IPHostEntry and the port number to connect to. This IPEndPoint object is then

passed as an argument to the socket object's Connect method. Should all this suc­ceed, the socket is connected to google.com's port 80. If it fails, an exception will be thrown; however, due to the demonstrative nature of this example, error han­dling has been omitted from the program.

GoogleQuery's SendQuery method is rather simple. It merely passes an HTTP GET request string to the established Google socket. One thing to note is that Socket.Send expects a byte array rather than an ASCII string. For that reason, we need to convert the ASCII string to a byte array using the ASCIIEncoding.ASCII.GetBytes static method.

The last method of interest, or Method 3, is GetTotalHits.The first 19 lines of code wait until all data is received from the socket and concatenate it into one buffer.This code uses the method Socket.Receive, which fills a byte array.The last segment of interesting code is the utilization of .NET regular expressions. First, we instantiate a Regex object and pass it one parameter—the pattern to search for. The pattern string consists of the literal phrase "of about" followed by a named group count, for which the pattern consists of a number. By naming the components of a regular expression, it becomes easier to reference them after the pattern has been matched (m.Groups["count"].Value). Next, the Regex object is passed the buffer returned from Google via the Match method. After that, if the pattern matches, a string is returned that contains the number of hits found from the query.

Where Credit Is Due

A special thank you goes out to Blake Watts (www.blakewatts.com) for his assistance with the C# code and knowledge. You continue to rock. Thanks, dude!

C Implementation

The following C implementation was provided by our friend l0om to be utilized as an educational tool in this book. As you will quickly come to see, the C implementation is somewhat different from the other language implementations described in this chapter. Not only is this implementation longer, it includes

additional functionality that the other language kits have left out. Additional functionality includes command line help documentation and the ability to receive command-line arguments and return a list of sites included within the response. Only the complete source and corresponding documentation have been incorporated into this section.

SOURCE

//Google Hacking in Good Old-Fashioned C //Written by lOom

//Revised and Documented by Foster

lgool V 0.2

written by lOom

WWW.EXCLUDED.ORG - l0om[a7]excluded[d07]org

idea based on johnny longs gooscan and goole dorking itself. thanks john.

this is a part of a proof-of-concept project in automate attacks with googles help.

greets to goolemasters:

murnejklouwjThePsykojjimmyneutron, MILKMAN,Deadlink,crash_monkey,zoro25 cybercide,wasabi

greets to geeks/freaks/nice_people like: proxy, detach, takt, dna, maximilan, capt.boris, dr.dohmen,

mattball

#Section 1

#include

#include

#include

#include

#include

#include

#include

#Section 2

#define GOOGLE "www.google.com" //default google server to send query

#define PATTERN >" //show results

char *encode(char *str); // NULL on failure / the encoded query on success

int connect_me(char *dest, int port); // -1 on failure / connected socket on success

int grep_google(char *host, int port, int proxy, char *query, int mode, int start);

void help(char *usage); void header(void);

#Section 3

int main(int argc, char **argv) {

int i, port, valswap, max = 0, only_results = 0, site = 0, proxl = 0; // greets at proxy - this variable is dedicated to you ;D h4h4h4 char *host, *query = NULL;

if(argc == 1) { help(argv[0]); return(1);

} else for(i = 1; i < argc; if(argv[i][0] == '-')

switch(argv[i][1]) { case 'V:

header(); return(0); case 'r':

only_results = 1; break; case 'm':

max = atoi(argv[++i]); break; case 'p':

if( (host = strchr(argv[++i], ':')) == NULL) { fprintf(stderr, "illegal proxy syntax

[host:port]\n");

return(1);

}

port = atoi(strtok(host, ":")); host = strtok(argv[i], ":"); proxl = 1; // "gib frei ich will rein" break; case 'h':

help(argv[0]); return(0); } else query = argv[i];

if(query == NULL) {

fprintf(stderr, "no query!\n");

help(argv[0]); return(1);

}

if( (query = encode(query)) == NULL) {

fprintf(stderr, "string encoding faild!\n"); return(2);

}

if(!max) {

if(grep_google(host, port, proxl, query, only_results, site) > 0) return(0);

else return(1);

}

for(i = 0; i < max; )

if( (valswap = grep_google(host, port, proxl, query, only_results, site)) <= 0) return(1);

else if(valswap < 10) return(0);

else { i+=valswap; site+=10; }

return(0);

}

#Section 4

int grep_google(char *host, int port, int proxl, char *query, int mode, int site)

{

unsigned int results = 0;

int sockfd, nbytes, stdlen = 31, prxlen = 3 8+strlen(GOOGLE), buflen =

100;

char *sendthis, *readbuf, *buffer, *ptr; if(proxl) {

if( (sockfd = connect_me(host, port)) == -1) // connect to proxy

return(-2);

if( (sendthis = (char *)malloc(prxlen+strlen(query)+7)) == NULL) { perror("malloc"); return(-1);

} else sprintf(sendthis,"GET http://%s/search?q=%s&start=%d HTTP/1.0\n\n",GOOGLE,query,site);

} else {

if( (sockfd = connect_me(GOOGLE, 80)) == -1) return(-2);

if( (sendthis = (char *)malloc(stdlen+strlen(query)+7)) == NULL) { perror("malloc"); return(-1);

} else sprintf(sendthis, "GET /search?q=%s&start=%d HTTP/1.0\n\n",query,site);

}

if( (readbuf = (char *)malloc(255)) == NULL) { perror("malloc"); return(-1);

}

if( (buffer = (char *)malloc(1)) == NULL) { perror("malloc"); return(-1);

}

if(send(sockfd, sendthis, strlen(sendthis),0) <= 0) return(-2);



while( (nbytes = read(sockfd, readbuf, 255)) > 0) {

if( (buffer = (char *)realloc(buffer, buflen+=nbytes)) == NULL) { perror("realloc"); return(-1);

} else { strcat(buffer, readbuf); memset(readbuf, 0x00, 255); }

}

close(sockfd);

ptr=buffer; while(buflen--) if(mode) {

if(memcmp(ptr++, RESULTS, strlen(RESULTS)) == 0) { ptr += strlen(RESULTS)-1; while(memcmp(ptr, "for", 3) != 0) {

if(memcmp(ptr, "", 3) == 0) ptr+=3;

else if(memcmp(ptr, "
", 4) == 0) ptr+=4; else printf("%c",*ptr++);

}

} else continue; printf("\n"); return(0); } else

if(memcmp(ptr++, PATTERN, strlen(PATTERN)) == 0) { ptr += strlen(PATTERN)-1; results++;

while(memcmp(ptr, ">", 1) && buflen--) printf("%c",*ptr++); printf("\n");

}

free(sendthis);

free(readbuf);

return(results);

}

#Section 5

char *encode(char *str) {

static char *query; char *ptr; int nlen, i;

nlen = strlen(str)*3;

if( (query = (char *)malloc(nlen)) == NULL) {

perror("malloc");

return(NULL); } else ptr = str;

for(i = 0; i < nlen; i+=3)

sprintf(&query[i], "%c%X",'%',*ptr++); query[nlen] = '\0'; return(query);

}

#Section 6

int connect_me(char *dest, int port) {

int sockfd;

struct sockaddr_in servaddr; struct hostent *he;

if( (sockfd = socket(AF_INET, SOCK_STREAM, 0)) == -1) { perror("socket"); return(-1);

}

if( (he = gethostbyname(dest)) == NULL) {

fprintf(stderr, "cannot resovle hostname\n"); return(-1);

}

servaddr.sin_addr = *((struct in_addr *) he->h_addr); servaddr.sin_port = htons(port); servaddr.sin_family = AF_INET;

if(connect(sockfd, (struct sockaddr *)&servaddr, sizeof(struct sockaddr)) == -1) {

perror("connect");

return(-1);

} else return(sockfd);

}

#Section 7

void help(char *usage) {

printf("%s help\n",usage);

printf("%s [options]\nM);

puts("options:");

puts("-h: this help menu");

puts("-p: request google with a proxy. next argument must be the proxy");

puts("

and the port in the following format \"host:port\"");

puts("-m:

next argument must be the count of results you want to

see");

puts("-V:

prints versions info");

puts("-r:

prints only the results count and exit");

puts("examples:");

printf("%s \"filetype:pwd inurl:service.pwd\" -r # show results\n"); printf("%s \"nletype:pwd inurl:service.pwd\" -m 30 # print about 30

results\n");

#Section 8

void header(void)

{

puts("\tlgool V 0.2");

puts("written by l0om - WWW.EXCLUDED.ORG -l0om[47]excluded[d07]org\n");

}

Source Documentation

The first section of this program (yes, it's a program, not script) sets the required libraries that must be included to complete successful compilation.The second section includes the global variables needed in the program and the prototypes.

Section 3 is the Main() function of the program, whereas the fourth section is dedicated to "grepping the Google site." Section 4 contains the meat of the pro­gram because the searching and proxying logic is included within that function.

Section 5 is somewhat than our scripting querying libraries or even the C# implementation. It's utilized to convert the desired search string in the program to a HTTP-compliant Google query string. Notice the conversion housed within the For loop. Once the string is properly formatted, the string is returned.

The sixth section is one of our favorites because it's similar to the socket ini­tialization functions within the other Google attack libraries. All the code to establish and connect to Google is contained in connect_me(). The socket structure and connection attempts are encapsulated in IF statements. Another alternative to utilizing IF statements is try catch blocks. The seventh section of the program prints the Help menu. Last but not least, Section 8 is a header that prints every time the program is executed.

Scanning the Web

with Google Attack Libraries

We've covered the concept of automating Google query transmissions and retrieving data, but we have yet to prove that our libraries work in a real-world environment.The libraries were all created with dynamic usage in mind, thereby permitting our querying bots to reuse the Google query and scraping code with minimized inline modifications.The following tool leverages the attack signatures found in the NIKTO security database, which can be found at www.cirt.net.

CGI Vulnerability Scanning

The following is a CGI scanner that we have created by quickly extending the Perl implementation code. Before we display and document our source, a snippet of the NIKTO database has been included.The NIKTO database is a flat text file for which the fields are separated by commas (,). In this scenario, we are only concerned with the HTTP string that is meant to be sent to the target Web servers.

It is critical to note that the NIKTO text-based database is completely broken from a consistency perspective.That said, every "attack" is listed in the second column of the file, and by no coincidence that is the field that we are ripping with our Google CGI Vulnerability Scanning tool.

NIKTO Vulnerability Database Snippet

#VERSION,1.189 #LASTMOD,09.06.2004

# http://www.cirt.net

########################################################################

# Checks: ws typejrootjmethodjfilejresultjinformationjdata to send ######################################################################## #

","M ,"GET"

# is vulnerable to Cross Site Scripting (XSS). CA-2000-02." ## These are normal tests

MgenericM,M/index.php?module=ew_filemanager&type=admin&func=manager&pathext=. ./../../etcM,MpasswdM/MGETM/MEW FileManager for PostNuke allows arbitrary file retrieval. OSVDB-8193."1

''generic"/Vindex.php?module=ew_filemanager&type=admin&func=manager&pathext=. ./../../etc/&view=passwdM/Mroot:'',''GET''/''EW FileManager for PostNuke allows arbitrary file retrieval. OSVDB-8193."

MgenericM/M/logs/str_err.logM/M200M,MGETM/MBmedia error log, contains invalid login attempts which include the invalid usernames and passwords entered (could just be typos & be very close to the right entries)."

"abyss","/%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5cwinnt%5cwin.ini","[fonts]", "GET","Abyss allows directory traversal if %5c is in a URL. Upgrade to the latest version."

"abyss","/%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5cwinnt%5cwin.ini","[windows] ","GET","Abyss allows directory traversal if %5c is in a URL. Upgrade to the latest version."

'abyss','/////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////","index of","GET","Abyss 1.03 reveals directory listing when 256 /'s are requested."

"abyss","/conspass.chl+","2 00"/"GET"/"Abyss allows hidden/protected files to be served if a + is added to the request."

"abyss","/consport.chl+","2 00"/"GET"/"Abyss allows hidden/protected files to be served if a + is added to the request."

"abyss","/general.chl+","200","GET","Abyss allows hidden/protected files to be served if a + is added to the request."

"abyss","/srvstatus.chl+","200"/"GET"/"Abyss allows hidden/protected files to be served if a + is added to the request."

"alchemyeye","@CGIDIRS../../../../../../../../../../WINNT/system32/ipconfig.e xe","IP Configuration","GET","Alchemy Eye and Alchemy Network Monitor for Windows allow attackers to execute arbitrary commands."

"alchemyeye","@CGIDIRSNUL/../../../../../../../../../WINNT/system32/ipconfig. exe","IP Configuration","GET","Alchemy Eye and Alchemy Network Monitor for Windows allow attackers to execute arbitrary commands."

"alchemyeye","@CGIDIRSPRN/../../../../../../../../../WINNT/system32/ipconfig. exe","IP Configuration","GET","Alchemy Eye and Alchemy Network Monitor for Windows allow attackers to execute arbitrary commands."

"apache","/.DS_Store"/"Bud1"/"GET"/"Apache on Mac OSX will serve the .DS_Store file, which contains sensitive information. Configure Apache to ignore this file or upgrade to a newer version."

"apache","/.FBCIndex"/"Bud2"/"GET"/"This file son OSX contains the source of the files in the directory.

http://www.securiteam.com/securitynews/5LP0O0 05FS.html"

"apache","//","index of","GET","Apache on Red Hat Linux release 9 reveals the root directory listing by default if there is no index page." "apache","//","not found for:","OPTIONS","By sending an OPTIONS request for /, the physical path to PHP can be revealed."

The following is our developed source code to scan a particular site using the signatures housed within CIRT's NIKTO database.

SOURCE

Output

First you will notice warnings when you run this script. These appear because we are splitting the NIKTO database into separate variables and utilizing the second variable, $attack. No need to be concerned; as these warnings are meant to be included.

The script will run all the NIKTO vulnerability checks within a set of Google queries and output when a vulnerability is found in Google's cache. No output will be displayed outside the warning if vulnerabilities are not found.

Summary

In any implementation, automating information-gathering techniques has become a necessary evil. It's not feasible that we would ever have the time required to manually collect, store, parse, and analyze data from sources as large as Google. Throughout this chapter, we have provided an overview of the Google Development API with its benefits and downfalls. We have also given you the code and knowledge to be able to directly access the Google Web application database with our Google attack libraries that contain query transmission and page-scraping functions. These libraries can be quickly extended to create addi­tional tools, applications, or even Web-based CGI forms. Although beneficial, it is important to note that these libraries do not adhere to the Google terms of ser­vice and were meant to be for educational purposes only.
Solutions Fast Track

Understanding Google Search Criteria

In a relatively short amount of time, Google has become synonymous with Internet searching. Learning to search Google's online database with its advanced flags is the key to successful Web surfing.

Advanced searching permits users—and more specifically, automated programs—to filter and limit the results to a much narrower set of Web pages.

A Google Advanced Search Page documents most of the detailed searching capabilities of Google's database to include country, language, and image searching.

Understanding the Google API

The Google API is designed for application developers looking to automate the collection of Google information in a sanctioned manner.

A complete manual on the Google development API can be found at www.google.com/apis/.

The Google API requires a Google API key that limits an automated engine to sending fewer than 1000 queries per day.

Understanding Google Attack Libraries

0 Google attack libraries are broken into three main components: socket initialization and establishment, Google query requesting, and retrieving a Google query response.

0 The Python language proved the most useful and efficient for creating Automated Google Query code. Its OOP style, easily accessible regular expression engine, and indexing methods made it easy to create, send, retrieve, and scrape Google information.

0 The C# for Microsoft .NET library is the most extendable language implementation of our Google libraries because it can be merged into any program that's compatible with Microsoft's Visual Studio .NET.

Scanning the Web with Google Attack Libraries

0 Conducting Google vulnerability scans is one of the easiest tasks that's hit the information security industry in the past few years. The key to automating such a task is the looping constructs that wrap around the library implementations presented in this chapter.

0 You can implement looping constructs to automate searching and information retrieval for numerous purposes.

0 Nearly all vulnerability scans utilize the allinurl: advanced searching flag to search for strings stored within the Google cache.

Frequently Asked Questions

The following Frequently Asked Questions, answered by the authors of this book, are designed to both measure your understanding of the concepts presented in this chapter and to assist you with real-life implementation of these concepts. To have your questions about this chapter answered by the author, browse to www.syngress.com/solutions and click on the "Ask the Author" form. You will also gain access to thousands of other FAQs at ITFAQnet.com.

Q: Can you automate Google analysis in languages that do not contain socket-class functionality?

A: No. Unfortunately, the initial part of any Google-based data analysis is

retrieving such data. The socket, or network, functionality is required to con­nect to Google's databases to send queries and receive responses.That said, it should be understood that an external program could pass Google data to another program for analysis.

Q: Does the Google API interfere with our page-scraping mechanisms?

A: No.The Google API was created to assist developers looking to access infor-
mation ascertained from Google's search engine. Though Google does not
condone automation outside the use of the API, page scraping is completely
acceptable, as long as the page was retrieved using a browser. Scraping and
API-based techniques can certainly coexist, depending on the requirements
of your project. I

Q: What language is best to use for Google page scraping?

A: It completely depends on the nature of the program you're creating. If you are looking to create an application that sends numerous Google queries and conduct some sort of algorithmic computation on the back end, you'd ben­efit from a faster language such as C/C++ or C#—C# being our new favorite. However, if you're looking for a quick alternative that integrates in Web scripts, Perl is the obvious choice for ease of development and time to integration. Java is the de facto cross-platform language of choice, but some­thing prevents us from saying that VBA is a good choice for anything.

Q: Do any of the available freeware tools currently use these libraries?

A: Not in their entirety. However, some of the Perl code has been utilized to update GooScan.All the code provided in this book, on ApplicationDefense, and at Ihackstuff is freely available to use and distribute as long as proper attribution is provided.

Q: Is HTTP 1.0 versus HTTP1.1 a major decision when considering what pro­tocol to use to transmit the queries?

A: Yes. HTTP 1.1 is much more efficient for transmitting multiple sequences of packets to a Web server. In this case, the libraries are not taking advantage of the HTTP 1.1 protocol, thereby making the decision trivial.

Q: Can any of this code be leveraged to proxy anonymous attacks through Google?

A: Outside of the socket code, nothing could be utilized to proxy attacks. A paper was released in 2001 on making Web attacks anonymous through open Web proxies. We encourage you to search for the paper via Google if you're seeking to gain experience.

No comments:

Post a Comment