Saturday, December 5, 2009

Advanced Operators

Introduction

Beyond the basic searching techniques explored in the previous chapter, Google offers special terms known as advanced operators to help you perform more advanced queries.These operators, when used properly, can help you get to exactly the information you're looking for without spending too much time poring over page after page of search results. When advanced operators are not provided in a query, Google will locate your search terms in any area of the Web page, including the title, the text, the URL, or the like. We take a look at the fol­lowing advanced operators in this chapter:

■ intitle, allintitle

■ inurl, allinurl

■ filetype

■ allintext

■ site

■ link

■ inanchor

■ daterange

■ cache

■ info

■ related

■ phonebook

■ rphonebook

■ bphonebook

■ author

■ group

■ msgid

■ insubject

■ stocks

■ define

Operator Syntax

An advanced operator is nothing more than a part of a query. You provide advanced operators to Google just as you would any other query. In contrast to the somewhat free-form style of standard Google queries, however, advanced operators have a fairly rigid syntax that must be followed. The basic syntax of a Google advanced operator is operator:search_term. When using advanced operators, keep in mind the following:

■ There is no space between the operator, the colon, and the search term. Violating this syntax can produce undesired results and will keep Google from understanding the advanced operator. In most cases, Google will treat a syntactically bad advanced operator as just another search term. For example, providing the advanced operator intitle without a following colon and search term will cause Google to return pages that contain the word intitle.

■ The search term is the same syntax as search terms we covered in the previous chapter. For example, you can provide as a search term a single word or a phrase surrounded by quotes. If you provide a phrase as the search term, make sure there are no spaces between the operator, the colon, and the first quote of the phrase.

■ Boolean operators and special characters (such as OR and +) can still be applied to advanced operator queries, but be sure not to place them in the way of the separating colon.

■ Advanced operators can be combined in a single query as long as you honor both the basic Google query syntax as well as the advanced oper­ator syntax. Some advanced operators combine better than others, and some simply cannot be combined. We will take a look at these limita­tions later in this chapter.

■ The ALL operators (the operators beginning with the word ALL) are oddballs. They are generally used once per query and cannot be mixed with other operators.

Examples of valid queries that use advanced operators include these:

■ intitle:Google This query will return pages that have the word Google in their title.

■ intitle:"index of" This query will return pages that have the phrase index of in their title. Remember from the previous chapter that this query could also be given as intitle:index.of, since the period serves as any character. This technique also makes it easy to supply a phrase without having to type the spaces and the quotation marks around the phrase.

■ intitle:"index of"private This query will return pages that have the phrase index of in their title and also have the word private anywhere in the page, including in the URL, the title, the text, and so on. Notice that intitle only applies to the phrase index ofand not the word private, since the first unquoted space follows the index of phrase. Google inter­prets that space as the end of your advanced operator search term and continues processing the rest of the query.

■ intitle:"index of" "backupfiles" This query will return pages that have the phrase index of in their title and the phrase backup files anywhere in the page, including the URL, the title, the text, and so on. Again, notice that intitle only applies to the phrase index of.



Troubleshooting Your Syntax

Before we jump head first into the advanced operators, let's talk about trou­bleshooting the inevitable syntax errors you'll run into when using these opera­tors. Google is kind enough to tell you when you've made a mistake, as shown in Figure 2.1.

In this example, we tried to give Google an invalid option to the as_qdr vari­able in the URL. (The correct syntax would be as_qdr=m3, as we'll see in a moment.) Google's search result page listed right at the top that there was some sort of problem.These messages are often the key to unraveling errors in either your query string or your URL, so keep an eye on the top of the results page. We've found that it's easy to overlook this spot on the results page, since we nor­mally scroll past it to get down to the results.

Sometimes, however, Google is less helpful, returning a blank results page with no error text, as shown in Figure 2.2.

Fortunately, this type of problem is easy to resolve once you understand what's going on. In this case, we didn't provide Google with a search query. We restricted our search to only PDF files (we'll look at filetype in more detail later in this chapter), but we failed to provide anything to search for. Subtracting results from zero results gets Google all confused, resulting in a blank page.

But That's What I Wanted!

Sometimes you actually want to get results for a search query you know is going to cause problems, such as filetypeipdf. It seems reasonable that this query would return every PDF file that Google has crawled, but it simply doesn't. In cases like this, you just need to be a bit creative. To get a list of every PDF file, try a query like filetypeipdf pdf. This query asks Google to return every PDF file that contains the word pdf—but remember, Google automatically searches the URL for your search term, so every file ending in .PDF will have PDF in the URL.

Introducing Google's Advanced Operators

Google's advanced operators are very versatile, but keep in mind the rules listed earlier. In addition, you should remember that not all operators can be used everywhere. Some operators can only be used in performing a Web search, and others can only be used in a Groups search. Refer to Table 2.3, which lists these distinctions. If you have trouble remembering these rules, keep an eye on the results line near the top of the page. If Google picks up on your bad syntax, an error message will be displayed, letting you know what you did wrong. Sometimes, however, Google will not pick up on your bad form and will try to perform the search anyway. If this happens, keep an eye on the search results page, specifically the words Google shows in bold within the search results.These are the words Google interpreted as your search terms. If you see the word intitle in bold, for example, you've probably made a mistake using the intitle operator.

Intitle and Allintitle:

Search Within the Title of a Page

From a technical standpoint, the title of a page can be described as the text that is found within the TITLE tags of an HTML document.The title is displayed at the top of most browsers when viewing a page, as shown in Figure 2.3. In the context of Google groups, intitle will find the term in the title of the message post.

As shown in Figure 2.3, the title of the Web page is "Syngress Publishing." It is important to realize that some Web browsers will insert text into the title of a Web page, under certain circumstances. For example, consider the page shown in Figure 2.1, shown again in Figure 2.4, this time before the page is actually fin­ished loading.

This time, the title of the page is prepended with the word "Loading" and quotation marks, which were inserted by the Safari browser. When using intitle, be sure to consider what text is actually from the title and which text might have been inserted by the browser.

Title text is not limited, however, to the TITLE HTML tag. A Web page's document can be generated in any number of ways, and in some cases, a Web page might not even have a title at all. The thing to remember is that the title is the text that appears at the top of the Web page, and you can use intitle to locate text in that spot.

When using intitle, it's important that you pay special attention to the syntax of the search string, since the word or phrase following the word intitle is considered the search phrase. Allintitle breaks this rule. Allintitle tells Google that every single word or phrase that follows is to be found in the title of the page. For example, we just looked at the intitle:"index of" "backupfiles" query as an example of an intitle search. In this query, the term "backup files" is found not in the title but rather in the text of the document, as shown in Figure 2.5.

Notice that "backup files" is not in the title of the first found document. If we were to modify this query to allintitle:"index of" "backupfiles" we would get a dif­ferent response from Google, as shown in Figure 2.6.

Display 3 menu

Notice that both "index of" and "backupfiles" have been found in the title of the document and that we have reduced our search from 556 hits to 21 hits by providing a much more restrictive search, since more sites have the term "backup files" in the text than in the title of the document.

Google Highlighting

Google highlights search terms using multiple colors when you're viewing the cached version of a page and uses a bold typeface when displaying search terms on the search results pages. Don't let this confuse you if the term is highlighted in a way that's not consistent with your search syntax. Google highlights your search terms everywhere they appear in the search results. You can also use Google's cache as a sort of virtual highlighter. Experiment with modifying a Google cache URL. Locate your search terms in the URL, and add words around your search terms. If you do it correctly and those words are present, Google will highlight those new words on the page.

Be wary of using the allintitle operator. It tends to be clumsy when it's used with other advanced operators and tends to break the query entirely, causing it to return no results. It's better to go overboard and use a bunch of intitle operators in a query than to screw it up with allintitle's funky conventions.

Although this is not completely accurate, assume that allintitle cannot be used with other operators or search terms.

Allintext: Locate a String Within the Text of a Page

The allintext operator is perhaps the simplest operator to use since it performs the function that search engines are most known for: locating a string within the text of the page. Although this advanced operator might seem too generic to be of any real use, it is handy when you know that the text you're looking for should only be found in the text of the page. Using allintext can also serve as a type of shorthand for "find this string anywhere except in the title, the URL, and links."

Since this operator starts with the word all, every search term provided after the operator is considered part of the operator's search query.

For this reason, the allintext operator should not be mixed with other advanced operators.

Inurl and Allinurl: Finding Text in a URL

Having been exposed to the intitle operators, it might seem like a fairly simple task to start throwing around the inurl operator with reckless abandon. I won't discourage such flights of searching fancy, but first realize that a URL is a much more complicated beast than a simple page title, and the workings of the inurl operator can be equally complex.

First, let's talk about what a URL is. Short for Uniform Resource Locator,a URL is simply the address of a Web page.The beginning of a URL consists of a protocol, followed by ://, like the very common http:// or ftp://. Following the protocol is an address followed by a pathname, all separated by forward slashes (/). Following the pathname comes an optional filename. A common basic URL, like http://www.uriah.com/apple-qt/1984.html, can be seen as several different components.The protocol, http, indicates that we should expect a Web document from the server.The server is located at www.uriah.com, and the requested file, 1984.html, is found in the /apple-qt directory on the server. As we saw in the previous chapter, a Google search can also be conveyed as a URL, which can look something like www.google.com/search?q=ihackstuff.

We've discussed the protocol, server, directory, and file pieces of the URL, but that last part of our example URL, ?q=ihackstuff, requires a bit more exami­nation. Explained simply, this is a list of parameters that are being passed into the "search" program or file. Without going into much more detail, simply under­stand that all this "stuff" is considered to be part of the URL, which Google can be instructed to search with the inurl and allinurl operators.

So far this doesn't seem much more complex than dealing with the intitle operator, but there are a few complications. First, Google can't effectively search the protocol portion of the URL—http://, for example. Second, there is a ton of special characters sprinkled around the URL, which Google also has trouble weeding through. Attempting to specifically include these special characters in a search could cause unexpected results and might limit your search in undesired ways.Third, and most important, other advanced operators (site and filetype, for example) can search more specific places inside the URL even better than inurl can. These factors make inurl much trickier to use effectively than an intitle

search, which is very simple by comparison. Regardless, inurl is one of the most indispensable operators for advanced Google users; we'll see it used extensively throughout this book.

As with the intitle operator, inurl has a companion operator, known as allinurl. Consider the inurl search results page shown in Figure 2.7.

This search located the word admin in the URL of the document and the word backup anywhere in the document, returning more than 20,000 results. Replacing the inurl search with an allinurl search, we receive the results page shown in Figure 2.8.

This time, Google was instructed to find the words admin and backup only in the URL of the document, resulting in only 2,530 hits. Just like the allintitle search, allinurl tells Google that every single word or phrase that follows is to be found only in the URL of the page. And just like allintitle, allinurl does not play very well with other queries. If you need to find several words or phrases in a URL, it's better to supply several inurl queries than to succumb to the rather unfriendly allinurl conventions.

Site: Narrow Search to Specific Sites

Although technically a part of a URL, the address (or domain name) of a server can best be searched for with the site operator. Site allows you to search only for pages that are hosted on a specific server or in a specific domain. Although fairly straightforward, proper use of the site operator can take a little bit of getting used to, since Google reads Web server names from right to left, as opposed to the human convention of reading site names from left to right. Consider a common Web server name, www.apple.com.To locate pages that are hosted on apple.com, a simple query of site:apple.com will suffice, as shown in Figure 2.9.

Notice that the first two results are from www.apple.com and store.apple.com. Both of these servers end in apple.com and are valid results of our query. It seems fairly logical to assume that a query for site:store.apple might help

us locate Apple store pages, but, as shown in Figure 2.10, we only get one result, despite the fact that there are really tens of thousands of pages at http://store.apple.com.

Look very closely at the results of the query and you'll discover that the URL for the singular returned result looks a bit odd.Truth be told, this result is odd.There's no Web page at www.store.apple, because there's no such registered domain name on the Internet. Google (and the Internet at large) reads server names (really domain names) from right to left, not from left to right. For www.store.apple to exist, there must be an .apple domain name, which there isn't. Top-level domain names include com, net, etc. (see http://www.iana.org/gtld/ gtld.htm) and must be registered and approved by the Internet Assigned Numbers Authority (IANA).This is the complicated way of saying that parame­ters to Google's site operator must end in a valid top-level domain name if you want predictable results. For example, queries for site:com, site:apple.com, and site:store.apple.com would all return results that would include links to the Apple store, but obviously the latter query would be the most specific.

Googleturds

So, what about that link that Google returned to www.store.apple? What is that thing? Johnny Long coined the term googleturd to describe what is most likely a typo that was crawled by Google. As a Webmaster, if you put up a Web page with a link to http://www.apple.store and your Web page was crawled by Google, there's a good chance that Google will hold onto this link even though it leads nowhere. These things can be useful, as we will see later on.

The site operator can be easily combined with other searches an operators, as we'll see later in this chapter.

Filetype: Search for Files of a Specific Type

Google searches more than just Web pages. Google can search many different types of files, including PDF (Adobe Portable Document Format) and Microsoft Office documents.The filetype operator can help you search for these types of files. More specifically, filetype searches for pages that end in a particular file extension. The file extension is the part of the URL following the last period of the filename but before the question mark that begins the parameter list. Although not always entirely accurate, the file extension can indicate what type of program opens the file, hence you can use Google's filetype operator to search for specific types of files by searching for a specific file extension.Table 2.1 shows the main file types that Google searches, according to www.google.com/help/faq_filetypes.html#what.

File Type

File Extension

Adobe Portable Document Format

Adobe PostScript Lotus 1-2-3

Pdf

Ps

wkl, wk2, wk3, wk4, wk5, wki, wks, wku

Continued

Many of the file extensions shown in Table 2.2 might be familiar to you; others might not. Filext (www.filext.com) is a great resource for getting detailed information about file extensions, what they are, and what programs the exten­sions are associated with.

Google converts every document it searches to either HTML or text for online viewing.You can see that Google has searched and converted a file by looking at the results page shown in Figure 2.11.

Figure 2.11 Converted File Types on a Search Page

OOO Google Search: filetvpe:doc doc

Web Images Groups News Froogle more» iletype:doc doc

AdvansB-d Search

Web

Results 1 -10 of about 5,120,000 for filetype:doc doc. (0.95 seconds)

rpoci The Darknet and the Future of Content Distribution

File Format: Microsoft Word 2000 - View as HTML The Darknet and the Future of Content Distribution. Peter Bid die, Paul England, Marcus Peinado, and Bryan Willman. Microsoft Corporation 1. Abstract. ...

crypto.stanford.edu/DRM2002/darknet5.doc - Similar pages

Sponsored Links

Doc

Huge selection, great deals on

everything, -aff

eBay.com

Notice that the first result lists [DOC] before the title of the document and a file format of Microsoft Word 2000.This indicates that Google recognized the file as a Microsoft Word 2000 document. In addition, Google has provided a View as HTML link that when clicked will display an HTML approximation of the file, as shown in Figure 2.12.

When you click the link for a document that Google has converted, a header is displayed at the top of the page, indicating that you are viewing the HTML version of the page. A link to the original file is also provided. If you think this looks similar to the cached view of a page, you're right.This is the cached ver­sion of the original page, converted to HTML.

Although these are great features, Google isn't perfect. Keep these things in mind:

■ Google doesn't always provide a link to the converted version of a page.

■ Google doesn't always properly recognize the file type of even the most common file formats.

■ When Google crawls a page that ends in a particular file extension but that file is blank, Google will sometimes provide a valid file type and a link to the converted page. Even the HTML version of a blank Word document is still, well, blank.

This operator flakes out when QRed. As an example, the query filetype:xls xls returns 912,000 results.The query filetype:pdf pdf returns 10,900,000 results.The query (filetype:pdf | filetype:xls) returns 17,600,000 results, which is pretty close to the two individual search results combined. However, when you start adding to this precocious combination with things like (filetype:pdf | filetpye:xls) (pdf | xls), Google flakes out with only 10,700,000 results.To make matters worse, all the returned files are PDF, and none are XLS files. We've found that Boolean logic applied to this operator is usually flaky, so beware when you start tinkering.

This operator can be mixed with other operators and search terms.

H We simply can't state this enough: The real hackers play in the gray areas all the time. The filetype operator opens up another interesting play­ground for the true Google hacker. Consider the query (filetype:pdf | file-type:xls) -inurhxls -inurhpdf, a query that should return zero results, since all PDF and XLS files have PDF or XLS in the URL, right? Wrong. At the time of this writing, this query gives over 100 results, all of them interesting, to say the least. Pay close attention to the next character .

Link: Search for Links to a Page

The link operator allows you to search for pages that link to other pages. Instead of providing a search term, the link operator requires a URL or server name as an argument. Shown in its most basic form, link is used with a server name, as shown in Figure 2.13.

Figure 2.13 The Link Operator

O O 8 Google Search: link:www.defcon.org

No comments:

Post a Comment