LANGUAGES: VB.NET | C#
ASP.NET VERSIONS: 2.x
Prepare to Be Searched
Get Your Site Noticed by the People Who Matter Most
If your Web site provides useful content, services, or products, there are people
out there who want to know about it. But how do you get the word out? You could
send out copious amounts of spam to get noticed, but that’s not likely to earn the
kind of reputation that most organizations crave. Other forms of marketing and advertising
are likely to bring more positive results, but just because you don’t have an advertising
budget doesn’t mean you’re out of luck. Read on to find out free ways to maximize
your Web site’s status and get found by the people you’re trying to reach.
Robots and Crawling Spiders
Sounds like an introduction to sci-fi movie, doesn’t it? Actually, robots, crawlers,
and spiders are all names for custom software from search engines like Google, Yahoo,
and MSN Search that investigate what’s currently out on the Internet. If you have
a public Web site, chances are it has already been visited, scanned, and thoroughly
indexed by one of these ominous-sounding pieces of software. As intimidating as
they sound, spiders can be your best friend if you take the time to understand them
— they hold the key to every Web site’s search ranking. If your site sells discount
toothpicks, then your site needs to appear near the top of the list when users search
for discount toothpicks — and the spiders hold the power to make that happen.
Functionally speaking, spiders do little more than record key pieces of your Web
page’s HTML and follow the hyperlinks to see where they lead. Conceptually, it’s
not very difficult to design a basic spider yourself. The .NET WebRequest object
is all you really need to retrieve the HTML of a page so you can parse it and extract
the hyperlinks to recursively parse other related Web pages. While in the process,
you can store important pieces of text in a database for querying. Sites like Google
and Yahoo have become masters of this technique, and by understanding some details
about how they do it, you can use their global dominance to advance your own agenda.
A primary technique that spiders employ is to examine the words used most often
in your Web pages. Therefore, the text content of your Web site is important for
determining the ranking of your site in relation to specific words and phrases.
It’s not very feasible (or advisable) to make major changes to the content of a
Web site just to increase search rankings. Instead, there are other techniques that
are likely to give better results. For example, another extremely important item
search engines examine is the title of a page. In a basic HTML page, the title would
be defined like this:
<html>
<head>
<title>THIS
IS THE TITLE</title>
</head>
<body>
Hello World
</body>
</html>
When the page is viewed by a user, its title shows up in the title bar of the browser,
as shown in Figure 1. As far as search engines are concerned, it is best to have
the title consist of a good sentence or two filled with highly descriptive words
about the page and/or Web site. This will help search engines “understand” the primary
focus of the Web page, thereby increasing the site’s ranking when people search
for related topics.
Figure 1: A Web page’s title shows
up in the title bar of the user’s browser. It’s a key element that is examined by
most major search engines to determine the subject matter of a Web page.
In ASP.NET 2.0, you’re likely to have a master page, so the simplest way to specify
the title for each page will be more like this:
<%@ Page
TITLE="MY PAGE TITLE"
Language="VB"
MasterPageFile="~/MyMasterPage.master" %>
<asp:Content ID="Content1"
ContentPlaceHolderID="CPH1" Runat="Server">
Hello World
</asp:Content>
This technique is fine for a small Web site, but for larger sites you’re in for
a major maintenance chore if you ever decide to change the titles of all the pages
in your Web site. Luckily, ASP.NET 2.0 makes it easy to change a page’s title programmatically
from the page’s (or master page’s) code-behind file:
Page.Title
= "Discount Toothpicks"
'VB 2005
Page.Title
= "Discount Toothpicks";
//C# 2.0
Now all that’s needed is a way to programmatically set the page title from some
kind of a data source. Luckily, the SiteMapDataSource is perfect for this kind of
thing. For more information about site maps, I suggest you read
Automate Navigation Chores. Once a site map is set up, it only takes a tidbit
of code in the master page’s code-behind to set the page title to the associated
title specified in the site map:
'VB 2005
If
SiteMap.CurrentNode IsNot
Nothing Then
Page.Title = SiteMap.CurrentNode.Title
//C# 2.0
if
(SiteMap.CurrentNode != null)
Page.Title =
SiteMap.CurrentNode.Title;
Descriptions, Keywords, and
Meta Tags
Virtually all search engines make use of the page title, so it has a high payoff
to ensure each page is thoroughly titled. However, there are other specific HTML
elements that some search engines also value highly in their rankings. For example,
Yahoo and MSN Search use the Description meta tag when present; Yahoo uses the Keyword
meta tag, as well. Here’s a syntactically correct example of these meta tags in
action:
<html>
<head>
<title>THIS
IS THE PAGE TITLE</title>
<meta name="description"
runat="server"
content="Discount Toothpicks" id="description" />
<meta name="keyword"
runat="server"
content="toothpicks, discount, teeth, cheap"
/>
</head>
<body>
Get yer cheap toopicks here
</body>
</html>
Technically, from an HTML perspective, the runat and id attributes are not required
— but by including them it permits you to adjust their value via server-side code.
For example, you can use a SiteMap for the Description meta tag in a similar way
that the title page was set in the previous example:
'VB 2005
If
SiteMap.CurrentNode IsNot
Nothing Then
Me.description.Content = SiteMap.CurrentNode.Description
Me.keywords.Content = _
SiteMap.CurrentNode("keyword").ToString()
//C# 2.0
if
(SiteMap.CurrentNode != null)
{
this.description.Content
= SiteMap.CurrentNode.Description;
this.keywords.Content
=
SiteMap.CurrentNode["keyword"].ToString();
}
While SiteMaps don’t officially support the keyword attribute, you can add it anyway
because extraneous attributes are permitted and can be accessed programmatically
using the syntax listed above.
Get a Buzz
Another extremely important factor that search engines consider when ranking a site
is how many other Web pages on the Internet link to that site. For a Web site to
be considered an authority on a particular topic, it will need a lot of related
Web sites pointing to your site, and the effect is greatest when those sites rank
highly (see Figure 2). Of course, the rhetorical question here is how to get other
sites to link to yours. There is no single great answer to this — although it sure
helps if you’ve got a lot of advertising dollars to spend. Otherwise, you’re stuck
with gradually building a reputation and getting other sites to link to yours via
trading, begging, bartering, and hard work. Sometimes sharing content with other
Web sites is a good way to get them to notice you and (more importantly) provide
valuable hyperlinks back to your site.
Figure 2: The Google toolbar plug-in
(available for Internet Explorer and Firefox) gives a good indication of a particular
Web site’s ranking. This ranking is based primarily on how many other Web sites
link to the site.
Creating a buzz is a great way to launch a public site on the right foot. Get the
word out. Make sure all the sites that should know about your pages are aware your
site is online. Post in public forums frequently, and always include a hyperlink
to the site in your signature or elsewhere in the posting. Get friends and coworkers
to join in, too. If you’re proud of your site, make a big deal about it and see
who notices.
Through some investigation, you might find some link networks related to your industry.
Basically, when you join such a network you agree to provide links to other related
Web sites, and they agree to link to yours as well. Varying degrees of automation
are generally involved to ensure participation among members. If you go with this
approach, be sure to stay with link networks within your industry; straying into
more general “link farm” networks will often have the opposite effect; that is,
watering down the focus of your Web site in the eyes of search engines, potentially
making it more difficult to find.
When you feel your site is ready, most major search engines provide a way to submit
a site for indexing, which effectively queues the site visit from a spider. To submit
a site to a search engine, visit the main search page and find a “help” link and
click it to find their submittal page. It’s generally not necessary to submit a
site to the search engine because their spiders will eventually find it on their
own, although it can sometimes speed up the process. Don’t
worry if your site has already been indexed; spiders will visit again soon to investigate
content revisions.
(advertisement)
What Not To Do
While all the previous tips provide valuable things that can be done to improve
a site’s search ranking, there are also some things that simply should not be done.
For example, most spiders are unable to analyze images, so you shouldn’t hide critical
search phrases inside an image unless they are duplicated in the image’s ALT attribute.
It’s also advisable to not attempt to trick search engines to increase a site’s
ranking. People have come up with all kinds of devious ways to hide extra key words
in HTML documents in an effort to boost profiles. Some people mistakenly think injecting
a wide variety of irrelevant words in a Web site will help it to be found by a wider
audience. My advice is to not get cute like this. The major search engines have
seen it all before. At best, these extra words will be ignored; at worst, your entire
site could end up being ignored.
Generally speaking, the more Web sites that link to your site the better. However,
there are a couple exceptions. Web sites infamous for undesirable content such as
spam, warez, and other illegal activities might give your Web site a bad reputation
in the eyes of search engines if they consistently link to your site. In other words,
keep your nose clean so questionable sites will have little interest in linking
to your content.
Complex QueryStrings can also confuse spiders. For example, do these two URLs output
the same content?
http://www.SomeSite.com/ShowContent.aspx?ID=1
http://www.SomeSite.com/ShowContent.aspx?ID=2
The answer is, “it depends.” As a Web developer, you likely know that the ID QueryString
tacked onto the end of the URL could be mostly irrelevant, or it could completely
change the page that is displayed depending on how the developer decided to use
it. Spiders understandably tend to get confused by this kind of thing and don’t
know whether to index them as separate pages. As a result, some spiders completely
ignore such pages. Because complex QueryStrings confuse spiders, they should be
mostly avoided, especially for pages that are meant to be highly searchable. The
Context.RewritePath method can be quite useful for providing spider-friendly URLs
without having to heavily modify a preexisting architecture that relies on QueryStrings.
Private Parts
Perhaps there are parts of a Web site that should not be searched. Maybe they contain
personal information or sensitive copyrighted content. The best solution is to use
some kind of authentication, such as Forms Authentication or Windows Authentication.
Because spiders don’t have user accounts, they won’t be able to access (or index)
the information contained within. However, if a full-blown authentication system
is overkill for your needs, there are some simple alternatives to keep specific
pages away from prying spider eyes.
One solution is the ROBOTS meta tag. To prevent a page’s content from being indexed,
add the following meta tag to its HTML:
<meta name="ROBOTS"
content="NOINDEX"
/>
To prevent spiders from following hyperlinks contained within the page, add this
meta tag to the page’s HTML:
<meta name="ROBOTS"
content="NOFOLLOW"
/>
While this solution can be useful for protecting a page or two, it can start to
become less manageable for larger numbers of pages. If entire directory trees need
to be protected, then creating a robots.txt file in the web root may be a better
solution because it centralizes the management of such details. To prevent the entire
Web site from being indexed, the robots.txt file should contain the following text:
User-agent:
*
Disallow:
/
This tells all (*) spiders to ignore pages starting at the root (/) of the Web site.
It’s easy to be more selective about which files to exclude, such as in the following
example that denies (only) Google permission to index content in the web root’s
subdirectory named “secure”, as well as the “/data/logs” subdirectory:
User-agent:
Googlebot
Disallow:
/secure
Disallow:
/data/logs
It’s also possible to grant different levels of access to spiders from different
search engines, and other advanced tricks that are beyond the scope of this article.
For more information, see http://www.robotstxt.org/wc/faq.html.
Although there is currently no ratified standard that is guaranteed to ward off
all search engines, most voluntarily comply with the techniques mentioned here.
Search Is King
Being easily found on the Internet is an important accomplishment for any public
organization. Being able to find information can be just as important. For more
details on how to retrieve and use search results programmatically, see
Search Box.
Obviously, the topic of searching and indexing the Web is far more complex than
anyone could hope to cover in an article or two; otherwise, companies like Google
and Yahoo wouldn’t be able to rake in such enormous amounts of money from their
expertise. Armed with the right knowledge, and building on the information you now
have, maybe you too can scoot up to the table and grab yourself a piece of the pie.
This article was originally published in
ASP.NET Pro Magazine.