CSDN博客

img waterboy

NATA1 Walk Through

发表于2004/10/21 10:31:00  1217人阅读

分类: WEBFORM

My mission, To implement NATA1 to spider and search my .Text site. Props to Paul [Sedgewick@Nata1.com] for sharing his source and walkthough to get me started on this, and for responding to my pleas for help

OK, so the NATA1 project is not quite as easy to implement as you were lead to believe at (http://www.nata1.com/)  the NATA2 does not appear to be "Open" yet, and the version 1.1 available via the download link at the bottom of the page at (http://www.nata1.com/download/default.aspx) has a couple of bugs, but we can fix this all in a jiffy... 

So start by getting the code, and make sure you can build it in your VS.NET. 

Choose or create your SQL database, and run the Nata1SqlScripts/tables.SQL to make the required tables... They all will co-exist nicely in your current sites database... Then run the Nata1SqlScripts/Sprocs.SQL to make the stored procedures...  If your database login is not an admin (sa) or a dbo, then you should remember to give your user execute permissions to the new stored procedures at this point.

Next on to the web.config...

In the web application where you hope to use NATA1, you are going to need to add a bunch of settings

for .Text you already have a configSections.. so add the following keys

<configuration>

<configSections>

<sectionGroup name="Nata1">

<section name="binPath" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="sites" type="Nata1.Nata1SectionHandler,Nata1" />

<section name="log" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="database" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="preferedIndexTime" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexRequestTimeOut" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexing" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="indexService" type="Nata1.Nata1SectionHandler, Nata1" />

<section name="google" type="Nata1.Nata1SectionHandler, Nata1" />

</sectionGroup>

<section name="exceptionManagement" type="Microsoft.ApplicationBlocks.ExceptionManagement.ExceptionManagerSectionHandler, Microsoft.ApplicationBlocks.ExceptionManagement" />

</configSections>

Then after the close of your systemWeb section () but before the close of your configuration section () you should add the following... editing the file path (My implementation does not use this), site URL, SQL connection string, and the Google Key (if applicable).  I have removed the quotes from the sections you need to edit, so if you forget one your application will tell you your .config is invalid...  you need to replace all of my square brackets with quoted strings.

<Nata1>

<binPath>

<add key="filePath" value=[FILE PATH HERE eg C:/siteName/searchEngine/] />

</binPath>

<sites>

<add key="site" value=[BASE URL HERE eg http://blog.yoursite.com]/>

<add key="defaultPage" value="index.aspx" />

</sites>

<log>

<add key="filePath" value="c:/eventLog.txt" />

</log>

<database>

<add key="connectionString" value=[CONNECTION STRING HERE] />

</database>

<indexing>

<add key="hour" value="4" />

<add key="intervalType" value="daily" />

<--

<add value="hourBased" key="interval" />

<add value="2" key="intervalHours" />

-->

</indexing>

<indexRequestTimeOut>

<add key="seconds" value="5" />

</indexRequestTimeOut>

<indexService>

<add key="provider" value="IndexServer" />

</indexService>

<google>

<add key="licenseKey" value="[put your google license key here]" />

</google>

</Nata1>

And finally you need to add the exception management section adding your email address and a valid path where ASP.NET has permissions to write a file for the error log (this is also in SQL so you don't really need it)

<exceptionManagement mode="on">

<publisher assembly="Nata1" type="Nata1.Engine.Exceptions.ExceptionPublisher" exclude="*" include="Nata1.Engine.Exceptions.DataStructureException, Nata1; Nata1.Engine.Exceptions.QueryException, Nata1; Nata1.Engine.Exceptions.UIException" operatorMail=[YOUR EMAIL ADDRESS]filename=[YOUR FILE PATH eg.. c:/inetpub/wwwroot/SearchEngine/Nata1ErrorLog.txt] />

</exceptionManagement>

Now things would be working, but for a few bugs in the spider... so lets fix the spider before we implement it on our site. Open the file Engine/Indexing/IndexUtility.cs

First we need to up the  timeout on the HttpWebRequest.  Open the Engine/Normalization/SitePageUtility.cs… and around line 342 in the GetPage function, you will see a line that looks like this:

wr.Timeout=1;

Change it to 10000 as it is in milliseconds, and 1 will cause all spider pages to timeout.

We are going to need to change the default behavior of the spider.  It appears to have been designed to follow only relative links, which kept it on the specific site…  For .Text all of the links are absolute, so we need this utility to recognize and spider all links that are on the same domain as our base site.  This behavior can be changed in the file Engine/Indexing/IndexUtility.cs.  Update the buildSiteURLs function…  Actually just replace it with (careful on the word wrap the editor really messed up this function):

private static void buildSiteURLs(string url)

{

//_rawPages = new Hashtable();

// for debugging -

 

// this function will be just like get page urls

// 1. get the links for a page that don't appear in the root arraylist

// 2. add the unique URLs to the root arraylist

// 3. for each unique URL, recursivly call the function

// this function will eventually only return "unique" page urls

                 

// ok, capture all the URLs on the page

Regex rx = new Regex("href//s*=//s*(?:/"(?<href>[^/"]*)/"|(?<href>//S+))",

                        RegexOptions.IgnoreCase|RegexOptions.Compiled);

 

//Regex rx = new Regex("href//s*=//s*(?:(?:///"(?<url>[^///"]*)///")|(?<url>[^//s]* ))",RegexOptions.IgnoreCase|RegexOptions.Compiled);

 

//string hrefPattern = "<(a|A)//s{1}.*href=/"(?<href>[^/"]+)/"";

                  ArrayList pageResults = new ArrayList();

 

//Regex rx = new Regex(hrefPattern);

                  MatchCollection mc;

           

try

{

      // Perform the Regex Match

      string pageText = null;

      try

      {

            pageText = SitePageUtility.GetPage(url);

      }

      catch

      {

            // ok we've logged this

      }

      if(pageText != null)

      {

            _rawPages[url] = pageText;

      }

      else

      {

            return;

      }

      // System.Diagnostics.EventLog log = new System.Diagnostics.EventLog("vivaCostaRica");

      // log.WriteEntry("error requesting - " + url);

      // log.Close();

                       

      mc = rx.Matches(pageText);

                 

      foreach(Match m in mc)

      {

     

            Group g = m.Groups["href"];

            // prepare the url

            string linkTest = g.Value.ToUpper();

            //if(linkTest.IndexOf("/")!=0)continue; no need to check for relative vs static

            if(linkTest.IndexOf("/")<-1)continue; //Use this to strip anchors and some javascript

            if(linkTest.IndexOf(".JPG")>-1)continue;

            if(linkTest.IndexOf(".GIF")>-1)continue;

            if(linkTest.IndexOf(".PDF")>-1)continue;

            if(linkTest.IndexOf(".CSS")>-1)continue;

            //                      int testInt = linkTest.IndexOf(".jpg");

            string href = g.Value;

                             

            if(_debug == true)

                  LogUtility.LogEvent(EntryType.Info , DateTime.Now, "href : " + href);

                             

            if(href.IndexOf("/")==0)

                  href = SiteUtility.GetSiteBaseUrl() + href.Substring(1); //only append the base if it is a relative link

            string uri = href.Split(char.Parse("#"))[0].ToString(); //strip any anchor tags

            //    string defaultPage = getSiteDefaultPage();

            // ADD SUPPORT FOR DIFFERENT EXTENSTIONS

                              if(!_siteUrls.Contains(uri)&&uri!=SiteUtility.GetSiteBaseUrl()&&uri!=(SiteUtility.GetSiteBaseUrl()+SiteUtility.GetSiteDefaultPage())&&uri.IndexOf(SiteUtility.GetSiteBaseUrl())==0)

            {

                                    //if(_debug==true&&_siteUrls.Count<6||_debug==false)

            {

                  _siteUrls.Add(uri);

                  pageResults.Add(uri);

            }

                       

            }

            else if(_debug == true)

            {

                  //log why the URL was not added

                  if(_siteUrls.Contains(uri))

                        LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL has already been added to the list");

                                    if(uri.IndexOf(SiteUtility.GetSiteBaseUrl())!=0)

                                          LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL is off of the host domain");

                                    if(uri==SiteUtility.GetSiteBaseUrl()||uri==(SiteUtility.GetSiteBaseUrl()+SiteUtility.GetSiteDefaultPage()));

                                          LogUtility.LogEvent(EntryType.Info , DateTime.Now, "The URL is for the default page, and is already indexed");

 

            }

      }    

      LogUtility.LogEvent(EntryType.Info , DateTime.Now, "RegEx URL Matches for this page : " + mc.Count.ToString());

      LogUtility.LogEvent(EntryType.Info , DateTime.Now, "pageResults for this page : " + pageResults.Count.ToString());

      // now for each unique link on this page make a recursive call

      // TESTING - make sure we only have unique URLs

                 

}

catch(Exception ex)

{

                        LogUtility.LogError(EntryType.RecoverableError,DateTime.Now,"Error processing page " + url,ex);

 

      //throw new ApplicationException("error occured building Site Urls", ex);

 

}

//if(_debug == true)

//    if(this.siteURLs.Count>5)return;

 

foreach(string s in pageResults)

{

      buildSiteURLs(s);

}

                 

                 

}

 

While we are here there is a bug in the SearchResults Repeater designer, I didn't find it because I wanted to implement the whole thing as a class in my application directly and avoid all of the carefully crafted but not documented server controls.  But I devised this workaround...  There is a problem with the ItemTemplate object, but without it your repeater wont return results... the solution is to use the alternatingItemTemplate...  So in the file UI/Common/BaseRepeater.cs find the line around 460 that reads:

// don't do anything if no ItemTemplate

if (_itemTemplate == null)

      return;

and change it to:

// don't do anything if no ItemTemplate

if (_itemTemplate == null && _alternatingItemTemplate == null)

      return;

else if(_itemTemplate == null)

      _itemTemplate = _alternatingItemTemplate;

 Now we are ready to build NATA1 and add a reference to the DLL to your web site project and to your toolbox in VS.NET (Warning.  Make a new tab, it will add lots of new controls) Following the instructions from the authors original article:

Step 3. Add Nata1.dll to your toolbox.  Right click your toolbox.  Choose “add/remove items” , click browse, and find Nata1.dll.  Nata1 controls are now added to your toolbox.

there are dozens of controls, some are container controls, like ResultsRepeater, and other are for individual Items, all the ones with a smiley icon are placed in the Item or Alternating Item template, like HitUrl, HitWords, etc.  You can get creative with your toolbox icons, I've included some neat ones like Homestar runner icons.  Controls like QueryTime sit in the header template.  Some controls are specific to a search provider, e.g. Google has many controls, like spelling suggestions, but index server only has a couple so you have to be careful to make sure the provider supports the controls.

 After adding the reference, add the following line to your Application_Start function in your global.asax

      protected void Application_Start(Object sender, EventArgs e)

            {

                  Nata1.Controller.Start();

            }

Now when your site starts, it will check the SQL logs to see when it was last spidered, and it should begin crawling your site.  You should see activity in the log, and after a while you should have a decent collection of site pages in your database.  Now we just need to build a search page…  (some of this content is from the original FAQ)  just make a page called search.aspx for now and follow the Authors original step 4:

Step 4: We'll need a form to get from a search box to the search results page.  Go ahead and drag and drop “SearchForm“ (control with the ducky) onto any ascx or aspx page in your site. 

To use an image, set the SearchButtonText to an image Url (I know, not the most elegant) or enter text and make sure to set the ButtonType as well as SearchPageUrl.  As you can see, there is a bug in the designer as the image isn't updating.

 

And then for step five make a page called SearchResults.aspx, you will need to link it to the control you added in step 4. and follow the Authors instructions:

Step 5: We'll build the search results page.  Drag and drop “Search Results Repeater” (the one with the fairy icon) onto a ascx or aspx page

The two most import properties will be “Query Provider“ - here you want to select Nata1.  The other property is called SearchQueryTemplate mode, here you want to select simple.

Then finally we follow the Authors Step 6.

 

 

Step 6: Right click the template, choose the template you want to edit, and start dragging and dropping controls.

Here I dragged the controls SearchQuery and TotalHits onto the Header template, and put an ad banner there too, you can rotate based on keyword if you want.

There are several other templates you'll need to set, like NoResults, etc. There's also a template for a Search Form, and you can specify what search form controls to place there, perhaps you want an advanced search form to be at the top.

The key is in that last part…  Earlier I noted that the Repeater relies on the ItemsTemplate row, but if you add that to the control, it breaks…  so we are going to want to add some basic output to the alternating items row… go to your HTML and add…

 

<AlternatingItemTemplate>

<P>

      <nata1searchui:HitTitle id="HitTitle1" runat="server"></nata1searchui:HitTitle>

      <nata1searchui:HitRankingIcon id="HitRankingIcon1" runat="server"></nata1searchui:HitRankingIcon></P>

<P>

      <nata1searchui:HitPageWords id="HitPageWords1" runat="server"></nata1searchui:HitPageWords></P>

<P>

      <nata1searchui:HitCategories id="HitCategories1" runat="server"></nata1searchui:HitCategories></P>

</AlternatingItemTemplate>

 

As well as an error handler, which cannot be set from the UI…

 

 <ErrorTemplate>

An error has occured and our support staff has been notified.

</ErrorTemplate>

 

And with this, You should be able to search through the site pages in your database.. Not to pretty, but really cool, because if you made it this far you are hard core, and understand the power of what just happened on your site…  you might want to explore the controls for the admin area… and add some noise words to your database too.  Happy searching….

0 0

相关博文

我的热门文章

img
取 消
img