Ch. 5: Web Searching

Google, Bing, Yahoo (uses Bing), Gigablast, DuckDuckGo, etc.
How does a search engine work?
1. The Index.
  1. A collection of words, each with a list of URLs they are found on.
  2. Example: Search Database
2. Using the index for a search.
  1. Single word: Just fetch the URLs from the index.
  2. Multiple-word searches.
    1. Find the URLs for each word.
    2. Find the URLs in both lists. These are the search results.
    3. Example: Search for “apple pear”
    4. Note: This treats the multi-word search as an AND search. Engines may not follow this strictly, especially if it it yields few results.
  3. Ordering the results: Order response by number of links to it.
    1. Known as “page rank”: Google's innovation.
      Biggest search engine had been Yahoo!
    2. Greatly tweaked as folks take advantage.
  4. Building the index, which is a collection of pairs of a word w and a URL u, ⟨w, u⟩:
    todo ← some starting list of URLs. while todo is not empty do: u ← some url removed from of todo fetch the web page at u. break the page into individual words (usually just divide at non-letters) for each word w in the page do: add ⟨w, u⟩ into the database. for each url d mentioned in the downloaded page do: if d is not in todo, and not in the database then: add d to todo
    1. The engine will also occasionally scan the database for old entries and check them for changes.
    2. The program which builds the index is called a crawler or spider.
3. There's always something missing.
  1. Web changes quickly.
  2. There's always something missing.
Complex Searches.
1. List of words: Find any.
2. Conjunctions: AND, OR. Use with parens.
3. Use -word to eliminate that word.
4. Use +n to keep Google from ignoring numbers.
5. If the last search isn't helpful, adjust.
  1. If it's to small, remove constraints or try an OR.
  2. If it's too big, add words, or try more specific words.
  3. If it's full of the wrong subject, maybe try a NOT (or -) to eliminate stuff you don't want.
Web Page Reliability.
1. Web v. print.
  1. Less editorial control.
  2. Harder to true source/owner.
2. Hoaxes and jokes.
  1. DHMO.
  2. Tree Octopus.
  3. How long does it look real?
3. Checking.
  1. Check for true owner. (Maybe use the whois database.)
  2. Verify information using other sources.