Building the index, which is a collection of pairs of a word w and
a URL u, ⟨w, u⟩:
todo ← some starting list of URLs.
while todo is not empty do:
u ← some url removed from of todo
fetch the web page at u.
break the page into individual words (usually just divide at non-letters)
for each word w in the page do:
add ⟨w, u⟩ into the database.
for each url d mentioned in the downloaded page do:
if d is not in todo, and not in the database then:
add d to todo
- The engine will also occasionally scan the database for
old entries and check them for changes.
- The program which builds the index is called a crawler or spider.