Automated Searches for coterms: In order to
locate all the prefixes and suffixes for a single term, I have created
automated searches that traverse a significant part of the web, and in the
process seek to get a complete list of coterms (prefixes or suffixes)
Term +datebysingleday ~ 50,000 urls or 5000 searches
Term +ListTopLevelDomains ~ 25,000 urls or 250 searches
Term + ListStateCodes.US ~ 5,000 urls or 50 searches
Term + ListCoterms ~ thousands of coterms, thousands of searches
Term + ListTerms ~ any number of searches from arbitrary list (e.g. common
words)
Term + site:ListDomains ~ Use domains from one search with the core term, or
another term. Domains in random or frequency order.
Term may be any boolean search.
Term may be a list of terms.
What happens is that there is a logrithmic or exponential
decay as new coterms are found. Gradually no new terms are found, and
the total number of coterms approaches a plateau or limit. Where the
coterm list is finite (approaches a limit), one can reasonably be assured
that the most relevant search terms have been found for the root term.
Instead of one search on a term, I can do hundreds or thousands, in a fairly
efficient manner.
Automated Scanning of Web Browser pages: I can capture
pages and process them on the fly as I am browsing the web. Whether it is a
set of searches, or browsing interesting sites (wikipedia), I can mined the
pages for useful information. Where I am traversing a database, I can
capture the contents of the database. I create web mining tools for
selected large sites. I can download complete sites for processing.
Data Mining: Besides downloading websites, it is also possible
to download databases, or other specific types of pages automatically.
For instance downloading the pages corresponding to a set of search engine
results. This is a good way to do statistical surveys for particular
content, or for types of pages, or technical aspects of how pages are built.