The Gathering Place - A Comprehensive Resource for Mission The Internet Foundation

Applying the Internet to Global and Systemic Issues

Notelist1, Notelist2, Medical, Searches, Analyses,

Automated Searches for coterms:  In order to locate all the prefixes and suffixes for a single term, I have created automated searches that traverse a significant part of the web, and in the process seek to get a complete list of coterms (prefixes or suffixes)

Term +datebysingleday  ~ 50,000 urls or 5000 searches
Term +ListTopLevelDomains ~ 25,000 urls or 250 searches
Term + ListStateCodes.US ~ 5,000 urls or 50 searches
Term + ListCoterms ~ thousands of coterms, thousands of searches
Term + ListTerms ~ any number of searches from arbitrary list (e.g. common words)
Term + site:ListDomains ~ Use domains from one search with the core term, or another term.  Domains in random or frequency order.

Term may be any boolean search.
Term may be a list of terms.

What happens is that there is a logrithmic or exponential decay as new coterms are found.  Gradually no new terms are found, and the total number of coterms approaches a plateau or limit.  Where the coterm list is finite (approaches a limit), one can reasonably be assured that the most relevant search terms have been found for the root term.  Instead of one search on a term, I can do hundreds or thousands, in a fairly efficient manner.

Automated Scanning of Web Browser pages:  I can capture pages and process them on the fly as I am browsing the web. Whether it is a set of searches, or browsing interesting sites (wikipedia), I can mined the pages for useful information.  Where I am traversing a database, I can capture the contents of the database.  I create web mining tools for selected large sites.  I can download complete sites for processing.

Data Mining:  Besides downloading websites, it is also possible to download databases, or other specific types of pages automatically.  For instance downloading the pages corresponding to a set of search engine results.  This is a good way to do statistical surveys for particular content, or for types of pages, or technical aspects of how pages are built.

 

Copyright 1988-2011  Richard Collins, All Rights Reserved