Click on these links to learn more about the CompletePlanet site and the automated means by which it is built.
Why the name "CompletePlanet" and what is its purpose?
We chose the name CompletePlanet to convey the site's role as the Internet's largest, most central and complete access point for all things relating to Internet searching. It was created as a public service and as a test bed for the Deep Query Manager.
What is the Deep Query Manager (DQM)?
DQM is a research, information sharing and management tool for organizations that accesses tens of thousands of Deep Web databases and Internet search engines.
With it, individual users can search vast stretches of the Internet in one search and can share their results across the organization or with selected co-workers as appropriate.
The system provides a very powerful infrastructure for finding and managing large amounts of information within the organization.
How are CompletePlanet and the DQM different from other search engines?
They are a complete source for search engines and databases.
They let you explore the quantity and quality of tens of thousands of searchable databases that other search engines cannot even find.
The results from a CompletePlanet search take you to the front door of such sites.
You gain added value from these sites and databases when you search them simultaneously for desired content with a directed query.
While CompletePlanet does not allow you to store, manipulate, and research results, DQM does.
What content does CompletePlanet contain?
CompletePlanet has three major content components:
- The most complete listing available of "surface" Web search engines and Deep Web searchable databases (see the Deep Web FAQ or Deep Web white paper to learn more about these distinctions).
CompletePlanet presently contains search sites organized into subject headings.
- Tips, tutorials and guides for helping you to become a more effective searcher of the Internet
- Links to other valuable Web sites where you can learn more about searching or the operation of search engines
How can I find searchable sites on the Internet using CompletePlanet?
You can find search sites on CompletePlanet in two ways.
One way is to navigate CompletePlanets "browse tree", which has subject nodes organized into a tree structure up to five levels deep.
Use of this tree is similar to Yahoo! or other standard Web directories.
The second way is to enter a query directly into the search edit box at the top of most pages.
See further the help section on what site search options are available.
How does CompletePlanet identify and then qualify the searchable sites in its listings?
As with all other aspects of the CompletePlanet site, the identification of Web search sites is automatic.
CompletePlanet "spiders" Web sites and then applies a proprietary set of decision rules and tests to determine if a given site is a searchable site or not.
The automation involves inspecting site content and HTML tags.
What is the CompletePlanet "browse tree"?
The CompletePlanet "browse tree" gets its name because it appears as an upside-down tree the lower level nodes are its branches.
At the "root" there are 34 top-level domains (e.g., Agriculture & Food, References).
Each level thereafter may be split into up to 34 or so sub-nodes, or more specific children subjects under a parent.
These splits continue deeper and deeper into the tree, with topics getting progressively more specific.
When you navigate from one level to the next deeper level, you are exploring one "branch" of the tree.
To explore other branches, you can back up levels and then descend in different directions.
On a combinatorial basis, the number of specific subjects using a tree structure can grow large rapidly.
For example, CompletePlanets five-level structure could lead to millions of specific subjects:
20 + 400 [20 * 20] + 8,000 [20*400] + 160,000 [20*8,000] + 3,200,000 [20*160,000] = 3,368,420
Of course, because some subjects are more or less comprehensive, the tree is bushier or more sparse for certain nodes.
How does CompletePlanet determine its subject headings in the browse tree?
After ensuring that CompletePlanets 34 top-level domains encompass the "complete"
domain of possible subjects, actual results placed at each node in the tree are evaluated with proprietary computational linguistics techniques to determine possible sub-node subject headings.
CompletePlanets editors make the final sub-node topic assignments. This process continues to deeper levels until all searchable sites have found a proper subject "home".
Why are some searchable sites listed more than once in the CompletePlanet browse tree?
CompletePlanet is rare in allowing its results to be placed in more than one location in its browse subject tree.
Most Web directories allow one and only one placement of a given result.
A single-placement objective derives from library practice and made sense when cost and bulk dictated that books be placed in only one location on the librarys shelves.
Ontologies such as the Dewey Decimal Classification system and the Library of Congress
Subject Headings are aids that help librarians make this assignment, and card catalogs provide a cross-reference point for those seeking specific books.
These imperatives do not apply to electronic representation of results on the Internet, since the cost of placing documents in more than a single location is in essence zero.
Because no ontology is perfect and there are many ways that documents can be classified, CompletePlanet allows up to five placements of a given result in different subject nodes.
How does CompletePlanet place searchable sites within its browse tree subject structure?
All searchable sites are placed automatically within CompletePlanet.
The process combines proprietary computational linguistics techniques with learned decision rules.
Placement is a very complicated process that involves literally dozens of tests for each result and each subject node.
Each searchable site is completely indexed and scored with computational linguistics techniques with various weighting factors regarding content, appearance of terms within titles or tags, etc.
The query terms used for this process involve various combinations of terms within the specific subject node and throughout the "branch" of the tree for that node.
Then, results at each subject node are evaluated with respect to all other searchable sites initially placed at that node.
Additional decision rules compare a given searchable sites placement across the entire subject tree to determine its best home.
QA/QC (quality assurance/quality control) inspections by CompletePlanets editors discover additional decision rules that are then fed back into the categorization system in a process of continuous quality improvements.
What determines the order of presentation of site results within the browse tree?
The automatic techniques for placing searchable sites within the CompletePlanet browse tree produce an overall score for that site and subject node.
Results are presented in descending order of these ranked scores.
Why cant I find a favorite site in CompletePlanet under its previous subject node?
Three things can cause a previously listed searchable site to no longer appear under a given subject node in the CompletePlanet browse tree:
- The searchable site no longer exists and has been removed from CompletePlanet
- The subject tree structure has changed within CompletePlanet and the previous searchable site has been moved to a better subject "home"
- Higher quality searchable sites have been added to CompletePlanet for that subject node and that has caused the previous site to score below the maximum listings threshold per node.
The latter two factors may be somewhat frequent since the structure and contents of CompletePlanet are dynamic to keep pace with changing content on the Web.
Why do I sometimes see a site listed under a node that doesnt seem to belong there?
Search sites misplaced in a browse tree category are "false positives" and are a major focus of CompletePlanets ongoing quality refinements through document placement decision rules.
An example of a false positive is a searchable site listing university courses that strictly meets the content selection criteria for a given subject node, but which contains or conveys little true content.
Other examples of false positives are documents with ambiguous terms that could be placed in multiple categories.
We are continuously adding new rules and decision criteria to the system to remove false positives and improve placements.
If you discover a searchable site that you think is listed in a wrong category, please let us know via email at firstname.lastname@example.org.
Why do I occasionally see a site result using the search function that does not appear in the browse tree?
All qualified searchable sites are contained within the pool available to CompletePlanets search function.
However, some searchable sites are essentially duplicates of other sites or score too low to be placed within the browse tree, and therefore do not appear within that structure.
Please explain the different summary options presented for site search results?
Other search portals either hand-summarize each document (costly and labor-intensive), take the first characters of a page, or use a "window" of the text around the first appearance of the query text in an attempt to generate a description of the document unfortunately, this approach usually fails to offer adequate information indicative of the document's content.
To better convey the "true" content of a document, CompletePlanets results summaries are generally automatically calculated, with the selection among the various forms also determined automatically via proprietary scoring tests.
The three summary options that may appear are:
- "Description" this is the authors own document description, subject to certain acceptance thresholds.
Since not all authors provide these descriptions and some do not pass the acceptance tests, descriptions are used for only about 15% of all summaries
- "Extract/Summary" this is an automatic summary produced by a CompletePlanets analysis system which analyzes the text and extracts content to build a useful summary automatically.
- "Keywords" when the extract/summary does not meet certain threshold conditions, keywords present the "highest value" words in the document in order of importance.
This option is used about 10% of the time, mostly for documents with little text context, such as tabular data
What is the "categories" link in the search results at the end of a site summary?
Clicking on the "Categories
" link at the end of a result from a site search lists the other browse tree subject nodes in which that result is listed. This is a useful means to identfy other searchable sites that may be similar to that given result.
Why is automation emphasized so much within the CompletePlanet site?
We believe automation is an imperative for these reasons:
- An estimated 1.5 million documents are posted daily on the "surface" Web alone. Automation is the only means to keep pace with the Internets growth, especially considering the faster growth of "deep" Web searchable databases
- Though hand-editing and -culling of documents can lead to the absolute highest quality, reliance on voluntary editors is problematic and hiring a large staff of dedicated editors leads to a high-cost business model
- Reliance on automation forces learning to be incorporated from the outset into constantly improving business rules. In essence, a commitment to automation enforces continuous learning as an explicit part of how we operate our company
- Quality improvement through automation and learned decision rules is sustainable and can be leveraged for other applications.
By employing a comparatively smaller staff of editors for statistical and spot-check QA/QC, we are able to sustain progressive quality improvements through learning new decision rules with acceptable costs.
What are CompletePlanets privacy policies?
Why are there no "adult" sites on CompletePlanet?
Our focus is to provide the most comprehensive access point to searchable sites on the Web for the research and "power user" community needing sophisticated information and content access. We do not see "adult" sites as consistent with that mission.