US7987172B1 - Minimizing visibility of stale content in web searching including revising web crawl intervals of documents - Google Patents

Minimizing visibility of stale content in web searching including revising web crawl intervals of documents Download PDF

Info

Publication number
US7987172B1
US7987172B1 US10/930,280 US93028004A US7987172B1 US 7987172 B1 US7987172 B1 US 7987172B1 US 93028004 A US93028004 A US 93028004A US 7987172 B1 US7987172 B1 US 7987172B1
Authority
US
United States
Prior art keywords
document
interval
web crawl
content
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US10/930,280
Inventor
Anton P. T. Carver
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US10/930,280 priority Critical patent/US7987172B1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CARVER, ANTON P.T.
Priority to US13/166,757 priority patent/US8407204B2/en
Application granted granted Critical
Publication of US7987172B1 publication Critical patent/US7987172B1/en
Priority to US13/849,355 priority patent/US8782032B2/en
Assigned to GOOGLE LLC reassignment GOOGLE LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: GOOGLE INC.
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the stale content in the search engine may have no particular significance, because the changes to the documents listed in a search result are minor, or the relevance of the documents remains substantially the same.
  • the search result may include links to documents that no longer exist, or whose content has changed such that the result is no longer relevant to the query (or has lower relevance to the query than the prior content of the documents).
  • stale content is assumed to be visible, whenever search results are returned based on the stale content, even if the search results are still useful to the user.
  • FIG. 2 depicts a search engine system 200 that implements the multi-tier data structure as suggested above.
  • Information for the documents falling into “Tier A” is stored in a database “Tier 1 ” and so on.
  • Each document is characterized by a set of parameters including, e.g., a URL, a content fingerprint, a Boolean value suggesting whether there is a critical content change to the document, an actual web crawl interval identified by the search engine during previous web crawl(s) and a web crawl interval recommended for the forthcoming web crawl(s).
  • the parameters could also include a past history of the N previous actual web crawl intervals. This might include information indicating for which intervals the content had changed and for which intervals the content had not changed.

Abstract

A method and system is disclosed for associating an appropriate web crawl interval with a document so that the probability of the document's stale content being used by a search engine is below an acceptable level when the search engine crawls the document at its associated web crawl interval. The web crawl interval of a document is determined through an iterative process and updated dynamically by the search engine after every visit to the document by a web crawler. A multi-tier data structure is employed for managing the web crawl order of billions of documents on the Internet. The search engine may move a document from one tier to another if its web crawl interval is changed significantly.

Description

FIELD OF THE INVENTION
The present invention relates generally to the field of search engines for locating documents in a computer network system, and in particular, to a system and method for minimizing the visibility of stale data through a web search engine.
BACKGROUND OF THE INVENTION
Search engines provide a powerful tool for locating documents in a large database of documents, such as the documents on the Internet or the documents stored on the computers of an Intranet. In the context of this application, a document is defined as a combination of a document address, e.g., a universal resource locator (URL), and a document content.
A typical structure of a web search engine comprises a front end and a back end. The front end includes a query server for receiving a search query submitted by a user and displaying search results to the user, and a query processor for transforming the search query into a search request understood by the back end of the web search engine. The back end includes one or more web crawlers for retrieving documents from the Internet, a scheduler for providing addresses of the documents to the web crawlers, an indexer for indexing the documents retrieved by the web crawlers and one or more databases for storing information of the retrieved documents, e.g., the indexes of the documents. Upon receipt of a search request, the front end searches the databases, identifies documents whose contents match the search request and returns them as the search results to the requester.
There are billions of documents accessible through the Internet. The life expectancy of a document's content (after which its contents may be replaced or changed) may vary from a few years, to a few seconds. Every day, many thousands of new and revised documents are posted by various web servers all over the world, while other documents are deleted from their hosting web servers and are therefore no longer accessible. As a result, at least some of the document information stored in a web search engine is likely to be stale, even if the web search engine is continuously crawling the web so as to update its database. Stale content in a search engine database is said to be visible when the search engine returns a result (e.g., in response to search query) that is based on stale information. In some cases, the stale content in the search engine may have no particular significance, because the changes to the documents listed in a search result are minor, or the relevance of the documents remains substantially the same. However, in other cases the search result may include links to documents that no longer exist, or whose content has changed such that the result is no longer relevant to the query (or has lower relevance to the query than the prior content of the documents). For purposes of this document, stale content is assumed to be visible, whenever search results are returned based on the stale content, even if the search results are still useful to the user.
In general, it would be desirable to keep the document information in a search engine's databases as fresh as possible, while avoiding needless refreshing of content that is highly static. More generally, it would be desirable to schedule documents for downloading by a web crawler so as to minimize the visibility of stale document information in the databases of the search engine.
SUMMARY
A web crawling system associates an appropriate web crawl interval with a document so that the probability of the document's stale content being used by a search engine is maintained below an acceptable level. Assuming sufficient crawl bandwidth, the search engine crawls each document at its associated web crawl interval.
In some embodiments, the web crawl interval of a document is identified by an iterative process that starts with an initial estimate of the web crawl interval. The iterative process, after crawling a document multiple times at different time intervals and analyzing the content changes associated with the crawling results, converges to a time interval that is deemed most appropriate for this document. This time interval is associated with the document as its web crawl interval.
In one embodiment, documents are partitioned into multiple tiers, each tier including a plurality of documents sharing similar web crawl intervals. After each crawl, the search engine re-evaluates a document's web crawl interval and determines if the document should be moved from its current tier to another tier.
In another embodiment, changes to a document's content are divided into two categories, critical content changes referring to those changes that occur to a predetermined portion of a document and non-critical content changes covering all other changes to the document. During the course of updating a document's web crawl interval, the search engine takes into account only critical content changes and ignores all non-critical content changes to the document.
BRIEF DESCRIPTION OF THE DRAWINGS
The aforementioned features and advantages of the invention as well as additional features and advantages thereof will be more clearly understood hereinafter as a result of a detailed description of preferred embodiments of the invention when taken in conjunction with the drawings.
FIG. 1 schematically represents the distribution of the content update rates of documents on the Internet as an L-shaped curve.
FIG. 2 depicts a search engine system that implements a multi-tier data structure for the billions of documents on the Internet.
FIG. 3 is a flowchart illustrating a dynamic crawling priority update strategy in accordance with an embodiment.
FIG. 4 illustrates a computer-based search engine system in accordance with an embodiment.
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
DESCRIPTION OF EMBODIMENTS
It is expected that a small number of documents on the Internet will have content that changes frequently and a larger number of documents will have content that changes rather infrequently. Document update intervals may range, for example, from once every few seconds to once every few years. FIG. 1 schematically illustrates this as an L-shaped distribution of content update rates for documents. There are a relatively small number of documents having high content update rates, as shown at the left portion of the L-shaped curve. On the other hand, as shown at the right portion of the curve, there are a large number of documents with much lower content update rates. Based on the distribution of content update rates, a search engine may incorporate a multi-tier data structure to group a certain number of documents whose content update rates fall within a particular portion of the L-shaped curve. This grouping may be used to ease the administrative overhead of scheduling efforts to obtain new copies of the documents. On the other hand, in another embodiment, such a tier data structure is not used and documents are not grouped into tiers for crawling purposes. The concepts described below would apply whether or not a tiered structure was used.
As mentioned above, a tiered structure may allow groups of documents to be treated together for various administrative and processing purposes. As shown in FIG. 1, “Tier A” includes documents having the highest content update rates and “Tier Z” includes documents having the lowest content update rates. Typically, a document from a higher tier, e.g., Tier A, is given a higher crawling priority, or a higher crawl repetition rate, than any document from a lower tier, e.g., Tier B, and vice versa.
FIG. 2 depicts a search engine system 200 that implements the multi-tier data structure as suggested above. Information for the documents falling into “Tier A” is stored in a database “Tier 1” and so on. Each document is characterized by a set of parameters including, e.g., a URL, a content fingerprint, a Boolean value suggesting whether there is a critical content change to the document, an actual web crawl interval identified by the search engine during previous web crawl(s) and a web crawl interval recommended for the forthcoming web crawl(s). The parameters could also include a past history of the N previous actual web crawl intervals. This might include information indicating for which intervals the content had changed and for which intervals the content had not changed. Using these values, it would be possible to determine an average interval length over which the document's content had not changed and an average interval length over which the document's content had changed. In other embodiment, a running average of the X previous actual web crawl intervals could be used or stored. In other embodiments, the set of parameters characterizing a document may be a subset of those identified above, or may include a subset of the parameters identified above plus other parameters not identified above.
The multi-tier databases implementing the multi-tier data structure submit web crawl requests to a scheduler, suggesting which documents should be crawled according to their respective web crawl intervals. In response, the scheduler examines the workload and capacity of its multiple web crawlers and then dispatches a particular web crawler, e.g., Crawler 3, to a web server on the Internet hosting the document.
After retrieving a new copy of the document from the hosting web server, the web crawler passes the new copy to a history log database. The history log database also has access to the previous copy of the document stored in the search engine system. Upon receipt of the new copy, the history log database retrieves the previous copy and submits both copies to the scheduler. The scheduler determines whether to modify the document's web crawl interval using information it has gathered about the document and updates one of the multi-tier databases accordingly. Of course, if this is the first time that a document has been crawled, the search engine will not have a previous copy to provide the scheduler. In this case, the scheduler assigns an initial web crawl interval to the document. The initial crawl interval could be determined in any of a number of ways, some of which are described below.
FIG. 3 is a flowchart illustrating a dynamic web crawl interval update strategy in accordance with one embodiment of the present invention. After receiving information of a particular document from the scheduler, one of the multi-tier databases of FIG. 2 schedules a web crawl request for the document based upon a desired web crawl interval for the document (302). Subsequently, one web crawler is invoked by the request to retrieve a new copy of the document and record it in the history log database (304). The history log database then passes the newly recorded document and its previous copy, if any, to the scheduler. The scheduler compares the content of the newly recorded document and that of the previous copy (306) to determine if the document content has changed (308). In some embodiments, the determination made at 308 is whether there have been any critical content changes in the document. The scheduler may indicate whether or not such a change has been detected in the history log and associate it with the particular crawl interval.
The simplest way to determine content changes is to compare the content fingerprint of the document before and after the recent crawl. If the content fingerprints are equal, the document has not changed, otherwise it has. Changes can be described as critical or non-critical and that determination may depend on the portion of the document changed, or the context of the changes, rather than the amount of text or content changed. Sometimes a change to a document may be insubstantial, e.g., the change of advertisements associated with a document. In this case, it is more appropriate to ignore those accessory materials in a document prior to making content comparisons. In other cases, e.g., as part of a product search, not every piece of information in a document is weighted equally by a potential user. For instance, the user may care more about the unit price of the product and the availability of the product. In this case, it is more appropriate to focus on the changes associated with information that is deemed critical to a potential user rather than something that is less significant, e.g., a change in a product's color. Accordingly, the determination of criticality or materiality is a function of the use and application of the documents.
Alternatively, a document could be considered a collection of individual features which change from time to time. Changes associated with different features would be accorded different levels of importance. In this instance, a document would be considered “changed” if the combination of a set of weighted features whose values have changed exceeds a certain threshold. For example in the equation below, when C is greater than some defined value, then the document is deemed to have materially changed:
C = i = 0 n - 1 weight i * feature i
where n is the number of features whose values have changed. Alternately, n may be the total number of features and the weights may be assigned non-zero values for only those features whose values have changed.
If the document has changed materially since the last crawl (308—Yes), the scheduler sends a notice to a content indexer (not shown), which replaces index entries for the prior version of the document with index entries for the current version of the document (310). Next, the scheduler computes a new web crawl interval (312) for the document based on its old interval and additional information, e.g., the document's importance (as measured by a score, such as pagerank), update rate and/or click rate. If the document's content has not been changed or if the content changes are non-critical (308—No), there is no need to re-index the document (314). However, the scheduler still computes a new web crawl interval (316) for the document based on its old one and other information, in particular, based on the fact that there was no critical content change to the document. A more in-depth discussion regarding the determination of the new web crawl interval is provided below. Of course, the scheduler could be configured to re-index the document and compute a new crawl interval on any change to the content, material or not.
Next, the scheduler records the newly determined web crawl interval at one of the multi-tier databases for later use. However, since the document's web crawl interval may be different from the one used previously, the document's affiliation with a particular tier may terminate as well. More specifically, if the recomputed crawl interval belongs to the interval range associated with a different tier (318—No), the document and its associate web crawl interval are moved to the other tier (320). Otherwise (318—Yes), the document and its new web crawl interval are recorded in the same tier database as previously. Alternately, the termination of whether to move the document to another tier, or to keep it in the current tier, may be based on the magnitude of the change in the document's web crawl interval.
When determining a new crawl interval, it is desirable to choose one which will reduce the probability that in response to a user request represented by a set of query terms, the web search engine returns the address of a document matching the request based on stale content. Stale content no longer reflects the current state of the document stored on the web server. Such a probability is a function of a user view rate on the document (which is a reflection on how frequently a page is viewed); a document update rate (which is an indication of how frequently the page is updated on the web host server); and the web crawl interval (which is an indication of the time between until the crawler obtains an updated copy of the document from its web server). This function can be expressed as:
Probability(Seen_Stale_Data)=Function(User_View_Rate,Document_Update_Rate,Web_Crawl_Interval).
In one embodiment, given a desired probability, Probability_Desired, the web crawl interval can be expressed as:
Web_Crawl_Interval=Probability_Desired/(User_View_Rate*Document_Update_Rate).
In other words, the higher a user view rate and/or the document update rate, the smaller the web crawl interval must be to maintain the same relative probability (i.e., the document is crawled more frequently).
Alternatively, the user view rate can be expressed as a user impression rate, a user click rate or a combination of the two. An impression rate is the rate at which the user is presented with the document, which includes presentation of all or part of the document in a search result, whereas the user click rate represents when a user clicks on a document to have it presented. As a combination, the user impression rate would be combined with the user click rate multiplied by a weighting factor. The weighting factor allows a relationship to be created representing the relative worth of a click compared to an impression. For example, a click may be worth x impressions, where x varies from negative values to positive values.
There are different approaches for measuring the user click rate, such as using redirects from the origin application. However, the redirect approach may be unreliable due to various spam robots which may cause the click rate to be artificially inflated. The effects of such could be reduced by, for example, using unique session identification information based on IP or cookie information. Alternatively, an application such as Google's NavClient could be used, which is more resistant to spam attacks than the direct approach.
It would be desirable to accurately estimate an update rate of a particular document to be crawled. Every document on the Internet has an associated document update rate and, as mentioned earlier, some documents are updated more frequently than others. If an estimated document update rate used to determine how frequently a document is crawled is much higher than the actual document update rate, then a too small web crawl interval will be determined. Therefore, a later crawl of the document at that smaller interval is likely to retrieve a copy of the document content that is substantially or materially the same as the previous crawl(s). This unnecessary crawl wastes valuable resources of the search engine. On the other hand, an estimated document update rate that is much lower than the actual document update rate results in a longer than necessary web crawl interval. This may cause the search engine to match a user query to stale data of a document because the search engine has not indexed the current version of the document.
A highly desirable situation would be that the search engine crawls a document right after its update. However, this would require that a web server notify the web search engine every time it updates a document. A more practical approach is to crawl the document at a rate that is close to its “actual” update rate.
As described in reference to FIG. 3 above, a dynamic process to approach the near-“actual” update rate of a document, would include the following steps:
    • 1. Crawling a URL to fetch a new copy of the document's content; and
    • 2. Comparing the new content with an old content of the document to determine if the content has changed, and if so, to what extent.
      There are two possible outcomes from the comparison:
    • 1. There is no change (or at least no material change) to the document during the web crawl interval; and
    • 2. There is a content change (or at least a material change) to the document during the web crawl interval.
In the first case, the newly completed crawl does not retrieve any new information about the document and to a certain degree, it is a waste of the search engine's limited crawling resources. In the second case, the newly completed crawl does acquire new information about the document. In this sense, such a crawl is not a waste. However, it indicates that there must be a delay between the time when the document was updated and the time when the document was crawled even though the extent of such delay is unknown. Without knowledge of the exact update time of a document, a desirable web crawl interval for the document is the one that, when applied, alternates between the two possible outcomes.
If there are two consecutive no-change outcomes, the web crawl interval is deemed too small and at least one of the two crawls could have been avoided to save crawling resources. Accordingly, the desirable web crawl interval should be increased. If there are two consecutive change outcomes, the web crawl interval is deemed too large and the risk that a document is “seen stale” has increased. Accordingly, the desirable web crawl interval should be decreased. A number of methodologies can be envisioned for producing these type of modifications to the web crawl rate. For example, the Nyquist sampling law familiar to those involved with signal processing could be applied. According to the Nyquist sampling law, a signal having a period T should be sampled at least twice during each period in order to avoid information loss. In the case of web crawling, a document that is updated every N seconds should be sampled twice during each N seconds. In other words, a desirable web crawl interval would be N/2 seconds. The determination of a desirable web crawl interval is further made more difficult by the fact that a particular document's update rate may vary in time. As a consequence, the desired web crawl interval may vary over time.
In one embodiment, a dynamic desirable web crawl interval is determined as follows. Given that a web crawl interval is T1, if the document crawled at interval T+T1 shows that the document has been changed, then the web crawl interval is modified to be half of the previous interval, i.e., T1/2. If there is no change to the document after the web crawl interval is halved, the desirable web crawl interval is modified to be somewhere between T1/2 and T1, e.g., the average of the two intervals, 3T1/4. An iterative process can be used to refine the desired web crawl interval. Different embodiments may select the initial web crawl interval in different ways. For example, the initial web crawl interval could be determined to be the average actual or average desired change interval for all documents, for all documents determined to be in a similar tier, or documents having a similarity to the document under consideration. In other embodiments, the initial web crawl interval could be based, at least in part, on a document's popularity or importance (e.g., as measured by the document's pagerank). For example, two documents in the same tier, but with different pageranks, may be assigned different initial web crawl intervals in accordance with their respective pageranks.
The term “pagerank” is used in this document mean a document importance score. PageRank is just one example of a document importance score. A detailed description of the PageRank algorithm can be found in the article “The Anatomy of a Large-Scale Hypertextual Search Engine” by S. Brin and L. Page, 7th International World Wide Web Conference, Brisbane, Australia and U.S. Pat. No. 6,285,999, both of which are hereby incorporated by reference as background information.
In another embodiment, an average interval between changes is compared to an average interval between no changes. If the average interval between crawls where no change was detected is greater than the average interval between crawls where a change was detected, the crawl interval may be close to the desired crawl interval. The interval could be maintained, or could be modified in accordance with the last comparison of the document with its prior version. For example, if the last comparison detected a change, then the web crawl interval may be changed to be the average interval between crawls where change was detected. On the other hand, if the last comparison detected no change, then the web crawl interval may be changed to be the average interval between crawls where no change was detected.
If the average interval between crawls where no change was detected is less than the average interval between crawls where a change was detected, it suggests that the desired crawl interval is between the two averages. Accordingly, the new web crawl interval may be chosen to be the average of the two averages.
The desired web crawl interval can be combined with other information to provide a score used to determine the crawling order for the documents to be crawled by a web search engine. The score takes into account various inputs to create a web crawl priority in order to reduce the probability of stale content to a desired level. For example, a document with a higher web crawl priority would receive more frequent visits from the search engine's web crawlers, resulting in a higher likelihood that the content is not stale.
In reality there are a huge number of documents competing for the limited web crawl capacity of a search engine. Therefore, it is practically inevitable that some documents will have stale content and will be presented to a user in a search result. The search engine can consider each document's pagerank, user click rate, and content update rates and/or other information, and provide an appropriate web crawl priority to the document so that the resultant probability of a document being seen “dirty”, i.e., the document's stale content being used in response to a search query, is below an acceptable level. In other words, a document's web crawl priority will determine its web crawl order relative to other documents competing for a search engine's limited web crawl capacity.
It should be noted that a document's desired web crawl interval is not necessarily identical to the document's actual web crawl interval. For example, the priority given to a certain document may not allow it to be crawled at the desired interval. Or, if documents are grouped in tiers, that too may affect the actual crawl interval. As a result, a document's actual web crawl interval may be longer than the desired web crawl interval. However, the difference between the two web crawl intervals does not adversely affect the role played by the desired web crawl interval in a significant way. Generally, the shorter the web crawl interval of a document, the higher its web crawl priority.
A generic relationship between the probability of a document being seen stale and its pagerank, user click rate, content update rate and web crawl interval can be expressed as:
P stale=ƒ(PR pagerank ,T click rate ,T content update rate ,T web crawl).
where Pstale represents a probability that the document is searched, or seen, in its stale state; PRpagerank represents the pagerank or importance of the document; Tclick rate represents the rate at which users click on the document; Tcontent update rate represents the rate at which the document is updated by its web server; and Tweb crawl represents the desired web crawl interval. The exact mathematical expression of the function ƒ is relatively arbitrary depending on how much weight each of the four parameters is allocated by the search engine in determining the probability. However, there is a set of qualitative features characterizing this relationship shared by any particular mathematical expression. For example, if the pagerank, the content update rate and the desired web crawl interval of a document are treated as fixed quantities, an increase in the user click rate will result in a higher probability of the document being seen, or searched, as stale from the search engine. Similarly, an increase in a document's content update rate, while holding fixed the other parameters, will increase the probability of stale content from the document being seen. An increase in the web crawl interval, while holding fixed the other parameters, will also increase the probability of stale content from the document being seen.
The impact of a document's pagerank on its probability of being seen stale is similar to that of the user click rate. A document's pagerank is often correlated with its user click rate, because the pagerank is indicative of the document's popularity or importance. The more popular a document is, the more visits it receives per unit of time period.
In one embodiment, the Pstale score is used to order the crawl of documents. In this embodiment, documents are crawled in decreasing order of the probability that they will be seen in their stale state.
As noted above, a document may be thought of as a collection of features which may be individually updated from time to time. As such, each feature may or may not be modified from the previous crawl. Each feature could have a feature change interval associated with it measured and stored as discussed above. The feature change intervals can be used to construct a document change interval where each feature is given a different weight depending on its desired importance, or other factors. For example, the document change interval could be determined by:
document_interval = i   = 0 n - 1 weight i * feature_interval i
where n is the number of features. This change interval could then be used as described above in determining the desired web crawl interval.
FIG. 4 illustrates an embodiment of a computer-based search engine system 400 that implements the web crawl interval update strategy discussed above. The system 400 includes one or more processing units (CPU's) 402, one or more network or other communications interfaces 410, memory 412, and one or more communication buses 414 for interconnecting these components. The system 400 may optionally include a user interface 404 comprising a display device 406 and a keyboard 408. Memory 412 may include high speed random access memory and may also include non-volatile memory, such as one or more magnetic disk storage devices. Memory 412 may include mass storage that is remotely located from the CPU's 402. The memory 412 preferably stores:
    • an operating system 416 that includes procedures for handling various basic system services and for performing hardware dependent tasks;
    • a network communication module (or instructions) 418 for connecting the computer system 400 to other computers via the one or more communication network interfaces 410 (wired or wireless), such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
    • a system initialization module (or instructions) 420 that initializes other modules and data structures stored in memory 412 required for the appropriate operation of the computer system 400;
    • a query processor 422 for receiving and processing search queries submitted from various client computers, and then organizing and transmitting search results back to the corresponding client computers;
    • a pageranker 424 for calculating a content-independent and structure-based pagerank of a document that is used for representing the document's relative popularity;
    • a content indexer 426 for generating a set of inverted content indexes for a document based on its current content;
    • a scheduler 428 for dispatching web crawlers in response to web crawling requests and determining a new web crawl interval for a crawled document;
    • one or more web crawlers 430 for retrieving documents from various hosting web servers;
    • a history log database 432 for storing previous web crawling results of each document; and
    • one or more multi-tier databases 434, each database managing a certain number of documents' web crawl requests.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (48)

1. A method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors:
associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document's importance metric;
retrieving each of the plurality of documents from a respective host at one or more times associated with each document's respective initial web crawl interval;
associating a revised web crawl interval with a respective document of the plurality of documents based on the document's respective initial web crawl interval and any changes to content of the document;
partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents' respective web crawl intervals;
for each tier of the multiple tiers, determining a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl interval; and
moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document.
2. The method of claim 1, including updating the web crawl interval of a document after retrieving a new copy of the document's content from its host and detecting content changes to the document based on the new copy.
3. The method of claim 1, further comprising:
determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
associating the revised web crawl interval with each document of the plurality of documents based on the document's respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
4. The method of claim 2, including updating the web crawl interval of the document to be less than the initial web crawl interval when the document's content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the document's content has not changed.
5. The method of claim 2, further comprising,
identifying a first average interval between times where the document's content has not changed;
identifying a second average interval between times where the document's content has changed; and
updating the web crawl interval of the document based on the first average interval and the second average interval.
6. The method of claim 5, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
7. The method of claim 5, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the document's content has not changed, and is updated in accordance with the second average interval if the document's content has changed.
8. The method of claim 5, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
9. The method of claim 2, wherein the computer system considers critical content changes to the document and ignores all non-critical content changes to the document when updating the document's web crawl interval.
10. The method of claim 1, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
11. The method of claim 1, wherein the initial web crawl interval of the document is determined in accordance with a score corresponding to the document's popularity.
12. The method of claim 1, wherein the revised web crawl interval of a document is smaller than the document's content update interval and larger than half of the document's content update interval.
13. The method of claim 12, wherein the search engine dynamically adjusts the revised web crawl interval of the document after a respective web crawler retrieves a new copy of the document's content.
14. The method of claim 1, wherein the revised web crawl interval of a document is a time interval such that the search engine, on average, retrieves each unique version of document content at least twice according to the revised web crawl interval.
15. The method of claim 1, wherein the revised web crawl interval of a document is determined in accordance with the document's user click interval when the document's user click rate is larger than the document's content update rate.
16. A search engine system, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
one or more central processing units for executing programs;
memory storing a web crawl order scheduler to be executed by the one or more central processing units;
the web crawl order scheduler comprising:
instructions for associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document's importance metric;
instructions for retrieving each of the plurality of documents from a respective host at one or more times associated with each document's respective initial web crawl interval;
instructions for associating a revised web crawl interval with a respective document of the plurality of documents based on the document's respective initial web crawl interval and any changes to content of the document;
instructions for partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including instructions for storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with their respective web crawl intervals;
instructions for determining, for each tier of the multiple tiers, a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl intervals; and
instructions for moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document.
17. The system of claim 16, wherein the search engine system's web crawl order scheduler further comprises:
instructions for determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
instructions for associating the revised web crawl interval with each document of the plurality of documents based on the document's respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
instructions for downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
18. The system of claim 16, wherein the changes to the content of a document comprise critical content changes and non-critical content changes, and the search engine system considers the critical content changes to the document and ignores the non-critical content changes to the document when associating a revised web crawl interval with the document.
19. The system of claim 16, wherein the revised web crawl interval of a document is smaller than the document's content update interval and larger than half of the document's content update interval.
20. The system of claim 19, wherein the search engine system dynamically adjusts the revised web crawl interval of the document after retrieving a new copy of the document's content from a respective host.
21. The system of claim 16, wherein the revised web crawl interval of a document is a time interval such that the search engine system retrieves each unique version of document content at least twice according to the revised web crawl interval.
22. The system of claim 16, wherein the revised web crawl interval of a document is determined in accordance with the document's user click interval when the document's user click interval is larger than the document's content update interval.
23. A non-transitory computer readable storage medium, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, storing one or more programs to be executed by a computer system, wherein the computer system includes a search engine program, the one or more programs comprising:
instructions for associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document's importance metric;
instructions for retrieving each of the plurality of documents from a respective host at one or more times associated with each document's respective initial web crawl interval;
instructions for associating a revised web crawl interval with a respective document of the plurality of documents based on the document's respective initial web crawl interval and any changes to content of the document;
instructions for partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including instructions for storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with their respective web crawl intervals;
instructions for determining, for each tier of the multiple tiers, a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl intervals; and
instructions for moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document.
24. The computer readable storage medium of claim 23, wherein the one or more programs further comprise:
instructions for determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
instructions for associating the revised web crawl interval with each document of the plurality of documents based on the document's respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
instructions for downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
25. The computer readable storage medium of claim 23, wherein the changes to the content of a document comprise critical content changes and non-critical content changes, and the search engine program considers the critical content changes to the document and ignores the non-critical content changes to the document when associating a revised web crawl interval with the document.
26. The computer readable storage medium of claim 23, wherein the revised web crawl interval of a document is smaller than the document's content update interval and larger than half of the document's content update interval.
27. The computer readable storage medium of claim 23, wherein the one or more programs dynamically adjust the revised web crawl interval of the document after retrieving a new copy of the document's content from a respective host.
28. The computer readable storage medium of claim 23, wherein the revised web crawl interval of a document is a time interval such that the search engine program retrieves each unique version of document content at least twice according to the revised web crawl interval.
29. The computer readable storage medium of claim 23, wherein the revised web crawl interval of a document is determined in accordance with the document's user click interval when the document's user click interval is larger than the document's content update interval.
30. The method of claim 1, wherein the changes to the content of a document comprise critical content changes and non-critical content changes, and the computer system considers the critical content changes to the document and ignores the non-critical content changes to the document when associating a revised web crawl interval with the document.
31. The system of claim 16, wherein the search engine system's web crawl order scheduler further comprises:
instructions for-updating the web crawl interval of a document after retrieving a new copy of the document's content from its host and detecting content changes to the document based on the new copy.
32. The system of claim 31, including instructions for updating the web crawl interval of the document to be less than the initial web crawl interval when the document's content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the document's content has not changed.
33. The system of claim 31, wherein the search engine system's web crawl order scheduler further comprises:
instructions for identifying a first average interval between times where the document's content has not changed;
instructions for identifying a second average interval between times where the document's content has changed; and
instructions for updating the web crawl interval of the document based on the first average interval and the second average interval.
34. The system of claim 31, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
35. The system of claim 31, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the document's content has not changed, and is updated in accordance with the second average interval if the document's content has changed.
36. The system of claim 31, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
37. The system of claim 31, wherein the search engine system considers critical content changes to the document and ignores all non-critical content changes to the document when updating the document's web crawl interval.
38. The system of claim 16, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
39. The system of claim 16, wherein the initial web crawl interval of the document is determined in accordance with a score corresponding to the document's popularity.
40. The computer readable storage medium of claim 23, wherein the one or more programs further comprise:
instructions for-updating the web crawl interval of a document after retrieving a new copy of the document's content from its host and detecting content changes to the document based on the new copy.
41. The computer readable storage medium of claim 40, including instructions for updating the web crawl interval of the document to be less than the initial web crawl interval when the document's content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the document's content has not changed.
42. The computer readable storage medium of claim 40, wherein the one or more programs further comprise:
instructions for identifying a first average interval between times where the document's content has not changed;
instructions for identifying a second average interval between times where the document's content has changed; and
instructions for updating the web crawl interval of the document based on the first average interval and the second average interval.
43. The computer readable storage medium of claim 40, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
44. The computer readable storage medium of claim 40, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the document's content has not changed, and is updated in accordance with the second average interval if the document's content has changed.
45. The computer readable storage medium of claim 40, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
46. The computer readable storage medium of claim 40, wherein the one or more programs consider critical content changes to the document and ignore all non-critical content changes to the document when updating the document's web crawl interval.
47. The computer readable storage medium of claim 40, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
48. The computer readable storage medium of claim 40, wherein the initial web crawl interval of the document is determined in accordance with a score corresponding to the document's popularity.
US10/930,280 2004-08-30 2004-08-30 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents Active 2026-02-14 US7987172B1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US10/930,280 US7987172B1 (en) 2004-08-30 2004-08-30 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US13/166,757 US8407204B2 (en) 2004-08-30 2011-06-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US13/849,355 US8782032B2 (en) 2004-08-30 2013-03-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/930,280 US7987172B1 (en) 2004-08-30 2004-08-30 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/166,757 Continuation US8407204B2 (en) 2004-08-30 2011-06-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

Publications (1)

Publication Number Publication Date
US7987172B1 true US7987172B1 (en) 2011-07-26

Family

ID=44280187

Family Applications (3)

Application Number Title Priority Date Filing Date
US10/930,280 Active 2026-02-14 US7987172B1 (en) 2004-08-30 2004-08-30 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US13/166,757 Active US8407204B2 (en) 2004-08-30 2011-06-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US13/849,355 Expired - Fee Related US8782032B2 (en) 2004-08-30 2013-03-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

Family Applications After (2)

Application Number Title Priority Date Filing Date
US13/166,757 Active US8407204B2 (en) 2004-08-30 2011-06-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US13/849,355 Expired - Fee Related US8782032B2 (en) 2004-08-30 2013-03-22 Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

Country Status (1)

Country Link
US (3) US7987172B1 (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110016471A1 (en) * 2009-07-15 2011-01-20 Microsoft Corporation Balancing Resource Allocations Based on Priority
US20110173212A1 (en) * 2004-11-22 2011-07-14 Tuttle Timothy D Systems and methods for sorting search results
US20110258176A1 (en) * 2004-08-30 2011-10-20 Carver Anton P T Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents
US20120023091A1 (en) * 2006-10-12 2012-01-26 Vanessa Fox System and Method for Enabling Website Owner to Manage Crawl Rate in a Website Indexing System
US8255385B1 (en) * 2011-03-22 2012-08-28 Microsoft Corporation Adaptive crawl rates based on publication frequency
US8386459B1 (en) * 2005-04-25 2013-02-26 Google Inc. Scheduling a recrawl
US8386460B1 (en) 2005-06-24 2013-02-26 Google Inc. Managing URLs
US20130185276A1 (en) * 2012-01-17 2013-07-18 Sackett Solutions & Innovations, LLC System for Search and Customized Information Updating of New Patents and Research, and Evaluation of New Research Projects' and Current Patents' Potential
US8533226B1 (en) 2006-08-04 2013-09-10 Google Inc. System and method for verifying and revoking ownership rights with respect to a website in a website indexing system
EP2680171A2 (en) * 2012-06-29 2014-01-01 Orange Intelligent index scheduling
US8666964B1 (en) 2005-04-25 2014-03-04 Google Inc. Managing items in crawl schedule
US8707312B1 (en) 2003-07-03 2014-04-22 Google Inc. Document reuse in a search engine crawler
US8775403B2 (en) 2003-07-03 2014-07-08 Google Inc. Scheduler for search engine crawler
US20140280204A1 (en) * 2013-03-14 2014-09-18 International Business Machines Corporation Document Provenance Scoring Based On Changes Between Document Versions
CN104063504A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method for determining comprehensive access weights of webpages and method for sorting access records
CN104123329A (en) * 2013-04-25 2014-10-29 北京千橡网景科技发展有限公司 Search method and device
US9002819B2 (en) 2005-05-31 2015-04-07 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
WO2015091534A1 (en) * 2013-12-18 2015-06-25 Thomson Reuters Global Resources System and method for dynamically scheduling network scanning tasks
US20150195156A1 (en) * 2011-12-01 2015-07-09 Google Inc. Method and system for providing page visibility information
US9286334B2 (en) 2011-07-15 2016-03-15 International Business Machines Corporation Versioning of metadata, including presentation of provenance and lineage for versioned metadata
US9384193B2 (en) 2011-07-15 2016-07-05 International Business Machines Corporation Use and enforcement of provenance and lineage constraints
US9418065B2 (en) 2012-01-26 2016-08-16 International Business Machines Corporation Tracking changes related to a collection of documents
US9613009B2 (en) 2011-05-04 2017-04-04 Google Inc. Predicting user navigation events
US9769285B2 (en) 2011-06-14 2017-09-19 Google Inc. Access to network content
EP2742438B1 (en) * 2011-08-09 2017-12-13 Microsoft Technology Licensing, LLC Optimizing web crawling with user history
US9846842B2 (en) 2011-07-01 2017-12-19 Google Llc Predicting user navigation events
US9928223B1 (en) 2011-06-14 2018-03-27 Google Llc Methods for prerendering and methods for managing and configuring prerendering operations
US9946792B2 (en) 2012-05-15 2018-04-17 Google Llc Access to network content
CN108416046A (en) * 2018-03-15 2018-08-17 广州优视网络科技有限公司 Sequence reptile boundary detection method, device and server
US20180285327A1 (en) * 2012-07-13 2018-10-04 Ziga Mahkovec Determining cacheability of webpages
CN111444412A (en) * 2020-04-03 2020-07-24 北京明朝万达科技股份有限公司 Scheduling method and device for web crawler task
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
US20220276999A1 (en) * 2015-08-24 2022-09-01 Salesforce.Com, Inc. Generic scheduling
US11531680B2 (en) * 2014-10-03 2022-12-20 Palantir Technologies Inc. Data aggregation and analysis system
US11538202B2 (en) 2014-10-03 2022-12-27 Palantir Technologies Inc. Time-series analysis system

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8243596B2 (en) * 2007-06-21 2012-08-14 Intel Corporation Distributing intelligence across networks
JP5929356B2 (en) * 2012-03-15 2016-06-01 富士ゼロックス株式会社 Information processing apparatus and information processing program
US9311406B2 (en) 2013-06-05 2016-04-12 Microsoft Technology Licensing, Llc Discovering trending content of a domain
RU2634218C2 (en) 2014-07-24 2017-10-24 Общество С Ограниченной Ответственностью "Яндекс" Method for determining sequence of web browsing and server used
CN106294364B (en) * 2015-05-15 2020-04-10 阿里巴巴集团控股有限公司 Method and device for realizing web crawler to capture webpage
CN107133217A (en) * 2017-05-26 2017-09-05 北京惠商之星网络科技有限公司 Target topic intelligent grabbing method, system and computer-readable recording medium
US10575123B1 (en) * 2019-02-14 2020-02-25 Uber Technologies, Inc. Contextual notifications for a network-based service
US10701238B1 (en) * 2019-05-09 2020-06-30 Google Llc Context-adaptive scanning
US10593128B1 (en) 2019-08-20 2020-03-17 Capital One Services, Llc Using augmented reality markers for local positioning in a computing environment
US10614636B1 (en) * 2019-08-20 2020-04-07 Capital One Services, Llc Using three-dimensional augmented reality markers for local geo-positioning in a computing environment

Citations (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4312009A (en) 1979-02-16 1982-01-19 Smh-Adrex Device for projecting ink droplets onto a medium
US5521140A (en) 1993-10-22 1996-05-28 Sony Corporation Recording unit structure and recording device
US5594480A (en) 1992-10-14 1997-01-14 Sony Corporation Printing device and photographic paper
US5634062A (en) 1993-10-27 1997-05-27 Fuji Xerox Co., Ltd. System for managing hypertext node information and link information
US5801702A (en) 1995-03-09 1998-09-01 Terrabyte Technology System and method for adding network links in a displayed hierarchy
US5832494A (en) 1993-06-14 1998-11-03 Libertech, Inc. Method and apparatus for indexing, searching and displaying data
US5898836A (en) 1997-01-14 1999-04-27 Netmind Services, Inc. Change-detection tool indicating degree and location of change of internet documents by comparison of cyclic-redundancy-check(CRC) signatures
US6003060A (en) 1996-12-20 1999-12-14 International Business Machines Corporation Method and apparatus to share resources while processing multiple priority data flows
US6012087A (en) 1997-01-14 2000-01-04 Netmind Technologies, Inc. Unique-change detection of dynamic web pages using history tables of signatures
US6049804A (en) * 1995-11-01 2000-04-11 Filetek, Inc. Method and apparatus for segmenting a database
US6068363A (en) 1996-07-04 2000-05-30 Canon Kabushiki Kaisha Recording head and apparatus employing multiple temperature sensors to effect temperature control
US6189019B1 (en) 1996-08-14 2001-02-13 Microsoft Corporation Computer system and computer-implemented process for presenting document connectivity
US6243091B1 (en) 1997-11-21 2001-06-05 International Business Machines Corporation Global history view
WO2001050320A1 (en) 1999-12-30 2001-07-12 Auctionwatch.Com, Inc. Minimal impact crawler
US6263364B1 (en) 1999-11-02 2001-07-17 Alta Vista Company Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness
US6263350B1 (en) 1996-10-11 2001-07-17 Sun Microsystems, Inc. Method and system for leasing storage
US6269370B1 (en) * 1996-02-21 2001-07-31 Infoseek Corporation Web scan process
US6285999B1 (en) 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
WO2001086507A1 (en) 2000-05-08 2001-11-15 The Johns Hopkins University Relevant search rankings using high refresh-rate distributed crawling
US6321265B1 (en) 1999-11-02 2001-11-20 Altavista Company System and method for enforcing politeness while scheduling downloads in a web crawler
US6336123B2 (en) 1996-10-02 2002-01-01 Matsushita Electric Industrial Co., Ltd. Hierarchical based hyper-text document preparing and management apparatus
US20020010682A1 (en) 2000-07-20 2002-01-24 Johnson Rodney D. Information archival and retrieval system for internetworked computers
US6351755B1 (en) 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
US6377984B1 (en) 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US20020052928A1 (en) 2000-07-31 2002-05-02 Eliyon Technologies Corporation Computer method and apparatus for collecting people and organization information from Web sites
US20020065827A1 (en) 1995-05-31 2002-05-30 David Christie Method and apparatus for workgroup information replication
US6404446B1 (en) 1997-08-15 2002-06-11 International Business Machines Corporation Multi-node user interface component and method thereof for use in displaying visual indication of search results
US20020073188A1 (en) 2000-12-07 2002-06-13 Rawson Freeman Leigh Method and apparatus for partitioning system management information for a server farm among a plurality of leaseholds
US6418453B1 (en) 1999-11-03 2002-07-09 International Business Machines Corporation Network repository service for efficient web crawling
US6418433B1 (en) 1999-01-28 2002-07-09 International Business Machines Corporation System and method for focussed web crawling
US6424966B1 (en) 1998-06-30 2002-07-23 Microsoft Corporation Synchronizing crawler with notification source
US20020099602A1 (en) 2000-12-04 2002-07-25 Paul Moskowitz Method and system to provide web site schedules
US20020129062A1 (en) 2001-03-08 2002-09-12 Wood River Technologies, Inc. Apparatus and method for cataloging data
US20030061260A1 (en) 2001-09-25 2003-03-27 Timesys Corporation Resource reservation and priority management
US6547829B1 (en) 1999-06-30 2003-04-15 Microsoft Corporation Method and system for detecting duplicate documents in web crawls
US6594662B1 (en) 1998-07-01 2003-07-15 Netshadow, Inc. Method and system for gathering information resident on global computer networks
US20030158839A1 (en) 2001-05-04 2003-08-21 Yaroslav Faybishenko System and method for determining relevancy of query responses in a distributed network search mechanism
US6631369B1 (en) 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6638314B1 (en) 1998-06-26 2003-10-28 Microsoft Corporation Method of web crawling utilizing crawl numbers
US6701350B1 (en) 1999-09-08 2004-03-02 Nortel Networks Limited System and method for web page filtering
US20040044962A1 (en) 2001-05-08 2004-03-04 Green Jacob William Relevant search rankings using high refresh-rate distributed crawling
US20040064442A1 (en) * 2002-09-27 2004-04-01 Popovitch Steven Gregory Incremental search engine
US6751612B1 (en) 1999-11-29 2004-06-15 Xerox Corporation User query generate search results that rank set of servers where ranking is based on comparing content on each server with user query, frequency at which content on each server is altered using web crawler in a search engine
US20040128285A1 (en) * 2000-12-15 2004-07-01 Jacob Green Dynamic-content web crawling through traffic monitoring
US6763362B2 (en) 2001-11-30 2004-07-13 Micron Technology, Inc. Method and system for updating a search engine
US6772203B1 (en) 1999-05-14 2004-08-03 Pivia, Inc. Updating data objects for dynamic application caching
US20040225642A1 (en) 2003-05-09 2004-11-11 International Business Machines Corporation Method and apparatus for web crawler data collection
US20040225644A1 (en) 2003-05-09 2004-11-11 International Business Machines Corporation Method and apparatus for search engine World Wide Web crawling
US20050071766A1 (en) 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
US20050086206A1 (en) 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US6950874B2 (en) 2000-12-15 2005-09-27 International Business Machines Corporation Method and system for management of resource leases in an application framework system
US6952730B1 (en) 2000-06-30 2005-10-04 Hewlett-Packard Development Company, L.P. System and method for efficient filtering of data set addresses in a web crawler
US20060036605A1 (en) * 2004-04-14 2006-02-16 Microsoft Corporation System and method for storage power, thermal and acoustic management in server systems
US20060069663A1 (en) 2004-09-28 2006-03-30 Eytan Adar Ranking results for network search query
US7047491B2 (en) 2000-12-05 2006-05-16 Schubert Daniel M Electronic information management system for abstracting and reporting document information
US7080073B1 (en) 2000-08-18 2006-07-18 Firstrain, Inc. Method and apparatus for focused crawling
US7139747B1 (en) 2000-11-03 2006-11-21 Hewlett-Packard Development Company, L.P. System and method for distributed web crawling
US7171619B1 (en) 2001-07-05 2007-01-30 Sun Microsystems, Inc. Methods and apparatus for accessing document content
US7200592B2 (en) 2002-01-14 2007-04-03 International Business Machines Corporation System for synchronizing of user's affinity to knowledge
US7231606B2 (en) 2000-10-31 2007-06-12 Software Research, Inc. Method and system for testing websites
US7260543B1 (en) 2000-05-09 2007-08-21 Sun Microsystems, Inc. Automatic lease renewal with message gates in a distributed computing environment
US7308643B1 (en) 2003-07-03 2007-12-11 Google Inc. Anchor tag indexing in a web crawler system
US7310632B2 (en) 2004-02-12 2007-12-18 Microsoft Corporation Decision-theoretic web-crawling and predicting web-page change
US7346839B2 (en) 2003-09-30 2008-03-18 Google Inc. Information retrieval based on historical data
US7725452B1 (en) 2003-07-03 2010-05-25 Google Inc. Scheduler for search engine crawler

Family Cites Families (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6213652B1 (en) 1995-04-18 2001-04-10 Fuji Xerox Co., Ltd. Job scheduling system for print processing
US6836768B1 (en) * 1999-04-27 2004-12-28 Surfnotes Method and apparatus for improved information representation
US7343412B1 (en) 1999-06-24 2008-03-11 International Business Machines Corporation Method for maintaining and managing dynamic web pages stored in a system cache and referenced objects cached in other data stores
US6418452B1 (en) * 1999-11-03 2002-07-09 International Business Machines Corporation Network repository service directory for efficient web crawling
US6883135B1 (en) * 2000-01-28 2005-04-19 Microsoft Corporation Proxy server using a statistical model
WO2001081829A1 (en) 2000-04-27 2001-11-01 Brio Technology, Inc. Method and apparatus for processing jobs on an enterprise-wide computer system
US7236976B2 (en) * 2000-06-19 2007-06-26 Aramark Corporation System and method for scheduling events and associated products and services
GB2368670A (en) 2000-11-03 2002-05-08 Envisional Software Solutions Data acquisition system
US7043473B1 (en) 2000-11-22 2006-05-09 Widevine Technologies, Inc. Media tracking system and method
US6910071B2 (en) * 2001-04-02 2005-06-21 The Aerospace Corporation Surveillance monitoring and automated reporting method for detecting data changes
US6754651B2 (en) * 2001-04-17 2004-06-22 International Business Machines Corporation Mining of generalized disjunctive association rules
MXPA03011976A (en) 2001-06-22 2005-07-01 Nervana Inc System and method for knowledge retrieval, management, delivery and presentation.
US7089233B2 (en) 2001-09-06 2006-08-08 International Business Machines Corporation Method and system for searching for web content
US20030131005A1 (en) 2002-01-10 2003-07-10 International Business Machines Corporation Method and apparatus for automatic pruning of search engine indices
US7447777B1 (en) 2002-02-11 2008-11-04 Extreme Networks Switching system
US6993534B2 (en) 2002-05-08 2006-01-31 International Business Machines Corporation Data store for knowledge-based data mining system
US20040064432A1 (en) * 2002-09-27 2004-04-01 Oetringer Eugen H. Method and system for maintaining documents
US7213047B2 (en) * 2002-10-31 2007-05-01 Sun Microsystems, Inc. Peer trust evaluation using mobile agents in peer-to-peer networks
US8707312B1 (en) * 2003-07-03 2014-04-22 Google Inc. Document reuse in a search engine crawler
US8136025B1 (en) * 2003-07-03 2012-03-13 Google Inc. Assigning document identification tags
US7240064B2 (en) * 2003-11-10 2007-07-03 Overture Services, Inc. Search engine with hierarchically stored indices
US20090326604A1 (en) * 2003-11-26 2009-12-31 Wicab, Inc. Systems and methods for altering vestibular biology
US7483891B2 (en) 2004-01-09 2009-01-27 Yahoo, Inc. Content presentation and management system associating base content and relevant additional content
US20050210008A1 (en) * 2004-03-18 2005-09-22 Bao Tran Systems and methods for analyzing documents over a network
US20050216522A1 (en) * 2004-03-23 2005-09-29 Integrated Data Corporation Multi-tier document management system
US20070214133A1 (en) * 2004-06-23 2007-09-13 Edo Liberty Methods for filtering data and filling in missing data using nonlinear inference
US7565423B1 (en) * 2004-06-30 2009-07-21 Google Inc. System and method of accessing a document efficiently through multi-tier web caching
US7437364B1 (en) * 2004-06-30 2008-10-14 Google Inc. System and method of accessing a document efficiently through multi-tier web caching
US8224964B1 (en) * 2004-06-30 2012-07-17 Google Inc. System and method of accessing a document efficiently through multi-tier web caching
US7987172B1 (en) * 2004-08-30 2011-07-26 Google Inc. Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US7769742B1 (en) 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US7475069B2 (en) 2006-03-29 2009-01-06 International Business Machines Corporation System and method for prioritizing websites during a webcrawling process
US8180760B1 (en) * 2007-12-20 2012-05-15 Google Inc. Organization system for ad campaigns

Patent Citations (68)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4312009A (en) 1979-02-16 1982-01-19 Smh-Adrex Device for projecting ink droplets onto a medium
US5594480A (en) 1992-10-14 1997-01-14 Sony Corporation Printing device and photographic paper
US5832494A (en) 1993-06-14 1998-11-03 Libertech, Inc. Method and apparatus for indexing, searching and displaying data
US5521140A (en) 1993-10-22 1996-05-28 Sony Corporation Recording unit structure and recording device
US5634062A (en) 1993-10-27 1997-05-27 Fuji Xerox Co., Ltd. System for managing hypertext node information and link information
US5801702A (en) 1995-03-09 1998-09-01 Terrabyte Technology System and method for adding network links in a displayed hierarchy
US20020065827A1 (en) 1995-05-31 2002-05-30 David Christie Method and apparatus for workgroup information replication
US6049804A (en) * 1995-11-01 2000-04-11 Filetek, Inc. Method and apparatus for segmenting a database
US6269370B1 (en) * 1996-02-21 2001-07-31 Infoseek Corporation Web scan process
US6068363A (en) 1996-07-04 2000-05-30 Canon Kabushiki Kaisha Recording head and apparatus employing multiple temperature sensors to effect temperature control
US6189019B1 (en) 1996-08-14 2001-02-13 Microsoft Corporation Computer system and computer-implemented process for presenting document connectivity
US6336123B2 (en) 1996-10-02 2002-01-01 Matsushita Electric Industrial Co., Ltd. Hierarchical based hyper-text document preparing and management apparatus
US6263350B1 (en) 1996-10-11 2001-07-17 Sun Microsystems, Inc. Method and system for leasing storage
US6003060A (en) 1996-12-20 1999-12-14 International Business Machines Corporation Method and apparatus to share resources while processing multiple priority data flows
US6285999B1 (en) 1997-01-10 2001-09-04 The Board Of Trustees Of The Leland Stanford Junior University Method for node ranking in a linked database
US6219818B1 (en) 1997-01-14 2001-04-17 Netmind Technologies, Inc. Checksum-comparing change-detection tool indicating degree and location of change of internet documents
US6012087A (en) 1997-01-14 2000-01-04 Netmind Technologies, Inc. Unique-change detection of dynamic web pages using history tables of signatures
US5898836A (en) 1997-01-14 1999-04-27 Netmind Services, Inc. Change-detection tool indicating degree and location of change of internet documents by comparison of cyclic-redundancy-check(CRC) signatures
US6404446B1 (en) 1997-08-15 2002-06-11 International Business Machines Corporation Multi-node user interface component and method thereof for use in displaying visual indication of search results
US6243091B1 (en) 1997-11-21 2001-06-05 International Business Machines Corporation Global history view
US6638314B1 (en) 1998-06-26 2003-10-28 Microsoft Corporation Method of web crawling utilizing crawl numbers
US6424966B1 (en) 1998-06-30 2002-07-23 Microsoft Corporation Synchronizing crawler with notification source
US6594662B1 (en) 1998-07-01 2003-07-15 Netshadow, Inc. Method and system for gathering information resident on global computer networks
US6418433B1 (en) 1999-01-28 2002-07-09 International Business Machines Corporation System and method for focussed web crawling
US6772203B1 (en) 1999-05-14 2004-08-03 Pivia, Inc. Updating data objects for dynamic application caching
US6631369B1 (en) 1999-06-30 2003-10-07 Microsoft Corporation Method and system for incremental web crawling
US6547829B1 (en) 1999-06-30 2003-04-15 Microsoft Corporation Method and system for detecting duplicate documents in web crawls
US6701350B1 (en) 1999-09-08 2004-03-02 Nortel Networks Limited System and method for web page filtering
US6321265B1 (en) 1999-11-02 2001-11-20 Altavista Company System and method for enforcing politeness while scheduling downloads in a web crawler
US6377984B1 (en) 1999-11-02 2002-04-23 Alta Vista Company Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue
US6351755B1 (en) 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
US6263364B1 (en) 1999-11-02 2001-07-17 Alta Vista Company Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness
US6418453B1 (en) 1999-11-03 2002-07-09 International Business Machines Corporation Network repository service for efficient web crawling
US6751612B1 (en) 1999-11-29 2004-06-15 Xerox Corporation User query generate search results that rank set of servers where ranking is based on comparing content on each server with user query, frequency at which content on each server is altered using web crawler in a search engine
WO2001050320A1 (en) 1999-12-30 2001-07-12 Auctionwatch.Com, Inc. Minimal impact crawler
WO2001086507A1 (en) 2000-05-08 2001-11-15 The Johns Hopkins University Relevant search rankings using high refresh-rate distributed crawling
US7260543B1 (en) 2000-05-09 2007-08-21 Sun Microsystems, Inc. Automatic lease renewal with message gates in a distributed computing environment
US6952730B1 (en) 2000-06-30 2005-10-04 Hewlett-Packard Development Company, L.P. System and method for efficient filtering of data set addresses in a web crawler
US20020010682A1 (en) 2000-07-20 2002-01-24 Johnson Rodney D. Information archival and retrieval system for internetworked computers
US20020052928A1 (en) 2000-07-31 2002-05-02 Eliyon Technologies Corporation Computer method and apparatus for collecting people and organization information from Web sites
US20060277175A1 (en) 2000-08-18 2006-12-07 Dongming Jiang Method and Apparatus for Focused Crawling
US7080073B1 (en) 2000-08-18 2006-07-18 Firstrain, Inc. Method and apparatus for focused crawling
US7231606B2 (en) 2000-10-31 2007-06-12 Software Research, Inc. Method and system for testing websites
US7139747B1 (en) 2000-11-03 2006-11-21 Hewlett-Packard Development Company, L.P. System and method for distributed web crawling
US20020099602A1 (en) 2000-12-04 2002-07-25 Paul Moskowitz Method and system to provide web site schedules
US7047491B2 (en) 2000-12-05 2006-05-16 Schubert Daniel M Electronic information management system for abstracting and reporting document information
US20020073188A1 (en) 2000-12-07 2002-06-13 Rawson Freeman Leigh Method and apparatus for partitioning system management information for a server farm among a plurality of leaseholds
US6950874B2 (en) 2000-12-15 2005-09-27 International Business Machines Corporation Method and system for management of resource leases in an application framework system
US20040128285A1 (en) * 2000-12-15 2004-07-01 Jacob Green Dynamic-content web crawling through traffic monitoring
US20020129062A1 (en) 2001-03-08 2002-09-12 Wood River Technologies, Inc. Apparatus and method for cataloging data
US20030158839A1 (en) 2001-05-04 2003-08-21 Yaroslav Faybishenko System and method for determining relevancy of query responses in a distributed network search mechanism
US7299219B2 (en) 2001-05-08 2007-11-20 The Johns Hopkins University High refresh-rate retrieval of freshly published content using distributed crawling
US20040044962A1 (en) 2001-05-08 2004-03-04 Green Jacob William Relevant search rankings using high refresh-rate distributed crawling
US7171619B1 (en) 2001-07-05 2007-01-30 Sun Microsystems, Inc. Methods and apparatus for accessing document content
US20030061260A1 (en) 2001-09-25 2003-03-27 Timesys Corporation Resource reservation and priority management
US6763362B2 (en) 2001-11-30 2004-07-13 Micron Technology, Inc. Method and system for updating a search engine
US7200592B2 (en) 2002-01-14 2007-04-03 International Business Machines Corporation System for synchronizing of user's affinity to knowledge
US20040064442A1 (en) * 2002-09-27 2004-04-01 Popovitch Steven Gregory Incremental search engine
US20040225642A1 (en) 2003-05-09 2004-11-11 International Business Machines Corporation Method and apparatus for web crawler data collection
US20040225644A1 (en) 2003-05-09 2004-11-11 International Business Machines Corporation Method and apparatus for search engine World Wide Web crawling
US7308643B1 (en) 2003-07-03 2007-12-11 Google Inc. Anchor tag indexing in a web crawler system
US7725452B1 (en) 2003-07-03 2010-05-25 Google Inc. Scheduler for search engine crawler
US20050071766A1 (en) 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
US7346839B2 (en) 2003-09-30 2008-03-18 Google Inc. Information retrieval based on historical data
US20050086206A1 (en) 2003-10-15 2005-04-21 International Business Machines Corporation System, Method, and service for collaborative focused crawling of documents on a network
US7310632B2 (en) 2004-02-12 2007-12-18 Microsoft Corporation Decision-theoretic web-crawling and predicting web-page change
US20060036605A1 (en) * 2004-04-14 2006-02-16 Microsoft Corporation System and method for storage power, thermal and acoustic management in server systems
US20060069663A1 (en) 2004-09-28 2006-03-30 Eytan Adar Ranking results for network search query

Non-Patent Citations (33)

* Cited by examiner, † Cited by third party
Title
Ali, What's Changed? Measuring Document Change in Web Crawling for Search Engines, SPIRE 2003, LNCS 2857, 2003, pp. 28-42, Springer-Verlag, Berlin, Germany.
Arasu, Searching the Web, ACM Transactions on Internet Technology, ACM Transactions on Internet Technology, vol. 1, No. 1, Aug. 2001, pp. 2-43.
Baeza-Yates, Balancing Volume, Quality and Freshness in Web Crawling, Center for Web Research, Dept. of Computer Science, University of Chile, 2002, pp. 1-10.
Brandman, O., et al., "Crawler-Friendly Web Servers," ACM Sigmetrics Performance Evalu ation Review, vol. 28, Issue 2, Sep. 2000, pp. 9-14.
Brin, S., et al., "The Anatomy of a Large-Scale Hypertextual Search Engine," Proceedings of ther 7th Int'l World Wide Web Conference, Brisbane, Australia, 1998.
Brusilovsky, Map-Based Horizontal Navigation in Education Hypertext, ACM Press, Jun. 2002, pp. 1-10.
Bullot, A Data-Mining Approach for Optimizing Performance of an Incremental Crawler, WI '03, Oct. 13-17, 2003, pp. 610-615.
Cho et al. "Effective Page Refresh Policies for Web Crawlers." ACM Transactions on Database Systems, vol. 28, No. 4, Dec. 2003, pp. 390-426. *
Cho et al. "Estimating Frequency of Change." ACM Transactions on Internet Technology, 3(3): Aug. 2003. 32 pages. *
Cho, J., "Crawling the Web: Discovery and Maintenance of Large-Scale Web Data," PhD Thesis, Dept. Of Computer Science, Stanford University, 2001, 188 pages.
Cho, J., et al., "Efficient Crawling Through URL Ordering," Computer Networks and ISDN Systems, vol. 30, Issues 1-7, Apr. 1988, pp. 161-172.
Cho, J., et al., "Synchronizing a Database to Improve Freshness," MOD 2000, Dallas, Texas, Jun. 2000, pp. 117-128.
Cho, J., et al., "The Evolution of the Web and Implications for an Incremental Crawler," Proc. of the 26th VLDB Conf., Cairo, Egypt, 2000, pp. 200-209.
Coffman, Jr., E.G., et al., "Optimal Robot Scheduling," Tech. Rep. RR-3317, 1997, 19 pages.
Douglis, Rate of Change and Other Metrics: a Live Study of the World Wide Web, USENIX Symposium on Internetworking Technologies and Systems, Monterey, CA, Dec. 1997, pp. I and 1-14.
Douglis, The At&T Internet Difference Engine: Tracking and Viewing Changes on the Web, World Wide Web, vol. 1, No. 1, Mar. 1998, pp. 27-44.
Fetterly, A Large-Scale Study of the Evolution of Web Pages, WWW 2003, Budapest, Hungary, May 20-24, 2003, pp. 669-678.
Final Office Action issued in U.S. Appl. No. 10/853,627, on May 12, 2008.
Haveliwala, Topic-Sensitive PageRank, WWW2002, Honolulu, HI, May 7-11, 2002, 10 pages.
Henzinger, Web Information Retrieval-an Algorithmic Perspective, ESA 2000, LNCS 1879, 2000, pp. 1-8, Springer-Verlag, Berlin, Germany.
Heydon, Mercator: A Scalable, Extensible Web Crawler, World Wide Web, vol. 2, No. 4, Dec. 1999, pp. 219-229.
Hirai, WebBase: a Repository of Web Pages, Computer Networks, vol. 33, Jun. 2000, pp. 277-293.
Introna, L., et al., "Defining the Web: the Politics of Search Engines," Computer, vol. 33, Issue 1, Jan. 2000, pp. 54-62.
Jeh, Scaling Personalized Web Search, WWW2003, Budapest, Hungary, May 20-24, 2003, pp. 271-279.
Kamvar, Exploiting the Block Structure of the Web for Computing PageRank, Stanford University Technical Report, 2003, 13 pages.
Klemm, R.P., "WebCompanion: A Friendly Client-Side Web Prefetching Agent," IEEE Transactions on Knowledge and Data Engineering, vol. 11, No. 4, Jul./Aug. 1999, pp. 577-594.
Lee, J.K.W., et al., "Intelligent Agents for Matching Information Providers and Consumers on the World-Wide Web," Proc. of the 13th Annual Hawaii Int'l Conf. on System Sciences, 1997, 11 pages.
Najork, Breadth-First Search Crawling Yields High-Quality Pages, WWW10, May 1-5, 2001, pp. 114-118.
Office Action from related U.S. Appl. No. 11/394,619, dated Sep. 23, 2010, 23 pages.
Pendey, S., et al., "Monitoring the Dynamic Web to Respond to Continuous Queries," WWW 2003, Budapest, Hungry, May 20-24, 2003, pp. 659-668.
Shkapenyuk, Design and Implementation of a High-Performance Distributed Web Crawler, ICDE '02, San Jose, CA, Feb. 26-Mar. 1, 2002, pp. 357-368.
Suel, Odissea: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval, WebDB, San Diego, CA, Jun. 12-13, 2003, pp. 1-6.
Wolf, Optimal Crawling Strategies for Web Search Engines, WWW 2002, Honolulu, Hawaii, May 7-11, 2002, pp. 136-147.

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8707312B1 (en) 2003-07-03 2014-04-22 Google Inc. Document reuse in a search engine crawler
US9679056B2 (en) 2003-07-03 2017-06-13 Google Inc. Document reuse in a search engine crawler
US10216847B2 (en) 2003-07-03 2019-02-26 Google Llc Document reuse in a search engine crawler
US10621241B2 (en) 2003-07-03 2020-04-14 Google Llc Scheduler for search engine crawler
US8775403B2 (en) 2003-07-03 2014-07-08 Google Inc. Scheduler for search engine crawler
US8707313B1 (en) 2003-07-03 2014-04-22 Google Inc. Scheduler for search engine crawler
US8407204B2 (en) * 2004-08-30 2013-03-26 Google Inc. Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US20110258176A1 (en) * 2004-08-30 2011-10-20 Carver Anton P T Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents
US20130226897A1 (en) * 2004-08-30 2013-08-29 Anton P.T. Carver Minimizing Visibility of Stale Content in Web Searching Including Revising Web Crawl Intervals of Documents
US8782032B2 (en) * 2004-08-30 2014-07-15 Google Inc. Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US8788488B2 (en) 2004-11-22 2014-07-22 Facebook, Inc. Ranking search results based on recency
US8463778B2 (en) * 2004-11-22 2013-06-11 Facebook, Inc. Systems and methods for sorting search results
US20110173212A1 (en) * 2004-11-22 2011-07-14 Tuttle Timothy D Systems and methods for sorting search results
US8386459B1 (en) * 2005-04-25 2013-02-26 Google Inc. Scheduling a recrawl
US8666964B1 (en) 2005-04-25 2014-03-04 Google Inc. Managing items in crawl schedule
US9002819B2 (en) 2005-05-31 2015-04-07 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US8386460B1 (en) 2005-06-24 2013-02-26 Google Inc. Managing URLs
US8533226B1 (en) 2006-08-04 2013-09-10 Google Inc. System and method for verifying and revoking ownership rights with respect to a website in a website indexing system
US20120023091A1 (en) * 2006-10-12 2012-01-26 Vanessa Fox System and Method for Enabling Website Owner to Manage Crawl Rate in a Website Indexing System
US8458163B2 (en) * 2006-10-12 2013-06-04 Google Inc. System and method for enabling website owner to manage crawl rate in a website indexing system
US20110016471A1 (en) * 2009-07-15 2011-01-20 Microsoft Corporation Balancing Resource Allocations Based on Priority
US8255385B1 (en) * 2011-03-22 2012-08-28 Microsoft Corporation Adaptive crawl rates based on publication frequency
US10896285B2 (en) 2011-05-04 2021-01-19 Google Llc Predicting user navigation events
US9613009B2 (en) 2011-05-04 2017-04-04 Google Inc. Predicting user navigation events
US11019179B2 (en) 2011-06-14 2021-05-25 Google Llc Access to network content
US9928223B1 (en) 2011-06-14 2018-03-27 Google Llc Methods for prerendering and methods for managing and configuring prerendering operations
US9769285B2 (en) 2011-06-14 2017-09-19 Google Inc. Access to network content
US11032388B2 (en) 2011-06-14 2021-06-08 Google Llc Methods for prerendering and methods for managing and configuring prerendering operations
US10332009B2 (en) 2011-07-01 2019-06-25 Google Llc Predicting user navigation events
US9846842B2 (en) 2011-07-01 2017-12-19 Google Llc Predicting user navigation events
US9384193B2 (en) 2011-07-15 2016-07-05 International Business Machines Corporation Use and enforcement of provenance and lineage constraints
US9286334B2 (en) 2011-07-15 2016-03-15 International Business Machines Corporation Versioning of metadata, including presentation of provenance and lineage for versioned metadata
EP2742438B1 (en) * 2011-08-09 2017-12-13 Microsoft Technology Licensing, LLC Optimizing web crawling with user history
US9584579B2 (en) * 2011-12-01 2017-02-28 Google Inc. Method and system for providing page visibility information
US20150195156A1 (en) * 2011-12-01 2015-07-09 Google Inc. Method and system for providing page visibility information
US20130185276A1 (en) * 2012-01-17 2013-07-18 Sackett Solutions & Innovations, LLC System for Search and Customized Information Updating of New Patents and Research, and Evaluation of New Research Projects' and Current Patents' Potential
US9418065B2 (en) 2012-01-26 2016-08-16 International Business Machines Corporation Tracking changes related to a collection of documents
US9946792B2 (en) 2012-05-15 2018-04-17 Google Llc Access to network content
US10754900B2 (en) 2012-05-15 2020-08-25 Google Llc Access to network content
EP2680171A2 (en) * 2012-06-29 2014-01-01 Orange Intelligent index scheduling
US20180285327A1 (en) * 2012-07-13 2018-10-04 Ziga Mahkovec Determining cacheability of webpages
US11429651B2 (en) * 2013-03-14 2022-08-30 International Business Machines Corporation Document provenance scoring based on changes between document versions
US20140379657A1 (en) * 2013-03-14 2014-12-25 International Business Machines Corporation Document Provenance Scoring Based On Changes Between Document Versions
US20140280204A1 (en) * 2013-03-14 2014-09-18 International Business Machines Corporation Document Provenance Scoring Based On Changes Between Document Versions
CN104123329B (en) * 2013-04-25 2019-06-07 北京千橡网景科技发展有限公司 Searching method and device
CN104123329A (en) * 2013-04-25 2014-10-29 北京千橡网景科技发展有限公司 Search method and device
WO2015091534A1 (en) * 2013-12-18 2015-06-25 Thomson Reuters Global Resources System and method for dynamically scheduling network scanning tasks
CN104063504A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method for determining comprehensive access weights of webpages and method for sorting access records
US11531680B2 (en) * 2014-10-03 2022-12-20 Palantir Technologies Inc. Data aggregation and analysis system
US11538202B2 (en) 2014-10-03 2022-12-27 Palantir Technologies Inc. Time-series analysis system
US20220276999A1 (en) * 2015-08-24 2022-09-01 Salesforce.Com, Inc. Generic scheduling
US11669522B2 (en) * 2015-08-24 2023-06-06 Salesforce, Inc. Generic scheduling
US11734266B2 (en) 2015-08-24 2023-08-22 Salesforce, Inc. Generic scheduling
CN108416046B (en) * 2018-03-15 2020-05-26 阿里巴巴(中国)有限公司 Sequence crawler boundary detection method and device and server
CN108416046A (en) * 2018-03-15 2018-08-17 广州优视网络科技有限公司 Sequence reptile boundary detection method, device and server
CN111444412A (en) * 2020-04-03 2020-07-24 北京明朝万达科技股份有限公司 Scheduling method and device for web crawler task
CN111444412B (en) * 2020-04-03 2023-06-16 北京明朝万达科技股份有限公司 Method and device for scheduling web crawler tasks
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN113821705B (en) * 2021-08-30 2024-02-20 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium

Also Published As

Publication number Publication date
US8782032B2 (en) 2014-07-15
US20130226897A1 (en) 2013-08-29
US8407204B2 (en) 2013-03-26
US20110258176A1 (en) 2011-10-20

Similar Documents

Publication Publication Date Title
US7987172B1 (en) Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
US8745183B2 (en) System and method for adaptively refreshing a web page
US8515952B2 (en) Systems and methods for determining document freshness
US8352452B2 (en) Methods and apparatus for employing usage statistics in document retrieval
US8832085B2 (en) Method and system for updating a search engine
US7260573B1 (en) Personalizing anchor text scores in a search engine
CA2492348C (en) Decision-theoretic web-crawling and predicting web-page change
US6640218B1 (en) Estimating the usefulness of an item in a collection of information
RU2419860C2 (en) Relative search results based on user interaction
US6792419B1 (en) System and method for ranking hyperlinked documents based on a stochastic backoff processes
AU2010343183B2 (en) Search suggestion clustering and presentation
US6873982B1 (en) Ordering of database search results based on user feedback
AU2004250658B2 (en) System and method for providing preferred country biasing of search results
US20010044791A1 (en) Automated adaptive classification system for bayesian knowledge networks
US7447684B2 (en) Determining searchable criteria of network resources based on a commonality of content
US9081861B2 (en) Uniform resource locator canonicalization
US8645367B1 (en) Predicting data for document attributes based on aggregated data for repeated URL patterns
US20080104502A1 (en) System and method for providing a change profile of a web page
US9569504B1 (en) Deriving and using document and site quality signals from search query streams
US8849775B2 (en) Caching web documents in two or more caches
US20060173556A1 (en) Methods and apparatus for using user gender and/or age group to improve the organization of documents retrieved in response to a search query
US20090132529A1 (en) Method and System for URL Autocompletion Using Ranked Results
US20030018621A1 (en) Distributed information search in a networked environment
US10078702B1 (en) Personalizing aggregated news content
US9734211B1 (en) Personalizing search results

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:CARVER, ANTON P.T.;REEL/FRAME:015316/0100

Effective date: 20040930

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

AS Assignment

Owner name: GOOGLE LLC, CALIFORNIA

Free format text: CHANGE OF NAME;ASSIGNOR:GOOGLE INC.;REEL/FRAME:044101/0405

Effective date: 20170929

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 12