BINRANK SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS PDF

Conceptually, these algorithms require a querytime PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs and not feasible at query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing. We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random walk starting points, and keeping only those objects that receive non-negligible scores.

Author:Voran Sarg
Country:Japan
Language:English (Spanish)
Genre:Marketing
Published (Last):7 November 2015
Pages:262
PDF File Size:11.17 Mb
ePub File Size:11.35 Mb
ISBN:523-8-78482-461-4
Downloads:10244
Price:Free* [*Free Regsitration Required]
Uploader:Zulujar



An ARSG may be constructed for term t by executing ObjectRank with some set of objects B as the baseset and restricting the graph to include only nodes with non-negligible ObjectRank scores, i. The main challenge of this approach is identifying a baseset B, which will provide a good RSG approximation for term t. Embodiments of the invention focus on sets B, which are supersets of the baseset of t. This relationship gives us the following important result.

According to this theorem, for a given term t, if the term baseset BS t is a subset of B, all the important nodes relevant to t are always subsumed within MSG B. That is, all the non-negligible end points of random walks originated from starting nodes containing t are present in the sub-graph generated using B. However, it may be observed that even though two nodes v1 and v2 are guaranteed to be found both in G and in MSG B , the ordering or their ObjectRank scores might not be preserved on MSG B as we do not include intermediate nodes if their ObjectRank scores are below the convergence threshold.

However, it is unlikely that many walks terminating on relevant nodes will pass through irrelevant nodes. Experimental evaluations performed by the inventors support this intuition. The quality of search results should improve if objects in B are semantically related to t. In fact, the inventors have discovered that terms with strong semantic connections can generate good RSGs for each other. However, there is definitely a strong semantic connection between these terms, since XML is a data format famous for its flexible schema.

Papers about XML tend to cite papers that talk about schemas and vice versa. It can be hard to automatically identify terms with such strong semantic connections for every query term.

A baseset B is created for every bin by taking the union of the posting lists of the terms in the bin, and construct MSG B for every bin.

The mapping of terms to bins is remembered, and at query time, the corresponding bin for each term can be uniquely identified, and the term can be executed on the MSG of this bin. Empirical results support this. The most frequent among them appeared in 8 documents.

As previously discussed, a set of MSGs is constructed for terms of a dictionary or a workload by partitioning the terms into a set of term bins based on their co-occurrence. An MSG is generated for every bin based on the intuition that a sub-graph that contains all objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these terms.

There are two main goals in constructing term bins. The first goal is controlling the size of each bin to ensure that the resulting sub-graph is small enough for ObjectRank to execute in a reasonable amount of time.

The second goal is minimizing the number of bins to save the pre-processing time. We know that pre-computing ObjectRank for all terms in our corpus is not feasible. To achieve the first goal, a maxBinSize parameter is introduced that limits the size of the union of the posting lists of the terms in the bin, called bin size.

As discussed above, ObjectRank uses the convergence threshold that is inversely proportional to the size of the baseset, i. Thus, there is a strong correlation between the bin size and the size of the materialized sub-graph. It can be shown that the value of maxBinSize should be determined by quality and performance requirements of the system.

The problem of minimizing the number of bins is NP-hard. In fact, if all posting lists are disjoint, this problem reduces to a classical NP-hard bin packing problem.

Embodiments of the invention apply a greedy algorithm that picks an unassigned term with the largest posting list to start a bin and loops to add the term with the largest overlap with documents already in the bin. We use a number of heuristics to minimize the required number of set intersections, which dominate the complexity of the algorithm. The tight upper bound on the number of set intersections that the algorithm needs to perform is the number of pairs of terms that co-occur in at least one document.

To speed-up the execution of set intersections for larger posting lists, KMV synopses may be used to estimate the size of set intersections. This process works on term posting lists from a text index. As the process fills up a bin, it maintains a list of document IDs, that are already in the bin, and a list of candidate terms, that are known to overlap with the bin i. This bin computation implements a greedy algorithm which picks a candidate term with a posting list that overlaps the most with documents already in the bin, without the posting list union size exceeding the maximum bin size.

While it is more efficient to prepare bins for a particular workload that may come from a system query log, it is may not necessarily be assumed that a query term that has not been seen before, will not be seen in the future.

It can be demonstrated that it is feasible to use the entire dataset dictionary as the workload, in order to be able to answer any query. Due to caching of candidate intersection results in lines of the process in FIG.

For example, consider N terms with posting lists of size X each, that all co-occur in one document d0 with no other co-occurrences. However, to get to that situation, the bin computation process will have to check intersections for every pair of terms. Thus, the upper bound on the number of intersections is tight. Fortunately, real-world text databases have structures that are far from the worst case. During a pre-processing stage, a query pre-processor 12 generates MSGs, as defined above.

During a query processing stage, a query processor 14 executes the ObjectRank process on the sub-graphs instead of the full graph and produces high quality approximations of top-K lists, at a small fraction of the cost. In order to save pre-processing cost and storage, each MSG is designed to answer multiple term queries.

We observed in the Wikipedia dataset that a single MSG can be used for to terms, on average. The query pre-processor 12 of the BinRank system 10 starts with a set of workload terms W for which MSGs will be materialized. If an actual query workload is not available, W includes the entire set of terms found in the corpus. All terms with posting lists longer than a system parameter maxPostingList are excluded.

The posting lists of these terms are deemed too large to be packed into bins. ObjectRank is executed for each such term individually, and the resulting top-K lists are stored. The maxPostingList should be tuned so that there are relatively few of these frequent terms. This process took 4. A greedy bin algorithm unit 20 using the above-discussed bin construction process, packTermsIntoBins partitions W into a set of bins composed of frequently co-occurring terms.

This process takes a single parameter maxBinSize, which limits the size of a bin posting list, i. During the bin construction process, the BinRank system 10 stores the bin identifier of each term into the Lucene index 16 as an additional field. This allows the system us to map each term to the corresponding bin and MSG at query time.

The threshold determines the convergence of the algorithm as well as the minimum ObjectRank score of MSG nodes. The ObjectRank system 10 stores a graph as a row-compressed adjacency matrix. In this format, the entire Wikipedia graph consumes MB of storage, and can be loaded into main memory for MSG generation. In case that the entire data graph does not fit in main memory, the system 10 can apply parallel PageRank computation techniques, such as hypergraph partitioning schemes.

The edge construction takes 1. Once the MSG is constructed and stored in MSG storage 26, it is serialized to a binary file on disk in the same row-compressed adjacency matrix format to facilitate fast deserialization. The serialization takes place in a sub-graph serializer 28 within an MSG cache module In general, deserialization speed can be greatly improved by increasing the transfer rate of the disk subsystem.

For a given keyword query q, a query dispatcher 32 retrieves from the Lucene index 16 the posting list bs q used as the baseset for the ObjectRank execution and the bin identifier b q. However, in the Wikipedia dataset that would introduce an additional delay of 1.

Once the ObjectRank scores are computed and sorted, the resulting document ids are used to retrieve and present the top-k objects to the user. One of the advantages of BinRank query processor 14 is that it can easily utilize large clusters of nodes. A set of dispatcher processes, each with its own replica of the Lucene index, may route the queries to the appropriate nodes. In block 42 materialized sub-graphs are pre-computed. A search query is then received in block 44 and one of the pre-computed materialized sub-graphs is accessed using a text index, in block In block 48, an authority-based keyword search is executed on the materalized sub-graph.

In block 50, nodes are retrieved from the dataset based on the keyword search. The retrieved nodes are transmitted as the results of the query in block In block 56, all terms in the dataset are partitioned.

A partition identifier is stored for each term, in block A random walk is then executed over each partition in block In block 62, important nodes are identified for each partition based on the random walk.

The important nodes are used to construct a corresponding sub-graph for each partition in block As can be seen from the above disclosure, embodiments of the invention provide a practical solution for scalable dynamic authority-based ranking. As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product.

Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment including firmware, resident software, micro-code, etc.

Any combination of one or more computer usable or computer readable medium s may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium.

More specific examples a non-exhaustive list of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory RAM , a read-only memory ROM , an erasable programmable read-only memory EPROM or Flash memory , an optical fiber, a portable compact disc read-only memory CDROM , an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device.

Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wire line, optical fiber cable, RF, etc.

ALESIS MULTIMIX 4 USB MANUAL PDF

BINRANK SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS PDF

An ARSG may be constructed for term t by executing ObjectRank with some set of objects B as the baseset and restricting the graph to include only nodes with non-negligible ObjectRank scores, i. The main challenge of this approach is identifying a baseset B, which will provide a good RSG approximation for term t. Embodiments of the invention focus on sets B, which are supersets of the baseset of t. This relationship gives us the following important result. According to this theorem, for a given term t, if the term baseset BS t is a subset of B, all the important nodes relevant to t are always subsumed within MSG B.

GT15Q101 DATASHEET PDF

BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS

An ARSG may be constructed for term t by executing ObjectRank with some set of objects B as the baseset and restricting the graph to include only nodes with non-negligible ObjectRank scores, i. The main challenge of this approach is identifying a baseset B, which will provide a good RSG approximation for term t. Embodiments of the invention focus on sets B, which are supersets of the baseset of t. This relationship gives us the following important result. According to this theorem, for a given term t, if the term baseset BS t is a subset of B, all the important nodes relevant to t are always subsumed within MSG B.

Related Articles