top of page

Latent Semantic Indexing

Conceptual analytics in Relativity uses latent semantic indexing. Rather than referring to a master dictionary, mathematics is used to identify concepts in documents. The approach is based on the co-occurence of terms used in the documents that an analytics index is based on. The content of the workspace determines how documents are related to one another, and which concepts are present in those documents.


Latent Semantic Indexing is the mathematical basis of a conceptual analytics index which is based on a set of documents. Conceptual analytics can also use a classification index which is based on coded examples. This type of index uses a Support Vector Machine learning.



LSI as used in Relativity has several key characteristics:

  1. It is language agnostic. Latent Semantic Indexing will discover correlations between documents and concepts inside them no matter which language the documents are in.

  2. The training data source used for a LSI conceptual analytics index can be the same as the full set of documents to be analyzed, or be a subset of those documents.

  3. It generates a multi-dimensional concept space. This is a mathematical model. The documents which are indexed are mapped on to the concept space - which is a spatial index. Documents which are closer together in the concept space, will be more conceptually similar.

  4. The similarity between two documents or two words is measured by rank value, also referred to as a coherence score. The higher score, the higher the degree of similarity.

  5. The coherence score is not a percentage of shared content, but a measurement of distance.

  6. Analytics indexes are always in memory.


Latent Semantic Indexing processing technique is based on the following:

  1. The principle that documents which are conceptually similar will use similar sections of text.

  2. Use of a matrix, or chart, in which each word is listed on a separate row, and each document in a separate column.

  3. Singular value decompositon is used to decrease the number of rows while maintaining the similarity relationships between the columns.

  4. The degree of similarity between two documents is calculated by finding the cosine of the angle between the two vectors formed by two columns. A value close to 1 will indicate they are very similar. A value closer to 0 will show that they are more dissimilar.

Comments


bottom of page