Document clustering | How AI identifies similar files

Algorithms use one or more methods to classify documents on the web based on content.   | Photo Credit: Reuters

(Subscribe to our Today's Cache newsletter for a quick snapshot of top 5 tech stories. Click here to subscribe for free.)

We know that Artificial intelligence (AI) recognise faces and other biometric data to find duplicates. AI also performs tasks based on voice command. But, could this technology help compare two pdf files? Here's how AI helps do it.

How does it do it?

Researchers use document text clustering to segregate documents based on its content. They analyse the documents based on a cluster of similar words, phrases, and sentences.

This way of grouping and segregating data helps simply the extraction process need to pull relevant information, especially when the user is presented with large amounts of data.

Also read | A machine learning tool that helps firms share confidential data easily

The document clustering technique is commonly used in data analysis and mining, image analysis, data compression, and information retrieval.

Where is it used?

The World Wide Web is the largest shared information source. Despite the existence of search engines like Google, Yahoo!, and Bing, retrieval of specific pages or documents can be overwhelming.

So, algorithms use one or more methods to classify documents on the web based on content. Common clustering applications include Vivismo, KartOO and DuckDuckGo.

Search result clustering involves grouping content based on parameters like hyperlinks, user's context and web usage. The most common method employed by clustering engines is grouping of short text or snippets that hint at what the actual document contains, researchers said in a study titled 'Web Search Result Clustering based on Heuristic Search and k-means.'

How is AI used in these applications to cluster?

There are two types of algorithms used to cluster documents - hierarchal clustering and non-hierarchal clustering.

Also read | Analysis of Reddit posts show pandemic’s impact on mental health

Hierarchal clustering algorithm divides and aggregates documents in a predefined, hierarchal manner. Pairs of clusters of data objects in the hierarchy are then linked together. Although this system may be easy to read and understand, it may not be as efficient as non-hierarchal clustering. Clustering may also be difficult in cases where the data has high levels of errors.

Non-hierarchal clustering involves formation of new clusters by merging and splitting the clusters. This is a relatively faster, reliable and stable technique of clustering.

This article is closed for comments.
Please Email the Editor

Printable version | Nov 29, 2020 4:38:10 AM |

Next Story