Document clustering | How AI identifies similar files

Despite the existence of search engines like Google, Yahoo!, and Bing, retrieval of specific pages or documents can be overwhelming. This is how AI makes searching for documents easy.

November 17, 2020 05:05 pm | Updated November 18, 2020 03:31 pm IST

Algorithms use one or more methods to classify documents on the web based on content.

Algorithms use one or more methods to classify documents on the web based on content.

(Subscribe to our Today's Cache newsletter for a quick snapshot of top 5 tech stories. Click here to subscribe for free.)

We know that Artificial intelligence (AI) recognise faces and other biometric data to find duplicates. AI also performs tasks based on voice command. But, could this technology help compare two pdf files? Here's how AI helps do it.

How does it do it?

Researchers use document text clustering to segregate documents based on its content. They analyse the documents based on a cluster of similar words, phrases, and sentences.

This way of grouping and segregating data helps simply the extraction process need to pull relevant information, especially when the user is presented with large amounts of data.

Also read | A machine learning tool that helps firms share confidential data easily

The document clustering technique is commonly used in data analysis and mining, image analysis, data compression, and information retrieval.

Where is it used?

The World Wide Web is the largest shared information source. Despite the existence of search engines like Google, Yahoo!, and Bing, retrieval of specific pages or documents can be overwhelming.

So, algorithms use one or more methods to classify documents on the web based on content. Common clustering applications include Vivismo, KartOO and DuckDuckGo.

Search result clustering involves grouping content based on parameters like hyperlinks, user's context and web usage. The most common method employed by clustering engines is grouping of short text or snippets that hint at what the actual document contains, researchers said in a study titled 'Web Search Result Clustering based on Heuristic Search and k-means.'

How is AI used in these applications to cluster?

There are two types of algorithms used to cluster documents - hierarchal clustering and non-hierarchal clustering.

Also read | Analysis of Reddit posts show pandemic’s impact on mental health

Hierarchal clustering algorithm divides and aggregates documents in a predefined, hierarchal manner. Pairs of clusters of data objects in the hierarchy are then linked together. Although this system may be easy to read and understand, it may not be as efficient as non-hierarchal clustering. Clustering may also be difficult in cases where the data has high levels of errors.

Non-hierarchal clustering involves formation of new clusters by merging and splitting the clusters. This is a relatively faster, reliable and stable technique of clustering.

0 / 0
Sign in to unlock member-only benefits!
  • Access 10 free stories every month
  • Save stories to read later
  • Access to comment on every story
  • Sign-up/manage your newsletter subscriptions with a single click
  • Get notified by email for early access to discounts & offers on our products
Sign in

Comments

Comments have to be in English, and in full sentences. They cannot be abusive or personal. Please abide by our community guidelines for posting your comments.

We have migrated to a new commenting platform. If you are already a registered user of The Hindu and logged in, you may continue to engage with our articles. If you do not have an account please register and login to post comments. Users can access their older comments by logging into their accounts on Vuukle.