PatSeer’s AI Classifier Obtains a 95% Accuracy Score in Performance Tests Using Gold Standard Datasets

PatSeer
3 min readDec 8, 2022

--

PatSeer’s AI Classifier

The recent PatSeer release features an AI Classifier that makes predictions for newer, uncategorised records based on your existing taxonomy and the records you have categorised against it. The AI Classifier classifies new records automatically according to hierarchical taxonomies, thereby saving you time and effort. Once you enable the AI Classifier for your projects, it continues to learn based on the feedback you provide. Various modes of operation permit you to select the level of intervention you prefer for the process.

PatSeer’s AI Classifier uses the latest AI-NLP stack and semantic rules in ReleSense to guarantee higher precision for scientific literature and patents.

Selecting a benchmark to assess performance

In order to benchmark the AI Classifier’s performance, we decided to test its accuracy against two Gold Standard Datasets (Quantum Computing and Cannabinoid Edibles) that have been released as part of a collaboration between Aistemos Ltd and Patinformatics LLC and are available at: https://github.com/swh/classification-gold-standard/tree/master/data.

The two topics covered by the gold standard datasets are- Cannabinoid edibles and Qubit Generation for Quantum Computing.

  • Cannabinoid edibles — The positive set of patents discuss food and beverages, pharmaceutical and cosmetic products that include cannabinoid substances. The negative set of patents includes food products that comprise a substance like a cannabinoid but not a cannabinoid itself.
  • Qubit Generation for Quantum Computing — This dataset refers to patents that discuss the various means of generating qubits for use in a quantum mechanics-based computing system. Positive set of patents discuss types of qubits including superconducting loops, topological, quantum dot based and ion-trap methods. Negative set of patents include excluded technologies like applications, algorithms, and other auxiliary aspects of quantum computing that do not mention a hardware component and hardware for other quantum phenomena outside of qubit generation.

The original Qubit dataset (as released on GitHub) has multiple family members, and a family id (DOCDB family) has been provided. There are an equal number of Positive and Negative records when you factor each family member separately. However, since the testing methodology required randomly selected records to go into the training dataset, having their family members in the prediction dataset would skew the performance metrics of the AI classifier.

We, therefore, deduplicated the dataset to one member per family.

Gold Standard Datasets for Cannabinoid Edibles and Qubit Generation for Quantum Computing.

Post deduplication, the original datasets have more Negatives than positives implying that positive matching records averaged more family members than the negative records in the original dataset. It is also worth noting here that an unbalanced dataset can have a varying impact on common AI metrics such as Accuracy.

Read more: Testing the AI Classifier’s performance on the gold standard dataset.

--

--

PatSeer
PatSeer

Written by PatSeer

PatSeer: AI-driven patent tool with integrated analytics, used by 8000+ worldwide.

No responses yet