• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to secondary sidebar
  • Skip to footer
  • Home
  • Resources
    • Calculators
      • ISO Certification Cost Calculator
      • Cost of Quality Calculator
    • Lowest Cost ISO Services Quote Program
    • Online Gap Checklists
      • ISO 9001 Gap Checklist
        • ISO 9001 Gap Checklist Overview
        • ISO 9001 Gap Checklist Sample
        • ISO 9001 Gap Checklist Dashboard
      • ISO 45001 Gap Checklist
        • 45001 Checklist Gap Checklist Overview
        • ISO 45001 Gap Checklist Sample
        • ISO 45001 Gap Checklist Dashboard
      • ISO 27001 Gap Checklist
        • ISO/IEC 27001 Gap Checklist Overview
        • ISO 27001 Gap Checklist Sample
        • ISO 27001 Gap Checklist Dashboard
    • White Papers
      • AI and Quality Management: Many Questions, Few Answers
      • A Guide to Quality Risk Management
      • ISO 9001 Updates FAQ
      • Integrating ISO 27001 and ISO 9001
    • Job Salary Reports
      • Quality Professionals Salary Report
    • Free Quality Ebook
    • Glossary
  • Articles
    • Environment
    • Cybersecurity
      • Artificial Intelligence
      • Automation
      • Career
      • Certification Management
      • Continuous Improvement
      • Documentation
      • ISO 27001
      • Information Security Mgt. Systems (ISMS)
      • Management
      • Regulatory
      • Risk Management
      • Software
      • Supplier Quality
      • Sustainability
    • Management Systems
    • Manufacturing
    • Quality
      • Artificial Intelligence
      • Automation
      • Career
      • Certification Management
      • Continuous Improvement
      • Cost of Quality
      • Documentation
      • ISO 9001
      • LEAN-6 Sigma
      • Product Safety
      • Quality Management
      • Regulatory
      • Risk Management
      • Root Cause
      • Skills
      • Software
      • Supplier Quality
      • Sustainability
    • Safety
      • Product Safety Certification
      • Risk Management
  • What We Do
    • About Conformance 1
    • Group Purchasing
    • Negotiated Discounts
    • Why Buy Through Us?
  • Products/Services
    • Name Your Fee Training
    • Registrar Directory
    • Software Directory
    • Consultant Directory
  • Online Gap Checklists
    • ISO 9001 Dashboard
    • ISO 45001 Dashboard
    • ISO 27001 Dashboard
  • Contact
    • General Inquiries
    • Ask an ISO Expert
  • Login
    • Login
    • Log Out
Conformance1

Conformance1

Tools for conforming to standards, goals and processes

Improving Data Cleaning by Learning From Unstructured Textual Data

Leave a Comment Filed Under: Quality-Continuous Improvement

  • Using machine learning models trained on clean data, this approach enhances data cleaning by leveraging unstructured textual inputs such as titles or descriptions to predict and correct errors in structured data.
  • The proposed method integrates Natural Language Processing (NLP) techniques with machine learning classifiers and applies constraint-based filtering to improve training data quality and model performance.
  • Experiments with real-world datasets show that the hybrid strategy, combining the ML-based model with traditional rule-based engines like Parker, yields strong results, with precision up to 0.88 and recall up to 0.94.

This paper introduces a machine learning-based strategy to improve data cleaning by integrating unstructured textual data with traditional structured datasets. Recognizing that structured data often contains errors due to inconsistencies, omissions, or conflicts, the authors propose leveraging textual descriptions, such as product titles or clinical trial summaries, to predict and correct inaccurate entries. The process begins by identifying “clean” data that adheres to integrity constraints like functional dependencies and selection rules. These clean samples are used to train machine learning models that learn mappings between unstructured text and structured attribute values.

The ML models rely on various NLP preprocessing techniques, such as Bag-of-Words (BOW), TF-IDF, and BERT embeddings, to transform unstructured text into usable features. These features then feed into base learners like XGBoost, Logistic Regression, Naïve Bayes, or Artificial Neural Networks to predict structured attribute values. During the data repair phase, the models output probability distributions for each potential value. A configurable confidence threshold determines whether a predicted value is used to replace a potentially incorrect one, ensuring that replacements are made only when the model has high confidence.

The paper also explores a hybrid approach that combines the ML model with a rule-based system called the Parker engine. This combined method either uses the ML predictions as inputs to Parker or uses Parker-cleaned data as training input for the ML model. The hybrid strategy generally improves performance, showing how integrating rule-based and learning-based systems can yield better data repair outcomes. Experimental evaluations using datasets related to clinical trials and allergen labeling validate the method’s effectiveness. Among several tested models, the combination of TF-IDF with XGBoost performs best overall in both prediction accuracy and repair robustness.

Performance evaluations demonstrate that using constraint-verified training data improves precision and recall metrics. In particular, the best models achieve a precision of 0.88 and a recall of 0.94 in some cases. While training with BERT embeddings underperformed due to the factual rather than narrative nature of the text, simpler embeddings like TF-IDF, when paired with XGBoost, yielded more reliable results. The paper concludes by emphasizing the value of using textual data to inform structured data cleaning and the strength of combining AI techniques with constraint-based frameworks for scalable and accurate data repair.

Read the full article

Filed Under: Quality-Continuous Improvement

Reader Interactions

Leave a Reply

You must be logged in to post a comment.

Primary Sidebar

Search

Email Newsletter

News delivered to your inbox

Name(Required)
Newsletter Preferences(Required)
This field is hidden when viewing the form
This field is for validation purposes and should be left unchanged.

Related Items

Help us improve our tool

Have a suggestion for improving our ISO Gap Analysis Checklist? Let us know.

Secondary Sidebar

Categories

Recent Posts

  • Important Role of Thermal Imaging for Condition Monitoring
  • The Top 10 Security Awareness Training Solutions For Business
  • Improving Data Cleaning by Learning From Unstructured Textual Data
  • Operational Key Performance Indicators (KPIs) 2.0: A Smarter Way to Visualize and Use Your Metrics
  • Mastering the 8D Problem-Solving Methodology: A Guide to Root Cause Analysis in Manufacturing

Footer

Important Resources

Cost of Quality Calculator

ISO 9001 Online Gap Analysis

ISO Certification Cost Calculator

Free Quality Ebook

Process Improvement Survey

ISO 9001 Glossary

 

Recent Posts

  • Important Role of Thermal Imaging for Condition Monitoring
  • The Top 10 Security Awareness Training Solutions For Business
  • Improving Data Cleaning by Learning From Unstructured Textual Data
  • Operational Key Performance Indicators (KPIs) 2.0: A Smarter Way to Visualize and Use Your Metrics
  • Mastering the 8D Problem-Solving Methodology: A Guide to Root Cause Analysis in Manufacturing

Search

Contact Us

About Us

Privacy Policy

 

Copyright © 2025 · Conformance1 · Log in