How will

artificial intelligence

change our future?

Quick links

Art deco aesthetic, minimalist image depicting a chaotic mix of musical notes, numbers, and time signatures scattered randomly on the left side, gradually transitioning to an organized, complete musical score on the right side with evenly spaced notes and structured time signatures. Surround the musical score with small data charts, binary code, and sleek glowing lines symbolizing data flow and AI intervention. The background features a gradient with subtle musical notes and circuit patterns, illustrating the transformation from disorganized data to a harmonized state.—What is data normalization by Talbot West

What is data normalization in data preprocessing?

By Jacob Andra / Published July 16, 2024

Last Updated: July 30, 2024

Data normalization is a step in data preprocessing that involves adjusting the values of data attributes to a common scale, without distorting differences in the ranges of values.

Enterprises have data in diverse formats and sources: databases, spreadsheets, documents, APIs, and more. These data sources may have different formats, units, and scales.

Normalized data is also compressed, meaning that your AI can ingest more substance and less fluff, which makes for improved performance.

For example, Wang, et. al. demonstrate that 2NF normalization can reduce a sales database by more than 79%.

Main takeaways

Normalized data improves AI model accuracy.

Standardizing text ensures consistent interpretation.

Normalization speeds up data analysis.

Normalization reduces bias for fairer AI predictions.

Normalization gives you increased ROI on your enterprise AI investment.

What is the purpose of data normalization?

Data normalization lays the foundation for balanced AI analysis, fair feature comparison, and overall improved model performance. Think of it like tuning instruments and adjusting microphones before a concert: skip it, and some elements will overpower others, distorting the overall output.

There are many different use cases for enterprise AI implementation, and most of them rely on the AI's ability to interpret data accurately and without bias. All features need to be standardized for maximum fairness in the AI's interpretation.

Data normalization is how we prepare your entire dataset, including numbers, text, and categorical data, to be interpreted equitably by the AI system. This includes scaling numerical features, standardizing text formats, and ensuring consistent terminology usage across your knowledge base.

The basics of data normalization

At its core, data normalization is about creating a level playing field for all your data. It's like establishing a common language across your entire knowledge base.

Here are the "3 S's" of data normalization:

Scaled: we want to bring all numerical features to a common range. This prevents features with larger scales from dominating the analysis and ensures each feature contributes proportionally.
Standardized: we unify the format of textual data across all documents. This includes consistent capitalization, punctuation, and special character usage. Standardization ensures the AI interprets similar information consistently, regardless of its original format.
Streamlined: we focus on creating a unified vocabulary and terminology across the knowledge base. This involves standardizing acronyms, industry terms, and naming conventions. Streamlining helps the AI recognize and relate similar concepts, even when expressed differently in various documents.

Our data normalization process

Here are our basic steps to effective data normalization:

Assess feature scales: we evaluate your knowledge base to identify numerical features with varying scales. We determine which scaling method (e.g., min-max normalization, z-score normalization) is most appropriate for each feature.
Standardize text formats: we analyze textual data across your documents, identifying inconsistencies in capitalization, punctuation, and special character usage. We then apply uniform formatting rules to standardize all text entries.
Unify terminology: we create a comprehensive list of terms, acronyms, and industry-specific language used throughout your knowledge base. We then standardize these terms to ensure consistent usage across all documents.
Apply numerical scaling: we implement the chosen scaling methods to bring all numerical features to a common range, usually between 0 and 1. This prevents features with larger scales from dominating the analysis.
Normalize categorical data: we address categorical variables by applying techniques such as one-hot encoding or label encoding, depending on the nature of the data and the requirements of the AI system.

If you’d like to explore how data normalization can drive efficiencies in your workplace, request a free consultation with Talbot West. We can discuss specific tools, implementations, and risk management strategies.

Get in touch

An art deco-styled image showing a chaotic mix of raw data elements on the left side, including numbers, letters, and geometric shapes scattered randomly. As the image transitions to the right, these elements gradually transform into neatly organized forms and tables, symbolizing the different forms of data normalization. Sleek glowing lines symbolize data flow and AI intervention, organizing the data into 1NF, 2NF, and 3NF forms. The background features a gradient with subtle grid and circuit patterns, representing the structured, orderly nature of normalized data.—Data normalization form examples by Talbot West

Data normalization examples

Here are ten examples of how data normalization ensures balanced and accurate AI analysis.

Scaling numerical features

Industry: retail
Scenario: an e-commerce platform is training its AI expert with customer purchase data.
Issue: purchase amounts vary widely, from a few dollars to thousands, and the high purchases skew the model.

Order ID	Product Name	Price ($)
1	Novelty keychain	3.25
2	Designer handbag	2415.00
3	Smartphone charger	15.99
4	Wireless earbuds	129.99
5	Smart watch	299.00
6	Laptop sleeve	24.50

Solution: scale all purchase amounts to a common range.

Order ID	Product Name	Normalized Price
1	Novelty keychain	0.0000
2	Designer handbag	1.0000
3	Smartphone charger	0.0053
4	Wireless earbuds	0.0525
5	Smart watch	0.1226
6	Laptop sleeve	0.0088

Implementation: used min-max scaling to normalize purchase amounts.

Standardizing text formats

Industry: marketing
Scenario: a marketing agency is training an AI system on customer feedback.
Issue: inconsistent spelling or formatting of product features (e.g., "face-id", "FaceID", "face id") can result in the AI missing important trends or issues related to specific features.
Solution: apply consistent text formatting rules, spelling, and capitalization across all feedback.
Implementation: use text processing tools to standardize capitalization and punctuation.

Unifying terminology

Industry: healthcare
Scenario: a hospital is preparing patient records for AI analysis to improve treatment plans.
Issue: different departments use different terms for the same medical conditions.
Solution: standardize medical terminology across all records.
Implementation: create a unified terminology database and update records accordingly.

Handling categorical data

Industry: human resources
Scenario: a company is fine-tuning an LLM on employee performance data.
Issue: job titles are recorded inconsistently (e.g., "Software Engineer," "SWE," "Dev").
Solution: normalize job titles to a standard set of categories.
Implementation: map out all variations to a single standard title.

Normalizing dates

Industry: finance
Scenario: a financial institution is preparing transaction histories for ingestion into a RAG.
Issue: dates are recorded in a variety of formats (e.g., "MM/DD/YYYY," "DD-MM-YYYY").
Solution: convert all dates to a single standard format (e.g., "YYYY-MM-DD").
Implementation: use a date normalization tool to reformat all dates.

Addressing missing values

Industry: education
Scenario: a university is training an AI on anonymized student performance data.
Issue: some records have missing grades or lack attendance information.
Solution: normalize missing values by imputing them or omitting the associated records.
Implementation: use statistical methods to fill in or flag missing data, or delete the records.

Transforming logarithmic data

Industry: environmental science
Scenario: a research team is training an AI with data about pollutant concentrations.
Issue: concentration levels vary exponentially.
Solution: apply a logarithmic transformation to normalize data.
Implementation: use log normalization to scale pollutant concentration levels.

Normalizing geographical data

Industry: real estate
Scenario: a real estate company is training an internal AI on property values.
Issue: property addresses are recorded in many different formats.
Solution: standardize address formats and geocode locations.
Implementation: use an address normalization and geocoding tool.

Equalizing survey responses

Industry: market research
Scenario: a market research firm is inputting survey data into an AI.
Issue: survey responses use different scales (e.g., 1-5, 1-10).
Solution: normalize all responses to a common scale.
Implementation: rescale all survey responses to a unified range.

Normalizing units of measurement

Industry: manufacturing
Scenario: a manufacturing company is preparing production data for an internal AI monitoring system.
Issue: measurements are recorded in different units (e.g., inches, centimeters).
Solution: convert all measurements to a single unit.
Implementation: use unit conversion tools to standardize all data measurements.

Need help with data normalization?

Need help with data normalization? Book a free consultation and let us show you how we can tackle your data preprocessing and document preparation challenges. Check out our services for a full rundown of what we offer.

Work with Talbot West

Data normalization FAQ

What is an example of normalized data?

Here’s an example of using min-max scaling and other methods to normalize data with a wide variance.

Original data:

Order ID	Product name	Price ($)	Order date
1	Widget A	2500.00	01/15/202
2	Gadget B	35.99	03/01/2023
3	Widget A	2400.50	2023-04-20
4	Tool C	150.75	05-10-202

Normalized data:

Order ID	Product name	Price	Order date
1	Widget A	1.00	2023-01-15
2	Gadget B	0.00	2023-03-01
3	Widget A	0.96	2023-04-20
4	Tool C	0.05	2023-05-10

Here are the normalization techniques we used to solve this:

Scaling numerical data:

Price: prices are scaled to a range between 0 and 1.

Max price: $2500.00
Min price: $35.99
Normalized price = (price - min price) / (max price - min price)
Widget A: (2500.00 - 35.99) / (2500.00 - 35.99) = 1.00
Gadget B: (35.99 - 35.99) / (2500.00 - 35.99) = 0.00
Widget A (2400.50): (2400.50 - 35.99) / (2500.00 - 35.99) = 0.96
Tool C: (150.75 - 35.99) / (2500.00 - 35.99) = 0.05

Normalizing dates:

Order date: all dates get formatted consistently as "YYYY-MM-DD."

01/15/2023 becomes 2023-01-15.
03/01/2023 becomes 2023-03-01.
2023-04-20 remains unchanged.
05-10-2023 becomes 2023-05-10.

When should you normalize data?

When preparing your data for AI uptake, normalization is a crucial step in the preprocessing stage. Anyone training an AI on their knowledge base or otherwise developing AI models should normalize data in the preparatory stages.

Machine learning algorithms: algorithms that rely on distance measurements, such as K-nearest neighbors, support vector machines, and clustering algorithms, need normalized data for accuracy. This ensures no single feature dominates due to its scale.
Gradient descent optimization: for neural networks that use gradient descent for optimization, normalization helps achieve faster convergence. It balances the gradients across features, leading to more efficient training.
Comparative analysis: when features have different units or scales, normalization makes them comparable. This allows for more accurate analyses and insights.
Improving model performance: normalization can enhance the performance and stability of machine learning models, reducing the risk of numerical instability.
Equal contribution of features: normalization ensures that each feature contributes equally to the analysis, which is crucial in multivariate analysis.
Dealing with outliers: normalizing data can mitigate the impact of outliers on model performance, making the model more robust.

Is NoSQL normalized?

Not Only Structured Query Language (NoSQL) databases are designed to handle large volumes of unstructured or semi-structured data, offering flexibility, scalability, and high performance.

Unlike traditional relational databases, NoSQL databases do not rely on a fixed schema and can store data in varied formats such as documents, key-value pairs, wide-column stores, and graph structures. Because of this flexibility, NoSQL databases are not typically normalized. Instead of adhering to normalization rules to minimize redundancy, NoSQL databases use denormalization to optimize read and write performance.

Is SQL faster than NoSQL?

Neither SQL nor NoSQL is universally faster; the performance depends on the specific requirements and characteristics of the application.

SQL databases excel in structured data environments with complex queries, while NoSQL databases offer superior performance and scalability for large, unstructured datasets and high-throughput scenarios.

Which normalization is best?

The best normalization form depends on your specific needs:

1NF: ensures data is stored in atomic units.
2NF: removes partial dependencies, best for composite keys.
3NF: eliminates transitive dependencies, suitable for most general use cases.
BCNF: a stricter version of 3NF, ensuring more rigorous constraints.
4NF and 5NF: used for handling very complex data relationships and ensuring minimal redundancy.

For most practical purposes, achieving 3NF is sufficient and balances data integrity with efficiency. For highly complex systems, BCNF or higher normal forms might be necessary. In performance-critical applications, strategic denormalization might be employed to optimize read operations.

What happens after data normalization?

After data normalization, we may need to apply other methodologies to your knowledge base.

These include, but are not limited to, the following:

Data transformation
Data augmentation
Data cleaning
Data imputation
Feature engineering

Once the entire data preprocessing task is completed, your knowledge base is ready for AI ingestion.

Resources

Wang, T. J., Du, H., & Lehmann, C. M. (2010). Accounting for the benefits of database normalization. American Journal of Business Education, 3(1), 41-50. Retrieved from https://files.eric.ed.gov/fulltext/EJ1060329.pdf
Eessaar, E. (2016). The database normalization theory and the theory of normalized systems: Finding a common ground. Baltic Journal of Modern Computing, 4(1), 5-33. Retrieved from https://www.proquest.com/docview/1785391026
Mingers, J., & Meyer, M. (2017). Normalizing Google Scholar data for use in research evaluation. Scientometrics, 113(3), 1203-1225. https://doi.org/10.1007/s11192-017-2415-x
Izonin, I., Tkachenko, R., Shakhovska, N., Ilchyshyn, B., & Singh, K. K. (2022). A two-step data normalization approach for improving classification accuracy in the medical diagnosis domain. Mathematics, 10(11), 1942. https://doi.org/10.3390/math10111942
Muhammad Ali, P.J. (2022). Investigating the impact of min-max data normalization on the regression performance of K-nearest neighbors with different similarity measurements. ARO-The Scientific Journal of Koya University, 10(1), 85-91. Retrieved from http://dx.doi.org/10.14500/aro.10955

About the author

Jacob Andra is the CEO of Talbot West as well as of BizForesight, an AI-powered M&A platform built and partially owned by Talbot West. He serves on the board of 47G, a Utah-based public-private aerospace and defense consortium. He spends his time pushing the limits of what AI can accomplish, especially in high-stakes use cases. Jacob also writes and publishes extensively on the intersection of AI, enterprise, economics, and policy, covering topics such as explainability, responsible AI, gray zone warfare, and more.

Jacob Andra

Industry insights

We stay up to speed in the world of AI so you don’t have to.

Big Consulting is realizing that they can't continue to justify their billable-hour model for strategic analysis when AI delivers better analysis in minutes.

McKinsey in WSJ: how Big Consulting is adapting to the age of AI, and how Talbot West is already there

Composable AI is AI architecture built from modular, interchangeable components that can be rapidly assembled, updated, or reconfigured. In short, it’s another term for Talbot West’s Cognitive Hive AI (CHAI) architecture that we’ve been championing for a long time now.

Cognitive Hive AI

Composable AI: the future of intelligent enterprise

Most treat “build vs buy” as a straightforward choice between speed and customization, cost and control. They're wrong. It’s a complex optimization problem disguised as a simple choice. Organizations think they're weighing two options when they're actually navigating dozens of variables they don't know exist.

Buy or build an AI solution? How to evaluate your options.

APEX (AI Prioritization and EXecution) cuts through the noise. Our process identifies your single best AI opportunity and hands you the blueprint to deploy it.

AI Prioritization and Execution (APEX): a decisionmaking framework

Total organizational intelligence is inevitable by 2030, according to digital transformation advisory Talbot West

The Talbot West 5-year thesis

Mergers and Acquisitions

AI across the M&A lifecycle

BizForesight is an AI-powered business assessment platform that serves two distinct audiences while creating value for both. For business owners, it delivers sophisticated valuation insights and strategic guidance based on proprietary data from thousands of actual transactions. The platform helps owners understand their company's worth and identify optimal paths forward—whether growing, transitioning management, or planning an exit. Simultaneously, BizForesight functions as a qualified lead generation engine for professional service providers in the M&A ecosystem. The platform intelligently matches business owners with relevant professionals who can help implement their chosen strategies. Led by Bill McCalpin, Chair of the Alliance of Mergers & Acquisitions Advisors, and powered by Talbot West's AI technology, BizForesight has 400 business owners queued for its summer 2025 launch. This positions the platform to become the industry's largest deal flow driver by year-end 2025.

Mergers and Acquisitions

BizForesight: an AI-powered business assessment tool

Art deco stylized tree with geometric, angular branches forming symmetrical patterns. Circuit traces run through branches, carrying glowing data particles. High-performing branches transform from copper to brilliant gold and grow thicker, while underperforming branches dim and narrow. Seasons transition in quadrants around the tree, showing the evolution of optimization. Classic zigzag and geometric motifs decorate the base. Background features stepped layers of circuitry in muted tones, allowing the tree's optimization process to stand out in brilliant metallic colors.

What is reinforcement learning in CHAI?

Allegorize a sales engine by showing an actual internal combustion engine generating money as a highly efficient machine. Art Deco aesthetic, cash coming out the manifold, cybercircuitry and data streams connecting the cash to the engine and also circuitry patterns across the engine itself.

Sales enablement

Build an efficient sales engine with AI capabilities

Art deco sentinel figures standing back-to-back, protecting a central sphere of client interests. One sentinel embodies traditional professional wisdom (rendered in classic art deco professional symbols), the other composed of advanced AI patterns. Their armor interlocks where they meet, creating stronger protection. Circuit-pattern shields extend from both figures. Energy flows between them strengthen their defensive stance. Style: protective art deco with cybernetic enhancement, burnished gold and electric blue.

Why do professional services firms love to refer their business clients to Talbot West?

An Art Deco-style illustration of a glowing, abstract human brain, seamlessly connected to a spinal column. The spinal column extends downward, branching out into intricate golden nerves that weave through an abstract corporate environment. Along the glowing pathways, Art Deco-styled icons appear: a briefcase for business operations, a bar graph for finance, a magnifying glass for analytics, a handshake for client services, and a gear for operations. The nerves light up each icon with radiant gold and teal energy, showing interconnectedness. The backdrop features symmetrical Art Deco patterns in black and gold with teal accents, combining elegance with a futuristic corporate aesthetic. The overall composition integrates organic forms with corporate iconography, embodying the concept of AI as the central nervous system of the organization. No text. Neural circuitry and data streams connecting icons to each other and to the brain and spine.

Cognitive Hive AI

An AI central nervous system for your organization

Art deco mechanical robotic arm split composition: left half realistic industrial metal in steel blues, right half transformed with glowing neural network overlay in warm gold. Clean geometric patterns and streamlined forms typical of art deco. Neural connections flow across divide using art deco's characteristic sunburst and zigzag motifs. Strong angular shapes, industrial elegance, minimal color palette of metallic blue-grey and warm gold. High contrast with dramatic shadows. Background should use subtle art deco chevron patterns. Data streams and cybercircuitry across the surfaces. Style reference: retro-futuristic meets Machine Age aesthetic.

Physical AI: Where gen AI, natural language, and robotics meet in the physical world

Art deco courthouse façade viewed head-on, with vertical data streams flowing between the columns like waterfalls. Circuit patterns form the decorative friezes. Gold and obsidian color scheme with electric blue data elements. Geometric stepped patterns frame the composition. No text.

Legal

Invisible AI for law firms: a new paradigm for legal tech

A minimalist art deco aesthetic of organic cloud-like forms transforming into clean geometric vectors, symbolizing AI vector embeddings. Use curved lines and interconnected nodes to show the transition from data to structured information. Blue and silver gradients in the background to evoke a futuristic yet elegant look.

What is vector embedding and why does it matter?

Art deco style architectural illustration of a sleek chrome and steel bridge connecting two distinct geometric platforms. Bridge has clean lines and symmetrical supports. Platforms feature stepped geometric patterns characteristic of art deco design. Muted gold and silver tones. Sharp angular shadows. No text or words. Professional technical aesthetic with art deco flourishes. Minimalist background with subtle gradient. View from slight angle showing depth. Data lines and cybercircuits crisscrossing everything and making up the background. Art deco style. No text.

What is AI middleware and how does it make my business more efficient?

Art deco style illustration of faint, glowing cybercircuitry weaving invisibly through a workplace scene—a desk, a laptop, and familiar tools like email and chat icons subtly integrated into the circuitry. The circuits blend seamlessly into the background, emphasizing invisibility and familiarity. Muted metallics with soft glows.

Invisible AI: the evolution of SaaS and why your team doesn’t need another “product” to learn

Art Deco style golden scale of justice balanced with a computer chip and dollar signs, geometric patterns in background, metallic gold and deep blue colors, sleek lines and symmetry. No text. Cyber circuitry and data streams connecting elements and making up the background.

Legal

Use AI to turn fixed-fee legal work into a profit center for your firm

Advanced persistent threat cyberintrusions. A collage consisting of power plant, a virus, a laptop with a ton of code visible on the screen, a cell phone tower, a single smartphone with a social media scroll. Art deco aesthetic. Mostly grayscale with a small amount of blue and gold. No text. Data streams and circuitry connecting everything and making up the background.

AI security

How to fight advanced persistent threats (APTs) with AI

Legal

AI and law: the opportunity of AI for the legal profession

Variational autoencoder as part of cognitive hive AI. Show a melange of motifs related to the data, backpropagation. Data lines and cybercircuits crisscrossing everything and making up the background. Art deco style. No text.

What is a variational autoencoder and what is its usefulness for enterprise?

Cybersecurity using AI. A collage consisting of a hacker, a laptop with a ton of code visible on the screen, a single smartphone with a social media scroll, a computer screen that is blank. Art deco aesthetic. Mostly grayscale with a small amount of blue and gold. No text. Data streams and circuitry connecting everything and making up the background.

AI security

AI and cybersecurity: How AI can help us defend ourselves

open source intelligence with cognitive hive AI for expanded insights. A collage consisting of a satellite, a drone, a ship, a map, social media profiles, a smartphone, and a single large computer screen that features geospatial intelligence. Art deco aesthetic. No text. Data streams and circuitry connecting everything and making up the background.

AI-powered OSINT: A system of systems approach to intelligence

Art deco aesthetic, minimalist control panel with dials, knobs, and sliders, connected by stylized lines to a faint neural network in the background, symbolizing hyperparameters in neural networks. Metallic textures with glowing accents, abstract and futuristic, landscape orientation.

What are hyperparameters in neural networks?

Minimalist art deco aesthetic of stacked, shrinking rectangular blocks glowing softly. Digital markings resembling abstract language symbols on each block. Design symbolizes the concept of scaled-down language models, with clean lines and a futuristic, tech-inspired look.

What is a small language model?

Stephen Karafiath Talbot West thoughts on AI

The future of AI and the power of modular systems: thoughts from Stephen Karafiath

Government

How AI can make government more efficient while unlocking new capabilities

An an image that encapsulates the idea of detection of adversarial gray zone campaigns. Use imagery of satellites, communications, surveillance, and maritime activity. Art deco aesthetic done in grayscale. Lots of circuitry and data streams connecting elements. Evoke persistent surveillance, competition, bring in a bit of a Cold War vibe.

Gray zone warfare: an article series by Talbot West

Gray zone warfare part 5: We need better detection capabilities

Gray zone warfare and detection and deterrence, a military motif with gray overtones and lots of circuitry and data streams. Think surveillance, detection, deterrence, aggression.

Gray zone warfare: an article series by Talbot West

Gray zone warfare part 4: Deterrence in the gray zone

$A close-up, minimalist art deco illustration of a nautilus shell with spiraling, nested chambers, each chamber representing a different AI module in a system of systems approach. Larger outer chambers symbolize high-level systems, while smaller inner chambers represent specialized capabilities. Fractals with cyber fusion, data streams and circuitry fusing the different fractals. Art deco style, muted colors, non-psychedelic. Really fuse nature and cyber elements.$

Why system of systems is the future of AI deployment

$Art deco aesthetic, minimalist, a fractured military shield in shades of gray with circuitry lines running through cracks, symbolizing cyber infiltration and vulnerability. Military overtones, subtle rivet details, red highlights on some lines for alert. Lots of data streams symbolizing the digital landscape of most gray zone warfare.$

Gray zone warfare: an article series by Talbot West