4.44 min to readData and AI

How to prepare data for GenAI

Blog Editorial Team

blue and pink light wave on black background

Thinking about implementing generative AI (GenAI) capabilities in your organization? Before you get started, examine the datasets you’ll be using to train your models and be sure they have the quality, features, security precautions and scalability you need to optimize your AI outcomes.

Since the launch of OpenAI’s generative AI (GenAI) tool ChatGPT in late 2022, organizations of all kinds have started exploring ways to incorporate its capabilities into their products, services and daily operations. Today, a rapidly expanding variety of GenAI tools enable companies to automatically write copy for websites and marketing, create images and videos, generate software code, analyze data, conduct research and much more.

Unlike traditional AI applications, GenAI tools aren’t trained using specific data for specific tasks but are built on foundation models using a vast amount of varied data – not just words, but images, video, audio and other types of information. These large bodies of training data enable the GenAI tools they power to generate accurate, intelligent-sounding responses to almost any prompt… or to occasionally “hallucinate” answers full of falsehoods.

To maximize the odds of the first outcome and minimize the chances of the second, it’s important to build your foundation model on high-quality data and best practices.

Data quality and pre-processing

Although foundation models are trained on a broader range of data than is used for task-specific AI applications, your training data should be relevant to the problems you’re looking to solve through GenAI. This requires you to draw from the same data sources that your people would use to find the answers they need.

After you’ve identified those sources, you’ll need to check and pre-process that data to make sure it’s accurate, reliable and verifiable. You should also ensure it’s well stored (ideally, in the cloud), secure and properly integrated into the systems you use. Depending on the type of data, you might also need to clean or update files, reformat items, resize images or make other revisions. In addition, it’s important to check for missing values and data gaps, and fill those in as required by adding new information from other data sources.

Data synthesis

Feature engineering involves manipulating or extracting information from existing raw data to create new kinds of variables or datasets that will be digestible by the foundation model you’re using. How you manage this process this will depend on what you aim to achieve using GenAI.

Imagine that you want to build a tool to forecast demand for guest rooms at a large resort complex. Your training model will need to include key features such as types of rooms available, price per night, seasonal promotions, length of stay, when and how far in advance guests typically book rooms and so on. But you might then want to further fine-tune some of that data – are more online bookings made late in the evening, for example, or on weekends? By applying a more granular view to the date and time of bookings, you improve your ability to identify patterns and make more accurate predictions.

Or say you want to use GenAI to automatically generate web content in another language based on existing English-language content on your site. You’ll need to test different search terms and questions that your non-English-speaking users are likely to use, preferably with support from native speakers who understand your intended audience. Good feature engineering means thinking about what kind of information your GenAI users will be looking for, and what datasets will be needed to generate useful and accurate responses. This might require you to seek input from domain experts to ensure outcomes that are fact-based and relevant.

Data privacy and security

Whatever kind of GenAI tool you use, it’s critical to understand the potential implications for data privacy and security. For example, some consumer-facing GenAI tools note that input by users might be used for future model training. As a result, users should avoid submitting prompts that include sensitive, proprietary or confidential information. Off-the-shelf tools might also follow different data residency and privacy requirements depending on the regions in which they operate.

In a recent survey, Gartner found that 70% of legal, compliance and privacy leaders see GenAI as a top concern over the next two years. And the analyst group says that advanced technologies such as GenAI and cloud are driving increased spending in security and risk management, which is projected to grow by 14% to $215 billion in 2024.

It also predicts that, by 2026, AI deepfakes will mean that 30% of organizations won't find face biometrics alone reliable for verifying and authenticating identities.

Whether you’re using a third-party GenAI tool or developing your own in house, be sure to follow best practices for secure and responsible AI. Among the key precautions you should take: carefully review and verify your data, understand what your technology can and can’t do and what it should and shouldn’t be used for and test your inputs, models, systems and outputs regularly, adjusting as needed to improve results.

Scaling for GenAI

Finally, make sure that the technologies you choose can handle the large datasets required for effective GenAI applications – and can be scaled as required as your needs evolve and grow. For data processing, in particular, cloud-based or distributed computing solutions are generally preferable to on-premises systems.

Building your GenAI applications on a solid foundation requires you to prepare your training data in a way that can optimize your AI outputs and minimize the risks of AI “hallucinations.” SoftwareOne can support you with advisory, platform and solution services and our Intelligence Fabric methodology, designed to achieve data-driven and AI-powered success.

Learn more about what AI can do for you

SoftwareOne demystifies AI and helps your team understand the value and risks, pragmatically defining the capabilities needed for your organization to adopt data-driven practices and scale analytics and AI.

Reach out today to schedule a free 1-hour session for you and your team.

Get ready for AI