AI and ML Data Platform by Richard Winter
Independent consultant Richard Winter explains what a data platform is, the role of generative AI, and how chatbots can protect data from public view.
In our latest podcast program, Richard Winter, CEO and principal consultant at WinterCorp, discussed modern data platforms for advanced analytics, artificial intelligence, and machine learning. Mr. Winter will be teaching a session on data platform strategies for AI and ML at TDWI's Modern Data Leaders Summit in Chicago on April 30th. His career as an independent consultant has spanned over 30 years, focusing on his research, understanding, testing, analysis and evaluation. , assisting customers in using the data platform. [Editor’s note: Speaker quotations have been edited for length and clarity.]
To set the stage, host Andrew Miller asked Wintour what a data platform is in the context of artificial intelligence. “Most people don't think of these data platforms that way. We think of them as business intelligence, reporting, and dashboards, in the traditional sense of data warehousing. What started happening about 15 years ago is that some vendors started building in the ability to do machine learning and certain advanced analytics within their data platforms rather than externally, as was traditionally done. ”
More vendors are doing similar work these days, including generative AI capabilities, Winter says. “Data scientists have traditionally done these tasks outside of data platforms in specialized data science workbenches or environments, but data volumes have grown so large that these technologies are being used at larger scale. It has become very important for some customers to move their processing closer to their data within a data platform, making it more efficient and scalable for some customers. The only practical way to run learning, AI, and advanced analytics workloads in a timely manner is to run them in-database. There are benefits beyond efficiency and scalability. Working within the data platform impacts moving models into production. Winter says it's faster, easier, less error-prone and cheaper.
Existing ML models built outside the platform can be brought into the platform thanks to a feature called Bring Your Own Model. With this feature, models are developed elsewhere and a language is provided for exporting the models, such as PMML. Data scientists may have strong preferences about using a particular tool, but BYOM makes it easy to move models into production.
Generation AI
What does Winter think about generative AI? “ChatGPT and chat bots in general are primarily aimed at consumer-facing generative AI uses. In enterprise environments and B2B applications, ask questions about generative AI. The answer involves more than a large language model. It involves retrieving data from an enterprise data warehouse.
“Insurance claims applications allow you to see who was involved in an accident and detect insurance fraud by repeat claimants. This question can be answered with traditional database queries, but the generated You can also have an AI app create a query and answer it by retrieving data from a database.”
In generative AI, many use cases involve similarity search. Rather than searching by exact match (the way most database queries are written), you're asking, “'This is a thing or an idea, and I want to know if there is anything similar.'” Or , if a user is asking a question about a broad subject, you may want to retrieve all records that are broadly related to a particular question.
“That similarity search is done on large amounts of data using vector indexes. Popular data warehouse platforms now have or are adding vector indexes,” Winter explains. Did. These platforms also add the ability to create vectors (known as vector embedding), store vectors, and search using vectors. All these features are built into the data platform. ”
There's still a lot to consider, warns Winter. “There are differences between data platforms, and those differences become even greater as the requirements become more difficult. If you have a relatively small database and the same day-to-day requirements as a million other companies. You'll probably be able to meet your requirements with a popular platform, but if you have a large data warehouse or your requirements are somehow more complex than the average user, you'll probably be able to meet your requirements. If you're in the 5% of companies that have stringent requirements on , it's very important how Platform A differs from Platform B, not just in business intelligence, but in machine learning, generative AI, and more. It can also be used for themes.
Data protection
With so many people using their data as input into public-facing applications and potentially exposing that data outside the company, what are companies doing to keep it safe? ? “ChatGPT and the equivalents provided by other vendors use a very large language model. It's really huge. It incorporates trillions of documents, and most companies have their own What they do instead is take a small language model that is populated with, say, billions of documents and train it on private data. As long as it is configured correctly, there is no risk of data being exposed to the open Internet through the operation of such a model.
“Another scenario is to use an open model such as ChatGPT, but use it to enrich the types of answers you get by generating queries against private data. TThis is called search expansion generation. When you ask a question, the language model uses its larger language model to understand your question. It then generates queries against company data to retrieve specific information, and retrieves and delivers answers. This must be done in a way that personal information is protected and not incorporated into the general model. ”
[Editor’s note: You can replay the podcast on demand here.]