The Impact of GPT and Generative AI Models on People Analytics (Interview with Andrew Marritt)
Since ChatGPT was launched in November 2022, there has been a surge of interest and some concern about the potential impact of GPT and Generative AI models on business as well as on HR and workers. One recent report by Goldman Sachs suggests that AI could replace the equivalent of 300 million full-time jobs, as well as a quarter of work tasks in the US and Europe. However, the same report also predicts that, similar to previous industrial revolutions, there will be new jobs and a productivity boom.
What will the impact be on HR and specifically on people analytics? To find out more, I was delighted to recently catch up with Andrew Marritt - someone who has their finger firmly on the pulse when it comes to large language models and text analytics. Andrew is the CEO of Organizational View, a specialist firm that uses quantitative research methods augmented by AI to analyse employee feedback and workforce data. Andrew has been working in the people analytics field since 2008, publishes the brilliant Empirical HR newsletter, and is also an instructor on a course on the myHRfuture Academy: Applying Text Analytics to HR Data.
I started the discussion by asking Andrew to provide more information about GPT models:
What are GPT models, and where will they (and similar models) be used in People Analytics?
Today, GPT models are best known via ChatGPT, a model introduced by OpenAI in November '22 but already surpassed by their GPT-4 models. They belong to a class of models which are often called 'foundational models' or 'large language models'. There are a rapidly growing number of these generative models available to the analyst. As well as the OpenAI suite some other interesting providers include Cohere, Antrophic's Claude models, and Google's PaLM models. (For simplification, I'm going to discuss the OpenAI models here).
The current LLMs are mostly created by a process called self-supervised learning. The models are trained on very large volumes of data where words are masked out, and the model trained to predict the hidden word. In this way, they're probabilistic models - they don't understand the words, but in a large context they can often accurately predict the next one or ones. At the moment it seems like one way of making these more performant is by increasing the number of tokens (words). We're therefore seeing a race to produce larger and larger models.
The recent LLMs also use a secondary process called human feedback reinforcement learning. They're using a team of reviewers to assess machine-generated text to assess which is the most 'human-like'. This process ensures that the resulting generated text appears human-like.
We think of these models as having two characteristics. The first is the generation of human-like text. They do this very well. The second is that they learn 'facts'. Consider the sentence, 'The capital of France is Paris.' If the model is built to predict the final word from a large volume of data, then through learning the next word, it would learn that the word 'Paris' is most likely to be the next word of the phrase 'capital of France is '.
Our view is that the strength of these models is language generation and that knowledge recall is a side effect. It's a side effect because if your primary focus was knowledge recall, you'd probably use a different model. A large language model would know that 2+2=4, but a simple calculator would do calculations quicker, at a lower cost and with more accuracy. If I needed to solve a calculation and I had the choice of a LLM or a calculator, I'd take the calculator.
You could think about this differentiation in AI terms as a split between "Creative AI" and "Precision AI". I think as an analyst, you need to be very clear whether what is important is creativity or precision. For many tasks a Creative AI is what you'd want. For most People Analytics tasks you probably want Precision.
As someone working in this area, was the advent of ChatGPT a surprise?
No, not really, and there are lots of other models, but obviously, ChatGPT did have some interesting characteristics which I couldn't imagine. I think with hindsight; it was the ease-of-use UI of ChatGPT that fired up the public imagination. In some ways, you could see ChatGPT's key innovation was the UI that moved an LLM from the developer community to the general public.
We've been seeing rapid increases in performance for some time. ChatGPT was an improvement over OpenAI's previous GPT3, especially for conversational uses. GPT4 is again a big step up from ChatGPT in terms of the accuracy and usefulness of the models.
I think what has surprised me in some ways was how OpenAI brought down the cost of using ChatGPT compared to the earlier GPT3 models. For me, the key differentiating factor between the latest GPT 3.5 models and later models (ChatGPT and GPT-4) is the ability to fine-tune some of the models. For the analyst, I think this has distinct advantages, especially with a task such as classification.
Coming onto the more recent GPT-4 model, this counters some - but not all - of the issues of ChatGPT. It's undoubtedly more accurate. What is potentially more important and useful is the significantly longer contexts that it can use - 4,000 words for most of us, but I believe 32,000 words on request. This opens up considerably more opportunities for some use cases.
Finally, the recent introduction of ChatGPT plugins (https://openai.com/blog/chatgpt-plugins) launched at the end of March '23 is possibly the most exciting addition for many users. These solve the issue that I described above for a calculation - they identify when a different API will be better served to answer the question and go to that API to find the answer.
You wrote an article late last year about the importance of training data. Does that still apply to these large language models?
Of course, one could argue that with the use of human feedback in their development, they themselves do use human-created training data.
From the context of the typical use cases of people analytics teams, I suspect that small organisations - those who typically don't have people analytics staff - could perform simple tasks with something like GPT-4. If you have a few hundred answers and have someone who can call the API, you could easily do some analysis of your survey comments, especially simple tasks like summarisation.
For large organisations, it's likely to be more complex. Firstly even the largest contexts (the 32k that is possible with GPT-4) are likely insufficient for most survey datasets. Second, you probably want to define what themes you want to use, how you define each theme and, as importantly, what isn't part of that theme. This would be highly important for the consistency of model results. Run an LLM model twice over the same context, and you'll get two different answers. Finally, there is the cost, both in terms of time and money.
What are the weaknesses of these LLMs as you see them?
There are a series of well-understood limitations of these LLMs. The first, and probably the most important, is that they require enormous resources to train. This puts them out of the reach of the vast majority of even large corporates. Historically training your own language models made sense as domain-specific models often outperform larger generic models. We have trained our own embeddings for some time. LinkedIn has 'LiBERT' their own domain-specific version of the earlier BERT model.
The resource requirement also makes model updating difficult. The latest OpenAI GPT models haven't seen text created since 2021. They, therefore, can't rely on any later data without something similar to the plugin approach.
The one which we've battled with for some time is called 'hallucination'. Basically, they have a tendency to make stuff up, often explaining their fictitious answers in a very confident manner. We developed a hallucination detector for our use cases to clean the answers provided by an API - it's a huge improvement currently but not perfect. Others have been doing things similar, and we see an increasing number of papers on the topic.
For many real-world applications to make a good decision, you need not only the text data but other datasets that can provide context. Multi-modal models (GPT-4 will be able to handle images) have come a long way, but at the moment, I think tabular data - the sort that is in most of our data warehouses - is poorly served by more modern deep-learning algorithms. LLMs at the moment have few answers to this challenge.
The final issues that we've come up against are practical ones - using the latest models like GPT4 or even ChatGPT is slow and can quickly become expensive. The APIs are also not as stable as other APIs that we're used to using. We couldn't deliver at the speed our clients expect and do what we do just with the GPT-4 API.
How are OrganizationView using LLMs?
We've been experimenting with LLMs for some time now, but in many use cases, we don't see the advantages they offer over our more traditional AI models. That being said, we are using them in our pipeline in a few instances, all aligned to the 'Creative AI' use cases.
A few examples of where we're using them include:
Data cleaning/pre-processing
Synthetic data generation
Summarisation - e.g. creating a human-readable label for a cluster.
In all instances, we're using them in combination with other models. In our experiences, it's this combination of LLMs and traditional AI models which is where they're most useful. We're using a mix of different LLMs depending on the task to be performed.
Outside PA, how do you think HR will adopt LLMs? Any good use cases?
With the release of GPT-4 and our experimentation with it, I think we'll rapidly see LLMs being used almost everywhere where an open-text box is used in an application. It really is a step up from ChatGPT, and I can see it having a wide range of use cases.
We'll also see a lot of uses that fit the 'creative AI' category. One that I really like is the recruitment marketing firm Cliquify. They built a feature in their product that creates new LinkedIn posts for a client's marketing messages so employees could share them. Each one was unique but used corporate language and marketing messages to create something that looked like it had been written by the employee sharing the messages.
I think for these creative AI applications, the best uses will include a human in the loop - mostly to review a text before applying it. This reduces the issues with hallucination.
What does the adoption of LLMs and the general increase in use of data and analytics mean in terms of the capabilities HR professionals require?
For most HR teams, we realised some time ago that the value is not created by the text analysis itself but the downstream tasks - turning the structured data into recommendations and outcomes. To this end, we developed a structured co-creation-based methodology to help our clients realise the value from their text data. The opportunities provided by LLMs don't really change this. You need to pull together the structured text data, the statistical analyses, other data etc., into a concise and cohesive whole to help inform better decision-making. I don't see this as a problem that can be solved with technology on its own, which is why we're moving clients to a technology-augmented services approach, either as part of their team or developing the skills internally.
I think the big shift for most PA teams is that using these LLMs really requires engineering skills over traditional data science skills. You're calling an API and then often having to parse the response. It's probably going to be more aligned with the skills of a data engineer than a data scientist.
If I was leading a PA team, I'd be thinking about aligning resources to develop little applications for HR that have LLMs at their heart. There are probably hundreds of use cases for simple apps, and it's probably a good use for a low-code application approach. At the same time, I'd want to dedicate resources to model checking and monitoring on an ongoing basis. You probably also need to think about audit-type issues.
I don't see most HR professionals needing to develop specialist skills in LLMs though learning about 'prompt engineering' would be an accessible skill to add to their portfolio. Prompt engineering is the discipline of writing good prompts, and it's a specialist skill which can be learnt. I personally think that the majority of HR teams will rely on others' prompts.
What's really important to realise, though, is that an LLM will provide a qualitative response. They can't at the moment perform statistical analysis - even something as simple as identifying what proportion of the population are discussing 'Career Development'.
Where do you see this taking us in the next 24 months?
It's clear that the best LLMs have a very short period of time when they can claim leadership. ChatGPT was the shiny new thing in November, and even now, it's been surpassed.
Many of the issues with LLMs can probably be dealt with via the plugin-type architecture that OpenAI has announced and many others are developing. I'm convinced that someone will have identified a way that they can be updated without having to train them again from scratch.
The most noticeable change that I think a lot of people will notice is that technology firms will completely rethink their UIs. Many firms will have some form of human augmentation that rests on LLMs very shortly.
In terms of HR, I suspect that the first functions to be impacted are the employee support teams. If I was working in a helpdesk-type area, I'd be thinking about my career direction. I would be surprised if firms had many such teams in 5 years, and it wouldn't surprise me if they're mostly gone within three years. The LLMs will be able to deal with the majority of requests themselves, and if they can't, they'll automatically pass it off to a specialist.
I think that for People Analytics functions wanting to analyse text, LLMs can't be the full answer. In the text / NLP community, they are just one niche at the moment, and there is still a lot of mileage in more traditional approaches. If you have a specialist application, you'll likely find a specialist approach will outperform a general one like an LLM.
As I mentioned above, even the best text analytics can only provide a narrow view of the data and the world it's representing. There's a real skill in building mixed-methods type analyses and coercing the rich text data with other contextual information into concise and valuable business recommendations.
THANK YOU
Thanks to Andrew for sharing his time, perspective and expertise with readers of the myHRfuture blog. If you want to find out more about his work and Organization View, you can connect with Andrew on LinkedIn, follow him on Twitter @AndrewMarritt, subscribe to Andrew's Empirical HR newsletter and visit the Organization View website.
If you'd like to explore more of Andrew's recent work, do take a look at some of the links below:
Article on the importance of good training data mentioned at the top https://www.organizationview.com/insights-articles/2022/12/19/why-great-training-data-is-the-key-to-text-analysis-in-people-analytics
How to use text data to prioritise employee experience projects
An introduction to mixed methods approaches
https://www.organizationview.com/insights-articles/2022/6/16/four-ways-to-use-qualitative-data-in-people-analytics
ABOUT THE AUTHORS
Andrew Marritt
Andrew founded OrganizationView, his People Analytics practice in 2010 after an early career in industry & management consulting. Since 2015 OrganizationView have specialised in text analysis and today work with a large number of global firms helping them make use text feedback to increase the effectiveness of their organizations. Andrew is an advocate for using Mixed Methods approaches to People Analytics to increase the effectiveness of business decisions. He regularly teaches and speaks on HR text analytics to HR and Text Analytics audiences.
David Green
David is a globally respected author, speaker, and executive consultant on people analytics, data-driven HR and the future of work. With lead responsibility for Insight222’s brand and market development, David helps chief people officers and people analytics leaders create value with people analytics. David is the co-author of Excellence in People Analytics, host of the Digital HR Leaders podcast, and regularly speaks at industry events such as UNLEASH and People Analytics World. Prior to co-founding Insight222, David worked in the human resources field in multiple major global companies including most recently with IBM.
Our myHRfuture course ‘An Introduction to AI in HR’ empowers you to gain the foundational knowledge surrounding AI in HR that you need to solve real HR challenges and facilitate HR Processes using AI.
Our unique mix of training courses, videos, interviews, podcasts, case studies and articles help you build hands-on skills while providing real-life context to what you’ve learnt.
By completing this course, you’ll build on your knowledge of AI in HR, and it will provide you with a solid foundation of what AI is and the impact of AI on business, society and the HR profession. Enroll now!