Final Project: ChatGPT UX Case Study

Introduction and Project Focus:

Perhaps it’s a fool’s errand to attempt to improve the design and user experience of the fastest growing app in history. ChatGPT reached 1 million users in 5 days after its release and 100 million users in just two months. Some if-it-ain’t-broke,-don’t-fix-it wisdom might seem obvious to apply with this wildly successful product, but UX experts Gavin Lew and Robert Schumacher issue an important warning in their 2020 book AI and UX. According to them, users who encounter problems early on with their use of any product are likely to abandon the product, often never to return. Therefore, if OpenAI (parent company of ChatGPT 3.5) wants to hold on to their massive user base and attract and keep new users (and, equally importantly, transition users to their upgraded paid version), it is imperative that their ChatGPT 3.5 generative AI model delivers an accessible design and experience without any frustrating flaws. But to create the most desirable experience, UX designers must first figure out what their users are seeking and what exactly they need and desire from the product.

For this study of ChatGPT, the issue of use cases is the first area of focus. Beyond addressing the UX problems related to serving the most appropriate and popular use cases, there are two other aspects of the ChatGPT experience that merited attention: interface design/functionality and response accuracy. Informed by industry research, academic research, and focus group testing, this case study presents analysis and recommendations for ChatGPT’s improvement related to use cases, interface design/functionality, and response accuracy.

Competitive Analysis:

A recent (February 2024) competitive analysis study from Copenhagen Economics looks at the current state of generative AI:

https://copenhageneconomics.com/wp-content/uploads/2024/03/Copenhagen-Economics-Generative-Artificial-Intelligence-The-Competitive-Landscape.pdf

Included in the report’s executive summary is the following:

“This white paper provides a preliminary review of current developments in the GenAI space and the emerging implications for competition. Further research is needed to shed additional light on the evolving competitive situation in GenAI, as well as any case-specific matters, particularly as this is a fast-evolving sector.”

Also included is this analysis of the most notable generative AI models:

“First, most foundation model developers (e.g., OpenAI, Anthropic, Mistral) already provide their own user-facing applications. At the same time, many foundation model developers offer rival developers the ability to build applications on top of their existing models via open models or closed APIs (e.g., Jasper.ai writing assistant built on OpenAI GPT models).

 Second, players in different levels of the value chain often specialise in specific domains (e.g., BloombergGPT in finance, or Isomorphic Labs in the pharmaceutical drugs discovery process) or tasks (GitHub coding copilot, or Perplexity.ai in search). The success of these models suggests that specialisation is a viable business model.”

The second observation there is especially important for this ChatGPT case study because it reveals a difference between models that serve specific domain purposes and those (like ChatGPT, claude.ai, and Llama 2, and Gemini) which function more generally or without strongly defined purposes.

There are also a number of comprehensive comparison guides for generative AI models, and this February 2024 guide does an effective job of assessing a handful of the most common use cases:

https://www.zdnet.com/article/best-ai-chatbot/

This comparison guide offers valuable information for the UX researcher by pointing out some of the pros and cons of each model. Its analysis suggests that ChatGPT 3.5 lacks two distinct features that some other free generative AI models offer: live connection to internet and the ability to upload files for analysis. While the lack of these features may be hurting ChatGPT 3.5’s ability to compete with other free generative AI models, the paid subscription upgrade (ChatGPT 4) does offer both internet connectivity and the ability to upload files for the system to read. From a profit standpoint, it may make sense for OpenAI not to offer all its available features in the free version, but if other companies are offering them at no charge, then those other companies are positioning themselves to deliver a superior overall user experience among free generative AI models. To remain competitive, ChatGPT 3.5 should incorporate internet connectivity and file uploading. However, they could limit file sizes and number of responses that include live internet in order to maintain an incentive for users to upgrade. It seems important to at least demonstrate for users all that the upgraded version of a product is capable of.

Issue #1: Identifying Use Cases

Understanding a product’s primary use is key to designing the best experience for the user. Designers are used to having an end goal or ultimate purpose in mind when they seek to improve how a particular product works. If the product is a shovel, the designers know to work on creating the best digging experience for the target users. Understanding that products could have secondary or multiple uses is important as well, and sometimes those uses must be factored into design. In the case of a Swiss Army Knife, the designers considered a number of different uses for their product and then chose to accommodate all of those uses within a compact device.

ChatGPT and other similar generative AI tools are somewhat unique as human-created technologies because unlike most tools which are created to solve defined problems, generative AI is more like a solution looking for problems to solve. This is evident in their user interface as it currently provides users with four different suggestions for how it can “help you today.”

 
 

Google’s generative AI model Gemini presents users with a very similar set of use options. While these options may change daily or upon new access of the tool, this example from Gemini shows two of four options that are related to cooking. This makes it appear as if Gemini is limited in what it can do. Even the first and fourth suggestions about brainstorming for presentation ideas and outlining a logical sales pitch are similar.

Many first-time users of ChatGPT or other models are likely accessing their chosen tool with no defined purpose, so it may be helpful to include some very specific options for how the product can be used. However, the options provided on the webpage may not be the best strategy for engaging curious explorers. The current suggestions give some sense of what ChatGPT might be able to do, but they are also overly specific. A better option might be to highlight the broader tasks that ChatGPT can help with, along with multiple examples of each task (instead of just one), and then have the tool initiate the chat and help the user narrow down to something specific they want or need help with. Here is a mockup:

Issue #2: Interface Design and Functionality

Method for testing interface and functionality:

A group of 12 first-year college students used ChatGPT for a prewriting activity designed to help them explore a topic or tentative thesis statement. The use case was thus pre-determined and defined for them, although they had a choice between which initial prompt to use.

The demographics of the student test group are reflective of the college they attend: racially diverse, majority female, majority traditional college students age 18-24. While different groups of users will have different needs and expectations of ChatGPT, this group is a likely target audience for generative AI models as they may learn to use the technology in college and then bring their knowledge and use of generative AI with them as they enter into their career fields.

Prompts options for product test:

Prompt for exploring a topic:
Please help me prepare for writing my persuasive essay about [insert topic]. Ask me questions–one at a time–that will help me come to a thesis, understand the strengths and weaknesses of my position and any opposing view(s).

 

Prompt for finding and understanding potential credible sources:
Can you point me to some prominent researchers who have published on [insert topic] and very briefly summarize their positions?


Prompt for testing a tentative thesis when you have a guess of what your position might be:
Let’s engage in a dialectic exercise. I will present my tentative thesis, and then you will take the role of Socrates and ask me yes or no questions (one at a time) to expose any potential flaws or contradictions in my position.    

Data Collected: Students submitted their ChatGPT pre-writing activity conversations in the form of sharable links provided by ChatGPT. The user interface of ChatGPT does include a helpful “share” button at the top right of the screen (desktop version).

This share button creates a special link that captures an entire chat session that others can view without having to access the original user’s account. The user can also choose whether to remain named or anonymous in the shared version of the chat. Users in this case study generally found the share feature offered by ChatGPT to be easy to locate and use.

The researcher also conducted a focus group to gather opinions and insights regarding the design and experience of the students.

The full ChatGPT conversations collected from students are of limited value for UX purposes, but from a qualitative perspective they do reveal something important. Each student prompted and replied to ChatGPT in very different ways. Some used more formal, properly punctuated language while others were far more informal, using abbreviations and texting language. Despite this, the responses from ChatGPT did not show variation in tone and style. It might make sense to have ChatGPT adjust its formality according to user inputs—or make the level of formality an adjustable option for the user.

Focus Group Discussion Summary and Interface Recommendations:

  • A majority of the students reported using some type of generative AI before, typically for academic reasons, but none of the students were regular users except for one who said she used Snapchat’s AI feature. This particular student remarked that she liked the Snapchat feature better than ChatGPT because the Snapchat AI did not decline to answer questions like ChatGPT does and Snapchat’s AI was not as “neutral.”
  • Students found the website for ChatGPT easily and reported no problems with creating an account. Some reported that creating the account/signing in using their existing Google account made things easier.
    • Recommendation: Keep the simplified sign-in process for Google users.
  • Initial impressions of the interface were mixed with some liking the “clean” layout and others calling it “simple in a bad way.” Some users felt like they did not know exactly what to do first within the site. Based on some users’ confusion, the layout would not score extremely well on a five second test.
    • Recommendation: Reduce confusion about what to users should do first by having ChatGPT start the conversation with a letter-by-letter appearance of text instead of the static “How can I help you?” message. This text would automatically open the chat conversation instead of waiting for the user to initiate the first response.
  • Some students appreciated the ability to change their personal settings to show the site in dark mode, but most students did not seek or find that option. Some felt that the dark mode option from their browser was sufficient.
    • Recommendation: add a simple dark mode toggle switch at the top of the interface:
  • All students accessed the desktop version of the tool, but many said they would consider using the mobile app.
  • About half of the students did notice the accuracy disclaimer at the bottom of the screen, and some students agreed that a pop-up notification at the start of a chat would be better because the disclaimer is not highly visible.
    • Recommendation: Increase font size of disclaimer and add disclaimer as a pop-up message at the beginning of a chat session. Include an option on the pop-up that says, “Do not show this message again.”
  • Students commented that the letter-by-letter appearance of the text is “cool” and makes it feel like “someone is typing to you.”
  • Students generally agreed that the ability to verbally enter prompts and responses would be a good feature to add. Most did not realize that they could use certain internet browsers for that same feature.
  • Almost all students found it easy to get responses from ChatGPT, except for one student who had issues when using a Safari browser. When she switched to Chrome, that resolved her issues.
    • Recommendation: improve the tool’s functionality in Safari or stop offering the tool through Safari altogether. No experience may be better than a frustrating experience in this case.
  • Student generally had no issues with the speed of ChatGPT responses, but some thought the length of responses were too long.
    • Recommendation: Have ChatGPT ask users if its responses are sufficient in length during the chat.

 

Issue #3: Accuracy of Responses

There are a handful of reasons why potential users of generative AI instead choose to steer clear of the technology. Concerns about both intellectual property rights and bias related to the training of these models are enough to keep some away completely. A more pressing concern for UX designers is the accuracy of responses generated by the different models. For any users on the fence about trying ChatGPT, one inaccurate response may be enough to cement their opinion of the tool as unhelpful or even dangerous. As many users are using ChatGPT as a source of information on a range of topics and as an alternative to Google search, it is essential that responses include accurate information and avoid the so-called “hallucinations” and misinformation that harm the tool’s credibility and lead to the frustrating experiences that drive users to other alternatives. Furthermore, since ChatGPT 3.5’s training data cuts off at January of 2022 and it lacks internet connection, it will not usually provide information for events after January 2022. If it does, then it is most likely providing misinformation because it has no way of accessing new information. Other free models (like perplexity.ai, for example) do include internet connectivity and are able to pull up-to-date information giving them a major edge over ChatGPT 3.5.

Industry studies and scholarly research are pointing toward improvements in accuracy for generative AI. One 2023 study by Walters and Wilder shows that GPT4 showed a significant decline in the inclusion of false citations compared to ChatGPT 3.5. In the study, 55% of sources created by ChatGPT 3.5 were found to be fictitious whereas only 18% of those created by GPT4 were. It is likely that GPT4’s access to the internet contributed to this improved accuracy as the model was able to better verify the existence of sources. This is further reason to make internet connectivity a priority change for ChatGPT.

Other industry studies also suggest the rapid improvement in generative AI accuracy.

Originality.ai released data in February of 2024 showing GPT4 outperforming ChatGPT 3.5 in a fact checking task (Gillham), and Anyscale released data indicating GPT4 outperforms both ChatGPT 3.5 and humans in identifying accurate summaries of news reports (Kadous, 2024). This is good news for the user experience, but generative AI already has a significant stigma when it comes to accuracy. Some misinformation that makes its way into generative AI responses is a result of the program simply trying to execute on a prompt without taking into account any need for accuracy, but other inaccuracies can be the direct result of biases that influence outputs.  

Since the optimal user experience of ChatGPT includes program outputs with accurate information, designers need to ensure the model’s training data is more representative of marginalized online communities and cultures. This means finding better quality data than the primarily English language based swaths of internet collected by Common Crawl (Baak, 2024).

Another source of bias that generates less accurate information is the custom response feature of ChatGPT. This feature allows users to tailor the responses they receive based on their own preferences. For example:

A response from ChatGPT based on these instructions thus becomes biased and will ignore certain information, making the responses potentially less accurate.

 

Although these kinds of tailored responses do create a preferred user experience, in the long run, the presentation of biased info to users (even if they ask for it) is going to hurt the overall reputation of ChatGPT and other generative AI models. This is an ethical problem for UX designers that deserves careful consideration. There may be ways to provide users some of their preferences in terms of length and formality of responses without reflecting a user’s own bias back at them.

Conclusion

ChatGPT revolutionized the world on behalf of generative AI in 2022, but today it faces strong competition from similar free (and freemium) models that offer important features that ChatGPT only offers in its subscription based GPT4 model. There are a handful of minor changes that ChatGPT could make to its interface to enhance users’ overall experience, but if OpenAI chooses to keep ChatGPT 3.5 offline, they are taking a major risk. The level of misinformation in the responses of generative AI systems is a major concern to many, and users concerned with accuracy very well might choose to experiment with more accurate free internet-connected models like Perplexity.ai and Gemini before they choose to pay $20 per month for that feature in GPT4.

 

References

Baak, S. (2024, February 6). Training data for the price of a sandwich. Mozilla Insights. https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/

Gillham, J. (2024, February 2). AI fact checking accuracy study. Originality.ai. https://originality.ai/blog/ai-fact-checking-accuracy

Kadous, W. (2023, August 23). Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper. https://www.anyscale.com/blog/llama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper

Lew, G., & Schumacher, R. Jr. (2020). AI and UX: Why artificial intelligence needs user experience. O’Reilly.

Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1), 1–8. https://doi.org/10.1038/s41598-023-41032-5

Leave a Reply

Your email address will not be published. Required fields are marked *