For those of you not up to date, ChatGPT in late 2022 gave us a computer that understands and generates English. Not just English, but human language in general. At first glance, this may seem like a small step. Is not.
Yes, you struggle with facts and actual data, but that part is solvable and often not as critical as you think.
If that’s not revolutionary enough for you, the next change is Generative Artificial Intelligence (GenAI) going multimodal. ChatGPT started as text to text. Multimodal are the other things:
- Text to audio. Generate voice from text. Companies like Elevenlabs, Google, and Amazon can now create both pre-trained voices and custom cloned voices that are in many cases indistinguishable from the human voice. Anyone can access these tools – it’s really easy.
- Audio to text. Apple, Amazon and Google have been doing this for years. Good luck if you have an accent. Today, through GenAI, it’s vastly improved—even with that emphasis. Otter.ai and OpenAI’s Whisper are two of my favorites. ChatGPT now has Audio-In > Audio-Out as an option, so you can bypass text entirely.
- Text to image. This area has been improving rapidly over the past year. OpenAI’s Dall-E was fun to play a year ago. Today, the free Bing.com/create is really close to creating usable photorealistic images. MidJourney is right there too. NEED to generate images for your travel marketing campaigns? I stay out of this argument.
- Image to text. This just happened in the last few weeks as part of ChatGPT. Take a picture, upload a sketch, whatever you want, and ask GenAI to describe, infer, tell a story … anything you want.
- Text to video and video to text. They’re not really any different than imaging technology, except there’s a lot more data (more calculations, more $) to process. These aren’t ready for prime time yet, but they’re coming. It’s fun to play with. Just like the images were a year ago.
So if you take any of these modes in isolation – they are all significant. But when you connect them all together seamlessly, it starts to get really interesting.
Subscribe to our newsletter below
Consider the new Ray-Ban Meta. They’re basically a multimodal large language model (LLM – think ChatGPT) that you wear on your face. You can do the basics like listening to music or taking photos and videos. But really, they’re a seamless way to connect what you see, hear and say to the most powerful (in a practical sense) computer in the world, the LLM.
When you take all the human senses, it’s a fair argument that sight and hearing are the ways we consume most data. Voice and text are the ways we transmit most data. All of them are now functional and connected.
Today, we already use many of these tools, but they often do not function very cohesively. It’s still a bit patchy. That will change.
The end of the phone?
To be clear, I’m not saying this is the end of the phone. More. We still need a place to look at pictures of cats or play Candy Crush. But that user experience (UX), the screen we all stare at for most of our workday for multitasking, will become obsolete. A screen connected to a keyboard feels very natural to us. But it was just the best way to interact with the technology we had at the time it was invented. The only practical way to communicate before GenAI was through words typed into a field or by clicking buttons and drop-down filters.
For many tasks, this is simply not a very effective way to communicate. It’s a mistake to point out that this was promised to us 10 years ago with the smart speaker and therefore tried and doomed to eternal failure. Smart speakers weren’t very smart back then. They didn’t have GenAI and that makes it a completely different proposition. Today it is really easy to search an entire database with natural language search and return exact matching results. A year ago this was almost impossible.
So now, with my sunglasses on, I can look at Alcatraz out of my hotel window and just say, “Hey Meta, what is this and how do I visit it?” Without ever looking at the screen, I can get a set of options, prices and availability. And book it by voice only.
The glasses? I’m not sure if that’s the future. But again, save the comments about the Google Glass failure. It’s not much different than just pointing your phone at something. Seamlessness is important. One less click or two fewer actions, like picking up your phone and navigating to the right app—those things matter.
Fifteen years ago, we heard stories of people in Asia booking hotels or even flights with a smartphone. None of this seemed plausible at the time. Surely this is a step too far? It will never catch on in the west. These things take time. People are great at slowing down progress in the short term. But at the end of the day, better is just better, and people find the easiest way to accomplish a task with the tools they have access to.
For many legacy companies (my new definition is any company over two years old), they face the innovator’s dilemma. Their UX is hard-coded into everything they do. Not only do loyal customers use their current UX, but so do the company’s designers, data analysts, marketers, engineers, and others whose roles derive from the current flow and current UX. It’s very hard to walk away from that.
It just might open the door for some new players. Many startups are coming after this market share. Most are travel planners. Most are hard to distinguish from each other at the moment. But one of them could create the next killer app and turn everything upside down. Exciting times ahead.
About the author …