Menus, Models, and the Madness in Between
WIMP + CLI are both here to stay, but we can elevate both
“In the beginning was the command line.” Remember that book? I can’t say I recommend it, but the title was fantastic.
The command line is having a bit of a renaissance. Chatbots as traditionally implemented were mediocre at best, but ChatGPT-style prompting is... better.
A few years ago, Apple came out with Siri Shortcuts to access key features of applications via Siri. As we know, Siri is not a graphical experience.
As with traditional command lines, one issue lingers: discoverability. This was the massive win of *WIMP* - windows, icons, *menus*, and pointer. Menus offer not only an organization of the commands available to the user, but they are explorable. NetApp, Cisco, et al implemented noun/verb/direct-object text interfaces. This consistent structure might be a bit more typing but it is a hierarchical structure, like WIMP menus, and I claim the natural structure of both are virtually identical.
Traditional Unix shells (and other shells) have commands that originated in a need to be sparse with memory, storage, and display area, which led to the creation of arcane commands that are overwhelming for a new user because there is no hierarchical relationship between pwd, ls, pushd, popd, cd. ‘d’ is involved but its only consistency is that it occurs at the end. But if we then learn about chmod, we might be surprised to learn the ‘d’ in chmod has nothing to do with the ‘d’ in pushd.
So with modern computers, we can reimagine the shell, and that’s what Microsoft did with PowerShell.
Today, we have the opportunity to consider a new command line with AI. Prompt engineering will be a skill that is useful for some time ahead of us, just as the Unix shell continues to be useful.
The challenge with prompt engineering is that the entire natural language is available. Depending on how our agents respond to commands, we can ideally cover as much of the search space as possible. This becomes rather interesting--the target original commands that link to specific actions clearly will have synonyms. From a user’s query or command, we may find semantic matches. If not, it would be nice to generate a trajectory from the current incoherent query to near possible coherent queries. The further we are from nearest neighbors, the more likely we want to expose some summary of the search -- because then we can enable the user to redirect their efforts to an available gamut of options rather than merely locking in on some calculated similarity with significant noise.
So we want to navigate our options, fine, and we can solve the discoverability.
Will we have a wave of GUI via AI? Boy I hope so. Today in GUI applications, we have two classes of apps: we have native apps that run on your workstation and ipso facto, the nature of the application is fairly complicated: plugins, numerous ways of manipulating and viewing things, and so on. No app today is installed if it doesn’t have to be; so the stuff running locally is the demanding stuff.
The other class are these applications we tolerate in our browsers. We have inconsistency at scale at every step of the web layer, from the user’s perception down to the variety of web frameworks. The UIs of these apps are all idiosyncratic and crippled by the restrictions of the browser: funky file I/O and right-clicking of the mouse is such a security risk we cannot possibly let it happen. Right clicking as a security risk is the canary but treated as the solution rather than the alarm.
Anyway the point is that we tolerate a crazy amount of inconsistency in every experience in our web apps, and many times in our desktop apps as well.
And the help menu has been wonderful in the last 10-20 years because it has become a way to navigate the menu hierarchy from a typing rather than a pointing experience. And if I know the name of what I want, then I can accelerate through the mental load of digging in menus, and jump right to it. For maintaining flow this is critical. For maintaining sanity and reducing swearing at computers, I think it’s pretty good too.
But today there is an interesting AI opportunity under our noses and I hope we start to leverage it more.
Every list we show a user is an opportunity to prioritize, filter, and adjust what we present to the user. Advertisers and retails know all about this, of course, and some applications track your favorite actions, commands, or files and present them in some way.
But UIs could be much more adaptive: not just favorites or recently used records, but menus could be filtered and ordered depending on collaborative filtering of commands clusters. Here the users collaborating would be sessions of users--and for personalization, likely a model trained on the user’s own activities (each application session or document development becomes a time series data set). Of course, there are likely patterns across users, especially users unfamiliar with the tool they are working with. As users develop familiarity and develop their own “style” of working with an application (some mix of habits or workaround techniques to limitations or perceived limitations with the tool), the user lifetime time series collection of the sub-time series of the user’s past session likely embody a new input towards a model of new users sophistication with an app, which could then be clustered and ideally this clustering would reveal representations of different styles, end goals, and familiarity with the product.
How these relationships work and evolve is a perfect gig for a Featrix embedding space. Different users, styles, goals, captured across the above-called “sub-time series” and inferred from a higher level model require flexibility and ideally discovery of the optimal segmentation of users. We may want to build predictive models for what object will the user want next, what action will they want to do next, and so on. These predictions can inform not just obvious tasks like pre-fetching data, but we could also change the responsiveness to predicted actions. Every touch action or mouse click on a specific element could be predicted at every step of interaction, keyboard press, whether successful or not, or mouse movement.
Imagine we have a toolbar and we have buttons on it. From past experience, our app knows that our user rarely uses the ‘cut’ command, preferring to use copy, paste, and then go back and delete once he is confident the target has accepted the paste and the new file is saved, or whatever the train of thought is. In any event, the user doesn’t hit the ‘cut’ button very often.
So our app could make the following decisions:
1. for this user, never show the ‘cut’ button
2. for this user, only show the ‘cut’ button if there is room and ‘cut’ is deemed more likely than other buttons that could be displayed.
3. if the user clicks or touches the ‘cut’ button, double check. Was the user recently near the ‘copy’ button? Was the user’s eyes looking at the ‘copy’ or ‘cut’ button? In other words, is it a valuable thing to let the computer second guess the user--and if the user was perhaps closer to the edge than the center of the button, maybe it was sloppy mousing and not the user’s intention. So we could show less-likely buttons as smaller, or dimmer, and we could reduce their response target. It would also be a solution to reduce the responsive of the target without increasing the area the neighboring buttons are responsive--and this might be less frustrating if our user suddenly changes their style (or if someone else has taken over their terminal.)
This brings up another point--security. Your app or computer could recognize unusual patterns--maybe you always use keyboard shortcuts and now someone logged in as you is not using any? Maybe ask for the password again and take a snapshot with the camera or insist on FaceID?
For network security, there’s this idea of portknocking. I’m not sure that it ever went anywhere, but I always loved it--and you could imagine that here. Though with portknocking, we usually have a predefined pattern, so I guess that has nothing to do with anything other than the idea of a pattern being executed.
You could imagine the computer system as a whole establishing user baseline--a clear time series model throughout the day. So prioritize the Mail app at 8 am and whatever you do, don’t let it run out of RAM. Anyway we could establish a baseline behavior and look for when is the user stressed or agitated? We could offer some calmer music, or a funny cat picture, or other sorts of things. We could also turn off dumb notifications during this time, and especially notifications that we have seen elevate heart rate.
The telemetry of mouse acceleration, force on buttons, pupil dilation via webcam, heart rate via watch, combined with what’s on our screens could be used to help us manage stress and maybe enhance collaboration. Imagine our computer knowing emails from ‘Boss’ always stress us out and Boss’ request doesn’t seem super important... but the last time the computer thought it wasn’t important, it was, and that raised user’s heart rate even more. So proceed with caution. The model is damned if he does, damned if he doesn’t. But with some calibration, we can determine where the decision threshold is and it may not be 50%.
So if we can model the user’s emotional state, if we can get measurements and associate those with the user’s calendar, featurize the web cam calls for emotional state with the goal of the emotional well-being of our user, we can build something that might be pretty interesting. And it relies heavily on joining a ton of unjoinable data, which invites an embedding space approach.
OK, so that’s a long term vision--it might be dark, it might not.
The chat UI as a CLI really gets interesting when we think about the software business. If the main UI becomes a chat UI, then a lot can change--web ui limitations will not be as severe? But the compute experience gets tricky. The generation speed is the limiting factor. And what does it look like? Often the generation is not specific, accurate, and so on for the kinds of things we are looking for in the model.
So I think a total chat with generative interactions is a long ways off as a primary means of computing (computer boots up to a prompt that is an LLM session). The point being in that world, the software application default might be something that plugs into that chat system. This system would be a set of tools--like OpenDoc--or like OpenAI “software-defined tools” (my term as far as I know) today. But would app vendors want to embrace this as a computing enabler?
One of the fundamental observations I’ve had about programming with ChatGPT is that a lot of programming an application involves stuff in which a language like Python or TypeScript is not the appropriate layer of abstraction. Just as SQL is a ‘4QL’ language that is high level than what we usually program in (Python, et al, and I am saying this with the idea that Python et al do not move the ball significantly from C. Some memory safety and automatic memory management are fantastic but it’s housekeeping and not a logical leap).
However, for many things, we would like to have an API binding.
def draw_barchart(**kwargs):
image = openai_chat(“Hey, some stuff came in, make a nice bar chart with labels and a legend and colors and use MY tick marks and otherwise follow a Tufte style and go easy on the data density”, **kwargs)
return image
Where this gets exciting is the passthrough of kwargs. Presumably I can just make up stuff and it will work. “y-axis: no”, hmm what else, “show title: yes”, “legend: upper right”.
Look how nice this API is. I get to put my own style on it and if I don’t quite get stuff to line up, maybe the legend will end up missing or not where I want it, but ideally I don’t have a total crash of the system. We can structure our AI generation to do a best effort job, and only flag a failure (or react to a failure flag) on mission critical operations. But for trying to get a chart made and into an email, we might not care to debug anything that didn’t map, since we got the core of what we needed. Just like today, it’s up to the user how much to grind and how far to go to achieve what level of result.
But the important thing is not an invitation to sloppiness, it’s the idea that we do not crash on a failure. If we typed a key name (and remember, we are may not be typoing against a specification, but rather misspelling or mis-constructing our own label towards a goal).
Imagine that--a computer system where things are squishy enough that if things do not totally line up, they do not fail.
Now many programmers take this approach and often end up in very confused (i.e., data loss) states. We are not talking about them here.
Instead what we’re talking about is flexibility for typos to recover in code, and for extra or missing arguments to be inferred based on whatever context we have for the current situation and previously observed situations. So we can recover and operate in the face of errors, again, for operations that can tolerate it. Previously programming a computer to be squishy and accepting in its behavior was quite deliberate and quite tedious to put together.
But something far better is in grasp today.
Buttons in dialogs could be rearranged. Notifications, errors, warnings could be ordered. Hero photos, button styles, accent colors, and more could all be tailored.
The command line never really left. But now it’s showing up with better manners, more memory, and a lot more potential. We’ve got a shot at rethinking how we interact with computers. Instead of clicking around menus or memorizing shortcuts, but actually shaping interfaces that adapt to us in real time, that prioritize what matters, that guess right more often than not, and that don’t fall apart when we’re vague or sloppy. The old dream of flexible, intelligent UI is suddenly within reach because the system got better at learning patterns. That’s a big shift. And if we get this right, everything from toolbars to security to our own cognitive load starts to look very different. Not simpler necessarily, but more fluid. More human.