In my last article, I explored what I called the "mom and dad" problem: how do non-technical people use AI to automate the repetitive, structured tasks that make up most of their daily friction? I found that pure agent approaches are too expensive and unreliable for repeated tasks, that workflow builders like n8n are too technical for regular people, and that natural language authoring shows promise but isn't sufficient yet.
I ended with several hypotheses about what might close the gap:
- What if the agent wrote workflows in an established framework with deep training data representation, instead of a custom DSL?
- What if it had a library of proven, production-tested workflows to reference before writing new ones?
- What if the runtime handled durability and error recovery natively, so the author could focus on logic instead of infrastructure?
- What if the authoring process included better iteration tooling, like sandboxing and pressure-testing before a workflow goes live?
I've gone and tested several of these things, and I'm starting to believe we can meaningfully close the gap between workflow author and consumer.
Python Workflows on Temporal
The previous version of WorkflowSkill used a custom YAML language I designed specifically for agent-authored workflows. It had five step types, a clean chaining model, and a validation system that caught structural errors before execution. It worked surprisingly well, but certainly not well enough to unlock natural language authoring for a non-technical person.
The hard part wasn't syntax. The agent could produce valid YAML reliably. The problem was the substance of the workflow. The authoring agent struggled with handling data transforms, wiring data between steps, and configuring durability mechanisms like retry policies. Roughly 70% of generated workflows would validate and run correctly. Sounds decent until you realize that means nearly one in three fails.
I replaced the YAML language with Python, and the custom runtime with Temporal.
The SKILL.md format stayed: a markdown file with YAML frontmatter for metadata and a fenced code block for the workflow logic. But instead of declarative YAML steps, the code block is a Python method body. The loader wraps it in a Temporal workflow class, auto-injects imports, and runs it on an embedded Temporal server. Actions (API calls, web scraping, LLM inference) are registered by the consumer and executed as Temporal activities with built-in retries, timeouts, and durable state.
This tested two of the hypotheses from the previous article directly.
Training data representation. Python's massive presence in training data would let the agent write workflow logic with far less instruction. This proved correct immediately. The authoring guide shrank because it no longer needed to teach the agent how to express transforms, conditionals, or loops. The agent already knows Python. The guide now focuses entirely on patterns and high-level rules: when to use parallel execution, how to structure retry policies, what the action interface looks like. The specifics of writing working code are baked into the model's pre-training.
Native durability. Temporal would give us production-grade infrastructure for free. Also correct. Retry policies, timeouts, scheduled execution, and state persistence are all Temporal primitives. We don't implement any of it. A workflow author (human or AI) gets these capabilities by following standard Temporal patterns that the agent has also seen extensively in training data.
The combined effect is dramatic. My qualitative assessment, based on extensive testing, is that the success rate jumped from around 70% to something like 98%. I haven't run a statistically significant study, so take those numbers as directional. But the difference is obvious in practice. Workflows that used to require multiple rounds of iteration and human debugging now work on the first or second attempt.
The third hypothesis, better iteration tooling, is partially tested. The project includes an eval suite that measures authoring quality: given a task description, does the agent generate a workflow that validates and runs? Each test targets a specific language feature (loops, error recovery, parallel execution) so regressions are caught per-feature. This has been valuable for iterating on the authoring guide, but true sandboxing, where the agent pressure-tests a workflow against live data before deploying it, is still ahead.
The fourth hypothesis, a reference library of proven workflows, is not yet tested. I have a rapidly growing collection of best practice examples, but need to build out more of them before it makes sense to serve as a library to authoring agents.
Jakob's Law for Machines
This insight generalizes beyond my project, so I want to spend a moment on it.
In human-facing design, there's a well-known principle called Jakob's Law: users spend most of their time on other sites, so they expect your site to work the same way as the ones they already know. Familiarity reduces cognitive load. People perform better when new tools feel like tools they've already used. This is why every e-commerce checkout looks like Amazon and every music player has the same transport controls. Fighting user expectations is a losing game, even when your novel approach is objectively better.
The same principle applies when the user is an AI model. The custom YAML language I built existed in zero training data. The agent had to learn it from scratch through a single instruction document. With Python on Temporal, the agent arrived already knowing the language, the idioms, and the framework's API. The authoring guide became a thin layer on top of existing competence rather than a complete education.
The mechanism is different but the principle is identical. Humans build mental models from prior experience with other products. Models build competence from prior exposure to training data. In both cases, familiarity is the strongest predictor of performance. And in both cases, a novel approach can perform worse than an established convention, because the user already knows the convention cold.
If you're building anything an AI agent will interact with, this might be the most important design question: how familiar is the model with what you're asking it to do? Not just code. Concepts, patterns, vocabulary, structure. The model's prior exposure to an idea matters as much as the quality of the idea itself.
Tempura: The Experiment
WorkflowSkill was always intended as building blocks for a product. The library is a workflow engine. The question was always whether it could be packaged into something that regular people could use. People who will never install a CLI, never configure a Temporal cluster, and never look at a Python file.
Thus, I've built Tempura, so non-technical people can test the WorkflowSkill paradigm.
Tempura is a hosted workflow service. You sign up, describe what you want to automate in plain language, and the platform authors a workflow, runs it on a schedule, and shows you the results. No code. No infrastructure. No workflow engine to operate. Under the hood, it's WorkflowSkill running on a hosted Temporal cluster with managed tool connections. But the user never sees any of that. They see a chat interface, a dashboard, and an intuitive visualization of the steps their workflow will take.

Here's what it looks like in practice. I wanted a daily snow report for Summit at Snoqualmie. I described it in the chat. Tempura built a workflow that fetches the National Weather Service forecast, scrapes the resort's mountain report page, pulls I-90 road conditions from WSDOT, and passes all of it to Claude Haiku to generate a go/no-go riding report with conditions and a star rating. Three data sources fetched in parallel, one LLM synthesis step. I asked it to send me this snow report via Slack.

The whole thing runs in ~4 seconds and costs $0.002.

You can see every step: what it called, how long it took, whether it succeeded. You can see the output. You can see the history of past runs. And if you want to change something, you go back to the chat and describe the change.
And, importantly, you can schedule workflows to run automatically for you.

In the future, this will be able to run off of triggers as well. An example might be running a workflow when somebody places an order in your store, or when somebody messages you in an app.
This is early. The integration library is small right now (web requests, Anthropic's API, and Slack), but the core authoring and execution loops work.

The question is whether it works for the kinds of tasks real people care about, and whether the experience is intuitive enough that someone without a technical background can get value from it.
What I Need Help With
I'm running Tempura as a private beta because I want to learn from real usage before building more. There are specific questions I'm trying to answer, and I'd rather answer them with evidence than assumptions.
What would you automate? I have hypotheses about use cases, but the most valuable signal will come from what people actually try to build. The use cases I haven't imagined are the ones I most need to hear about.
Where does the experience break? When you describe a workflow and the result isn't what you expected, that's gold for me. The failure modes teach me what the authoring guide is missing, what clarifying questions the agent should ask, and where the UX needs guardrails.
What integrations matter most? What integration would unlock your most valuable automation?
If you want to try it, join the beta at tempura.run. It's free during the beta period. And if you want to be part of the ongoing conversation about what this product becomes, join the Discord. That's where I'm sharing what I learn, collecting feedback, and making decisions about what to build next in the open.
I'm also building toward a curated collection of immediately useful workflows (the reference library hypothesis from the last article, finally getting a real test), an MCP server so you can trigger Tempura workflows from your existing agent tools (Claude, Cursor, etc.), and a broader set of native integrations. All of this will be shaped by what I hear from early users.
The Gap, Revisited
When I wrote the first article about the mom and dad problem, I framed it as a distance between "needs a developer" and "a conversation gets you there." That distance is shorter than it was. The Python/Temporal pivot closed a significant chunk of it on the authoring side. Tempura closed another chunk by eliminating infrastructure entirely.
I'm also finding it's wider than I thought in ways I didn't anticipate. For example, I've learned that helping someone who's never thought about automation understand what a workflow is and how they could use it in their daily lives is actually pretty difficult.
I find that encouraging, honestly. Most of the remaining distance isn't about waiting for better models or more capable infrastructure. It's about product design.
If you're thinking about these problems too, I'd love to hear from you. And if you're a non-technical person who's curious whether this kind of tool could actually help you, let's talk!
Leave a Private Comment
Your comments are private and only visible to me.
Select any text above to comment on a specific passage, or use the form below for general thoughts.