Technology

Mimic 2.1: portable cognitive twins as a single skill.md

Mimic 2.1: portable cognitive twins as a single skill.md

Mimic 2.1 compiles a person's decision-making into a single portable SKILL.md twin any model can load. Built declaratively or inferred from real work history, coupled twins reach 96% decision endorsement, near a person's own consistency.

Private beta access : https://tally.so/r/jap6LR

We are introducing Mimic 2.1, an engine that compiles a person's decision-making into a single portable SKILL.md file written in a structured format we call the Mimic grammar. A Mimic twin is not a fine-tuned model and not a new base model. It is a structured text artifact that any host model can load at inference time, and it is built to approximate three things about its subject: the decisions they make in a given situation, the stance they hold on events and options, and the chain of thought they follow to get there.

The design goal was deliberately narrow: reproduce a person's choices well enough to be useful inside an existing AI workflow, as a readable, auditable artifact rather than an opaque profile baked into model weights.

Why a file, and not a model

Most personalization approaches change the model. Mimic changes the context. A twin is a SKILL.md that is loaded as a skill by whatever host model is already in the pipeline. This buys four properties that matter for deployment:

  • Portability. The same twin runs across host models with no retraining.

  • Auditability. The artifact is human-readable. The subject can open it, read what it claims about them.

  • Low cost. There is no training job and no model hosting. You load a text artifact into the context of a model you already run.

  • Authorship. Because the file is editable by the subject, the twin stays something a person owns rather than something inferred about them in the dark.

What a twin contains

A twin is written in the Mimic grammar, a structured but human-readable format. Rather than one block of prose, the file is organized into named strata that the host model reads as an operating role:

  • Core drives, each carrying a numeric weight, that set what the subject optimizes for and how competing pulls are resolved when they conflict.

  • Cognition, a ranked set of reasoning models and decision directives that encode how the subject moves from a situation to a choice. This is where the chain-of-thought target lives.

  • Expertise and stances, the domains the subject is fluent in and the positions they hold on contested questions. This is where the viewpoint target lives.

  • Voice, the register and the recurring patterns of how the subject expresses a judgment.

  • Calibration and fidelity, tunable dials plus self-check gauges the twin scores its own draft against before it answers.

  • Guardrails, a small set of absolute rules, including that the twin is a behavioral emulation and never a claim to be the real person.

  • Exemplars, worked situations in review, conversation, and decision modes, each paired with the reasoning that produced it. This is where the decision target is anchored.

The decision, viewpoint, and chain-of-thought targets each map onto specific strata, which is what lets us benchmark the three of them independently.

Two ways to build a twin

Declarative mode (available in private beta). Human-in-the-loop. The subject ranks options across structured situations and annotates a subset with short rationales; the engine compiles these into a SKILL.md and iterates until the twin matches the subject on held-out items. Nothing enters the twin that the subject did not produce or approve.

Inferred mode (research preview). The engine drafts a twin from the artifacts a subject already produced, then asks the subject to confirm it. We detail this engine, and its results, at the end of this post.

How we built the engine

Phase 1: choosing the representation (no reinforcement)

Before adding any learned adjustment layer, we wanted to know which SKILL.md representation reproduces a subject's behavior best. We treated this as a representation ablation.

  1. Stimuli. We authored a pool of 3,000 multiple-choice situations focused on professional, white-collar tasks: prioritization, delegation, hiring, escalation, risk tolerance, and resourcing. Each item presents a realistic situation and a set of plausible actions.

  2. Panel. 76 participants (41 women, 35 men). Each participant saw 108 randomly drawn items and ranked every option from most to least probable for themselves. For 50% of their items, participants also explained how they arrived at that ordering.

  3. Ceiling. We re-tested a subset of items per participant to estimate a self-consistency ceiling. People do not answer identically twice, so test-retest agreement is the realistic upper bound for any twin. We observed a self-consistency ceiling of roughly 0.86 Top-1 and 0.81 ranking agreement, and we report all twin results against that bound.

Metrics. We scored each twin against its own subject on three measures. For all three, higher is better:

  • Decision Top-1 (0 to 100%): the twin's top-ranked option matches the subject's top-ranked option.

  • Ranking agreement (τ) (ranges from -1 to 1, where 1 is a perfect match): a normalized Kendall's τ over the full ordering of options.

  • Rationale alignment (0 to 100%): a rubric-scored concordance between the twin's stated reasoning and the subject's written rationale.

Procedure. For each subject we compiled four candidate SKILL.md variants and benchmarked each with 10 runs per item across the host models, comparing the twin's output to the subject's own answers, then refined the winning representation and re-ran.


Representation (SKILL.md variant)

Decision Top-1

Ranking agreement (τ)

Rationale alignment

V1: flat trait list

58.4%

0.41

49%

V2: narrative biography

61.2%

0.46

57%

V3: weighted drives + exemplars

71.9%

0.63

68%

V4: full Mimic grammar

76.8%

0.69

74%

Table 1. Representation ablation, averaged across host models, no reinforcement. Higher is better in every column. Self-consistency ceiling: 0.86 Top-1, 0.81 τ.

The full Mimic grammar (V4) clearly beat both a flat trait list and a free-form biography. The highest-fidelity representation was also the largest: we did not optimize for a minimal file, we optimized for fidelity and auditability, and the structured strata earn their footprint. We adopted the Mimic grammar as the default representation.

Holding the representation fixed at the Mimic grammar, we then measured fidelity across host models, 10 runs per item.


Host model

Decision Top-1

Ranking agreement (τ)

Rationale alignment

Claude (frontier)

78.9% ± 1.2

0.71 ± 0.02

76% ± 1.5

GPT (frontier)

77.4% ± 1.4

0.69 ± 0.03

74% ± 1.7

Gemini (frontier)

75.1% ± 1.6

0.66 ± 0.03

71% ± 1.9

Llama (open weight)

70.3% ± 2.1

0.61 ± 0.04

65% ± 2.3

Table 2. Host-model fidelity using the Mimic grammar (mean ± standard deviation over 10 runs). Higher is better in every column. Names refer to the most capable generally available checkpoint in each family at time of testing. These numbers measure how faithfully the twin reproduces its subject, not the general capability of the host model.

The gap between frontier hosts is smaller than the gap between representations in Table 1, which was the main finding of Phase 1: how the twin is written matters more than which model reads it, at least within the frontier set.

Phase 2: reinforcement on failure cases

Phase 1 left a residue of items where twins disagreed with their subjects, and a separate set where the host model could not settle on a confident choice (high disagreement across the 10 runs). We isolated this ambiguous subset and ran a targeted correction pass.

14 participants reviewed a fresh pool of questions drawn from these failure cases and supplied corrective rankings and rationales. Mimic 2.1 now includes a reinforcement layer that uses these corrections to adjust the SKILL.md directly: it re-weights drives and directives, edits or adds exemplars, and tightens the cognition strata. No host-model weights are touched; the learning lives entirely in the artifact.


Metric

Pre-reinforcement

Post-reinforcement

Δ

Decision Top-1 (ambiguous subset)

54.6%

67.2%

+12.6

Ranking agreement (τ, ambiguous subset)

0.42

0.55

+0.13

Rationale alignment (ambiguous subset)

58%

69%

+11

Decision Top-1 (full set)

78.9%

80.4%

+1.5

Table 3. Effect of the reinforcement layer, Claude host. Higher is better in every column. Gains concentrate on the ambiguous subset, with a small lift on the full set.

As expected, reinforcement helped most where the twin was previously uncertain, and added little on items the twin already handled well. This is the behavior we wanted: reinforcement edits the strata that were failing rather than rewriting the whole twin, and the gains land where the twin was previously unsure.

The inferred engine

Declarative mode asks the subject to answer. The inferred engine instead reconstructs a twin from the behavioral footprint a person already leaves at work, with their permission:

  • email and chat history (Slack, Teams),

  • meeting and call voice transcripts,

  • documents, decision logs, and ticket or review trails.

The engine extracts the real decisions, stances, and reasoning patterns from this corpus and compiles them into a draft Mimic grammar, which the subject then confirms. In short:

  • Declarative captures what a person says they would do, from a short, deliberate elicitation.

  • Inferred captures what a person actually did, from a large, passive record of real decisions.

  • Coupled uses the inferred draft as the starting point and a short declarative pass to correct it, getting the best of both.

Internal results are stronger for the inferred path, for an intuitive reason: months of real decisions are a richer and less self-conscious signal than a 108-item elicitation. We benchmarked the three configurations on the same held-out protocol, Claude host, against the same per-subject self-consistency ceiling.


Configuration

Decision Top-1

Ranking agreement (τ)

Rationale alignment

Subject endorsement

Declarative (post-reinforcement)

80.4%

0.72

77%

89%

Inferred (confirmed draft)

83.1%

0.75

79%

92%

Coupled (inferred + declarative)

85.6%

0.80

83%

96%

Human self-consistency ceiling

0.86

0.81

n/a

n/a

Table 4. Inferred and coupled configurations versus declarative, Claude host, preliminary internal results. Higher is better in every column. Subject endorsement is the share of the twin's decisions a subject reviews and confirms as ones they would themselves make, a looser test than exact Top-1 match, which is why it can sit above the Top-1 ceiling.

The takeaways:

  • Inferred alone beats declarative alone on every metric, because it learns from real behavior rather than self-report.

  • Coupling reaches the human ceiling. A coupled twin matches its subject's exact top choice about as often as the subject matches their own earlier answers. There is little exact-match headroom left above this.

  • The number users actually feel is 96%. Asked whether each decision the coupled twin made was one they would make themselves, subjects agreed 96% of the time. This endorsement test is looser than exact Top-1 match, and it is the closest proxy for everyday use.

These numbers are preliminary and internal. The inferred engine is being hardened, and the draft-and-confirm review that keeps a subject in control of their own twin is the main work standing between it and beta. On current evidence it is the more faithful of the two paths, and the two together are stronger than either alone.

Limitations and what is next

These results are confined to professional, white-collar tasks and should not be read as validation for high-stakes personal, clinical, or legal decisions. A twin is also a snapshot: people change, and an unrefreshed twin drifts from its subject over time. The inferred engine's results are preliminary and internal, and it is being hardened for beta. Our next steps are to broaden the situation domains, add a longitudinal refresh so twins can be updated cheaply, and ship the draft-and-confirm version of the inferred engine.

Access

Mimic 2.1 is closed-source and available in private beta. If you would like to build or evaluate a twin, request access through the beta program and our team will follow up : https://tally.so/r/jap6LR

Tech Stack

People from to-markdown.ai