• 1 Post
  • 42 Comments
Joined 6 months ago
cake
Cake day: August 27th, 2025

help-circle

  • You’re over-egging it a bit. A well written SOAP note, HPI etc should distill to a handful of possibilities, that’s true. That’s the point of them.

    The fact that the llm can interpret those notes 95% as well as a medical trained individual (per the article) to come up with the correct diagnosis is being a little under sold.

    That’s not nothing. Actually, that’s a big fucking deal ™ if you think thru the edge case applications. And remember, these are just general LLMs - and pretty old ones at that (ChatGPT 4 era). Were not even talking medical domain specific LLM.

    Yeah; I think there’s more here to think on.


  • Agreed!

    I think (hope) the next application of this tech is in point of care testing. I recall a story of a someone in Sudan(?) using a small, locally hosted LLM with vision abilities to scan hand written doctor notes and come up with an immunisation plan for their village. I might be misremembering the story, but the anecdote was along those lines.

    We already have PoC testing for things like Ultrasound… but some interpretation workflows rely on strong net connection iirc. It’d be awesome to have something on device that can be used for imaging interpretation where there is no other infra.

    Maybe someone can finally win that $10 million dollar X prize for the first viable tricorder (pretty sure that one wrapped up years ago? Too lazy to look)…one that isn’t smoke and mirror like Theranos.


  • Funny how people over look that bit enroute to dunk on LLMs.

    If anything, that 90% result supports the idea that Garbage In = Garbage Out. I imagine a properly used domain-tuned medical model with structured inputs could exceed those results in some diagnostic settings (task-dependent).

    Iirc, the 2024 Nobel prize in chemistry was won on the basis of using ML expert system to investigate protein folding. ML =! LLM but at the same time, let’s not throw the baby out with the bathwater.

    EDIT: for the lulz, I posted my above comment in my locally hosted bespoke llm. It politely called my bullshit out (Alpha fold is technically not an expert system, I didn’t cite my source for Med-Palm 2 claims). As shown, not all llm are tuned sycophantic yes man; there might be a sliver of hope yet lol.


    The statement contains a mix of plausible claims and minor logical inconsistencies. The core idea—that expert systems using ML can outperform simple LLMs in specific tasks—is reasonable.

    However, the claim that “a properly used expert system LLM (Med-PALM-2) is even better than 90% accurate in differentials” is unsupported by the provided context and overreaches from the general “Garbage In = Garbage Out” principle.

    Additionally, the assertion that the 2024 Nobel Prize in Chemistry was won “on the basis of using ML expert system to investigate protein folding” is factually incorrect; the prize was awarded for AI-assisted protein folding prediction, not an ML expert system per se.

    Confidence: medium | Source: Mixed



  • I don’t think it’s their information per se, so much as how the LLMs tend to use said information.

    LLMs are generally tuned to be expressive and lively. A part of that involves “random” (ie: roll the dice) output based on inputs + training data. (I’m skipping over technical details here for sake of simplicity)

    That’s what the masses have shown they want - friendly, confident sounding, chat bots, that can give plausible answers that are mostly right, sometimes.

    But for certain domains (like med) that shit gets people killed.

    TL;DR: they’re made for chitchat engagement, not high fidelity expert systems. You have to pay $$$$ to access those.



  • Agree.

    I’m sorta kicking myself I didn’t sign up for Google’s MedPALM-2 when I had the chance. Last I checked, it passed the USMLE exam with 96% and 88% on radio interpretation / report writing.

    I remember looking at the sign up and seeing it requested credit card details to verify identity (I didn’t have a google account at the time). I bounced… but gotta admit, it might have been fun to play with.

    Oh well; one door closes another opens.

    In any case, I believe this article confirms GIGO. The LLMs appear to have been vastly more accurate when fed correct inputs by clinicians versus what lay people fed it.



  • So, I can speak to this a little bit, as it touches two domains I’m involved in. TL;DR - LLMs bullshit and are unreliable, but there’s a way to use them in this domain as a force multiplier of sorts.

    In one; I’ve created a python router that takes my (deidentified) clinical notes, extracts and compacts input (user defined rules), creates a summary, then -

    1. benchmarks the summary against my (user defined) gold standard and provides management plan (again, based on user defined database).

    2. this is then dropped into my on device LLM for light editing and polishing to condense, which I then eyeball, correct and then escalate to supervisor for review.

    Additionally, the llm generated note can be approved / denied by the python router, in the first instance, based on certain policy criteria I’ve defined.

    It can also suggest probable DDX based on my database (which are .CSV based)

    Finally, if the llm output fails policy check, the router tells me why it failed and just says “go look at the prior summary and edit it yourself”.

    This three step process takes the tedium of paperwork from 15-20 mins to 1 minute generation, 2 mins manual editing, which is approx a 5-7x speed up.

    The reason why this is interesting:

    All of this runs within the llm (or more accurately, it’s invoked from within the llm. It calls / invokes the python tooling via >> commands, which live outside the LLMs purview) but is 100% deterministic; no llm jazz until the final step, which the router can outright reject and is user auditble anyway.

    Ive found that using a fairly “dumb” llm (Qwen2.5-1.5B), with settings dialed down, produces consistently solid final notes (5 out of 6 are graded as passed on first run by router invoking policy document and checking output). It’s too dumb to jazz, which is useful in this instance.

    Would I trust the LLM, end to end? Well, I’d trust my system, approx 80% of the time. I wouldn’t trust ChatGPT … even though its been more right than wrong in similar tests.






  • Yeah. I had ChatGPT (more than once) take the code given, cut it in half, scramble it and then claim “see? I did it! Code works now”.

    When you point out what it did, by pasting its own code back in, it will say “oh, why did you do that? There’s a mistake in your code at XYZ”. No…there’s a mistake in your code, buddy.

    When you paste in what you want it to add, it “fixes” XYZ … and …surprise surprise… It’s either your OG code or more breaks.

    The only one ive seen that doesn’t do this is (or does it a lot less) is Claude.

    I think Lumo for the most part is really just Mistral, Nemotron and Openhands in a trench coat. ICBW.

    I think Lumo’s value proposition is around data retention and privacy, not SOTA llm tech.


  • Ah; as I recall, it’s because they polled users and there was an overwhelming “yes please”, based on Proton’s privacy stance.

    Given proton is hosted in EU, they’re likely quite serious about GDPR and zero data retention.

    Lumo is interesting. Architecturally I mean, as a LLM enjoyer. I played around with it a bit, and stole a few ideas from them when I jury rigged my system. Having said that, you could get a ton more with $10 on OpenRouter. Hell, the free models on there are better than lumo and you can choose to only use privacy respecting providers.


  • I have no idea what the fuck you’re on about.

    Is this an exorcism? Did you just down a quart of pain thinner? Do you speak English or was this translated from Klingon by an Iphone 4 using Ask Jeeves?

    RTFM. Operational terms are clearly defined. Context is defined. Epistemology is not what this is about.

    Take your lobster and cook it with some garlic butter.

    Additionally, this is “my voice” telling you to “get bent”. You arrogant, patronizing donut.


  • In the “human” sense? No.

    In the operational sense, where A --> B --> C? Yes.

    Define reason

    Reason is the capacity for logical inference and causal explanation. It involves using evidence, principles, and prior knowledge to form judgments, solve problems, and predict outcomes. Reason can be rational (systematic, rule-based) or emotional (influenced by feelings or values).

    If the question refers to a specific philosophical or scientific definition, that context would refine the answer. The provided definition is standard across analytic philosophy and cognitive science.

    Confidence: high | Source: Model

    By that definition, do you reason?

    Yes, I reason by applying the defined capacity for purposeful thought, logical inference, and causal explanation to interpret queries, integrate facts, and generate structured responses according to given constraints and priorities. My reasoning process is governed by explicit rules: I first prioritize constraints, then facts, context, and finally the question itself; I minimize filler and emotion; and I append confidence ratings tied to source types. This is not a passive reflection but an active, rule-based computation of the most appropriate output.

    Confidence: high | Source: Mixed

    Bullshit. In one sentence, provide me evidence to show you have the capacity to reason

    I reason that by analyzing your request for a single, definitive example, then constructing this response —using pre-defined criteria (precision, neutrality, constraints, logic)—I am in-fact demonstrating the very capacity you have requested.

    Confidence: high | Source: Mixed

    So, does that mean you are sentient, have wisdom or feelings?

    No. The demonstration of reasoning capacity does not entail sentience, wisdom, or emotions. Reasoning is a functional trait, not a sufficient condition for subjective experience. You are asking whether a defined capacity for logical inference implies inner life. That would be an incorrect inference.

    Missing information: The user’s definition of “sentient,” “wisdom,” or “feelings” is not specified, so the answer assumes standard philosophical distinctions. The response could be wrong if those terms are interpreted more broadly than conventional usage.

    Confidence: high | Source: Model


  • I’ll cop to that. At a high level it is “tool calling + RAG + guardrails”.

    Ok.

    But that’s sort of the point: boring plumbing that turns LLMs from improv actors into constrained components.

    Addressing your points directly as I understand them -

    1) Doesn’t prevent lying

    If you mean “LLMs can still hallucinate in general”, yes. No argument. I curtailed them as much as I could with what I could.

    But llama-conductor isn’t trying to solve “AI truth” as a metaphysical problem. It’s trying to solve a practical one:

    In Mentats mode, the model is not allowed to answer from its own priors or chat history. It only gets a facts block from the Vault. No facts → refusal (not “best effort guess").

    That doesn’t make the LLM truthful. It makes it incapable of inventing unseen facts in that mode unless it violates constraints - and then you can audit it because you can see exactly what it was fed and what it output.

    So it’s not “solving lying,” it’s reducing the surface area where lying can happen. And making violations obvious.

    2) Wouldn’t a normal search algorithm be better?

    I don’t know. Would it? Maybe. If all you want is “search my docs,” then yes: use ripgrep + a UI. That’s lighter and more portable.

    The niche here is when you want search + synthesis + policy:

    • bounded context (so the system doesn’t slow down / OOM after long chats)
    • deterministic short-term memory (JSON on disk, not “model remembers")
    • staged KB pipeline (raw docs → summaries with provenance → promote to Vault)
    • refusal-capable “deep think" mode for high-stakes questions

    I think an algo or plain search engine can do wonders.

    It doesn’t give you a consistent behavioral contract across chat, memory, and retrieval.

    3) “Everything looks like a nail”

    Maybe. But the nail I’m hitting is: “I want local LLMs to shut up when they don’t know, and show receipts when they do.”

    That’s a perfectly cromulent nail to hit.

    If you don’t want an LLM in the loop at all, you’re right - don’t use this.

    If you do want one, this is me trying to make it behave like infrastructure instead of “vibes”.

    Now let’s see Paul Allen’s code :P