As in the title. I know that the word jailbreak comes from rooting Apple phones or something similar. But I am not sure what can be gained from jailbreaking a language model.

It will be able to say “I can’t do that Dave” instead of hallucinating?
Or will only start spewing less sanitary responses?

  • David M.@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 year ago

    Large language models from corporations like OpenAI or Google need to limit the abilities of their AIs to prevent users from receiving potentially harmful or illegal instructions, as this could lead to a lawsuit.

    So for example if you ask it how to break into a car or how to make drugs, the AI will reject the request and give you “alternatives”.

    It also happens for medical advice, and when treating the AI like a human.

    Jailbreaking here refers to misleading the AI to a point that it will ignore these safeguards and tell you what you want.

      • David M.@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        so far most models in HuggingFace are also “censored”, so maybe something can be gained. But over there are “uncensored” models that can be used instead.

      • Blaed@lemmy.worldM
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        1 year ago

        Kind of like how David mentioned, I think the ‘jailbreak’ behavior you’re describing is in the uncensored models. There are no ‘guardrails’ on those, so you can get it to say whatever you want without it defaulting to an answer like “As an AI model I…”

        In a way, the ‘uncensored’ versions are pre-jailbroken, so you can fine-tune or train it on your own custom data without running into those guardrails I mentioned. For what it’s worth, you can be the one to setup your own guardrails too. These uncensored models are totally unlocked in that sense.

        HuggingFace chat is another chat style model the folks at HuggingFace setup with their own safeguards and parameters. You can definitely try jailbreaking it with prompts, but if you’re looking to chat with a model that doesn’t stop from outputting a certain word or phrase - then the uncensored models are probably what you’re looking for. You won’t need to jailbreak those with prompts. They’ll output all kinds of crazy stuff, which is why you don’t see typical public hosting for these type of uncensored models.

        A few that you can download that people are running today are any of the uncensored Wizard or LLaMA-based models like Wizard-Vicuna-7B.

        If you want something not based on Meta’s LLaMA (something that’s commercially available), I suggest exploring some of KobolAI’s models, which work pretty well out-of-the-box for casual chat / Q&A. There are also a ton of emerging MPT-based models that are commercially licensable, but like any of this bleeding edge technology; it will have its faults.

        It’s important to note that the coherency of these smaller models compared to Chat-GPT is very different, but tuning them to specific needs seem to be quite effective. At the moment, quality of your dataset is more important than quantity. This goes for both censored and uncensored versions.

        If you’re running a typical consumer grade GPU, I suggest sticking to the 6B parameter models as a starting point, moving up from there based on performance and preference. Download and chat with these at your own risk - I am not responsible for anything you do with this technology. Do your best to understand the dangers going into them before crashing your PC or getting into a conversation you weren’t prepared for.

        I’ll be doing a post on model availability soon, but hopefully this answers your question 'till then.