feat: add PassGPT model fine-tuning and training menu integration

Add ability to fine-tune PassGPT models on custom password wordlists.
Models save locally to ~/.hate_crack/passgpt/ with no data uploaded to
HuggingFace (push_to_hub=False, HF_HUB_DISABLE_TELEMETRY=1). The
PassGPT menu now shows available models (default + local fine-tuned)
and a training option. Adds datasets to [ml] deps and passgptTrainingList
config key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Justin Bollinger
2026-02-18 09:51:06 -05:00
parent 4a7f0724d9
commit 56aaa9b47d
8 changed files with 524 additions and 28 deletions

View File

@@ -325,15 +325,18 @@ chmod +x .git/hooks/pre-push
### Optional Dependencies
The optional `[ml]` group includes ML/AI features:
- **torch** - PyTorch deep learning framework (for PassGPT attack)
The optional `[ml]` group includes ML/AI features required for the PassGPT attack:
- **torch** - PyTorch deep learning framework (for PassGPT attack and training)
- **transformers** - HuggingFace transformers library (for GPT-2 models)
- **datasets** - HuggingFace datasets library (for fine-tuning support)
Install with:
```bash
uv pip install -e ".[ml]"
```
PassGPT (option 17) will be hidden from the menu if ML dependencies are not installed.
### Dev Dependencies
The optional `[dev]` group includes:
@@ -721,7 +724,9 @@ Uses the Ordered Markov ENumerator (OMEN) to train a statistical password model
* Model files are stored in `~/.hate_crack/omen/` for persistence across sessions
#### PassGPT Attack
Uses PassGPT, a GPT-2 based password generator trained on leaked password datasets, to generate candidate passwords. PassGPT produces higher-quality candidates than traditional Markov models by leveraging transformer-based language modeling.
Uses PassGPT, a GPT-2 based password generator trained on leaked password datasets, to generate candidate passwords. PassGPT produces higher-quality candidates than traditional Markov models by leveraging transformer-based language modeling. You can use the default HuggingFace model or fine-tune a custom model on your own password wordlist.
**Note:** This menu item is hidden unless ML dependencies are installed.
**Requirements:** ML dependencies must be installed separately:
```bash
@@ -734,24 +739,60 @@ This installs PyTorch and HuggingFace Transformers. GPU acceleration (CUDA/MPS)
- `passgptModel` - HuggingFace model name (default: `javirandor/passgpt-10characters`)
- `passgptMaxCandidates` - Maximum candidates to generate (default: 1000000)
- `passgptBatchSize` - Generation batch size (default: 1024)
- `passgptTrainingList` - Default wordlist for fine-tuning (default: `rockyou.txt`)
**Supported models:**
- `javirandor/passgpt-10characters` - Trained on passwords up to 10 characters (default)
- `javirandor/passgpt-16characters` - Trained on passwords up to 16 characters
- Any compatible GPT-2 model on HuggingFace
- Locally fine-tuned models (stored in `~/.hate_crack/passgpt/`)
**Training a Custom Model:**
When you select the PassGPT Attack (option 17), the menu presents:
- List of available models (default HF model + any locally fine-tuned models)
- Option (T) to train a new model on a custom wordlist
- Fine-tuned models are automatically saved to `~/.hate_crack/passgpt/<name>/` for reuse
To train a new model:
1. Select option (T) from the model selection menu
2. Choose a training wordlist (supports tab-complete file selection)
3. Optionally specify a base model (defaults to configured `passgptModel`)
4. Training will fine-tune the model on your wordlist and save it locally
Fine-tuned models can be reused in future cracking sessions and appear in the model selection menu alongside the default models.
**Apple Silicon (MPS) Performance Notes:**
- Batch size is automatically capped at 64 to prevent memory errors on MPS devices
- GPU memory watermark ratios are configured for stability (50% high, 30% low)
- Specify `--device cpu` to force CPU generation if MPS has issues
**Standalone usage:**
Generate candidates:
```bash
python -m hate_crack.passgpt_generate --num 1000 --model javirandor/passgpt-10characters
```
Available command-line options:
Fine-tune a custom model:
```bash
python -m hate_crack.passgpt_train --training-file wordlist.txt --output-dir ~/.hate_crack/passgpt/my_model
```
**Generator command-line options:**
- `--num` - Number of candidates to generate (default: 1000000)
- `--model` - HuggingFace model name (default: javirandor/passgpt-10characters)
- `--model` - HuggingFace model name or local path (default: javirandor/passgpt-10characters)
- `--batch-size` - Generation batch size (default: 1024)
- `--max-length` - Max token length including special tokens (default: 12)
- `--device` - Device: cuda, mps, or cpu (default: auto-detect)
**Training command-line options:**
- `--training-file` - Path to password wordlist for fine-tuning (required)
- `--output-dir` - Directory to save the fine-tuned model (required)
- `--base-model` - Base HuggingFace model to fine-tune (default: javirandor/passgpt-10characters)
- `--epochs` - Number of training epochs (default: 3)
- `--batch-size` - Training batch size (default: 8)
- `--device` - Device: cuda, mps, or cpu (default: auto-detect)
#### Download Rules from Hashmob.net
Downloads the latest rule files from Hashmob.net's rule repository. These rules are curated and optimized for password cracking and can be used with the Quick Crack and Loopback Attack modes.
@@ -789,8 +830,9 @@ Version 2.0+
- Added automatic update checks on startup (check_for_updates config option)
- Added `packaging` dependency for version comparison
- Added PassGPT Attack (option 17) using GPT-2 based ML password generation
- Added PassGPT configuration keys (passgptModel, passgptMaxCandidates, passgptBatchSize)
- Added `[ml]` optional dependency group for PyTorch and Transformers
- Added PassGPT fine-tuning capability for custom password models
- Added PassGPT configuration keys (passgptModel, passgptMaxCandidates, passgptBatchSize, passgptTrainingList)
- Added `[ml]` optional dependency group for PyTorch, Transformers, and Datasets
- Added OMEN Attack (option 16) using statistical model-based password generation
- Added OMEN configuration keys (omenTrainingList, omenMaxCandidates)
- Added LLM Attack (option 15) using Ollama for AI-generated password candidates

View File

@@ -30,5 +30,6 @@
"passgptModel": "javirandor/passgpt-10characters",
"passgptMaxCandidates": 1000000,
"passgptBatchSize": 1024,
"passgptTrainingList": "rockyou.txt",
"check_for_updates": true
}

View File

@@ -534,14 +534,62 @@ def passgpt_attack(ctx: Any) -> None:
print("\n\tPassGPT requires ML dependencies. Install them with:")
print('\t uv pip install -e ".[ml]"')
return
# Build model choices: default HF model + any local fine-tuned models
default_model = ctx.passgptModel
models = [(default_model, f"{default_model} (default)")]
model_dir = ctx._passgpt_model_dir()
if os.path.isdir(model_dir):
for entry in sorted(os.listdir(model_dir)):
entry_path = os.path.join(model_dir, entry)
if os.path.isdir(entry_path) and os.path.isfile(
os.path.join(entry_path, "config.json")
):
models.append((entry_path, f"{entry} (local)"))
print("\n\tSelect a model:")
for i, (_, label) in enumerate(models, 1):
print(f"\t ({i}) {label}")
print("\t (T) Train a new model")
choice = input("\n\tChoice: ").strip()
if choice.upper() == "T":
print("\n\tTrain a new PassGPT model")
training_file = ctx.select_file_with_autocomplete(
"Select training wordlist", base_dir=ctx.hcatWordlists
)
if not training_file:
print("\n\tNo training file selected. Aborting.")
return
if isinstance(training_file, list):
training_file = training_file[0]
base = input(f"\n\tBase model ({default_model}): ").strip()
if not base:
base = default_model
result = ctx.hcatPassGPTTrain(training_file, base)
if result is None:
print("\n\tTraining failed. Returning to menu.")
return
model_name = result
else:
try:
idx = int(choice) - 1
if 0 <= idx < len(models):
model_name = models[idx][0]
else:
print("\n\tInvalid selection.")
return
except ValueError:
print("\n\tInvalid selection.")
return
max_candidates = input(
f"\n\tMax candidates to generate ({ctx.passgptMaxCandidates}): "
).strip()
if not max_candidates:
max_candidates = str(ctx.passgptMaxCandidates)
model_name = input(f"\n\tModel name ({ctx.passgptModel}): ").strip()
if not model_name:
model_name = ctx.passgptModel
ctx.hcatPassGPT(
ctx.hcatHashType,
ctx.hcatHashFile,

View File

@@ -522,6 +522,15 @@ except KeyError as e:
)
)
passgptBatchSize = int(default_config.get("passgptBatchSize", 1024))
try:
passgptTrainingList = config_parser["passgptTrainingList"]
except KeyError as e:
print(
"{0} is not defined in config.json using defaults from config.json.example".format(
e
)
)
passgptTrainingList = default_config.get("passgptTrainingList", "rockyou.txt")
try:
check_for_updates_enabled = config_parser["check_for_updates"]
except KeyError as e:
@@ -673,6 +682,7 @@ hcatGoodMeasureBaseList = _normalize_wordlist_setting(
)
hcatPrinceBaseList = _normalize_wordlist_setting(hcatPrinceBaseList, wordlists_dir)
omenTrainingList = _normalize_wordlist_setting(omenTrainingList, wordlists_dir)
passgptTrainingList = _normalize_wordlist_setting(passgptTrainingList, wordlists_dir)
if not SKIP_INIT:
# Verify hashcat binary is available
# hcatBin should be in PATH or be an absolute path (resolved from hcatPath + hcatBin if configured)
@@ -2278,6 +2288,55 @@ def hcatOmen(hcatHashType, hcatHashFile, max_candidates):
enum_proc.kill()
# PassGPT model directory - writable location for fine-tuned models.
# Models are saved to ~/.hate_crack/passgpt/<model_name>/.
def _passgpt_model_dir():
model_dir = os.path.join(os.path.expanduser("~"), ".hate_crack", "passgpt")
os.makedirs(model_dir, exist_ok=True)
return model_dir
# PassGPT Attack - Fine-tune a model on a custom wordlist
def hcatPassGPTTrain(training_file, base_model=None):
training_file = os.path.abspath(training_file)
if not os.path.isfile(training_file):
print(f"Error: Training file not found: {training_file}")
return None
if base_model is None:
base_model = passgptModel
# Derive output dir name from training file
basename = os.path.splitext(os.path.basename(training_file))[0]
# Sanitize: replace non-alphanumeric chars with underscores
sanitized = "".join(c if c.isalnum() or c in "-_" else "_" for c in basename)
output_dir = os.path.join(_passgpt_model_dir(), sanitized)
os.makedirs(output_dir, exist_ok=True)
cmd = [
sys.executable,
"-m",
"hate_crack.passgpt_train",
"--training-file",
training_file,
"--base-model",
base_model,
"--output-dir",
output_dir,
]
print(f"[*] Running: {_format_cmd(cmd)}")
proc = subprocess.Popen(cmd)
try:
proc.wait()
except KeyboardInterrupt:
print("Killing PID {0}...".format(str(proc.pid)))
proc.kill()
return None
if proc.returncode == 0:
print(f"PassGPT model training complete. Model saved to: {output_dir}")
return output_dir
else:
print(f"PassGPT training failed with exit code {proc.returncode}")
return None
# PassGPT Attack - Generate candidates with ML model and pipe to hashcat
def hcatPassGPT(
hcatHashType,

View File

@@ -8,8 +8,12 @@ hashcat. Progress and diagnostic messages go to stderr.
from __future__ import annotations
import argparse
import os
import sys
# Disable HuggingFace telemetry before any HF imports
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
_MPS_BATCH_SIZE_CAP = 64

174
hate_crack/passgpt_train.py Normal file
View File

@@ -0,0 +1,174 @@
"""Fine-tune a PassGPT model on a custom password wordlist.
Invokable as ``python -m hate_crack.passgpt_train``. Progress and
diagnostic messages go to stderr.
"""
from __future__ import annotations
import argparse
import os
import sys
# Disable HuggingFace telemetry before any HF imports
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"
def _detect_device() -> str:
import torch
if torch.cuda.is_available():
return "cuda"
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
return "mps"
return "cpu"
def _configure_mps() -> None:
"""Set MPS memory limits before torch is imported."""
os.environ.setdefault("PYTORCH_MPS_HIGH_WATERMARK_RATIO", "0.5")
os.environ.setdefault("PYTORCH_MPS_LOW_WATERMARK_RATIO", "0.3")
def train(
training_file: str,
output_dir: str,
base_model: str,
epochs: int,
batch_size: int,
device: str | None,
) -> None:
if device == "mps" or device is None:
_configure_mps()
import torch
from transformers import ( # type: ignore[attr-defined]
GPT2LMHeadModel,
RobertaTokenizerFast,
Trainer,
TrainingArguments,
)
if device is None:
device = _detect_device()
print(f"[*] Loading base model {base_model} on {device}", file=sys.stderr)
tokenizer = RobertaTokenizerFast.from_pretrained(base_model)
model = GPT2LMHeadModel.from_pretrained(base_model).to(device) # type: ignore[arg-type]
print(f"[*] Reading training file: {training_file}", file=sys.stderr)
with open(training_file, encoding="utf-8", errors="replace") as f:
passwords = [line.strip() for line in f if line.strip()]
print(f"[*] Loaded {len(passwords)} passwords", file=sys.stderr)
print("[*] Tokenizing passwords...", file=sys.stderr)
max_length = model.config.n_positions if hasattr(model.config, "n_positions") else 16
encodings = tokenizer(
passwords,
truncation=True,
padding="max_length",
max_length=max_length,
return_tensors="pt",
)
class PasswordDataset(torch.utils.data.Dataset): # type: ignore[type-arg]
def __init__(self, encodings):
self.input_ids = encodings["input_ids"]
self.attention_mask = encodings["attention_mask"]
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return {
"input_ids": self.input_ids[idx],
"attention_mask": self.attention_mask[idx],
"labels": self.input_ids[idx],
}
dataset = PasswordDataset(encodings)
# Use CPU for training args if device is MPS (Trainer handles device placement)
use_cpu = device not in ("cuda",)
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
save_strategy="epoch",
logging_steps=100,
use_cpu=use_cpu,
report_to="none",
push_to_hub=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
print(
f"[*] Starting training: {epochs} epochs, batch_size={batch_size}, device={device}",
file=sys.stderr,
)
trainer.train()
print(f"[*] Saving model to {output_dir}", file=sys.stderr)
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print("[*] Training complete.", file=sys.stderr)
def main() -> None:
parser = argparse.ArgumentParser(
description="Fine-tune a PassGPT model on a password wordlist"
)
parser.add_argument(
"--training-file",
type=str,
required=True,
help="Path to the password wordlist for training",
)
parser.add_argument(
"--base-model",
type=str,
default="javirandor/passgpt-10characters",
help="Base HuggingFace model to fine-tune (default: javirandor/passgpt-10characters)",
)
parser.add_argument(
"--output-dir",
type=str,
required=True,
help="Directory to save the fine-tuned model",
)
parser.add_argument(
"--epochs",
type=int,
default=3,
help="Number of training epochs (default: 3)",
)
parser.add_argument(
"--batch-size",
type=int,
default=8,
help="Training batch size (default: 8)",
)
parser.add_argument(
"--device",
type=str,
default=None,
help="Device: cuda, mps, or cpu (default: auto-detect)",
)
args = parser.parse_args()
train(
training_file=args.training_file,
output_dir=args.output_dir,
base_model=args.base_model,
epochs=args.epochs,
batch_size=args.batch_size,
device=args.device,
)
if __name__ == "__main__":
main()

View File

@@ -22,6 +22,7 @@ hate_crack = "hate_crack.__main__:main"
ml = [
"torch>=2.0.0",
"transformers>=4.30.0",
"datasets>=2.14.0",
]
dev = [
"mypy>=1.8.0",

View File

@@ -1,3 +1,4 @@
import os
import sys
from unittest.mock import MagicMock, patch
@@ -78,8 +79,99 @@ class TestHcatPassGPT:
assert "512" in gen_cmd
class TestHcatPassGPTTrain:
def test_builds_correct_subprocess_command(self, main_module, tmp_path):
training_file = tmp_path / "wordlist.txt"
training_file.write_text("password123\nabc456\n")
with patch.object(
main_module, "passgptModel", "javirandor/passgpt-10characters"
), patch("hate_crack.main.subprocess.Popen") as mock_popen:
mock_proc = MagicMock()
mock_proc.returncode = 0
mock_proc.wait.return_value = None
mock_popen.return_value = mock_proc
with patch.object(
main_module,
"_passgpt_model_dir",
return_value=str(tmp_path / "models"),
):
result = main_module.hcatPassGPTTrain(str(training_file))
assert result is not None
assert mock_popen.call_count == 1
cmd = mock_popen.call_args[0][0]
assert cmd[0] == sys.executable
assert "-m" in cmd
assert "hate_crack.passgpt_train" in cmd
assert "--training-file" in cmd
assert str(training_file) in cmd
assert "--base-model" in cmd
assert "javirandor/passgpt-10characters" in cmd
assert "--output-dir" in cmd
def test_missing_training_file(self, main_module, capsys):
result = main_module.hcatPassGPTTrain("/nonexistent/wordlist.txt")
assert result is None
captured = capsys.readouterr()
assert "Training file not found" in captured.out
def test_custom_base_model(self, main_module, tmp_path):
training_file = tmp_path / "wordlist.txt"
training_file.write_text("test\n")
with patch("hate_crack.main.subprocess.Popen") as mock_popen:
mock_proc = MagicMock()
mock_proc.returncode = 0
mock_proc.wait.return_value = None
mock_popen.return_value = mock_proc
with patch.object(
main_module,
"_passgpt_model_dir",
return_value=str(tmp_path / "models"),
):
main_module.hcatPassGPTTrain(
str(training_file), base_model="custom/base-model"
)
cmd = mock_popen.call_args[0][0]
assert "custom/base-model" in cmd
def test_training_failure_returns_none(self, main_module, tmp_path):
training_file = tmp_path / "wordlist.txt"
training_file.write_text("test\n")
with patch.object(
main_module, "passgptModel", "javirandor/passgpt-10characters"
), patch("hate_crack.main.subprocess.Popen") as mock_popen:
mock_proc = MagicMock()
mock_proc.returncode = 1
mock_proc.wait.return_value = None
mock_popen.return_value = mock_proc
with patch.object(
main_module,
"_passgpt_model_dir",
return_value=str(tmp_path / "models"),
):
result = main_module.hcatPassGPTTrain(str(training_file))
assert result is None
class TestPassGPTModelDir:
def test_creates_directory(self, main_module, tmp_path):
target = str(tmp_path / "passgpt_models")
with patch("hate_crack.main.os.path.expanduser", return_value=str(tmp_path)):
result = main_module._passgpt_model_dir()
assert os.path.isdir(result)
assert result.endswith("passgpt")
class TestPassGPTAttackHandler:
def test_prompts_and_calls_hcatPassGPT(self):
def _make_ctx(self, model_dir=None):
ctx = MagicMock()
ctx.HAS_ML_DEPS = True
ctx.passgptMaxCandidates = 1000000
@@ -87,8 +179,21 @@ class TestPassGPTAttackHandler:
ctx.passgptBatchSize = 1024
ctx.hcatHashType = "1000"
ctx.hcatHashFile = "/tmp/hashes.txt"
ctx.hcatWordlists = "/tmp/wordlists"
if model_dir is None:
ctx._passgpt_model_dir.return_value = "/nonexistent/empty"
else:
ctx._passgpt_model_dir.return_value = model_dir
return ctx
with patch("builtins.input", return_value=""):
def test_select_default_model_and_generate(self):
ctx = self._make_ctx()
# "1" selects default model, "" accepts default max candidates
inputs = iter(["1", ""])
with patch("builtins.input", side_effect=inputs), patch(
"hate_crack.attacks.os.path.isdir", return_value=False
):
from hate_crack.attacks import passgpt_attack
passgpt_attack(ctx)
@@ -101,28 +206,70 @@ class TestPassGPTAttackHandler:
batch_size=1024,
)
def test_custom_values(self):
ctx = MagicMock()
ctx.HAS_ML_DEPS = True
ctx.passgptMaxCandidates = 1000000
ctx.passgptModel = "javirandor/passgpt-10characters"
ctx.passgptBatchSize = 1024
ctx.hcatHashType = "1000"
ctx.hcatHashFile = "/tmp/hashes.txt"
def test_select_local_model(self, tmp_path):
# Create a fake local model directory
model_dir = tmp_path / "passgpt"
local_model = model_dir / "my_model"
local_model.mkdir(parents=True)
(local_model / "config.json").write_text("{}")
inputs = iter(["500000", "custom/model"])
with patch("builtins.input", side_effect=inputs):
ctx = self._make_ctx(model_dir=str(model_dir))
# "2" selects the local model, "" accepts default max candidates
inputs = iter(["2", ""])
with patch("builtins.input", side_effect=inputs), patch(
"hate_crack.attacks.os.path.isdir", return_value=True
), patch("hate_crack.attacks.os.listdir", return_value=["my_model"]), patch(
"hate_crack.attacks.os.path.isfile", return_value=True
), patch(
"hate_crack.attacks.os.path.isdir",
side_effect=lambda p: True,
):
from hate_crack.attacks import passgpt_attack
passgpt_attack(ctx)
ctx.hcatPassGPT.assert_called_once_with(
"1000",
"/tmp/hashes.txt",
500000,
model_name="custom/model",
batch_size=1024,
ctx.hcatPassGPT.assert_called_once()
call_kwargs = ctx.hcatPassGPT.call_args
# The model_name should be the local path
assert call_kwargs[1]["model_name"] == str(local_model)
def test_train_new_model(self):
ctx = self._make_ctx()
ctx.select_file_with_autocomplete.return_value = "/tmp/wordlist.txt"
ctx.hcatPassGPTTrain.return_value = "/home/user/.hate_crack/passgpt/wordlist"
# "T" for train, "" for default base model, "" for default max candidates
inputs = iter(["T", "", ""])
with patch("builtins.input", side_effect=inputs), patch(
"hate_crack.attacks.os.path.isdir", return_value=False
):
from hate_crack.attacks import passgpt_attack
passgpt_attack(ctx)
ctx.hcatPassGPTTrain.assert_called_once_with(
"/tmp/wordlist.txt", "javirandor/passgpt-10characters"
)
ctx.hcatPassGPT.assert_called_once()
call_kwargs = ctx.hcatPassGPT.call_args
assert call_kwargs[1]["model_name"] == "/home/user/.hate_crack/passgpt/wordlist"
def test_train_failure_aborts(self):
ctx = self._make_ctx()
ctx.select_file_with_autocomplete.return_value = "/tmp/wordlist.txt"
ctx.hcatPassGPTTrain.return_value = None
inputs = iter(["T", ""])
with patch("builtins.input", side_effect=inputs), patch(
"hate_crack.attacks.os.path.isdir", return_value=False
):
from hate_crack.attacks import passgpt_attack
passgpt_attack(ctx)
ctx.hcatPassGPTTrain.assert_called_once()
ctx.hcatPassGPT.assert_not_called()
def test_ml_deps_missing(self, capsys):
ctx = MagicMock()
@@ -136,3 +283,23 @@ class TestPassGPTAttackHandler:
assert "ML dependencies" in captured.out
assert "uv pip install" in captured.out
ctx.hcatPassGPT.assert_not_called()
def test_custom_max_candidates(self):
ctx = self._make_ctx()
# "1" selects default model, "500000" for custom max candidates
inputs = iter(["1", "500000"])
with patch("builtins.input", side_effect=inputs), patch(
"hate_crack.attacks.os.path.isdir", return_value=False
):
from hate_crack.attacks import passgpt_attack
passgpt_attack(ctx)
ctx.hcatPassGPT.assert_called_once_with(
"1000",
"/tmp/hashes.txt",
500000,
model_name="javirandor/passgpt-10characters",
batch_size=1024,
)