Răsfoiți Sursa

Fix benchmark scoring, classification, and Keycloak userinfo roles

Benchmark (03_benchmark.yml, all.yml):
- Add has_code_block and has_import signals to coding heuristic; reduce
  has_assert/has_test_def weights so debug/refactor prompts aren't penalised
- Add has_list and has_detail to general heuristic; cut length_score weight
  from 0.60 to 0.35 to reduce verbosity dominance
- Replace composite-delta classification (always ~0) with 3-tier logic:
  override dict -> raw quality delta -> name pattern (coder/codestral/etc.)
- Lower toks_norm ceiling 30 -> 22 to match observed Dell M630 hardware max
- Add model_category_overrides variable for manual classification escape hatch
- Result: deepseek-coder-v2 and qwen2.5-coder:7b now correctly land in
  slots 3/4 (coding); duplicate llama3.2:3b in slot 3 eliminated

Keycloak (05_keycloak.yml):
- Add oidc-usermodel-realm-role-mapper to open-webui client so realm_access.roles
  is included in the userinfo endpoint response; fixes Open WebUI resetting
  OIDC users to pending on every login despite having ai-admin in Keycloak
Shaun Arman 6 zile în urmă
părinte
comite
75e9ea03bc

+ 41 - 0
benchmarks/results/benchmark_20260307T170059.md

@@ -0,0 +1,41 @@
+# Benchmark Results - 20260307T170059
+
+## Model Selection
+| Slot | Role | Model | Composite Score |
+|------|------|-------|----------------|
+| 1 | General (Primary) | llama3.2:3b | 0.967 |
+| 2 | General (Secondary) | llama3.2:3b | 0.967 |
+| 3 | Coding (Primary) | deepseek-coder-v2 | 0.738 |
+| 4 | Coding (Secondary) | qwen2.5-coder:7b | 0.63 |
+
+## Detailed Metrics
+### deepseek-coder-v2
+- **Category**: coding
+- **Coding Quality**: 0.667
+- **General Quality**: 0.918
+- **Avg Tokens/sec**: 20.2
+- **Latency (ms)**: 1744.5
+- **Coding Composite**: 0.738
+- **General Composite**: 0.852
+### qwen2.5-coder:7b
+- **Category**: coding
+- **Coding Quality**: 0.64
+- **General Quality**: 0.922
+- **Avg Tokens/sec**: 11.2
+- **Latency (ms)**: 1211.5
+- **Coding Composite**: 0.63
+- **General Composite**: 0.757
+### llama3.2:3b
+- **Category**: general
+- **Coding Quality**: 0.607
+- **General Quality**: 0.991
+- **Avg Tokens/sec**: 22.5
+- **Latency (ms)**: 576.1
+- **Coding Composite**: 0.794
+- **General Composite**: 0.967
+
+## Scoring Formula
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 22 tok/sec ceiling (hardware-observed max)
+- Coding quality: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general

+ 65 - 64
benchmarks/results/model_selection.json

@@ -1,89 +1,90 @@
 {
     "all_metrics": {
         "deepseek-coder-v2": {
-            "avg_tok_per_sec": 19.8,
-            "category": "general",
-            "coding_composite": 0.602,
-            "coding_quality": 0.55,
-            "general_composite": 0.781,
-            "general_quality": 0.948,
-            "latency_ms": 1875.8,
-            "latency_score": 0.625,
-            "toks_norm": 0.661
+            "avg_tok_per_sec": 20.2,
+            "category": "coding",
+            "coding_composite": 0.738,
+            "coding_quality": 0.667,
+            "general_composite": 0.852,
+            "general_quality": 0.918,
+            "latency_ms": 1744.5,
+            "latency_score": 0.651,
+            "toks_norm": 0.919
         },
         "llama3.2:3b": {
-            "avg_tok_per_sec": 21.8,
+            "avg_tok_per_sec": 22.5,
             "category": "general",
-            "coding_composite": 0.748,
-            "coding_quality": 0.7,
-            "general_composite": 0.86,
-            "general_quality": 0.949,
-            "latency_ms": 697.1,
-            "latency_score": 0.861,
-            "toks_norm": 0.728
+            "coding_composite": 0.794,
+            "coding_quality": 0.607,
+            "general_composite": 0.967,
+            "general_quality": 0.991,
+            "latency_ms": 576.1,
+            "latency_score": 0.885,
+            "toks_norm": 1.0
         },
         "qwen2.5-coder:7b": {
-            "avg_tok_per_sec": 12.3,
-            "category": "general",
-            "coding_composite": 0.518,
-            "coding_quality": 0.6,
-            "general_composite": 0.65,
-            "general_quality": 0.895,
-            "latency_ms": 2501.0,
-            "latency_score": 0.5,
-            "toks_norm": 0.41
+            "avg_tok_per_sec": 11.2,
+            "category": "coding",
+            "coding_composite": 0.63,
+            "coding_quality": 0.64,
+            "general_composite": 0.757,
+            "general_quality": 0.922,
+            "latency_ms": 1211.5,
+            "latency_score": 0.758,
+            "toks_norm": 0.509
         }
     },
-    "coding_ranking": [],
-    "general_ranking": [
+    "coding_ranking": [
         {
-            "composite": 0.86,
+            "composite": 0.738,
             "metrics": {
-                "avg_tok_per_sec": 21.8,
-                "category": "general",
-                "coding_composite": 0.748,
-                "coding_quality": 0.7,
-                "general_composite": 0.86,
-                "general_quality": 0.949,
-                "latency_ms": 697.1,
-                "latency_score": 0.861,
-                "toks_norm": 0.728
+                "avg_tok_per_sec": 20.2,
+                "category": "coding",
+                "coding_composite": 0.738,
+                "coding_quality": 0.667,
+                "general_composite": 0.852,
+                "general_quality": 0.918,
+                "latency_ms": 1744.5,
+                "latency_score": 0.651,
+                "toks_norm": 0.919
             },
-            "name": "llama3.2:3b"
+            "name": "deepseek-coder-v2"
         },
         {
-            "composite": 0.781,
+            "composite": 0.63,
             "metrics": {
-                "avg_tok_per_sec": 19.8,
-                "category": "general",
-                "coding_composite": 0.602,
-                "coding_quality": 0.55,
-                "general_composite": 0.781,
-                "general_quality": 0.948,
-                "latency_ms": 1875.8,
-                "latency_score": 0.625,
-                "toks_norm": 0.661
+                "avg_tok_per_sec": 11.2,
+                "category": "coding",
+                "coding_composite": 0.63,
+                "coding_quality": 0.64,
+                "general_composite": 0.757,
+                "general_quality": 0.922,
+                "latency_ms": 1211.5,
+                "latency_score": 0.758,
+                "toks_norm": 0.509
             },
-            "name": "deepseek-coder-v2"
-        },
+            "name": "qwen2.5-coder:7b"
+        }
+    ],
+    "general_ranking": [
         {
-            "composite": 0.65,
+            "composite": 0.967,
             "metrics": {
-                "avg_tok_per_sec": 12.3,
+                "avg_tok_per_sec": 22.5,
                 "category": "general",
-                "coding_composite": 0.518,
-                "coding_quality": 0.6,
-                "general_composite": 0.65,
-                "general_quality": 0.895,
-                "latency_ms": 2501.0,
-                "latency_score": 0.5,
-                "toks_norm": 0.41
+                "coding_composite": 0.794,
+                "coding_quality": 0.607,
+                "general_composite": 0.967,
+                "general_quality": 0.991,
+                "latency_ms": 576.1,
+                "latency_score": 0.885,
+                "toks_norm": 1.0
             },
-            "name": "qwen2.5-coder:7b"
+            "name": "llama3.2:3b"
         }
     ],
     "slot1_general": "llama3.2:3b",
-    "slot2_general": "deepseek-coder-v2",
-    "slot3_coding": "llama3.2:3b",
-    "slot4_coding": "none"
+    "slot2_general": "llama3.2:3b",
+    "slot3_coding": "deepseek-coder-v2",
+    "slot4_coding": "qwen2.5-coder:7b"
 }

+ 6 - 1
inventory/group_vars/all.yml

@@ -81,9 +81,14 @@ benchmark_thresholds:
   min_quality_score: 0.6
   min_composite_score: 0.55
 
-benchmark_toks_norm_ceiling: 30     # Realistic max tok/sec for CPU-only inference
+benchmark_toks_norm_ceiling: 22     # Observed hardware max on Dell M630 (21.8 tok/sec measured)
 benchmark_coding_threshold: 0.10    # Delta to classify a model as coding-specialized
 
+# Explicit category overrides applied before heuristics. Keys are model names as
+# returned by `ollama list`. Valid values: 'coding' or 'general'.
+# Example: { "deepseek-coder-v2": "coding", "qwen2.5-coder:7b": "coding" }
+model_category_overrides: {}
+
 # Candidate models to recommend/pull if benchmark scores are below threshold
 candidate_models:
   - name: "qwen2.5-coder:32b-instruct-q4_K_M"

+ 19 - 6
playbooks/03_benchmark.yml

@@ -170,13 +170,17 @@
           {%         set has_test_def = 1 if 'def test_' in response_text else 0 %}
           {%         set has_docstring = 1 if '"""' in response_text else 0 %}
           {%         set has_type_hint = 1 if ' -> ' in response_text else 0 %}
-          {%         set quality = (has_def * 0.20 + has_return * 0.20 + has_assert * 0.15 + has_test_def * 0.15 + has_docstring * 0.15 + has_type_hint * 0.15) %}
+          {%         set has_code_block = 1 if '```' in response_text else 0 %}
+          {%         set has_import = 1 if ('import ' in response_text or 'from ' in response_text) else 0 %}
+          {%         set quality = (has_def * 0.20 + has_return * 0.20 + has_docstring * 0.15 + has_type_hint * 0.15 + has_code_block * 0.10 + has_assert * 0.08 + has_test_def * 0.07 + has_import * 0.05) %}
           {%         set ns2.coding_quality = ns2.coding_quality + quality %}
           {%         set ns2.coding_count = ns2.coding_count + 1 %}
           {%       elif test_name in ['explain', 'creative', 'reasoning'] %}
           {%         set length_score = [resp_len / 800.0, 1.0] | min %}
           {%         set has_structure = 1 if ('\n' in response_text and resp_len > 100) else 0 %}
-          {%         set quality = (length_score * 0.6 + has_structure * 0.4) %}
+          {%         set has_list = 1 if ('\n- ' in response_text or '\n* ' in response_text or '\n1.' in response_text) else 0 %}
+          {%         set has_detail = 1 if '\n\n' in response_text else 0 %}
+          {%         set quality = (length_score * 0.35 + has_structure * 0.40 + has_list * 0.15 + has_detail * 0.10) %}
           {%         set ns2.general_quality = ns2.general_quality + quality %}
           {%         set ns2.general_count = ns2.general_count + 1 %}
           {%       endif %}
@@ -191,7 +195,16 @@
           {%   set latency_score = [1.0 - (latency_ms / 5000.0), 0] | max %}
           {%   set coding_composite = coding_avg * 0.45 + toks_norm * 0.30 + latency_score * 0.25 %}
           {%   set general_composite = general_avg * 0.45 + toks_norm * 0.30 + latency_score * 0.25 %}
-          {%   set category = 'coding' if (coding_composite - general_composite) >= benchmark_coding_threshold else 'general' %}
+          {%   set _override = (model_category_overrides | default({}))[model] | default('') %}
+          {%   if _override in ['coding', 'general'] %}
+          {%     set category = _override %}
+          {%   elif (coding_avg - general_avg) >= benchmark_coding_threshold %}
+          {%     set category = 'coding' %}
+          {%   elif 'coder' in model | lower or 'codestral' in model | lower or 'codellama' in model | lower or 'starcoder' in model | lower %}
+          {%     set category = 'coding' %}
+          {%   else %}
+          {%     set category = 'general' %}
+          {%   endif %}
           {%   set _ = ns.results.update({model: {'coding_quality': coding_avg | round(3), 'general_quality': general_avg | round(3), 'avg_tok_per_sec': avg_toks | round(1), 'toks_norm': toks_norm | round(3), 'latency_ms': latency_ms | round(1), 'latency_score': latency_score | round(3), 'coding_composite': coding_composite | round(3), 'general_composite': general_composite | round(3), 'category': category}}) %}
           {% endfor %}
           {{ ns.results | to_json }}
@@ -279,9 +292,9 @@
 
           ## Scoring Formula
           - Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
-          - Speed normalized against {{ benchmark_toks_norm_ceiling }} tok/sec ceiling
-          - Coding quality: has_def×0.20 + has_return×0.20 + has_assert×0.15 + has_test_def×0.15 + has_docstring×0.15 + has_type_hint×0.15
-          - Category: coding if (coding_composite - general_composite) >= {{ benchmark_coding_threshold }}, else general
+          - Speed normalized against {{ benchmark_toks_norm_ceiling }} tok/sec ceiling (hardware-observed max)
+          - Coding quality: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+          - Category: override dict → quality delta (coding_avg - general_avg >= {{ benchmark_coding_threshold }}) → name pattern (coder/codestral/codellama/starcoder) → general
         dest: "{{ benchmark_results_dir }}/benchmark_{{ benchmark_timestamp }}.md"
         mode: "0644"
       delegate_to: localhost

+ 36 - 0
playbooks/05_keycloak.yml

@@ -245,6 +245,42 @@
       tags:
         - keycloak-client
 
+    # ── Add realm roles → userinfo protocol mapper ──────────────────
+    - name: "Keycloak | Get open-webui client internal ID"
+      ansible.builtin.uri:
+        url: "{{ keycloak_base_url }}/admin/realms/{{ keycloak_realm }}/clients?clientId=open-webui"
+        method: GET
+        headers:
+          Authorization: "Bearer {{ kc_token }}"
+        status_code: 200
+      register: openwebui_client_info
+      tags:
+        - keycloak-client
+
+    - name: "Keycloak | Add realm roles userinfo mapper to open-webui client"
+      ansible.builtin.uri:
+        url: "{{ keycloak_base_url }}/admin/realms/{{ keycloak_realm }}/clients/{{ openwebui_client_info.json[0].id }}/protocol-mappers/models"
+        method: POST
+        headers:
+          Authorization: "Bearer {{ kc_token }}"
+          Content-Type: application/json
+        body_format: json
+        body:
+          name: realm-roles-userinfo
+          protocol: openid-connect
+          protocolMapper: oidc-usermodel-realm-role-mapper
+          config:
+            claim.name: realm_access.roles
+            jsonType.label: String
+            multivalued: "true"
+            userinfo.token.claim: "true"
+            id.token.claim: "true"
+            access.token.claim: "true"
+        status_code: [201, 409]
+      when: openwebui_client_info.json | length > 0
+      tags:
+        - keycloak-client
+
     # ── Create realm roles ───────────────────────────────────────────
     - name: "Keycloak | Create ai-user role"
       ansible.builtin.uri: