Categories
AI

Using LLMs for Massive Code Trace Analysis Part 3: Switching from Ollama to Hugging Face

Up to now, I’ve been using Ollama as my platform to research the requirements for processing massive amounts of Intel Processor Trace. But it now seems that I’ll need a more sophisticated development environment to accomplish the model fine-tuning I need to do; so, I’ve switched over to Hugging Face.

In Part 1 and Part 2 of this article series, I worked with the Ollama framework. Those who are familiar with both Ollama and Hugging Face can skip this part: the main difference between the two is that Ollama is a runtime environment for running models locally, while Hugging Face is a collaborative hub for building and sharing AI assets. And since I plan to open-source this work, and Ollama seems more limited in its ability to customize my model, I’m more inclined to work with Hugging Face. A good comparison table is here:

FeatureOllamaHugging Face
Primary RoleA local runtime for LLMsA hub for models, datasets, and tools
Data PrivacyExcellent (all data stays local)Cloud-based services may process data in the cloud
LatencyExtremely low (no network overhead)Varies, dependent on network and server load
ScalabilityLimited to your local hardwareVirtually unlimited via cloud services
Model AccessA curated list of popular modelsThousands of models across all modalities
CustomizationVia simple Modelfile syntaxVia code using libraries like PEFT
RelationshipOften used together. Ollama can pull models from the Hugging Face Hub (GGUF format) and run them locally.The source for most models that Ollama runs.

Again, I used Tech with Tim’s work as a foundation for this research. It’s really an excellent channel, one of the better ones on YouTube. Have a look at the following links as references:

LangChain Explained In 15 Minutes – A MUST Learn For Python Programmers

HuggingFace + Langchain | Run 1,000s of FREE AI Models Locally

Hugging Face uses the Transformers Python library to work with its models. As with Ollama, you can download these models to your local computer and run them on your own hardware – subject to the VRAM requirements that I referred to in Part 1. Also, the LangChain Python library provides some advanced features that I’ll need. Finally, the PyCharm application makes working in this environment so much easier. Let’s walk through the steps, and I’ll explain how all these fit together.

The first step is to create a userid on Hugging Face. Simple enough.

Then download and install JetBrains’ PyCharm application, Community Edition. Note in Tim’s video above, he mentions a promotion for the Professional Edition, to get three months of the Pro edition with every initial download. But I couldn’t find that. But I see that maybe more recently, JetBrains offers one month of the Pro edition with every initial download. That’s cool, so I took that. PyCharm has an integration with Hugging Face, as well as support for TensorFlow, git, PyTorch, jupyter notebooks, etc. It looks great.

I thought that it would be a good idea to set up the environment variable:

Of course, to follow the remaining content, it helps to have a working knowledge of Python, and to have Visual Studio Code installed on your station, with all needed extensions. If you’re not there yet, I recommend this written Visual Studio Code tutorial here: Getting Started with Python in VS Code.

Note that when I first installed PyCharm, it came up with the virtual environment already set up when I created a new Project. This makes life a lot easier by isolating my environment for this Project and avoiding the tricky Python dependencies that used to surface back in the old days. Thank you, new Python!

Once this is all set up, you can (to speed things up a little) set up a requirements.txt file, which is needed to configure everything for our work with the Transformers library and LangChain. Right-click on the Project name (in my case, I just named it PythonProject) and add it to the Project, then edit it to add these three lines:

transformers

langchain

langchain-huggingface

And then use pip to install all these modules into this environment:

(.venv) >pip install -r .\requirements.txt

It will run for a few minutes, and finally succeed. We’re almost ready to start coding, after we get our Hugging Face access token! This is a “keep them honest” way for Hugging Face to ensure that the license terms for the models are honored.

Go back into Hugging Face, click on your profile picture, and click on Access Tokens:

Then click on “Create New Token”, add in a Token name (anything you want), then hit the Read tab, and create it. Save your token in a safe place!

Then we’re ready to start the connection process. From your shell, type in:

(venv) >hf auth login

And you’ll see something like the below (you’ll have to enter your token, but it will remain invisible):

Note: for the main.py code below in Tim’s video, PyCharm compiled it like a charm:

from transformers import pipeline
model = pipeline("summarization", model="facebook/bart-large-cnn")
response = model("text to summarize")
print(response)

However, when I tried to run it, unlike in Tim’s video, it failed with the following error messages:

C:\Users\alans\PycharmProjects\PythonProject\.venv\Scripts\python.exe C:\Users\alans\PycharmProjects\PythonProject\main.py

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

Traceback (most recent call last):

  File "C:\Users\alans\PycharmProjects\PythonProject\main.py", line 3, in <module>

    model = pipeline("summarization", model="facebook/bart-large-cnn")

  File "C:\Users\alans\PycharmProjects\PythonProject\.venv\Lib\site-packages\transformers\pipelines\__init__.py", line 1018, in pipeline

    if isinstance(dtype, str) and hasattr(torch, dtype):

                                          ^^^^^

NameError: name 'torch' is not defined

Process finished with exit code 1

Apparently, we also need to download the NVIDIA CUDA toolkit (and driver) to use the GPU on my PC and pull in the PyTorch files. Go to https://developer.nvidia.com/cuda-downloads and initiate the download.

You MUST restart PyCharm for PyTorch to become visible, and to run the model.

Then, in the Terminal window, type in:

nvcc -- version

And in my case, I saw:

(.venv) PS C:\Users\alans\PycharmProjects\PythonProject> nvcc --version         
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_13:58:20_Pacific_Daylight_Time_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Note that I have version 13.0 of the tools. So, this is used in the next step installing PyTorch:

>pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

But this didn’t work! Apparently, as of the time of writing, this version of the Cuda compiler driver isn’t supported with PyTorch. I got this error message from pip:

Looking in indexes: https://download.pytorch.org/whl/cu130
ERROR: Could not find a version that satisfies the requirement torch (from versions: none)
ERROR: No matching distribution found for torch

I then decided to go with cu128 of the driver, keeping my fingers crossed:

>pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

And, that successfully installed.

Then, it’s a simple matter of running my program, and the output is below:

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`

Device set to use cuda:0

Your max_length is set to 142, but your input_length is only 5. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=2)

[{'summary_text': 'CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery. Visit CNN.com/Travel each week for a new gallery of snapshots. Please share your best shots of the U.S. with CNN iReport. Send your best photos of the United States to jennifer.smith@dailymail.co.uk.'}]

Process finished with exit code 0

Very cool! If you go into the Hugging Face Model card for this model, https://huggingface.co/facebook/bart-large-cnn, you can see that this is a BART (Bidirectional and Auto-Regressive Transformer) model, pre-trained on the English language, and find-tuned on the CNN Daily Mail dataset: https://huggingface.co/datasets/abisee/cnn_dailymail. That explains part of the text above.

There’s so much more to write about in this part of my research, but I’ll save the remainder for an upcoming article, where I’ll cover some of the advanced capabilities provided by LangChain, and start closing in on model fine-tuning.

Categories
AI Graphics and Game Programming

AI refactoring of 20+ year-old code from Tricks of the Windows Game Programming Gurus

How good is AI at bringing old code into the 21st century? I’ve undertaken a small project to refactor code from the book Tricks of the Windows Game Programming Gurus, by Andre Lamothe, to see how easy it might be to modernize.

First, some background: back in 2002, I was a General Manager at the Adtech division of Spirent Communications, located in Honolulu, Hawaii. Adtech made telecom test equipment, and I oversaw our division’s P&L through the telecom boom/bust cycle: and what a wild ride that was! It was an extremely stressful job, with long hours and lots of travel; I needed a way to relax in between corporate crises.

Somehow, I came across the legendary Andre Lamothe’s book, Tricks of the Windows Game Programming Gurus, Second Edition. I think that it’s still available on Amazon, if you want to try to get a copy: https://www.amazon.com/Tricks-Windows-Game-Programming-Gurus/dp/0672323699. Although I wasn’t a hardcore programmer at the time, I bought the book, and I was hooked. Andre’s approachable style and great sense of humor made learning some of the basics of Windows game programming easy; there were plenty of source code examples included, and they all compiled nicely using Visual Studio 6 and DirectX 8. I worked my way through the entire book, compiling and running all of his example code, starting with a simple “Hello World” Windows program, and ending with a complete game, “Outpost”.

The book was a groundbreaker at the time: over 1,000 pages of amazing content, I think the first of its kind to teach low-level game programming. We have certainly come a long way since then, with new platforms like Unity, and now even AI writing games on its own. But to learn game design from the bottom up, I can’t think of any better way to have gotten started (at least, back in 2002).

And there’s the rub – the code in Andre’s book is old! 25+ years old, to be precise – the book was published in 2002, but the code on the included CD dates back to 1998 and 1999. All the code was Win32, and it even pre-dated Unicode being used in Windows. And we’re up to DirectX Ultimate, released in 2020, versus DirectX 8.0. I ran all this code on a PC running Windows ME. Would it be possible to port it to Windows 11 with a modern IDE and code base using AI, and how much guidance would AI need to get it right? I decided to find out.

For these experiments, I’m running on Windows 11, with a fresh install of Visual Studio 2022 Community, and GitHub Copilot.

Let’s start with the first version of a Windows “Hello World” program. The 1998 code for this from Andre’s book looks like the below:

// DEMO2_2.CPP - a simple message box

#define WIN32_LEAN_AND_MEAN

#include <windows.h>        // the main windows headers

#include <windowsx.h>       // a lot of cool macros

// main entry point for all windows programs

int WINAPI WinMain(HINSTANCE hinstance,

                   HINSTANCE hprevinstance,

                   LPSTR lpcmdline,

                   int ncmdshow)

{

// call message box api with NULL for parent window handle

MessageBox(NULL, "THERE CAN BE ONLY ONE!!!",

                 "MY FIRST WINDOWS PROGRAM",

                  MB_OK | MB_ICONEXCLAMATION);

// exit program

return(0);

} // end WinMain

To begin the modernization, open up an Empty Project:

Copy and paste in Andre’s code.

Of course, compiling it generates build errors:

I asked for help from GitHub Copilot:

HelloWorld.cpp is Win32 code that was written in 2002. It has build errors, and needs to be refactored to run on Windows 11 with unicode. Can you modernize this code so that it will compile and run on Windows 11?

Certainly. To modernize this Win32 code for Windows 11 and Unicode support, the following changes are required:

  1. Use the wide-character (Unicode) versions of Win32 APIs by:
    1. Replacing string literals with wide string literals (prefix with L).
    1. Changing the WinMain signature to use LPWSTR for the command line argument.
  2. Ensure the project is set to use Unicode (which is default in modern Visual Studio projects).
  3. No other major changes are needed for such a simple program.

Here is the refactored code:

// DEMO2_2.CPP - a simple message box

#define WIN32_LEAN_AND_MEAN
#include <windows.h>        // the main windows headers

#include <windowsx.h>       // a lot of cool macros

// main entry point for all windows programs (Unicode)

int WINAPI wWinMain(HINSTANCE hinstance,

    HINSTANCE hprevinstance,

    LPWSTR lpcmdline,

    int ncmdshow)

{

    // call message box api with NULL for parent window handle

    MessageBoxW(NULL, L"THERE CAN BE ONLY ONE!!!",

        L"MY FIRST WINDOWS PROGRAM",

        MB_OK | MB_ICONEXCLAMATION);

    // exit program

    return 0;

} // end wWinMain

Copilot then creates a nice diff environment within the Code window:

Use the Tab key selectively and the final code is updated to the below, which is the same that was displayed in the Copilot Chat window:

// DEMO2_2.CPP - a simple message box

#define WIN32_LEAN_AND_MEAN

#include <windows.h>        // the main windows headers

#include <windowsx.h>       // a lot of cool macros

// main entry point for all windows programs (Unicode)

int WINAPI wWinMain(HINSTANCE hinstance,

    HINSTANCE hprevinstance,

    LPWSTR lpcmdline,

    int ncmdshow)

{

    // call message box api with NULL for parent window handle

    MessageBoxW(NULL, L"THERE CAN BE ONLY ONE!!!",

        L"MY FIRST WINDOWS PROGRAM",

        MB_OK | MB_ICONEXCLAMATION);

    // exit program

    return 0;

} // end wWinMain

Compile it again and there’s still an error, with an unresolved external:

Now, this is a common error.

I asked for help from Copilot again:

I have accepted all of your changes, but the compile still fails with LNK2019 and LNK1120 unresolved externals error messages. Can you help me with that?

Yes, I can help. The linker errors LNK2019 and LNK1120 typically mean that the linker cannot find the entry point or other required symbols. In this case, the most common cause is that the project is still configured to use WinMain as the entry point, but your code now uses wWinMain for Unicode.

How to fix:

  1. Ensure the project is set to use Unicode:
    1. In Visual Studio, go to Project > Properties.Navigate to Configuration Properties > General.
    1. Set Character Set to Use Unicode Character Set.
  2. Set the correct entry point:
    1. Go to Configuration Properties > Linker > System.Ensure Subsystem is set to Windows (/SUBSYSTEM:WINDOWS).
    1. If you still get errors, explicitly set the entry point:
      1. Go to Configuration Properties > Linker > Advanced.
      1. Set Entry Point to wWinMainCRTStartup.
  3. Rebuild the project.

Why this works:

  • wWinMain is the Unicode entry point, and the CRT startup code expects to call wWinMainCRTStartup when Unicode is enabled.
  • If the entry point is not set correctly, the linker cannot find the function to start your program, resulting in LNK2019/LNK1120.

Summary of steps:

  • Set the project to use Unicode.
  • Set the entry point to wWinMainCRTStartup if needed.

Try these steps and rebuild. If you still see errors, please provide the full error messages for further assistance.

In my opinion, #1 seems unlikely. And besides, I can’t find Character Set in the Project Properties (I believe that was superceded in my most current version of VS):

So going on to #2, I find in fact that the Linker Subsystem is not correctly set:

I saw that error 25 years ago. It’s strange that it’s still there – you would think the system would be able to auto-correct that. And although I asked Copilot to go ahead and change it, it refused – it said it doesn’t have access to my HelloWorld.vcxproj project file:

I do not currently have access to your HelloWorld.vcxproj project file, so I cannot directly check the Linker Subsystem setting.

Anyway, I made the change suggested, and it compiled and ran successfully!

This took only about five minutes to get through – amazing!

Just for fun, I decided to have the same interaction with ChatGPT 5:

Thought for 11s

Here’s a tiny, modernized version that compiles on Visual Studio 2022 (Community), runs on Windows 11, and uses Unicode explicitly:

// A minimal Unicode Win32 app that shows a message box.

// Build: Visual Studio 2022 (Community)

// Platform: Windows 11 (works on earlier versions too)

#define WIN32_LEAN_AND_MEAN

#include <windows.h>

// Unicode entry point (no console)

int APIENTRY wWinMain(_In_ HINSTANCE hInstance,

                      _In_opt_ HINSTANCE /*hPrevInstance*/,

                      _In_ PWSTR /*lpCmdLine*/,

                      _In_ int /*nCmdShow*/)

{

    MessageBoxW(nullptr,

                L”THERE CAN BE ONLY ONE!!!”,

                L”MY FIRST WINDOWS PROGRAM”,

                MB_OK | MB_ICONEXCLAMATION);

    return 0;

}

Why this works now

  • Uses the Unicode entry point wWinMain and wide‐char Windows API (MessageBoxW) instead of ANSI.
  • Drops the old <windowsx.h> macros you don’t need for this sample.
  • Avoids deprecated/ambiguous types like LPSTR in the entry point.

Visual Studio 2022 setup (quick)

  1. File → New → Project → Windows Desktop Application (C++).
  2. Name the project, then Create.
  3. In Project → Properties (for All Configurations):
    • General → Character Set: Use Unicode Character Set.
    • (Optional) C/C++ → Language → Conformance mode: Yes (default).
  4. Replace the auto-generated .cpp with the main.cpp above.
  5. Build → Build Solution (Ctrl+Shift+B), then Debug → Start Without Debugging (Ctrl+F5).

That’s it—clean, Unicode-safe, and ready for Win11 with VS2022.

That compiled and ran like a charm!

It’s interesting that ChatGPT removed the deprecated #include <windowsx.h>, whereas Copilot just seemed to ignore it.

I also like the fact that ChatGPT used the Source (Code) Annotation Language (SAL) – for example, the _In_ and _In_opt_. It make the code much more readable. And finally, it used the more modern PWSTR, versus the legacy LPWSTR, as the pointer to the wide character string.

So, the winner is ChatGPT, at least for now.

To wrap this up, there are dozens and dozens of examples on the CD that was included with the book, in its original form. It might be hard to find that CD now, but I have the original source.  Here are the contents for all 15 chapters:

In my spare time, I plan to go through much of the content of the book again and modernize the code. Many of the older .exe files that Andre included in the book’s CD don’t run anymore (although some of them do – which is amazing!) so over time I’ll bring them all into Windows 11 and the 21st century.

Categories
AI

Using LLMs for Massive Code Trace Analysis Part 2: Parameters, Tokens and Modelfiles

In this article, I explored ways to extend my LLM with knowledge specific to analyzing Intel Processor Trace.

“To me the question is what happens next. And there are three things that are happening this year. The first is infinite context-window.”

– Eric Schmidt, Ex-Google CEO

Source: The 7 Things Everyone Needs to Know About AI Right Now: Superintelligence

In Part 1 of this series, I introduced the concept behind this project: to fine-tune an open-source LLM to process massive amounts of Intel Processor Trace. This has the potential to apply to many use cases: at-scale bug triage, codebase vulnerability scans, malware detection at the point of execution in RAM, and many others.

As such, I’ll chronicle the journey of discovery as I develop this solution over many months. I’ll need to become a bit of an SME for LLMs, and combine that with my current knowledge of x86 architecture, UEFI, and OS internals.

In this article, I’ll cover the following topics:

  • Simplify my Python coding environment
  • Parameters versus Tokens – a short description
  • Using a Modelfile to customize LLMs – first steps

Simplifying Python Coding

I ended Part 1 with writing some Python code that connects with a local LLM, delivers a payload (the prompt) via an HTTP POST, and prints out the response. It’s pretty low-level, complex code. You can simplify this code by using the ollama Python package, obtaining that module via:

>pip3 install ollama

And then the code becomes much simpler:

import ollama

#Initialize the Ollama client
client = ollama.Client()

# Define the model and the input prompt

model = "mistral"  # Replace with your model name

prompt = "What is PL/I?"

# Send the query to the model

response = client.generate(model=model, prompt=prompt)

# Print the response from the model

print("Response from Ollama:")

print(response.response)

For now, I’ll continue the development on my local hardware, so this will be the approach for a while. I may have to change my strategy when I move to Google Colab or some other platform going forward. We’ll see.

Parameters versus Tokens

Before continuing the description of the fine-tuning process, I’ll take a detour into two very important LLM concepts: tokens and parameters. While this article is not an LLM tutorial, understanding these basics is essential for adapting an LLM to specific tasks, such as interpreting instruction execution traces in UEFI.

To put it simply, tokens are the basic building blocks of text that an LLM reads and generates. They are the model’s “words” or “letters.” A token can be a single word, a part of a word, a punctuation mark, or even a space. For example, the sentence “I love my dog” might be broken down into the tokens: [“I”, ” love”, ” my”, ” dog”].

Parameters, on the other hand, are the model’s knowledge and learned relationships. They are the numerical values (weights and biases) that the model adjusts during its training to become more accurate. Think of parameters as the “neural connections” in the model’s brain; I like that analogy.

A good way to compare the two is in this table:

AspectTokensParameters
What they areThe units of text being processedThe model’s internal knowledge and weights
MeasurementCounted per interaction (input & output)A fixed, static number for a specific model
FunctionThe “what” of the conversation (input/output)The “how” of the model’s intelligence
AnalogyThe words you speak and hearThe knowledge in your brain

For tokens and parameters, in general, the larger the better. We often talk about the LLM’s “token context window” as a key means by which a model can retain “in memory” massive amounts of data that we send it; retaining earlier parts of the data without losing context.

Determining the number of parameters of any given LLM can easy. It’s a fixed characteristic of the model. If you’re working with https://ollama.com/search, you’ll see it’s sometimes displayed in the description of the model; in this case, gpt-oss has two versions available, one with 20 billion parameters, and a larger one with 120 billion parameters:

But as noted in the ollama GitHub repository, https://github.com/ollama/ollama, the bigger the model, the more RAM and disk space it’s going to take. Here are some examples:

And further, you should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Since my local PC is a 16-thread AMD Ryzen 7 1700X CPU, 16 GB RAM, NVIDIA GeForce GTX 1060 with only 6GB VRAM, I’ll need to stick with models that have significantly fewer than 7B parameters.

Determining the context window can also be found by accessing the model repository, for example:

Or once you’ve loaded the model onto your local machine, the “ollama show <model_name> will display it as well:

Of the models that I worked with in Part 1, here’s a summary table of their tokens and parameters:

ModelContextParameters
mistral32,7687.2B
llama24,0966.7B
llama3.2131,0723.2B

Keep in mind that the context window represents the maximum number of tokens that can be accommodated in a single interaction; including the prompt and its response. This makes sense, based on what I’ve learned about LLM tokenizer design.

In my webinar video, Advanced Threat Analysis and Reverse Engineering using AI, IDA, and the SourcePoint JTAG Debugger, I worked with 100 million lines of Intel Processor Trace, which took up about 8GB of disk space. I had to compress it using 7-Zip to about 500MB, as ChatGPT has a hard limit on file size and tokens supported. When queried, it informed me:

Uploading and analyzing an 8 GB / 107 M-line .txt in one go isn’t possible in a single ChatGPT Plus session.

ChatGPT file uploads have a hard per-file cap of 512 MB and text/doc uploads are limited to ~2 million tokens per file (far smaller than your 8 GB trace). The blocking issue is the per-file 512 MB and 2M-token caps, which apply broadly—not just Plus. Pro/Enterprise mainly increase file count/quota, not the single-file size/token limit.

My current context window is about 128k tokens (roughly 400–500 pages of text). That’s the maximum amount of text I can keep “in working memory” at once when analyzing a file or conversation.

Very interesting. For now, I’ll focus on the token context window as a key aspect that I want to increase in size. Since I’m using an open-source model on my local PC, I’m looking for the largest context window that will fit into its constrained space. In an upcoming article, I’ll refer back to Arthur Rasmusson’s use of Paged Attention over RDMA as a means to expand the context window for Intel PT.

Here’s an important caveat: even though the context window may be quite large, the ollama platform may default it to be lower (often 2,048) to save VRAM. For customization, you will want to tune this; see the next section.

Using a Modelfile to customize LLMs

One of the means of customizing an LLM in the ollama platform is with the use of a Modelfile. This is a little text file that can modify an existing base model and creates a more specialized model.

The use of the Modelfile is covered lightly in Tech with Tim’s video, Learn Ollama in 15 Minutes – Run LLM Models Locally for FREE. In this instance, we’ll use the following Modelfile to modify the base llama3.2 model:

FROM llama3.2

# set the temperature to 1 [higher is more creative, lower is more coherent]

PARAMETER temperature 1

# set the system message

SYSTEM """

You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.

"""

Run the following from a CMD prompt to change the model’s behavior and watch it act like Super Mario:

>ollama create mario -f ./Modelfile

>ollama run mario

>>>Hello

It's-a me, Mario! *moustache twirl* How can I help you today?

This is a simple example, so let’s dive in a little deeper to see what more interesting things that we can do.

Here’s a table of attributes that can be changed within a Modelfile:

AttributeDescriptionDefault Value
FROMThis is the mandatory first instruction that specifies the base model to use. It tells Ollama which pre-trained model you are building upon, which can be a model from the official Ollama library (e.g., llama3:8b) or a local .gguf file path.N/A (Required)
SYSTEMThis sets the system prompt, which is a set of instructions that the model will follow as its core programming or persona. It’s a powerful tool for defining the model’s behavior, style, and constraints (e.g., “You are a helpful coding assistant who explains concepts clearly with code examples.”).None (models may have a built-in system prompt)
TEMPLATEThis instruction defines the full prompt template that is sent to the model. It controls how the system message, user prompt, and model response are structured. This is essential for models that require a specific format to understand the conversation flow correctly.Varies by model
PARAMETERThis instruction allows you to set a wide range of generation parameters that fine-tune the model’s behavior and output. The most common ones are:
• temperature: Controls the creativity and randomness of the output. A lower value makes the model more deterministic and factual, while a higher value encourages more creative, diverse, and unexpected responses.
• top_p: Defines a cumulative probability cutoff for selecting the next token. A lower value selects from a smaller, more focused set of tokens, leading to more conservative responses.
• top_k: Reduces the probability of generating “nonsense” by limiting the model’s choice to only the k most likely tokens at each step of generation. For example, if top_k is 10, the model will only choose from the 10 most probable next words.
• repeat_penalty: Penalizes words that have appeared recently in the conversation. A higher value strongly discourages repetition, which is useful for preventing the model from getting stuck in a loop.
temperature: 0.8
top_p: 0.9
top_k: 40
repeat_penalty: 1.1
ADAPTERThis instruction is used to apply a LoRA (Low-Rank Adaptation) adapter to the base model. LoRAs are small, fine-tuned files that can add a specific skill or knowledge set (e.g., a specific writing style or technical expertise) to a model without requiring you to retrain the entire model.None
LICENSEThis is a metadata instruction that allows you to specify the legal license under which your custom Modelfile and the resulting model are released. This is important for ensuring proper usage and distribution.None
STOPThis attribute defines one or more stop sequences. When the model generates any of these specific strings, it will immediately stop generating a response, which is crucial for managing conversation turns and avoiding unnecessary output.Varies by model (e.g., \n[INST], `\n<
MESSAGEThis instruction allows you to add pre-defined messages to the model’s history. You can use this to establish a specific conversational pattern or to pre-load a context for the model to work from before the user even starts chatting.None

The PARAMETER and ADAPTER attributes are the most interesting ones in the context of this article.

For completeness, here’s a complete list of generation parameters (hyperparameters) that are supported by PARAMETER within Modelfiles, with a brief explanation of what each one does:

  • mirostat: Enables Mirostat sampling, an alternative to top_p and top_k. It is designed to control perplexity, balancing the coherence and randomness of the output. The value can be 0 (disabled), 1 (Mirostat), or 2 (Mirostat 2.0).
  • mirostat_eta: A learning rate parameter for Mirostat sampling. It influences how quickly the algorithm adjusts to the generated text. A higher value makes the model more responsive to feedback, while a lower value makes adjustments more slowly.
  • mirostat_tau: A target perplexity parameter for Mirostat. It controls the balance between coherence and diversity in the output. A lower value leads to more focused and coherent text.
  • num_ctx: Sets the size of the context window in tokens. This determines how much of the previous conversation or prompt the model “remembers” when generating its next response.
  • num_gqa: The number of Grouped-Query Attention (GQA) groups in the transformer layer. This parameter is required for some model architectures and is used to optimize performance.
  • num_gpu: Specifies the number of layers of the model to offload to the GPU(s) for accelerated computation. Setting it to 0 will force the model to run on the CPU.
  • num_predict: The maximum number of tokens to predict and generate in a single response. A value of -1 allows for infinite generation, and -2 fills the entire context window.
  • num_thread: Sets the number of threads used for computation. By default, Ollama automatically detects the optimal number for your system’s performance.
  • repeat_last_n: Defines how far back (in tokens) the model should look to apply the repetition penalty. A value of 0 disables the penalty, and -1 looks back at the entire context.
  • repeat_penalty: A multiplier that penalizes tokens that have appeared recently in the output. A higher value (e.g., 1.5) will more strongly discourage repetition.
  • seed: Sets the random number seed for generation. Using a specific seed ensures that the model will produce the exact same text for the same prompt, which is useful for debugging and reproducibility.
  • temperature: Controls the creativity and randomness of the model’s output. A value of 0 makes the output deterministic and factual, while a higher value leads to more diverse and creative responses.
  • tfs_z: Stands for Tail Free Sampling. This is a method to reduce the impact of less probable tokens. A higher value reduces this impact more, while a value of 1.0 disables it.
  • top_k: Limits the pool of potential next tokens to the k most likely ones. This helps prevent the model from generating random or nonsensical text.
  • top_p: A more dynamic alternative to top_k. It selects the smallest set of tokens whose cumulative probability exceeds the value of top_p. This balances diversity and quality.
  • min_p: An alternative to top_p that ensures a balance of quality and variety by setting a minimum probability for a token to be considered, relative to the most likely token.

Note the num_ctx generation parameter. We’ll need to use that.

I’ll demonstrate the use of the PARAMETER and ADAPTER features in an upcoming article.

I’ll also cover the SYSTEM feature in the next article, as I modify the model’s behavior to act like an x86 and Windows internals expert (with some forthcoming training from me, of course).

As a final note, I should mention that Arthur Rasmusson suggested testing my LLM with TensorRT-LLM and using the KVCacheTransferManager. Since the ultimate goal of this project is to train an LLM to process huge amounts of Intel Processor Trace, this seems like a good direction. TensorRT-LLM is an NVIDIA open-source library that accelerates LLM inference, and KvCacheTransferManager, a memory management system for efficient handling of large token context windows, is a core part of it.

Also, Arthur mentioned that it might be possible to run NVIDIA’s GPUDirect Storage on the RTX 1060 in my aging workstation – this is a mechanism that expands limited VRAM memory via a direct pathway between it and external memories, which will be essential to create the massive context window I’ll be needing for Intel Processor Trace – how cool would that be.

This is going to take some study on my part, as these tools aren’t for the faint of heart, but I’m looking forward to it.