How It Works¶
tokenspy uses the same technique as py-spy and line_profiler: in-process monkey-patching. No proxy server, no HTTP interception, no environment variables.
Architecture¶
Your Code
│
├── @tokenspy.profile ────────────────────────────── sets active function name
│
└── openai_client.chat.completions.create(...)
│
└── tokenspy interceptor (in-process monkey-patch)
├── calls original SDK method
├── reads response.usage (tokens)
├── looks up cost in built-in pricing table
├── records: function · model · tokens · cost · duration
└── returns response UNCHANGED to your code
tokenspy.report() → renders flame graph from recorded data
Your code runs exactly as before. tokenspy just watches and keeps score.
Provider interceptors¶
Each supported SDK is patched at a specific method:
| Provider | Patched method |
|---|---|
| OpenAI | openai.resources.chat.completions.Completions.create |
| Anthropic | anthropic.resources.messages.Messages.create |
google.generativeai.GenerativeModel.generate_content |
The patch replaces the original method with a wrapper that:
- Delegates — calls the original method and captures the response
- Reads usage — extracts
input_tokens,output_tokensfrom the response - Calculates cost — looks up the model in the built-in pricing table
- Records — creates a
CallRecordand stores it in the activeTracker - Returns — passes the original response back to your code unchanged
Function attribution¶
When @tokenspy.profile wraps a function, it pushes the function name onto a thread-local stack before calling the function, and pops it after. The interceptors read from this stack to know which function to attribute the cost to.
Thread-local stack: []
Enter @profile("run_pipeline")
Stack: ["run_pipeline"]
LLM call → recorded as function_name="run_pipeline"
Enter @profile("fetch_and_summarize")
Stack: ["run_pipeline", "fetch_and_summarize"]
LLM call → recorded as function_name="fetch_and_summarize"
Exit @profile("fetch_and_summarize")
Stack: ["run_pipeline"]
Exit @profile("run_pipeline")
Stack: []
Streaming¶
For streaming responses, the interceptor wraps the response iterator:
# Simplified
def patched_create(**kwargs):
response = original_create(**kwargs)
if kwargs.get("stream"):
return _wrap_stream(response, tracker, function_name)
else:
_record(response, tracker, function_name)
return response
The wrapper yields chunks unchanged and accumulates token counts. The call is recorded to the tracker after the last chunk is consumed.
Thread safety¶
The Tracker uses a threading.Lock around all writes. This means:
- Multiple threads can make LLM calls concurrently
- Each call is recorded atomically
- Function attribution uses thread-local storage — two threads in different profiled functions don't interfere
Activation lifecycle¶
Interceptors are activated automatically when @tokenspy.profile or tokenspy.session() is first called. They persist until the process exits.
from tokenspy.interceptor import activate, deactivate, is_active
activate() # patch all available SDKs
is_active() # True
deactivate() # restore all original methods
is_active() # False
If an SDK isn't installed, its interceptor silently does nothing — no import errors.
Error handling¶
All interception code is wrapped in try/except. If tokenspy encounters an unexpected response format (e.g. a new SDK version), it logs a warning and returns the original response untouched. Your code is never interrupted by tokenspy bugs.
Built-in pricing table¶
Cost calculations use a static pricing table compiled from official provider pricing pages. No network calls are made — the table is bundled with the package.
from tokenspy.pricing import PRICING
# PRICING maps model name → (input_$/1M_tokens, output_$/1M_tokens)
print(PRICING["gpt-4o"]) # (2.5, 10.0)
print(PRICING["claude-haiku-4-5"]) # (0.8, 4.0)
See tokenspy/pricing.py for the full table, or call tokenspy.pricing.list_models() to get all known models.