How to go around a request with too many tokens to llama-3.1-70b-versatile (Groq)?
Posted by dirtyring@reddit | LocalLLaMA | View on Reddit | 2 comments
groq.APIStatusError: Error code: 413 - {'error': {'message': 'Request too large for model `llama-3.1-70b-versatile
Limit 6000, Requested 10648, please reduce your message size and try again
For the purposes of my application, which is analyzing an OCR'd bank account statement markdown text, I need to feed the entire information. I need an analysis over the entire bank account statement.
The LLM needs to see all of the text to answer the question I want it to answer. Is the way the this problem is usually solved by sending multiple messages that would then encompass all information? e.g. chunk 1, contains 50% of the info, chunk 2 contains the other 50%.
Are ways of going around perhaps by providing multiple calls to Groq? But the limit is 6,000 tokens PER MINUTE.
My current code is simply:
completion = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{
'role': 'user',
'content': prompt, # ******** PROMPT IS TOO LARGE!
},
{"role": "assistant", "content": "{"}
],
temperature=1,
max_tokens=530, #prev 1024
top_p=1,
stream=False,
stop=None,
)
1) I've tried running the model locally with ollama but my Mac M1 Pro takes forever to run it. 2) Is the solution to find a different provider?
kryptkpr@reddit
Yeah, the easy solution is a provider that doesn't runtime limit context to 6k.
Otherwise you can try a rolling extractor approach: feed 3K tokens of the input + last output, ask for new output based on old and new info. Repeat as needed.
kif88@reddit
AFAIK grow doesn't have context cache so the whole thing has to be sent each time. You'll have to find a provider with higher rate limit.