How to go around a request with too many tokens to llama-3.1-70b-versatile (Groq)?

Posted by dirtyring@reddit | LocalLLaMA | View on Reddit | 2 comments

groq.APIStatusError: Error code: 413 - {'error': {'message': 'Request too large for model `llama-3.1-70b-versatile Limit 6000, Requested 10648, please reduce your message size and try again

For the purposes of my application, which is analyzing an OCR'd bank account statement markdown text, I need to feed the entire information. I need an analysis over the entire bank account statement.

The LLM needs to see all of the text to answer the question I want it to answer. Is the way the this problem is usually solved by sending multiple messages that would then encompass all information? e.g. chunk 1, contains 50% of the info, chunk 2 contains the other 50%.

Are ways of going around perhaps by providing multiple calls to Groq? But the limit is 6,000 tokens PER MINUTE.

My current code is simply:

completion = client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
    {
        'role': 'user',
        'content': prompt, # ******** PROMPT IS TOO LARGE!
    },
    {"role": "assistant", "content": "{"}

    ],
    temperature=1,
    max_tokens=530, #prev 1024
    top_p=1,
    stream=False,
    stop=None,
)

1) I've tried running the model locally with ollama but my Mac M1 Pro takes forever to run it. 2) Is the solution to find a different provider?

[-]

kryptkpr@reddit

Yeah, the easy solution is a provider that doesn't runtime limit context to 6k.

Otherwise you can try a rolling extractor approach: feed 3K tokens of the input + last output, ask for new output based on old and new info. Repeat as needed.

kif88@reddit

AFAIK grow doesn't have context cache so the whole thing has to be sent each time. You'll have to find a provider with higher rate limit.