Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] #34

Open
druce opened this issue Feb 5, 2025 · 1 comment
Open

[Review] #34

druce opened this issue Feb 5, 2025 · 1 comment

Comments

@druce
Copy link

druce commented Feb 5, 2025

Format

What's the book format where you found this issue?
[ ] pdf
[x ] web
[ ] ipynb

Chapter

In what chapter did you find this issue?
2 - Structured Output

Issue Description

few minor random comments
"Pos-training" -> Post-training

On fine-tuned JSON approaches, "JSON mode is typically a form of fine-tuning, where a base model went though a post-training process to learn target formats. However, while useful this strategy is not guaranteed to work all the time." However, from OpenAI docs, "While both ensure valid JSON is produced, only Structured Outputs ensure schema adherance [sic]." My understanding and experience is, the response is guaranteed to adhere to the provided schema. presumably they go beyond fine-tuning to some of the other approaches in their JSON mode. Hopefully same is true of other LLMs that offer structured output. ofc it doesn't guarantee integrity beyond valid JSON and matching the provided schema, can't force a length contraing, an enumerated type, could get eg refusal. TL;DR just use JSON mode if available and it should guarantee good JSON? and then, what is best practice beyond that, presumably outlines/instructor?
https://platform.openai.com/docs/guides/structured-outputs .

Would maybe mention that pydantic just serves as a convenient readable way to generate the json schema format in the REST API call (maybe obvious)

From the original OpenAI blog post: "Structured Outputs takes inspiration from excellent work from the open source community: namely, the outlines, jsonformer, instructor, guidance, and lark libraries." https://archive.is/KIRVL#selection-19231.0-19240.0 . The Applied LLMs folks mention " (If you’re importing an LLM API SDK, use Instructor; if you’re importing Huggingface for a self-hosted model, use Outlines.)" https://applied-llms.org/

@druce
Copy link
Author

druce commented Feb 5, 2025

langchain also has the output-fixing parser and similar approaches - https://python.langchain.com/docs/how_to/output_parser_fixing/

https://www.boundaryml.com/blog/structured-output-from-llms

curious how this might evolve over time, whether we'll get llm apis that do all the validation and structured output out of the box

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant