Interview with VentureBeat on SWE-PolyBench, a new AI benchmark

This title was summarized by AI from the post below.

Director, AWS Agentic AI

6mo

Late last week, I had the pleasure of sitting down with Michael Nuñez from VentureBeat to discuss my team's latest work on building and open sourcing SWE-PolyBench, the first industry benchmark to evaluate AI coding agents' ability to navigate and understand complex codebases, introducing rich metrics to advance AI performance in real-world scenarios. In the interview I discuss the importance of building fine grained metrics to track, measure and improve upon agents' reasoning, decision making and their ability to understand (very) large context spaces. https://lnkd.in/gVJKaxt7

Amazon's SWE-PolyBench just exposed the dirty secret about your AI coding assistant https://venturebeat.com

3 Comments

Ramesh Nallapati

6mo

Congrats, Anoop Deoras and team!

1 Reaction

Abdul Rasheed

Building the Future of Distributed Cloud and AI at Google

6mo

Congratulations, Anoop! Evaluation criteria must continue to evolve, and this is a major stepping stone. Great to see this important work in the limelight!

1 Reaction

Manas Apte

seasoned leader in AI/ML

6mo

Are you also going to post a video?

See more comments

To view or add a comment, sign in

More Relevant Posts

Lukas Hamm

AI & Software Automation Engineer | Building Scalable, Intelligent Systems
1mo
Report this post
Large Language Models are powerful - but unpredictable. What if you could structure their output like any other data model? Over the past weeks, I’ve been exploring how AI Agents can actually become part of real-world software projects - not just demos, but as real agents supporting workflows. Now I’m documenting what I learn step by step, starting with Pydantic AI, a framework that brings structure, validation, and reliability into AI-driven projects. Part 1 is now live: I show how to set up your environment, create your first agent, and make its responses fully structured and reliable. If you’re building AI tools or workflows and want more control over model responses, this might be your next step. 🔗 Read it on Dev.to: https://lnkd.in/dPYT8zNa 💻 Source code on GitHub: https://lnkd.in/dQGyVi-S

Getting Started with PydanticAI — Basics for AI Agents in Python dev.to
Like Comment
To view or add a comment, sign in
THE DECODER - EVERYTHING AI

3,003 followers
1mo
Report this post
1/ A new study from US universities and Google Deepmind finds that widely used tests for AI-generated code miss key qualities like style, documentation, and error handling—details that matter in real-world programming. 2/ The research team introduces new tools, the VeriCode taxonomy and the Vibe Checker testbed, to measure these overlooked aspects. Their approach matches human preferences much better than previous benchmarks. 3/ After reviewing 31 top AI models, the study shows that even the best systems have trouble following several instructions at once, underlining how important instruction following is for producing quality code. More: https://lnkd.in/eqBrs7mH

Google Deepmind's "Vibe Checker" aims to rate AI code by human standards the-decoder.com
Like Comment
To view or add a comment, sign in
Tyler Folkman Tyler Folkman is an Influencer

Chief AI Officer at JobNimbus | Building AI that solves real problems | 10+ years scaling AI products
4w
Report this post
Your 180,000 line codebase won't fit in Claude's context window. By token 50,000, Claude starts forgetting your architecture decisions. By 100,000, it's rewriting patterns you established at the start. By 150,000, it's essentially working blind. This is context rot. And throwing bigger context windows at it doesn't actually solve the problem. You just hit the wall later and pay more to get there. MIT researchers published early results last week on a different approach: Recursive Language Models that treat your entire codebase like a Python variable, exploring it iteratively instead of loading it all at once. In their tests, GPT-5-mini using this approach hit 64.9% accuracy on long-context tasks versus 30.3% without it. They handled 10M+ tokens without the typical degradation. Here's what I find interesting: instead of cramming everything into context (RAG, compression, massive windows), this lets the model explore what it needs, when it needs it. But it's early research. We don't know yet how this performs across different types of codebases or tasks. The devil's in the implementation details. Still, if you're working with large codebases and fighting context limits daily, this is worth watching. I broke down the research and what it could mean for AI coding tools. What's the biggest context problem you're hitting right now?

10 Comments
Like Comment
To view or add a comment, sign in
Mostafa Tarek

Backend Software Engineer at Robusta Studio
1mo
Report this post
One might be tempted to think that AI is just another abstraction above the already existing abstractions we use when writing essentially machine code, BUT this is wrong simply because coding abstractions are DETERMINISTIC, meaning they can be mathematically proved to always return the same result if ran any number of times, but AI is not, it's PROBABILISTIC! The following brief article explains this https://lnkd.in/dqati56w

AI is not another abstraction because god plays dice rakhim.exotext.com
Like Comment
To view or add a comment, sign in
Prasad Tilloo

Senior IT Leader | AI & Cloud Innovator | Digital Transformation & Sustainable Tech Expert. Expert in green initiatives, scalable systems, and cross-functional impact.
1mo
Report this post
This is genuinely next-level AI coding. Simon Willison challenged Claude Sonnet 4.5 to autonomously design and implement a tree-structured database feature (complete with migrations and 22 passing tests!). It nailed it. Simon is calling it the 'best coding model in the world' right now. If you're building agents or doing complex dev work, you need to see this deep dive. 👇 https://lnkd.in/grSueGXn #AI #SoftwareDevelopment #Claude #LLMs

Claude Sonnet 4.5 is probably the “best coding model in the world” (at least for now) simonwillison.net
Like Comment
To view or add a comment, sign in
Xavier Collantes
1mo
Report this post
10x productivity with AI IDEs You always hear "10x this... 10x that..." but rarely does this materialize. Until now. I catalogued my experience in using the new wave of AI IDEs and this is certainly the biggest impact of AI in my life. In my latest blog entry I compare the pros and cons of existing options for programmers: xaviercollantes.dev/10 #SoftwareEngineering #AI #LLM #Productivity #Automation #ClaudeCode #AITools #EngineeringProductivity

10x my productivity with AI IDEs xaviercollantes.dev
Like Comment
To view or add a comment, sign in
Muhammad Cikal Merdeka

AI Engineer | Data Analyst | Data Scientist | AI and NLP Enthusiast. Physics graduate from ITB, Indonesia.
4w
Report this post
It’s been ages since I’ve done any article writing. Probably the last time was back when I was still involved in journalistic stuff during college. Anyway, I just wrote an article in Medium about building an AI agent for beginners using LangChain https://lnkd.in/gHM_9iZP. In the article, I used the ReAct (Reason + Act) agent architecture since it’s a simple and intuitive approach that helps you understand what’s actually happening behind the scenes. The agent runs using the AgentExecutor function in LangChain. I covered the basics of what an agent is, how to build its main components (LLM, tools, prompt engineering, and more), included practical code snippets that you can copy and run (I’ve tested them myself so they should work fine, hopefully), and even added a few advanced topics toward the end. #AI #LangChain #AIAgent #MachineLearning #LLM #GenerativeAI #Python #OpenAI

Building Your First AI Agent with LangChain: A Complete Practical Guide medium.com
Like Comment
To view or add a comment, sign in
Chad Lester

Cloud SRE & DevSecOps Engineer | AWS, Azure, Terraform | Automation & Infrastructure Resilience | Hybrid Cloud Architectures | CI/CD Expert | Open to Opportunities 🚀
1mo
Report this post
✨ Turning Dreams into Code: AI + Python in Action ✨ What happens when you give Python, the OpenAI API, and a creative constraint? You get a console app that interprets your dreams through the voices of Freud or Jung. 🚀 I recently built this project for the Programiz PRO Playground AI Challenge: 👉 Try the AI Dream Interpreter here 🔮 About the Project The AI Dream Interpreter lets you describe a dream and choose between two iconic analysts: Sigmund Freud → interprets with psychoanalysis, wish-fulfillment, and childhood symbolism. Carl Jung → interprets with archetypes, collective unconscious, and individuation. Each persona is prompt-engineered for a distinct tone and style. The result: an interactive session that feels alive. 💡 Key Highlights Pure Python – no heavy frameworks, runs in a constrained sandbox. Powered by OpenAI’s GPT-4o-mini for fast, insightful responses. Interactive persona selection and repeatable sessions. Careful prompt design to keep answers consistent, concise, and persona-driven. 🛠️ Why It Matters This project demonstrates more than just Python coding: Working effectively within tight platform constraints. Applying AI creatively to build engaging user experiences. Showcasing skills in prompt engineering, API integration, and UX thinking. 🌱 I’m excited to keep exploring how lightweight AI interpreters like this can be applied to domains such as education, healthcare, and personal development. 👉 Curious: How would you use AI interpreters outside of dream analysis? #Python #AI #OpenAI #PromptEngineering #Programiz #Innovation #CareerGrowth
Like Comment
To view or add a comment, sign in
kamran Imtiyaz

ML Engineer
1mo
Report this post
🚀 Qwen3-VL Vision Model is Here! 🌍 The latest leap in multimodal AI has arrived — Qwen3-VL now combines advanced vision and language understanding with powerful scalability. Designed to process images, text, and structured data in a unified architecture, Qwen3-VL takes multimodal reasoning to an entirely new level. 🔍 Highlights:Unified visual-text backbone powered by Multimodal Diffusion Transformers (MMDiT)Smarter cross-attention mechanisms for richer visual groundingResolution-independent training for diverse real-world imagesEnhanced post-training with human-aligned preference tuning (GRPO)Open weights for developers and researchers to experiment freelyThis model continues the Qwen team’s push to democratize high-performance multimodal AI — bridging reasoning, creativity, and perception in one model. https://lnkd.in/g4HbTKKQ

Qwen3-VL · Ollama Blog ollama.com
Like Comment
To view or add a comment, sign in
Chirag Suthar

Full-Stack Engineer | Python, Finance & Data | Building at the intersection of code and markets.
1mo
Report this post
AI isn't replacing developers, it's replacing bad ones. I spent time watching developers hide behind Stack Overflow copy-paste and verbose "working" code. Now AI does that in 3 seconds. If your value was being a human clipboard, you're done. But if you understand system design, trade-offs, and can architect solutions that don't collapse at scale — you're more valuable than ever. I wrote a detailed breakdown of: → What AI actually can't replace (yet) → Real code examples showing the gap → The 3 types of developers in the AI era → Practical steps to stay irreplaceable Read PDF below. Takes 2 minutes to read. #Developer #SoftwareEngineering #AI #Programming #Python #JavaScript #TechCareers #SoftwareDevelopment
Like Comment
To view or add a comment, sign in

5,292 followers

94 Posts

View Profile Connect

LinkedIn respects your privacy

Interview with VentureBeat on SWE-PolyBench, a new AI benchmark

Explore content categories

Interview with VentureBeat on SWE-PolyBench, a new AI benchmark

More Relevant Posts

Explore related topics

Explore content categories