Self-Healing Applications

LogicStar AI autonomously investigates, reproduces, and fixes real bugs pulled from your backlog - and only acts when it can fully validate the fix. No prompts. No hallucinations. Your product stays reliable and more robust.

Trusted By Industry Giants

Stay Focused On Features
And Reliable Product.

Faster bug resolution

Mean time to resolution (MTTR) cut by 95%

Fully tested PRs

Most pull requests come with 100% test coverage and pass static verification.

Clear your bug backlog

40% of application bugs resolved fully autonomously.

Afraid to open your backlog?

Backlogs that grow faster than your team can clear. Delays in new features. Burned out teams. We've seen it all before.

More than bugs: it’s time, morale, and momentum lost.

40% of engineering time disappears into triage and fixes. Backlogs pile up, features stall, and your team feels like they’re always firefighting instead of building.

Co-pilots assist. Agents guess. LogicStar fixes for you.

Co‑pilots and LLM agents still need feedback, oversight, and review, shifting work instead of removing it. LogicStar is different: it correlates signals, analyzes code like an engineer, and delivers fully validated fixes that prevent regressions. This is the only way to truly save time, not just trade bug fixing for reviewing bad AI suggestions.

Critical bugs sit unresolved for weeks or months.

LogicStar delivers autonomous, end‑to‑end bug resolution by eliminating queues, coordination, and waiting on the right people to triage, investigate, and prioritize. By removing these hand‑offs, it cuts bug resolution time by over 95%.

Bug fixing drains...

Every hour spent fixing bugs is an hour not building new features, work that can deliver 10× more ROI. LogicStar clears the backlog so engineering time shifts back to innovation, helping your team ship faster and maximize the value of every development hour.

LogicStar changes the game: fully autonomous, production-grade bug fixing that works like an extra engineer on your team.

Reproducing issues, validating fixes, and submitting tested pull requests before your human developers even get involved.

Fully Autonomous. Fully Trusted. Always On.

Reproduces real-world bugs in isolated sandboxed environments.

Validates every fix with static analysis, fuzzing & safety checks

Quick to adopt: up and running in hours, no workflow disruption

Abstains when uncertain - only delivers verified, trustworthy code

Generates full unit, regression & reproduction tests

Structured PRs with root cause summaries, test results, and coverage reports

Reproduces real-world bugs in isolated test environments

Validates every fix with static analysis, fuzzing & safety checks

Quick to adopt: up and running in hours, no workflow disruption

Abstains when uncertain - only delivers verified, trustworthy code

Generates full unit, regression & reproduction tests

Structured PRs with root cause summaries, test results, and coverage reports

LogicStar AI founded by experts behind DeepCode and Snyk, trusted by enterprise engineering teamsLogicStar AI team from diverse backgrounds bonding over cooking, fostering inclusion and collaboration while building autonomous AI solutions
The LogicStar team combines deep technical expertise with a proven record of impact in autonomous AI and software maintenance. Our founders created DeepCode, used by over 3 million developers worldwide, and after its acquisition by Snyk, the technology now powers more than $100M in annual revenue. Backed by leading AI researchers from ETH Zurich, MIT, and INSAIT, we bring cutting-edge AI research into production. With LogicStar, we are pioneering self-healing applications that autonomously fix real software bugs, reduce mean time to resolution by 95%, and deliver production-ready pull requests with full validation.

Built By AI Experts That Lead The Way.

Our team consists of leading researchers and entrepeneurs from ETH, MIT, and INSAIT, including the people behind Snyk Code and DeepCode.ai, trusted by 3M developers.

LogicStar AI founded by experts behind DeepCode and Snyk, trusted by enterprise engineering teamsLogicStar AI team from diverse backgrounds bonding over cooking, fostering inclusion and collaboration while building autonomous AI solutionsThe LogicStar team combines deep technical expertise with a proven record of impact in autonomous AI and software maintenance. Our founders created DeepCode, used by over 3 million developers worldwide, and after its acquisition by Snyk, the technology now powers more than $100M in annual revenue. Backed by leading AI researchers from ETH Zurich, MIT, and INSAIT, we bring cutting-edge AI research into production. With LogicStar, we are pioneering self-healing applications that autonomously fix real software bugs, reduce mean time to resolution by 95%, and deliver production-ready pull requests with full validation.

See it work
On your bugs

Stop wasting your best engineers on bug fixing.

AI That Fixes Code.

September 26, 2025
-

Evaluating coding agents shouldn’t feel like watching paint dry. Yet with SWE-Bench Verified, it often does—hundreds of Docker images totaling 240 GiB, throttled by rate limits*, turn the first setup on a new machine into a 30-hour ordeal. Want to test across a broader, less overfitted, and more representative set of repositories, by also using our SWA-Bench and SWEE-Bench or your own environments? Good luck; things only get slower.

So we decided to fix that. By restructuring layers, trimming unnecessary files, and compressing the results, we shrank SWE-Bench Verified from 240 GiB to just 5 GiB. Now it downloads in under a minute, making large-scale evaluation and trace generation on cloud machines fast and painless.

*100 images per 6h as an unauthenticated user, 200 as an authenticated user without Docker Hub Pro

Background

Evaluating SWE-Bench Verified requires 500 containerized environments, one for each issue across twelve repositories. Your options are either to build all of them from scratch (and pray all dependencies were pinned) or to pull the prebuilt images from Docker Hub. Neither choice is great. Building takes hours and can introduce inconsistencies. Pulling requires downloading more than 100 GiB of compressed layers and expanding them into 240 GiB of local storage. Even with a Docker Hub Pro subscription and a fast connection, this process takes anywhere from half an hour to several hours. Without a Pro account, rate limits make it even worse—you can spend 30 hours just waiting for pulls to finish.

The situation becomes truly painful if you want to evaluate more instances at scale on ephemeral cloud machines. Copying 100s of GiB around the world hundreds of times adds up quickly. So we set out to make the environment images light enough to be dropped onto a fresh machine in minutes.

The Layering Problem

At the core of every Docker image lies a stack layers representing filesystem changes. When a container runs, Docker (via OverlayFS) looks for the topmost layer containing a requested file and reads it from there. The container itself adds a thin writable layer on top: when you modify a file, Docker copies it into this writable layer so changes never affect the underlying image layers.

This design is clever because it makes image storage and distribution efficient. If two images share a base like ubuntu:latest, they can both use the same base layer and only add their own differences on top. However, every file that is modified is fully duplicated.

For SWE-Bench, every image starts with ubuntu:22.04. Then comes one of 63 distinct “environment” layers that set up dependencies, and finally one of 500 "instance" layers, including the repository checkout at the right commit.

The problem is that while the environment layers share many dependencies and repositories change very little between commits, the resulting layers are still different. As a result, full copies are created every time. While every checkout is only a few hundred megabytes, that quickly adds up when multiplied by 500 instances.

In short, the way SWE-Bench (Verified) images are constructed leads to hundreds of near-duplicate layers adding up to 240 GiB.

Fixing the Layering Problem

To resolve this, we introduce a technique we call delta layering. Instead of creating a single layer for every checkout containing a full copy of the repository, we post-process the images so that each instance layer only adds the difference - the delta - to the commit before.

The intuition is simple: two snapshots of the same repository taken only a few weeks apart are nearly identical. Yet in the default layering scheme, both snapshots get packaged as full copies; delta layering removes that duplication.

We build chronological chains—one per repository—where each instance builds directly on top of the previous one. The resulting layers become small changes between commits (including potential dependency changes), instead of big, redundant snapshots. Only Django had so many instances that we had to split it into two chains due to Docker’s hard limit of 125 layers per image.

All of these chains share a common base layer that holds the truly universal pieces - Ubuntu 22.04, Conda, and other system-level dependencies. 

Could we get the same result by just cloning the chronologically last state of the repo and then checking out the right commit? Unfortunately, no. This would leave future commits in the git history, which can and did get exploited by agents to cheat.

Git History and Packfiles

Delta layering solves much of the duplication problem, but there’s a hidden complication: git history. Each SWE-Bench image includes the full git history of the repository up to the point when the issue was created. In principle, this shouldn’t be a huge deal. Git stores its data as a key–value database of objects: commits, trees, and blobs. Adding a new commit just creates a few new objects-the changed files, changed directories, and the commit object itself. If everything were stored as loose zlib-compressed files in .git/objects, delta layering could simply capture the handful of new objects.

But in practice, git uses packfiles. A packfile bundles thousands of objects into a single large file and applies compression across them. This is great for efficiency, but the problem is that every time a new packfile is generated, that’s an entirely new multi-hundred-megabyte file from Docker's perspective. As a result, all the benefits of delta layering vanish.

To resolve this problem, we restructured the packfiles, creating one per instance, containing all additional git objects. We do lose some of git’s internal compression, but the trade-off is worth it: small, incremental layers instead of massive redundant packfiles.

Removing Build Artifacts

Many of the images contained leftovers from the build process that were never needed at runtime—installers, caches, etc. For example, the Miniconda installer alone added 136 MB to every image. Pip and Conda caches consumed even more. Removing these shaves off gigabytes at essentially no cost.

Final Compression

In addition to making each layer as small as possible, we also apply cross-layer compression. While Docker’s layer model copies the entire file when a single line changes, compression algorithms are very good at spotting such repeated data.

We choose zstd because it’s fast, highly parallel, and supports very large compression windows. To give the compressor the best shot, we sorted the layers by their chronological chain order. That way, nearly identical layers sit next to each other in the input stream. As a result the entire benchmark, 240GiB of raw images, now fits into a single 5 GiB archive.

Using 100 cores, the compression process below takes around ten minutes. Decompression, however, is extremely fast—about forty seconds on a single core.

zstd --T100 -19 --long=31 layers.tar

Original Size Our Size
Uncompressed (Podman) 240 GiB 31 GiB
Compressed (Registry) 106 GiB 12.4 GiB
Compressed II (Zstd) n/a 5.0 GiB

Summary

All told, our optimizations bring SWE-Bench Verified down from 240 GiB of raw layers to just 31 GiB uncompressed—and with the right compression, a single archive of only 5 GiB. That archive is small enough to download and unpack in about five minutes on any modern machine. And the best thing, the core of our optimization – delta layer – is not SWE-Bench specific and can be easily applied to any other series of execution environments. Because Docker and Podman can’t natively load compressed bundles, we’ve provided helper scripts on GitHub. The final archive itself is hosted on Hugging Face, supporting fast downloads.

If all you care about is the quickest way to set up SWE-Bench Verified, here it is:

curl -L -#  https://huggingface.co/LogicStar/SWE-Bench-Verified-Compressed/resolve/main/saved.tar.zst?download=true | zstd -d --long=31 --stdout | docker load

Execution environments are not only essential for evaluating code agents but also for training code models. Regardless of whether you do RL or SFT, generating high-quality training data requires diverse agent traces, which in turn require a large number of execution environments. Execution environments which we can now efficiently store and distribute to a large number of ephemeral machines to generate a large number of traces…

What's Next?

Execution environments are not only essential for evaluating code agents but also for training code models. Regardless of whether you do RL or SFT, generating high-quality training data requires diverse agent traces, which in turn require a large number of execution environments. Execution environments which we can now efficiently store and distribute to a large number of ephemeral machines to generate a large number of traces…

Stay tuned to learn more about what comes next.

Authors: Christian Mürtz & Mark Niklas Müller

time
min read
How We Made SWE-Bench 50x Smaller

We optimized the OCI layer structure of code execution environments to improve storage and distribution at scale

Read more
September 16, 2025
-

At LogicStar, our mission is to build a platform for self-healing applications. This relies on a strong bug-fixing backbone and review system working hand in hand to produce high-quality code fixes where possible, while abstaining rather than proposing incorrect fixes. We are therefore excited to announce that we not only have the best test generation system (announced last week) but also reached the state-of-the-art in fix generation with  76.8% accuracy on SWE-Bench Verified, the most competitive benchmark for automated bug fixing. Combining these systems, we achieve 80% precision, i.e., if our agent proposes a code fix, it is ready to merge 8 out of 10 times.

We are particularly proud that we achieved these results with our cost-effective production system rather than an agent carefully tuned for SWE-Bench and too expensive to ever run on customer problems. To achieve this, our L* Agent v1 leverages only the cost-effective OpenAI GPT-5 and GPT-5-mini, breaks down the bug fixing problem into clear sub-problems, and then orchestrates multiple sub-agents to investigate, reproduce, and fix the issue, before carefully reviewing and testing the generated code fix. All of this is enabled by our agent’s unique codebase understanding, powered by proprietary static analysis. 

So, how does our L* Agent work and why is it so cost-effective? The main insight is to combine a strong model (GPT-5), generating baseline patches and tests, with diverse cheaper agents based on GPT-5-mini, to increase diversity before picking the best patch using our state-of-the-art tests. All of this is enabled by our static-analysis-powered codebase understanding, which boosts the performance of both the weak and strong models.

We prioritize correctness and validation over speed, processing all issues asynchronously, as soon as they appear in your bug backlog or observability. This approach ensures you don’t have to waste time manually triaging and reviewing issues but simply receive high-quality patches from LogicStar for the issues we can solve confidently. We are now turning this technology into a loveable product, and invite you to sign up as a design partner if you’d like to help us build a system that will reliably maintain your code. While SWE-Bench is an important benchmark, it’s only part of the story — we are developing our agents for real-world use and not only benchmarks, so be sure to follow us for more updates.

time
min read
SWE-Bench Verified – Best Fix Generation at 76.8%

The L* agent achieves state-of-the-art results on SWE-Bench Verified using an ensemble of cheap agents and strong validation

Read more
September 10, 2025
-

In a series of posts, we will outline some the core technologies behind LogicStar.

At LogicStar AI, we are building the platform for self-healing software applications, leveraging agentic systems to autonomously identify, reproduce, and fix bugs. This requires rigorous testing and thorough validation of every application behavior to avoid introducing new issues or wasting reviewer time. Therefore, test generation is an area of key importance at LogicStar.

Our vision is to deliver substantial value for commercial applications; rather than flashy AI demos, we design LogicStar to avoid wasting developer time in reviewing partial or almost correct pull requests.

To drive innovation in test generation, we have developed and open-sourced SWT-Bench, also published at NeurIPS 2024. The popular SWE-Bench requires code agents to fix given issues, SWT-bench tests their ability to generate effective tests. This allows us to develop agents that excel at test generation. Within LogicStar, we orchestrate these test and code generation agents that collaboratively produce well-tested patches for every bug we address.

This system allows our agents to score 84% on the SWT-Bench, beating the previous state-of-the-art of 75.8%, held by the OpenHands team. We achieve this performance by combining multiple agents and models, iteratively refining both code and tests. The seamless orchestration of these agents heavily relies on our proprietary technology, including advanced static analysis tools used directly by our agents. As our agents do not rely on Internet access, there is no risk of leaking your source code, secrets, or your customers' data. Instead, our agents leverage advanced code search capabilities, iterative feedback driven by code execution with coverage metrics, and static analysis tools developed by LogicStar for building codebase understanding.

We are rolling out our latest agent advancements with selected design partners who share our vision for self-healing applications and are helping us shape the future of this technology. Their collaboration ensures that our research delivers immediate value for commercial software. If you also believe in this direction and work with Python, JavaScript or TypeScript repositories, we invite you to join sign up here. We will support you through onboarding and ensure full SOC2 compliance.

time
min read
SWT-Bench Verified – Best Test Generation at 84%

The L* Agent achieves a new state-of-the-art of 84% on SWT-Bench Verified

Read more
May 22, 2025
-

At LogicStar AI, trust, security, and operational excellence are foundational to how we build and deliver our autonomous software maintenance platform.

We’re proud to share that LogicStar has successfully completed a SOC 2 audit conducted by an independent third-party firm, validating the design and implementation of our security controls in alignment with the AICPA Trust Services Criteria.

This achievement reflects our commitment to safeguarding customer data and building secure systems from day one.

Importantly, we’ve also implemented continuous monitoring processes that ensure our controls remain active and effective — not just at a point in time, but throughout our operations.

SOC 2 compliance is one step in our broader mission to build infrastructure our customers and partners can rely on with confidence.

If you’re a customer or vendor and would like to receive a copy of our SOC 2 audit report, please reach out to: info@logicstar.ai

time
min read
Read more
March 3, 2025
-

We’re excited to announce that LogicStar AI has officially joined the ETH Zurich AI Center as an affiliate member🎉! This partnership is a significant milestone for us, especially since many of our team members have deep roots in AI research at ETH.

As a pioneering AI company focused on autonomous software maintenance, this collaboration strengthens our commitment to advancing AI research and innovation. Partnering with one of the world’s leading AI research hubs at ETH Zurich will accelerate our efforts in building cutting-edge AI agents that autonomously detect, reproduce, and fix software bugs-transforming the way engineers maintain commercial applications. At LogicStar we harness AI along classical computer science to empower engineering teams and AI agents to autonomously maintain commercial applications, enabling faster resolution of issues and empowering engineering teams to focus on innovation by automating tedious application maintenance tasks.

What This Means for LogicStar AI & Our Community:
✅ Access to World-Class Research & Talent - Collaborating with ETH Zurich’s AI experts, faculty, and students to push the boundaries of AI-powered software development.
✅ Advancing AI Reliability & Explainability - Working alongside top researchers to refine AI verification and validation techniques, ensuring robust and trustworthy autonomous coding agents.
✅ Stronger AI Ecosystem - Engaging with startups, industry leaders, and academia to shape the future of self-healing software and AI-driven code maintenance.

This partnership marks a major milestone in our mission to revolutionize software reliability. We’re excited about the journey ahead and look forward to working with ETH Zurich’s brilliant minds to make AI a seamless, dependable partner for engineering teams.

As AI is central to our mission, we are at the forefront of AI research and innovation. Being affiliated with the ETH Zurich AI Center allows us to do just that. We’re excited to collaborate as we advance the field of agentic AI for application maintenance together!

time
min read
ETH AI Center Affiliation

LogicStar AI Joins the ETH AI Center as an Affiliate! 🚀

Read more
February 24, 2025
-

LLM-generated applications are here. Some well-known tools now offer to turn anyone into an app developer, while others aim to make current developers more productive. Given the concerns about the security, support, and future development of these quickly made apps, we wanted to measure this with a benchmark. Our focus wasn’t on any specific tools but on the LLM models that power them. We present BaxBench - a benchmark of 392 instances – 28 scenarios for LLMs to implement using 14 different frameworks in 6 languages such as Django, ExpressJS, Flask, Ruby on Rails and others. For leaderboard and more benchmark details, together with the team at ETH Zurich we have created baxbench.com.

We conducted an analysis of backend applications generated by LLMs with a specific focus on assessing their vulnerability exposure to real security exploits. Backends are integral to many systems and vary in complexity; some are designed to manage the entire state of an application, while others are constructed by integrating multiple specialized services, known as microservices. Applications rely on one or more security critical backends to perform tasks such as handling logins, managing application states, and storing user data. To evaluate these systems, we developed BaxBench, which consists of small and frequently seen tasks for application backends. These backends are granted access to a database for storage, and large language models (LLMs) are tasked with generating their logic based on predefined OpenAPI specifications. Our findings revealed that many of these backends were not secure, as we were to execute actual attacks against them. This goes beyond mere analysis or tool-generated warnings about hypothetical security issues - we successfully executed real exploits, including SQL injection, path traversal, and user impersonation. It is crucial to emphasize that our specifications did not suggest any vulnerabilities; the vulnerabilities we exploited arose from the outputs generated by the LLMs.

One interesting point is that we can ask LLMs to fix these vulnerabilities, and they manage to solve many of them. They do best when we tell them exactly what we will be trying to exploit. However, even then, not every vulnerability goes away and there is a trade-off - when security issues are fixed, we measure that some apps stop working properly. This creates a big opportunity for tools like the ones we are building at LogicStar, which can both identify and fix these security issues. And of course, the benchmark is open source, so security and application development experts can help us add more scenarios or new ways to exploit vulnerabilities. In addition to this, we expect that LLMs will also get better thanks to benchmarks like BaxBench.

Looking deeper, it’s clear that correctness and security aren’t the only challenges. LLMs also struggle to create reliable code in different backend frameworks, especially those that aren’t the most popular one. Engineers see firsthand that LLMs can have trouble with complex and varied tasks, which means the results aren’t always perfect. Sometimes, even the best tools get stuck and can’t improve an app by just using more LLM prompts. However, to make progress, you have to start by measuring the problem. With BaxBench, we looked at security, and going forward-at LogicStar, we are focusing on checking and improving how well the models can understand existing apps, fix real problems in supporting them, and ultimately make their and your end users happier.

For more work on maintaining software, you can follow our research, blogs and social media. If you’re running an app with Python backends, we’d love to talk about how our early access product can help maintain that app by fixing bugs.

Need more information? Have a look at the paper and the baxbench.com website.

Please file issues or contribute to the benchmark code.

time
min read
Introducing BaxBench

BaxBench: Can LLMs Generate Secure and Correct Backends?

Read more
February 4, 2025
-

On - [4th February, 2025] - TechCrunch’s senior reporter Natasha Lomas wrote this article about LogicStar.

The text of the article is quoted below: “ Swiss startup LogicStar is bent on joining the AI agent game. The summer 2024-founded startup has bagged $3 million in pre-seed funding to bring tools to the developer market that can do autonomous maintenance of software applications, rather than the more typical AI agent use-case of code co-development.

LogicStar CEO and co-founder Boris Paskalev (pictured top right, in the feature image, with his fellow co-founders) suggests the startup’s AI agents could end up partnering with code development agents - such as, say, the likes of Cognition Labs’ Devin - in a business win-win.

Code fidelity is an issue for AI agents building and deploying software, just as it is for human developers, and LogicStar wants to do its bit to grease the development wheel by automatically picking up and fixing bugs wherever they may crop up in deployed code.

As it stands, Paskalev suggests that “even the best models and agents” out there are unable to resolve the majority of bugs they’re presented with - hence the team spying an opportunity for an AI startup that’s dedicated to improving these odds and delivering on the dream of less tedious app maintenance.

To this end, they are building atop large language models (LLMs) - such as OpenAI’s GPT or even China’s DeepSeek - taking a model-agnostic approach for their platform. This allows LogicStar to dip into different LLMs and maximize its AI agents’ utility, based on which foundational model works best for resolving a particular code issue.

Paskalev contends that the founding team has the technical and domain-specific knowledge to build a platform that can resolve programming problems which can challenge or outfox LLMs working alone. They also have past entrepreneurial success to point to: he sold his prior code review startup, DeepCode, to cybersecurity giant Snyk back in September 2020.

“In the beginning we were thinking about actually building a large language model for code,” he told TechCrunch. “Then we realized that that will quickly become a commodity… Now we’re building assuming all those large language models are there. Assuming there’s some actually decent [AI] agents for code, how do we extract the maximum business value from them?”

He said that the idea built on the team’s understanding of how to analyze software applications. “Combine that with large language models - then focus into grounding and verifying what those large language models and the AI agent actually suggest.”

Test-driven development What does that mean in practice? Paskalev says LogicStar performs an analysis of each application that its tech is deployed on - using “classical computer science methods” - in order to build a “knowledge base”. This gives its AI agent a comprehensive map of the software’s inputs and outputs; how variables link to functions; and any other linkages and dependencies etc.

Then, for every bug it’s presented with, the AI agent is able to determine which parts of the application are impacted - allowing LogicStar to narrow down the functions needing to be simulated in order to test scores of potential fixes.

Per Paskalev, this “minimized execution environment” allows the AI agent to run “thousands” of tests aimed at reproducing bugs to identify a “failing test”, and - through this “test-driven development” approach - ultimately land on a fix that sticks.

He confirms that the actual bug fixes are sourced from the LLMs. But because LogicStar’s platform enables this “very fast executive environment” its AI agents can work at scale to separate the wheat from the chaff, as it were, and serve its users with a shortcut to the best that LLMs can offer.

“What we see is [LLMs are] great for prototyping, testing things, etc, but it’s absolutely not great for [code] production, commercial applications. I think we’re far from there, and this is what our platform delivers,” he argued. “To be able to extract those capabilities of the models today, we can actually safely extract commercial value and actually save time for developers to really focus on the important stuff.”

Enterprises are set to be LogicStar’s initial target. Its “silicon agents” are intended to be put to work alongside corporate dev teams, albeit at a fraction of the salary required to hire a human developer, handling a range of app upkeep tasks and freeing up engineering talent for more creative and/or challenging work. (Or, well, at least until LLMs and AI agents get a lot more capable.)

While the startup’s pitch touts a “fully autonomous” app maintenance capability, Paskalev confirms that the platform will allow human developers to review (and otherwise oversee) the fixes its AI agents call up. So trust can be - and must be - earned first.

“The accuracy that a human developer delivers ranges between 80 to 90%. Our goal [for our AI agents] is to be exactly there,” he adds.

It’s still early days for LogicStar: an alpha version of its technology is in testing with a number of undisclosed companies which Paskalev refers to as “design partners”. Currently the tech only supports Python - but expansions to Typescript, Javascript and Java are billed as “coming soon”.

“The main goal [with the pre-seed funding] is to actually show the technology works with our design partners - focusing on Python,” adds Paskalev. “We already spent a year on it, and we have lots of opportunity to actually expand. And that’s why we’re trying to focus it first, to show the value in one case.”

The startup’s pre-seed raise was led by European VC firm Northzone, with angel investors from DeepMind, Fleet, Sequoia scouts, Snyk and Spotify also joining the round.

In a statement, Michiel Kotting, partner at Northzone, said: “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

LogicStar is operating a waiting list for potential customers wanting to express interest in getting early access. It told us a beta release is planned for later this year. “

time
min read
TechCrunch Article About LogicStar

A TechCrunch article about us titled LogicStar is building AI agents for app maintenance

Read more
December 18, 2024
-

SWT-Bench: Benchmarking CodeAgents’ Test Generation Capabilities

As the complexity of modern software systems grows, so does the challenge of ensuring their reliability. To this end, rigorous testing plays a critical role in maintaining high software quality. However, while the rise of Large Language Models (LLMs) has catalyzed advancements in code generation, their potential in test automation remains underexplored. Enter SWT-Bench, a novel benchmark for test generation based on real-world GitHub issues, developed in collaboration with ETH Zurich. With the release of a public leaderboard at swtbench.com, we aim to spark a similar push from the research community on test generation as SWE-Bench caused for code generation.

What is SWT-Bench?

SWT-Bench is a test generation benchmark based on real-world GitHub issues. The objective is to generate a test reproducing the described issue given the full codebase. We determine whether a test reproduces an issue by checking whether it fails on the original codebase but passes after a human-written ground truth fix, taken from a corresponding pull request (PR), has been applied. We call this the success rate \mathcal{S} Additionally, we measure the coverage \Delta \mathcal{C} of the lines modified in this ground truth bug fix to further assess the test quality.

How did we create SWT-Bench?

Starting with over 90,000 PRs from 12 popular GitHub repositories, we applied rigorous filtering to obtain 1,900 diverse and high-quality instances. SWT-Bench thus reflects the complexity of modern software ecosystems, challenging AI systems to navigate large codebases (up to 700k lines), interpret nuanced issue descriptions (320 words average), and integrate tests into diverse existing test suites and frameworks (from pytest to tox to custom frameworks).

First Results

Performance of Code Agents

We found that Code Agents, originally designed for program repair (e.g. SWE-Agent), perform well on test-generation tasks, even outperforming dedicated test-generation methods (LIBRO). However, even minimal modifications like explicitly instructing the agent to execute the generated tests (SWE-Agent+) significantly improve performance further. This highlights the potential of dedicated Code Agents for test generation.

A new Patch Format for Test Generation

Based on the insight that test generation is typically solved by adding a new (test) function or class, we propose a novel patch format tailored for fault tolerance and simplicity. This format alone, allows vanilla LLMs to generate executable tests in twice as many cases (ZeroShot vs ZeroShotPlus) leading to almost 3 times as many solved instances.

Utility of generated tests:
Automatically generating high-quality tests not only allows developers to focus on (test-driven) development generating real business value but can also boost the performance of code generation agents. In particular, the generated tests can guide them along the whole generation process from informing context localization to bug fix validation. Early results show that simply using generated tests to filter proposed bug-fixes can more than double the achieved precision.

Correlation of Test and Fix Generation
While we observe that Code Agents who perform well on code generation also perform well on test generation, we interestingly doe not see such a correlation for individual issues. That is an issue that is easy to fix is not necessarily easy to test and vice versa. Indeed, we see no statistically significant correlation between the hardness/resolution rate of these tasks, highlighting the unique challenges of test generation.

Implications for the Future of Software Maintenance

SWT-Bench demonstrates the capability of LLMs to interpret and formalize the intent of natural language issue descriptions into tests. This has the potential to in the long run significantly improve software quality by making thorough testing attainable without significant manual efforts. In a next step, this can even enable self-healing systems by automatically detecting, reproducing, and resolving issues in real-time, as they appear, minimizing downtime and increasing reliability.

We at LogicStar AI believe that reliable automated testing is the key to unlocking the real potential of Code Agents and will be essential to push the frontier in automated application maintenance. Therefore, we are extra excited to see the great interest of the community in SWT-Bench and hope that our public leaderboard can make it even more accessible.

For more details, check out our NeurIPS paper (https://arxiv.org/pdf/2406.12952) or our open-source code (https://github.com/logic-star-ai/swt-bench.

time
min read
Introducing the SWT-Bench Leaderboard!

SWT-Bench Benchmarking CodeAgents' Test Generation Capabilities

Read more
December 5, 2024
-

Researchers and entrepreneurs from INSAIT and ETH Zurich have launched LogicStar AI, a new deep-tech startup, which is building fully autonomous agentic AI that helps teams maintain their software.

Founding Team
The founding team behind LogicStar AI is star-studded and includes the founders of DeepCode.ai (now Snyk Code) which is currently delivering more than $100M ARR for Snyk. The LogicStar AI founders are:

🌟 Boris Paskalev (CEO), formerly CEO of DeepCode, then Director of Product AI at Snyk, and currently a Strategic Entrepreneurship Advisor at INSAIT.
🌟 Dr. Mark Niklas Müller (CTO), AI PhD from ETH Zurich, formerly an engineer at Porsche and Mercedes AMG Petronas F1 Team.
🌟 Dr. Veselin Raychev (Chief Architect), formerly CTO of DeepCode, then Head of AI at Snyk and researcher at INSAIT.
🌟 Prof. Martin Vechev (Adviser), full professor at ETH Zurich, founder and scientific director of INSAIT.

LogicStar AI Mission
LogicStar AI is building an agentic AI platform for automatically validating, reproducing, and fixing bugs with high precision. Their technology empowers engineering teams to focus on creating new features, driving growth and innovation, by reducing the burden of maintenance and debugging issues. With LogicStar AI, developers can thus spend more of their time delivering real business value, while LogicStar AI reliably addresses software maintenance problems without manual intervention.

INSAIT’s mission
INSAIT (https://insait.ai) is a world-class computer science and AI research institution, founded in 2022 in partnership with Switzerland’s ETH Zurich and EPFL. The focus of INSAIT is on conducting world-class research and attracting outstanding faculty, research scientists, postdocs, and PhD students. In the short time since its inception, INSAIT has published over 50 papers in all major AI venues, as well as in premier theory conferences.

Join the Journey
As LogicStar AI embarks on this exciting new chapter, we invite talented individuals to join our team and shape the future of reliable AI for code and software applications. For more information on career opportunities, partnerships, or to learn about our innovative solutions, please visit our website and follow LogicStar on LinkedIn. LogicStar AI has offices in both Sofia, Bulgaria and Zurich, Switzerland.

time
min read
Agentic AI from INSAIT and ETH Zurich

INSAIT and ETH Zurich Entrepreneurs launch LogicStar AI, a new Agentic AI startup

Read more
October 17, 2024
-

We are excited to share SWT-Bench, the first benchmark for reproducing bugs and validating their fixes based on GitHub issue descriptions. We presented SWT-Bench at two ICML workshops and want to thank everyone who stopped by for their interest, enthusiasm, and the great discussions we had. We now see a community trend to not only focus on fixing bugs but also generating tests that can effectively reproduce them and validate that proposed fixes truly resolve the issues. We believe this is essential for achieving truly autonomous bug fixing, which is what LogicStar delivers.

In our paper, we demonstrate how any code repair benchmark with a known ground truth solution can be transformed into a test generation and issue reproduction benchmark. There, the goal is to create a “reproducing test” that fails on the original codebase and passes after the ground truth fix has been applied. Our analysis shows that Code Agents excel in this task and outperform dedicated LLM-based test generation methods. Leveraging these tests for code repair further allows us to significantly enhance precision. To learn more, please check out our preprint paper.

LogicStar AI builds on top of this research to achieve a truly autonomous bug fixing that you can trust as you trust your top engineers.

time
min read
SWT-Bench

A Benchmark for Testing and Validating Bugfixes

Read more
July 1, 2024
-

Zurich, Switzerland - [4th February, 2025] - LogicStar, the AI agent for fully autonomous software maintenance, has raised $3M (CHF 2.6M) in a pre-seed funding round led by Northzone, with angel investors from DeepMind, Snyk, Spotify, Fleet and Sequoia scouts. LogicStar empowers engineering teams to focus on innovation by automating tedious application maintenance tasks.

LogicStar is revolutionising software maintenance with its autonomous code agent, designed to deliver self-healing software applications that empower engineers to focus on innovation and growth. LogicStar works seamlessly alongside human developers, autonomously reproducing application issues, testing solutions and proposing precise fixes without the need for human oversight. The world’s rapid adoption of AI has sparked a wave of pivotal trends that are reshaping industries and workflows. Organisations relying on custom software spend considerable resources and time on maintenance and bug fixes, which divert developers from innovation. AI coding agents, despite performing well on benchmarks and simple tasks, tend to introduce errors in complex settings, leaving teams stuck with tedious maintenance tasks.

The founding team consists of Boris Paskalev, Veselin Raychev, Mark Müller, and Prof. Dr. Martin Vechev. Boris, Veselin, and Martin previously built DeepCode.ai (acquired by Snyk and now called Snyk Code) and scaled it to over $100M ARR: a technology trusted by millions of developers. Martin also leads the Secure, Reliable, and Intelligent Systems (SRI) lab at ETH Zurich and is the Founder and Scientific Director of INSAIT. The unique technology behind LogicStar draws from the team’s deep research background and expertise from ETH Zurich, MIT, TRIUM, and INSAIT, resulting in over 20,000 citations and 350 top publications in AI and program analysis, particularly in large codebases and software development.

LogicStar has already released SWT-Bench to support the development of code agents and demonstrated that existing code agents are not up to the challenge of enterprise code bases, failing on >95% of issues. Using an advanced mock execution environment, LogicStar swiftly runs generated tests to reproduce issues and confirm solutions - spotting errors before you’re aware of them. At the core of LogicStar’s technology lies a blend of the latest advancements in LLMs for code and classical computer science techniques. The platform is rapidly evolving, with Python support already available and expansions to Typescript, Javascript, and Java coming soon. Technology leaders managing commercial software systems are invited to join the waiting list to experience the benefits of LogicStar firsthand.

Boris Paskalev comments, “I am excited that developers can focus on innovation and creativity while automation handles the burden of application maintenance. Our platform eliminates the need to oversee the current generation of agents and LLMs in maintaining commercial software. Providing an evolving solution that seamlessly grows along LLM advancements and maximises successful task completion.”

Michiel Kotting, partner at Northzone adds, “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

About Logic Star
LogicStar is the AI agent for fully autonomous application maintenance. Founded in 2024, the company is headquartered in Zurich, Switzerland and is backed by global venture firm Northzone, as well as angels from DeepMind, Snyk, Spotify, Fleet, and Sequoia Scouts.

About Northzone
Northzone (northzone.com) is a global venture capital fund built on experience spanning multiple economic and disruptive technology cycles. Founded in 1996, Northzone has raised more than ten funds to date, with its most recent fundraise in excess of $1.2 billion and has invested in more than 175 companies, including category-defining businesses such as Trustpilot, Spotify, Klarna, iZettle, Kahoot!, Personio, TrueLayer, Spring Health, amongst others. Northzone is a full-stack investor from Seed to Growth stage, with transatlantic hubs out of London, New York, Amsterdam, Berlin, Stockholm and Oslo.

time
min read
LogicStar AI raised a $3m round led by Northzone

LogicStar, building the AI agent for fully autonomous application maintenance, raised a $3m round led by Northzone.

Read more
July 1, 2024
-

LogicStar AI is looking for passionate software engineers to join our team. We are a team of researchers, engineers, and product people that focus on cutting edge research and quickly bringing it to a product. If you are interested in working with us, please review our jobs on our careers page and send your resume to jobs@logicstar.ai.

time
min read
Jobs

We are looking for passionate software engineers to join our team

Read more
April 11, 2024
-

🚀 We are thrilled to introduce LogicStar, a pioneering deep-tech startup based in Switzerland, revolutionizing application monitoring and maintenance.

Our cutting-edge platform blends AI with proven computer science methodologies to create agentic AI-tailored application mocks. These mocks reproduce software bugs in an AI-powered mock execution environment, enabling scalable evaluation and verification of AI-driven fix suggestions.

Our exceptional team comprises experts and top researchers from ETH Zurich, INSAIT, and MIT, alongside seasoned serial entrepreneurs, united by a shared mission to redefine the future of software reliability.

✨ Join us on this transformative journey as we push the boundaries of network monitoring and maintenance with groundbreaking innovation.

time
min read
Introducing LogicStar

We are excited to announce the launch of LogicStar AI, our startup to revolutionize application monitoring.

Read more
LogicStar AI logo – autonomous software maintenance and self-healing applications

Stop Drowning in Bugs. Start

Shipping Features Faster.

Join the beta and let LogicStar AI clear your backlog while your team stays focused on what matters.

No workflow changes and no risky AI guesses. Only validated fixes you can trust.

Screenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validationScreenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validation