16.10.2025
Jitendra Palepu
Open Source
From Efficiency to Exposure: The Rise of Vibe Coding
Developers no longer write every line of code from scratch. Most software is built on layers of existing libraries. Traditionally, this meant reusing vetted, attributed, and licensed Free and Open Source code. Enter “vibe coding” — the practice of using generative AI tools to generate quick scaffolds, utility functions, or even core business logic. Over 60% of code in some organizations is now AI-generated. But only a fraction of companies have processes to approve these tools or vet their output.
The result: opaque, untraceable code with unknown licensing, origins, or vulnerabilities. Even worse, many developers can’t tell if a function was generated by AI, copy-pasted from Stack Overflow, or copied wholesale from a GPL repository.
When prompted to complete a sorting algorithm or a math function, GitHub Copilot often reproduces code that is near-identical to existing examples in public repositories. Our own demonstrations have shown exact matches—but with the license and author stripped away. This is not accidental. It’s architectural.
AI code generation systems are trained on massive datasets of existing code, often without respecting the terms of use or the license obligations. And the models are not built to preserve provenance.
“Copilot is not a co-author. It is a collector—often of other people’s work.”
The Legal Shift: From Infringement Theory to Infringement Practice
Until recently, legal risks around AI-generated code were largely hypothetical. That changed in September 2025, when a German court (Landgericht München I) found OpenAI likely liable for copyright infringement over the use of song lyrics in training its models.
The court rejected:
- OpenAI’s claim that the users were responsible.
- Arguments invoking EU text and data mining exceptions.
- Comparisons to U.S. “fair use.”
Instead, the court made clear: training on copyrighted data without permission or license is infringement. And generating content based on that training is unauthorized reproduction.
This case could soon lead to a formal injunction. The court also signalled it could become a center for similar lawsuits. If this logic extends to source code, Copilot-style models trained on GPL code could be in legal free fall.
Diverging Legal Standards: Europe vs. the United States
While European courts are beginning to impose strict copyright obligations on model training and output, the situation in the United States remains more ambiguous. Under U.S. copyright law, AI companies often argue that training large language models on public code falls under the doctrine of “fair use”. Fair use in the U.S. is not a legal permission – it’s a defense, one that is fact-specific, unpredictable, and applied inconsistently across courts. Some AI developers rely on it as a shield for training on copyrighted data, but there is no guarantee that courts will agree.
There are several ongoing law suits in the US concerning AI and its potential intellectual property rights violations. The U.S. courts have yet to form a consistent view on whether AI training constitutes fair use. Until clear precedent is established, companies using or distributing AI-generated code, especially code that resembles existing works face significant legal uncertainty. To address this issue, Creative Commons proposed a new set of machine-readible opt-out signals, which would allow copyright holders to express a preference not to be used for AI training. The opt-out mechanism is gaining legal value in Europe.
Under the EU AI Act (Article 53 (I) (c), recital 106, Measure I.2.3 of Copyright section in draft code of practice) and the CDSM directive, model developers must respect such opt-outs when training on copyrighted works, even if the training occurs outside the EU. This also means that once an AI model or its outputs are placed on the EU market, its developers are expected to follow EU copyright rules, no matter where the training took place and even if the training is protected by the US fair use or elsewhere.
Detecting the Invisible: AI Copying may Be Proven
As discussed in our recent Bitkom Forum Open Source 2025 talk, Detecting AI-generated code is tricky, as most code lacks clear indicators of its origin. While some snippets may include comments like “generated by ChatGPT,” this is rare. However, there are some indicators.
AI-generated code often has a uniform structure with excessive or unnecessary comments, uses generic variable names like temp or data, and resembles textbook examples rather than real-world code. Semantically, it may include redundant or illogical statements, lack edge case handling, or show little understanding of domain-specific logic unless explicitly prompted.
Tools like GPTZero and DetectGPT, originally designed for text, can sometimes flag AI-generated comments or explanations. Plagiarism detectors like PlagScan and Turnitin are also beginning to scan code. Searching snippets on GitHub or Google often reveals near-identical code from public sources like Stack Overflow.
Other clues may appear in commit history, GitHub Copilot commits sometimes include metadata or tags like “Co-authored-by.” Occasionally, prompt fragments even leak into code comments or variable names.
Tools like Vendetect use semantic fingerprinting to detect vendored or copied code across repositories, even after it has been refactored. Combined with version control analysis, such tools can trace code back to the exact commit in the source repository. But even these tools have limits. Obfuscated code, minor structural variations, or deeply transformed snippets can still evade detection.
Despite these indicators, reliable detection remains difficult and requires a combination of tools, context, and manual review. 100% detection is difficult—if not impossible—at scale. That’s why detection must be combined with forensic level scanning of codebases, developer disclosure, and clear contractual safeguards.
Security and Quality of AI generated code
The issues surrounding code generated by AI go beyond licenses and copyrights. If AI models are trained on insecure, outdated, or buggy code, they reproduce those flaws. Researchers warn that AI-generated code often ignores edge cases, mishandles input types, or introduces vulnerabilities that seasoned developers would avoid. In a Checkmarx survey, 80 percent of developers use AI tools, but nearly half don’t trust their output. Yet that output is silently entering production.
The security issues with code generated by AI will be crucial ones to address in the context of current regulatory landscape of Cyber Resilience Act (CRA), Digital Operational Resilience Act (DORA), NIS-2 and product liability laws that also include software.
AI Can also Expose You — by Finding Bugs
Ironically, the same AI techniques that generate code can also detect flaws in it. Researcher Joshua Rogers used generative AI Static Application Security Testing (SAST) tools to discover 50 new bugs in cURL – one of the most heavily used and audited open source projects in the world. Even the project’s maintainer, Daniel Stenberg, who had previously dismissed AI-generated bug reports as “slop,” acknowledged the quality of these findings.
The tools Rogers used go beyond syntax analysis. They understand intent, protocol logic, and semantics — just like they do when generating code.
The dual-use nature of AI shows that the flaw lies not in the tool itself, but in how people use it. AI without review, audit, or attribution is a liability. AI with validation can be an asset.
From Blind Trust to Controlled Use?
AI-generated code must be treated like third-party code, it requires license verification, origin tracing, and security review. SBOMs (Software Bill of Materials), must include provenance data where possible: was this generated? If so, how? What prompt was used? What training data is known? Developers must disclose AI usage to customers and partners. Lack of disclosure creates legal risk under German contract law and makes warranty disclaimers ineffective. Software buyers must shift the risk contractually. Define AI-origin code as a defect. Demand auditability. Require vendors to assume responsibility, not shift blame.
Legal Accountibility and Provenance in AI-Generated Code
If Copilot or a similar tool produces code that closely matches copyrighted material — like from Free and Open Source projects under the GPL or LGPL — that can be enough to trigger a copyright violation. German copyright law even gives rights holders the ability to request access to your source code if they suspect this kind of unauthorized reuse.
This can lead to lawsuits, takedown demands, or even financial damages, especially if the reused code comes from projects that offer commercial licenses, like Qt, MySQL, or OpenJDK. From a buyer’s or customer’s perspective, the law treats any code without a clear license or proof of origin as defective — just like broken hardware. In such cases, the law holds the software vendor responsible for shipping code that isn’t legally clean. To avoid this, developers and vendors should be upfront if they used AI-generated code. Developers should document, review, and include it in SBOMs just like any other third-party code.
When purchasing software, organizations must ensure that contracts clearly define AI-generated code as a potential risk, and require the supplier to take responsibility for it. That includes making sure the code is legal to use and properly reviewed. Knowingly shipping AI-generated code without verifying its origins or legality signals acceptance of the risk of breaking the law.
Organizations can implement review workflows — like our OCCTET toolchain – that allow for forensic audits of the codebases to generate clean SBOMs that show all the software components and their provenance with their respective copyrights, licenses and vulnerabilities.
Developers as Gatekeepers
Developer using Copilot, ChatGPT, or other codegen tools, are also gatekeepers. They decide what gets into the codebase. That means they also decide what risks the company assumes. AI is here to stay. But so are copyright law, security standards, and contract law. We cannot ignore one just because the other is exciting.
Protect yourself. Audit your code. Demand transparency. At Bitsea, we help organizations turn uncertainty into clarity. Whether developing with AI tools like Copilot, integrating third-party code, or sourcing software from vendors, our forensic audit services ensure your codebase is legally clean, traceable, and secure at file-by-file level.
