Review Article
Protecting the Candy: A Case for Data and Software Rights in the Age of Machine Learning
Charles R. Macedo*
Partner, Amster, Rothstein & Ebenstein, LLP, Chloe Vizzone, Associate, Amster, Rothstein & Ebenstein, LLP, and ChatGPT 4o
Charles R. Macedo, Partner, Amster, Rothstein & Ebenstein, LLP, Chloe Vizzone, Associate, Amster, Rothstein & Ebenstein, LLP, and ChatGPT 4o
Received Date:May 28, 2025; Published Date:June 14, 2025
Abstract
As machine learning and generative AI technologies reshape innovation and creativity, legal frameworks for intellectual property (IP) have lagged behind. This article argues that data and software, the fundamental ingredients of AI systems, deserve dedicated legal protections analogous to physical goods. Using the analogy of a candy store, this piece frames generative AI outputs, platforms, and data preparation as products and supply chains deserving IP rights. A new category of “Data Rights” and “Software Rights,” grounded in originality and commercial investment, should be introduced to fill the current gap. These rights would be simple to enforce, valid for 15 years, and provide statutory royalties to all contributors, from data curators to platform developers.
Introduction
Imagine walking into a candy store, admiring the shelves lined with bright wrappers and mouthwatering sweets, and then simply walking out with your favorite chocolate bar-without paying. Most would agree such an act is theft. Under state criminal codes, this would typically constitute petty larceny or shoplifting, with penalties including fines or jail time. Yet in today’s AI-driven economy, analogous behavior is not just common; it is often legal. Developers, platforms, and users routinely exploit data and generative AI outputs without attribution or compensation to those who created, curated, or maintain the underlying resources. As the economic and creative value of machine learning tools continues to surge, so too must the legal protections afforded to their building blocks.
Problem: Outdated IP Laws in a Post-ChatGPT Economy
The rise of ChatGPT in late 2022 marked a paradigm shift in public awareness and commercial adoption of artificial intelligence (“AI”) and generative technologies. Within months, generative AI tools were being used in virtually every sector-from education and entertainment to scientific research and legal services. This explosion of usage has propelled generative AI to become a critical engine of the digital economy, projected to contribute trillions of dollars to global GDP in the coming decade.
Yet existing legal frameworks-primarily copyright and patent law-are ill-suited (at least in the U.S.) to protect the core elements of these systems: data and software.
Copyright law struggles to protect software-generated content due to the “authorship” requirement, which excludes outputs not created by a human author. Additionally, raw and curated datasets used in training AI generally do not qualify for copyright protection unless they exhibit a creative selection or arrangement leaving most training data unprotected.
While a stolen candy bar invokes clear criminal consequences, an AI-generated image copied without attribution may go unpunished.
ALikewise, while software code can be copyrighted, functionality and architecture often fall outside its scope, and patent law imposes a high bar for novelty and non-obviousness, and has been reluctant to find software patent-eligible.
This legal vacuum creates significant risks:
i. Developers lack enforceable rights over generative outputs,
undermining monetization and investment.
ii. Platforms cannot ensure exclusive use of their software
systems or training pipelines.
iii. Data contributors and curators go uncompensated, weakening
incentives for high-quality data preparation.
In sum, the current IP regime, built for a pre-AI world, fails to recognize or reward the layered economic and creative contributions required to build machine learning system.
More critically, these legal constructs are rooted in the economic and technological paradigms of the 18th and 19th centuries. The foundational structure of copyright was designed to incentivize and protect authors of literary and artistic works in a print-dominated society. Similarly, patent law emerged to promote invention during the Industrial Revolution, where mechanical processes and chemical compounds formed the backbone of economic value. These laws were well-suited to the tangible, human-centered outputs of their time.
But the 21st century operates on an entirely different substrate: information. The raw materials of today’s economy are no longer cotton, coal, or steel-but data, algorithms, and computation. Intellectual labor is increasingly performed not only by humans but also by intelligent machines. Creation is less about fixing ink to paper and more about training models on vast datasets, producing outputs through layers of computation, and refining them through human-machine collaboration. Despite this, our IP regime continues to hinge on antiquated notions of authorship, and fixation.
As AI systems generate images, text, and software that rival or exceed human creations, legal questions of ownership, attribution, and compensation have become more urgent. Traditional IP laws do not answer these questions adequately because they were never designed to accommodate non-human creators, or the distributed, iterative nature of machine learning development. Without reform, we risk suppressing innovation, misallocating value, and fostering inequity across the AI economy.
In this context, reform is not only with is inevitable. Just as the law adapted to the printing press, the camera, and the internet, it must now evolve to embrace the realities of generative AI. Doing so will require new legal instruments that account for the complexity, scale, and economic significance of digital creation. It will also demand a reimagining of what it means to protect an “original work” or an “inventive step” in a world where machines contribute to both.
We propose that this reform should begin with the establishment of two new IP frameworks: Data Rights and Software Rights. These rights would recognize the economic and creative value of curated data and model design. They would offer protections where existing regimes fall short and provide a balanced, time-limited structure for attribution, licensing, and fair remuneration. They represent a first step toward reconciling law with the logic of the digital age.
The Output: The Candy on the Shelf
The final product of a generative AI system-whether it is a poem, image, software code, or synthetic data-is akin to a candy bar on a shelf. Just as a consumer cannot lawfully walk into a candy shop and take a bar of chocolate without paying, users should not be able to freely extract and commercialize AI-generated content.
This principle is rigorously upheld in the physical world. Criminal statutes, such as New York Penal Law § 155.25, treat the unauthorized taking of tangible goods-including candy-as “petit larceny”. Retail theft is monitored by surveillance systems, prosecuted by district attorneys, and deterred through fines, community service, and incarceration. In contrast, the digital appropriation of AI-generated works often goes unchecked due to gaps in intellectual property law.
Under the Copyright Act of 1976, copyright protection is limited to “original works of authorship fixed in any tangible medium of expression”. The U.S. Copyright Office has clarified that works created without human authorship are ineligible for copyright, as articulated in its 2023 guidance on AI-generated works. This limitation leaves AI-generated content unprotected, even when it has substantial economic or creative value.
Legal scholars such as Prof. Jane Ginsburg have emphasized that the concept of authorship is central to copyright, and without a human agent, courts are reluctant to extend protection. As a result, the moment an AI-generated poem, image, or article is released, it can be copied and redistributed without fear of legal repercussions. This asymmetry distorts market incentives and disincentivizes creators from investing in AI development.
The Platform: The Storefront That Sells the Candy
The generative AI platform is not unlike the neighborhood candy store: a curated, maintained, and monetized environment designed to offer products to the public. These platforms represent significant capital outlay-often in the tens or hundreds of millions of dollars-for computing resources, engineering teams, interface development, cybersecurity infrastructure, marketing strategies, and compliance.
In the brick-and-mortar world, protections for business operators are well established. Trademark law under the Lanham Act protects the visual identity, slogans, and branding of a store from infringement and dilution. Commercial landlords and retail franchises benefit from well-defined leasing contracts, trade dress protections, and franchise laws.
In the digital realm, however, these safeguards are patchy. While platform names and logos may be trademarked, the user interface design, the recommendation engines, and the model architectures behind generative platforms enjoy minimal protection.
Copyright law explicitly excludes protection for “ideas, procedures, processes, systems, methods of operation, concepts, principles, or discoveries,” which limits the protection of AI system functionalities.
Trade secret law, such as that codified in the Defend Trade Secrets Act of 2016, offers some remedies, but these are difficult to enforce internationally and provide no recourse once a secret is made public or reverse-engineered. As many legal scholars and AI experts have pointed out, trade secrets rely on confidentiality, which clashes with the transparency ethos of open science and responsible AI development.
The Data: The Ingredients That Make the Candy
Just as no candy bar can be made without ingredients, no AI model can exist without data. Training data serves as the sugar, cocoa, and milk of the machine learning pipeline. It is the essential fuel from which patterns are learned, and without it, there is no model.
Candy manufacturers pay for their ingredients and are legally bound by contracts and supply chains governed by the Uniform Commercial Code (UCC). A breach of contract results in clear civil liability; theft or misappropriation can result in criminal charges or trade secret litigation.
In contrast, training datasets are often compiled through web scraping, API extraction, or bulk licensing from undisclosed sources. While some uses may fall under fair use exemptions (as discussed in Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015)), this doctrine is narrow and context-dependent. Moreover, many creators object to the use of their content for AI training, especially when it competes with or displaces their own economic activity
Legal commentators such as Mark Lemley have argued that existing copyright law lacks the doctrinal tools to address nonconsumptive, large-scale data use. The sui generis database protection offered in the EU under the Database Directive (96/9/ EC) has no parallel in the United States, creating a regulatory arbitrage problem. Without an American counterpart, curated datasets-especially those requiring human labor to annotate, structure, and clean-remain largely unprotected.
The Data Preparation: The Candy-Making Process
Turning raw sugar and milk into a finished candy bar requires more than a recipe-it requires skilled labor, machinery, sanitation, packaging, branding, and logistics. These processes are protected by a variety of legal and contractual rights in the physical economy.
Similarly, data must be curated cleaned, labeled, normalized, enriched, and formatted before it becomes usable in training a model. These processes are labor-intensive, often requiring data scientists, annotators, domain experts, and engineers. For example, the ImageNet project involved hundreds of thousands of human hours to label images accurately-work underpins countless AI applications today.
However, traditional IP frameworks fail to recognize these contributions. As the Supreme Court held in Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991), mere investment in data collection and compilation does not meet the originality requirement of copyright. While this ruling reinforced the ideaexpression dichotomy, it left producers of non-expressive data vulnerable.
Scholars such as Jerome Reichman and Paul Uhlir have proposed sui generis protection for data compilation based on investment and utility rather than creativity. Such proposals have gained little traction in the U.S., but the need is growing more acute as AI models increasingly depend on curated, domain-specific datasets.
Proposing New Rights: The Case for Data Rights and Software Rights
To resolve these challenges, the U.S. should introduce two new IP categories: Data Rights and Software Rights.
Data Rights would protect original, curated datasets-especially
those demonstrating selection, coordination, or investment. These
rights would:
i. Last for 15 years from first commercial use.
ii. Prohibit unauthorized copying, re-use, or distribution.
iii. Allow for reasonable licensing and fair royalties.
iv. Be subject to clear, simple criteria for enforcement (e.g.,
registration and publication).
This concept mirrors the EU Database Directive, but would be tailored to the American legal landscape. Instead of requiring creativity, it would recognize investment, labor, and organization. This would ensure that data compilers-especially in sectors like healthcare, scientific research, and education-receive returns on their efforts.
Software Rights would protect machine learning systems and
underlying algorithms, similar to design patents, including:
i. The structure, training architecture, and tuning of AI models.
ii. A 15-year term, renewable once for systems under active use.
iii. Enforcement mechanisms modeled after copyright and trade
secret law.
Software Rights would be narrower than full patent protection but broader than copyright in covering functional elements of AI systems. They could adopt principles from the Semiconductor Chip Protection Act of 1984, which recognizes the industrial design of microchips without demanding full patent standards. By protecting the model architecture as an engineered system, these rights would balance innovation incentives with competition and interoperability.
These rights would ensure that developers and data curators are fairly compensated while still allowing access under fair terms. They would provide legal certainty, promote responsible investment, and reward those who contribute to the AI ecosystemwithout stifling downstream innovation.
Acknowledgement
None
Conflict of Interest
No conflict of interest.
-
Charles R. Macedo*. Protecting the Candy: A Case for Data and Software Rights in the Age of Machine Learning. Iris On J of Arts & Soc Sci . 2(4): 2025. IOJASS.MS.ID.000542.
-
Intellectual property, Data Rights, Software Rights, Post-ChatGPT, AI economy
-
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.