Lumen Tools for Researchers
An Introduction to the Fake DMCA Database
FAQ Contents:
- What is Fake DMCA Investigation?
- What is in the database?
- What is missing?
- Who is Lumen for?
- Viewing a Notice
- How does it work?
- API Terms of Use
Lumen is an independent research project studying cease and desist letters concerning online content. We collect and analyze requests to remove material from the web. Our goals are to educate the public, to facilitate research about the different kinds of complaints and requests for removal–both legitimate and questionable–that are being sent to Internet publishers and service providers, and to provide as much transparency as possible about the “ecology” of such notices, in terms of who is sending them and why, and to what effect.
Our database contains millions of notices, some of them with valid legal basis, some of them without, and some difficult to determine. The fact that Lumen has a notice in its database does not mean that Lumen is authenticating the provenance of that notice or making any judgment on the validity of the claims it raises.
Conceived, developed, and founded in 2002 by then-Berkman Klein Center Fellow Wendy Seltzer, the project, then called “Chilling Effects”, was initially focused on requests submitted under the United States’ Digital Millennium Copyright Act. As the Internet and its usage has evolved, so has Lumen, and the database now includes complaints of all varieties, including trademark, defamation, and privacy, domestic and international, and court orders. The Lumen database grows by more than 40,000 notices per week, with voluntary submissions provided by companies such as Google, Twitter, YouTube, Wikipedia, Meta, Counterfeit Technology, Medium, Stack Exchange, Vimeo, DuckDuckGo, aspects of the University of California system, and WordPress. As of the summer of 2019, the project hosts approximately twelve million notices, referencing close to four billion URLs. In 2018, the project website was visited over ten million times by users from virtually every country in the world.What content can I find in the database, and where does it come from?
See here for comprehensive and up-to-date details of who sends Lumen notices, what notices they send, and what details are part of each notice. Aggregating all of these different requests to remove material facilitates the research, study and mapping of the Internet’s removal request landscape by journalists, NGOs, policy-makers, and academics.A notice or work contains “[redacted]” – what is missing?
Lumen makes a good faith effort to redact out all personally identifying information (“PII”) contained within notices other than the name of the sender or rightsholder, and the country of origin of the notice. Our automatic redaction processes seek to identify and remove the following:
- Email addresses
- Phone numbers
- Other forms of ID number (e.g. Social Security #s, national ID #s)
Further, an individual or company submitting a notice directly to the Lumen database may have decided not to share with Lumen, or to keep private, certain pieces of information in the notice.
Please note that for DMCA notices, Lumen does not typically redact the name of the rightsholder making the request or the URL(s) of the material complained of. Without the location of the complained-of material and the complainant, the notices are meaningless from a transparency or research perspective, to say nothing of offering no insight as to possible misuse of takedown notices as a vehicle for censorship.
When a company shares copies of court orders it has received with us, Lumen typically displays those orders in the form in which they have been shared with Lumen and further, makes a good faith effort to do so in accordance with the applicable law of the jurisdiction from which the order emerged. United States court orders, unless sealed, are public documents.
See here for more details on what gets redacted from what notices.
Who is Lumen for?
Lumen is designed for casual use both by lay Internet users curious about a notice they may have encountered, perhaps in the news or because of personal interest. (see below for more details about viewing notices) as well as for more powerful and potentially expansive use by journalists, NGOs, policy-makers, academics, and other legal researchers conducting more in-depth and focused research or studying larger trends about content removal online.
Lumen is not intended to be, is not set up to to be, and should not be used as, part of the work-flow of any particular business model. Companies interested in takedown notices regarding them or their clients that have been sent to other platforms would be best served contacting those platforms directly for more information. If you or your organization are interested in conducting journalistic, academic, legal, or policy-focused research on Lumen data — research you will then publish, or have further ideas about we might improve the database and its interfaces, email us at [email protected].
At this time, Lumen generally issues researcher credentials only to people or non-profit organizations planning journalistic, academic, or legislative policy-focused public written research outputs.
Viewing a Notice
For non-researchers, Lumen currently offers access to one full notice per email address every twenty-four (24) hours. Submitting an email address through the request form will provide a 1-time use URL for that particular notice that will display the full contents of the notice. Access through this URL will last for 24 hours. See here for more details.How does it work?
Most users will find that the web interface will suffice for browsing and discovery within the database. However, for those that need to access larger swaths of data for their research, or for those interested in submitting copies of takedown notices to Lumen, we offer our API. Read on for further information.
BASIC FACTS ABOUT THE API AND DATABASE
Contents
- API Documentation
- Formatting
- Understanding dates – Unix Timestamps
- Searching the database
The documentation for the Lumen API can be found here.Formatting
When a query or request is submitted to the database, the system will return a response with a list of JSON-encoded attributes. Learn more about JSON (JavaScript Object/Open Notation) here. This format is designed to be “machine readable,” and not necessarily useful to a human reader in its raw form. However, there are many tools for rendering JSON output into a friendlier form, and we recommend finding one that works for you.Example JSON Request:
<code> curl https://research.fakedmca.com/dmca/1.json </code>
Example Successful JSON Output:
<code> { "dmca":{ "id":1, "title":"Lion King on YouTube", "body":null, "date_sent":"2013-06-04T19:23:12Z", "date_received":"2013-06-05T20:31:44Z", "topics":[ "Anticircumvention (DMCA)", "Bookmarks", "Lumen" ], "tags": [ "tag_1", "tag_2" ], "jurisdictions": [ "US", "CA" ], "action_taken": "Partial", "sender_name": "Joe Lawyer", "recipient_name": "Google, Inc.", "works": [ { "description": "Lion King Video", "copyrighted_urls": [ { "url": "https://www.example.com/lion_king.mp4" }, { "url": "https://www.example.com/lion_king.mov" } ], "infringing_urls": [ { "url": "https://www.example.com/infringing1" }, { "url": "https://www.example.com/infringing2" }, { "url": "https://www.example.com/infringing3" } ] } ] } } </code>
Understanding dates – Unix Timestamps
The Lumen database accepts dates in a variety of formats but always outputs dates in Unix Time, which is the number of seconds elapsed since the beginning of the Unix epoch. This can be quite confusing at first, and we recommend using a Unix Timestamp conversion tool (like this one here) to transform these raw date outputs into something a human can understand.Searching the Database
Most users will find that the web interface will suffice for browsing and discovery within the database. However, for those that need to access larger swaths of data or create automated processes to digest data trends, we offer our new API.
Searching the database, whether through the web interface or with the API, is done via full-text search. The default search is to search all possible notice fields and facets. Searches can also refined based on specific slices of the database or on specific facets of the data. See the documentation for the applicable notice parameters and metadata.
QUERYING THE DATABASE WITH THE API
Contents
- Getting an API Key
- Basic search from the command line
- Requesting a list of topics
- Searching notices
- Downloading results in bulk
Getting an API KeyAn authentication key is needed in order to query the database at will via the API. Researchers interested in using Lumen’s API should contact the Lumen staff at [email protected] to be provided with one. API queries to the database submitted without a token will return an error.
At this time we generally issue Lumen researcher credentials only to people or non-profit organizations planning journalistic, academic, or legislative policy-focused public written research outputs.
Basic search from the command line
To query the database, use your preferred tools for HTTP “get” requests. There are a number of options available, so pick one depending on your research needs.
Examples include:
- Curl – a command line program for Mac, iOS and BSD operating system computers, but not for Windows. In order to use curl commands on Windows, a separate tool such as CygWin or Putty is needed.
- wget – dumps the results of the “get” request to a file.
Example search query for Batman
where <parameter>
is the database field or facet that is the object of the search.
<code> curl -H "Accept: application/json" -H "Content-type: application/json" 'https://www.research.fakedmca.com/dmca/search?<b><parameter>=batman</b>' </code>
Here’s a search query for star
where term
is the parameter.
<code> curl -H "Accept: application/json" -H "Content-type: application/json" 'https://www.research.fakedmca.com/dmca/search?<b>term=star</b>' </code>
Searches can also combine multiple parameters when linked with an ampersand. Below, the query combines a search for star
where term
is the parameter, where batman
is the sender_name
, and date_received
falls between RANGE1..RANGE2
<code> curl -H "Accept: application/json" -H "Content-type: application/json" 'https://www.research.fakedmca.com/dmca/search?<b>term=star&sender_name=batman&date_received=_facet=RANGE1..RANGE2</b>' </code>
Running these search queries through the API will allow you to search for some period of time, as well as download search results for use and reuse in applications. A complete list of searchable parameters can be found here.Requesting a List of Topics
The database classifies notices into one or more topics, more of which may be added over time. Certain topics are categorized as subtopics of a larger, comprehensive root topic. For example, like “DMCA,” “fair use,” and “anti-circumvention” all fall under “Copyright.” Each topic has a unique numerical ID in the database. To request a list of topics, use the following command.
<code> curl https://www.research.fakedmca.com/topics.json </code>
This command will return results with three pieces of information: 1) the topic’s unique ID number, 2) the name of the topic, and 3) either the ID number of the parent topic or null if the topic is a root topic.
id | integer | The unique ID used for the topic_ids array during notice creation |
name | string | The topic name |
parent_id | integer | The parent topic_id of this topic, or “null” if this is a root topic. |
On the web interface, above a certain number of hits your search results will be paginated. By default, results are sorted by descending relevance. Full-text search results contain the same data as an individually-requested notices, with the addition of a score field that articulates the result relevance to the query term; higher numbers are more relevant. Terms are joined with an ‘OR’ by default.