All APIs besides the authentication endpoints are served from
https://api.www.documentcloud.org/api.
If you develop in Python, check out python-documentcloud which is our Python wrapper for the DocumentCloud API and its corresponding documentation.
The API end points are generally organized as /api/<resource>/
representing
the entirety of the resource, and /api/<resource>/<id>/
representing a single
resource identified by its ID. All REST actions are not available on every
endpoint, and some resources may have additional endpoints, but the following
are how HTTP verbs generally map to REST operations:
/api/<resource>/
HTTP Verb | REST Operation | Parameters |
---|---|---|
GET | List the resources | May support parameters for filtering |
POST | Create a new resource | Must supply all required fields, and may supply all non-read only fields |
/api/<resource>/<id>/
HTTP Verb | REST Operation | Parameters |
---|---|---|
GET | Display the resource | |
PUT | Update the resource | Same as for creating - all required fields must be present. For updating resources PATCH is usually preferred, as it allows you to only update the fields needed. PUT support is included for completeness |
PATCH | Partially update the resource | Same as for creating, but all fields are optional |
DELETE | Destroy the resources |
A select few of the resources support some bulk operations on the /api/<resource>/
route:
HTTP Verb | REST Operation | Parameters |
---|---|---|
POST | Bulk create | A list of objects, where each object is what you would POST for a single object |
PUT | Bulk update | A list of objects, where each object is what you would PUT for a single object — except it must also include the ID |
PATCH | Bulk partial update | A list of objects, where each object is what you would PATCH for a single object — except it must also include the ID |
DELETE | Bulk destroy | Bulk destroys will have a filtering parameter, often required, to specify which resources to delete |
Lists response will be of the form
{
"next": <next url if applicable>,
"previous": <previous url if applicable>,
"results": <list of results>
}
with a 200 status code. The document search route will also include a count
key, with a total count of all documents returned by the search.
Getting a single resource, creating and updating will return just the object. Create uses a 201 status code and get and update will return 200.
Delete will have an empty response with a 204 status code.
Batch updates will contain a list of objects updated with a 200 status code.
Specifying invalid parameters will generally return a 400 error code with a
JSON object with a single "error"
key, whose value will be an error message.
Specifying an ID that does not exist or that you do not have access to view
will return status 404. Trying to create or update a resource you do not have
permission to will return status 403.
All list views accept a per_page
parameter, which specifies how many
resources to list per page. It is 25
by default and may be set up to 100
for authenticated users. For anonymous users it is restricted to 25
. You
may register for a free account at https://accounts.muckrock.com/ to use the
100
limit. You may view subsequent pages by using the next
URL.
Page offset pagination does not scale well to a large number of pages. For
improved performance, DocumentCloud uses a cursor based
pagination system. Instead of a page
parameter, there is a cursor
parameter, which accepts an opaque cursor
which specifies the last value
seen. To use this system, you must use the next
and previous
links as
returned by the API, as random access is not available. This system also
restricts arbitrary ordering of the results, except for the document search
route, which will still allow re-ordering with cursor based pagination.
If the cursor based pagination breaks your workflow, you may continue to use
the old page-offset based pagination system for now. In the future, this will
be disabled completely, and you will be forced to use the cursor based
pagination. To use the page-offset based pagination, which also has a top
level count
key with a total count of the objects returned for all list
queries, add a version=1.0
query parameter to your API queries. Be aware
that this will make your queries less performant, possibly to the point of them
being unusable. This should only be used as a stop-gap solution while you
update your workflow to use the new cursor based pagination. Please reach out
to info@documentcloud.org if you need
assistance moving to the new version.
Some resources also support sub resources, which is a resource that belongs to another. The general format is:
/api/<resource>/<id>/<subresource>/
or
/api/<resource>/<id>/<subresource>/<subresource_id>/
It generally works the same as a resource, except scoped to the parent resource.
TODO: Examples
Filters on list views which have choices generally allow you to specify
multiple values, and will filter on all resources that match at least one
choices. To specify multiple parameters you may either supply a comma
separated list of IDs — ?parameter=1,2
— or by specify the
parameter multiple times — ?parameter=1¶meter=2
.
The DocumentCloud API is rate limited to 10 requests per second. It also allows bursts up to 20 requests. This means if you exceed the the 10 request per second limit, it will serve you up to 20 requests more quickly, while keeping track of your average rate. After the 20 requests are served, additional requests will be rejected with an HTTP status of 503 until you again fall under an average of 10 requests per second. If you use the Python DocumentCloud library, it will automatically throttle your requests to 10 per second to avoid going over the rate limit. If you are writing custom code, please be mindful of the rate limits.
There is also a secondary limit of 500 requests per day for anonymous users. If you exceed this limit, you will start receiving errors with an HTTP status of 429. In order to avoid this, please register for a free account at https://aacounts.muckrock.com/. Currently, there are no daily limits of registered accounts, although this may change in the future.
Authentication happens at the MuckRock accounts server located at
https://accounts.muckrock.com/. The API provided there will supply you with
a JWT access token and refresh token in exchange for your username and
password. The access token should be placed in the Authorization
header
preceded by Bearer
- {'Authorization': 'Bearer <access token>'}
. The
access token is valid for 5 minutes, after which you will receive a 403
forbidden error if you continue trying to use it. At this point you may use
the refresh token to obtain a new access token and refresh token. The refresh
token is valid for one day.
Param | Type | Description |
---|---|---|
username | String | Your username |
password | String | Your password |
{'access': <access token>, 'refresh': <refresh token>}
Param | Type | Description |
---|---|---|
refresh | String | Refresh token |
{'access': <access token>, 'refresh': <refresh token>}
The documents API allows you to upload, browse and edit documents. To add or remove documents from a project, please see project documents.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the document |
access | String | Default: private |
The access level for the document |
asset_url | String | Read Only | The base URL to load this document's static assets from |
canonical_url | URL | Read Only | The canonical URL to view this document |
created_at | Date Time | Read Only | Time stamp when this document was created |
data | JSON | Not Required | Custom metadata |
description | String | Not Required | A brief description of the document |
edit_access | Bool | Read Only | Does the current user have edit access to this document |
file_url | URL | Create Only | A URL to a publicly accessible document for the URL Upload Flow |
file_hash | String | Read Only | A sha1 hash representation of the raw PDF data as a hexadecimal string. |
force_ocr | Bool | Create Only | Force OCR even if the PDF contains embedded text - only include if file_url is set, otherwise should set force_ocr on the call to the processing endpoint. This operation clears underlying metadata about the document like authorship, creation date, etc. If this is a concern, make sure to keep a copy of the original document. |
language | String | Default: eng |
The language the document is in |
ocr_engine | string | Not required | Specifies which OCR engine to use on documents. Use with force_ocr set to True. Accepted values: tess4 for tesseract and textract for Amazon textract (which requires AI credits). |
noindex | Bool | Not required | Ask search engines and DocumentCloud search to not index this document |
organization | Integer | Read Only | The ID for the organization this document belongs to |
original_extension | String | Default: pdf |
The original file extension of the document you are seeking to upload. It must be a supported file type |
page_count | Integer | Read Only | The number of pages in this document |
page_spec | Integer | Read Only | The dimensions for all pages in the document |
pages | JSON | Write Only | Allows you to set page text via the API. See set page text for more information. |
presigned_url | URL | Read Only | The pre-signed URL to directly PUT the PDF file to |
projects | List:Integer | Create Only | The IDs of the projects this document belongs to - this may be set on creation, but may not be updated. See project documents |
publish_at | Date Time | Not Required | A timestamp when to automatically make this document public |
published_url | URL | Not Required | The URL where this document is embedded |
related_article | URL | Not Required | The URL for the article about this document |
remaining | JSON | Read Only | The number of pages left for text and image processing - only included if remaining is included as a GET parameter |
slug | String | Read Only | The slug is a URL safe version of the title |
source | String | Not Required | The source who produced the document |
status | String | Read Only | The status for the document |
title | String | Required | The document's title |
updated_at | Date Time | Read Only | Time stamp when the document was last updated |
user | Integer | Read Only | The ID for the user this document belongs to |
Expandable fields: user, organization, projects, sections, notes
There are two supported ways to upload documents — directly uploading the file to our storage servers or by providing a URL to a publicly available PDF or other supported file type. To upload another supported file type you will need to include the original_extension field documented above.
POST /api/documents/
To initiate an upload, you will first create the document. You may specify all
writable document fields (besides file_url
). The response will contain all
the fields for the document, with two being of note for this flow:
presigned_url
and id
.
If you would like to upload files in bulk, you may POST
a list of JSON
objects to /api/documents/
instead of a single object. The response will
contain a list of document objects.
PUT <presigned_url>
Next, you will PUT
the binary data for the file to the given
presigned_url
. The presigned URL is valid for 5 minutes. You may obtain a
new URL by issuing a GET
request to /api/documents/\<id\>/
.
If you are bulk uploading, you will still need to issue a single PUT
to the
corresponding presigned_url
for each file.
POST /api/documents/<id>/process/
Finally, you will begin processing of the document. Note that this endpoint
accepts only one optional parameter — force_ocr
which, if set to true,
will OCR the document even if it contains embedded text.
If you are uploading in bulk you can issue a single POST
to
/api/document/process/
which will begin processing in bulk. You should pass
a list of objects containing the document IDs of the documents you would like
to being processing. You may optionally specify force_ocr
for each document.
POST /api/documents/
If you set file_url
to a URL pointing to a publicly accessible PDF, our
servers will fetch the PDF and begin processing it automatically.
You may also send a list of document objects with file_url
set to bulk upload
files using this flow.
GET /api/documents/
— List documentsPOST /api/documents/
— Create documentPUT /api/documents/
— Bulk update documentsPATCH /api/documents/
— Bulk partial update documentsDELETE /api/documents/
— Bulk delete documentsid__in
filter.POST /api/documents/process/
— Bulk process documents[{"id": 1, "force_ocr": true}, {"id": 2}]
It expects a list of objects, where each object contains the ID of the
document to process, and an optional boolean, force_ocr
, which will OCR
the document even if it contains embedded text if set to true
GET /api/documents/search/
— Search documentsGET /api/documents/<id>/
— Get documentPUT /api/documents/<id>/
— Update documentPATCH /api/documents/<id>/
— Partial update documentDELETE /api/documents/<id>/
— Delete documentPOST /api/documents/<id>/process/
— Process documentforce_ocr
, which will OCR the document
even if it contains embedded text if it is set to true
. Note that it
is an error to try to process a document that is already processing.DELETE /api/documents/<id>/process/
— Cancel processing documentGET /api/documents/<id>/search/
— Search within a documentordering
— Sort the results — valid options include: created_at
,
page_count
, title
, and source
. You may prefix any valid option with
-
to sort it in reverse order.user
— Filter by the ID of the owner of the document.organization
— Filter by the ID of the organization of the document.project
— Filter by the ID of a project the document is in.access
— Filter by the access level.status
— Filter by status.created_at__lt
, created_at__gt
— Filter by documents created
either before or after a given date. You may specify both to find documents
created between two dates. This may be a date or date time, in the following
formats: YYYY-MM-DD
or YYYY-MM-DD+HH:MM:SS
.page_count
, page_count__lt
, page_count__gt
— Filter by documents
with a specified number of pages, or more or less pages then a given amount.id__in
— Filter by specific document IDs, passed in as comma
separated values.Notes can be left on documents for yourself, or to be shared with other users. They may contain HTML for formatting.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the note |
access | String | Default: private |
The access level for the note |
content | String | Not Required | Content for the note, which may include HTML |
created_at | Date Time | Read Only | Time stamp when this note was created |
edit_access | Bool | Read Only | Does the current user have edit access to this note |
organization | Integer | Read Only | The ID for the organization this note belongs to |
page_number | Integer | Required | The page of the document this note appears on |
title | String | Required | Title for the note |
updated_at | Date Time | Read Only | Time stamp when this note was last updated |
user | ID | Read Only | The ID for the user this note belongs to |
x1 | Float | Not Required | Left most coordinate of the note, as a percentage of page size |
x2 | Float | Not Required | Right most coordinate of the note, as a percentage of page size |
y1 | Float | Not Required | Top most coordinate of the note, as a percentage of page size |
y2 | Float | Not Required | Bottom most coordinate of the note, as a percentage of page size |
Expandable fields: user, organization
The coordinates must either all be present or absent — absent represents a page level note which is displayed between pages.
GET /api/documents/<document_id>/notes/
- List notesPOST /api/documents/<document_id>/notes/
- Create noteGET /api/documents/<document_id>/notes/<id>/
- Get notePUT /api/documents/<document_id>/notes/<id>/
- Update notePATCH /api/documents/<document_id>/notes/<id>/
- Partial update noteDELETE /api/documents/<document_id>/notes/<id>/
- Delete noteSections can mark certain pages of your document — the viewer will show an outline of the sections allowing for quick access to those pages.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the section |
page_number | Integer | Required | The page of the document this section appears on |
title | String | Required | Title for the section |
GET /api/documents/<document_id>/sections/
- List sectionsPOST /api/documents/<document_id>/sections/
- Create sectionGET /api/documents/<document_id>/sections/<id>/
- Get sectionPUT /api/documents/<document_id>/sections/<id>/
- Update sectionPATCH /api/documents/<document_id>/sections/<id>/
- Partial update sectionDELETE /api/documents/<document_id>/sections/<id>/
- Delete sectionSometimes errors happen — if you find one of your documents in an error state, you may check the errors here to see a log of the latest, as well as all previous errors. If the message is cryptic, please contact us — we are happy to help figure out what went wrong.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the error |
created_at | Date Time | Read Only | Time stamp when this error was created |
message | String | Required | The error message |
GET /api/documents/<document_id>/errors/
- List errorsDocuments may contain user supplied metadata. You may assign multiple values
to arbitrary keys. This is represented as a JSON object, where each key has a
list of strings as a value. The special key _tag
is used by the front end to
represent tags. These values are useful for searching and organizing documents.
You may directly set or update the data from the document endpoints, but these
additional endpoints are supplied to add or remove data on a per key basis.
Field | Type | Options | Description |
---|---|---|---|
values | List:String | Required | The values associated with a key |
remove | List:String | Not Required | Values to be removed |
remove
is only used for PATCH
ing. values
is not required when PATCH
ing.
GET /api/documents/<document_id>/data/
- List values for all keys{
"_tag": ["important"],
"location": ["boston", "new york"]
}
GET /api/documents/<document_id>/data/<key>/
- Get values for the given key["one", "two"]
PUT /api/documents/<document_id>/data/<key>/
- Set values for the given keyPATCH /api/documents/<document_id>/data/<key>/
- Add and/or remove values for the given keyDELETE /api/documents/<document_id>/data/<key>/
- Delete all values for a given keyRedactions allow you to obscure parts of the document which are confidential before publishing them. The pages which are redacted will be fully flattened and reprocessed, so that the original content is not present in lower levels of the image or as text data. Redactions are not reversible, and may only be created, not retrieved or edited. Redacting a document strips available metadata from a document about authorship, creation date, etc. If this is a concern to you, you may want to hold onto an original copy before redaction, as it is irreversible.
Field | Type | Options | Description |
---|---|---|---|
page_number | Integer | Required | The page of the document this redaction appears on |
x1 | Float | Required | Left most coordinate of the redaction, as a percentage of page size |
x2 | Float | Required | Right most coordinate of the redaction, as a percentage of page size |
y1 | Float | Required | Top most coordinate of the redaction, as a percentage of page size |
y2 | Float | Required | Bottom most coordinate of the redaction, as a percentage of page size |
POST /api/documents/<document_id>/redactions/
- Create redactionModifications allow you to perform page modification operations on a document, including moving pages, rotating pages, copying pages, deleting pages, and inserting pages from other documents. Applying modifications effectively shuffles, removes, and copies pages, preserving and duplicating page information as needed (this includes page text and any annotations and sections attached to the page). No page text needs to be reprocessed or re-OCR'd. After successfully applying modifications, the document cannot be reverted.
To support a flexible host of potential modifications, you must pass in the modifications as a JSON array that lists the operations to take place. The modification specification defines the pages that should compose the document post-modification and any operations such as rotation to apply to the pages. Each element of the modification array can have the following fields (instructive examples will be listed after the official specification):
Field | Description |
---|---|
page | A comma-separated string of page ranges, which can include individual pages or hyphenated inclusive runs of pages. Page numbers are 0-based (the first page of the document is page 0 , and 0-9 refers to the first through the 10th page of the document). Valid examples of page ranges include "7" , "0-499" , "0-5,8,11-13" , and 0,0,0 (page numbers can be repeated to duplicate them). |
id | If unspecified, pull pages from the current document. Otherwise, pull pages from the document with the specified id. |
modifications | An array of JSON objects defining modifications to take place. The only currently defined page modification operation is rotate , which rotates pages clockwise, counterclockwise, or halfway. Rotation is specified as {"type": "rotate", "angle": <angle>} , where <angle> is one of cc , ccw , or hw (corresponding to clockwise, counterclockwise, and halfway, respectively). |
The following examples assume you are modifying the Mueller Report, a 448-page document.
Example | Description |
---|---|
[{ |
Leave the Mueller Report unchanged |
[{ |
Remove the middle 400 pages of the Mueller Report |
[{ |
Duplicate the first 50 pages of the Mueller Report at the end of the document |
[{ |
Rotate all the pages of the Mueller Report counter-clockwise |
[ |
Rotate just the first 50 pages of the Mueller Report 180 degrees |
[ |
Import 50 pages of another document with id 2000000 at the end of the Mueller report |
POST /api/documents/<document_id>/modifications/
- Create modificationsEntities can be extracted using Google Cloud's Natural Language API. Entity extraction must be initalized manually per document and entities are read-only.
Top level fields
Field | Type | Description |
---|---|---|
entity | Object | Object containing information about this particular entity |
relevance | Float | An estimate as to how relevant this entity is to this document |
occurences | List | A list of occurence objects specifying where in the document this entity was found |
Fields for the entity object
Field | Type | Description |
---|---|---|
name | String | The name of the entity |
kind | String | The kind of entity |
description | String | A short description of the entity |
mid | String | The Knowledge Graph ID |
wikipedia_url | URL | The Wikipedia URL for this entity |
metadata | Object | Additional metadata for the entity, based on its kind |
Fields for the occurence objects
Field | Type | Description |
---|---|---|
page | Integer | The page of the document this occurs on |
offset | Integer | The character offset into the document this occurs on |
content | String | The content of this occurence (the occurence may not match the entity name) |
page_offset | Integer | The character offset into the page this occurs on |
kind | String | proper for proper nouns, common for common nouns or unknown |
Entity kinds include
unknown
person
location
organization
event
work_of_art
consumer_good
other
phone_number
— metadata may include number, national_prefix, area_code and extensionaddress
— metadata may include street_number, locality, street_name, postal_code, country, broad_region, narrow_region, and sublocalitydate
— metadata may include year, month and dayprice
— metadata may include value and currencyGET /api/documents/<document_id>/entities/
- List entities for this documentPOST /api/documents/<document_id>/entities/
- Begin extracting entities for this document (POST body is empty)DELETE /api/documents/<document_id>/entities/
- Delete all entities for this documentkind
— Filter for entities with the given kind (may give multiple, comma seperated)occurences
— Filter for entities with the given occurence kind (proper
or common
)relevance__gt
— Filter for documents with the given relevance or highermid
— Boolean filter for entities which do or do not have a MIDwikipedia_url
— Boolean filter for entities which do or do not have a Wikipedia URLProjects are collections of documents. They can be used for organizing groups of documents, or for collaborating with other users by sharing access to private documents.
Projects may be used for sharing documents. When you add a collaborator to a project, you may select one of three access levels:
view
- This gives the collaborator permission to view your documents that
you have added to the projectedit
- This gives the collaborator permission to view or edit your
documents you have added to the projectadmin
- This gives the collaborator both view and edit permissions, as well
as the ability to add their own documents and invite other collaborators to
the projectAdditionally, you may add public documents to a project, for organizational
purposes. Obviously, no permissions are granted to your or your collaborators
when you add documents you do not own to your project — this is tracked
by the edit_access
field on the project membership.
When you add documents you or your organization do own, it will be added with
edit_access
enabled by default. You may override this using the API if you
would like to add your documents to a project, but not extend permissions to
any of your collaborators. Also note that documents shared with you for
editing via another project may not be added to your own project with
edit_access
enabled. This means the original owner of a document may revoke
any access they have granted to others via projects at any time.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the project |
created_at | Date Time | Read Only | Time stamp when this project was created |
description | String | Not Required | A brief description of the project |
edit_access | Bool | Read Only | Does the current user have edit access to this project |
add_remove_access | Bool | Read Only | Does the current user have permission to add and remove documents to this project |
private | Bool | Default: false |
Private projects may only be viewed by their collaborators |
slug | String | Read Only | The slug is a URL safe version of the title |
title | String | Required | Title for the project |
updated_at | Date Time | Read Only | Time stamp when this project was last updated |
user | ID | Read Only | The ID for the user who created this project |
GET /api/projects/
- List projectsPOST /api/projects/
- Create projectGET /api/projects/<id>/
- Get projectPUT /api/projects/<id>/
- Update projectPATCH /api/projects/<id>/
- Partial update projectDELETE /api/projects/<id>/
- Delete projectuser
— Filter by projects where this user is a collaboratordocument
— Filter by projects which contain the given documentprivate
— Filter by private or public projects. Specify either
true
or false
.slug
— Filter by projects with the given slug.title
— Filter by projects with the given title.These endpoints allow you to browse, add and remove documents from a project
Field | Type | Options | Description |
---|---|---|---|
document | Integer | Required | The ID for the document in the project |
edit_access | Bool | Default: true if you have access |
If collaborators of this project should be granted edit access to this document |
Expandable fields: document
GET /api/projects/<project_id>/documents/
- List documents in the projectPOST /api/projects/<project_id>/documents/
- Add a document to the projectPUT /api/projects/<project_id>/documents/
- Bulk update documents in the projectPATCH /api/projects/<project_id>/documents/
- Bulk partial update documents
in the projectDELETE /api/projects/<project_id>/documents/
- Bulk remove documents from
the projectdocument_id__in
filter. This endpoint will allow you to remove all
documents in the project if you call it with no filter specified.GET /api/projects/<project_id>/documents/<document_id>/
- Get a document in the projectPUT /api/projects/<project_id>/documents/<document_id>/
- Update document in the projectPATCH /api/projects/<project_id>/documents/<document_id>/
- Partial update document in the projectDELETE /api/projects/<project_id>/documents/<document_id>/
- Remove document from the projectdocument_id__in
— Filter by specific document IDs, passed in as comma
separated values.Other users who you would like share this project with. See Sharing Documents
Field | Type | Options | Description |
---|---|---|---|
access | String | Default: view |
The access level for this collaborator |
Create Only | Email address of user to add as a collaborator to this project | ||
user | Integer | Read Only | The ID for the user who is collaborating on this project |
Expandable fields: user
GET /api/projects/<project_id>/users/
- List collaborators on the projectPOST /api/projects/<project_id>/users/
- Add a collaborator to the project
— you must know the email address of a user with a DocumentCloud
account in order to add them as a collaborator on your projectGET /api/projects/<project_id>/users/<user_id>/
- Get a collaborator in the projectPUT /api/projects/<project_id>/users/<user_id>/
- Update collaborator in the projectPATCH /api/projects/<project_id>/users/<user_id>/
- Partial update collaborator in the projectDELETE /api/projects/<project_id>/users/<user_id>/
- Remove collaborator from the projectOrganizations represent a group of users. They may share a paid plan and resources with each other. Organizations can be managed and edited from the MuckRock accounts site. You may only view organizations through the DocumentCloud API.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the organization |
avatar_url | URL | Read Only | A URL pointing to an avatar for the organization — normally a logo for the company |
individual | Bool | Read Only | Is this organization for the sole use of an individual |
name | String | Read Only | The name of the organization |
slug | String | Read Only | The slug is a URL safe version of the name |
uuid | UUID | Read Only | UUID which links this organization to the corresponding organization on the MuckRock Accounts Site |
GET /api/organizations/
- List organizationsGET /api/organizations/<id>/
- Get an organizationUsers can be managed and edited from the MuckRock accounts site. You may view users and change your own active organization from the DocumentCloud API.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the user |
avatar_url | URL | Read Only | A URL pointing to an avatar for the user |
name | String | Read Only | The user's full name |
organization | Integer | Required | The user's active organization |
organizations | List:Integer | Read Only | A list of the IDs of the organizations this user belongs to |
username | String | Read Only | The user's username |
uuid | UUID | Read Only | UUID which links this user to the corresponding user on the MuckRock Accounts Site |
Expandable fields: organization
GET /api/users/
- List usersGET /api/users/<id>/
- Get a userPUT /api/users/<id>/
- Update a userPATCH /api/users/<id>/
- Partial update a userAdd-Ons allow you to easily add custom features to DocumentCloud. Learn more about Add-Ons. Add-Ons are added by installing the GitHub App in the repository you would like to use as an add-on. The API allows you to view, edit and run your add-ons.
Field | Type | Options | Description |
---|---|---|---|
ID | Integer | Read Only | The ID for the add-on |
access | String | Read Only | The access level for the add-on (will be settable in the future) |
active | Bool | Default: false |
Whether this add-on is active for you |
created_at | Date Time | Read Only | Time stamp when this add-on was created |
name | String | Read Only | The name of the add-on (set in the configuration) |
organization | Integer | Not Required | The ID for the organization this add-on belongs to |
parameters | JSON | Read Only | The contents of the config.yaml file from the repository, converted to JSON |
repository | String | Read Only | The full name of the GitHub repository, including the account name |
updated_at | Date Time | Read Only | Time stamp when the add-on was last updated |
user | Integer | Read Only | The ID for the user this add-on belongs to |
Your active add-ons are showed to you in the web interface.
GET /api/addons/
- List add-onsGET /api/addons/<id>/
- Get an add-onPUT /api/addons/<id>/
- Update an add-onPATCH /api/addons/<id>/
- Partial update an add-onactive
— Filter by only your active or inactive add-ons query
— Searches for add-ons which contain the query in their name or descriptionAdd-on runs represent an invocation of an add-on. You create one to run the add-on. The add-on itself can then update the add-on run as a means of supplying feedback to the caller.
Field | Type | Options | Description |
---|---|---|---|
UUID | UUID | Read Only | The ID for the add-on run |
addon | Integer | Required | The ID of the add-on that is being ran |
created_at | Date Time | Read Only | Time stamp when this add-on was created |
dismissed | Bool | Default: false |
Add-on runs are shown to the user until they are dismissed |
file_name | String | Write Only | The add-on must set this to the name of the file supplied to presigned_url after uploading the file to make it accessible to the user |
file_url | URL | Read Only | The URL of a file uploaded via presigned_url |
message | String | Not Required | Add-ons may set infromational messages to the user while running |
parameters | JSON | Write Only | The add-on specific data |
presigned_url | URL | Read Only | Only included if you set the upload_file query parameter to the name of the file to upload. This is a URL the add-on can directly PUT a file to in order to return it to the user |
progress | Integer | Not Required | Long running add-ons may set this as a percentage of their progress |
status | String | Read Only | The status of the run - queued , in_progress , success , or failure |
updated_at | Date Time | Read Only | Time stamp when the add-on was last updated |
user | Integer | Read Only | The ID for the user who ran the add-on |
POST /api/addon_runs/
- Create a new add-on run - this will start the run using GitHub actionsGET /api/addon_runs
- List add-on runsGET /api/addon_runs<uuid>/
- Get an add-on runPUT /api/addon_runs/<uuid>/
- Update an add-on runPATCH /api/addon_runs/<uuid>/
- Partial update an add-on rundismissed
— Filter by dismissed or not dismissed add-on runsGenerate an embed code for a document, page, or annotation using our oEmbed service.
Field | Type | Options | Description |
---|---|---|---|
url | URL | Required | The URL for the document, page or annotation to get an embed code for |
maxwidth | Integer | The maximum width of the embedded resource | |
maxheight | Integer | The maximum height of the embedded resource |
GET /api/oembed/
- Get an embed code for a given URLNote: The hash symbol (#) in the URL will need to be encoded, which converts it to %23.
Generate an embed code for a page:
GET api/oembed?url=https://www.documentcloud.org/documents/23745991-gpo-j6-video-exh-vc9mp4%23document/p1
Generate an embed code for an annotation:
GET /api/oembed?url=https://www.documentcloud.org/documents/23745991-gpo-j6-video-exh-vc9mp4%23document/p1/a2242636
The access level allows you to control who has access to your document by default. You may also explicitly share a document with additional users by collaborating with them on a project.
public
– Anyone on the internet can search for and view the documentprivate
– Only people with explicit permission (via collaboration) have accessorganization
– Only the people in your organization have accessFor notes, the organization
access level will extend access to all users with
edit access to the document — this includes project
collaborators.
The status informs you to the current status of your document.
success
– The document has been succesfully processedreadable
– The document is currently processing, but is readable during the operationpending
– The document is processing and not currently readableerror
– There was an error during processingnofile
– The document was created, but no file was uploaded yetFormat | Extension | Type | Notes |
---|---|---|---|
AbiWord | ABW, ZABW | Document | |
Adobe PageMaker | PMD, PM3, PM4, PM5, PM6, P65 | Document, DTP | |
AppleWorks word processing | CWK | Document | Formerly called ClarisWorks |
Adobe FreeHand | AGD, FHD | Graphics / Vector | |
Apple Keynote | KTH, KEY | Presentation | |
Apple Numbers | Numbers | Spreadsheet | |
Apple Pages | Pages | Document | |
BMP file format | BMP | Graphics / Raster | |
Comma-separated values | CSV, TXT | Text | |
CorelDRAW 6-X7 | CDR, CMX | Graphics / Vector | |
Computer Graphics Metafile | CGM | Graphics | Binary-encoded only; not those using clear-text or character-based encoding |
Data Interchange Format | DIF | Spreadsheet | |
DBase, Clipper, VP-Info, FoxPro | DBF | Database | |
DocBook | XML | XML | |
Encapsulated PostScript | EPS | Graphics | |
Enhanced Metafile | EMF | Graphics / Vector / Text | |
FictionBook | FB2 | eBook | |
Gnumeric | GNM, GNUMERIC | Spreadsheet | |
Graphics Interchange Format | GIF | Graphics / Raster | |
Hangul WP 97 | HWP | Document | Newer "5.x" documents are not supported |
HPGL plotting file | PLT | Graphics | |
HTML | HTML, HTM | Document, text | |
Ichitaro 8/9/10/11 | JTD, JTT | Document | |
JPEG | JPG, JPEG | Graphics | |
Lotus 1-2-3 | WK1, WKS, 123, wk3, wk4 | Spreadsheet | |
Macintosh Picture File | PCT | Graphics | |
MathML | MML | Math | |
Microsoft Excel 2003 XML | XML | Spreadsheet | |
Microsoft Excel 4/5/95 | XLS, XLW, XLT | Spreadsheet | |
Microsoft Excel 97–2003 | XLS, XLW, XLT | Spreadsheet | |
Microsoft Excel 2007-2016 | XLSX | Spreadsheet | |
Microsoft Office 2007-2016 Office Open XML | DOCX, XLSX, PPTX | Multiple formats | |
Microsoft PowerPoint 97–2003 | PPT, PPS, POT | Presentation | |
Microsoft PowerPoint 2007-2016 | PPTX | Presentation | |
Microsoft Publisher | PUB | Document, DTP | |
Microsoft RTF | RTF | Document | |
Microsoft Word 2003 XML (WordprocessingML) | XML | Document | |
Microsoft Word | DOC, DOT, DOCX | Document | |
Microsoft Works | WPS, WKS, WDB | Multiple | Microsoft Works for Mac formats since 4.1 |
Microsoft Write | WRI | Document | |
Microsoft Visio | VSD | Graphics / Vector | |
Netpbm format | PGM, PBM, PPM | Graphics / Raster | |
OpenDocument | ODT, FODT, ODS, FODS, ODP, FODP, ODG, FODG, ODF | Multiple formats | |
Open Office Base | ODB | Database forms, data | |
OpenOffice.org XML | SXW, STW, SXC, STC, SXI, STI, SXD, STD, SXM | Multiple formats | |
PCX | PCX | Graphics | |
Photo CD | PCD | Presentation | |
PhotoShop | PSD | Graphics | |
Plain text | TXT | Text | Various encodings supported |
Portable Document Format | Document | Including hybrid PDF |
The page spec is a compressed string that lists dimensions in pixels for every
page in a document. Refer to ListCrunch for the compression format. For
example, 612.0x792.0:0-447
The static assets for a document are loaded from different URLs depending on
its access level. Append the following to the asset_url
returned to load the static asset:
Asset | Path | Description |
---|---|---|
Document | documents/<id>/<slug>.pdf | The original document |
Full Text | documents/<id>/<slug>.txt | The full text of the document, obtained from the PDF or via OCR |
JSON Text | documents/<id>/<slug>.txt.json | The text of the document, in a custom JSON format (see below) |
Page Text | documents/<id>/pages/<slug>-p<page number>.txt | The text for each page in the document |
Page Positions | documents/<id>/pages/<slug>-p<page number>.position.json | The position of text on each page, in a custom JSON format |
Page Image | documents/<id>/pages/<slug>-p<page number>-<size>.gif | An image of each page in the document, in various sizes |
<size> may be one of large
, normal
, small
, or thumbnail
The TXT JSON file is a single file containing all of the text, but broken out
per page. This is useful if you need the text per page for every page, as you
can download just a single file. There is a top level object with an updated
key, which is a Unix time stamp of when the file was last updated. There may
be an is_import
key, which will be set to true
if this document was
imported from legacy DocumentCloud. The last key is pages
which contains the
per page info. It is a list of objects, one per page. Each page object will
have a page
key, which is a 0-indexed page number. There is a contents
key
which contains the text for the page. There is an ocr
key, which is the
version of OCR software used to obtain the text. Finally there is an updated
key, which is a Unix time stamp of when this page was last updated.
The position JSON file constains position information for each word of text on the page. It is an optional file, which may be generated depending on the type of OCR run on the document. If it exists, it will be a JSON array, which contains a JSON object for each word of text. The object for each word will have the following fields:
text
- The text for the current wordx1
, x2
, y1
, y2
- The coordinates of the bounding box for this word on
the page. Each value will be between 0 and 1 and represents a percentage of
the width or height of the page.The format to set the page text is similar to the text formats described above.
The pages
field may be set to a JSON array of page objects, with the
following fields:
Field | Type | Options | Description |
---|---|---|---|
page_number | Integer | Required | The page number you would like to set the page text for, zero indexed |
text | String | Required | The updated text for the given page |
ocr | String | Not Required | An optional identifier for the OCR engine used to generate this text |
positions | Array of JSON Objects | Not Required | Optionally set the position of each word of text, see next table for details |
The position
field in each pages
object is a JSON array of position
objects, with the following fields:
Field | Type | Options | Description |
---|---|---|---|
text | String | Required | A single word on the page |
x1 | Float | Required | Left most coordinate of the word, as a percentage of page size |
x2 | Float | Required | Right most coordinate of the word, as a percentage of page size |
y1 | Float | Required | Top most coordinate of the word, as a percentage of page size |
y2 | Float | Required | Bottom most coordinate of the word, as a percentage of page size |
metadata | JSON | Not Required | Any extra metadata that you would like to store with this word |
Example JSON setting just the page text:
[
{"page_number": 0, "text": "Page 1 text"},
{"page_number": 1, "text": "Page 2 text"}
]
Example JSON setting the page text and word positions:
[
{
"page_number": 0,
"text": "Page 1 text",
"ocr": "my-ocr-engine",
"positions": [
{
"text": "Page",
"x1": 0.1,
"x2": 0.2,
"y1": 0.1,
"y2": 0.2,
"metadata": {"type": "word"}
},
{
"text": "1",
"x1": 0.3,
"x2": 0.4,
"y1": 0.1,
"y2": 0.2,
"metadata": {"type": "word"}
},
{
"text": "text",
"x1": 0.5,
"x2": 0.6,
"y1": 0.1,
"y2": 0.2,
"metadata": {"type": "word"}
}
]
}
]
The API uses expandable fields in a few places, which are implemented by the Django REST - FlexFields package. It allows related fields, which would normally be returned by ID, be expanded into the fully nested representation. This allows you to save additional requests to the server when you need the related information, but for the server to not need to serve this information when it is not needed.
To expand one of the expandable fields, which are document in the fields
section for each resource, add the expand
query parameter to your request:
?expand=user
To expand multiple fields, separate them with a comma:
?expand=user,organization
You may also expand nested fields if the expanded field has its own expandable fields:
?expand=user.organization
To expand all fields:
?expand=~all