October 2024
XLSM File Support
Added support for
XLSM
file uploads from third-party and local file uploads.Similar to
XLSX
files, each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Macros, images, and charts aren’t supported yet.
Webscrape Improvements
We now immediately abort the web scrape run when the file is deleted in Carbon, freeing up resources to submit another web scrape request.
Web scrapes auto-sync can now be managed on the user and organization level via
auto_sync_enabled_sources
asWEB_SCRAPE
. The default auto-sync schedule is 2 weeks (as opposed to daily for other data sources).
Store Files Without Parsing
The sync endpoints now take a new parameter
store_file_only
(file_sync_config.store_file_only
for external files) to allow users to skip parsing during the sync. This means the file will have apresigned_url
but not aparsed_text_url
.Because we are skipping parsing, we won’t be able to count the number of characters in the file. That means the only metrics we’ll report to Stripe are bytes uploaded (and URLs scraped if it’s a web scrape file).
Sync Additional GitHub Data
In addition to syncing files from repositories, you can fetch data directly from GitHub via the following endpoints:
/integrations/data/github/pull_requests
: Lists all pull requests for a repository/integrations/data/github/pull_requests/{pull_number}
: Retrieves a specific pull request/integrations/data/github/pull_requests/comments
: Fetches comments on a pull request/integrations/data/github/pull_requests/files
: Retrieves files that were changed/integrations/data/github/pull_requests/commits
: Retrieves a list of commits on a pull request/integrations/data/github/issues
: Lists repository issues/integrations/data/github/issues/{issue_number}
: Retrieves a specific issue
By default, we return responses with mappings applied, but there is an option to include the entire GitHub response on every endpoint (
include_remote_data
).Find more details in our documentation here.
/user_files_v2
: New upload_id
Property
User files now contain a new property called
upload_id
which is generated internally by Carbon. This property groups together files that were part of the same upload. Each upload from a third-party file picker will have its own unique upload_id, even if the files were uploaded in the same session. Sessions are still identified by therequest_id
. If the same file is uploaded multiple times, only the most recentupload_id
is saved.Webhooks that send the
request_id
will now also send theupload_id
.The
/user_files_v2
endpoint now accepts a new filter calledupload_ids
.
New ALL_FILES_PROCESSED
Webhook
The new webhook
ALL_FILES_PROCESSED
will be sent when all files in an upload have moved into the “READY,” “SYNC_ERROR,” “READY_TO_SYNC,” or “RATE_LIMITED” status. It includes therequest_id
as the sent object and theupload_id
as additional information.
API Update
Starting next Tuesday (10/15), the
hot_storage_time_to_live
field under file upload endpoints will no longer take values in seconds. Instead it will need to be a discrete number of days from the list:[1, 3, 7, 14, 30]
. Anything else will raise an exception.
Self-Hosting Updates
You can now bring your own S3-compatible object storage instead of using S3 (AWS) or Google Blob Storage (GCP).
Added a flag
DISABLE_RATE_LIMITS
to disable all of Carbon’s rate limits listed here.
Premium Proxies for Web Scraping
We have introduced a new feature called
use_premium_proxies
for web scraping and sitemap scraping that can be enabled upon request. This feature aims to enhance the success rate when scraping websites that utilize captchas or bot blockers.Please note that enabling this feature upon request may result in longer web scraping durations.
Limit File Syncs by Character Count
Initial file syncs now includes the option to limit based on the number of characters. There are three levels of character limits:
max_characters_per_file
: A single file from the user cannot exceed this character limit.max_characters_per_upload
: Custom character limit for the user across a single upload request.max_characters
: Custom character limit for the user across all of the user’s files. Please note that in this case, the value can slightly exceed the limit.
These limits can be configured using the user (
/update_user
) or organization (/organization/update
) update endpoints. If these limits are exceeded, the file that surpasses the threshold will be moved toSYNC_ERROR
, and the corresponding webhook (FILE_ERROR
) will be sent. Please be aware that files that have already synced from the same upload request will not be rolled back.
Email Notifications for Usage Limits
You can now enable the following emails (currently upon request) to be sent to admin and users under your portal.carbon.ai account:
Daily Limit Reached
: Your organization has reached the 2.5GB (or custom) upload limit across all users and data sources. We’ll return theorganizationName
,uploadLimit
, andresetTime
.User Exceeded Their Upload Limit
: A certain user has exceeded the upload limits you set viamax_files
ormax_files_per_upload
. We’ll return thecustomerId
,limitType
, anddateTime
.User Exceeded Their Upload Limit
: A certain user has exceeded the upload limits you set viamax_characters_per_file
,max_characters_per_upload
, ormax_characters
. We’ll return thecustomerId
,limitType
, anddateTime
.
Self-Hosting Updates
We have added two new environment variables:
HASH_BEARER_TOKEN
: when set totrue
, we store only hashed bearer tokens in the database. This is optional and adds an additional layer of security if your database is already encrypted at rest.DATA_SOURCE_ENCRYPTION_KEY
: enables encryption of client secrets and access tokens when set. This key should be a URL-safe, base64-encoded 32-byte key. Refresh tokens are not encrypted because they are not useful without the client secret. Encrypted values can be decrypted and rolled back using a migration that does this for all tokens.
You can now use your own SQS-compatible queue instead of using SQS (AWS) or PubSub (GCP). Currently we’ve implemented elasticmq as the open-source SQS alternative.
Carbon Connect Enhancements
Users can now opt to have the
updated_at
column displayed infilesTabColumns
instead ofcreated_at
, allowing for sorting by this column.Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either
Ready
orSync Error
.Users can now opt to have the
updated_at
column displayed infilesTabColumns
instead ofcreated_at
, allowing for sorting by this column.Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either
Ready
orSync Error
.
API Endpoint for White Labeling
If white-labeling is enabled for your organization, you can directly manage your oauth credentials for white-labeling via the following endpoints:
/white_label/create
: Add credentials to white label data sources./white_label/delete
: Delete credentials for white-labeled data sources./white_label/update
: Update credentials for a white-labeled data source./white_label/list
: List credentials for white-labeled data sources.
Below is a list of data sources that can be white-labeling:
NOTION GOOGLE_DRIVE BOX ONEDRIVE SHAREPOINT INTERCOM SLACK ZENDESK OUTLOOK GMAIL SERVICENOW SALESFORCE ZOTERO CONFLUENCE DROPBOX GOOGLE_CLOUD_STORAGE GONG
For all these data source types,
client_id
andredirect_uri
are required credentials.client_secret
is optional for those who want to create data source with access tokens obtained outside of Carbon. For data source specific credentials:Google Drive takes an
api_key
optionally for those who want to use Google’s file pickerOneDrive and Sharepoint take a
file_picker_client_id
andfile_picker_redirect_uri
for those who want to use Microsoft’s file picker.
Note: Carbon will encrypt client secrets in our database, but return them unencrypted in the API responses.
Disabling File Formats in CCv3 File Picker (3.0.21
)
You can now disable the selection of unsupported or disabled file formats in the CCv3 in-house file picker for the following integrations:
GOOGLE_DRIVE ONEDRIVE SHAREPOINT BOX DROPBOX S3 (includes Digital Ocean Spaces) ZOTERO AZURE_BLOB_STORAGE GOOGLE_CLOUD_STORAGE
By default, all file formats supported by Carbon are enabled. Users can set the
allowed_file_formats
under connector settings at the user (update_users
) or organization level (organization/update
) to control which file formats are enabled.
Self-Hosting Updates (1.3.18
)
We now allow environment variables for the Carbon application to be passed as a
yaml
file. Theconfig.yaml
file is a configuration file that stores all the environment variables for the application. It can have multiple environments such as dev, prod, etc. Each environment should be placed under a key with the name of the environment, and the key must be at the top level of the file. The environment that is used is determined by theglobal_env
key-value pair (e.g.global_env: dev
). It’s important to note that the variables in this file are converted into environment variables. Essentially, every key-value pair at the leaf level is extracted. The key becomes the key of the environment variable in all caps, and the value remains the same.For instance, the
microsoft_file_picker_client_id
variable underprod.data_connects.onedrive
would be converted to the env variable:MICROSOFT_CLIENT_FILE_PICKER_CLIENT_ID=test_id_here
.
Here is an example of the
.yaml
file for reference.
Custom Metadata for Data Sources
We added the functionality to add custom tags to data sources, similar to those currently supported for files.
You can add tags to any data source via the following endpoints:
data_sources/tags/add
: Add tags to a data source.data_sources/tags/remove
: Remove tags from a data source.
Any endpoints for connecting data sources (ie:
integrations/connect
and/integrations/oauth_url
) all take adata_source_tags
param for adding tags.The tags must be added as a key-value pair (same as file tags). Example:
{{"userId": "swapnil@carbon.ai"}}
We have also introduced two parameters in Carbon Connect (
3.0.23
), allowing customers to add and filter displayed data sources for users:dataSourceTags
: These are key-value pairs that will be added to all data sources connected through CCv3.dataSourceTagsFilterQuery
: This parameter filters for tags when querying data sources. It functions similarly to our documented file filters. If not provided, all data sources will be returned. Example:{{"key": "userId", "value": "swapnil@carbon.ai"}}
Sharepoint Team Site Support
We now support Sharepoint team sites. To connect a Sharepoint team site, leave
sharepoint_site_name
undefined when calling/integrations/oauth_url
Cursor-Based Pagination
We have begun to implement a more efficient pagination system for our API endpoints, starting with the
/user_files_v2
endpoint.We introduced a new parameter called
starting_id
in thepagination
block. It is recommended to use a combination of limit andstarting_id
instead of limit and offset. This not only reduces the load on our backend but also leads to significantly faster response times for you. Thelimit-starting_id
approach is essentially cursor-based pagination, with the cursor being thestarting_id
in this case.To use it, if you are unsure about which ID to use for
starting_id
, you should initially make a query with just a limit, order direction, and field to order by. For example:
{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10 } }
You will receive a list of results (files in the case of
/user_files_v2
), ordered byid
in descending order. From here, you can use the last ID in the list as the starting ID for the next API call. For instance:
{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10, "starting_id": 25032 } }
This assumes that the last ID of the first API call was
25032
. By following this method, you can retrieve the next 10 files. You can continue this process as needed.We aim to eventually phase out offset-based pagination in favor of this cursor-based pagination, as limit-based pagination performs significantly worse at a database level.
Self-Hosting Updates
Azure Blob Storage has been added as an object storage backend, alongside S3 (AWS), Google Blob Storage (GCP), and S3-compatible open source alternatives.
Customer Portal v2
We have completely redesigned our customer portal UI (portal.carbon.ai) and have a roadmap to significantly enhance the functionality.
You can now manage the following through the portal:
Webhooks
API keys
Admin and User Permissions
Subscription Plans
Drives Listed As Top-Level Items
Personal and Shared Drives are now listed as the top-level sources items via both the API and and within the in-house Carbon file picker.
Drives themselves cannot be selected for syncing but you can click in to select folders and files with the Drives.
Self-Hosting Updates
We added the following environment variables for self-hosted deployments:
default_request_timeout
: This is the default timeout for all requests made to external APIs and URLs. Defaults to 7 seconds.web_scrape_request_timeout
: This timeout is specifically for requests made during web scraping. Defaults to 60 seconds.<data_source_name>_request_timeout
: This allows you to customize the request timeout for specific data sources. Replace<data_source_name>
with the actual name, such asnotion_request_timeout
orgoogle_drive_request_timeout
. Defaults to 7 seconds.
Custom Scopes for Connectors
You can now directly pass in custom scopes to request from the OAuth provider via
/integrations/oauth_url
. The scopes will be used as it is, not combined with the default scopes that Carbon requests.The scopes must be passed in as an array, example:
Support for custom scopes has also been added to Carbon Connect
3.0.26
. The prop is calledscopes
and is an array that can be set only on the integration level.
Presigned URL Expiry Time
We added a new, optional field on the
/user_files_v2
endpoint calledpresigned_url_expiry_time_seconds
that can be used to set the expiry time for generated presigned URLs. The default is 3600 seconds.
List Sharepoint Sites
You can now list all the SharePoint sites associated with a user’s SharePoint account.
After connecting to a SharePoint account, you can use the endpoint
/integrations/sharepoint/sites/list
to retrieve a list of all sites in the account.This endpoint has two optional parameters:
data_source_id
: This must be provided if there are multiple SharePoint connections under the same customer ID.cursor
: This is used for pagination.
Each site will return three properties:
site_url
site_display_name
site_name
: This value is used for thesharepoint_site_name
when connecting sites withintegrations/oauth_url
.
Please note that this endpoint requires an additional scope,
Sites.Read.All
, which Carbon does not request by default. In order to list sites, connect sites, and sync files from connected sites, you must includeSites.Read.All
in the/integrations/oauth_url
through thescopes
parameter, along with the required scopes:openid
,offline_access
,User.Read
, andFiles.Read.All
.
New Filters for Source Items
We added two new optional filters for
/integrations/items/list
:file_formats
: Filter based on all file formats supported by Carbon. This is a new feature that won’t be backfilled, so it will only apply to data sources that are synced or re-synced moving forward.item_types
: Filter on different item types at the source; for example, help centers will haveTICKET
andARTICLE
, while Google Drive will haveFILE
andFOLDER
.
Return external_url
for Freshdesk
We now return the
external_url
value for Freshdesk articles.
Sync Outlook Emails Across All Folders
We have introduced support for syncing Outlook emails across all user folders. Users can specify the folder as
null
to achieve this, with the default being the inbox if this input is excluded.
CARBON
Data Connectors for LLMs
COPYRIGHT @ 2024 JCDT DBA CARBON