October 2024

XLSM File Support

  • Added support for XLSM file uploads from third-party and local file uploads.

  • Similar to XLSX files, each row is split on its own line, and each element within the row has the header of its corresponding column added to it as a prefix. Our parser assumes that the first row and only the first row is the header. Macros, images, and charts aren’t supported yet.

Webscrape Improvements

  • We now immediately abort the web scrape run when the file is deleted in Carbon, freeing up resources to submit another web scrape request.

  • Web scrapes auto-sync can now be managed on the user and organization level via auto_sync_enabled_sources as WEB_SCRAPE. The default auto-sync schedule is 2 weeks (as opposed to daily for other data sources).

Store Files Without Parsing

  • The sync endpoints now take a new parameter store_file_only (file_sync_config.store_file_only for external files) to allow users to skip parsing during the sync. This means the file will have a presigned_url but not a parsed_text_url.

  • Because we are skipping parsing, we won’t be able to count the number of characters in the file. That means the only metrics we’ll report to Stripe are bytes uploaded (and URLs scraped if it’s a web scrape file).

Sync Additional GitHub Data

  • In addition to syncing files from repositories, you can fetch data directly from GitHub via the following endpoints:

    • /integrations/data/github/pull_requests: Lists all pull requests for a repository

    • /integrations/data/github/pull_requests/{pull_number}: Retrieves a specific pull request

    • /integrations/data/github/pull_requests/comments: Fetches comments on a pull request

    • /integrations/data/github/pull_requests/files: Retrieves files that were changed

    • /integrations/data/github/pull_requests/commits: Retrieves a list of commits on a pull request

    • /integrations/data/github/issues: Lists repository issues

    • /integrations/data/github/issues/{issue_number}: Retrieves a specific issue

  • By default, we return responses with mappings applied, but there is an option to include the entire GitHub response on every endpoint (include_remote_data).

  • Find more details in our documentation here.

/user_files_v2: New upload_id Property

  • User files now contain a new property called upload_id which is generated internally by Carbon. This property groups together files that were part of the same upload. Each upload from a third-party file picker will have its own unique upload_id, even if the files were uploaded in the same session. Sessions are still identified by the request_id. If the same file is uploaded multiple times, only the most recent upload_id is saved.

  • Webhooks that send the request_id will now also send the upload_id.

  • The /user_files_v2 endpoint now accepts a new filter called upload_ids.

New ALL_FILES_PROCESSED Webhook

  • The new webhook ALL_FILES_PROCESSED will be sent when all files in an upload have moved into the “READY,” “SYNC_ERROR,” “READY_TO_SYNC,” or “RATE_LIMITED” status. It includes the request_id as the sent object and the upload_id as additional information.

API Update

  • Starting next Tuesday (10/15), the hot_storage_time_to_live field under file upload endpoints will no longer take values in seconds. Instead it will need to be a discrete number of days from the list: [1, 3, 7, 14, 30]. Anything else will raise an exception.

Self-Hosting Updates

  • You can now bring your own S3-compatible object storage instead of using S3 (AWS) or Google Blob Storage (GCP).

  • Added a flag DISABLE_RATE_LIMITS to disable all of Carbon’s rate limits listed here.

Premium Proxies for Web Scraping

  • We have introduced a new feature called use_premium_proxies for web scraping and sitemap scraping that can be enabled upon request. This feature aims to enhance the success rate when scraping websites that utilize captchas or bot blockers.

  • Please note that enabling this feature upon request may result in longer web scraping durations.

Limit File Syncs by Character Count

  • Initial file syncs now includes the option to limit based on the number of characters. There are three levels of character limits:

    • max_characters_per_file: A single file from the user cannot exceed this character limit.

    • max_characters_per_upload: Custom character limit for the user across a single upload request.

    • max_characters: Custom character limit for the user across all of the user’s files. Please note that in this case, the value can slightly exceed the limit.

  • These limits can be configured using the user (/update_user) or organization (/organization/update) update endpoints. If these limits are exceeded, the file that surpasses the threshold will be moved to SYNC_ERROR, and the corresponding webhook (FILE_ERROR) will be sent. Please be aware that files that have already synced from the same upload request will not be rolled back.

Email Notifications for Usage Limits

  • You can now enable the following emails (currently upon request) to be sent to admin and users under your portal.carbon.ai account:

    • Daily Limit Reached: Your organization has reached the 2.5GB (or custom) upload limit across all users and data sources. We’ll return the organizationName, uploadLimit, and resetTime.

    • User Exceeded Their Upload Limit: A certain user has exceeded the upload limits you set via max_files or max_files_per_upload. We’ll return the customerId, limitType, and dateTime.

    • User Exceeded Their Upload Limit: A certain user has exceeded the upload limits you set via max_characters_per_file, max_characters_per_upload, or max_characters. We’ll return the customerId, limitType, and dateTime.

Self-Hosting Updates

  • We have added two new environment variables:

    • HASH_BEARER_TOKEN: when set to true, we store only hashed bearer tokens in the database. This is optional and adds an additional layer of security if your database is already encrypted at rest.

    • DATA_SOURCE_ENCRYPTION_KEY: enables encryption of client secrets and access tokens when set. This key should be a URL-safe, base64-encoded 32-byte key. Refresh tokens are not encrypted because they are not useful without the client secret. Encrypted values can be decrypted and rolled back using a migration that does this for all tokens.

  • You can now use your own SQS-compatible queue instead of using SQS (AWS) or PubSub (GCP). Currently we’ve implemented elasticmq as the open-source SQS alternative.

Carbon Connect Enhancements

  • Users can now opt to have the updated_at column displayed in filesTabColumns instead of created_at, allowing for sorting by this column.

  • Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either Ready or Sync Error.

  • Users can now opt to have the updated_at column displayed in filesTabColumns instead of created_at, allowing for sorting by this column.

  • Status updates for files in a syncing state have been implemented. The file view will automatically refresh when the file changes to either Ready or Sync Error.

API Endpoint for White Labeling

  • If white-labeling is enabled for your organization, you can directly manage your oauth credentials for white-labeling via the following endpoints:

    • /white_label/create: Add credentials to white label data sources.

    • /white_label/delete: Delete credentials for white-labeled data sources.

    • /white_label/update: Update credentials for a white-labeled data source.

    • /white_label/list: List credentials for white-labeled data sources.

  • Below is a list of data sources that can be white-labeling:

NOTION GOOGLE_DRIVE BOX ONEDRIVE SHAREPOINT INTERCOM SLACK ZENDESK OUTLOOK GMAIL SERVICENOW SALESFORCE ZOTERO CONFLUENCE DROPBOX GOOGLE_CLOUD_STORAGE GONG

  • For all these data source types, client_id and redirect_uri are required credentials. client_secret is optional for those who want to create data source with access tokens obtained outside of Carbon. For data source specific credentials:

    • Google Drive takes an api_key optionally for those who want to use Google’s file picker

    • OneDrive and Sharepoint take a file_picker_client_id and file_picker_redirect_uri for those who want to use Microsoft’s file picker.

  • Note: Carbon will encrypt client secrets in our database, but return them unencrypted in the API responses.

Disabling File Formats in CCv3 File Picker (3.0.21)

  • You can now disable the selection of unsupported or disabled file formats in the CCv3 in-house file picker for the following integrations:

GOOGLE_DRIVE ONEDRIVE SHAREPOINT BOX DROPBOX S3 (includes Digital Ocean Spaces) ZOTERO AZURE_BLOB_STORAGE GOOGLE_CLOUD_STORAGE

  • By default, all file formats supported by Carbon are enabled. Users can set the allowed_file_formats under connector settings at the user (update_users) or organization level (organization/update) to control which file formats are enabled.

Self-Hosting Updates (1.3.18)

  • We now allow environment variables for the Carbon application to be passed as a yaml file. The config.yaml file is a configuration file that stores all the environment variables for the application. It can have multiple environments such as dev, prod, etc. Each environment should be placed under a key with the name of the environment, and the key must be at the top level of the file. The environment that is used is determined by the global_env key-value pair (e.g. global_env: dev). It’s important to note that the variables in this file are converted into environment variables. Essentially, every key-value pair at the leaf level is extracted. The key becomes the key of the environment variable in all caps, and the value remains the same.

    • For instance, the microsoft_file_picker_client_id variable under prod.data_connects.onedrive would be converted to the env variable: MICROSOFT_CLIENT_FILE_PICKER_CLIENT_ID=test_id_here.

  • Here is an example of the .yaml file for reference.

Custom Metadata for Data Sources

  • We added the functionality to add custom tags to data sources, similar to those currently supported for files.

  • You can add tags to any data source via the following endpoints:

    • data_sources/tags/add: Add tags to a data source.

    • data_sources/tags/remove: Remove tags from a data source.

  • Any endpoints for connecting data sources (ie: integrations/connect and /integrations/oauth_url) all take a data_source_tags param for adding tags.

  • The tags must be added as a key-value pair (same as file tags). Example: {{"userId": "swapnil@carbon.ai"}}

  • We have also introduced two parameters in Carbon Connect (3.0.23), allowing customers to add and filter displayed data sources for users:

    • dataSourceTags: These are key-value pairs that will be added to all data sources connected through CCv3.

    • dataSourceTagsFilterQuery: This parameter filters for tags when querying data sources. It functions similarly to our documented file filters. If not provided, all data sources will be returned. Example: {{"key": "userId", "value": "swapnil@carbon.ai"}}

Sharepoint Team Site Support

  • We now support Sharepoint team sites. To connect a Sharepoint team site, leave sharepoint_site_name undefined when calling /integrations/oauth_url

Cursor-Based Pagination

  • We have begun to implement a more efficient pagination system for our API endpoints, starting with the /user_files_v2 endpoint.

  • We introduced a new parameter called starting_id in the pagination block. It is recommended to use a combination of limit and starting_id instead of limit and offset. This not only reduces the load on our backend but also leads to significantly faster response times for you. The limit-starting_id approach is essentially cursor-based pagination, with the cursor being the starting_id in this case.

  • To use it, if you are unsure about which ID to use for starting_id, you should initially make a query with just a limit, order direction, and field to order by. For example:

{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10 } }

  • You will receive a list of results (files in the case of /user_files_v2), ordered by id in descending order. From here, you can use the last ID in the list as the starting ID for the next API call. For instance:

{ "order_by": "id", "order_dir": "desc", "pagination": { "limit": 10, "starting_id": 25032 } }

  • This assumes that the last ID of the first API call was 25032. By following this method, you can retrieve the next 10 files. You can continue this process as needed.

  • We aim to eventually phase out offset-based pagination in favor of this cursor-based pagination, as limit-based pagination performs significantly worse at a database level.

Self-Hosting Updates

  • Azure Blob Storage has been added as an object storage backend, alongside S3 (AWS), Google Blob Storage (GCP), and S3-compatible open source alternatives.

Customer Portal v2

  • We have completely redesigned our customer portal UI (portal.carbon.ai) and have a roadmap to significantly enhance the functionality.

    • You can now manage the following through the portal:

      • Webhooks

      • API keys

      • Admin and User Permissions

      • Subscription Plans

Drives Listed As Top-Level Items

  • Personal and Shared Drives are now listed as the top-level sources items via both the API and and within the in-house Carbon file picker.

  • Drives themselves cannot be selected for syncing but you can click in to select folders and files with the Drives.

Self-Hosting Updates

  • We added the following environment variables for self-hosted deployments:

    • default_request_timeout: This is the default timeout for all requests made to external APIs and URLs. Defaults to 7 seconds.

    • web_scrape_request_timeout: This timeout is specifically for requests made during web scraping. Defaults to 60 seconds.

    • <data_source_name>_request_timeout: This allows you to customize the request timeout for specific data sources. Replace <data_source_name> with the actual name, such as notion_request_timeout or google_drive_request_timeout. Defaults to 7 seconds.

Custom Scopes for Connectors

  • You can now directly pass in custom scopes to request from the OAuth provider via /integrations/oauth_url. The scopes will be used as it is, not combined with the default scopes that Carbon requests.

  • The scopes must be passed in as an array, example:

"scopes": [
    "https://www.googleapis.com/auth/userinfo.profile",
    "https://www.googleapis.com/auth/userinfo.email",
    "https://www.googleapis.com/auth/drive.readonly"
  ]

  • Support for custom scopes has also been added to Carbon Connect 3.0.26. The prop is called scopes and is an array that can be set only on the integration level.

Presigned URL Expiry Time

  • We added a new, optional field on the /user_files_v2 endpoint called presigned_url_expiry_time_seconds that can be used to set the expiry time for generated presigned URLs. The default is 3600 seconds.

List Sharepoint Sites

  • You can now list all the SharePoint sites associated with a user’s SharePoint account.

  • After connecting to a SharePoint account, you can use the endpoint /integrations/sharepoint/sites/list to retrieve a list of all sites in the account.

    • This endpoint has two optional parameters:

      • data_source_id: This must be provided if there are multiple SharePoint connections under the same customer ID.

      • cursor: This is used for pagination.

    • Each site will return three properties:

      • site_url

      • site_display_name

      • site_name: This value is used for the sharepoint_site_name when connecting sites with integrations/oauth_url.

  • Please note that this endpoint requires an additional scope, Sites.Read.All, which Carbon does not request by default. In order to list sites, connect sites, and sync files from connected sites, you must include Sites.Read.All in the /integrations/oauth_url through the scopes parameter, along with the required scopes: openid, offline_access, User.Read, and Files.Read.All.

New Filters for Source Items

  • We added two new optional filters for /integrations/items/list:

    • file_formats: Filter based on all file formats supported by Carbon. This is a new feature that won’t be backfilled, so it will only apply to data sources that are synced or re-synced moving forward.

    • item_types: Filter on different item types at the source; for example, help centers will have TICKET and ARTICLE, while Google Drive will have FILE and FOLDER.

Return external_url for Freshdesk

  • We now return the external_url value for Freshdesk articles.

Sync Outlook Emails Across All Folders

  • We have introduced support for syncing Outlook emails across all user folders. Users can specify the folder as null to achieve this, with the default being the inbox if this input is excluded.

CARBON

Data Connectors for LLMs

COPYRIGHT @ 2024 JCDT DBA CARBON