Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(community): Support google cloud storage document loader #7740

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
hide_table_of_contents: true
sidebar_class_name: node-only
---

# Google Cloud Storage

:::tip Compatibility
Only available on Node.js.
:::

This covers how to load a Google Cloud Storage File into LangChain documents.

## Setup

To use this loader, you'll need to have Unstructured already set up and ready to use at an available URL endpoint. It can also be configured to run locally.

See the docs [here](/docs/integrations/document_loaders/file_loaders/unstructured) for information on how to do that.

You'll also need to install the official Google Cloud Storage SDK:

```bash npm2yarn
npm install @langchain/community @langchain/core @google-cloud/storage
```

## Usage

Once Unstructured is configured, you can use the Google Cloud Storage loader to load files and then convert them into a Document.

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/document_loaders/google_cloud_storage.ts";

<CodeBlock language="typescript">{Example}</CodeBlock>
17 changes: 17 additions & 0 deletions examples/src/document_loaders/google_cloud_storage.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import { GoogleCloudStorageLoader } from "@langchain/community/document_loaders/web/google_cloud_storage";

const loader = new GoogleCloudStorageLoader({
bucket: "my-bucket-123",
file: "path/to/file.pdf",
storageOptions: {
keyFilename: "/path/to/keyfile.json",
},
unstructuredLoaderOptions: {
apiUrl: "http://localhost:8000/general/v0/general",
apiKey: "", // this will be soon required
},
});

const docs = await loader.load();

console.log(docs);
4 changes: 4 additions & 0 deletions libs/langchain-community/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -934,6 +934,10 @@ document_loaders/web/college_confidential.cjs
document_loaders/web/college_confidential.js
document_loaders/web/college_confidential.d.ts
document_loaders/web/college_confidential.d.cts
document_loaders/web/google_cloud_storage.cjs
document_loaders/web/google_cloud_storage.js
document_loaders/web/google_cloud_storage.d.ts
document_loaders/web/google_cloud_storage.d.cts
document_loaders/web/gitbook.cjs
document_loaders/web/gitbook.js
document_loaders/web/gitbook.d.ts
Expand Down
1 change: 1 addition & 0 deletions libs/langchain-community/langchain.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,7 @@ export const config = {
"document_loaders/web/playwright": "document_loaders/web/playwright",
"document_loaders/web/college_confidential":
"document_loaders/web/college_confidential",
"document_loaders/web/google_cloud_storage": "document_loaders/web/google_cloud_storage",
"document_loaders/web/gitbook": "document_loaders/web/gitbook",
"document_loaders/web/hn": "document_loaders/web/hn",
"document_loaders/web/imsdb": "document_loaders/web/imsdb",
Expand Down
15 changes: 14 additions & 1 deletion libs/langchain-community/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
"@gomomento/sdk": "^1.51.1",
"@gomomento/sdk-core": "^1.51.1",
"@google-ai/generativelanguage": "^2.5.0",
"@google-cloud/storage": "^7.7.0",
"@google-cloud/storage": "^7.15.2",
"@gradientai/nodejs-sdk": "^1.2.0",
"@huggingface/inference": "^2.6.4",
"@huggingface/transformers": "^3.2.3",
Expand Down Expand Up @@ -2823,6 +2823,15 @@
"import": "./document_loaders/web/college_confidential.js",
"require": "./document_loaders/web/college_confidential.cjs"
},
"./document_loaders/web/google_cloud_storage": {
"types": {
"import": "./document_loaders/web/google_cloud_storage.d.ts",
"require": "./document_loaders/web/google_cloud_storage.d.cts",
"default": "./document_loaders/web/google_cloud_storage.d.ts"
},
"import": "./document_loaders/web/google_cloud_storage.js",
"require": "./document_loaders/web/google_cloud_storage.cjs"
},
"./document_loaders/web/gitbook": {
"types": {
"import": "./document_loaders/web/gitbook.d.ts",
Expand Down Expand Up @@ -4150,6 +4159,10 @@
"document_loaders/web/college_confidential.js",
"document_loaders/web/college_confidential.d.ts",
"document_loaders/web/college_confidential.d.cts",
"document_loaders/web/google_cloud_storage.cjs",
"document_loaders/web/google_cloud_storage.js",
"document_loaders/web/google_cloud_storage.d.ts",
"document_loaders/web/google_cloud_storage.d.cts",
"document_loaders/web/gitbook.cjs",
"document_loaders/web/gitbook.js",
"document_loaders/web/gitbook.d.ts",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import { Storage, StorageOptions } from "@google-cloud/storage";
import * as os from "node:os";
import * as path from "node:path";
import * as fsDefault from "node:fs";
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base";
import { Document } from "@langchain/core/documents";
import {
UnstructuredLoaderOptions,
UnstructuredLoader,
} from "../fs/unstructured.js";

/**
* Represents the parameters for the GoogleCloudStorageLoader class. It includes
* properties such as the GCS bucket, file destination, the options for the UnstructuredLoader and
* the options for Google Cloud Storage
*/
export type GcsLoaderConfig = {
fs?: typeof fsDefault;
bucket: string;
file: string;
unstructuredLoaderOptions: UnstructuredLoaderOptions;
storageOptions: StorageOptions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageOptions seems to be used just for authentication. Does this also work with Application Default Credentials? If so - this should be an optional field.

};

/**
* A class that extends the BaseDocumentLoader class. It represents a
* document loader for loading files from a google cloud storage bucket.
* @example
* ```typescript
* const loader = new GoogleCloudStorageLoader({
* bucket: "<my-bucket-name>",
* file: "<file-path>",
* storageOptions: {
* keyFilename: "<key-file-name-path>"
* }
* unstructuredConfig: {
* apiUrl: "<unstructured-API-URL>",
* apiKey: "<unstructured-API-key>"
* }
* });
* const docs = await loader.load();
* ```
*/
export class GoogleCloudStorageLoader extends BaseDocumentLoader {
private bucket: string;

private file: string;

private storageOptions: StorageOptions;

private _fs: typeof fsDefault;

private unstructuredLoaderOptions: UnstructuredLoaderOptions;

constructor({
fs = fsDefault,
file,
bucket,
unstructuredLoaderOptions,
storageOptions,
}: GcsLoaderConfig) {
super();
this._fs = fs;
this.bucket = bucket;
this.file = file;
this.unstructuredLoaderOptions = unstructuredLoaderOptions;
this.storageOptions = storageOptions;
}

async load(): Promise<Document<Record<string, any>>[]> {
const tempDir = this._fs.mkdtempSync(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we do this without the filesystem as a passthrough?

path.join(os.tmpdir(), "googlecloudstoragefileloader-")
);
const filePath = path.join(tempDir, this.file);

try {
const storage = new Storage(this.storageOptions);
const bucket = storage.bucket(this.bucket);

const [buffer] = await bucket.file(this.file).download();
this._fs.mkdirSync(path.dirname(filePath), { recursive: true });

this._fs.writeFileSync(filePath, buffer);
// eslint-disable-next-line @typescript-eslint/no-explicit-any
} catch (e: any) {
throw new Error(
`Failed to download file ${this.file} from google cloud storage bucket ${this.bucket}: ${e.message}`
);
}

try {
const unstructuredLoader = new UnstructuredLoader(
filePath,
this.unstructuredLoaderOptions
);
const docs = await unstructuredLoader.load();
return docs;
} catch {
throw new Error(
`Failed to load file ${filePath} using unstructured loader.`
);
}
}
}
1 change: 1 addition & 0 deletions libs/langchain-community/src/load/import_map.ts
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ export * as indexes__base from "../indexes/base.js";
export * as indexes__memory from "../indexes/memory.js";
export * as document_loaders__web__airtable from "../document_loaders/web/airtable.js";
export * as document_loaders__web__html from "../document_loaders/web/html.js";
export * as document_loaders__web__google_cloud_storage from "../document_loaders/web/google_cloud_storage.js";
export * as document_loaders__web__jira from "../document_loaders/web/jira.js";
export * as document_loaders__web__searchapi from "../document_loaders/web/searchapi.js";
export * as document_loaders__web__serpapi from "../document_loaders/web/serpapi.js";
Expand Down
48 changes: 45 additions & 3 deletions yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -6802,6 +6802,29 @@ __metadata:
languageName: node
linkType: hard

"@google-cloud/storage@npm:^7.15.2":
version: 7.15.2
resolution: "@google-cloud/storage@npm:7.15.2"
dependencies:
"@google-cloud/paginator": ^5.0.0
"@google-cloud/projectify": ^4.0.0
"@google-cloud/promisify": ^4.0.0
abort-controller: ^3.0.0
async-retry: ^1.3.3
duplexify: ^4.1.3
fast-xml-parser: ^4.4.1
gaxios: ^6.0.2
google-auth-library: ^9.6.3
html-entities: ^2.5.2
mime: ^3.0.0
p-limit: ^3.0.1
retry-request: ^7.0.0
teeny-request: ^9.0.0
uuid: ^8.0.0
checksum: 1c3a2614349ca9e16bc2949e1dad511e321ca38551a51987a972955a88bf21f7ad31dccec9b63233c5d678cccb84654575098f85a592a8d335b09cab8859ea4c
languageName: node
linkType: hard

"@google-cloud/storage@npm:^7.7.0":
version: 7.7.0
resolution: "@google-cloud/storage@npm:7.7.0"
Expand Down Expand Up @@ -8270,7 +8293,7 @@ __metadata:
"@gomomento/sdk": ^1.51.1
"@gomomento/sdk-core": ^1.51.1
"@google-ai/generativelanguage": ^2.5.0
"@google-cloud/storage": ^7.7.0
"@google-cloud/storage": ^7.15.2
"@gradientai/nodejs-sdk": ^1.2.0
"@huggingface/inference": ^2.6.4
"@huggingface/transformers": ^3.2.3
Expand Down Expand Up @@ -20670,6 +20693,18 @@ __metadata:
languageName: node
linkType: hard

"duplexify@npm:^4.1.3":
version: 4.1.3
resolution: "duplexify@npm:4.1.3"
dependencies:
end-of-stream: ^1.4.1
inherits: ^2.0.3
readable-stream: ^3.1.1
stream-shift: ^1.0.2
checksum: 9636a027345de3dd3c801594d01a7c73d9ce260019538beb1ee650bba7544e72f40a4d4902b52e1ab283dc32a06f210d42748773af02ff15e3064a9659deab7f
languageName: node
linkType: hard

"eastasianwidth@npm:^0.2.0":
version: 0.2.0
resolution: "eastasianwidth@npm:0.2.0"
Expand Down Expand Up @@ -24207,7 +24242,7 @@ __metadata:
languageName: node
linkType: hard

"google-auth-library@npm:^9.15.1":
"google-auth-library@npm:^9.15.1, google-auth-library@npm:^9.6.3":
version: 9.15.1
resolution: "google-auth-library@npm:9.15.1"
dependencies:
Expand Down Expand Up @@ -24844,7 +24879,7 @@ __metadata:
languageName: node
linkType: hard

"html-entities@npm:^2.3.3":
"html-entities@npm:^2.3.3, html-entities@npm:^2.5.2":
version: 2.5.2
resolution: "html-entities@npm:2.5.2"
checksum: b23f4a07d33d49ade1994069af4e13d31650e3fb62621e92ae10ecdf01d1a98065c78fd20fdc92b4c7881612210b37c275f2c9fba9777650ab0d6f2ceb3b99b6
Expand Down Expand Up @@ -35608,6 +35643,13 @@ __metadata:
languageName: node
linkType: hard

"stream-shift@npm:^1.0.2":
version: 1.0.3
resolution: "stream-shift@npm:1.0.3"
checksum: a24c0a3f66a8f9024bd1d579a533a53be283b4475d4e6b4b3211b964031447bdf6532dd1f3c2b0ad66752554391b7c62bd7ca4559193381f766534e723d50242
languageName: node
linkType: hard

"streamsearch@npm:^1.1.0":
version: 1.1.0
resolution: "streamsearch@npm:1.1.0"
Expand Down
Loading