diff --git a/.gitignore b/.gitignore index acfb63b..2ab7092 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,6 @@ /telecrawl coverage.out dist/ +__pycache__/ +*.pyc .DS_Store diff --git a/CHANGELOG.md b/CHANGELOG.md index e9d7785..afa5025 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,10 +1,12 @@ # Changelog -All notable changes to this project are documented here. +## [0.2.1] - Unreleased + +### Added -The format follows Keep a Changelog, and this project uses Semantic Versioning. +- Archive Telegram contact records from local Postbox imports. (#7; thanks @joshp123) -## [0.2.1] - Unreleased +## [0.2.0] - 2026-05-31 ### Added diff --git a/README.md b/README.md index 7111847..b11eacf 100644 --- a/README.md +++ b/README.md @@ -13,7 +13,8 @@ It is local-first: - Normal archive/search commands do not upload data. - `backup push` uploads only age-encrypted shards when you run it explicitly. -- Telegram message text, chat names, sender names, and media metadata stay inside +- Telegram message text, chat names, sender names, contact phone numbers, + contact usernames, avatar path metadata, and media metadata stay inside encrypted backup payloads. ## Install @@ -89,9 +90,12 @@ fetch is attempted, so `--fetch-media` only tries media that is not already in the local archive. Native Postbox can tag link previews, polls, geo/live-geo, service messages, or -deleted messages as broad media candidates. `telecrawl` tries those during -`--fetch-media`, but only keeps them as media rows when Telegram returns a -downloadable file. +deleted messages as broad media candidates. `telecrawl` archives their decoded +message metadata separately from binary media, and only keeps them as media rows +when Telegram returns a downloadable file. +`metadata_json` is a local source-native Postbox payload for later rendering or +search; it is not a cross-source schema and can contain private Telegram +metadata. When no `--source` is provided on macOS, `telecrawl` checks Telegram Desktop `tdata` first, then the native Telegram for macOS group container. No backend @@ -103,14 +107,18 @@ telecrawl import --path "$HOME/Library/Group Containers/6N38VWS5BX.ru.keepcoder. Native macOS imports include every local `account-*` database they find; if more than one account is present, stored chat and sender IDs are account-scoped to -avoid collisions. They archive cached media by default. `--fetch-media` also uses -the existing native Telegram session to fetch missing cloud media when account -auth data is present; this does not launch Telegram or start a login/2FA flow. +avoid collisions. They archive cached media by default and store Telegram peer +records as contacts for message enrichment. Contacts can include phone numbers, +usernames, and archived avatar paths when those values exist locally, and are +visible through `telecrawl contacts`. `--fetch-media` also uses the existing +native Telegram session to fetch missing cloud media when account auth data is +present; this does not launch Telegram or start a login/2FA flow. Useful reads: ```bash telecrawl folders +telecrawl contacts telecrawl chats --limit 20 telecrawl chats --folder FOLDER_ID telecrawl chats --unread @@ -230,9 +238,10 @@ Git can still see cleartext metadata: - plaintext shard hashes - backup cadence and which encrypted shards changed -Git cannot read message text, chat names, sender names, or media metadata without -an age identity. Binary media files archived in `~/.telecrawl/media/` are local -only and are not included in backup shards. +Git cannot read message text, chat names, sender names, contact phone numbers, +contact usernames, avatar path metadata, or media metadata without an age +identity. Binary media files and cached avatar files archived in +`~/.telecrawl/media/` are local only and are not included in backup shards. Keep `~/.telecrawl/age.key` private. If you lose it and no other recipient can decrypt the backup, the encrypted backup cannot be restored. diff --git a/internal/cli/cli.go b/internal/cli/cli.go index 52b3076..940bedb 100644 --- a/internal/cli/cli.go +++ b/internal/cli/cli.go @@ -13,6 +13,7 @@ import ( "sort" "strings" "time" + "unicode" "github.com/openclaw/telecrawl/internal/backup" "github.com/openclaw/telecrawl/internal/store" @@ -102,6 +103,8 @@ func (r *runtime) dispatch(args []string) error { return r.runChats(args[1:]) case "folders": return r.runFolders(args[1:]) + case "contacts": + return r.runContacts(args[1:]) case "topics": return r.runTopics(args[1:]) case "messages": @@ -257,14 +260,14 @@ func storeImportResult(ctx context.Context, st *store.Store, result *telegramdes } refreshImportMediaStats(result) if strings.TrimSpace(chatFilter) == "" { - return st.ReplaceAll(ctx, result.Stats, result.Chats, result.Folders, result.FolderChats, result.Topics, result.Messages) + return st.ReplaceAll(ctx, result.Stats, result.Contacts, result.Chats, result.Folders, result.FolderChats, result.Topics, result.Messages) } if len(result.Chats) == 0 { return fmt.Errorf("telegram import returned no chats for --chat %s", chatFilter) } for _, chat := range result.Chats { partial := importResultForChat(*result, chat.JID) - if err := st.UpsertChat(ctx, partial.Stats, chat.JID, partial.Chats, partial.Folders, partial.FolderChats, partial.Topics, partial.Messages); err != nil { + if err := st.UpsertChat(ctx, partial.Stats, chat.JID, partial.Contacts, partial.Chats, partial.Folders, partial.FolderChats, partial.Topics, partial.Messages); err != nil { return err } } @@ -394,6 +397,29 @@ func importResultForChat(result telegramdesktop.ImportResult, chatJID string) te out.Messages = append(out.Messages, message) } } + out.Contacts = contactsForMessages(result.Contacts, out.Messages, chatJID) + return out +} + +func contactsForMessages(contacts []store.Contact, messages []store.Message, chatJID string) []store.Contact { + peerIDs := map[string]struct{}{} + if strings.TrimSpace(chatJID) != "" { + peerIDs[chatJID] = struct{}{} + } + for _, message := range messages { + if strings.TrimSpace(message.ChatJID) != "" { + peerIDs[message.ChatJID] = struct{}{} + } + if strings.TrimSpace(message.SenderJID) != "" { + peerIDs[message.SenderJID] = struct{}{} + } + } + out := make([]store.Contact, 0, len(peerIDs)) + for _, contact := range contacts { + if _, ok := peerIDs[contact.JID]; ok { + out = append(out, contact) + } + } return out } @@ -440,6 +466,116 @@ func (r *runtime) runFolders(args []string) error { }) } +func (r *runtime) runContacts(args []string) error { + if len(args) > 0 && args[0] == "export" { + return r.runContactsExport(args[1:]) + } + fs := flag.NewFlagSet("telecrawl contacts", flag.ContinueOnError) + fs.SetOutput(io.Discard) + limit := fs.Int("limit", 100, "") + if err := fs.Parse(args); err != nil { + return usageErr(err) + } + if fs.NArg() != 0 { + return usageErr(errors.New("contacts takes flags only")) + } + return r.withStore(func(st *store.Store) error { + contacts, err := st.ListContacts(r.ctx, *limit) + if err != nil { + return err + } + return r.print(contacts) + }) +} + +type contactExport struct { + Contacts []exportedContact `json:"contacts"` +} + +type exportedContact struct { + DisplayName string `json:"display_name"` + PhoneNumbers []string `json:"phone_numbers"` +} + +func (r *runtime) runContactsExport(args []string) error { + fs := flag.NewFlagSet("telecrawl contacts export", flag.ContinueOnError) + fs.SetOutput(io.Discard) + if err := fs.Parse(args); err != nil { + return usageErr(err) + } + if fs.NArg() != 0 { + return usageErr(errors.New("contacts export takes no arguments")) + } + return r.withStore(func(st *store.Store) error { + contacts, err := st.ExportContacts(r.ctx) + if err != nil { + return err + } + return r.print(contactExport{Contacts: exportContacts(contacts)}) + }) +} + +func exportContacts(contacts []store.Contact) []exportedContact { + out := make([]exportedContact, 0, len(contacts)) + for _, contact := range contacts { + name := contactDisplayName(contact) + phone := strings.TrimSpace(contact.Phone) + if name == "" || phone == "" { + continue + } + out = append(out, exportedContact{DisplayName: name, PhoneNumbers: []string{phone}}) + } + return out +} + +func contactDisplayName(contact store.Contact) string { + if name := cleanContactName(contact.FullName, contact); name != "" { + return name + } + return cleanContactName(strings.TrimSpace(contact.FirstName+" "+contact.LastName), contact) +} + +func cleanContactName(name string, contact store.Contact) string { + name = strings.TrimSpace(name) + switch { + case name == "": + return "" + case name == strings.TrimSpace(contact.Phone): + return "" + case name == strings.TrimSpace(contact.JID): + return "" + case name == strings.TrimSpace(contact.Username): + return "" + case name == strings.TrimSpace(contact.LID): + return "" + case strings.HasPrefix(name, "@"): + return "" + case looksLikePhone(name): + return "" + default: + return name + } +} + +func looksLikePhone(value string) bool { + value = strings.TrimSpace(value) + if value == "" { + return false + } + digits := 0 + other := 0 + for _, r := range value { + switch { + case unicode.IsDigit(r): + digits++ + case strings.ContainsRune(" +()-.", r): + default: + other++ + } + } + return digits >= 5 && other == 0 +} + func (r *runtime) runTopics(args []string) error { fs := flag.NewFlagSet("telecrawl topics", flag.ContinueOnError) fs.SetOutput(io.Discard) @@ -765,6 +901,8 @@ usage: telecrawl [--json] import [--path PATH] [--chat ID] [--dialogs-limit N] [--messages-limit N] [--fetch-media] telecrawl [--json] status telecrawl [--json] folders + telecrawl [--json] contacts [--limit N] + telecrawl --json contacts export telecrawl [--json] chats [--limit N] [--unread] [--folder ID] telecrawl [--json] topics --chat ID [--limit N] telecrawl [--json] messages [--chat ID] [--topic ID] [--limit N] [--after DATE] diff --git a/internal/cli/cli_test.go b/internal/cli/cli_test.go index 1e09ab5..f9b74b4 100644 --- a/internal/cli/cli_test.go +++ b/internal/cli/cli_test.go @@ -3,6 +3,7 @@ package cli import ( "bytes" "context" + "encoding/json" "os" "path/filepath" "slices" @@ -63,6 +64,129 @@ func TestStoreImportResultUpsertsReturnedAccountScopedChats(t *testing.T) { } } +func TestImportResultForChatFiltersContacts(t *testing.T) { + result := accountScopedImportResult("filtered") + partial := importResultForChat(result, "111") + + got := make([]string, 0, len(partial.Contacts)) + for _, contact := range partial.Contacts { + got = append(got, contact.JID) + } + want := []string{"111", "10"} + if !slices.Equal(got, want) { + t.Fatalf("contacts = %v, want %v", got, want) + } +} + +func TestContactsExportUsesContractShapeAndSkipsUnsafeNames(t *testing.T) { + ctx := context.Background() + db := filepath.Join(t.TempDir(), "telecrawl.db") + st, err := store.Open(ctx, db) + if err != nil { + t.Fatal(err) + } + defer func() { _ = st.Close() }() + contacts := make([]store.Contact, 0, 104) + for i := 0; i < 101; i++ { + contacts = append(contacts, store.Contact{ + JID: "safe-" + string(rune('a'+(i%26))) + "-" + string(rune('a'+((i/26)%26))), + Phone: "+1555010" + strings.Repeat("0", 3-len(string(rune('0'+(i%10))))) + string(rune('0'+(i%10))), + FullName: "Safe Person", + }) + } + contacts = append(contacts, + store.Contact{JID: "first-last", Phone: "+15559990001", FirstName: "First", LastName: "Last"}, + store.Contact{JID: "username-only", Phone: "+15559990002", Username: "handle", FullName: "@handle"}, + store.Contact{JID: "phone-only", Phone: "+15559990003", FullName: "+15559990003"}, + store.Contact{JID: "jid-only", Phone: "+15559990004", FullName: "jid-only"}, + store.Contact{JID: "blank-name", Phone: "+15559990005"}, + store.Contact{JID: "no-phone", FullName: "No Phone"}, + ) + if err := st.ReplaceAll(ctx, store.ImportStats{}, contacts, nil, nil, nil, nil, nil); err != nil { + t.Fatal(err) + } + var out, errOut bytes.Buffer + err = Run(ctx, []string{"--json", "--db", db, "contacts", "export"}, &out, &errOut) + if err != nil { + t.Fatalf("contacts export: %v stderr=%s", err, errOut.String()) + } + var payload struct { + Contacts []struct { + DisplayName string `json:"display_name"` + PhoneNumbers []string `json:"phone_numbers"` + JID string `json:"jid"` + Username string `json:"username"` + } `json:"contacts"` + } + if err := json.Unmarshal(out.Bytes(), &payload); err != nil { + t.Fatalf("json = %s err=%v", out.String(), err) + } + assertContactExportKeys(t, out.Bytes()) + if len(payload.Contacts) != 102 { + t.Fatalf("contacts = %d, want 102", len(payload.Contacts)) + } + var sawFirstLast bool + for _, contact := range payload.Contacts { + if contact.DisplayName == "First Last" { + sawFirstLast = true + } + if contact.DisplayName == "" || len(contact.PhoneNumbers) != 1 { + t.Fatalf("bad contact = %#v", contact) + } + if contact.JID != "" || contact.Username != "" { + t.Fatalf("leaked source fields = %#v", contact) + } + if strings.HasPrefix(contact.DisplayName, "@") || strings.HasPrefix(contact.DisplayName, "+") || contact.DisplayName == "jid-only" { + t.Fatalf("unsafe display name exported: %#v", contact) + } + } + if !sawFirstLast { + t.Fatalf("missing composed first/last name: %#v", payload.Contacts) + } +} + +func assertContactExportKeys(t *testing.T, data []byte) { + t.Helper() + var root map[string]json.RawMessage + if err := json.Unmarshal(data, &root); err != nil { + t.Fatal(err) + } + contactsJSON, ok := root["contacts"] + if !ok || len(root) != 1 { + t.Fatalf("root keys = %#v, want only contacts", root) + } + var contacts []map[string]json.RawMessage + if err := json.Unmarshal(contactsJSON, &contacts); err != nil { + t.Fatal(err) + } + for _, contact := range contacts { + if _, ok := contact["display_name"]; !ok { + t.Fatalf("contact keys = %#v, missing display_name", contact) + } + if _, ok := contact["phone_numbers"]; !ok { + t.Fatalf("contact keys = %#v, missing phone_numbers", contact) + } + if len(contact) != 2 { + t.Fatalf("contact keys = %#v, want only display_name and phone_numbers", contact) + } + } +} + +func TestMetadataAdvertisesContactExport(t *testing.T) { + manifest := controlManifest() + command, ok := manifest.Commands["contact-export"] + if !ok { + t.Fatalf("commands = %#v", manifest.Commands) + } + if command.Mutates || !command.JSON { + t.Fatalf("contact-export command = %#v", command) + } + want := []string{"telecrawl", "--json", "contacts", "export"} + if !slices.Equal(command.Argv, want) { + t.Fatalf("argv = %#v, want %#v", command.Argv, want) + } +} + func TestStoreImportResultPreservesArchivedMediaOnReimport(t *testing.T) { ctx := context.Background() st, err := store.Open(ctx, filepath.Join(t.TempDir(), "telecrawl.db")) @@ -271,13 +395,20 @@ func accountScopedImportResult(label string) telegramdesktop.ImportResult { now := time.Unix(1_800_000_000, 0).UTC() return telegramdesktop.ImportResult{ Stats: store.ImportStats{SourcePath: "postbox", StartedAt: now, FinishedAt: now}, + Contacts: []store.Contact{ + {JID: "111", FullName: "Account A"}, + {JID: "10", FullName: "Sender A"}, + {JID: "222", FullName: "Account B"}, + {JID: "20", FullName: "Sender B"}, + {JID: "999", FullName: "Unrelated"}, + }, Chats: []store.Chat{ {JID: "111", Kind: "chat", Name: "account a", LastMessageAt: now, MessageCount: 1}, {JID: "222", Kind: "chat", Name: "account b", LastMessageAt: now, MessageCount: 1}, }, Messages: []store.Message{ - {SourcePK: 1, ChatJID: "111", ChatName: "account a", MessageID: "0:1", Timestamp: now, Text: label + " a"}, - {SourcePK: 2, ChatJID: "222", ChatName: "account b", MessageID: "0:1", Timestamp: now, Text: label + " b"}, + {SourcePK: 1, ChatJID: "111", ChatName: "account a", MessageID: "0:1", SenderJID: "10", Timestamp: now, Text: label + " a"}, + {SourcePK: 2, ChatJID: "222", ChatName: "account b", MessageID: "0:1", SenderJID: "20", Timestamp: now, Text: label + " b"}, }, } } diff --git a/internal/cli/control.go b/internal/cli/control.go index d88c1e5..e3117ce 100644 --- a/internal/cli/control.go +++ b/internal/cli/control.go @@ -19,10 +19,11 @@ func controlManifest() control.Manifest { m.Capabilities = []string{"metadata", "doctor", "status", "sync", "search", "backup"} m.Privacy = control.Privacy{ContainsPrivateMessages: true, ExportsSecrets: false, LocalOnlyScopes: []string{"telegram-desktop", "telegram-macos-postbox", "sqlite", "encrypted-git-backup"}} m.Commands = map[string]control.Command{ - "doctor": {Title: "Doctor", Argv: []string{"telecrawl", "--json", "doctor"}, JSON: true}, - "status": {Title: "Status", Argv: []string{"telecrawl", "--json", "status"}, JSON: true}, - "sync": {Title: "Import", Argv: []string{"telecrawl", "--json", "import"}, JSON: true, Mutates: true}, - "search": {Title: "Search", Argv: []string{"telecrawl", "--json", "search"}, JSON: true}, + "doctor": {Title: "Doctor", Argv: []string{"telecrawl", "--json", "doctor"}, JSON: true}, + "status": {Title: "Status", Argv: []string{"telecrawl", "--json", "status"}, JSON: true}, + "sync": {Title: "Import", Argv: []string{"telecrawl", "--json", "import"}, JSON: true, Mutates: true}, + "search": {Title: "Search", Argv: []string{"telecrawl", "--json", "search"}, JSON: true}, + "contact-export": {Title: "Export contacts", Argv: []string{"telecrawl", "--json", "contacts", "export"}, JSON: true}, } return m } diff --git a/internal/cli/version.go b/internal/cli/version.go index c030e8f..6182dd0 100644 --- a/internal/cli/version.go +++ b/internal/cli/version.go @@ -1,3 +1,3 @@ package cli -var version = "0.1.0" +var version = "0.2.0" diff --git a/internal/store/export.go b/internal/store/export.go index 250c8df..f230dee 100644 --- a/internal/store/export.go +++ b/internal/store/export.go @@ -32,6 +32,10 @@ func (d SnapshotData) Validate() error { } func (s *Store) ExportAll(ctx context.Context) (SnapshotData, error) { + contacts, err := s.allContacts(ctx) + if err != nil { + return SnapshotData{}, err + } chats, err := s.ListChats(ctx, int(^uint(0)>>1), false) if err != nil { return SnapshotData{}, err @@ -52,7 +56,7 @@ func (s *Store) ExportAll(ctx context.Context) (SnapshotData, error) { if err != nil { return SnapshotData{}, err } - return SnapshotData{Chats: chats, Folders: folders, FolderChats: folderChats, Topics: topics, Messages: messages}, nil + return SnapshotData{Contacts: contacts, Chats: chats, Folders: folders, FolderChats: folderChats, Topics: topics, Messages: messages}, nil } func (s *Store) ImportSnapshot(ctx context.Context, data SnapshotData, sourcePath string, finishedAt time.Time) error { @@ -65,7 +69,47 @@ func (s *Store) ImportSnapshot(ctx context.Context, data SnapshotData, sourcePat stats.MediaMessages++ } } - return s.ReplaceAll(ctx, stats, data.Chats, data.Folders, data.FolderChats, data.Topics, data.Messages) + return s.ReplaceAll(ctx, stats, data.Contacts, data.Chats, data.Folders, data.FolderChats, data.Topics, data.Messages) +} + +func (s *Store) ListContacts(ctx context.Context, limit int) ([]Contact, error) { + if limit <= 0 { + limit = 100 + } + return s.contacts(ctx, limit) +} + +func (s *Store) ExportContacts(ctx context.Context) ([]Contact, error) { + return s.allContacts(ctx) +} + +func (s *Store) allContacts(ctx context.Context) ([]Contact, error) { + return s.contacts(ctx, 0) +} + +func (s *Store) contacts(ctx context.Context, limit int) ([]Contact, error) { + query := `select jid,coalesce(peer_type,''),coalesce(phone,''),coalesce(full_name,''),coalesce(first_name,''),coalesce(last_name,''),coalesce(business_name,''),coalesce(username,''),coalesce(lid,''),coalesce(about_text,''),coalesce(avatar_path,''),coalesce(updated_at,0) from contacts order by jid` + args := []any{} + if limit > 0 { + query += " limit ?" + args = append(args, limit) + } + rows, err := s.db.QueryContext(ctx, query, args...) + if err != nil { + return nil, err + } + defer func() { _ = rows.Close() }() + var out []Contact + for rows.Next() { + var c Contact + var updatedAt int64 + if err := rows.Scan(&c.JID, &c.PeerType, &c.Phone, &c.FullName, &c.FirstName, &c.LastName, &c.BusinessName, &c.Username, &c.LID, &c.AboutText, &c.AvatarPath, &updatedAt); err != nil { + return nil, err + } + c.UpdatedAt = fromUnix(updatedAt) + out = append(out, c) + } + return out, rows.Err() } func (s *Store) allFolderChats(ctx context.Context) ([]FolderChat, error) { diff --git a/internal/store/schema.go b/internal/store/schema.go index b3bc187..980b538 100644 --- a/internal/store/schema.go +++ b/internal/store/schema.go @@ -47,6 +47,7 @@ create table if not exists topics ( create table if not exists contacts ( jid text primary key, + peer_type text, phone text, full_name text, first_name text, @@ -55,6 +56,7 @@ create table if not exists contacts ( username text, lid text, about_text text, + avatar_path text, updated_at integer ); @@ -93,6 +95,10 @@ create table if not exists messages ( media_path text, media_url text, media_size integer, + metadata_type text, + metadata_title text, + metadata_url text, + metadata_json text, starred integer not null default 0, topic_id text, reply_to_msg_id text, diff --git a/internal/store/store.go b/internal/store/store.go index f07edd0..4a1a536 100644 --- a/internal/store/store.go +++ b/internal/store/store.go @@ -13,7 +13,7 @@ import ( _ "modernc.org/sqlite" ) -const schemaVersion = 2 +const schemaVersion = 4 type Store struct { db *sql.DB @@ -98,6 +98,7 @@ type Topic struct { type Contact struct { JID string `json:"jid"` + PeerType string `json:"peer_type,omitempty"` Phone string `json:"phone,omitempty"` FullName string `json:"full_name,omitempty"` FirstName string `json:"first_name,omitempty"` @@ -106,6 +107,7 @@ type Contact struct { Username string `json:"username,omitempty"` LID string `json:"lid,omitempty"` AboutText string `json:"about_text,omitempty"` + AvatarPath string `json:"avatar_path,omitempty"` UpdatedAt time.Time `json:"updated_at,omitzero"` } @@ -143,6 +145,10 @@ type Message struct { MediaPath string `json:"media_path,omitempty"` MediaURL string `json:"media_url,omitempty"` MediaSize int64 `json:"media_size,omitempty"` + MetadataType string `json:"metadata_type,omitempty"` + MetadataTitle string `json:"metadata_title,omitempty"` + MetadataURL string `json:"metadata_url,omitempty"` + MetadataJSON string `json:"metadata_json,omitempty"` Starred bool `json:"starred,omitempty"` TopicID string `json:"topic_id,omitempty"` ReplyToID string `json:"reply_to_message_id,omitempty"` @@ -207,7 +213,7 @@ func Open(ctx context.Context, path string) (*Store, error) { func (s *Store) Close() error { return s.db.Close() } func (s *Store) Path() string { return s.path } -func (s *Store) UpsertChat(ctx context.Context, stats ImportStats, chatJID string, chats []Chat, folders []Folder, folderChats []FolderChat, topics []Topic, messages []Message) error { +func (s *Store) UpsertChat(ctx context.Context, stats ImportStats, chatJID string, contacts []Contact, chats []Chat, folders []Folder, folderChats []FolderChat, topics []Topic, messages []Message) error { tx, err := s.db.BeginTx(ctx, nil) if err != nil { return err @@ -237,6 +243,9 @@ func (s *Store) UpsertChat(ctx context.Context, stats ImportStats, chatJID strin return err } } + if err := insertContacts(ctx, tx, contacts); err != nil { + return err + } for _, c := range chats { if preserveFolderState && c.FolderID == "" { c.FolderID = existingFolderID @@ -264,15 +273,8 @@ func (s *Store) UpsertChat(ctx context.Context, stats ImportStats, chatJID strin return err } } - for _, m := range messages { - if _, err := tx.ExecContext(ctx, `insert into messages(source_pk,chat_jid,chat_name,msg_id,sender_jid,sender_name,ts,from_me,text,raw_type,message_type,media_type,media_title,media_path,media_url,media_size,starred,topic_id,reply_to_msg_id,reply_to_chat_jid,thread_id,edit_ts,forward_json,reactions_json,views,forwards,replies_count,pinned) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)`, - m.SourcePK, m.ChatJID, m.ChatName, m.MessageID, m.SenderJID, m.SenderName, unix(m.Timestamp), boolInt(m.FromMe), m.Text, m.RawType, m.MessageType, m.MediaType, m.MediaTitle, m.MediaPath, m.MediaURL, m.MediaSize, boolInt(m.Starred), m.TopicID, m.ReplyToID, m.ReplyToChat, m.ThreadID, unix(m.EditTime), m.ForwardJSON, m.ReactionsJSON, m.Views, m.Forwards, m.RepliesCount, boolInt(m.Pinned)); err != nil { - return err - } - if _, err := tx.ExecContext(ctx, `insert into messages_fts(rowid,text,chat,sender,media) values((select rowid from messages where source_pk=?),?,?,?,?)`, - m.SourcePK, strings.TrimSpace(m.Text+" "+m.MediaTitle), m.ChatName, m.SenderName, m.MediaType); err != nil { - return err - } + if err := insertMessages(ctx, tx, messages); err != nil { + return err } now := stats.FinishedAt if now.IsZero() { @@ -286,7 +288,7 @@ func (s *Store) UpsertChat(ctx context.Context, stats ImportStats, chatJID strin return tx.Commit() } -func (s *Store) ReplaceAll(ctx context.Context, stats ImportStats, chats []Chat, folders []Folder, folderChats []FolderChat, topics []Topic, messages []Message) error { +func (s *Store) ReplaceAll(ctx context.Context, stats ImportStats, contacts []Contact, chats []Chat, folders []Folder, folderChats []FolderChat, topics []Topic, messages []Message) error { tx, err := s.db.BeginTx(ctx, nil) if err != nil { return err @@ -297,6 +299,9 @@ func (s *Store) ReplaceAll(ctx context.Context, stats ImportStats, chats []Chat, return err } } + if err := insertContacts(ctx, tx, contacts); err != nil { + return err + } for _, c := range chats { if _, err := tx.ExecContext(ctx, `insert into chats(id,kind,name,username,last_message_at,unread_count,message_count,folder_id,forum) values(?,?,?,?,?,?,?,?,?)`, parseInt64(c.JID), c.Kind, c.Name, c.Username, unix(c.LastMessageAt), c.UnreadCount, c.MessageCount, c.FolderID, boolInt(c.Forum)); err != nil { @@ -321,15 +326,8 @@ func (s *Store) ReplaceAll(ctx context.Context, stats ImportStats, chats []Chat, return err } } - for _, m := range messages { - if _, err := tx.ExecContext(ctx, `insert into messages(source_pk,chat_jid,chat_name,msg_id,sender_jid,sender_name,ts,from_me,text,raw_type,message_type,media_type,media_title,media_path,media_url,media_size,starred,topic_id,reply_to_msg_id,reply_to_chat_jid,thread_id,edit_ts,forward_json,reactions_json,views,forwards,replies_count,pinned) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)`, - m.SourcePK, m.ChatJID, m.ChatName, m.MessageID, m.SenderJID, m.SenderName, unix(m.Timestamp), boolInt(m.FromMe), m.Text, m.RawType, m.MessageType, m.MediaType, m.MediaTitle, m.MediaPath, m.MediaURL, m.MediaSize, boolInt(m.Starred), m.TopicID, m.ReplyToID, m.ReplyToChat, m.ThreadID, unix(m.EditTime), m.ForwardJSON, m.ReactionsJSON, m.Views, m.Forwards, m.RepliesCount, boolInt(m.Pinned)); err != nil { - return err - } - if _, err := tx.ExecContext(ctx, `insert into messages_fts(rowid,text,chat,sender,media) values((select rowid from messages where source_pk=?),?,?,?,?)`, - m.SourcePK, strings.TrimSpace(m.Text+" "+m.MediaTitle), m.ChatName, m.SenderName, m.MediaType); err != nil { - return err - } + if err := insertMessages(ctx, tx, messages); err != nil { + return err } now := stats.FinishedAt if now.IsZero() { @@ -343,6 +341,33 @@ func (s *Store) ReplaceAll(ctx context.Context, stats ImportStats, chats []Chat, return tx.Commit() } +func insertContacts(ctx context.Context, tx *sql.Tx, contacts []Contact) error { + for _, c := range contacts { + if strings.TrimSpace(c.JID) == "" { + continue + } + if _, err := tx.ExecContext(ctx, `insert into contacts(jid,peer_type,phone,full_name,first_name,last_name,business_name,username,lid,about_text,avatar_path,updated_at) values(?,?,?,?,?,?,?,?,?,?,?,?) on conflict(jid) do update set peer_type=excluded.peer_type, phone=excluded.phone, full_name=excluded.full_name, first_name=excluded.first_name, last_name=excluded.last_name, business_name=excluded.business_name, username=excluded.username, lid=excluded.lid, about_text=excluded.about_text, avatar_path=excluded.avatar_path, updated_at=excluded.updated_at`, + c.JID, c.PeerType, c.Phone, c.FullName, c.FirstName, c.LastName, c.BusinessName, c.Username, c.LID, c.AboutText, c.AvatarPath, unix(c.UpdatedAt)); err != nil { + return err + } + } + return nil +} + +func insertMessages(ctx context.Context, tx *sql.Tx, messages []Message) error { + for _, m := range messages { + if _, err := tx.ExecContext(ctx, `insert into messages(source_pk,chat_jid,chat_name,msg_id,sender_jid,sender_name,ts,from_me,text,raw_type,message_type,media_type,media_title,media_path,media_url,media_size,metadata_type,metadata_title,metadata_url,metadata_json,starred,topic_id,reply_to_msg_id,reply_to_chat_jid,thread_id,edit_ts,forward_json,reactions_json,views,forwards,replies_count,pinned) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)`, + m.SourcePK, m.ChatJID, m.ChatName, m.MessageID, m.SenderJID, m.SenderName, unix(m.Timestamp), boolInt(m.FromMe), m.Text, m.RawType, m.MessageType, m.MediaType, m.MediaTitle, m.MediaPath, m.MediaURL, m.MediaSize, m.MetadataType, m.MetadataTitle, m.MetadataURL, m.MetadataJSON, boolInt(m.Starred), m.TopicID, m.ReplyToID, m.ReplyToChat, m.ThreadID, unix(m.EditTime), m.ForwardJSON, m.ReactionsJSON, m.Views, m.Forwards, m.RepliesCount, boolInt(m.Pinned)); err != nil { + return err + } + if _, err := tx.ExecContext(ctx, `insert into messages_fts(rowid,text,chat,sender,media) values((select rowid from messages where source_pk=?),?,?,?,?)`, + m.SourcePK, strings.TrimSpace(m.Text+" "+m.MediaTitle+" "+m.MetadataTitle+" "+m.MetadataURL), m.ChatName, m.SenderName, m.MediaType); err != nil { + return err + } + } + return nil +} + func (s *Store) Status(ctx context.Context) (Status, error) { out := Status{DBPath: s.path} for _, c := range []struct { @@ -500,11 +525,11 @@ func (s *Store) messages(ctx context.Context, filter MessageFilter, search bool) if filter.Limit <= 0 { filter.Limit = 50 } - query := `select source_pk,chat_jid,chat_name,msg_id,sender_jid,sender_name,ts,edit_ts,from_me,text,raw_type,message_type,media_type,media_title,media_path,media_url,media_size,starred,topic_id,reply_to_msg_id,reply_to_chat_jid,thread_id,forward_json,reactions_json,views,forwards,replies_count,pinned,'' from messages where 1=1` + query := `select source_pk,chat_jid,chat_name,msg_id,sender_jid,sender_name,ts,edit_ts,from_me,text,raw_type,message_type,media_type,media_title,media_path,media_url,media_size,coalesce(metadata_type,''),coalesce(metadata_title,''),coalesce(metadata_url,''),coalesce(metadata_json,''),starred,topic_id,reply_to_msg_id,reply_to_chat_jid,thread_id,forward_json,reactions_json,views,forwards,replies_count,pinned,'' from messages where 1=1` args := []any{} prefix := "" if search { - query = `select m.source_pk,m.chat_jid,m.chat_name,m.msg_id,m.sender_jid,m.sender_name,m.ts,m.edit_ts,m.from_me,m.text,m.raw_type,m.message_type,m.media_type,m.media_title,m.media_path,m.media_url,m.media_size,m.starred,m.topic_id,m.reply_to_msg_id,m.reply_to_chat_jid,m.thread_id,m.forward_json,m.reactions_json,m.views,m.forwards,m.replies_count,m.pinned,snippet(messages_fts,0,'[',']','...',12) from messages_fts f join messages m on m.rowid=f.rowid where messages_fts match ?` + query = `select m.source_pk,m.chat_jid,m.chat_name,m.msg_id,m.sender_jid,m.sender_name,m.ts,m.edit_ts,m.from_me,m.text,m.raw_type,m.message_type,m.media_type,m.media_title,m.media_path,m.media_url,m.media_size,coalesce(m.metadata_type,''),coalesce(m.metadata_title,''),coalesce(m.metadata_url,''),coalesce(m.metadata_json,''),m.starred,m.topic_id,m.reply_to_msg_id,m.reply_to_chat_jid,m.thread_id,m.forward_json,m.reactions_json,m.views,m.forwards,m.replies_count,m.pinned,snippet(messages_fts,0,'[',']','...',12) from messages_fts f join messages m on m.rowid=f.rowid where messages_fts match ?` args = append(args, filter.Query) prefix = "m." } @@ -556,7 +581,7 @@ func (s *Store) messages(ctx context.Context, filter MessageFilter, search bool) var m Message var ts, editTS int64 var fromMe, starred, pinned int - if err := rows.Scan(&m.SourcePK, &m.ChatJID, &m.ChatName, &m.MessageID, &m.SenderJID, &m.SenderName, &ts, &editTS, &fromMe, &m.Text, &m.RawType, &m.MessageType, &m.MediaType, &m.MediaTitle, &m.MediaPath, &m.MediaURL, &m.MediaSize, &starred, &m.TopicID, &m.ReplyToID, &m.ReplyToChat, &m.ThreadID, &m.ForwardJSON, &m.ReactionsJSON, &m.Views, &m.Forwards, &m.RepliesCount, &pinned, &m.Snippet); err != nil { + if err := rows.Scan(&m.SourcePK, &m.ChatJID, &m.ChatName, &m.MessageID, &m.SenderJID, &m.SenderName, &ts, &editTS, &fromMe, &m.Text, &m.RawType, &m.MessageType, &m.MediaType, &m.MediaTitle, &m.MediaPath, &m.MediaURL, &m.MediaSize, &m.MetadataType, &m.MetadataTitle, &m.MetadataURL, &m.MetadataJSON, &starred, &m.TopicID, &m.ReplyToID, &m.ReplyToChat, &m.ThreadID, &m.ForwardJSON, &m.ReactionsJSON, &m.Views, &m.Forwards, &m.RepliesCount, &pinned, &m.Snippet); err != nil { return nil, err } m.Timestamp = fromUnix(ts) @@ -587,6 +612,14 @@ func migrate(ctx context.Context, db *sql.DB) error { "forwards": "integer not null default 0", "replies_count": "integer not null default 0", "pinned": "integer not null default 0", + "metadata_type": "text", + "metadata_title": "text", + "metadata_url": "text", + "metadata_json": "text", + }, + "contacts": { + "peer_type": "text", + "avatar_path": "text", }, } for table, defs := range adds { diff --git a/internal/store/store_test.go b/internal/store/store_test.go index 5d32e51..eedae1a 100644 --- a/internal/store/store_test.go +++ b/internal/store/store_test.go @@ -2,6 +2,7 @@ package store import ( "context" + "database/sql" "path/filepath" "testing" "time" @@ -12,6 +13,16 @@ func TestSnapshotRoundTripPreservesTelegramStructure(t *testing.T) { ctx := context.Background() now := time.Date(2026, 5, 9, 3, 17, 53, 0, time.UTC) data := SnapshotData{ + Contacts: []Contact{{ + JID: "9", + PeerType: "user", + Phone: "+15551234567", + FullName: "Peter Example", + FirstName: "Peter", + LastName: "Example", + Username: "peter", + UpdatedAt: now, + }}, Chats: []Chat{{ JID: "-10042", Kind: "channel", @@ -63,6 +74,10 @@ func TestSnapshotRoundTripPreservesTelegramStructure(t *testing.T) { MediaType: "webpage", MediaTitle: "GitHub", MediaSize: 123, + MetadataType: "web_page", + MetadataTitle: "GitHub", + MetadataURL: "https://github.com/openclaw/telecrawl", + MetadataJSON: `{"url":"https://github.com/openclaw/telecrawl"}`, ForwardJSON: `{"from_name":"someone"}`, ReactionsJSON: `{"results":[]}`, Views: 10, @@ -89,6 +104,9 @@ func TestSnapshotRoundTripPreservesTelegramStructure(t *testing.T) { if got := len(exported.Topics); got != 1 { t.Fatalf("topics = %d, want 1", got) } + if got := len(exported.Contacts); got != 1 { + t.Fatalf("contacts = %d, want 1", got) + } restored := openTestStore(t, filepath.Join(t.TempDir(), "restored.db")) if err := restored.ImportSnapshot(ctx, exported, "backup", now); err != nil { @@ -116,9 +134,16 @@ func TestSnapshotRoundTripPreservesTelegramStructure(t *testing.T) { t.Fatalf("messages = %d, want 1", len(messages)) } msg := messages[0] - if msg.ReplyToID != "17" || msg.ReactionsJSON == "" || msg.ForwardJSON == "" || msg.Views != 10 || !msg.Pinned { + if msg.ReplyToID != "17" || msg.ReactionsJSON == "" || msg.ForwardJSON == "" || msg.Views != 10 || !msg.Pinned || msg.MetadataType != "web_page" || msg.MetadataURL == "" { t.Fatalf("message metadata lost: %#v", msg) } + restoredExport, err := restored.ExportAll(ctx) + if err != nil { + t.Fatal(err) + } + if len(restoredExport.Contacts) != 1 || restoredExport.Contacts[0].Phone != "+15551234567" || restoredExport.Contacts[0].PeerType != "user" { + t.Fatalf("contact lost: %#v", restoredExport.Contacts) + } } func openTestStore(t *testing.T, path string) *Store { @@ -135,6 +160,74 @@ func openTestStore(t *testing.T, path string) *Store { return st } +func TestOpenMigratesSchema2MessageMetadataColumns(t *testing.T) { + t.Parallel() + ctx := context.Background() + path := filepath.Join(t.TempDir(), "schema2.db") + db, err := sql.Open("sqlite", path) + if err != nil { + t.Fatal(err) + } + if _, err := db.ExecContext(ctx, ` +create table messages ( + rowid integer primary key autoincrement, + source_pk integer not null unique, + chat_jid text not null, + chat_name text, + msg_id text not null, + sender_jid text, + sender_name text, + ts integer not null, + from_me integer not null, + text text, + raw_type integer not null default 0, + message_type text, + media_type text, + media_title text, + media_path text, + media_url text, + media_size integer, + starred integer not null default 0, + topic_id text, + reply_to_msg_id text, + reply_to_chat_jid text, + thread_id text, + edit_ts integer, + forward_json text, + reactions_json text, + views integer not null default 0, + forwards integer not null default 0, + replies_count integer not null default 0, + pinned integer not null default 0 +); +pragma user_version = 2; +`); err != nil { + _ = db.Close() + t.Fatal(err) + } + if err := db.Close(); err != nil { + t.Fatal(err) + } + + st := openTestStore(t, path) + cols, err := columns(ctx, st.db, "messages") + if err != nil { + t.Fatal(err) + } + for _, name := range []string{"metadata_type", "metadata_title", "metadata_url", "metadata_json"} { + if !cols[name] { + t.Fatalf("missing migrated column %q", name) + } + } + var version int + if err := st.db.QueryRowContext(ctx, "pragma user_version").Scan(&version); err != nil { + t.Fatal(err) + } + if version != schemaVersion { + t.Fatalf("user_version = %d, want %d", version, schemaVersion) + } +} + func TestUpsertChatPreservesUnrelatedChats(t *testing.T) { t.Parallel() ctx := context.Background() @@ -154,7 +247,9 @@ func TestUpsertChatPreservesUnrelatedChats(t *testing.T) { msgB2 := Message{SourcePK: 3, ChatJID: "-1002", ChatName: "Chat B", MessageID: "2", SenderJID: "20", SenderName: "Bob", Timestamp: later, Text: "hello b2", MessageType: "Message"} initial := ImportStats{SourcePath: "tdata", DBPath: st.Path(), Chats: 2, Messages: 3, StartedAt: now, FinishedAt: now} - if err := st.ReplaceAll(ctx, initial, + if err := st.ReplaceAll( + ctx, initial, + nil, []Chat{chatA, chatB}, []Folder{{ID: "1", Title: "F1"}, {ID: "2", Title: "F2"}}, []FolderChat{fcA, fcB}, @@ -168,7 +263,9 @@ func TestUpsertChatPreservesUnrelatedChats(t *testing.T) { updatedMsgA := Message{SourcePK: 4, ChatJID: "-1001", ChatName: "Chat A Updated", MessageID: "2", SenderJID: "10", SenderName: "Alice", Timestamp: later, Text: "updated a", MessageType: "Message", MediaType: "photo", MediaTitle: "pic.jpg"} upsertStats := ImportStats{SourcePath: "tdata", DBPath: st.Path(), Chats: 1, Messages: 1, MediaMessages: 1, StartedAt: later, FinishedAt: later} - if err := st.UpsertChat(ctx, upsertStats, "-1001", + if err := st.UpsertChat( + ctx, upsertStats, "-1001", + nil, []Chat{updatedChatA}, nil, nil, nil, diff --git a/internal/telegramdesktop/importer.go b/internal/telegramdesktop/importer.go index 4a62900..0ed1e3a 100644 --- a/internal/telegramdesktop/importer.go +++ b/internal/telegramdesktop/importer.go @@ -46,6 +46,7 @@ type ExistingMediaRef struct { type ImportResult struct { Stats store.ImportStats + Contacts []store.Contact Chats []store.Chat Folders []store.Folder FolderChats []store.FolderChat @@ -77,6 +78,20 @@ type pyResult struct { FolderID string `json:"folder_id"` Forum bool `json:"forum"` } `json:"chats"` + Contacts []struct { + ID string `json:"id"` + PeerType string `json:"peer_type"` + Phone string `json:"phone"` + FullName string `json:"full_name"` + FirstName string `json:"first_name"` + LastName string `json:"last_name"` + BusinessName string `json:"business_name"` + Username string `json:"username"` + LID string `json:"lid"` + AboutText string `json:"about_text"` + AvatarPath string `json:"avatar_path"` + UpdatedAt string `json:"updated_at"` + } `json:"contacts"` Folders []struct { ID string `json:"id"` Title string `json:"title"` @@ -124,6 +139,10 @@ type pyResult struct { MediaTitle string `json:"media_title"` MediaPath string `json:"media_path"` MediaSize int64 `json:"media_size"` + MetadataType string `json:"metadata_type"` + MetadataTitle string `json:"metadata_title"` + MetadataURL string `json:"metadata_url"` + MetadataJSON string `json:"metadata_json"` Views int `json:"views"` Forwards int `json:"forwards"` RepliesCount int `json:"replies_count"` @@ -173,7 +192,11 @@ func Import(ctx context.Context, opts ImportOptions, dbPath string) (ImportResul return ImportResult{}, err } result := decodeImportResult(raw, dbPath) - if err := copyImportedMedia(result.Messages, mediaArchiveDir(dbPath), &result.Stats); err != nil { + archiveDir := mediaArchiveDir(dbPath) + if err := copyImportedContactAvatars(result.Contacts, archiveDir); err != nil { + return ImportResult{}, err + } + if err := copyImportedMedia(result.Messages, archiveDir, &result.Stats); err != nil { return ImportResult{}, err } return result, nil @@ -200,6 +223,14 @@ func Import(ctx context.Context, opts ImportOptions, dbPath string) (ImportResul defer func() { _ = os.RemoveAll(mediaTempDir) }() args = append(args, "--fetch-media", "--media-output-dir", mediaTempDir) } + existingRefsPath, cleanupExistingRefs, err := writeExistingMediaRefs(opts, source.path) + if err != nil { + return ImportResult{}, err + } + defer cleanupExistingRefs() + if existingRefsPath != "" { + args = append(args, "--existing-media-refs", existingRefsPath) + } if opts.ChatID != "" { args = append(args, "--chat", opts.ChatID) } @@ -208,7 +239,11 @@ func Import(ctx context.Context, opts ImportOptions, dbPath string) (ImportResul return ImportResult{}, err } result := decodeImportResult(raw, dbPath) - if err := copyImportedMedia(result.Messages, mediaArchiveDir(dbPath), &result.Stats); err != nil { + archiveDir := mediaArchiveDir(dbPath) + if err := copyImportedContactAvatars(result.Contacts, archiveDir); err != nil { + return ImportResult{}, err + } + if err := copyImportedMedia(result.Messages, archiveDir, &result.Stats); err != nil { return ImportResult{}, err } return result, nil @@ -356,6 +391,22 @@ func decodeImportResult(raw pyResult, dbPath string) ImportResult { Forum: c.Forum, }) } + for _, c := range raw.Contacts { + result.Contacts = append(result.Contacts, store.Contact{ + JID: c.ID, + PeerType: c.PeerType, + Phone: c.Phone, + FullName: c.FullName, + FirstName: c.FirstName, + LastName: c.LastName, + BusinessName: c.BusinessName, + Username: c.Username, + LID: c.LID, + AboutText: c.AboutText, + AvatarPath: c.AvatarPath, + UpdatedAt: parseTime(c.UpdatedAt), + }) + } for _, f := range raw.Folders { result.Folders = append(result.Folders, store.Folder{ ID: f.ID, @@ -410,6 +461,10 @@ func decodeImportResult(raw pyResult, dbPath string) ImportResult { MediaTitle: m.MediaTitle, MediaPath: m.MediaPath, MediaSize: m.MediaSize, + MetadataType: m.MetadataType, + MetadataTitle: m.MetadataTitle, + MetadataURL: m.MetadataURL, + MetadataJSON: m.MetadataJSON, Views: m.Views, Forwards: m.Forwards, RepliesCount: m.RepliesCount, @@ -515,6 +570,38 @@ func copyImportedMedia(messages []store.Message, archiveDir string, stats *store return nil } +func copyImportedContactAvatars(contacts []store.Contact, archiveDir string) error { + copiedSources := make(map[string]string) + for i := range contacts { + sourcePath := strings.TrimSpace(contacts[i].AvatarPath) + if sourcePath == "" { + continue + } + archivedPath, ok := copiedSources[sourcePath] + if !ok { + path, _, alreadyArchived, err := existingArchivedMedia(sourcePath, archiveDir) + if err != nil { + return err + } + if !alreadyArchived { + path, _, err = copyMediaFile(sourcePath, archiveDir) + } + if err != nil { + if isMediaSourceUnavailable(err) { + copiedSources[sourcePath] = "" + contacts[i].AvatarPath = "" + continue + } + return err + } + archivedPath = path + copiedSources[sourcePath] = archivedPath + } + contacts[i].AvatarPath = archivedPath + } + return nil +} + func existingArchivedMedia(sourcePath, archiveDir string) (string, int64, bool, error) { sourceAbs, err := filepath.Abs(filepath.Clean(sourcePath)) if err != nil { diff --git a/internal/telegramdesktop/importer_test.go b/internal/telegramdesktop/importer_test.go index c176a94..a3fe52d 100644 --- a/internal/telegramdesktop/importer_test.go +++ b/internal/telegramdesktop/importer_test.go @@ -124,6 +124,43 @@ func TestCopyImportedMediaArchivesByContentHash(t *testing.T) { } } +func TestCopyImportedContactAvatarsArchivesByContentHash(t *testing.T) { + t.Parallel() + source := filepath.Join(t.TempDir(), "source-avatar") + if err := os.WriteFile(source, []byte("fixture avatar"), 0o600); err != nil { + t.Fatal(err) + } + contacts := []store.Contact{ + {JID: "1", AvatarPath: source}, + {JID: "2", AvatarPath: source}, + {JID: "3", AvatarPath: filepath.Join(t.TempDir(), "missing-avatar")}, + } + archiveDir := filepath.Join(t.TempDir(), "media") + + if err := copyImportedContactAvatars(contacts, archiveDir); err != nil { + t.Fatal(err) + } + if contacts[0].AvatarPath == source { + t.Fatal("avatar path still points at source cache") + } + if contacts[1].AvatarPath != contacts[0].AvatarPath { + t.Fatalf("duplicate avatar archived to different paths: %q != %q", contacts[1].AvatarPath, contacts[0].AvatarPath) + } + if contacts[2].AvatarPath != "" { + t.Fatalf("missing avatar path = %q, want cleared", contacts[2].AvatarPath) + } + data, err := os.ReadFile(contacts[0].AvatarPath) + if err != nil { + t.Fatal(err) + } + if string(data) != "fixture avatar" { + t.Fatalf("archived avatar = %q", data) + } + if !strings.HasPrefix(contacts[0].AvatarPath, archiveDir+string(os.PathSeparator)) { + t.Fatalf("avatar path %q is not under archive dir %q", contacts[0].AvatarPath, archiveDir) + } +} + func TestCopyImportedMediaKeepsExistingArchiveRef(t *testing.T) { t.Parallel() archiveDir := filepath.Join(t.TempDir(), "media") @@ -193,6 +230,34 @@ func TestImportPassesFetchMediaToTDataImporter(t *testing.T) { } } +func TestImportPassesExistingMediaRefsToTDataImporter(t *testing.T) { + t.Parallel() + python, argvPath := fakePythonImporter(t) + source := t.TempDir() + + _, err := Import(context.Background(), ImportOptions{ + Path: source, + Python: python, + FetchMedia: true, + ExistingMediaSourcePath: source, + ExistingMediaRefs: []ExistingMediaRef{{ + SourcePK: 42, + MediaType: "photo", + MediaPath: "/tmp/already-archived", + MediaSize: 12, + }}, + }, filepath.Join(t.TempDir(), "telecrawl.db")) + if err != nil { + t.Fatal(err) + } + + args := readImporterArgs(t, argvPath) + idx := indexArg(args, "--existing-media-refs") + if idx < 0 || idx+1 >= len(args) || strings.TrimSpace(args[idx+1]) == "" { + t.Fatalf("args missing --existing-media-refs value: %v", args) + } +} + func TestImportPassesExistingMediaRefsToPostboxImporter(t *testing.T) { t.Parallel() python, argvPath := fakePythonImporter(t) diff --git a/internal/telegramdesktop/scripts/import_postbox.py b/internal/telegramdesktop/scripts/import_postbox.py index 697b3e9..0ee83d4 100644 --- a/internal/telegramdesktop/scripts/import_postbox.py +++ b/internal/telegramdesktop/scripts/import_postbox.py @@ -22,6 +22,7 @@ import io import json import os +import re import struct import sys import tempfile @@ -47,8 +48,14 @@ RESOURCE_TYPE_CLOUD_PHOTO_SIZE = 1226791958 RESOURCE_TYPE_CLOUD_DOCUMENT_SIZE = -2129249780 RESOURCE_TYPE_CLOUD_DOCUMENT = 486562374 +RESOURCE_TYPE_CLOUD_PEER_PHOTO_SIZE = 923090569 RESOURCE_TYPE_LOCAL_FILE = 711798229 RESOURCE_TYPE_LOCAL_FILE_REFERENCE = 1868491758 +MESSAGE_METADATA_SERVICE_ACTION = -1132984447 +MESSAGE_METADATA_LOCATION = -1138242673 +MESSAGE_METADATA_POLL = -165764138 +MESSAGE_METADATA_WEBPAGE = -661322156 +URL_RE = re.compile(r"https?://[^\s<>()\"']+") # Public Telegram for macOS app identity from TelegramSwift. TELEGRAM_MAC_API_ID = 9 TELEGRAM_MAC_API_HASH = "3975f648bb682ee889f35483bc618d1c" # gitleaks:allow @@ -412,6 +419,28 @@ def peer_display(peer: Any) -> str: return "" +def clean_text(value: Any) -> str: + if value is None: + return "" + return str(value).strip() + + +def peer_type_for_id(peer_id: int) -> str: + parts = postbox_peer_parts(peer_id) + if parts is None: + return "unknown" + namespace, _ = parts + if namespace == 0: + return "user" + if namespace == 1: + return "group" + if namespace == 2: + return "channel" + if namespace == 3: + return "secret_chat" + return "unknown" + + def peer_access_hash(peer: Any) -> int: if not isinstance(peer, dict): return 0 @@ -421,7 +450,7 @@ def peer_access_hash(peer: Any) -> int: return 0 -def load_peer_records(conn: Any) -> dict[int, dict[str, Any]]: +def load_peer_records(conn: Any, media_root: Path | None = None) -> dict[int, dict[str, Any]]: peers: dict[int, dict[str, Any]] = {} for key, value in conn.execute("SELECT key, value FROM t2"): if not isinstance(key, int) or not isinstance(value, bytes): @@ -432,8 +461,23 @@ def load_peer_records(conn: Any) -> dict[int, dict[str, Any]]: continue display = peer_display(peer) access_hash = peer_access_hash(peer) - if display or access_hash: - peers[key] = {"display": display, "access_hash": access_hash} + first_name = clean_text(peer.get("fn") if isinstance(peer, dict) else "") + last_name = clean_text(peer.get("ln") if isinstance(peer, dict) else "") + title = clean_text(peer.get("t") if isinstance(peer, dict) else "") + username = clean_text(peer.get("un") if isinstance(peer, dict) else "") + phone = clean_text(peer.get("p") if isinstance(peer, dict) else "") + avatar_path = cached_peer_avatar_path(peer, media_root) if media_root is not None else "" + if display or access_hash or first_name or last_name or title or username or phone or avatar_path: + peers[key] = { + "display": display, + "access_hash": access_hash, + "first_name": first_name, + "last_name": last_name, + "title": title, + "username": username, + "phone": phone, + "avatar_path": avatar_path, + } return peers @@ -441,15 +485,47 @@ def load_peer_map(conn: Any) -> dict[int, str]: return {peer_id: str(peer.get("display") or "") for peer_id, peer in load_peer_records(conn).items()} -def read_source_records(source: PostboxSource, conn: Any, multi_account: bool) -> tuple[dict[str, str], list[dict[str, Any]]]: - raw_peer_records = load_peer_records(conn) +def contact_for_peer(account_id: str, peer_id: int, peer: dict[str, Any], multi_account: bool) -> dict[str, Any] | None: + jid = peer_store_id(account_id, peer_id, multi_account) + peer_type = peer_type_for_id(peer_id) + full_name = clean_text(peer.get("display") or peer.get("title")) + first_name = clean_text(peer.get("first_name")) + last_name = clean_text(peer.get("last_name")) + username = clean_text(peer.get("username")) + phone = clean_text(peer.get("phone")) + avatar_path = clean_text(peer.get("avatar_path")) + if not any([full_name, first_name, last_name, username, phone, avatar_path]) and peer_type == "unknown": + return None + return { + "id": jid, + "peer_type": peer_type, + "phone": phone, + "full_name": full_name, + "first_name": first_name, + "last_name": last_name, + "business_name": "", + "username": username, + "lid": "", + "about_text": "", + "avatar_path": avatar_path, + "updated_at": "", + } + + +def read_source_records(source: PostboxSource, conn: Any, multi_account: bool) -> tuple[dict[str, str], list[dict[str, Any]], list[dict[str, Any]]]: + media_root = source.db_path.parent.parent / "media" + raw_peer_records = load_peer_records(conn, media_root) raw_peers = {peer_id: str(peer.get("display") or "") for peer_id, peer in raw_peer_records.items()} peers = { peer_store_id(source.account_id, peer_id, multi_account): display for peer_id, display in raw_peers.items() } + contacts = [ + contact + for peer_id, peer in raw_peer_records.items() + if (contact := contact_for_peer(source.account_id, peer_id, peer, multi_account)) is not None + ] messages: list[dict[str, Any]] = [] - media_root = source.db_path.parent.parent / "media" for key_blob, value in conn.execute("SELECT key, value FROM t7 ORDER BY key"): if not isinstance(key_blob, bytes) or len(key_blob) < 20 or not isinstance(value, bytes): continue @@ -495,8 +571,9 @@ def read_source_records(source: PostboxSource, conn: Any, multi_account: bool) - "media_path": media_path, "media_size": media_size, "embedded_media": msg.get("embedded_media") or [], + "referenced_media_ids": msg.get("referenced_media_ids") or [], }) - return peers, messages + return peers, contacts, messages def read_forward_info(reader: ByteReader) -> None: @@ -590,6 +667,8 @@ def visit(item: Any) -> None: ids.append(f"telegram-cloud-document-size-{item.get('d')}-{item.get('i')}-{item.get('s')}") elif resource_type == RESOURCE_TYPE_CLOUD_DOCUMENT: ids.append(f"telegram-cloud-document-{item.get('d')}-{item.get('f')}") + elif resource_type == RESOURCE_TYPE_CLOUD_PEER_PHOTO_SIZE: + ids.append(f"telegram-peer-photo-size-{item.get('d')}-{item.get('s')}-{item.get('v')}-{item.get('l')}") elif resource_type == RESOURCE_TYPE_LOCAL_FILE: ids.append(f"telegram-local-file-{item.get('f')}") elif resource_type == RESOURCE_TYPE_LOCAL_FILE_REFERENCE: @@ -616,6 +695,20 @@ def cached_media_for(msg: dict[str, Any], media_root: Path) -> tuple[str, int]: return str(path), size +def cached_peer_avatar_path(peer: Any, media_root: Path) -> str: + if not isinstance(peer, dict): + return "" + candidates: list[tuple[int, Path]] = [] + for item in peer.get("ph") or []: + for resource_id in media_resource_ids(item): + for path in cached_media_paths(resource_id, media_root): + candidates.append((path.stat().st_size, path)) + if not candidates: + return "" + _, path = max(candidates, key=lambda item: item[0]) + return str(path) + + def cached_media_paths(resource_id: str, media_root: Path) -> list[Path]: paths: list[Path] = [] exact = media_root / resource_id @@ -679,6 +772,90 @@ def visit(item: Any) -> str: return "" +def json_safe(value: Any) -> Any: + if isinstance(value, dict): + return {str(key): json_safe(nested) for key, nested in sorted(value.items(), key=lambda item: str(item[0]))} + if isinstance(value, list): + return [json_safe(nested) for nested in value] + if isinstance(value, tuple): + return [json_safe(nested) for nested in value] + if isinstance(value, bytes): + return {"base64": base64.b64encode(value).decode("ascii")} + return value + + +def metadata_json(value: Any) -> str: + return json.dumps(json_safe(value), ensure_ascii=False, sort_keys=True, separators=(",", ":")) + + +def first_string(value: Any, keys: Iterable[str]) -> str: + if not isinstance(value, dict): + return "" + for key in keys: + text = clean_text(value.get(key)) + if text: + return text + return "" + + +def first_url(text: str) -> str: + match = URL_RE.search(text or "") + if not match: + return "" + return match.group(0).rstrip(".,;:") + + +def service_action_title(value: dict[str, Any]) -> str: + if any(key in value for key in ("d", "dr", "vc")): + return "call" + if "t" in value: + return "title_change" + if "m" in value: + return "member_change" + if "p" in value: + return "pin" + return "service_action" + + +def message_metadata(msg: dict[str, Any]) -> tuple[str, str, str, str]: + for item in msg.get("embedded_media") or []: + if not isinstance(item, dict): + continue + type_hash = item.get("@type") + if type_hash == MESSAGE_METADATA_WEBPAGE: + url = first_string(item, ("u", "du", "url")) + title = first_string(item, ("ti", "title", "t", "tx")) + return "web_page", title, url, metadata_json(item) + if type_hash == MESSAGE_METADATA_LOCATION: + title = first_string(item, ("adr", "t", "title")) + return "location", title, "", metadata_json(item) + if type_hash == MESSAGE_METADATA_POLL: + title = first_string(item, ("t", "title")) + return "poll", title, "", metadata_json(item) + if type_hash == MESSAGE_METADATA_SERVICE_ACTION: + title = service_action_title(item) + return "service_action", title, "", metadata_json(item) + + url = first_url(str(msg.get("text") or "")) + if url and msg.get("media_type") == "web_page": + return "web_page", "", url, metadata_json({"url": url}) + return "", "", "", "" + + +def attach_message_metadata(messages: list[dict[str, Any]]) -> int: + count = 0 + for msg in messages: + metadata_type, title, url, raw_json = message_metadata(msg) + if not metadata_type: + continue + msg["metadata_type"] = metadata_type + msg["metadata_title"] = title + msg["metadata_url"] = url + msg["metadata_json"] = raw_json + count += 1 + return count + + def stable_int(*parts: object) -> int: digest = hashlib.sha256(":".join(str(part) for part in parts).encode("utf-8")).digest() return int.from_bytes(digest[:8], "big") & 0x7FFFFFFFFFFFFFFF @@ -723,7 +900,17 @@ def filter_chat(messages: list[dict[str, Any]], chat_id: str) -> list[dict[str, return [msg for msg in messages if msg["chat_id"] == chat_id or msg.get("_raw_chat_id") == chat_id] -def import_source(source: PostboxSource, passcodes: list[bytes], multi_account: bool) -> tuple[dict[str, str], list[dict[str, Any]]]: +def filter_contacts_for_messages(contacts: list[dict[str, Any]], messages: list[dict[str, Any]]) -> list[dict[str, Any]]: + peer_ids = { + str(value) + for msg in messages + for value in (msg.get("chat_id"), msg.get("sender_id")) + if str(value or "").strip() + } + return [contact for contact in contacts if str(contact.get("id") or "") in peer_ids] + + +def import_source(source: PostboxSource, passcodes: list[bytes], multi_account: bool) -> tuple[dict[str, str], list[dict[str, Any]], list[dict[str, Any]]]: key = parse_tempkey(source.key_path, passcodes) conn = connect_postbox(source.db_path, key) try: @@ -777,6 +964,11 @@ def message_resource_ids(msg: dict[str, Any]) -> list[str]: return list(dict.fromkeys(ids)) +def has_remote_media_identity(msg: dict[str, Any]) -> bool: + # Referenced-only Postbox rows still carry enough chat/message identity for Telethon fetch. + return bool(message_resource_ids(msg) or msg.get("referenced_media_ids")) + + def share_duplicate_media(messages: list[dict[str, Any]]) -> int: known: dict[tuple[str, int, int, str, str, str], tuple[str, int]] = {} for msg in messages: @@ -827,6 +1019,8 @@ def remote_media_candidates(messages: list[dict[str, Any]]) -> list[dict[str, An for msg in messages: if msg.get("media_path") or not msg.get("media_type"): continue + if not has_remote_media_identity(msg): + continue if cloud_media_key(msg) is None: continue candidates.append(msg) @@ -1128,10 +1322,12 @@ def download_remote_media( def build_result( source_path: str, peers: dict[str, str], + contacts: list[dict[str, Any]], messages: list[dict[str, Any]], started: dt.datetime, remote_media: dict[str, int] | None = None, ) -> dict[str, Any]: + attach_message_metadata(messages) clear_unarchived_placeholder_media(messages) chats: dict[str, dict[str, Any]] = {} for msg in messages: @@ -1140,6 +1336,7 @@ def build_result( msg.pop("_account_id", None) msg.pop("_access_hash", None) msg.pop("embedded_media", None) + msg.pop("referenced_media_ids", None) chat_id = msg["chat_id"] chat = chats.setdefault(chat_id, { "id": chat_id, @@ -1162,6 +1359,7 @@ def build_result( "finished_at": finished.isoformat().replace("+00:00", "Z"), "remote_media": remote_media or {"downloaded": 0, "missing": 0}, "chats": sorted(chats.values(), key=lambda c: c["last_message_at"], reverse=True), + "contacts": sorted(contacts, key=lambda c: c["id"]), "folders": [], "folder_chats": [], "topics": [], @@ -1195,6 +1393,10 @@ def fixture_kv_int64(key: str, value: int) -> bytes: return fixture_short_str(key) + struct.pack(" bytes: + return fixture_short_str(key) + struct.pack(" bytes: return fixture_short_str(key) + struct.pack("B", 5) + fixture_object(payload, type_hash) @@ -1211,8 +1413,13 @@ def fixture_root_typed_object(payload: bytes, type_hash: int) -> bytes: return fixture_short_str("_") + struct.pack("B", 5) + fixture_object(payload, type_hash) -def fixture_peer(first: str, last: str = "") -> bytes: - return fixture_root_object(fixture_kv_string("fn", first) + fixture_kv_string("ln", last)) +def fixture_peer(first: str, last: str = "", username: str = "", phone: str = "") -> bytes: + payload = fixture_kv_string("fn", first) + fixture_kv_string("ln", last) + if username: + payload += fixture_kv_string("un", username) + if phone: + payload += fixture_kv_string("p", phone) + return fixture_root_object(payload) def fixture_message( @@ -1276,6 +1483,37 @@ def fixture_photo_media(photo_id: int = 123456789) -> bytes: return fixture_root_typed_object(media, -1951522668) +def fixture_webpage_media() -> bytes: + media = fixture_kv_string("u", "https://example.com/article") + fixture_kv_string("ti", "Example Article") + return fixture_root_typed_object(media, MESSAGE_METADATA_WEBPAGE) + + +def fixture_location_media() -> bytes: + media = ( + fixture_kv_double("la", 52.1) + + fixture_kv_double("lo", 4.3) + + fixture_kv_string("adr", "Example Place") + ) + return fixture_root_typed_object(media, MESSAGE_METADATA_LOCATION) + + +def fixture_poll_media() -> bytes: + media = fixture_kv_string("t", "Example Poll") + return fixture_root_typed_object(media, MESSAGE_METADATA_POLL) + + +def fixture_service_action_media(action: str = "call") -> bytes: + if action == "title_change": + media = fixture_kv_string("t", "Example Title") + elif action == "member_change": + media = fixture_kv_int32("m", 1001) + elif action == "pin": + media = fixture_kv_int32("p", 2002) + else: + media = fixture_kv_int32("d", 42) + fixture_kv_int32("dr", 10) + return fixture_root_typed_object(media, MESSAGE_METADATA_SERVICE_ACTION) + + def fixture_message_key(peer_id: int, namespace: int, timestamp: int, message_id: int) -> bytes: return struct.pack(">qiii", peer_id, namespace, timestamp, message_id) @@ -1371,6 +1609,19 @@ def run_self_test(fixture_dir: str) -> None: if cached_path != str(large) or cached_size != 6: raise AssertionError(f"largest cached photo lookup failed: {(cached_path, cached_size)!r}") + peer_photo = { + "@type": 1774815102, + "r": {"@type": RESOURCE_TYPE_CLOUD_PEER_PHOTO_SIZE, "d": 2, "s": 1, "v": 333, "l": 444}, + } + if media_resource_ids(peer_photo) != ["telegram-peer-photo-size-2-1-333-444"]: + raise AssertionError(f"peer photo resource decode failed: {peer_photo!r}") + with tempfile.TemporaryDirectory() as tmp: + avatar = Path(tmp) / "telegram-peer-photo-size-2-1-333-444.jpg" + avatar.write_bytes(b"avatar") + got = cached_peer_avatar_path({"ph": [peer_photo]}, Path(tmp)) + if got != str(avatar): + raise AssertionError(f"cached peer avatar lookup failed: {got!r}") + sample = [ {"chat_id": "1", "_raw_chat_id": "1", "_ts": 10, "source_pk": 1}, {"chat_id": "1", "_raw_chat_id": "1", "_ts": 20, "source_pk": 2}, @@ -1409,16 +1660,19 @@ def run_self_test(fixture_dir: str) -> None: raise AssertionError("channel peer id conversion failed") if postbox_peer_to_telethon_id(fixture_postbox_peer_id(3, 42)) is not None: raise AssertionError("secret chat peer id should not be remotely fetched") + remote_resource = PostboxDecoder(fixture_photo_media(77)).decode_root_object() remote_sample = [ - {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:1", "media_type": "photo", "media_path": ""}, - {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:2", "media_type": "photo", "media_path": "cached"}, - {"_raw_chat_id": str(fixture_postbox_peer_id(3, 42)), "message_id": "0:3", "media_type": "photo", "media_path": ""}, - {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "1:4", "media_type": "photo", "media_path": ""}, - {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:5", "media_type": "", "media_path": ""}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:1", "media_type": "photo", "media_path": "", "embedded_media": [remote_resource]}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:2", "media_type": "photo", "media_path": "cached", "embedded_media": [remote_resource]}, + {"_raw_chat_id": str(fixture_postbox_peer_id(3, 42)), "message_id": "0:3", "media_type": "photo", "media_path": "", "embedded_media": [remote_resource]}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "1:4", "media_type": "photo", "media_path": "", "embedded_media": [remote_resource]}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:5", "media_type": "", "media_path": "", "embedded_media": [remote_resource]}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:6", "media_type": "web_page", "media_path": "", "embedded_media": []}, + {"_raw_chat_id": str(fixture_postbox_peer_id(0, 777000)), "message_id": "0:7", "media_type": "media", "media_path": "", "embedded_media": [], "referenced_media_ids": [(7, 123456789)]}, ] - if [row["message_id"] for row in remote_media_candidates(remote_sample)] != ["0:1"]: + if [row["message_id"] for row in remote_media_candidates(remote_sample)] != ["0:1", "0:7"]: raise AssertionError(f"remote media candidate selection failed: {remote_sample!r}") - if remote_media_missing_count(remote_sample) != 1: + if remote_media_missing_count(remote_sample) != 2: raise AssertionError(f"remote missing count failed: {remote_sample!r}") existing_ref_sample = [ {"source_pk": 10, "message_id": "0:10", "media_type": "photo", "media_path": ""}, @@ -1442,6 +1696,34 @@ def run_self_test(fixture_dir: str) -> None: raise AssertionError(f"placeholder clear count failed: {placeholder_sample!r}") if [row["media_type"] for row in placeholder_sample] != ["", "", "web_page", "photo"]: raise AssertionError(f"placeholder media clear failed: {placeholder_sample!r}") + + metadata_sample = [ + {"text": "", "media_type": "web_page", "embedded_media": [PostboxDecoder(fixture_webpage_media()).decode_root_object()]}, + {"text": "", "media_type": "media", "embedded_media": [PostboxDecoder(fixture_location_media()).decode_root_object()]}, + {"text": "", "media_type": "media", "embedded_media": [PostboxDecoder(fixture_poll_media()).decode_root_object()]}, + {"text": "", "media_type": "media", "embedded_media": [PostboxDecoder(fixture_service_action_media()).decode_root_object()]}, + {"text": "see https://example.com/from-text.", "media_type": "web_page", "embedded_media": []}, + ] + if attach_message_metadata(metadata_sample) != 5: + raise AssertionError(f"metadata attach count failed: {metadata_sample!r}") + if [row["metadata_type"] for row in metadata_sample] != ["web_page", "location", "poll", "service_action", "web_page"]: + raise AssertionError(f"metadata type decode failed: {metadata_sample!r}") + if metadata_sample[0]["metadata_url"] != "https://example.com/article": + raise AssertionError(f"web page URL decode failed: {metadata_sample!r}") + if metadata_sample[1]["metadata_title"] != "Example Place" or metadata_sample[2]["metadata_title"] != "Example Poll": + raise AssertionError(f"location/poll metadata title failed: {metadata_sample!r}") + if metadata_sample[3]["metadata_title"] != "call": + raise AssertionError(f"service metadata title failed: {metadata_sample!r}") + if metadata_sample[4]["metadata_url"] != "https://example.com/from-text": + raise AssertionError(f"text URL metadata failed: {metadata_sample!r}") + service_samples = [ + {"text": "", "media_type": "media", "embedded_media": [PostboxDecoder(fixture_service_action_media(action)).decode_root_object()]} + for action in ("call", "title_change", "member_change", "pin") + ] + if attach_message_metadata(service_samples) != 4: + raise AssertionError(f"service metadata attach count failed: {service_samples!r}") + if [row["metadata_title"] for row in service_samples] != ["call", "title_change", "member_change", "pin"]: + raise AssertionError(f"service metadata subtype failed: {service_samples!r}") import_module = importlib.import_module try: def missing_telethon(name: str, package: str | None = None) -> Any: @@ -1454,7 +1736,7 @@ def missing_telethon(name: str, package: str | None = None) -> Any: finally: importlib.import_module = import_module if result != { - "candidates": 1, + "candidates": 2, "attempted": 0, "downloaded": 0, "missing": 1, @@ -1470,7 +1752,7 @@ async def never_finishes() -> None: try: asyncio.run(wait_remote(never_finishes(), 0.001)) raise AssertionError("remote timeout did not fire") - except TimeoutError: + except (TimeoutError, asyncio.TimeoutError): pass duplicate_sample = [ @@ -1552,21 +1834,31 @@ async def never_finishes() -> None: ] public_connections = [ FixturePostboxConnection( - {100: fixture_peer("Fixture", "A"), 4242: fixture_peer("Sender", "A")}, + { + 100: fixture_peer("Fixture", "A"), + 4242: fixture_peer("Sender", "A", "sendera", "+15550000001"), + 9999: fixture_peer("Unused", "A", "unuseda", "+15559990001"), + }, [(fixture_message_key(100, 0, 1_421_404_800, 1), fixture_message("public account a"))], ), FixturePostboxConnection( - {100: fixture_peer("Fixture", "B"), 4242: fixture_peer("Sender", "B")}, + { + 100: fixture_peer("Fixture", "B"), + 4242: fixture_peer("Sender", "B", "senderb", "+15550000002"), + 9999: fixture_peer("Unused", "B", "unusedb", "+15559990002"), + }, [(fixture_message_key(100, 0, 1_421_404_801, 1), fixture_message("public account b"))], ), ] public_peers: dict[str, str] = {} + public_contacts: list[dict[str, Any]] = [] public_messages: list[dict[str, Any]] = [] if len(public_sources) != len(public_connections): raise AssertionError("public fixture source/connection mismatch") for source, conn in zip(public_sources, public_connections): - peers, messages = read_source_records(source, conn, multi_account=True) + peers, contacts, messages = read_source_records(source, conn, multi_account=True) public_peers.update(peers) + public_contacts.extend(contacts) public_messages.extend(messages) public_filtered = filter_chat(public_messages, "100") if len(public_filtered) != 2: @@ -1575,9 +1867,18 @@ async def never_finishes() -> None: raise AssertionError("public multi-account import collapsed distinct chats") if public_filtered[0]["source_pk"] == public_filtered[1]["source_pk"]: raise AssertionError("public multi-account import collided source keys") - public_result = build_result("fixture-postbox", public_peers, public_filtered, dt.datetime(2026, 1, 1, tzinfo=dt.timezone.utc)) + public_filtered_contacts = filter_contacts_for_messages(public_contacts, public_filtered) + if len(public_contacts) != 6 or len(public_filtered_contacts) != 4: + raise AssertionError(f"public contact filtering shape failed: {public_contacts!r} -> {public_filtered_contacts!r}") + if any(contact["username"].startswith("unused") for contact in public_filtered_contacts): + raise AssertionError(f"public contact filtering kept unrelated contacts: {public_filtered_contacts!r}") + public_result = build_result("fixture-postbox", public_peers, public_filtered_contacts, public_filtered, dt.datetime(2026, 1, 1, tzinfo=dt.timezone.utc)) if len(public_result["chats"]) != 2 or len(public_result["messages"]) != 2: raise AssertionError(f"public import result shape failed: {public_result!r}") + if len(public_result["contacts"]) != 4: + raise AssertionError(f"public contacts shape failed: {public_result!r}") + if not any(contact["phone"] == "+15550000001" and contact["username"] == "sendera" for contact in public_result["contacts"]): + raise AssertionError(f"public contact enrichment failed: {public_result!r}") if any("embedded_media" in msg for msg in public_result["messages"]): raise AssertionError(f"public import leaked internal media records: {public_result!r}") if {msg["text"] for msg in public_result["messages"]} != {"public account a", "public account b"}: @@ -1614,10 +1915,13 @@ def main() -> None: multi_account = len(sources) > 1 sources_by_account = {source.account_id: source for source in sources} all_peers: dict[str, str] = {} + all_contacts: dict[str, dict[str, Any]] = {} by_identity: dict[tuple[str, str, str], dict[str, Any]] = {} for source in sources: - peers, messages = import_source(source, passcodes, multi_account) + peers, contacts, messages = import_source(source, passcodes, multi_account) all_peers.update(peers) + for contact in contacts: + all_contacts[contact["id"]] = contact for msg in messages: by_identity[(source.account_id, msg["chat_id"], msg["message_id"])] = msg @@ -1636,7 +1940,11 @@ def main() -> None: share_duplicate_media(limited) share_resource_media(limited) source_path = str(Path(args.source).expanduser()) if args.source else str(default_group_path()) - json.dump(build_result(source_path, all_peers, limited, started, remote_media), sys.stdout, separators=(",", ":")) + # Full imports refresh account-level contacts; targeted chat imports keep contacts scoped to exported messages. + contacts = list(all_contacts.values()) + if args.chat: + contacts = filter_contacts_for_messages(contacts, limited) + json.dump(build_result(source_path, all_peers, contacts, limited, started, remote_media), sys.stdout, separators=(",", ":")) if __name__ == "__main__": diff --git a/internal/telegramdesktop/scripts/import_tdata.py b/internal/telegramdesktop/scripts/import_tdata.py index a960a57..6e28d78 100644 --- a/internal/telegramdesktop/scripts/import_tdata.py +++ b/internal/telegramdesktop/scripts/import_tdata.py @@ -109,6 +109,19 @@ def media_size(message): return 0 +def load_existing_media_refs(path): + if not path: + return {} + data = json.loads(Path(path).read_text()) + refs = {} + for item in data: + source_pk = str(item.get("source_pk") or "") + media_path = str(item.get("media_path") or "") + if source_pk and media_path: + refs[source_pk] = item + return refs + + async def wait_remote(awaitable, seconds): return await asyncio.wait_for(awaitable, timeout=seconds) @@ -265,6 +278,7 @@ async def main(): parser.add_argument("--chat", default="") parser.add_argument("--fetch-media", action="store_true") parser.add_argument("--media-output-dir", default="") + parser.add_argument("--existing-media-refs", default="") args = parser.parse_args() started = datetime.now(timezone.utc) @@ -291,6 +305,7 @@ async def main(): out_folders = [] remote_media = empty_remote_media_stats() media_output_dir = Path(args.media_output_dir).expanduser() if args.fetch_media and args.media_output_dir else None + existing_media_refs = load_existing_media_refs(args.existing_media_refs) folder_memberships = {} if not args.chat: out_folders, folder_memberships = await load_folders(client) @@ -331,21 +346,32 @@ async def main(): topic_id = str(msg.id) replies = getattr(msg, "replies", None) msg_media_type = media_type(msg) + msg_media_title = media_title(msg) msg_media_path = "" msg_media_size = media_size(msg) + source_pk_value = stable_pk(chat_id, msg.id) if args.fetch_media and msg_media_type: - remote_media["candidates"] += 1 - remote_media["attempted"] += 1 - msg_media_path, downloaded_size, reason = await download_message_media(client, msg, media_output_dir, chat_id) - if msg_media_path: - msg_media_size = downloaded_size - remote_media["downloaded"] += 1 + existing_ref = existing_media_refs.get(str(source_pk_value)) + if existing_ref: + msg_media_path = str(existing_ref.get("media_path") or "") + msg_media_size = int(existing_ref.get("media_size") or 0) + if not msg_media_type: + msg_media_type = str(existing_ref.get("media_type") or "") + if not msg_media_title: + msg_media_title = str(existing_ref.get("media_title") or "") else: - remote_media["missing"] += 1 - remote_media[reason or "unavailable"] += 1 + remote_media["candidates"] += 1 + remote_media["attempted"] += 1 + msg_media_path, downloaded_size, reason = await download_message_media(client, msg, media_output_dir, chat_id) + if msg_media_path: + msg_media_size = downloaded_size + remote_media["downloaded"] += 1 + else: + remote_media["missing"] += 1 + remote_media[reason or "unavailable"] += 1 out_messages.append( { - "source_pk": stable_pk(chat_id, msg.id), + "source_pk": source_pk_value, "chat_id": chat_id, "chat_name": chat_name, "message_id": str(msg.id), @@ -361,7 +387,7 @@ async def main(): "text": text, "message_type": type(msg).__name__, "media_type": msg_media_type, - "media_title": media_title(msg), + "media_title": msg_media_title, "media_path": msg_media_path, "media_size": msg_media_size, "views": int(getattr(msg, "views", 0) or 0),