아마존이 알렉사에게 아일랜드어로 말하는 법을 가르친 방법

[NYT]How Amazon Taught Alexa to Speak in an Irish Brogue

조지 버나드 쇼의 연극 “피그말리온”의 음성학자인 헨리 히긴스처럼, 마리우스 코테스쿠와 조지 틴체프는 최근 학생들이 어떻게 발음 문제를 극복하려고 노력하는지 보여주었습니다.

유럽의 Amazon에서 일하는 두 명의 데이터 과학자는 그 회사의 디지털 비서인 Alexa를 가르치고 있었습니다. 그들의 임무: 알렉사가 인공지능의 도움과 원어민의 녹음을 통해 아일랜드 억양의 영어를 마스터할 수 있도록 돕는 것입니다.

시연하는 동안, 알렉사는 기억에 남는 밤 외출에 대해 말했습니다. 알렉사는 재미로 아일랜드어를 사용하며 “어젯밤 파티는 정말 미친 짓이었어요,”라고 말했다. “우리는 집으로 가는 길에 아이스크림을 받았고, 우리는 행복했습니다.”

틴체프 씨는 고개를 저었습니다. 알렉사는 “파티”에서 “r”을 떨어뜨려 파티처럼 평평하게 만들었습니다. 너무 영국적이라고 그는 결론을 내렸습니다.

이 기술자들은 음성 분리로 알려진 데이터 과학의 도전적인 분야를 연구하는 아마존 팀의 일원입니다. A의 물결 속에서 새로운 관련성을 얻은 까다로운 문제입니다.I. 음성 및 기술 퍼즐이 A를 만드는 데 도움이 될 수 있다고 믿는 연구자들과의 발전.I.-동력 장치, 봇 및 음성 합성기는 더 대화가 가능합니다. 즉, 다양한 지역별 억양을 구현할 수 있습니다.

음성 분리 문제를 해결하는 것은 어휘와 구문을 파악하는 것 이상의 것을 수반합니다. 화자의 음높이, 음색, 억양은 종종 단어들에게 미묘한 의미와 감정적인 무게를 줍니다. 언어학자들은 이 언어의 특징을 “프로소디”라고 부르는데, 이는 기계가 숙달하는 데 어려움을 겪었습니다.

인공지능, 컴퓨터 칩 및 기타 하드웨어의 발전 덕분에 연구자들은 음성 분리 문제를 해결하는 데 진전을 이루었고, 컴퓨터에서 생성된 음성을 귀에 더 즐거운 것으로 변형시켰습니다.

연구원들은 이러한 작업이 결국 챗봇이 자체적인 응답을 생성할 수 있도록 하는 기술인 “생성적 AI”의 폭발과 융합될 수 있다고 말했습니다. ChatGPT와 바드와 같은 챗봇은 언젠가 사용자의 음성 명령에 완전히 반응할 수 있습니다. 분석가들은 동시에 알렉사와 애플의 시리와 같은 음성 비서가 대화를 더 많이 할 것이며, 잠재적으로 정체된 것처럼 보였던 기술 부문에 대한 소비자의 관심을 다시 불러일으킬 것이라고 말했습니다.

알렉사, 시리 및 구글 어시스턴트와 같은 음성 비서가 여러 언어를 구사하도록 하는 것은 비용이 많이 들고 오래 걸리는 과정이었습니다. 기술 회사들은 수백 시간의 연설을 녹음하기 위해 성우들을 고용했고, 이는 디지털 보조자들을 위한 합성 음성을 만드는 데 도움이 되었습니다. 고급 A.I. 텍스트를 자연스럽게 들리는 합성 음성으로 변환하기 때문에 “텍스트-음성 모델”로 알려진 시스템은 이 프로세스를 간소화하기 시작했습니다.

이 기술은 “이제 다른 언어, 억양, 방언으로 된 텍스트 입력을 기반으로 인간의 목소리와 합성 오디오를 만들 수 있습니다”라고 도이체방크 연구소의 수석 전략가 마리온 라보레가 말했습니다.

아마존은 A의 마이크로소프트와 구글과 같은 경쟁자들을 따라잡아야 한다는 압박을 받아왔습니다.I. 인종. 4월에 아마존의 최고 경영자인 앤디 재시는 월스트리트 분석가들에게 회사가 세련된 A 세대의 도움을 받아 알렉사를 “더욱 사전 예방적이고 대화 가능한” 사람으로 만들 계획이라고 말했습니다.I. 그리고 Alexa의 아마존 수석 과학자인 Rohit Prasad는 5월 CNBC에 음성 비서를 “즉시 사용 가능한 개인 A”로 본다고 말했습니다.저는.”

아이리쉬 알렉사는 9개월 동안 아일랜드 억양을 이해하고 그것을 말하는 훈련을 받은 후 11월에 상업적으로 데뷔했습니다.

“액센트는 언어와 다릅니다.”라고 프라사드가 한 인터뷰에서 말했습니다. AI 기술은 지역 방언의 특성을 복제하기 전에 톤과 주파수와 같은 다른 언어 부분에서 억양을 추출하는 법을 배워야 합니다. 예를 들어, “a”가 더 평평하고 “t’s”가 더 강하게 발음될 수 있습니다.

이 시스템들은 “당신이 완전히 새로운 억양을 합성할 수 있도록” 이러한 패턴을 파악해야 합니다,”라고 그는 말했습니다. “어렵군요

더 어려운 것은 다른 소리를 내는 음성 모델에서 새로운 억양을 주로 스스로 배울 수 있는 기술을 얻기 위해 노력하는 것이었습니다. 그게 코테스쿠의 팀이 아이리쉬 알렉사를 만들기 위해 시도한 것입니다. 그들은 아일랜드 영어를 사용하도록 훈련시키기 위해 주로 영국-영어 억양의 기존 언어 모델에 크게 의존했습니다. – 미국, 캐나다, 호주 억양의 범위가 훨씬 더 작습니다.

그 팀은 아일랜드 영어의 다양한 언어적 도전에 맞섰습니다. 예를 들어, 아일랜드 사람들은 “th”의 “h”를 떨어뜨리는 경향이 있는데, 문자를 딱딱한 “t” 또는 “d”로 발음하여 “bath”를 “bat” 또는 심지어 “bad”로 발음합니다 아일랜드 영어는 또한 “r”이 지나치게 발음된다는 것을 의미하는 로틱입니다. 이것은 “파티”의 “r”이 런던 사람의 입에서 들을 수 있는 것보다 더 명확하다는 것을 의미합니다. 알렉사는 이러한 연설 기능들을 배우고 그것들을 마스터해야 했습니다.

루마니아인이자 아일랜드 알렉사 팀의 수석 연구원이었던 코테스쿠 씨는 아일랜드 영어는 “어려운 영어”라고 말했습니다

알렉사의 언어 능력을 강화하는 음성 모델은 최근 몇 년 동안 더 발전하고 있습니다. 2020년, 아마존 연구원들은 알렉사에게 영어를 사용하는 모델로부터 스페인어를 유창하게 말하는 것을 가르쳤습니다.

Cotescu씨와 그 팀은 억양을 알렉사의 언어 능력의 다음 개척자로 보았습니다. 그들은 A에 더 의존하도록 아이리쉬 알렉사를 설계했습니다.I. 그것의 연설 모델을 구축하기 위한 배우들보다. 그 결과, 아이리쉬 알렉사는 상대적으로 작은 말뭉치 – 아일랜드 억양의 영어로 2,000개의 발화를 암송한 성우들에 의해 약 24시간 녹음된 말뭉치에 대해 훈련을 받았습니다.

처음에, 아마존의 연구원들이 아직도 배우는 아이리쉬 알렉사에게 아이리쉬 녹음물을 먹였을 때, 몇 가지 이상한 일들이 일어났습니다.

때때로 응답에서 문자와 음절이 삭제됩니다. “S”는 때때로 서로 붙습니다. 한두 마디, 때로는 결정적인 것들이 설명할 수 없을 정도로 중얼거리고 이해할 수 없었습니다. 적어도 한 경우는 알렉사의 여성 목소리가 몇 옥타브 떨어져 남성적으로 들리는 경우도 있었습니다. 더 나쁜 것은, 그 남성적인 목소리는 분명히 영국식으로 들렸는데, 그것은 일부 아일랜드 가정에서 눈살을 찌푸리게 할 수 있는 일종의 바보 같은 것이었습니다.

“그것들은 큰 블랙박스입니다,”라고 아마존의 프로젝트를 이끄는 과학자인 불가리아인 틴체프 씨가 연설 모델에 대해 말했습니다. “그것들을 조정하기 위해서는 많은 실험이 필요합니다.”

그것이 기술자들이 알렉사의 “파티” 실수를 바로잡기 위해 한 일입니다. 그들은 알렉사가 미끄러지는 곳을 정확히 지적하고 미세 조정하기 위해 음성, 단어, 음소(가장 작은 가청음)를 음소별로 분리했습니다. 그런 다음 그들은 잘못된 발음을 교정하기 위해 아일랜드 알렉사의 음성 모델에 더 많은 녹음된 음성 데이터를 제공했습니다.

결과: “파티”의 “r”이 반환되었습니다. 하지만 “p”는 사라졌습니다.

그래서 데이터 과학자들은 다시 같은 과정을 거쳤습니다. 그들은 결국 빠진 “p”가 들어 있는 음소를 집중했습니다 그런 다음 그들은 “p” 소리가 되돌아오고 “r”이 사라지지 않도록 모델을 더 미세 조정했습니다. 알렉사는 마침내 더블라이너처럼 말하는 법을 배우고 있었습니다.

두 명의 아일랜드 언어학자 – 리머릭 대학에서 가르치는 일레인 본과 박사 학위를 가진 케이트 탈론.더블린 트리니티 칼리지의 음성학 및 음성 연구소에서 일하는 D 학생은 그 이후로 아이리쉬 알렉사의 억양에 높은 점수를 주었습니다. 아이리쉬 알렉사가 “r”을 강조하고 “t”를 부드럽게 하는 방식이 눈에 띄었고, 아마존은 전체적으로 억양을 얻었습니다.

탈론 씨는 “제게는 진짜처럼 들려요,”라고 말했습니다.

아마존의 연구원들은 대체로 긍정적인 피드백에 만족한다고 말했습니다. 그들의 언어 모델은 아일랜드 억양을 아주 빠르게 풀어냈기 때문에 그들이 다른 곳에서 억양을 복제할 수 있다는 희망을 주었습니다.

“우리는 또한 우리의 방법론을 영어가 아닌 다른 언어의 억양으로 확장할 계획입니다,” 그들은 아일랜드 알렉사 프로젝트에 대한 1월 연구 논문에 썼습니다.

Like Henry Higgins, the phonetician from George Bernard Shaw’s play “Pygmalion,” Marius Cotescu and Georgi Tinchev recently demonstrated how their student was trying to overcome pronunciation difficulties.

The two data scientists, who work for Amazon in Europe, were teaching Alexa, the company’s digital assistant. Their task: to help Alexa master an Irish-accented English with the aid of artificial intelligence and recordings from native speakers.

During the demonstration, Alexa spoke about a memorable night out. “The party last night was great craic,” Alexa said with a lilt, using the Irish word for fun. “We got ice cream on the way home, and we were happy out.”

Mr. Tinchev shook his head. Alexa had dropped the “r” in “party,” making the word sound flat, like pah-tee. Too British, he concluded.

The technologists are part of a team at Amazon working on a challenging area of data science known as voice disentanglement. It’s a tricky issue that has gained new relevance amid a wave of A.I. developments, with researchers believing the speech and technology puzzle can help make A.I.-powered devices, bots and speech synthesizers more conversational — that is, capable of pulling off a multitude of regional accents.

Tackling voice disentanglement involves far more than grasping vocabulary and syntax. A speaker’s pitch, timbre and accent often give words nuanced meaning and emotional weight. Linguists call this language feature “prosody,” something machines have had a hard time mastering.

Only in recent years, thanks to advances in A.I., computer chips and other hardware, have researchers made strides in cracking the voice disentanglement issue, transforming computer-generated speech into something more pleasing to the ear.

Such work may eventually converge with an explosion of “generative A.I.,” a technology that enables chatbots to generate their own responses, researchers said. Chatbots like ChatGPT and Bard may someday fully act on users’ voice commands and respond verbally. At the same time, voice assistants like Alexa and Apple’s Siri will become more conversational, potentially rekindling consumer interest in a tech segment that had seemingly stalled, analysts said.

Getting voice assistants such as Alexa, Siri and Google Assistant to speak multiple languages has been an expensive and protracted process. Tech companies have hired voice actors to record hundreds of hours of speech, which helped create synthetic voices for digital assistants. Advanced A.I. systems known as “text-to-speech models” — because they convert text to natural-sounding synthetic speech — are just beginning to streamline this process.

The technology “is now able to create a human’s voice and synthetic audio based on a text input, in different languages, accents and dialects,” said Marion Laboure, a senior strategist at Deutsche Bank Research.

Amazon has been under pressure to catch up to rivals like Microsoft and Google in the A.I. race. In April, Andy Jassy, Amazon’s chief executive, told Wall Street analysts that the company planned to make Alexa “even more proactive and conversational” with the help of sophisticated generative A.I. And Rohit Prasad, Amazon’s head scientist for Alexa, told CNBC in May that he saw the voice assistant as a voice-enabled “instantly available, personal A.I.”

Irish Alexa made its commercial debut in November, after nine months of training in comprehending an Irish accent and then speaking it.

“Accent is different from language,” Mr. Prasad said in an interview. A.I. technologies must learn to extricate the accent from other parts of speech, such as tone and frequency, before they can replicate the peculiarities of local dialects — for instance, maybe the “a” is flatter and “t’s” are pronounced more forcibly.

These systems must figure out these patterns “so you can synthesize a whole new accent,” he said. “That’s hard.”

Harder still was trying to get the technology to learn a new accent largely on its own, from a different-sounding speech model. That’s what Mr. Cotescu’s team tried in building Irish Alexa. They relied heavily on an existing speech model of primarily British-English accents — with a far smaller range of American, Canadian and Australian accents — to train it to speak Irish English.

The team contended with various linguistic challenges of Irish English. The Irish tend to drop the “h” in “th,” for example, pronouncing the letters as a hard “t” or a “d,” making “bath” sound like “bat,” or even “bad.” Irish English is also rhotic, meaning the “r” is overpronounced. That means the “r” in “party” will be more distinct than what you might hear out of a Londoner’s mouth. Alexa had to learn these speech features and master them.

Irish English, said Mr. Cotescu, who is Romanian and was the lead researcher on the Irish Alexa team, “is a hard one.”

The speech models that power Alexa’s verbal skills have been growing more advanced in recent years. In 2020, Amazon researchers taught Alexa to speak fluent Spanish from an English language-speaking model.

Mr. Cotescu and the team saw accents as the next frontier of Alexa’s speech capabilities. They designed Irish Alexa to rely more on A.I. than on actors to build up its speech model. As a result, Irish Alexa was trained on a relatively small corpus — about 24 hours of recordings by voice actors who recited 2,000 utterances in Irish-accented English.

At the outset, when Amazon’s researchers fed the Irish recordings to the still-learning Irish Alexa, some weird things happened.

Letters and syllables occasionally dropped out of the response. “S’s” sometimes stuck together. A word or two, sometimes crucial ones, were inexplicably mumbled and incomprehensible. At least in one case, Alexa’s female voice dropped a few octaves, sounding more masculine. Worse, the masculine voice sounded distinctly British, the kind of goof that might raise eyebrows in some Irish homes.

“They are big black boxes,” Mr. Tinchev, a Bulgarian national who is Amazon’s lead scientist on the project, said of the speech models. “You have to have a lot of experimentation to tune them.”

That’s what the technologists did to correct Alexa’s “party” gaffe. They disentangled the speech, word by word, phoneme (the smallest audible sliver of a word) by phoneme to pinpoint where Alexa was slipping and fine-tune it. Then they fed Irish Alexa’s speech model more recorded voice data to correct the mispronunciation.

The result: the “r” in “party” returned. But then the “p” disappeared.

So the data scientists went through the same process again. They eventually zeroed in on the phoneme that contained the missing “p.” Then they fine-tuned the model further so the “p” sound returned and the “r” didn’t disappear. Alexa was finally learning to speak like a Dubliner.

Two Irish linguists — Elaine Vaughan, who teaches at the University of Limerick, and Kate Tallon, a Ph.D student who works in the Phonetics and Speech Laboratory at Trinity College Dublin — have since given Irish Alexa’s accent high marks. The way Irish Alexa emphasized “r’s” and softened “t’s” stuck out, they said, and Amazon got the accent as a whole right.

“It sounds authentic to me,” Ms. Tallon said.

Amazon’s researchers said they were gratified by the largely positive feedback. That their speech models disentangled the Irish accent so quickly gave them hope they could replicate accents elsewhere.

“We also plan to extend our methodology to accents of language other than English,” they wrote in a January research paper about the Irish Alexa project.

📰 관련 뉴스

댓글 남기기 취소