connects the hook to the thesis statement
summarizes the overall claim of the paper
» Opening with a Story (Anecdote)
A good way of catching your reader’s attention is by sharing a story that sets up your paper. Sharing a story gives a paper a more personal feel and helps make your reader comfortable.
This example was borrowed from Jack Gannon’s The Week the World Heard Gallaudet (1989):
Astrid Goodstein, a Gallaudet faculty member, entered the beauty salon for her regular appointment, proudly wearing her DPN button. (“I was married to that button that week!” she later confided.) When Sandy, her regular hairdresser, saw the button, he spoke and gestured, “Never! Never! Never!” Offended, Astrid turned around and headed for the door but stopped short of leaving. She decided to keep her appointment, confessing later that at that moment, her sense of principles had lost out to her vanity. Later she realized that her hairdresser had thought she was pushing for a deaf U.S. President. Hook: a specific example or story that interests the reader and introduces the topic.
Transition: connects the hook to the thesis statement
Thesis: summarizes the overall claim of the paper
» Specific Detail Opening
Giving specific details about your subject appeals to your reader’s curiosity and helps establish a visual picture of what your paper is about.
Hands flying, green eyes flashing, and spittle spraying, Jenny howled at her younger sister Emma. People walked by, gawking at the spectacle as Jenny’s grunts emanated through the mall. Emma sucked at her thumb, trying to appear nonchalant. Jenny’s blond hair stood almost on end. Her hands seemed to fly so fast that her signs could barely be understood. Jenny was angry. Very angry. | a specific example or story that interests the reader and introduces the topic. connects the hook to the thesis statement summarizes the overall claim of the paper |
» Open with a Quotation
Another method of writing an introduction is to open with a quotation. This method makes your introduction more interactive and more appealing to your reader.
“People paid more attention to the way I talked than what I said!” exclaimed the woman from Brooklyn, New York, in the movie American Tongues. This young woman’s home dialect interferes with people taking her seriously because they see her as a New Yorker’s cartoonish stereotype. The effects on this woman indicate the widespread judgment that occurs about nonstandard dialects. People around America judge those with nonstandard dialects because of _____________ and _____________. This type of judgment can even cause some to be ashamed of or try to change their language identity.* | a specific example or story that interests the reader and introduces the topic. connects the hook to the thesis statement summarizes the overall claim of the paper |
» Open with an Interesting Statistic
Statistics that grab the reader help to make an effective introduction.
American Sign Language is the second most preferred foreign language in the United States. 50% of all deaf and hard of hearing people use American Sign Language (ASL).* ASL is beginning to be provided by the Foreign Language Departments of many universities and high schools around the nation. The statistics are not accurate. They were invented as an example. | a specific example or story that interests the reader and introduces the topic. connects the hook to the thesis statement summarizes the overall claim of the paper |
» Question Openings
Possibly the easiest opening is one that presents one or more questions to be answered in the paper. This is effective because questions are usually what the reader has in mind when he or she sees your topic.
Is ASL a language? Can ASL be written? Do you have to be born deaf to understand ASL completely? To answer these questions, one must first understand exactly what ASL is. In this paper, I attempt to explain this as well as answer my own questions. | a specific example or story that interests the reader and introduces the topic. connects the hook to the thesis statement summarizes the overall claim of the paper |
Source : *Writing an Introduction for a More Formal Essay. (2012). Retrieved April 25, 2012, from http://flightline.highline.edu/wswyt/Writing91/handouts/hook_trans_thesis.htm
The conclusion to any paper is the final impression that can be made. It is the last opportunity to get your point across to the reader and leave the reader feeling as if they learned something. Leaving a paper “dangling” without a proper conclusion can seriously devalue what was said in the body itself. Here are a few effective ways to conclude or close your paper. » Summary Closing Many times conclusions are simple re-statements of the thesis. Many times these conclusions are much like their introductions (see Thesis Statement Opening).
Because of a charter signed by President Abraham Lincoln and because of the work of two men, Amos Kendall and Edward Miner Gallaudet, Gallaudet University is what it is today – the place where people from all over the world can find information about deafness and deaf education. Gallaudet and the deaf community truly owe these three men for without them, we might still be “deaf and dumb.” |
» Close with a Logical Conclusion
This is a good closing for argumentative or opinion papers that present two or more sides of an issue. The conclusion drawn as a result of the research is presented here in the final paragraphs.
As one can see from reading the information presented, mainstreaming deaf students isn’t always as effective as educating them in a segregated classroom. Deaf students learn better on a more one-on-one basis like they can find in a school or program specially designed for them. Mainstreaming lacks such a design; deaf students get lost in the mainstream. |
» Real or Rhetorical Question Closings
This method of concluding a paper is one step short of giving a logical conclusion. Rather than handing the conclusion over, you can leave the reader with a question that causes him or her to draw his own conclusions.
Why, then, are schools for the deaf becoming a dying species? |
» Close with a Speculation or Opinion This is a good style for instances when the writer was unable to come up with an answer or a clear decision about whatever it was he or she was researching. For example:
Through all of my research, all of the people I interviewed, all of the institutions I visited, not one person could give me a clear-cut answer to my question. Can all deaf people be educated in the same manner? I couldn’t find the “right” answer. I hope you, the reader, will have better luck. |
» Close with a Recommendation
A good conclusion is when the writer suggests that the reader do something in the way of support for a cause or a plea for them to take action.
American Sign Language is a fast growing language in America. More and more universities and colleges are offering it as part of their curriculum and some are even requiring it as part of their program. This writer suggests that anyone who has a chance to learn this beautiful language should grab that opportunity. |
202-448-7036
Gallaudet University, chartered in 1864, is a private university for deaf and hard of hearing students.
Copyright © 2024 Gallaudet University. All rights reserved.
800 Florida Avenue NE, Washington, D.C. 20002
Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.
This chapter is adapted from Stand up, Speak out: The Practice and Ethics of Public Speaking , CC BY-NC-SA 4.0 .
“OK, I’m done; thank God that’s over!” Or, “Thanks. Now what? Do I just sit down?” It’s understandable to feel relief as you end your speech, but remember that as a speaker, your conclusion is the last chance you have to drive home your ideas. When you opt to end the speech with an ineffective conclusion—or no conclusion at all—your speech loses the energy you created, and the audience is left confused and disappointed. Just as a good introduction helps bring an audience into your speech’s world, and a good speech body holds the audience in that world, a good conclusion helps bring that audience back to reality. So, plan ahead to ensure that your conclusion is an effective one. While a good conclusion will not rescue a poorly prepared speech, a strong conclusion signals to your listeners that the speech is over and helps them remember your topic. Now, let’s examine the functions fulfilled by a speech conclusion.
The first function of a good conclusion is to signal the speech’s end. You may be thinking that telling an audience that you’re about to stop speaking is a “no brainer,” but many speakers really don’t prepare their audience to conclude. When a speaker just suddenly stops speaking, the audience is left confused and disappointed. Instead, make sure that you leave your audience knowledgeable and satisfied.
The second function of a good conclusion stems from some very interesting research reported by the German psychologist Hermann Ebbinghaus in his 1885 book Memory: A Contribution to Experimental Psychology (Ebbinghaus, 1885). Ebbinghaus proposed that humans remember information in a linear fashion, which he calls the serial position effect. He found that an individual’s ability to remember information in a list, such as a grocery list, a chores list, or a to-do list depends on the item’s location on the list. Specifically, he found that items toward the list’s beginning and items toward the list’s end tended to have the highest recall rates. In Ebbinghaus’ serial position effect, he calls information at the list’s beginning—primacy, and information at the list’s end—recency, and shows that information in these positions are easier to recall than information in the list’s middle.
So, what does this serial position effect have to do with speech conclusions? A lot! Ray Ehrensberger wanted to test Ebbinghaus’ serial position effect in public speaking. Ehrensberger created an experiment that rearranged a speech topic’s order to determine the audience’s information recall (Ehrensberger, 1945). Ehrensberger’s study reaffirmed primacy and recency’s importance when listening to speeches. In fact, Ehrensberger found that the information delivered during a speech conclusion—recency, had the highest recall level overall.
A strong conclusion restates the thesis, reviews the main points, and uses a concluding device.
The first step in a powerful conclusion is to restate your thesis statement. When you restate the thesis statement as you conclude, you’re re-emphasizing your speech’s overarching main idea. For example, suppose your thesis statement is, “I will analyze how Barack Obama uses lyricism in his July 2008 speech, ‘A World That Stands as One.’” At the conclusion, restate the thesis in this fashion: “In the past few minutes, I have analyzed how Barack Obama uses lyricism in his July 2008 speech, ‘A World That Stands as One.’” Notice the shift in tense: the statement has gone from the future tense—this is what I will speak about, to the past tense—this is what I have spoken about. Restating the thesis in your conclusion reminds the audience of your speech’s major purpose or goal, helping them to better remember it.
The second step in a powerful conclusion is to review the main points after restating the speech’s thesis. A big differences between written and oral communication is oral communication’s need to repeat. So, you increase the likelihood that the audience retains your main points after the speech is over when you do the following: preview your main points in the introduction, effectively discuss and make transitions to your main points during the speech’s body, and finally, review your main points in the conclusion,
In a speech’s introduction, deliver a preview of the main body points, and in the conclusion, deliver a review . Let’s look at a sample preview:
To understand the gender and communication field, I will first differentiate between the terms biological sex and gender. I will then explain gender research in communication’s history. Lastly, I will examine some important findings related to gender and communication.
In this preview, you have three clear main points. Let’s see how you can review them at your speech’s conclusion:
Today, we have differentiated between the terms biological sex and gender, examined gender research in communication’s history, and analyzed some topic research findings.
In the past few minutes, I have explained the difference between the terms biological sex and gender, discussed the communication field’s rise in gender research, and examined some groundbreaking topic studies.
Notice that both conclusions review the main points originally set forth. Both variations are equally effective main point reviews, but you might like the linguistic turn of one over the other. Remember, while there is a lot of science to help us understand public speaking, there’s a lot of art as well, so you are always encouraged to choose the wording that you think is most effective for your audience.
The final step in a powerful conclusion is to employ a concluding device. A concluding device is essentially the final thought you want to impart to your audience when you stop speaking. It also provides a definitive sense of closure to your speech. Just as a gymnast dismounting the parallel bars or balance beam wants to stick the landing and avoid taking two or three additional steps, a speaker wants to stick their speech ending with a concluding device instead of with, “Well, umm, I guess I’m done.” Miller observed that speakers tend to use one of ten concluding devices when ending a speech (Miller, 1946). Let’s examine these ten concluding devices.
The first way to conclude a speech is with a challenge. A challenge is a call to engage in some kind of activity that requires a contest or special effort. In a speech on fund raising’s necessity, conclude by challenging the audience to raise 10 percent more than their original projections. Audience members are being asked to go out of their way to do something different that involves their effort.
The second way to conclude a speech is by reciting a quotation relevant to the speech topic. When using a quotation, think about whether your goal is to end on a persuasive note or an informative note. Some quotations will have a clear call-to-action, while other quotations summarize or provoke thought. For example, let’s say you are delivering an informative speech about dissident writers in the former Soviet Union. End by citing this quotation from Alexander Solzhenitsyn: “A great writer is, so to speak, a second government in his country. And for that reason, no regime has ever loved great writers” (Solzhenitsyn, 1964). Notice that this quotation underscores the writers-as-dissidents idea, but it doesn’t ask listeners to put forth effort to engage in any specific thought process or behavior. If, on the other hand, you are delivering a persuasive speech urging your audience to participate in a very risky political demonstration, use this quotation from Martin Luther King Jr.: “If a man hasn’t discovered something that he will die for, he isn’t fit to live” (King, 1963). In this case, the quotation leaves the audience with these messages: that great risks are worth taking, that they make our lives worthwhile, and that the right thing to do is to take that great risk.
The third way to conclude a speech is to end with a summary. To do this, the speaker simply elongates the main point’s review. While this may not be the most exciting concluding device, it can be useful for information that is highly technical or complex or for speeches that last longer than thirty minutes. Typically, for short speeches, such as student-given speeches, avoid this summary device.
The fourth way to conclude a speech is to visualize the future. This device helps your audience imagine the future that you believe can occur. For example, if you are giving a speech on developing video games for learning, conclude by inviting your audience to visualize a future classroom where video games are perceived as true learning tools and how those tools are used. More often, speakers use future visualization to depict how society would be , or how an individual listener’s life would be different, if the speaker’s persuasive attempts work. For example, if in your speech you propose that hiring more public-school reading specialists will solve illiteracy, ask your audience to imagine a world without illiteracy. In using this visual, your goal is to persuade your audience to adopt your view point. By showing that your future vision is a positive one, this conclusion further persuades your audience to help create this future.
The fifth way to conclude a speech, and probably the most common persuasive device, is the appeal-for-action or the call-to-action. In essence, the appeal-for-action occurs when a speaker asks her or his audience to engage in a specific behavior or to change their thinking. When a speaker concludes by asking the audience “to do” or “to think” in a specific manner, the speaker wants to see an actual change. Whether the speaker appeals for people to eat more fruit, buy a car, vote for a candidate, oppose the death penalty, or sing more in the shower, the speaker is asking the audience to engage in action.
One specific appeal type is the immediate call-to-action. Whereas some appeals ask for people to engage in future behavior, the immediate call-to-action asks people to engage in behavior right now . If a speaker wants to see a new traffic light placed at a dangerous intersection, he or she may conclude by asking all the audience members to sign a digital petition right then and there, using a computer the speaker has made available. Here are more immediate call-to-action examples:
These are just a few different examples we’ve actually seen students use to elicit an immediate change in behavior. The immediate call-to-action may not lead to long-term change, but can be very effective at increasing the likelihood that an audience will change behavior in the short term.
The sixth way to conclude a speech is to inspire someone. By definition, the word inspire means to affect or arouse someone. Both affect and arouse have strong emotional connotations. The ultimate goal of employing an inspirational concluding device is similar to an appeal-for-action, but the ultimate goal is more lofty or ambiguous—the goal is to stir someone’s emotions in a specific manner. Maybe a speaker is giving an informative speech on domestic violence’s prevalence in our society today. That speaker could end the speech by reading Paulette Kelly’s powerful poem, “I Got Flowers Today,” which is a poem that evokes strong emotions because it’s about an abuse victim who receives flowers from her abuser every time she is victimized. The poem ends by saying, “I got flowers today… / Today was a special day—it was the day of my funeral / Last night he killed me” (Kelly, 1994).
The seventh way to conclude a speech is to end with your advice. This concluding device is one that should be used primarily by speakers who are recognized as expert authorities on a given subject. Advice is essentially a speaker’s opinion about what should or should not be done. The problem with opinions is that everyone has one; and one person’s opinion is not necessarily any more correct than another’s. There must be a really good reason your opinion—and therefore your advice—should matter to your audience. If, for example, you are a nuclear physics expert, conclude an energy speech by giving advice about nuclear energy’s benefits.
The eighth way to conclude a speech is to offer a powerful solution to the problem discussed within your speech. For example, perhaps you have been discussing the problems associated with art education’s disappearance in the United States. Propose a solution to create more community-based art experiences for school children as a way to fill this gap. Although this can be an effective conclusion, consider discussing the solution in more depth as a stand-alone main point within the speech’s body so that you can address your audience’s concerns about the proposed solution.
The nineth way to conclude a speech is to ask a rhetorical question that forces the audience to ponder an idea. Maybe you are giving a speech on the environment’s importance, so you end the speech by saying, “Think about your children’s future. What kind of world do you want them raised in? A world that is clean, vibrant, and beautiful—or one that is filled with smog, pollution, filth, and disease?” Notice that you aren’t actually asking the audience to verbally or nonverbally answer the question—the question’s goal is to force the audience into thinking about what kind of world they want for their children.
The tenth way to conclude a speech is to refer to your audience. As discussed by Miller (1946), this concluding device is useful when a speaker attempts to answer a basic question for the audience, such as, “What’s in it for me?” The goal of this concluding device is to spell out the direct benefits a behavior or thought-change has for audience members. For example, a speaker talking about stress reduction techniques concludes by clearly listing all the physical health benefits that stress reduction offers, such as improved reflexes, improved immune system, improved hearing, and lowered blood pressure. In this case, the speaker is clearly spelling out why audience members should care—so what? What’s in it for me!
Ebbinghaus, H. (1885). Memory: A contribution to experimental psychology [Online version]. Retrieved from http://psychclassics.yorku.ca/Ebbinghaus/index.htm .
Ehrensberger, R. (1945). An experimental study of the relative effectiveness of certain forms of emphasis in public speaking. Speech Monographs , 12, 94–111. doi: 10.1080/03637754509390108.
Kelly, P. (1994). I got flowers today. In C. J. Palmer & J. Palmer, Fire from within. Painted Post, NY: Creative Arts & Science Enterprises.
King, M. L. (1963, June 23). Speech in Detroit. Cited in Bartlett, J., & Kaplan, J. (Eds.), Bartlett’s familiar quotations (6th ed.). Boston, MA: Little, Brown & Co., p. 760.
Miller, E. (1946). Speech introductions and conclusions. Quarterly Journal of Speech , 32, 181–183.
Solzhenitsyn, A. (1964). The first circle . New York: Harper & Row. Cited in Bartlett, J., & Kaplan, J. (Eds.), Bartlett’s familiar quotations (6th ed.). Boston, MA: Little, Brown & Co., p. 746.
University of Minnesota. (2011). Stand up, Speak out: The Practice and Ethics of Public Speaking . University of Minnesota Libraries Publishing. https://open.lib.umn.edu/publicspeaking/ . CC BY-SA 4.0.
Public Speaking Copyright © 2022 by Sarah Billington and Shirene McKay is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.
Speech conclusion generator.
A good conclusion of a speech comes from a good introduction. The conclusion and the introduction are somewhat linked together. Whatever you said you in your introduction must be reiterated in the conclusion but of course with different structure and approach. It is like saying again what you intended to say in the first place.
Closing any special occasion speech or event is one of the most important parts of an event. It sets the tone and leaves everyone something to think of after the event. However, it not an easy job and in fact, the trickiest. But the key to an effective closing speech is to make it short and simple.
Closing a speech in pdf is not that hard, but it is not that easy either. To help you conclude your speech, here are some suggestions.
Below are the techniques used to make effective conclusions.
Text prompt
Create a Speech Conclusion for a talk on climate change
Generate a Speech Conclusion for a presentation on the importance of education
{{item.snippet}} |
See the bottom of the main Writing Guides page for licensing information.
Part i: the introduction.
An introduction is usually the first paragraph of your academic essay. If you’re writing a long essay, you might need 2 or 3 paragraphs to introduce your topic to your reader. A good introduction does 2 things:
Body paragraphs help you prove your thesis and move you along a compelling trajectory from your introduction to your conclusion. If your thesis is a simple one, you might not need a lot of body paragraphs to prove it. If it’s more complicated, you’ll need more body paragraphs. An easy way to remember the parts of a body paragraph is to think of them as the MEAT of your essay:
Main Idea. The part of a topic sentence that states the main idea of the body paragraph. All of the sentences in the paragraph connect to it. Keep in mind that main ideas are…
Evidence. The parts of a paragraph that prove the main idea. You might include different types of evidence in different sentences. Keep in mind that different disciplines have different ideas about what counts as evidence and they adhere to different citation styles. Examples of evidence include…
Analysis. The parts of a paragraph that explain the evidence. Make sure you tie the evidence you provide back to the paragraph’s main idea. In other words, discuss the evidence.
Transition. The part of a paragraph that helps you move fluidly from the last paragraph. Transitions appear in topic sentences along with main ideas, and they look both backward and forward in order to help you connect your ideas for your reader. Don’t end paragraphs with transitions; start with them.
Keep in mind that MEAT does not occur in that order. The “ T ransition” and the “ M ain Idea” often combine to form the first sentence—the topic sentence—and then paragraphs contain multiple sentences of evidence and analysis. For example, a paragraph might look like this: TM. E. E. A. E. E. A. A.
A conclusion is the last paragraph of your essay, or, if you’re writing a really long essay, you might need 2 or 3 paragraphs to conclude. A conclusion typically does one of two things—or, of course, it can do both:
Handout by Dr. Liliana Naydan. Do not reproduce without permission.
Real-time speech enhancement has began to rise in performance, and the Demucs Denoiser model has recently demonstrated strong performance in multiple-speech-source scenarios when accompanied by a location-based speech target selection strategy. However, it has shown to be sensitive to errors in the direction-of-arrival (DOA) estimation. In this work, a DOA correction scheme is proposed that uses the real-time estimated speech quality of its enhanced output as the observed variable in an Adam-based optimization feedback loop to find the correct DOA. In spite of the high variability of the speech quality estimation, the proposed system is able to correct in real-time an error of up to 15 o superscript 15 𝑜 15^{o} 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT using only the speech quality as its guide. Several insights are provided for future versions of the proposed system to speed up convergence and further reduce the speech quality estimation variability.
organization=Universidad Nacional Autonoma de Mexico, addressline=Circuito Escolar 3000, city=Mexico, postcode=03740, state=CDMX, country=Mexico
In real-life scenarios where a speech source of interest is present, and is of interest to process and analyze, various other audio sources typically coexist in the acoustic environment. This results in a mixture of the target speech source and these other sources, referred here to as ‘interferences’, being captured in conjunction. Additionally, there are other effects that occur in real-life scenarios, such as noise and reverberation. All of these are aimed to be removed from the mixture, such that only the target speech source remains. This task, known as ”speech enhancement,” has shown significant advancements through deep learning methods [ 1 ] .
When conducted offline (using previously recorded audio), speech enhancement has benefited various applications such as security [ 2 ] and music production [ 3 , 4 ] . Additionally, there is interest in performing speech enhancement in an online manner (using live audio capture), since it holds promise for a diverse range of applications, including real-time automatic speech recognition [ 5 ] , sound source localization in robotics [ 6 ] , hearing aids [ 7 ] , mobile communications [ 8 ] , and teleconferencing [ 9 ] .
It should be noted that carrying out a process in an online manner and in “real-time” are usually distinguishable. Online processing only requires that the execution is carried out within a time frame shorter than the capture time. While real-time processing, in addition to the latter constraint, also requires to meet specific latency or response time requirements that are application-dependent [ 10 , 11 ] . For the sake of simplicity, this work focuses solely on meeting the execution-time-less-than-capture-time criterion, and the terms “online processing” and “real-time processing” are used inter-changeably.
To this effect, the online speech enhancement technique know as Demucs Denoiser [ 12 ] has recently demonstrated superior performance compared to several other advanced methods when running in real-time [ 13 ] . It achieves excellent enhancement results, typically achieving an output signal-to-interference ratio exceeding 20 dB, even when using very short window segments (0.064 s). This positions it as a representative example of current advancements in online speech enhancement technologies.
However, similar to other speech enhancement techniques, the Demucs Denoiser model operates under a crucial assumption: that there is only one speech source in the mixture. This assumption may not pose a significant issue in applications like mobile telecommunication [ 8 ] and teleconferencing [ 9 ] , which typically involve one user speaking at a time. However, in contexts such as service robotics [ 6 ] and hearing aids [ 7 ] , where multiple speech sources are expected, using speech enhancement techniques may not be suitable. In such cases, “speech separation” techniques [ 14 ] could be more appropriate, as they aim to separate multiple speech sources into distinct audio streams. However, while speech enhancement techniques can be executed in real-time, speech separation techniques as of yet are not typically suitable for online applications [ 13 ] .
It has been demonstrated that when inputting a mix of speech sources into contemporary online speech enhancement techniques, some algorithms isolate the speech mixture from non-speech sources, whereas others, such as Demucs Denoiser, prioritize enhancing the “loudest” speech source [ 13 ] . To this effect, the work in [ 15 ] , which this work is based on, proposed two target selection strategies so as to “nudge” Demucs Denoiser towards the speech source of interest. The strategy that generally outperformed the other was the one that selects the target speech through its location (specifically, its direction of arrival). However, it was found to be very sensitive to localization errors.
In this work, a direction-of-arrival (DOA) correction mechanism is proposed to supplement this previous work, based on maximizing the speech quality measured in the audio output. A feedback loop is implemented, where the speech quality is used by an optimization mechanism as the observed variable, and a ‘corrected’ DOA as its controlled variable. This corrected DOA is fed back to the location-based speech enhancement to obtain a new speech quality estimation, repeating the process to find the correct DOA.
This work has the following structure: Section 2 details the proposed system and all of its modules; Section 3 shows how the proposed system works in a real-time scenario, it also discusses its limitations and insights for future versions; finally, conclusions and future work are presented in Section 4 .
In the following sections, these three modules are detailed. It is important to mention that, for reasons of reproducibility and ease of testing, the implementation of the proposed system is available at https://github.com/balkce/doacorrection , which includes the direction of arrival correction module, the speech quality estimation module, and the trained Demucs Denoiser sub-module. As for the beamformer sub-module, its implementation is available at https://github.com/balkce/beamformer2 .
The speech enhancement module is based on the location-based target selection strategy detailed in [ 15 ] , which for completeness sake is summarized here.
The Demucs Denoiser model [ 12 ] has been shown to be effective at carrying out speech enhancement in an online manner, with relatively minimal computational power and a low response time [ 13 ] . However, in this later study [ 13 ] it was also shown that, as with any other current speech enhancement technique, the Demucs Denoiser model assumes that only one speech source is present in the input mixture. In the case of multiple speech source, it was shown that it tends to separate the “loudest” speech source in the input mixture. Thus, it requires a target selection strategy to appropriately select the speech source of interest (SOI). In [ 15 ] , two strategies were explored, and the one based on the location of the SOI provided good results.
The location-based strategy requires to know the location of the SOI in relation to the microphone array as a direction of arrival ( θ 𝜃 \theta italic_θ ), which can be provided by a diverse set of sound source localization techniques [ 6 ] . The target selection strategy uses a phase-based frequency-masking beamformer [ 16 ] to create a preliminary estimation of the SOI ( S b e a m subscript 𝑆 𝑏 𝑒 𝑎 𝑚 S_{beam} italic_S start_POSTSUBSCRIPT italic_b italic_e italic_a italic_m end_POSTSUBSCRIPT ). Although it may not be able to remove all the interferences (and even inserts some musical artifacts in its estimation), it does increase the energy of the SOI compared to the rest of the sound sources in the mixture. Such increase is enough so that the Demucs Denoiser model ‘picks it up’ as the speech source to aim for and separate from the mixture.
However, in [ 15 ] it was also shown that the location-based target selection strategy is very sensitive to localization errors. A location error of 0.1 m resulted in a 5 dB drop in the average output signal-to-interference ratio (SIR), as well as a considerable increase in result variability. Several approaches were proposed as future work in [ 15 ] to circumvent this issue, such as:
Re-train the Demucs Denoiser model with data that is the result of artificially inserting location errors. It is to be noted that the author of [ 15 ] did attempt to do this, but the Demucs Denoiser model seemed to not be able to internally correct for such issues. However, no further tests were carried in this front (and are out of the scope of this work), which does mean that this approach may still merit exploring.
Incorporate a robust localization method. Although it is also worth exploring, it does not fix any issues within the speech enhancement module; it just ‘pushes’ the responsibility elsewhere.
Incorporate a quality metric as feedback to correct the beamformer. This is the approach explored in this work.
To this effect, it is essential to have a quality metric to carry out this approach. This is detailed in the following section.
The proposed system requires that the quality of the enhanced speech is measured in an online manner. This is a research problem onto itself, with two main challenges: 1) assessing the quality of the enhanced signal without a reference signal to compare it to; and 2) doing so in a window-by-window basis (since the whole input signal is unavailable). As for the first challenge, there have been several approaches that have attempted to solve it [ 17 , 18 , 19 , 20 , 21 , 22 ] .
The approach employed in this work is the model known as Squim [ 17 ] . The reasoning behind this selection is that: 1) it is one of the most recent, outperforming earlier approaches; 2) it is quite popular (and, thus, tested by the community) and already implemented as part of Torchaudio [ 23 ] , the deep learning framework employed here; and 3) although it has not been reported as such, it is actually able to be run online (partially solving the second aforementioned challenge). To further this last point, its response times were measured as part of this work, using different capture window lengths ( t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ), which are shown in Table 1 .
Capture Time | Response Time |
---|---|
(seconds) | [Min, Max] (seconds) |
1.0 | [0.0234, 0.0242] |
2.0 | [0.0310, 0.0425] |
3.0 | [0.0538, 0.0704] |
Considering that these tests were carried out with a low-power GPU (Nvidia GTX 1050 Ti), these are very good response times. A time step ( t h subscript 𝑡 ℎ t_{h} italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) of 0.1 s is more than enough to measure the quality of the last 3.0 s of captured audio ( t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ).
Additionally, Squim assumes that there is speech activity in its input, which may not always be the case. This can be solved by the use of voice activity detection as pre-processing step. To this effect, the Silero-VAD technique [ 24 ] was chosen because of its good performance, while providing low response times and requiring low computing power. To provide a smooth transition between quality estimations, the latest t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT window is divided into smaller windows of length t v a d subscript 𝑡 𝑣 𝑎 𝑑 t_{vad} italic_t start_POSTSUBSCRIPT italic_v italic_a italic_d end_POSTSUBSCRIPT , and Silero-VAD is applied to each of them. If more than 3 / 4 3 4 3/4 3 / 4 of the total amount of these t v a d subscript 𝑡 𝑣 𝑎 𝑑 t_{vad} italic_t start_POSTSUBSCRIPT italic_v italic_a italic_d end_POSTSUBSCRIPT windows have active speech, the latest t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT window is fed to Squim to obtain a quality estimation. This results in a series of continuous quality estimations through time, each calculated from the latest t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT window of the input signal that has active speech.
However, even when applying a VAD pre-processing step, Squim provides very ‘noisy’ quality estimations through time, as shown in Figure 2 (with t h = 0.1 subscript 𝑡 ℎ 0.1 t_{h}=0.1 italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 0.1 and t w = 3.0 subscript 𝑡 𝑤 3.0 t_{w}=3.0 italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 3.0 ).
As it will detailed later, the direction of arrival correction module works well with a ‘smooth’ input/observed signal, which is not the case with the Squim output. To overcome this issue, exponential smoothing is applied, as presented in ( 1 ).
(1) |
where Q k subscript 𝑄 𝑘 Q_{k} italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the quality estimation at the k 𝑘 k italic_k moment in time, and α 𝛼 \alpha italic_α is a smoothing factor in the range of [ 0 , 1 ] 0 1 [0,1] [ 0 , 1 ] with higher values providing smoother results but less responsiveness to underlying changes, and vice-versa. In Figure 3 , results are shown when applying different values of α 𝛼 \alpha italic_α to the quality estimations through time.
It can be seen that in all cases the final smoothened result is considerably less variable than the original Squim output. The α 𝛼 \alpha italic_α value of 0.9 is a good balance between providing a smooth output and still being somewhat responsive to underlying changes.
The complete speech quality estimation module is shown in Algorithm 1 .
Once the quality ( Q 𝑄 Q italic_Q ) of a given past audio window has been provided, the direction of arrival correction module uses such information to correct the localization ( θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT ) estimated by a sound source localization technique. It does this by aiming to find the direction of arrival ( θ c o r r subscript 𝜃 𝑐 𝑜 𝑟 𝑟 \theta_{corr} italic_θ start_POSTSUBSCRIPT italic_c italic_o italic_r italic_r end_POSTSUBSCRIPT ) that maximizes Q 𝑄 Q italic_Q .
The Adam optimization method [ 25 ] is very popular in the deep learning community [ 26 ] , typically used to find the weights of a given model that maximizes its performance. It has been mathematically proven to converge a non-convex objective function [ 27 ] , which is of great relevance to this work, given the estimated quality variability shown in Figure 3 . For this to be the case, however, it is required that the objective function be twice continuously differentiable. Unfortunately, this is not something that is ensured by purely smoothing the output of the speech quality estimation module (as described in the previous section). However, it does seem to provide an objective function that is ‘close’ to satisfying such requirement.
Additionally, Adam is well fit to carry out its optimization process in an online manner, and because of its simplicity, a low response time can be assumed. It dynamically changes the updating factor of the controlled variable (which in this case is θ 𝜃 \theta italic_θ ) during its optimization process, considering the gradient of the optimized value (which in this case is Q 𝑄 Q italic_Q ). Adam then proceeds to use the first and second derivative of past gradients, by way of an exponentially decaying average. Both derivatives, respectively, correspond to the gradient’s mean (or ‘momentum’) and its uncentered variance. All of this in conjunction results in it avoiding ‘getting stuck’ in local optima.
The Adam-based quality optimization process is presented in Algorithm 2 .
As it can be seen, it requires the current and past quality estimations ( Q c subscript 𝑄 𝑐 Q_{c} italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Q p subscript 𝑄 𝑝 Q_{p} italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , respectively), as well as the current and past direction of arrivals ( θ c subscript 𝜃 𝑐 \theta_{c} italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and θ p subscript 𝜃 𝑝 \theta_{p} italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , respectively). Three configurable parameters are meant to be set: the ‘forgetting’ factors (as they are commonly known) for both the momentum and the variance ( β m subscript 𝛽 𝑚 \beta_{m} italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and β v subscript 𝛽 𝑣 \beta_{v} italic_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , respectively); as well as the learning rate ( η 𝜂 \eta italic_η ) to update the direction of arrival.
It is important to mention that the Adam optimizer [ 25 ] was originally intended as a minimization process. However, it is used here to maximize the speech quality. Thus, in the implementation shown in Algorithm 2 , the result from the speech quality estimation module is subtracted from 100 100 100 100 so that Q c subscript 𝑄 𝑐 Q_{c} italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bares a value that is aimed to be minimized.
It is also important to mention that the optimization process shown in Algorithm 2 does not carry out the original Adam implementation, which usually includes bias correction to avoid ‘getting stuck’ at the beginning of the optimization process. This is appropriate when the objective function is differentiable; this, in turn, implies that its current value can be robustly predicted given past values. However, as it was explained in the previous section, the output of the speech quality estimation module is not able to satisfy this assumption. Thus, in Algorithm 2 no bias correction is carried out as part of the Adam-based optimization.
The sound source localization estimation ( θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT ) is used at the starting point of the optimization process, established as the first value of θ c subscript 𝜃 𝑐 \theta_{c} italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . Furthermore, θ p subscript 𝜃 𝑝 \theta_{p} italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is initialized at 0 so that the first calculated quality gradient has a ‘reasonable’ value. If it is initialized with the same value as θ p subscript 𝜃 𝑝 \theta_{p} italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , the first calculated quality gradient would be equivalent to Q c ϵ subscript 𝑄 𝑐 italic-ϵ \frac{Q_{c}}{\epsilon} divide start_ARG italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG , which results in an astronomical value. This, in turn, would make the mean and variance calculations have very similar values throughout the first iterations, effectively ‘halting’ the optimization process in the mean time. It would only begin to update adequately until that first calculated gradient is ‘forgotten’, which may take a considerable amount of iterations, given its enormous value.
It is also worth pointing out that the quality gradient ( ∇ Q ∇ 𝑄 \nabla Q ∇ italic_Q ) is calculated using only the current and past data points. This means, that the instantaneous gradient is used. Additionally, θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT is not used again during the optimization process. Although it would be worth exploring the impact of using more data points to calculate ∇ Q ∇ 𝑄 \nabla Q ∇ italic_Q , as well as updated values of θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT throughout the optimization process, it is left for future work. For the time being, the current proposal of the Adam-based optimization process seems to be enough to provide reasonable results, as it can be seen in the following section.
A multi-channel recording from the AIRA corpus [ 28 ] was used to test the proposed system. The recording bares two sound sources located at 1 m from the center of the microphone array, which has a triangular shape. There is a microphone at each vertex of the triangle; the inter-microphone distance is of 0.18 m. The source of interest is located at around 0 o superscript 0 𝑜 0^{o} 0 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , and the interference is located at around 90 o superscript 90 𝑜 90^{o} 90 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT . The same recording was used throughout the tests so as to not have it be a source of variability that may impede comparability between results.
A simple multi-channel reproduction program (referred here as ReadMicWavs ) was built using the JACK audio connection toolkit [ 29 ] to feed the recording to the proposed system in real-time, so as to emulate a live microphone signal. Thus, all the results shown in the following sections are from tests that were carried out in an online manner. JACK was chosen given its well-standing performance to capture and reproduce multi-channel audio in real-time, with good inter-microphone synchronicity [ 30 , 31 ] .
To connect all of the previously described modules, ROS2 [ 32 ] was used along with a custom message type called jackaudio that holds: the window of audio data, its length and a timestamp. ROS2 was chosen since it has been widely used, mainly in the robotics and automation community [ 33 , 34 ] , for near real-time communication between modules, and has shown great potential to be used with low-power hardware [ 35 ] .
As mentioned before, the beamformer used is the phase-based frequency masking technique [ 16 ] , implemented as part of the beamform2 ROS2 package (which can be accessed at https://github.com/balkce/beamformer2 ).
ReadMicWavs uses the transport protocol in JACK to feed the recording (as if it were a real-time captured signal) to beamform2 , which acts as the beamformer sub-module shown in Figure 1 . This sub-module uses the direction of arrival (that was fed to it through the theta ROS2 topic) to spatially filter the sound source of interest. Its results are published through the jackaudio ROS2 topic (using the custom message type of the same name), which are fed to the demucs ROS2 node which acts as the Demucs Denoiser sub-module in Figure 1 . The results from beamform2 are enhanced by demucs , which are published through the jackaudio_filtered ROS2 topic (using the jackaudio custom message type). These results are fed to the online_sqa ROS2 node, which acts as the speech quality estimation module in Figure 1 . Its quality estimations are published through the SDR ROS2 topic and are fed to the doacorrect ROS2 node, which acts as the direction of arrival correction module in Figure 1 . This module optimizes the speech quality by modifying the direction of arrival, which is published through the aforementioned theta topic, closing the loop to the beamform2 node.
Because of the multi-modular nature of ROS2, the theta topic can also be published by a node other than doacorrect , which provides flexibility for future versions of the proposed system.
The values of the parameters (detailed previously) during testing are as follows:
Capture time ( t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ): 3.0 s
Time step ( t h subscript 𝑡 ℎ t_{h} italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ): 0.1 s
VAD window size ( t v a d subscript 𝑡 𝑣 𝑎 𝑑 t_{vad} italic_t start_POSTSUBSCRIPT italic_v italic_a italic_d end_POSTSUBSCRIPT ): 0.032 s
Smoothing factor ( α 𝛼 \alpha italic_α ): 0.9
Momentum ‘forgetting’ factor ( β m subscript 𝛽 𝑚 \beta_{m} italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ): 0.9
Variance ‘forgetting’ factor ( β v subscript 𝛽 𝑣 \beta_{v} italic_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ): 0.999
t w subscript 𝑡 𝑤 t_{w} italic_t start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and t h subscript 𝑡 ℎ t_{h} italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT were chosen considering the information shown in Table 1 . β m subscript 𝛽 𝑚 \beta_{m} italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and β v subscript 𝛽 𝑣 \beta_{v} italic_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT were chosen based on the recommendation in [ 25 ] , that are typically used in other works as well. α 𝛼 \alpha italic_α was chosen considering what is discussed in Section 2.2 .
To observe the overall repeatability of the proposed system’s behavior, several ‘runs’ were carried out with each configuration. Additionally, as it will be seen, several of these ‘runs’ do not provide an appropriate behavior: the system gets ‘lost’ and the corrected DOA is not near the correct DOA. To this effect, a ‘good’ run is here defined as one that in its last third of its run-time, the average θ 𝜃 \theta italic_θ is less than 5 o superscript 5 𝑜 5^{o} 5 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT from the correct DOA.
The learning rate parameter ( η 𝜂 \eta italic_η ) is one of the cornerstones of the proposed system. To observe its effect, several runs were carried out using different values. The results are shown in Figure 4 , where the dark blue line is the mean θ 𝜃 \theta italic_θ at each moment in time, and the space below and above such line represents the standard deviation (as a measure of variability).
From these results, it can be surmised that a low η 𝜂 \eta italic_η value (such as 0.01, shown in Figure 6a ) gets ‘stuck’ early on the optimization process, but it provides more consistent (less variable) results. On the other hand, a high η 𝜂 \eta italic_η value (such as 0.3, shown in Figure 6d ) does not get ‘stuck’ as easily, but provides much less consistent (more variable) results and tends to overshoot the target. A good balance between these two scenarios is a η 𝜂 \eta italic_η value of 0.1 (shown in Figure 4b ) which provides consistent results, while not getting ‘stuck’ at the start and not overshooting the target.
To validate the removal of the bias correction in the Adam-based optimization shown in Algorithm 2 , two tests were carried out as previously described, the results of which are shown in Figure 5 . The difference between these two tests is that one was carried out using bias correction (Figure 5a ), and the other without (Figure 5b ).
As it can be seen, when applying bias correction, no ‘good’ runs were observed and no tendency towards the correct DOA can be observed. This is opposed to when no bias correction is carried out, in which a considerably amount of ‘good’ runs are observed and the overall tendency of the system is toward the correct DOA. This demonstrates that the overall response of the proposed system is improved when not applying bias correction in the Adam-based quality optimization.
To observe the effect of the estimated direction of arrival, several runs were carried out varying the estimated direction of arrival ( θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT ), and the results are shown in Figure 6 . This variation simulates scenarios in which a sound source localization technique provides an estimated direction of arrival with varying degrees of error.
As it can be seen, the proposed system is able to correct the DOA when θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT is less than 20 o superscript 20 𝑜 20^{o} 20 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT away from the correct DOA. From 20 o superscript 20 𝑜 20^{o} 20 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT onward, the amount of ‘good’ runs is reduced considerably, with θ e s t = 25 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 25 𝑜 \theta_{est}=25^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 25 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT being the limit with which it is not able to ‘recover’. Additionally, it can also be seen that as the error in θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT increases, the more time it takes the proposed system to reach a value near the correct DOA.
It is worth pointing out that when θ e s t = 1 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 1 𝑜 \theta_{est}=1^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 1 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , the proposed system actually inserts a DOA error instead of correcting it, even though its starting DOA is close to the correct DOA. It will be discussed further in Section 3.6 , but this is due to the fact that the system is not able to converge in a single value (because of the input variability and the nature of the optimization approach). Future efforts are to be made so that the proposed system behavior is more stable.
Another noteworthy case is when θ e s t = 15 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 15 𝑜 \theta_{est}=15^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , which is an outlier of the following tendency: as θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT gets farther from the correct DOA, the lower amount of ‘good’ runs. This is due to the fact that the proposed system’s behavior starts with a noticeable downward ‘push’ (which is the result of the θ p ← 0 ← subscript 𝜃 𝑝 0 \theta_{p}\leftarrow 0 italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← 0 step in Algorithm 1 ), and that its size is dependent on the value of the first quality estimation. In fact, a tendency of the size of this downward can be observed in Figure 6 , where the closer θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT is to the correct DOA, the lower this downward ‘push’. This turned out to specifically benefit the case when θ e s t = 15 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 15 𝑜 \theta_{est}=15^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , since that initial downward ‘push’ makes the proposed system land close enough to the correct DOA (probably just inside the valley of the global optimum in the search space) such that the subsequent DOA correction requires only to refine its result. Its important to mention that the value of η 𝜂 \eta italic_η was chosen while θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT was set at 15 o superscript 15 𝑜 15^{o} 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , so this is evidence of a type of over-fitting of the system’s behavior. As it will be discussed in Section 3.6 , it is then of great interest to explore an automatic parameter calibration scheme that involves the selection of all the parameters in conjunction so as to obtain a more generalizable optimization behavior. Having said all of this, the combination of α = 0.9 𝛼 0.9 \alpha=0.9 italic_α = 0.9 and η = 0.1 𝜂 0.1 \eta=0.1 italic_η = 0.1 seems to be providing an acceptable optimization performance given that θ e s t < 20 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 20 𝑜 \theta_{est}<20^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT < 20 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT .
To further inspect the generalizability of the proposed system, another test was carried out when θ e s t = 105 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 105 𝑜 \theta_{est}=105^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 105 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , which is closer to the other source located at 90 o superscript 90 𝑜 90^{o} 90 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT . The result is shown in Figure 7 .
As it can be seen, the proposed system got close ( ± 5 o plus-or-minus superscript 5 𝑜 \pm 5^{o} ± 5 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) to the correct DOA.
Additionally, it is also of interest to explore the proposed system’s performance with more interferences present. In Figure 8 , the results are shown of a 3-source scenario, one located at around 90 o superscript 90 𝑜 90^{o} 90 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , another at around 180 o superscript 180 𝑜 180^{o} 180 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , and the third at around 0 o superscript 0 𝑜 0^{o} 0 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT . The three runs, respectively, had θ e s t = 105 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 105 𝑜 \theta_{est}=105^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 105 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , θ e s t = 195 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 195 𝑜 \theta_{est}=195^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 195 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , and θ e s t = − 15 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 15 𝑜 \theta_{est}=-15^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = - 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (a 15 o superscript 15 𝑜 15^{o} 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT initial error). As it can be seen, the two first runs got close ( ± 5 o plus-or-minus superscript 5 𝑜 \pm 5^{o} ± 5 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) to the correct DOA. The second run didn’t got as close, but its definitely trending towards the correct DOA.
As far as the author knows, this is the first successful attempt to carry out real-time direction-of-arrival correction by only using as feedback the estimated speech quality. Overall, the proposed system is performing this task well. However, there are some issues that need to be considered for later versions of the proposed system.
First off, the system can be considered slow. In general, a ‘good’ run takes up to 40 seconds (as shown in Figure 6d ) to reach a value near the correct DOA. Unfortunately, accelerating the optimization process requires increasing the learning rate, which results in the system getting ‘lost’ more frequently. Also, the proposed system seems to be sensitive to the initial circumstances in which the DOA correction was carried out (which is the reason why several runs were needed to be carried out for each test). It is the belief of the author that this is due to the high variability of the quality estimations. The causes of this may be many-fold. For one, the quality estimations are initially provided by Squim which are highly variable. Additionally, although ROS2 is presumed to be able to run in real-time, the manner in which it interacts with a true real-time server, such as JACK, may result in an overrun: where the execution time of a given input data window ends up being higher than its capture time. The rate of this occurrence appears to be agnostic to the response time of the system (which is quite low, as will be discussed later); a possible reason is that the internal communication mechanisms of ROS2 are not built to be truly run in real-time. In any case, when an overrun occurs, a zero-valued output data window is returned, resulting in small “breaks” (or silences) in the enhanced signal. The quality estimation is then affected, resulting in more variability. All of this in conjunction makes it difficult for the optimization process to converge in a single value. Instead, a slight but continuous variation of the corrected DOA is observed, the average of which is usually close ( ± 5 o plus-or-minus superscript 5 𝑜 \pm 5^{o} ± 5 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) to the correct DOA. However, in cases when the initial DOA estimated by the sound source localization technique is already close to the correct DOA (as in the case of Figure 6a ), the optimization process will introduce errors in an already close-to-correct estimated DOA.
In addition, it was observed that when the source was not well enhanced in its ideal state (when accurately located from the start), the quality estimation objective function provided little to no quality improvement compared to other values of θ e s t subscript 𝜃 𝑒 𝑠 𝑡 \theta_{est} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT . This resulted in a sparse search space that is difficult to optimize. Fortunately, in these cases, the behavior of the proposed system still ‘tended’ towards the correct DOA (exemplified in Figure 8c ), which implies the need for more time to converge.
Thus, the optimization process would benefit from using other types of approaches, such as those applied in control engineering [ 36 , 37 ] , to handle the input variability while being fast to converge in a single steady value. Also, although substituting ROS2 as the inter-module communication framework is outside the scope of this work, it would be of interest to explore ways that make the real-time interaction between ROS2 and JACK be more seamless.
Next, it can be assumed that if and when the proposed system converges on the correct DOA, the speech quality is maximized. Thus, it can be argued that the proposed system is carrying out, simultaneously, both DOA correction and speech quality maximization. However, the latter was not discussed here because the main focus of this first version of the proposed system is DOA correction. Also, it is suspected that the speech quality improvement is minimal. To be fair, this assessment was obtained informally during the implementation of the proposed system, where the resulting speech was subjectively evaluated. Thus, it is left for future work to formally assess the speech quality improvement, as well as its impact in other possible subsequent steps in the audio processing data flow, such as sound source classification or speech recognition.
Furthermore, the only parameters whose impact on the system’s behavior was characterized in this work was of η 𝜂 \eta italic_η and of α 𝛼 \alpha italic_α . The former resulted in a type of over-fitting to the specific case of θ e s t = 15 o subscript 𝜃 𝑒 𝑠 𝑡 superscript 15 𝑜 \theta_{est}=15^{o} italic_θ start_POSTSUBSCRIPT italic_e italic_s italic_t end_POSTSUBSCRIPT = 15 start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (as shown in Figure 6 ), while the latter was admittedly very superficial and subjectively chosen. First off, the impact of the values of other parameters (such as β m subscript 𝛽 𝑚 \beta_{m} italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , and β v subscript 𝛽 𝑣 \beta_{v} italic_β start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) are also worth exploring. It would be of interest to observe how their values, in conjunction, impact the system’s optimization behavior. Additionally, it would be of interest to create an automatic parameter calibration scheme that could run alongside the proposed system and provide a more dynamic and off-the-shelf application.
Moreover, it was observed how the bias correction in the Adam-based optimization was making the proposed system get ‘stuck’ prematurely. This is counter-intuitive, since the main objective of the bias correction is to avoid Adam getting stuck at the beginning of the optimization process. However, given that the objective function is not differentiable, it was expected that the original Adam implementation would need modification to work well. Interestingly, since no bias correction is being carried out, this also means that the global optimum can be considered as dynamic, opposed to the assumption of global optimum staticity in the original Adam implementation. This opens the possibility to track mobile sound sources in later versions of the proposed system.
Additionally, it was observed that the Adam-based optimization was better served when the speech quality estimation module had run for a short while ( ∼ 10 similar-to absent 10 \sim 10 ∼ 10 s) before running the DOA correction module. It appears that the combination of the Squim model and the exponential smoothing requires some initial data to provide a ‘moderately stable’ objective function signal. This, in turn, creates an initial state that avoids the Adam-based optimization getting ‘lost’ at the start.
Finally, since the proposed system is aimed to be run in real-time, it is of interest to know its response time. Since all the modules shown in Figure 1 run in parallel, the overall response time is based solely on the ‘slowest’ module. To this effect, the beamfomer sub-module runs as a client of JACK, which has a maximum latency of 0.021 s (as it was configured); during testing, this sub-module showed a response time close to half that amount, with a minimal amount of overruns occurring ( < 5 absent 5 <5 < 5 in an hour long evaluation). The direction-of-arrival correction module runs a simple Adam-based optimization process, which bares a response time lower than 0.001 s. The speech enhancement sub-module runs the Demucs model that has a measured response time between 0.006 and 0.009 s to enhance a window of 0.063 s. And the speech quality estimation module runs the Squim model which, as shown in Table 1 , has a measured response time between 0.0538 and 0.0704 s to measure the quality of a 3-second window. Since this last module is the slowest, the system’s response time is between 0.0538 and 0.0704 s.
Speech enhancement carried in an online manner is of great interest to various areas of application. However, doing so using deep-learning-based models has shown to be challenging, given their low response time. To this effect, recent efforts have been made to not only making this type of models run in real-time while still providing comparable performance, as is the case of the Demucs Denoiser model, but also making them be able to focus on a given speech source of interest in a multi-speech environment. One of these efforts uses the location (mainly, the direction of arrival) of the source of interest to make Demucs Denoiser “focus” on it. However, this effort has been proven to be sensitive to location errors.
In this work, a direction-of-arrival correction system is proposed that is based on maximizing the speech quality of the speech enhancer. Through a feedback loop, the quality estimation is carried out via the Squim model, whose output is post-processed using exponential smoothing, to provide a close-to-differentiable objective function. An Adam-based optimization scheme is then applied to find the direction of arrival that maximizes such function.
It was shown that the proposed system works well, in that it is able to correct the direction of arrival towards the correct location. However, it was found that the system is sensitive to the learning rate of the Adam-based optimization, of which a recommended value was provided. Making the process less sensitive towards such value, as well as reducing the variability of the speech quality estimations, is of great interest and will be carried out as future work.
The author would like to acknowledge the support of PAPIIT-UNAM through the grant IN100624.
UCHIDA Shinichi Deputy Governor of the Bank of Japan May 27, 2024
It is my great pleasure to welcome all of you to this conference. As Governor Ueda mentioned in his opening remarks, the Bank of Japan is conducting the "Broad Perspective Review" of our monetary policy over the past 25 years. In short, it has been a battle against persistent deflation and a battle with the zero lower bound.
Let me start by giving an overview of the inflation picture during this period. Please look at Chart 1. Japan's deflation started in the late 1990s and continued for 15 years. The average inflation rate was just minus 0.3%. It was a mild but persistent deflation.
To tackle this situation, the Bank introduced the 2% price stability target and Quantitative and Qualitative Monetary Easing, or QQE, in 2013, and a negative interest rate policy and Yield Curve Control, or YCC, in 2016. As a result, we succeeded in achieving a situation without deflation, but the average inflation rate was 0.5%, which fell short of our 2% goal. Recently, inflation rate has risen to around 3%, following the global inflation.
The big question is whether the current change in inflation picture means an irreversible, structural change from deflation, or just a temporary phenomenon led by global inflation. In this speech, I will try to give an answer to this important question, which has implications for the future course of our monetary policy as well as Japan's economy.
The bursting of the asset bubble and chronic shortages of demand.
For this, we need to go back to the 1990s and explore the causes of Japan's deflation. As background to this deflation, from a real economy perspective, Japan's economy experienced two things: a decline in the growth trend and chronic shortages of demand. You can see these in Chart 2.
The causes of these developments are compound. The most important factor appears to be the bursting of the asset bubble in the early 1990s. This was followed by financial system turmoil and painful balance sheet adjustments in the corporate sector. Companies had to address excess capacity, excess labor, and debt-overhang. Against this backdrop, they became more and more reluctant to take risks and were slow to adjust their operations to the globalization trend brought about by the rises of the emerging economies. As shown in the left-hand panel of Chart 3, the corporate sector turned to a net saving position. Companies invested their limited resources mostly abroad, as shown in the center panel of Chart 3. This lowered the accumulation of capital stock and the growth rate of labor productivity and hence, the potential growth rate, as shown in the right-hand panel of Chart 3.
In this environment, the natural rate of interest, r* , declined earlier and to a greater extent than in other countries. It is always difficult to estimate r* , and various models give us different figures ranging from minus 1% to plus 0.5%, as you can see in Chart 4. But it is safe to say that our r* is low and has been declining over time. Otherwise, we cannot explain what has happened over these decades.
In addition to the bursting of the bubble, the declining and aging population might have affected r* . The impact of demography on r* is not straightforward, even theoretically. r* is often related to the per-capita growth rate of GDP. So, a declining population itself would not necessarily lower r* if the size of the economy went hand-in-hand with labor input. But still, a higher dependency ratio should lower per-capita growth, as you can see in Chart 5.
To address the problem of dependency, the answer is clear: continue working. The good thing is that senior people are much healthier than before. But the move toward continued employment did not happen until the 2010s. The labor force participation rate of seniors started to rise from 2012, as you can see in Chart 6. Japan then experienced a labor-shortage situation for the first time since the bursting of the bubble, with the aggressive stimulus to the economy by QQE and other policies. Before then, companies did not necessarily have to rely on senior workers.
As you know, Japan is a frontrunner among countries with aging populations. In 2019, when Japan hosted the G20 meetings, "aging" was one of the priority topics. Participants discussed various issues on this topic and reached the natural conclusion: the impact of aging is complicated. When senior citizens reduce savings in their life cycle, r* rises. But, if people have strong concerns over the risks associated with increased longevity, younger generations save more and seniors slow the pace at which they use their savings. I am not saying that a declining and aging population is a problem by itself. I would rather stress that society appears to have failed or been slow to address this issue properly.
We tend to have a negative attitude when we discuss demographic issues. Companies tended to focus on the demand side and worry about shrinking domestic markets. Of course, a declining population also means a decline in the labor force. However, the supply-side implications were ignored or marginalized during the course of deflation, for good reason, I would say. As you can see in Chart 7, during that period, colored in blue, companies felt that they had more than enough employees. I will return to this issue later. Here, I just want to stress that the labor market is the key.
Let's move onto inflation. Actual and expected inflation declined in the 1990s, stayed low in the 2000s, and then rose somewhat after 2013 (Chart 8).
There are two distinctive features in our inflation expectations. First, inflation expectations in Japan have a high positive correlation with growth trends or growth expectations, as you can see in Chart 9. Secondly, the formation of medium- to long-term inflation expectations is adaptive rather than forward-looking, as shown in Chart 10. Of course, these observations are far from ideal. No central banker would welcome them. It means that inflation expectations are not anchored and fluctuate, reflecting real variables and actual inflation rates.
Basically, what happened is as follows (Chart 11). In the 1990s and 2000s, the inflation rate declined due to chronic demand shortages. The growth trend and r* declined, and the Bank of Japan's monetary policy, which was mostly conventional at that time and faced with the zero lower bound constraint, could not sufficiently stimulate demand. Prolonged weak demand prevented the inflation rate from rising. It is natural that people lack faith in the central bank's ability to raise the inflation rate and, so, inflation expectations remained low. All in all, our monetary policy did not have enough power to lift up the actual and expected inflation under the zero lower bound constraint, while it would be fair to mention that the policy measures then contributed to protect the financial system by providing ample liquidity.
Since 2013, we have overcome the zero lower bound to some extent through the introduction of QQE and YCC. As you can see in Chart 12, real interest rates were in negative territory, and monetary policy has been very accommodative, even while r* has been low. But ten more years are needed to give enough stimulus to change the whole picture of the economy.
Formation of deflationary norm.
So far, I have explained the mechanism by which Japan's economy dropped into deflation and failed to exit from it, from the perspective of a central banker. This is the main component of our story, though I am afraid it may not be particularly interesting from an academic point of view. In the end, it was just a typical story of the zero lower bound.
To paint a complete picture, however, we need to add another set of stories. There is one phenomenon which only Japan has experienced. The mild but persistent deflation created a social norm based on the belief that "today's prices and wages will be the same tomorrow." I use the word "social." It is not just an economic phenomenon.
Chart 13 shows the distribution of the inflation rate for all goods and services in the CPI. You can see that most items have concentrated around zero % in Japan. In the US, the peak was around 2%, and the distribution is much wider compared with Japan. Of course, these figures are before the pandemic and the recent global inflation. In Japan, the price-setting behavior based on the belief that there would be no change in prices and wages spread widely among companies and became a kind of norm. They all kept their prices unchanged for fear of losing customers.
How does this come to be the norm? Again, the initial trigger was chronic demand shortages. A typical textbook way of handling demand shortages would be to reduce prices, downsize production, and reduce the number of employees, in the hope of a later recovery. But that was not the case in Japan. There was a strong consensus in society that employment should be maintained as long as possible. Companies continued to hold on to their labor force, and the government helped them by providing various subsidies, furlough programs, and public financing. As you can see in Chart 14, the unemployment rate was not very high, even at its peak, and we had limited cases of bankruptcy.
On the price front, companies continued to face harsh competition, as their rivals were still there. As for wages, employees started to accept reduced wages in exchange for job security (Chart 15). Companies also tried to replace retiring employees with part-time workers. As in Chart 16, price markups declined significantly, while wage markdowns increased.
In our large-scale survey of the corporate sector, companies responded that, in the face of harsh competition, they refrained from passing on costs to their customers, as you can see in the left-hand panel of Chart 17. Instead, they tried to cut costs by reducing wages and by asking their suppliers to reduce prices, or they just accepted smaller profit margins as shown in the right-hand panel of Chart 17.
They also argued that, under this deflationary norm, it was difficult to move in the direction of making better products and raising prices. As the left-hand panel of Chart 18 shows, more than 70 % of the respondents favor a future landscape with mild, positive price and wage inflation to that with zero inflation. They believe such an environment will make their businesses easier, as they can pass on costs to their customers. They also said that if such a situation is realized, they would invest more and raise wages, instead of just cutting costs, as shown in the right-hand panel of Chart 18.
Those results correspond with some stylized facts. For the past 25 years, Japanese companies have implemented process innovations that cut costs, rather than product innovations that develop new products. This explains and is explained by the decline in markups. Companies did not invest enough in R&D and failed to differentiate their products sufficiently from their competitors.
It is often argued that deflation or the deflationary mindset in Japan is the fundamental cause of the slow economic development during that period. Here, you may argue, causes and consequences are being confused. You may also argue that relative prices and overall inflation are being mixed up. In theory, of course, changes in the relative prices between individual products can happen even when the overall inflation rate is zero. Companies can raise their relative prices regardless of the overall inflation rate. Basically, I agree. It is difficult to tell whether and through what channels this norm has adversely affected the economy.
There are several possible candidates for theoretical explanations of the norm, such as the nominal rigidity of wages and menu costs. Here I would like to focus on menu costs, as I believe this illustrates some important aspects of what happened during this period.
In Chart 19, we can see that the frequency of price changes in Japan declined in the 1990s, especially in the service sector. The frequency of price increases, that is, the share of prices that saw an upward change, as shown by the blue line, declined along with trend inflation. In this context, the decline in frequency itself is quite natural, but the extent of the decline is large. Meanwhile, the frequency of price decreases, that is, the share of prices that saw a downward change, the green line, rose only slightly, even though trend inflation declined. A salient decline in the former and a modest rise in the latter both suggest that companies have become more reluctant to change their prices. These observations suggest that there have been an increase in menu costs. Please look at Chart 17 again. Based on the survey mentioned above, companies responded that they refrained from passing on costs to their prices because, among other things, they were afraid of losing reputation. That is why I used the word "social" norm to describe this "economic" phenomenon.
Higher menu costs, together with mild inflation, have slowed the pace of price adjustment. And for us, the central bank, this requires more effort to get out of this situation. The "no change in prices and wages" norm worked as if inflation expectations are anchored at zero %. And the gravity towards zero % is stronger than a 2% anchor; as you can see, the 0% peak of the distribution in Japan's CPI is much higher than 2% peak seen in that of the US.
Two things are required to escape this situation. First, we need to resolve the original causes of deflation, that is, demand shortages and consequent excess labor supply. Secondly, we need to overcome the threshold of menu costs, or more fundamentally, the deflationary norm.
As to the first, QQE and other accommodative monetary policy tools provided powerful stimulus to the economy and, together with government measures, created more than 5 million jobs, mainly for women and seniors (Chart 20). It was basically a high-pressure economic strategy.
Chart 21 shows what happened in the labor markets during the period under QQE. In this period, the year-on-year rate of change in employee income was stable at around 2-3 percent. Prior to the pandemic, the increase was driven by a rise in the number of employees, the blue bars. But since the pandemic, the increase has been led by a rise in wages, the white bars, given the limited room for additional labor supply of women and seniors. The labor market structure appears to have changed after the pandemic, and wages are likely to continue increasing.
In other words, when we started QQE in 2013, there was considerable slack in the economy. We did not expect this scale of additional labor coming from women and seniors. Of course, this should be taken as a favorable development in addressing our demographic challenges. In addition, there was another type of slack, a kind of hidden slack, in which companies continued to provide too many services to their customers for free, which is possible only with an abundant labor force. The Bank of Japan has spent ten years providing high pressure, aiming to remove all of these slacks in the economy.
Another issue we need to address is to overcome the threshold of menu costs, or the deflationary norm. As an initial response, menu costs appear to have been normalized after the global inflation. Please see Chart 19 again. The frequency of price changes has returned to the levels of the early 1990s. And in Chart 22, the shape of the distribution of the CPI has changed significantly. Now the distribution is wider and the peak is lower, which is not surprising in the current situation. The question is whether this is a temporary development due to the recent global inflation or an irreversible, structural change. I will discuss this issue again shortly, focusing more on the fundamental issue of the deflationary norm.
Before I conclude, let me summarize today's explanation in Chart 23. The bursting of the asset bubble in the 1990s was followed by financial system turmoil and painful balance sheet adjustments in the corporate sector. Companies and society as a whole were slow to address trends in globalization and, more importantly, the declining and aging population.
The trend growth rate declined, coupled with chronic demand shortages, particularly in the labor market. r* also declined. On price front, actual and expected inflation declined, and the Bank of Japan's monetary policy did not have enough power to lift them up under the zero lower bound constraint.
Faced with this difficult situation, the consensus in society was that employment should be maintained. Labor hoarding by companies, supported by the government, contributed to maintaining economic and social stability in exchange for leaving the excess labor and excess numbers of companies. Price markups declined significantly, along with an increase in wage markdowns. "No change in prices and wages" became the norm. Companies proceeded with process innovations rather than product innovations, which affected and was affected by markups.
From 2013, the Bank of Japan started to provide high pressure to the economy under QQE and YCC, which, together with government measures, created millions of jobs, mainly for women and seniors, and gradually changed labor market conditions, resulting in labor shortages. And the recent global inflation has put final pressure on the deflationary norm. These are the summary of today's explanation.
Now it is time to answer the question I posed at the beginning of this speech. Are these trends irreversible? As I mentioned, two things are needed to change the current situation: resolving the original causes of deflation and overcoming the deflationary norm.
As to the first issue, resolving the original causes of deflation, I can confidently answer "yes." Labor market conditions have changed structurally and irreversibly. We cannot expect much more labor participation to come from women and seniors. The job participation rate for women in Japan is now higher than that in the US. To be precise, there is still some more room for supply. Women may work longer hours as fulltime workers. Companies are extending the retirement age to keep their labor force. These efforts are reasonable and necessary, but they are not enough to change the overall picture.
As to the second issue, overcoming the deflationary norm, the answer is not so clear. Will companies continue the current price-setting behavior even after the cost-push pressure from global inflation wanes? The key once again is the labor market. If the structural changes in the labor market continue, companies will have to build business models that generate enough profits and wages to keep and attract employees. As to price-setting strategy, companies need to rewrite their prices in their menus promptly, reflecting their labor costs while paying due attention to the possible impact on demand for their products.
In the end, the "social norm" is set to be dissolved. The main driving force for these developments and long-waited structural changes is labor shortages. Labor shortages drive individual companies' transformations and the dynamics of the whole economy, while this process entails relatively low transition costs, as it is less likely to give rise to unemployment.
After ten years of experience under QQE, YCC, and the negative interest rate policy, the Bank of Japan declared in this March that these unconventional policy tools had fulfilled their roles. We returned to a conventional monetary policy framework, aiming at a 2% price stability target through adjustments of the short-term policy rate, which means we have overcome the zero lower bound. While we still have a big challenge to anchor the inflation expectations to 2%, the end of our battle is in sight.
So, I would like to conclude my speech with this phrase: "This time is different."
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
Detection and segmentation of mouth region in stereo stream using yolov6 and deeplab v3+ models for computer-aided speech diagnosis in children.
2. materials, 2.1. data acquisition equipment, 2.2. image dataset management.
3.1. preprocessing and data preparation for training, 3.2. object detection, 3.3. image segmentation, 3.3.1. preparation of weak labels using level set segmentation, 3.3.2. deeplab v3+ model, 4. experiments and results, 4.1. object detection, 4.2. image segmentation, 4.3. time consumption and processing speed, 5. discussion, 6. conclusions, supplementary materials, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest, abbreviations.
SLP | Speech and language pathologist |
CASD | Computer-aided speech diagnosis |
CAD | Computer-aided diagnosis |
ROI | Region of interest |
CLAHE | Contrast-limited adaptive histogram equalization |
DRLSE | Distance-regularized level set evolution |
CNN | Convolutional neural network |
YOLO | You only look once |
DLv3+ | DeepLabv3+ |
IoU | Intersection-over-union |
AP | Average precision |
PvR | Precision vs. recall curve |
DSC | Dice similarity coefficient |
Acc | Accuracy |
Click here to enlarge figure
Total Frames | Children | Frames per Child Median (Min–Max) | Lips | Teeth | Tongue | |
---|---|---|---|---|---|---|
Subset A | 16,156 | 35 | 407 (290–880) | 16,079 | 10,738 | 6766 |
Subset B | 1092 | 25 | 45 (29–47) | 1089 | 525 | 289 |
Subset C | 665 | 16 | 45 (16–54) | 664 | 421 | 212 |
Total | 17,913 | 76 | 17,832 | 11,684 | 7267 | |
(99.5%) | (65.2%) | (40.6%) |
Lips | Teeth | Tongue | Overall | ||
---|---|---|---|---|---|
0.50 | AP | 0.999 | 0.745 | 0.651 | 0.798 |
F1 | 0.999 | 0.794 | 0.664 | 0.798 | |
0.75 | AP | 0.999 | 0.748 | 0.661 | 0.803 |
F1 | 0.999 | 0.788 | 0.659 | 0.798 |
Lips | Teeth | Tongue | Overall | ||
---|---|---|---|---|---|
AP | YOLOv3 | 0.999 | 0.612 | 0.276 | 0.629 |
YOLOv5 | 0.999 | 0.532 | 0.768 | ||
YOLOv6 | 0.999 | 0.745 | |||
YOLOv7 | 0.744 | 0.624 | 0.789 | ||
F1 | YOLOv3 | 0.645 | 0.336 | 0.641 | |
YOLOv5 | 0.749 | 0.543 | 0.753 | ||
YOLOv6 | 0.794 | 0.798 | |||
YOLOv7 | 0.605 |
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Sage, A.; Badura, P. Detection and Segmentation of Mouth Region in Stereo Stream Using YOLOv6 and DeepLab v3+ Models for Computer-Aided Speech Diagnosis in Children. Appl. Sci. 2024 , 14 , 7146. https://doi.org/10.3390/app14167146
Sage A, Badura P. Detection and Segmentation of Mouth Region in Stereo Stream Using YOLOv6 and DeepLab v3+ Models for Computer-Aided Speech Diagnosis in Children. Applied Sciences . 2024; 14(16):7146. https://doi.org/10.3390/app14167146
Sage, Agata, and Pawel Badura. 2024. "Detection and Segmentation of Mouth Region in Stereo Stream Using YOLOv6 and DeepLab v3+ Models for Computer-Aided Speech Diagnosis in Children" Applied Sciences 14, no. 16: 7146. https://doi.org/10.3390/app14167146
Article access statistics, supplementary material.
ZIP-Document (ZIP, 25220 KiB)
Mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
Advertisement
Gov. Josh Shapiro of Pennsylvania rallied Democrats in his home state behind the Democratic ticket after the conclusion of a vice-presidential search that prompted intense public scrutiny of his views on Israel.
By Katie Glueck
For years, Gov. Josh Shapiro of Pennsylvania has said that his Jewish faith drives his commitment to public service.
But as he wrapped up a fiery speech in Philadelphia on Tuesday, after the conclusion of a vice-presidential search process that prompted intense public scrutiny of his views on Israel, Mr. Shapiro’s familiar references to his religious background took on a raw new resonance. And he seemed to sound a note of defiance.
“I am proud of my faith,” he said, his voice rising, speaking slowly and deliberately to sustained applause.
Mr. Shapiro’s comments came as part of a well-received speech welcoming Vice President Kamala Harris and her new running mate, Gov. Tim Walz of Minnesota. Throughout, Mr. Shapiro praised the ticket effusively.
But the moment followed an ugly final phase of Ms. Harris’s search.
Mr. Shapiro’s positions on the Middle East , his allies have noted, are well within the Democratic mainstream, and were not markedly different from other vice-presidential candidates under consideration.
Yet Mr. Shapiro — dubbed “Genocide Josh” by some activists — drew outsize attention on the subject, his supporters said, and some saw that focus as driven by antisemitism.
Left-wing and pro-Palestinian activists and other critics vehemently denied that, saying they objected because they saw Mr. Shapiro as too sympathetic to Israel and overly critical of campus Gaza-war protests, not because of his faith.
Some Democrats worried that elevating Mr. Shapiro to the ticket would reignite the battles raging in the party over Israel and the Gaza war — divisions that still exist, but had been more muted after President Biden said he would not seek re-election.
For others — especially for some Jewish Democrats who have struggled with their place in the party against that backdrop — the spotlight on Mr. Shapiro was painful.
In his remarks, Mr. Shapiro emphasized the role his faith — which he did not highlight by name — has played in his public life.
“I lean on my family, and I lean on my faith, which calls me to serve,” Mr. Shapiro said.
Soon after, he was back to stumping for the Democratic ticket.
Katie Glueck covers American politics with a focus on the Democratic Party. More about Katie Glueck
IMAGES
COMMENTS
The introduction of a speech is incredibly important because it needs to establish the topic and purpose, set up the reason your audience should listen to you and set a precedent for the rest of the speech. ... When we restate the thesis statement at the conclusion of our speech, we're attempting to reemphasize what the overarching main idea ...
The general rule is that the introduction and conclusion should each be about 10-15% of your total speech, leaving 80% for the body section. Let's say that your informative speech has a time limit of 5-7 minutes: if we average that out to 6 minutes that gives you 360 seconds. Ten to 15 percent means that the introduction and conclusion should ...
5. Melissa Butler. Speech Ending: When you go home today, see yourself in the mirror, see all of you, look at all your greatness that you embody, accept it, love it and finally, when you leave the house tomorrow, try to extend that same love and acceptance to someone who doesn't look like you. 6.
Conclusions: Should reinforce the message and give the speech unity and closure. Summarize the main points of your speech. Restate your purpose or thesis. Create closure, a sense of finality. In persuasive speeches, make a final call for commitment or action. Open new areas of discussion or argument.
The introduction is the speaker's first and only chance to make a good impression, so, if done correctly, your speech will start strong and encourage the audience to listen to the rest. Speech Introductions. The introduction for a speech is generally only 10 to 15 percent of the entire time the speaker will spend speaking.
2) Simon Sinek. Speech ending line: "Listen to politicians now, with their comprehensive 12-point plans. They're not inspiring anybody. Because there are leaders and there are those who lead. Leaders hold a position of power or authority, but those who lead inspire us.
An introduction speech, or introductory address, is a brief presentation at the beginning of an event or public speaking engagement. Its primary purpose is to establish a connection with the audience and to introduce yourself or the main speaker. ... Conclusion: Summarize the main points briefly and express enthusiasm for the upcoming ...
The introduction of information belongs in the body of your speech. making the ending too long in comparison to the rest of your speech. using a different style or tone that doesn't fit with what went before it which puzzles listeners. ending abruptly without preparing the audience for the conclusion. Without a transition, signal or indication ...
To end your speech with impact, you can use a lot of the devices discussed in the attention-getting section of the introductions page such as: quotations, jokes, anecdotes, audience involvement, questions, etc. One of the best ways to conclude a speech is to tie the conclusion into the introduction.
When contemplating how to end a speech, remember that your introduction is the appetizer, while your conclusion is its dessert. Conclusions must round off the topic and make a strong impression on people's minds. To create a conclusion that will satisfy and sum up all the vital information from your speech, consider these three key elements: 1.
This means that if your speech is meant to be five minutes long, your introduction should be no more than about forty-five seconds. If your speech is to be ten minutes long, then your introduction should be no more than about a minute and a half. Keep in mind, that 10 to 20 percent of your speech can either make your audience interested in what ...
1664 N. Virginia Street, Reno, NV 89557. William N. Pennington Student Achievement Center, Mailstop: 0213. [email protected]. (775) 784-6030. Get tips for creating a great introduction to your speech from the Writing & Speaking Center at the University of Nevada, Reno.
The introduction and conclusion are essential to a speech. The audience will remember the main ideas even If the middle of the speech is a mess or nerves overtake the speaker. So if nothing else, get these parts down! The speech is almost over and the audience needs closure. The conclusion needs to ...
9. It's in the news. Take headlines from what's trending in media you know the audience will be familiar with and see. Using those that relate to your speech topic as the opening of your speech is a good way to grab the attention of the audience. It shows how relevant and up-to-the-minute the topic is. For example:
Conclusions. Conclusions should reinforce the message and give the speech unity and closure. Do: Summarize the main points of your speech. Restate your purpose or thesis. Create closure, a sense of finality. In persuasive speeches, make a final call for commitment or action. Don't: Open new areas of discussion or argument. Change position or ...
Chapter 10: Introductions and Conclusions. Learning Objectives. Identify key elements of an effective introduction. List methods of grabbing the audience's attention. Identify key elements of an effective conclusion. Learning Objectives. Attention-Getter. Brakelight.
Conclusion. Your conclusion is the part of your speech the audience is most likely to remember. Therefore, it must be well planned out. A conclusion serves three purposes: Gives the audience one last opportunity to understand the material. Provides the audience with a course of action. Lets the audience know that the speech is ending.
The speech conclusion has four basic missions: Wraps things up- This portion is often referred to as a " Brakelight ". Much like brake lights on a car warn us the car will be stopping, this "brakelight" or transitional statement warns the audience that the speech is coming to a close. Summarizes- A solid conclusion briefly restates the ...
Recall something from the introduction so your speech comes full circle. Thank your audience for attending and listening. Steps. Method 1. Method 1 of 3: ... In some cases, you can use the conclusion to recall the introduction, showing how the speech comes full circle. Or, if you have a catchy title, work it into the conclusion to grab your ...
Guide to Writing Introductions and Conclusions. First and last impressions are important in any part of life, especially in writing. This is why the introduction and conclusion of any paper - whether it be a simple essay or a long research paper - are essential. Introductions and conclusions are just as important as the body of your paper.
Just as a good introduction helps bring an audience into your speech's world, and a good speech body holds the audience in that world, a good conclusion helps bring that audience back to reality. So, plan ahead to ensure that your conclusion is an effective one. While a good conclusion will not rescue a poorly prepared speech, a strong ...
A good conclusion of a speech comes from a good introduction. The conclusion and the introduction are somewhat linked together. Whatever you said you in your introduction must be reiterated in the conclusion but of course with different structure and approach. It is like saying again what you intended to say in the first place.
Part I: The Introduction. An introduction is usually the first paragraph of your academic essay. If you're writing a long essay, you might need 2 or 3 paragraphs to introduce your topic to your reader. A good introduction does 2 things: Gets the reader's attention. You can get a reader's attention by telling a story, providing a statistic ...
The speech enhancement sub-module runs the Demucs model that has a measured response time between 0.006 and 0.009 s to enhance a window of 0.063 s. And the speech quality estimation module runs the Squim model which, as shown in Table 1, has a measured response time between 0.0538 and 0.0704 s to measure the quality of a 3-second window. Since ...
Introduction. It is my great pleasure to welcome all of you to this conference. ... Participants discussed various issues on this topic and reached the natural conclusion: the impact of aging is complicated. ... So, I would like to conclude my speech with this phrase: "This time is different." 2-1-1 Nihonbashi-Hongokucho,Chuo-ku,Tokyo Tel:+81-3 ...
This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders' treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis of articulation to deliver robust indicators of disordered speech.
But as he wrapped up a fiery speech in Philadelphia on Tuesday, after the conclusion of a vice-presidential search process that prompted intense public scrutiny of his views on Israel, Mr. Shapiro ...